Re: acpi ->video_device_list corruption
On Wed, Dec 12, 2007 at 12:48:09PM +0100, Mikael Pettersson wrote: > IMO the memset(ptr, 0, sizeof(*ptr)) idiom is both safer > and avoids having to write an uninteresting type name. How about this, then? The ->cap fields of struct acpi_video_device and struct acpi_video_bus are 1B each, not 4B. The oversized memset()'s corrupted the subsequent list_head fields. This resulted in silent corruption without CONFIG_DEBUG_LIST and BUG's with it. This patch uses sizeof() to pass the proper bounds to the memset() calls and thereby correct the bugs. The patch was seen to resolve the issue on the affected system. vs. 2.6.24-rc5 Signed-off-by: William Irwin <[EMAIL PROTECTED]> diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c index 44a0d9b..bd77e81 100644 --- a/drivers/acpi/video.c +++ b/drivers/acpi/video.c @@ -577,7 +577,7 @@ static void acpi_video_device_find_cap(struct acpi_video_device *device) struct acpi_video_device_brightness *br = NULL; - memset(>cap, 0, 4); + memset(>cap, 0, sizeof(device->cap)); if (ACPI_SUCCESS(acpi_get_handle(device->dev->handle, "_ADR", _dummy1))) { device->cap._ADR = 1; @@ -697,7 +697,7 @@ static void acpi_video_bus_find_cap(struct acpi_video_bus *video) { acpi_handle h_dummy1; - memset(>cap, 0, 4); + memset(>cap, 0, sizeof(video->cap)); if (ACPI_SUCCESS(acpi_get_handle(video->device->handle, "_DOS", _dummy1))) { video->cap._DOS = 1; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
acpi ->video_device_list corruption
The ->cap fields of struct acpi_video_device and struct acpi_video_bus are 1B each, not 4B. The oversized memset()'s corrupted the subsequent list_head fields. This resulted in silent corruption without CONFIG_DEBUG_LIST and BUG's with it. This patch uses sizeof() to pass the proper bounds to the memset() calls and thereby correct the bugs. Included as a MIME attachment is a compressed dmesg from an affected system. The patch was seen to resolve the issue on the affected system. vs. 2.6.24-rc5 Signed-off-by: William Irwin <[EMAIL PROTECTED]> -- wli diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c index 44a0d9b..7895d57 100644 --- a/drivers/acpi/video.c +++ b/drivers/acpi/video.c @@ -577,7 +577,7 @@ static void acpi_video_device_find_cap(struct acpi_video_device *device) struct acpi_video_device_brightness *br = NULL; - memset(>cap, 0, 4); + memset(>cap, 0, sizeof(struct acpi_video_device_cap)); if (ACPI_SUCCESS(acpi_get_handle(device->dev->handle, "_ADR", _dummy1))) { device->cap._ADR = 1; @@ -697,7 +697,7 @@ static void acpi_video_bus_find_cap(struct acpi_video_bus *video) { acpi_handle h_dummy1; - memset(>cap, 0, 4); + memset(>cap, 0, sizeof(struct acpi_video_bus_cap)); if (ACPI_SUCCESS(acpi_get_handle(video->device->handle, "_DOS", _dummy1))) { video->cap._DOS = 1; } dmesg.acpibug.gz Description: dmesg.acpibug.gz
acpi -video_device_list corruption
The -cap fields of struct acpi_video_device and struct acpi_video_bus are 1B each, not 4B. The oversized memset()'s corrupted the subsequent list_head fields. This resulted in silent corruption without CONFIG_DEBUG_LIST and BUG's with it. This patch uses sizeof() to pass the proper bounds to the memset() calls and thereby correct the bugs. Included as a MIME attachment is a compressed dmesg from an affected system. The patch was seen to resolve the issue on the affected system. vs. 2.6.24-rc5 Signed-off-by: William Irwin [EMAIL PROTECTED] -- wli diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c index 44a0d9b..7895d57 100644 --- a/drivers/acpi/video.c +++ b/drivers/acpi/video.c @@ -577,7 +577,7 @@ static void acpi_video_device_find_cap(struct acpi_video_device *device) struct acpi_video_device_brightness *br = NULL; - memset(device-cap, 0, 4); + memset(device-cap, 0, sizeof(struct acpi_video_device_cap)); if (ACPI_SUCCESS(acpi_get_handle(device-dev-handle, _ADR, h_dummy1))) { device-cap._ADR = 1; @@ -697,7 +697,7 @@ static void acpi_video_bus_find_cap(struct acpi_video_bus *video) { acpi_handle h_dummy1; - memset(video-cap, 0, 4); + memset(video-cap, 0, sizeof(struct acpi_video_bus_cap)); if (ACPI_SUCCESS(acpi_get_handle(video-device-handle, _DOS, h_dummy1))) { video-cap._DOS = 1; } dmesg.acpibug.gz Description: dmesg.acpibug.gz
Re: acpi -video_device_list corruption
On Wed, Dec 12, 2007 at 12:48:09PM +0100, Mikael Pettersson wrote: IMO the memset(ptr, 0, sizeof(*ptr)) idiom is both safer and avoids having to write an uninteresting type name. How about this, then? The -cap fields of struct acpi_video_device and struct acpi_video_bus are 1B each, not 4B. The oversized memset()'s corrupted the subsequent list_head fields. This resulted in silent corruption without CONFIG_DEBUG_LIST and BUG's with it. This patch uses sizeof() to pass the proper bounds to the memset() calls and thereby correct the bugs. The patch was seen to resolve the issue on the affected system. vs. 2.6.24-rc5 Signed-off-by: William Irwin [EMAIL PROTECTED] diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c index 44a0d9b..bd77e81 100644 --- a/drivers/acpi/video.c +++ b/drivers/acpi/video.c @@ -577,7 +577,7 @@ static void acpi_video_device_find_cap(struct acpi_video_device *device) struct acpi_video_device_brightness *br = NULL; - memset(device-cap, 0, 4); + memset(device-cap, 0, sizeof(device-cap)); if (ACPI_SUCCESS(acpi_get_handle(device-dev-handle, _ADR, h_dummy1))) { device-cap._ADR = 1; @@ -697,7 +697,7 @@ static void acpi_video_bus_find_cap(struct acpi_video_bus *video) { acpi_handle h_dummy1; - memset(video-cap, 0, 4); + memset(video-cap, 0, sizeof(video-cap)); if (ACPI_SUCCESS(acpi_get_handle(video-device-handle, _DOS, h_dummy1))) { video-cap._DOS = 1; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] hugetlbfs :shmget with SHM_HUGETLB only works as root
On Fri, Nov 30, 2007 at 12:02:32AM +0530, Ciju Rajan K wrote: > I tested your patch. But that is not solving the problem. > If the code change to user_shm_lock() is not a good solution, could > you please suggest a method so that the normal user is able to allocate > the huge pages, if his gid is added to /proc/sys/vm/hugetlb_shm_group The patch I posted resolves a race unrelated to your issue. Raising your locked memory limits should not be difficult. /etc/limits.conf or similar should set it up for you. You can also change the default rlimit in the kernel and compile it with default limits elevated to what you want your unprivileged process to have to start with if you're truly having lots of trouble getting userspace to set the default limits properly. I'd look in include/asm-generic/resource.h -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] hugetlbfs :shmget with SHM_HUGETLB only works as root
On Fri, Nov 30, 2007 at 12:02:32AM +0530, Ciju Rajan K wrote: I tested your patch. But that is not solving the problem. If the code change to user_shm_lock() is not a good solution, could you please suggest a method so that the normal user is able to allocate the huge pages, if his gid is added to /proc/sys/vm/hugetlb_shm_group The patch I posted resolves a race unrelated to your issue. Raising your locked memory limits should not be difficult. /etc/limits.conf or similar should set it up for you. You can also change the default rlimit in the kernel and compile it with default limits elevated to what you want your unprivileged process to have to start with if you're truly having lots of trouble getting userspace to set the default limits properly. I'd look in include/asm-generic/resource.h -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why preallocate pmd in x86 32-bit PAE?
Linus Torvalds wrote: >> IIRC, the present bit is ignored in the magic 4-entry PGD. All entries >> have to be present. On Thu, Nov 15, 2007 at 02:42:46PM -0800, H. Peter Anvin wrote: > This is true, although you could point a PGD to an all-zero page if you > really wanted to. You have to re-load CR3 after modifying the top-level > entries. There may be bigger fish to fry in terms of per-process overhead, if you're trying to cut that down. The trouble with trying to address some of those is that there is mutual antagonism between compactness and expansibility in the process address space layout, so you'll end up instantiating a lot more than you want barring some sort of provision for a compact address space layout. Pagetable sharing is a far more powerful resource scalability method, though it also needs cooperation in user address space layout to reap its gains. There are other overheads, of course, though they're more typically per-something besides processes. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why preallocate pmd in x86 32-bit PAE?
Linus Torvalds wrote: IIRC, the present bit is ignored in the magic 4-entry PGD. All entries have to be present. On Thu, Nov 15, 2007 at 02:42:46PM -0800, H. Peter Anvin wrote: This is true, although you could point a PGD to an all-zero page if you really wanted to. You have to re-load CR3 after modifying the top-level entries. There may be bigger fish to fry in terms of per-process overhead, if you're trying to cut that down. The trouble with trying to address some of those is that there is mutual antagonism between compactness and expansibility in the process address space layout, so you'll end up instantiating a lot more than you want barring some sort of provision for a compact address space layout. Pagetable sharing is a far more powerful resource scalability method, though it also needs cooperation in user address space layout to reap its gains. There are other overheads, of course, though they're more typically per-something besides processes. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] hugetlbfs :shmget with SHM_HUGETLB only works as root
On Wed, Nov 14, 2007 at 09:31:41AM -0600, aglitke wrote: > ... if the user's locked limit (ulimit -l) is set to unlimited, allowed > (above) is set to 1. In that case, the second part of that if() is > bypassed, and the function grants permission. Therefore, the easy > solution is to make sure your user's lock_limit is RLIM_INFINITY. This function deserves a minor cleanup and a bit more commenting. Reading user->locked_shm within shmlock_user_lock would be nice, too. Maybe something like this (untested, uncompiled) would do. -- wli diff --git a/mm/mlock.c b/mm/mlock.c index 7b26560..5f51792 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -234,6 +234,12 @@ asmlinkage long sys_munlockall(void) /* * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB * shm segments) get accounted against the user_struct instead. + * First, user_shm_lock() checks that the user has permission to lock + * enough memory; then if so, the locked shm is accounted to the user's + * system-wide state. shmlock_user_lock protects the per-user field + * tracking how much locked_shm is in use within the struct user_struct. + * shmlock_user_lock is taken early to guard the read-only check that + * user->locked_shm is in-bounds against updates to user->locked_shm. */ static DEFINE_SPINLOCK(shmlock_user_lock); @@ -242,19 +248,22 @@ int user_shm_lock(size_t size, struct user_struct *user) unsigned long lock_limit, locked; int allowed = 0; + spin_lock(_user_lock); locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur; if (lock_limit == RLIM_INFINITY) allowed = 1; - lock_limit >>= PAGE_SHIFT; - spin_lock(_user_lock); - if (!allowed && - locked + user->locked_shm > lock_limit && !capable(CAP_IPC_LOCK)) - goto out; - get_uid(user); - user->locked_shm += locked; - allowed = 1; -out: + else { + lock_limit >>= PAGE_SHIFT; + if (locked + user->locked_shm <= lock_limit) + allowed = 1; + else if (capable(CAP_IPC_LOCK)) + allowed = 1; + } + if (allowed) { + get_uid(user); + user->locked_shm += locked; + } spin_unlock(_user_lock); return allowed; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] hugetlbfs :shmget with SHM_HUGETLB only works as root
On Wed, Nov 14, 2007 at 09:31:41AM -0600, aglitke wrote: ... if the user's locked limit (ulimit -l) is set to unlimited, allowed (above) is set to 1. In that case, the second part of that if() is bypassed, and the function grants permission. Therefore, the easy solution is to make sure your user's lock_limit is RLIM_INFINITY. This function deserves a minor cleanup and a bit more commenting. Reading user-locked_shm within shmlock_user_lock would be nice, too. Maybe something like this (untested, uncompiled) would do. -- wli diff --git a/mm/mlock.c b/mm/mlock.c index 7b26560..5f51792 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -234,6 +234,12 @@ asmlinkage long sys_munlockall(void) /* * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB * shm segments) get accounted against the user_struct instead. + * First, user_shm_lock() checks that the user has permission to lock + * enough memory; then if so, the locked shm is accounted to the user's + * system-wide state. shmlock_user_lock protects the per-user field + * tracking how much locked_shm is in use within the struct user_struct. + * shmlock_user_lock is taken early to guard the read-only check that + * user-locked_shm is in-bounds against updates to user-locked_shm. */ static DEFINE_SPINLOCK(shmlock_user_lock); @@ -242,19 +248,22 @@ int user_shm_lock(size_t size, struct user_struct *user) unsigned long lock_limit, locked; int allowed = 0; + spin_lock(shmlock_user_lock); locked = (size + PAGE_SIZE - 1) PAGE_SHIFT; lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur; if (lock_limit == RLIM_INFINITY) allowed = 1; - lock_limit = PAGE_SHIFT; - spin_lock(shmlock_user_lock); - if (!allowed - locked + user-locked_shm lock_limit !capable(CAP_IPC_LOCK)) - goto out; - get_uid(user); - user-locked_shm += locked; - allowed = 1; -out: + else { + lock_limit = PAGE_SHIFT; + if (locked + user-locked_shm = lock_limit) + allowed = 1; + else if (capable(CAP_IPC_LOCK)) + allowed = 1; + } + if (allowed) { + get_uid(user); + user-locked_shm += locked; + } spin_unlock(shmlock_user_lock); return allowed; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
On Wed, Jul 25, 2007 at 04:39:04PM +0200, Andrea Arcangeli wrote: > For the kernel stack btw, when alloc_pages(order=1) fails vmalloc > should be used and 4k stacks can be dropped. Nobody does dma from the > stack anymore these days IIRC (it doesn't work in all archs anyway). I have recent code for that circulating, albeit intended for debugging purposes. There's nothing particularly debug-oriented about it, though, apart from the fact a guard page is automatically set up by vmalloc() and that the use of vmalloc() is unconditional. As for the rest, I'm sure there could be a lively conversation, but consensus, so I'll let it go. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
On Wed, Jul 25, 2007 at 04:39:04PM +0200, Andrea Arcangeli wrote: For the kernel stack btw, when alloc_pages(order=1) fails vmalloc should be used and 4k stacks can be dropped. Nobody does dma from the stack anymore these days IIRC (it doesn't work in all archs anyway). I have recent code for that circulating, albeit intended for debugging purposes. There's nothing particularly debug-oriented about it, though, apart from the fact a guard page is automatically set up by vmalloc() and that the use of vmalloc() is unconditional. As for the rest, I'm sure there could be a lively conversation, but consensus, so I'll let it go. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
On Wed, Jul 18, 2007 at 06:32:22AM -0700, William Lee Irwin III wrote: >> Actually I'd worked on what was called MPSS (Multiple Page Size Support) >> before I ever started on pgcl. Some large portion of the pgcl proposal >> as I presented it internally was to reduce the order of large page >> allocations and provide a promotion and demotion mechanism enabling >> different processes to have different sized translations for the same >> large page, and hence no out-of-context pagetable/TLB updates during >> promotion and demotion, essentially by making the TLB translation to >> page relation M:N. ISTR describing this in a KS presentation for which >> IIRC you were present. But that's neither here nor there. On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote: > Well the whole difference between you back then and SGI now, is that > your stuff wasn't being pushed to be merged very hard (it was proposed > but IIRC more as research topic, like the large PAGE_SIZE also fallen > into that same research area). See now the emails from SGI fs folks > about variable order page size, they want it merged badly instead. Neither were research topics, but I'm tired of correcting the history of my failures. I've got enough ongoing failures as things stand. On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote: > My whole point is that the single moment the variable order page size > isn't pure research anymore like MPSS, the CONFIG_PAGE_SHIFT isn't > research anymore either, like the tail packing in pagecache with > kmalloc also isn't research anymore. There was never any research involved in the page clustering per se. It was supposed to be a generally advantageous thing that Linus had at least once explicitly approved of that just so happened to relieve mem_map[] pressure on 64GB i386, the side effect intended to attract corporate patronage. That last fact was not only demonstrable, it was used in the first ever public demonstration of a 64GB i386 machine running Linux, which I personally carried out. Beyond active hindrances and lacks of cooperation, a "competing solution" with distro backing appeared that removed the last vestige of corporate patronage from the project. It ended up bitrotting faster than I could singlehandedly do all the maintenance, testing, and coding work on it while also trying to get anything else done. MPSS was not as well-developed at the time the hugetlb "solution" killed it, but is not terribly dissimilar in how it came into being, developed, and then died, apart from less active hindrance. The one and only aspect in which any research was involved was a proposal, never accepted or pursued, to investigate how larger base page sizes implemented via page clustering mitigated external fragmentation for the purposes of MPSS and also how certain techniques borrowed from page clustering could reduce the frequency of and performance penalties associated with demotion in MPSS. The proposal has never been publicly circulated, though some of its content was described in the KS presentation as "future directions" or similar. On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote: > About the fs deciding the size of the pagecache granularity I totally > dislike that design, there's no reason why the fs should control that, [...] This is all valid commentary, though I don't have any particular response to it. In any event, I've never been involved in a research project, though I would've liked to have been. The emphasis in all cases was enabling specific functionality in production, using techniques whose viability had furthermore already been demonstrated elsewhere, by others. In both instances, insurmountable nontechnical obstacles were present, which remain in place and effectively limit the scale and scope of any sort of project I can personally lead with any sort of likelihood of mainline acceptance. Where I am limited, you are not. Good luck to you. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
On Wed, Jul 18, 2007 at 06:32:22AM -0700, William Lee Irwin III wrote: Actually I'd worked on what was called MPSS (Multiple Page Size Support) before I ever started on pgcl. Some large portion of the pgcl proposal as I presented it internally was to reduce the order of large page allocations and provide a promotion and demotion mechanism enabling different processes to have different sized translations for the same large page, and hence no out-of-context pagetable/TLB updates during promotion and demotion, essentially by making the TLB translation to page relation M:N. ISTR describing this in a KS presentation for which IIRC you were present. But that's neither here nor there. On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote: Well the whole difference between you back then and SGI now, is that your stuff wasn't being pushed to be merged very hard (it was proposed but IIRC more as research topic, like the large PAGE_SIZE also fallen into that same research area). See now the emails from SGI fs folks about variable order page size, they want it merged badly instead. Neither were research topics, but I'm tired of correcting the history of my failures. I've got enough ongoing failures as things stand. On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote: My whole point is that the single moment the variable order page size isn't pure research anymore like MPSS, the CONFIG_PAGE_SHIFT isn't research anymore either, like the tail packing in pagecache with kmalloc also isn't research anymore. There was never any research involved in the page clustering per se. It was supposed to be a generally advantageous thing that Linus had at least once explicitly approved of that just so happened to relieve mem_map[] pressure on 64GB i386, the side effect intended to attract corporate patronage. That last fact was not only demonstrable, it was used in the first ever public demonstration of a 64GB i386 machine running Linux, which I personally carried out. Beyond active hindrances and lacks of cooperation, a competing solution with distro backing appeared that removed the last vestige of corporate patronage from the project. It ended up bitrotting faster than I could singlehandedly do all the maintenance, testing, and coding work on it while also trying to get anything else done. MPSS was not as well-developed at the time the hugetlb solution killed it, but is not terribly dissimilar in how it came into being, developed, and then died, apart from less active hindrance. The one and only aspect in which any research was involved was a proposal, never accepted or pursued, to investigate how larger base page sizes implemented via page clustering mitigated external fragmentation for the purposes of MPSS and also how certain techniques borrowed from page clustering could reduce the frequency of and performance penalties associated with demotion in MPSS. The proposal has never been publicly circulated, though some of its content was described in the KS presentation as future directions or similar. On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote: About the fs deciding the size of the pagecache granularity I totally dislike that design, there's no reason why the fs should control that, [...] This is all valid commentary, though I don't have any particular response to it. In any event, I've never been involved in a research project, though I would've liked to have been. The emphasis in all cases was enabling specific functionality in production, using techniques whose viability had furthermore already been demonstrated elsewhere, by others. In both instances, insurmountable nontechnical obstacles were present, which remain in place and effectively limit the scale and scope of any sort of project I can personally lead with any sort of likelihood of mainline acceptance. Where I am limited, you are not. Good luck to you. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for review] [7/48] i386: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
From: William Lee Irwin III <[EMAIL PROTECTED]> >> PAE is useful for more than supporting more than 4GB RAM. It supports >> expanded swapspace and NX executable protections. Some users may want NX >> or expanded swapspace support without the overhead or instability of >> highmem. For these reasons, the following patch divorces CONFIG_X86_PAE >> from CONFIG_HIGHMEM64G. On Thu, Jul 19, 2007 at 03:52:29PM +0100, Christoph Hellwig wrote: > What overhead of instability of highmem? Sorry folks but this is utter > bollocks. Back in the Caldera days we did a lot of measurement on highmem > overhead, and CONFIG_HIGHMEM has no measurable overhead at all on a system > that doesn't use it. CONFIG_HIGHMEM64G on the other hand has > a quite visible overhead on small systems, but that's entirely due to the > bigger page table entries that you need for NX. The missing context here is CONFIG_VMSPLIT on laptops. Laptop users, who frequently use CONFIG_VMSPLIT options to avoid highmem, wanted to turn on NX. Prior to the patch, those options were barred for all highmem configurations. In response to those requests, I produced the patch. The overhead and instability derived from tiny zones as opposed to kmap()/kunmap(), or at least such was the case historically. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Check for compound pages in set_page_dirty()
On Thu, Jul 19, 2007 at 06:35:17PM +0100, Hugh Dickins wrote: > I started from your patch. But it now seems to me a bugfix to remove > those PageCompound tests, because they're preventing a hugetlb page > from being marked dirty, when Ken needs it to be marked dirty so > /proc/sys/vm/drop_caches doesn't drop the data read in by DIO. > (His original patch went into -stable: would the patch fixing > this all up need to go into -stable?) This needs to be done some other way. The dirty bit should not be significant for pseudofs's with no backing store. The consequences of making it so are becoming apparent in the IO path, and it caused performance regressions elsewhere as well. ramfs, for instance, doesn't require anything of this sort to cope with drop_caches. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Check for compound pages in set_page_dirty()
On Thu, Jul 19, 2007 at 06:35:17PM +0100, Hugh Dickins wrote: I started from your patch. But it now seems to me a bugfix to remove those PageCompound tests, because they're preventing a hugetlb page from being marked dirty, when Ken needs it to be marked dirty so /proc/sys/vm/drop_caches doesn't drop the data read in by DIO. (His original patch went into -stable: would the patch fixing this all up need to go into -stable?) This needs to be done some other way. The dirty bit should not be significant for pseudofs's with no backing store. The consequences of making it so are becoming apparent in the IO path, and it caused performance regressions elsewhere as well. ramfs, for instance, doesn't require anything of this sort to cope with drop_caches. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for review] [7/48] i386: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
From: William Lee Irwin III [EMAIL PROTECTED] PAE is useful for more than supporting more than 4GB RAM. It supports expanded swapspace and NX executable protections. Some users may want NX or expanded swapspace support without the overhead or instability of highmem. For these reasons, the following patch divorces CONFIG_X86_PAE from CONFIG_HIGHMEM64G. On Thu, Jul 19, 2007 at 03:52:29PM +0100, Christoph Hellwig wrote: What overhead of instability of highmem? Sorry folks but this is utter bollocks. Back in the Caldera days we did a lot of measurement on highmem overhead, and CONFIG_HIGHMEM has no measurable overhead at all on a system that doesn't use it. CONFIG_HIGHMEM64G on the other hand has a quite visible overhead on small systems, but that's entirely due to the bigger page table entries that you need for NX. The missing context here is CONFIG_VMSPLIT on laptops. Laptop users, who frequently use CONFIG_VMSPLIT options to avoid highmem, wanted to turn on NX. Prior to the patch, those options were barred for all highmem configurations. In response to those requests, I produced the patch. The overhead and instability derived from tiny zones as opposed to kmap()/kunmap(), or at least such was the case historically. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
On Tue, Jul 17, 2007 at 10:47:37AM -0700, William Lee Irwin III wrote: >> You may rest assured that it's technically feasible. It's been done. >> The larger obstacles to all this are nontechnical. On Tue, Jul 17, 2007 at 09:33:08PM +0200, Andrea Arcangeli wrote: > Back then there was no variable order page size proposal, no slub, > generally nothing of that kind. > I think these days it worth to get it working again and solve the > technical obstacles once more time. Then we should plug into it a > pagecache logic to handle small files. That means if the soft page > size is 64k, we should kmalloc 32k of pagecache if the file is < 64k > but >= 32k, or kmalloc 16k if the file is < 32k but >= 16k, etc... Actually I'd worked on what was called MPSS (Multiple Page Size Support) before I ever started on pgcl. Some large portion of the pgcl proposal as I presented it internally was to reduce the order of large page allocations and provide a promotion and demotion mechanism enabling different processes to have different sized translations for the same large page, and hence no out-of-context pagetable/TLB updates during promotion and demotion, essentially by making the TLB translation to page relation M:N. ISTR describing this in a KS presentation for which IIRC you were present. But that's neither here nor there. On Tue, Jul 17, 2007 at 09:33:08PM +0200, Andrea Arcangeli wrote: > Down to 32bytes if we memcpy the 32bytes away to a 64k page, and we > disable the logic the moment somebody attempts to mmap the "kmalloced" > pagecache (which I think it's a lot simpler than trying to mmap a > kmalloced 4k naturally aligned object into userland). I wouldn't call > it tail packing, it's more a fine-granular pagecache with the already > available kmalloc granularities. That will maximize pagecache > utilization with read syscall for hg/git compared to current 2.6.22 > plus memory will be allocated faster in 64k chunks etc... Ideally it > should be possible to disable the finer-granular-kmalloc-pagecache on > the big irons with lots of memory and only working with big files. In any event, that is a sound strategy for mitigating internal fragmentation of pagecache, though internal fragmentation of anonymous memory has more severe consequences and is less easily mitigated. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
On Tue, Jul 17, 2007 at 10:47:37AM -0700, William Lee Irwin III wrote: You may rest assured that it's technically feasible. It's been done. The larger obstacles to all this are nontechnical. On Tue, Jul 17, 2007 at 09:33:08PM +0200, Andrea Arcangeli wrote: Back then there was no variable order page size proposal, no slub, generally nothing of that kind. I think these days it worth to get it working again and solve the technical obstacles once more time. Then we should plug into it a pagecache logic to handle small files. That means if the soft page size is 64k, we should kmalloc 32k of pagecache if the file is 64k but = 32k, or kmalloc 16k if the file is 32k but = 16k, etc... Actually I'd worked on what was called MPSS (Multiple Page Size Support) before I ever started on pgcl. Some large portion of the pgcl proposal as I presented it internally was to reduce the order of large page allocations and provide a promotion and demotion mechanism enabling different processes to have different sized translations for the same large page, and hence no out-of-context pagetable/TLB updates during promotion and demotion, essentially by making the TLB translation to page relation M:N. ISTR describing this in a KS presentation for which IIRC you were present. But that's neither here nor there. On Tue, Jul 17, 2007 at 09:33:08PM +0200, Andrea Arcangeli wrote: Down to 32bytes if we memcpy the 32bytes away to a 64k page, and we disable the logic the moment somebody attempts to mmap the kmalloced pagecache (which I think it's a lot simpler than trying to mmap a kmalloced 4k naturally aligned object into userland). I wouldn't call it tail packing, it's more a fine-granular pagecache with the already available kmalloc granularities. That will maximize pagecache utilization with read syscall for hg/git compared to current 2.6.22 plus memory will be allocated faster in 64k chunks etc... Ideally it should be possible to disable the finer-granular-kmalloc-pagecache on the big irons with lots of memory and only working with big files. In any event, that is a sound strategy for mitigating internal fragmentation of pagecache, though internal fragmentation of anonymous memory has more severe consequences and is less easily mitigated. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: > BTW, in a parallel thread (the thread where I've been suggested to > post this), Rik rightfully mentioned Bill once also tried to get this > working and basically asked for the differences. I don't know exactly > what Bill did, I only remember well the major reason he did it. Below > I add some more comment on the Bill, taken from my answer to Rik: I got it working. It merely bitrotted faster than I could maintain it. On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: > Right, I almost forgot he also tried enlarging the PAGE_SIZE at some > point, back then it was for the 32bit systems with 64G of ram, to > reduce the mem_map array, something my patch achieves too btw. It was done for the occasion of the first publicly-announced boot of Linux on a 64GB x86-32 machine. On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: > I thought his approach was of the old type, not backwards compatible, > the one we also thought for amd64, and I seem to remember he was > trying to solve the backwards compatibility issue without much > success. It was not of the old type. It followed Hugh's strategy, which made it fully backward-compatible. The only deficits in terms of success were performance, maintenance, and attracting any sort of audience. The only tester besides myself was literally Zwane. On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: > But really I'm unsure how Bill could achieve anything backwards > compatible back then without anon-vma... anon-vma is the enabler. I > remember he worked on enlarging the PAGE_SIZE back then, but I don't > recall him exposing HARD_PAGE_SIZE to the common code either (actually > I never seen his code so I can't be sure of this). Even if he had pte > chains back then, reaching the pte wasn't enough and I doubt he could > unwalk the pagetable tree from pte up to pmd up to pgd/mm, up to vma > to read the vm_pgoff that btw was meaningless back then for the anon > vmas ;). It was exposed to the common code as MMUPAGE_SIZE. Significant pte vectoring code in the core was involved, as well as partial page distribution policies, mmap()/mprotect() et al handling splitting across physical page boundaries, and the like. When done wrong, applications such as /sbin/init didn't run. It was all there, though Hugh's earlier implementation was far superior. pte_chains didn't make things anywhere near as awkward as highpte. pte_chains didn't really care so much how large an area a struct page tracked. highpte OTOH needed more effort, though I don't recall specifically why anymore. My long-dead code should be at: ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/pgcl/ dmesg's from 64GB x86-32 machines are also in that directory, dating from March 2003. On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: > Things are very complex, but I think it's possible by doing proper > math on vm_pgoff, vm_start/vm_end and address, just with that 4 things > we should have enough info to know which parts of each page to map in > which pte, and that's all we need to solve it. At the second mprotect > of 4k over the same 8k page will get two vmas queued in the same > anon-vma. So we check both vmas and looking at the vm_pgoff(hardpage > units)+(((address-vm_start)&~PAGE_MASK)>>HARD_PAGE_SHIFT we should be > able to tell if the ptes behind the vma need to be updated and if the > second vma can be merged back. > The idea to make it work is to synchronously map all the ptes for all > indexes covered by each page as long as they're in the range > vm_start>>HARD_PAGE_SHIFT to vm_end >> HARD_PAGE_SHIFT. We should > threat a page fault like a multiple page fault. Then when you mprotect > or mremap you already know which ptes are mapped and that you need to > unmap/update by looking the start/end hard-page-indexes, and you also > have to always check all vmas that could possibly map that page, if > the page cross the vm_start/vm_end boundary. Hugh had this all worked out in 2001. I explored some alternatives in the design space, but they didn't perform as well as the original. It's best to refer to his original patch for reference as it's far cleaner, though in principle one should be able to find machines where the late 2.5.x patches I did will run. It was never exposed to a very broad variety of systems, so I can't vouch for much beyond NUMA-Q and ThinkPad and whatever Zwane booted it on. On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: > Easy definitely not, but feasible I hope yes because I couldn't think > of a case where we can't figure out which part of the page to map in > which pte. I wish I had it implemented before posting because then I > would be 100% sure it was feasible ;). > Now if somebody here can think of a case where we can't know where to > map which part of the page in which pte, then *that* would be very > interesting and it
Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?
At some point in the past, I wrote: >> If at some point one of the pro-4k stacks crowd can prove that all >> code paths are safe, or introduce another viable alternative (such as >> Matt's idea for extending the stack dynamically), then removing the 8k >> stacks option makes sense. On Mon, Jul 16, 2007 at 11:54:38PM +0100, Alan Cox wrote: > Any x86-32 path unsafe with 4K stacks is almost certainly unsafe with 8K > stacks because the 8K stacks do not have seperate IRQ stack paths, so you > have the same space but split. It might be less predictable on 8K stacks > but it isn't absent. At hch's suggestion I rewrote the separate IRQ stack configurability patch into one making IRQ stacks mandatory and unconfigurable, and hence enabled with 8K stacks. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?
At some point in the past, I wrote: If at some point one of the pro-4k stacks crowd can prove that all code paths are safe, or introduce another viable alternative (such as Matt's idea for extending the stack dynamically), then removing the 8k stacks option makes sense. On Mon, Jul 16, 2007 at 11:54:38PM +0100, Alan Cox wrote: Any x86-32 path unsafe with 4K stacks is almost certainly unsafe with 8K stacks because the 8K stacks do not have seperate IRQ stack paths, so you have the same space but split. It might be less predictable on 8K stacks but it isn't absent. At hch's suggestion I rewrote the separate IRQ stack configurability patch into one making IRQ stacks mandatory and unconfigurable, and hence enabled with 8K stacks. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: BTW, in a parallel thread (the thread where I've been suggested to post this), Rik rightfully mentioned Bill once also tried to get this working and basically asked for the differences. I don't know exactly what Bill did, I only remember well the major reason he did it. Below I add some more comment on the Bill, taken from my answer to Rik: I got it working. It merely bitrotted faster than I could maintain it. On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: Right, I almost forgot he also tried enlarging the PAGE_SIZE at some point, back then it was for the 32bit systems with 64G of ram, to reduce the mem_map array, something my patch achieves too btw. It was done for the occasion of the first publicly-announced boot of Linux on a 64GB x86-32 machine. On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: I thought his approach was of the old type, not backwards compatible, the one we also thought for amd64, and I seem to remember he was trying to solve the backwards compatibility issue without much success. It was not of the old type. It followed Hugh's strategy, which made it fully backward-compatible. The only deficits in terms of success were performance, maintenance, and attracting any sort of audience. The only tester besides myself was literally Zwane. On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: But really I'm unsure how Bill could achieve anything backwards compatible back then without anon-vma... anon-vma is the enabler. I remember he worked on enlarging the PAGE_SIZE back then, but I don't recall him exposing HARD_PAGE_SIZE to the common code either (actually I never seen his code so I can't be sure of this). Even if he had pte chains back then, reaching the pte wasn't enough and I doubt he could unwalk the pagetable tree from pte up to pmd up to pgd/mm, up to vma to read the vm_pgoff that btw was meaningless back then for the anon vmas ;). It was exposed to the common code as MMUPAGE_SIZE. Significant pte vectoring code in the core was involved, as well as partial page distribution policies, mmap()/mprotect() et al handling splitting across physical page boundaries, and the like. When done wrong, applications such as /sbin/init didn't run. It was all there, though Hugh's earlier implementation was far superior. pte_chains didn't make things anywhere near as awkward as highpte. pte_chains didn't really care so much how large an area a struct page tracked. highpte OTOH needed more effort, though I don't recall specifically why anymore. My long-dead code should be at: ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/pgcl/ dmesg's from 64GB x86-32 machines are also in that directory, dating from March 2003. On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: Things are very complex, but I think it's possible by doing proper math on vm_pgoff, vm_start/vm_end and address, just with that 4 things we should have enough info to know which parts of each page to map in which pte, and that's all we need to solve it. At the second mprotect of 4k over the same 8k page will get two vmas queued in the same anon-vma. So we check both vmas and looking at the vm_pgoff(hardpage units)+(((address-vm_start)~PAGE_MASK)HARD_PAGE_SHIFT we should be able to tell if the ptes behind the vma need to be updated and if the second vma can be merged back. The idea to make it work is to synchronously map all the ptes for all indexes covered by each page as long as they're in the range vm_startHARD_PAGE_SHIFT to vm_end HARD_PAGE_SHIFT. We should threat a page fault like a multiple page fault. Then when you mprotect or mremap you already know which ptes are mapped and that you need to unmap/update by looking the start/end hard-page-indexes, and you also have to always check all vmas that could possibly map that page, if the page cross the vm_start/vm_end boundary. Hugh had this all worked out in 2001. I explored some alternatives in the design space, but they didn't perform as well as the original. It's best to refer to his original patch for reference as it's far cleaner, though in principle one should be able to find machines where the late 2.5.x patches I did will run. It was never exposed to a very broad variety of systems, so I can't vouch for much beyond NUMA-Q and ThinkPad and whatever Zwane booted it on. On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote: Easy definitely not, but feasible I hope yes because I couldn't think of a case where we can't figure out which part of the page to map in which pte. I wish I had it implemented before posting because then I would be 100% sure it was feasible ;). Now if somebody here can think of a case where we can't know where to map which part of the page in which pte, then *that* would be very interesting and it could save some wasted development effort. Unless this
Re: state of stack patches
On Thu, Jul 05, 2007 at 01:34:25PM -0700, Jeremy Fitzhardinge wrote: > What's the state of your stack patches? I'm still using the ones you > posted some time ago, and they seem like useful things to have in the > kernel. Is there anything preventing you from pushing them upstream? Just one thing: 2.6.22. I can, of course, do updating for -mm, -ak, et al. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: state of stack patches
On Thu, Jul 05, 2007 at 01:34:25PM -0700, Jeremy Fitzhardinge wrote: What's the state of your stack patches? I'm still using the ones you posted some time ago, and they seem like useful things to have in the kernel. Is there anything preventing you from pushing them upstream? Just one thing: 2.6.22. I can, of course, do updating for -mm, -ak, et al. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] fsblock
On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote: > fsblock is a rewrite of the "buffer layer" (ding dong the witch is > dead), which I have been working on, on and off and is now at the stage > where some of the basics are working-ish. This email is going to be > long... Long overdue. Thank you. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: >>> c. open() flag to unlink a file before returning the fd On Jun 19, 2007, at 11:08:24, William Lee Irwin III wrote: >> You probably want a tmpfile(3) -like affair which never has a >> pathname to begin with. It could be useful for security purposes >> more generally. On Fri, Jun 22, 2007 at 11:52:12PM -0400, Kyle Moffett wrote: > maybe this: open("/some/dir", O_TMPFILE); > and this? open("/some/dir", O_TMPFILE|O_DIRECTORY); > The former would return a filehandle to a new anonymous file > somewhere on whatever filesystem backs the specified path. The > latter would do the same, except create an anonymous directory where > you could use "openat()" or something. Presumably "lsof" and "/proc" > should show either type of handle as referring to either "/some/ > filesystem/" or "/some/filesystem/ (anonymous temp file)" or something. This is plausible (and I did indeed consider the file variant), though it may require more infrastructure than for tmpfs only. It may be worth clarifying that I have no concrete plans to work on the JIT emulator issues myself. I'm only disseminating ideas I think will pass review. I expect others to take up the issue(s) perhaps with some inspiration from what I described. I may review some, but I have a large review backlog as things now stand. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: c. open() flag to unlink a file before returning the fd On Jun 19, 2007, at 11:08:24, William Lee Irwin III wrote: You probably want a tmpfile(3) -like affair which never has a pathname to begin with. It could be useful for security purposes more generally. On Fri, Jun 22, 2007 at 11:52:12PM -0400, Kyle Moffett wrote: maybe this: open(/some/dir, O_TMPFILE); and this? open(/some/dir, O_TMPFILE|O_DIRECTORY); The former would return a filehandle to a new anonymous file somewhere on whatever filesystem backs the specified path. The latter would do the same, except create an anonymous directory where you could use openat() or something. Presumably lsof and /proc should show either type of handle as referring to either /some/ filesystem/ or /some/filesystem/ (anonymous temp file) or something. This is plausible (and I did indeed consider the file variant), though it may require more infrastructure than for tmpfs only. It may be worth clarifying that I have no concrete plans to work on the JIT emulator issues myself. I'm only disseminating ideas I think will pass review. I expect others to take up the issue(s) perhaps with some inspiration from what I described. I may review some, but I have a large review backlog as things now stand. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] fsblock
On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote: fsblock is a rewrite of the buffer layer (ding dong the witch is dead), which I have been working on, on and off and is now at the stage where some of the basics are working-ish. This email is going to be long... Long overdue. Thank you. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
William Lee Irwin III wrote: >> I presumed an ELF note or extended filesystem attributes were already >> in place for this sort of affair. It may be that the model implemented >> is so restrictive that users are forbidden to create new executables, >> in which case using a different model is certainly in order. Otherwise >> the ELF note or attributes need to be implemented. On Wed, Jun 20, 2007 at 09:37:31AM -0700, H. Peter Anvin wrote: > Another thing to keep in mind, since we're talking about security > policies in the first place, is that anything like this *MUST* be > "opt-in" on the part of the security policy, because what we're talking > about is circumventing an explicit security policy just based on a > user-provided binary saying, in effect, "don't worry, I know what I'm > doing." > Changing the meaning of an established explicit security policy is not > acceptable. This is what I had in mind with the commentary on the intentions of the policy. Thank you for correcting my hamhanded attempt to describe it. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> If the policy forbidding self-modifying code lacks a method of >> exempting programs such as JIT interpreters (which I doubt) then >> it's a problem. I'm with Alan on this one. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: > It does and it doesn't. There is not a reasonable way for a > user to mark an app as needing full self-modifying ability. > It's not like the executable stack, which can be set via the > ELF note markings on the executable. (ELF note markings are > ideal because they can not be used via a ret-to-libc attack) > With admin privs, one can change SE Linux settings. Mark the > executable, disable the protection system-wide, generate a > completely new SE Linux policy, or just turn SE Linux off. > Normally we don't expect/require admin privs to install an > executable in one's own ~/bin directory. This is broken. > It ought to be easier to get a JIT working well without > enabling arbitrary mprotect. This would allow a JIT to > partially benefit from the recent security enhancements. > (think of all the buggy browser-based JIT things!) I presumed an ELF note or extended filesystem attributes were already in place for this sort of affair. It may be that the model implemented is so restrictive that users are forbidden to create new executables, in which case using a different model is certainly in order. Otherwise the ELF note or attributes need to be implemented. On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> This sort of logic might be appropriate for a sort of parametrized >> and specialized vma allocator setting the policy in /proc/ along >> with various sorts of limits. There are limits to such and at some >> point things will have to manually manage their own process address >> spaces in a platform-specific fashion. If kernel assistance here is >> rejected they may have to do so in all cases. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: > I prefer ELF notes (for start-up allocations) and prctl, > plus a mmap flag for per-allocation behavior. Beware that the kernel (upstream of me) will likely refuse to support to exotic mmap() placement policies. At that point userspace will have to implement them itself with a front-end to mmap(). Userspace can actually live without kernel placement support for everything but the executable itself, which is already implemented via ELF loading standards. This is not to downplay the tremendous amounts of pain involved for moving the stack, getting ld.so to land in the right place, and so on. Actually I'm less sure about .interp placement. In any event, exotic virtualspace allocation policies are largely yet another "simple matter of programming" implementable entirely in userspace. On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> This is a bad idea. The standard semantics are needed for programs >> relying upon them. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: > I didn't mean that the default default :-) setting would change. > I meant that people could change the behavior from a boot script. > Things that break are really foul and nasty anyway, probably with > serious problems that ought to get fixed. It's actually not a good idea to make it the default even via sysctl. People won't realize something will break until it does, and what will break is likely to be a database responsible for data integrity. The IPC_RMID creation flag should suffice. On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> You probably want a tmpfile(3) -like affair which never has a pathname >> to begin with. It could be useful for security purposes more generally. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: > Yes, exactly. I think there are some possible optimizations > available too, particularly with the cifs filesystem. I doubt this will be controversial, but it's not clear to me that there is any convenient way to obtain an anonymous inode on anything but tmpfs, in which case it's not really anonymous, but not visible to userspace on account of the default kern_mount(). Essentially it's possible to hoist the tmpfile name generation in-kernel to where it's in a disconnected namespace not visible to any userspace whatsoever, and kernel threads can cooperatively ensure safety via access discipline. Alternatively, one could kern_mount() a fresh tmpfs filesystem for some concurrency domain, e.g. per-uid, per-process, or per-thread. On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> This sounds vaguely like another syscall, like mdup(). This is >> particularly meaningful in the context of anonymous memory, for >> which there is no method of replicating mappings with
Re: JIT emulator needs
On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote: If the policy forbidding self-modifying code lacks a method of exempting programs such as JIT interpreters (which I doubt) then it's a problem. I'm with Alan on this one. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: It does and it doesn't. There is not a reasonable way for a user to mark an app as needing full self-modifying ability. It's not like the executable stack, which can be set via the ELF note markings on the executable. (ELF note markings are ideal because they can not be used via a ret-to-libc attack) With admin privs, one can change SE Linux settings. Mark the executable, disable the protection system-wide, generate a completely new SE Linux policy, or just turn SE Linux off. Normally we don't expect/require admin privs to install an executable in one's own ~/bin directory. This is broken. It ought to be easier to get a JIT working well without enabling arbitrary mprotect. This would allow a JIT to partially benefit from the recent security enhancements. (think of all the buggy browser-based JIT things!) I presumed an ELF note or extended filesystem attributes were already in place for this sort of affair. It may be that the model implemented is so restrictive that users are forbidden to create new executables, in which case using a different model is certainly in order. Otherwise the ELF note or attributes need to be implemented. On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote: This sort of logic might be appropriate for a sort of parametrized and specialized vma allocator setting the policy in /proc/ along with various sorts of limits. There are limits to such and at some point things will have to manually manage their own process address spaces in a platform-specific fashion. If kernel assistance here is rejected they may have to do so in all cases. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: I prefer ELF notes (for start-up allocations) and prctl, plus a mmap flag for per-allocation behavior. Beware that the kernel (upstream of me) will likely refuse to support to exotic mmap() placement policies. At that point userspace will have to implement them itself with a front-end to mmap(). Userspace can actually live without kernel placement support for everything but the executable itself, which is already implemented via ELF loading standards. This is not to downplay the tremendous amounts of pain involved for moving the stack, getting ld.so to land in the right place, and so on. Actually I'm less sure about .interp placement. In any event, exotic virtualspace allocation policies are largely yet another simple matter of programming implementable entirely in userspace. On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote: This is a bad idea. The standard semantics are needed for programs relying upon them. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: I didn't mean that the default default :-) setting would change. I meant that people could change the behavior from a boot script. Things that break are really foul and nasty anyway, probably with serious problems that ought to get fixed. It's actually not a good idea to make it the default even via sysctl. People won't realize something will break until it does, and what will break is likely to be a database responsible for data integrity. The IPC_RMID creation flag should suffice. On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote: You probably want a tmpfile(3) -like affair which never has a pathname to begin with. It could be useful for security purposes more generally. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: Yes, exactly. I think there are some possible optimizations available too, particularly with the cifs filesystem. I doubt this will be controversial, but it's not clear to me that there is any convenient way to obtain an anonymous inode on anything but tmpfs, in which case it's not really anonymous, but not visible to userspace on account of the default kern_mount(). Essentially it's possible to hoist the tmpfile name generation in-kernel to where it's in a disconnected namespace not visible to any userspace whatsoever, and kernel threads can cooperatively ensure safety via access discipline. Alternatively, one could kern_mount() a fresh tmpfs filesystem for some concurrency domain, e.g. per-uid, per-process, or per-thread. On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote: This sounds vaguely like another syscall, like mdup(). This is particularly meaningful in the context of anonymous memory, for which there is no method of replicating mappings within a single process address space. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: Yes, mdup() and probably mdup2(). It could be mremap flags or not. JIT emulators generally need a second mapping so that they can have both read/write and execute
Re: JIT emulator needs
William Lee Irwin III wrote: I presumed an ELF note or extended filesystem attributes were already in place for this sort of affair. It may be that the model implemented is so restrictive that users are forbidden to create new executables, in which case using a different model is certainly in order. Otherwise the ELF note or attributes need to be implemented. On Wed, Jun 20, 2007 at 09:37:31AM -0700, H. Peter Anvin wrote: Another thing to keep in mind, since we're talking about security policies in the first place, is that anything like this *MUST* be opt-in on the part of the security policy, because what we're talking about is circumventing an explicit security policy just based on a user-provided binary saying, in effect, don't worry, I know what I'm doing. Changing the meaning of an established explicit security policy is not acceptable. This is what I had in mind with the commentary on the intentions of the policy. Thank you for correcting my hamhanded attempt to describe it. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > Right now, Linux isn't all that friendly to JIT emulators. > Here are the problems and suggestions to improve the situation. > There is an SE Linux execmem restriction that enforces W^X. > Assuming you don't wish to just disable SE Linux, there are > two ugly ways around the problem. You can mmap a file twice, > or you can abuse SysV shared memory. The mmap method requires > that you know of a filesystem mounted rw,exec where you can > write a very large temporary file. This arbitrary filesystem, > rather than swap space, will be the backing store. The SysV > shared memory method requires an undocumented flag and is > subject to some annoying size limits. Both methods create > objects that will fail to be deleted if the program dies > before marking the objects for deletion. If the policy forbidding self-modifying code lacks a method of exempting programs such as JIT interpreters (which I doubt) then it's a problem. I'm with Alan on this one. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > Processors often have annoying limits on the immediate values > in instructions. An x86 or x86_64 JIT can go a bit faster if > all allocations are kept to the low 2 GB of address space. > There are also reasons for a 32bit-to-x86_64 JIT to chose > a nearly arbitrary 2 GB region that lies above 4 GB. > Other archs have other limits, such as 32 MB or 256 MB. This sort of logic might be appropriate for a sort of parametrized and specialized vma allocator setting the policy in /proc/ along with various sorts of limits. There are limits to such and at some point things will have to manually manage their own process address spaces in a platform-specific fashion. If kernel assistance here is rejected they may have to do so in all cases. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > Sometimes it is very helpful to have the read/write mapping > be a fixed offset from the read/exec mapping. A power of 2 > can be especially desirable. As far as the kernel is concerned they're unrelated, so this will likely need MAP_FIXED barring a staggering array of fresh system calls to act on tuples of memory ranges in lockstep. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > Emulators often need a cheap way to change page permissions. > One VMA per page is no good. Besides taking up space and making > many things generally slower, having one VMA per page causes > a huge performance loss for snapshot roll-back operations. > Just tearing down all those VMAs takes a good while. remap_file_pages_prot() is reputedly waiting in the wings somewhere for this. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > Additions to better support JIT emulators: > a. sysctl to set IPC_RMID by default This is a bad idea. The standard semantics are needed for programs relying upon them. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > b. shmget() flag to set IPC_RMID by default This is relatively innocuous. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > c. open() flag to unlink a file before returning the fd You probably want a tmpfile(3) -like affair which never has a pathname to begin with. It could be useful for security purposes more generally. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > d. mremap() flag to always keep the old mapping This sounds vaguely like another syscall, like mdup(). This is particularly meaningful in the context of anonymous memory, for which there is no method of replicating mappings within a single process address space. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > e. mremap() flag to get a read/write mapping of a read/exec one > f. mremap() flag to get a read/exec mapping of a read/write one Presumably to be used in conjunction with keeping the old mapping. A composite mdup()/mremap() and mprotect(), presumably saving a TLB flush or other sorts of overhead, may make some sort of sense here. Odds are it'll get rejected as the sequence of syscalls is a rather precise equivalent, though it would optimize things (as would other composite syscalls, e.g. ones combining fork() and execve() etc.). On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > g. mremap() flag to make the 5th arg (new addr) be the upper limit > h. 6-bit wide mremap() "flag" to set the upper limit above given base Essentially more placement support for mremap()/mdup(). It's not clear to me those particular semantics are the ideal ones. A target range for placement should do, if not manual address space management. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > i. support the prot argument to remap_file_pages This is probably going to happen anyway. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: > j. a documented way (madvise?) to punch same-VMA zero-page holes This is
Re: JIT emulator needs
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: Right now, Linux isn't all that friendly to JIT emulators. Here are the problems and suggestions to improve the situation. There is an SE Linux execmem restriction that enforces W^X. Assuming you don't wish to just disable SE Linux, there are two ugly ways around the problem. You can mmap a file twice, or you can abuse SysV shared memory. The mmap method requires that you know of a filesystem mounted rw,exec where you can write a very large temporary file. This arbitrary filesystem, rather than swap space, will be the backing store. The SysV shared memory method requires an undocumented flag and is subject to some annoying size limits. Both methods create objects that will fail to be deleted if the program dies before marking the objects for deletion. If the policy forbidding self-modifying code lacks a method of exempting programs such as JIT interpreters (which I doubt) then it's a problem. I'm with Alan on this one. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: Processors often have annoying limits on the immediate values in instructions. An x86 or x86_64 JIT can go a bit faster if all allocations are kept to the low 2 GB of address space. There are also reasons for a 32bit-to-x86_64 JIT to chose a nearly arbitrary 2 GB region that lies above 4 GB. Other archs have other limits, such as 32 MB or 256 MB. This sort of logic might be appropriate for a sort of parametrized and specialized vma allocator setting the policy in /proc/ along with various sorts of limits. There are limits to such and at some point things will have to manually manage their own process address spaces in a platform-specific fashion. If kernel assistance here is rejected they may have to do so in all cases. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: Sometimes it is very helpful to have the read/write mapping be a fixed offset from the read/exec mapping. A power of 2 can be especially desirable. As far as the kernel is concerned they're unrelated, so this will likely need MAP_FIXED barring a staggering array of fresh system calls to act on tuples of memory ranges in lockstep. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: Emulators often need a cheap way to change page permissions. One VMA per page is no good. Besides taking up space and making many things generally slower, having one VMA per page causes a huge performance loss for snapshot roll-back operations. Just tearing down all those VMAs takes a good while. remap_file_pages_prot() is reputedly waiting in the wings somewhere for this. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: Additions to better support JIT emulators: a. sysctl to set IPC_RMID by default This is a bad idea. The standard semantics are needed for programs relying upon them. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: b. shmget() flag to set IPC_RMID by default This is relatively innocuous. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: c. open() flag to unlink a file before returning the fd You probably want a tmpfile(3) -like affair which never has a pathname to begin with. It could be useful for security purposes more generally. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: d. mremap() flag to always keep the old mapping This sounds vaguely like another syscall, like mdup(). This is particularly meaningful in the context of anonymous memory, for which there is no method of replicating mappings within a single process address space. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: e. mremap() flag to get a read/write mapping of a read/exec one f. mremap() flag to get a read/exec mapping of a read/write one Presumably to be used in conjunction with keeping the old mapping. A composite mdup()/mremap() and mprotect(), presumably saving a TLB flush or other sorts of overhead, may make some sort of sense here. Odds are it'll get rejected as the sequence of syscalls is a rather precise equivalent, though it would optimize things (as would other composite syscalls, e.g. ones combining fork() and execve() etc.). On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: g. mremap() flag to make the 5th arg (new addr) be the upper limit h. 6-bit wide mremap() flag to set the upper limit above given base Essentially more placement support for mremap()/mdup(). It's not clear to me those particular semantics are the ideal ones. A target range for placement should do, if not manual address space management. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: i. support the prot argument to remap_file_pages This is probably going to happen anyway. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: j. a documented way (madvise?) to punch same-VMA zero-page holes This is MADV_REMOVE, though most filesystems don't
Re: [patch 00/14] Page cache cleanup in anticipation of Large Blocksize support
On Sun, 17 Jun 2007, Matt Mackall wrote: >> Is it? Last I looked it had reverted to handing out reverse-contiguous >> pages. On Sun, Jun 17, 2007 at 07:08:41PM -0700, Christoph Lameter wrote: > I thought that was fixed? Bill Irwin was working on it. > But the contiguous pages usually only work shortly after boot. After > awhile memory gets sufficiently scrambled that the coalescing in the I/O > layer becomes ineffective. It fell off the bottom of my priority queue, sorry. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/14] Page cache cleanup in anticipation of Large Blocksize support
On Sun, 17 Jun 2007, Matt Mackall wrote: Is it? Last I looked it had reverted to handing out reverse-contiguous pages. On Sun, Jun 17, 2007 at 07:08:41PM -0700, Christoph Lameter wrote: I thought that was fixed? Bill Irwin was working on it. But the contiguous pages usually only work shortly after boot. After awhile memory gets sufficiently scrambled that the coalescing in the I/O layer becomes ineffective. It fell off the bottom of my priority queue, sorry. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2/2] 2.6.22-rc4: known regressions v3
On Thu, Jun 14, 2007 at 03:57:25PM +0100, Mark Fortescue wrote: > Benh's ptep_set_access_flags() patch needs to be applied in order to get > anyware with sun4c for all kernels >= linux-2.6.15. If not applied, you > will be lucky to get sash running as your init and even that will have > very limitit capabilities before it locks up the processor (power up > reset required). > It has been applied to both the kernels I used for testing so this > problem is independent of the ptep_set_access_flags patch but that > does not mean that it is not a related issue. > I will try to get some testing done over the weekend to narrow down > when the random illegal instructions first occour. > If I start with 2.6.21 then if that is OK, then I should be able to narow > the issue down without too much trouble. If it is between 2.6.20 and > 2.6.21 then it will be a right pig as there are a large number of commits > that don't compile for sun4c between these two. What I am hoping is that > it occours in the 2.6.22-rc2 as per the x86_64. Sounds like I'll be digging through my hardware stockpiles this weekend to find a functional sun4c box. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2/2] 2.6.22-rc4: known regressions v3
On Thu, Jun 14, 2007 at 11:30:25AM +0100, Mark Fortescue wrote: > They apear as soon as simpleinit starts up. Somtimes I get to a login > prompt before seeing any. Other times, commands in the simpleinit rc > script fail. > They do apear to be random. If a command failes, you re-run the command > and it is OK. Commands seen to fail are basic (depmod, rm cat ..). > The test I did use the same binaries with both the OK and problem kernels > so it is not a change to the application code, it is definatly a kernel > issue. This sounds like it may be addressed by benh's ptep_set_access_flags() fixes. Those fixes are still in -mm, hopefully to hit mainline by 2.6.22. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2/2] 2.6.22-rc4: known regressions v3
On Thu, Jun 14, 2007 at 11:30:25AM +0100, Mark Fortescue wrote: They apear as soon as simpleinit starts up. Somtimes I get to a login prompt before seeing any. Other times, commands in the simpleinit rc script fail. They do apear to be random. If a command failes, you re-run the command and it is OK. Commands seen to fail are basic (depmod, rm cat ..). The test I did use the same binaries with both the OK and problem kernels so it is not a change to the application code, it is definatly a kernel issue. This sounds like it may be addressed by benh's ptep_set_access_flags() fixes. Those fixes are still in -mm, hopefully to hit mainline by 2.6.22. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2/2] 2.6.22-rc4: known regressions v3
On Thu, Jun 14, 2007 at 03:57:25PM +0100, Mark Fortescue wrote: Benh's ptep_set_access_flags() patch needs to be applied in order to get anyware with sun4c for all kernels = linux-2.6.15. If not applied, you will be lucky to get sash running as your init and even that will have very limitit capabilities before it locks up the processor (power up reset required). It has been applied to both the kernels I used for testing so this problem is independent of the ptep_set_access_flags patch but that does not mean that it is not a related issue. I will try to get some testing done over the weekend to narrow down when the random illegal instructions first occour. If I start with 2.6.21 then if that is OK, then I should be able to narow the issue down without too much trouble. If it is between 2.6.20 and 2.6.21 then it will be a right pig as there are a large number of commits that don't compile for sun4c between these two. What I am hoping is that it occours in the 2.6.22-rc2 as per the x86_64. Sounds like I'll be digging through my hardware stockpiles this weekend to find a functional sun4c box. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2/2] 2.6.22-rc4: known regressions v3
On Wed, Jun 13, 2007 at 11:25:20PM +0100, Mark Fortescue wrote: > The random seg faults on x86_64 is interesting as I have been getting > random illegal instruction faults on sparc (sun4c) with 2.6.22-rc3. I have > not yet tried to track it down. All I know at present is that it is not a > problem on 2.6.20.9. Very interesting. Any hints as to how to test or how long to wait before the illegal instructions happen? -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2/2] 2.6.22-rc4: known regressions v3
On Wed, Jun 13, 2007 at 11:25:20PM +0100, Mark Fortescue wrote: The random seg faults on x86_64 is interesting as I have been getting random illegal instruction faults on sparc (sun4c) with 2.6.22-rc3. I have not yet tried to track it down. All I know at present is that it is not a problem on 2.6.20.9. Very interesting. Any hints as to how to test or how long to wait before the illegal instructions happen? -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [shm][hugetlb] Fix get_policy for stacked shared memory files
On Tue, Jun 12, 2007 at 12:20:52AM -0600, Eric W. Biederman wrote: > Does this perhaps need to be: >> diff --git a/ipc/shm.c b/ipc/shm.c >> index 4fefbad..8d2672d 100644 >> --- a/ipc/shm.c >> +++ b/ipc/shm.c >> @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct >> *vma, unsigned long addr) >> >> +pol = NULL; >> >> if (sfd->vm_ops->get_policy) >> pol = sfd->vm_ops->get_policy(vma, addr); >> -else >> +else if (vma->vm_policy && vma->vm_policy->policy != MPOL_DEFAULT) >> pol = vma->vm_policy; >> return pol; Those paths are above the level where shm_get_policy() is called. It may be that shm_get_policy() doesn't need to recapitulate them if it's only ever called through such codepaths. It's not clear to me whether that's intended as an invariant or is coincidental and not guaranteed for future callsites. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [shm][hugetlb] Fix get_policy for stacked shared memory files
On Tue, Jun 12, 2007 at 12:20:52AM -0600, Eric W. Biederman wrote: Does this perhaps need to be: diff --git a/ipc/shm.c b/ipc/shm.c index 4fefbad..8d2672d 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct *vma, unsigned long addr) +pol = NULL; if (sfd-vm_ops-get_policy) pol = sfd-vm_ops-get_policy(vma, addr); -else +else if (vma-vm_policy vma-vm_policy-policy != MPOL_DEFAULT) pol = vma-vm_policy; return pol; Those paths are above the level where shm_get_policy() is called. It may be that shm_get_policy() doesn't need to recapitulate them if it's only ever called through such codepaths. It's not clear to me whether that's intended as an invariant or is coincidental and not guaranteed for future callsites. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [shm][hugetlb] Fix get_policy for stacked shared memory files
On Mon, Jun 11, 2007 at 09:30:20PM -0700, Andrew Morton wrote: > Can we just double-check the refcounting please? The refcounting for mpol's doesn't look good in general. I'm more curious as to what releases the refcounts. alloc_page_vma(), for instance, does get_vma_policy() which eventually takes a reference, without ever releasing the reference it acquires. get_vma_policy() itself uses a similar idiom to that used in aglitke's patch. I think mpol refcounting needs to be addressed elsewhere besides this patch. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [shm][hugetlb] Fix get_policy for stacked shared memory files
On Mon, Jun 11, 2007 at 04:34:54PM -0500, Adam Litke wrote: > Here's another breakage as a result of shared memory stacked files :( > The NUMA policy for a VMA is determined by checking the following (in > the order given): > 1) vma->vm_ops->get_policy() (if defined) > 2) vma->vm_policy (if defined) > 3) task->mempolicy (if defined) > 4) Fall back to default_policy > By switching to stacked files for shared memory, get_policy() is now > always set to shm_get_policy which is a wrapper function. This > causes us to stop at step 1, which yields NULL for hugetlb instead of > task->mempolicy which was the previous (and correct) result. > This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the > wrapped vm_ops. Andi and Christoph, does this look right to you? > Signed-off-by: Adam Litke <[EMAIL PROTECTED]> Thanks for fielding this. The fix is certainly clear enough. Acked-by: William Irwin <[EMAIL PROTECTED]> -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
On Thu, Jun 07, 2007 at 07:35:51PM -0700, William Lee Irwin III wrote: >> + PAE is required for NX support, and furthermore enables >> + larger swapspace support for non-overcommit purposes. It >> + has the cost of more pagetable lookup overhead, and also >> + consumes more pagetable space per process. On Tue, Jun 12, 2007 at 01:52:35AM +0200, Adrian Bunk wrote: > It's not specific to this help text, but I start becoming a bit picky > about this issues: > If you understand this help text after reading it, you don't need a help > text for this option... ;-) > What is "NX support"? > What are "non-overcommit purposes"? > What is "pagetable lookup overhead"? > And if in doubt, should I say Y or N? > "System administrator who knows which hardware components he put into > the computer and which filesystems his data is on" might be a good > description for the average kconfig user, and these are the people who > should understand this help text. I would like to have some place to explain issues such as those, but there are as of yet no designated places for tutorial-level information. If such a place were provided, I would provide storybook commentary to explain all those. Similarly actually holds for kernel function docbook. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
On Thu, Jun 07, 2007 at 07:35:51PM -0700, William Lee Irwin III wrote: + PAE is required for NX support, and furthermore enables + larger swapspace support for non-overcommit purposes. It + has the cost of more pagetable lookup overhead, and also + consumes more pagetable space per process. On Tue, Jun 12, 2007 at 01:52:35AM +0200, Adrian Bunk wrote: It's not specific to this help text, but I start becoming a bit picky about this issues: If you understand this help text after reading it, you don't need a help text for this option... ;-) What is NX support? What are non-overcommit purposes? What is pagetable lookup overhead? And if in doubt, should I say Y or N? System administrator who knows which hardware components he put into the computer and which filesystems his data is on might be a good description for the average kconfig user, and these are the people who should understand this help text. I would like to have some place to explain issues such as those, but there are as of yet no designated places for tutorial-level information. If such a place were provided, I would provide storybook commentary to explain all those. Similarly actually holds for kernel function docbook. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [shm][hugetlb] Fix get_policy for stacked shared memory files
On Mon, Jun 11, 2007 at 04:34:54PM -0500, Adam Litke wrote: Here's another breakage as a result of shared memory stacked files :( The NUMA policy for a VMA is determined by checking the following (in the order given): 1) vma-vm_ops-get_policy() (if defined) 2) vma-vm_policy (if defined) 3) task-mempolicy (if defined) 4) Fall back to default_policy By switching to stacked files for shared memory, get_policy() is now always set to shm_get_policy which is a wrapper function. This causes us to stop at step 1, which yields NULL for hugetlb instead of task-mempolicy which was the previous (and correct) result. This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the wrapped vm_ops. Andi and Christoph, does this look right to you? Signed-off-by: Adam Litke [EMAIL PROTECTED] Thanks for fielding this. The fix is certainly clear enough. Acked-by: William Irwin [EMAIL PROTECTED] -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [shm][hugetlb] Fix get_policy for stacked shared memory files
On Mon, Jun 11, 2007 at 09:30:20PM -0700, Andrew Morton wrote: Can we just double-check the refcounting please? The refcounting for mpol's doesn't look good in general. I'm more curious as to what releases the refcounts. alloc_page_vma(), for instance, does get_vma_policy() which eventually takes a reference, without ever releasing the reference it acquires. get_vma_policy() itself uses a similar idiom to that used in aglitke's patch. I think mpol refcounting needs to be addressed elsewhere besides this patch. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/8] fdmap v2 - implement sys_socket2
On Sun, Jun 10, 2007 at 04:26:07PM +1000, Paul Mackerras wrote: > If you don't think we should be bound by POSIX, then you are perfectly > free to go off and write your own research kernel with whatever > interface you want, and no programs available to run on it. :) This isn't fair to research kernels. Breaking applications is not an active area of research. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/8] fdmap v2 - implement sys_socket2
On Sun, Jun 10, 2007 at 04:26:07PM +1000, Paul Mackerras wrote: If you don't think we should be bound by POSIX, then you are perfectly free to go off and write your own research kernel with whatever interface you want, and no programs available to run on it. :) This isn't fair to research kernels. Breaking applications is not an active area of research. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21 numa policy and huge pages not working
On Sat, Jun 09, 2007 at 09:10:51PM -0700, dean gaudet wrote: > ok i've narrowed it some... maybe. > in commit 8ef8286689c6b5bc76212437b85bdd2ba749ee44 things work fine, numa > policy is respected... > the very next commit bc56bba8f31bd99f350a5ebfd43d50f411b620c7 breaks shm > badly causing the test program to oops the kernel. > commit 516dffdcd8827a40532798602830dfcfc672294c fixes that breakage but > numa policy is no longer respected. > i've added the authors of those two commits to the recipient list and > reattached the test program. hopefully someone can shed light on the > problem. Thanks, this helps a lot. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21 numa policy and huge pages not working
On Sat, Jun 09, 2007 at 09:10:51PM -0700, dean gaudet wrote: ok i've narrowed it some... maybe. in commit 8ef8286689c6b5bc76212437b85bdd2ba749ee44 things work fine, numa policy is respected... the very next commit bc56bba8f31bd99f350a5ebfd43d50f411b620c7 breaks shm badly causing the test program to oops the kernel. commit 516dffdcd8827a40532798602830dfcfc672294c fixes that breakage but numa policy is no longer respected. i've added the authors of those two commits to the recipient list and reattached the test program. hopefully someone can shed light on the problem. Thanks, this helps a lot. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
On Fri, Jun 08, 2007 at 10:07:52AM +0200, Mikael Pettersson wrote: > Is this really needed? I can see why VMSPLIT_{2,3}G_OPT would > depend on !HIGHMEM, but why would they depend on !X86_PAE? The only reason they depend on !HIGHMEM is because handling for 1GB-unaligned splits is unimplemented for PAE, which formerly only occurred in conjunction with HIGHMEM64G. That said, they were oriented toward avoiding highmem on laptops, hence the broader !HIGHMEM constraint. The entire point of the patch is to add an option to use PAE without highmem for the purposes of NX and secondarily expanded swapspace, at which point CONFIG_VMSPLIT_[23]G_OPT need some other way besides !HIGHMEM to exclude PAE, such as specifying !X86_PAE directly. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
On Fri, Jun 08, 2007 at 10:07:52AM +0200, Mikael Pettersson wrote: Is this really needed? I can see why VMSPLIT_{2,3}G_OPT would depend on !HIGHMEM, but why would they depend on !X86_PAE? The only reason they depend on !HIGHMEM is because handling for 1GB-unaligned splits is unimplemented for PAE, which formerly only occurred in conjunction with HIGHMEM64G. That said, they were oriented toward avoiding highmem on laptops, hence the broader !HIGHMEM constraint. The entire point of the patch is to add an option to use PAE without highmem for the purposes of NX and secondarily expanded swapspace, at which point CONFIG_VMSPLIT_[23]G_OPT need some other way besides !HIGHMEM to exclude PAE, such as specifying !X86_PAE directly. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
William Lee Irwin III wrote: >> Beg your pardon? Are you reading the patch description correctly? On Thu, Jun 07, 2007 at 08:44:09PM -0700, H. Peter Anvin wrote: > I mean, with your patch CONFIG_HIGHMEM4G versus CONFIG_HIGHMEM64G really > don't make sense as separate selections anymore. I thought about sweeping those up, but defaulted to minimal diffsize. I can sweep them up given more votes in favor of doing so. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
William Lee Irwin III wrote: >> !CONFIG_X86_PAE && CONFIG_HIGHMEM64G doesn't make sense and is not allowed >> by this patch. CONFIG_X86_PAE && !CONFIG_HIGHMEM64G works here. On Thu, Jun 07, 2007 at 08:38:22PM -0700, H. Peter Anvin wrote: > But what's the point? > If you're going to divorce these, at least do it in a way that makes > sense, specifically the two independent variables are PAE and HIGHMEM. > PAE and !HIGHMEM does make (some amount of) sense, due to no kmap overhead. Beg your pardon? Are you reading the patch description correctly? -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
On Thu, 7 Jun 2007 19:35:51 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> PAE is useful for more than supporting more than 4GB RAM. It supports >> expanded swapspace and NX executable protections. Some users may want >> NX or expanded swapspace support without the overhead or instability >> of highmem. For these reasons, the following patch divorces >> CONFIG_X86_PAE from CONFIG_HIGHMEM64G. On Thu, Jun 07, 2007 at 07:41:56PM -0700, Andrew Morton wrote: > Do (CONFIG_X86_PAE && !CONFIG_HIGHMEM64G) and (!CONFIG_X86_PAE && > CONFIG_HIGHMEM64G) > kernels actually work? I wouldn't be surprised if there are places where we > used > the incorrect one. !CONFIG_X86_PAE && CONFIG_HIGHMEM64G doesn't make sense and is not allowed by this patch. CONFIG_X86_PAE && !CONFIG_HIGHMEM64G works here. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
PAE is useful for more than supporting more than 4GB RAM. It supports expanded swapspace and NX executable protections. Some users may want NX or expanded swapspace support without the overhead or instability of highmem. For these reasons, the following patch divorces CONFIG_X86_PAE from CONFIG_HIGHMEM64G. vs. 2.6.22-rc4-mm2 Cc: Mark Lord <[EMAIL PROTECTED]> Cc: Andi Kleen <[EMAIL PROTECTED]> Cc: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: William Irwin <[EMAIL PROTECTED]> Index: mm-2.6.22-rc4-2/arch/i386/Kconfig === --- mm-2.6.22-rc4-2.orig/arch/i386/Kconfig 2007-06-07 00:05:53.609599701 -0700 +++ mm-2.6.22-rc4-2/arch/i386/Kconfig 2007-06-07 17:02:24.333262965 -0700 @@ -544,6 +544,7 @@ config HIGHMEM64G bool "64GB" depends on X86_CMPXCHG64 + select X86_PAE help Select this if you have a 32-bit processor and more than 4 gigabytes of physical RAM. @@ -573,12 +574,12 @@ config VMSPLIT_3G bool "3G/1G user/kernel split" config VMSPLIT_3G_OPT - depends on !HIGHMEM + depends on !X86_PAE bool "3G/1G user/kernel split (for full 1G low memory)" config VMSPLIT_2G bool "2G/2G user/kernel split" config VMSPLIT_2G_OPT - depends on !HIGHMEM + depends on !X86_PAE bool "2G/2G user/kernel split (for full 2G low memory)" config VMSPLIT_1G bool "1G/3G user/kernel split" @@ -598,10 +599,15 @@ default y config X86_PAE - bool - depends on HIGHMEM64G - default y + bool "PAE (Physical Address Extension) Support" + default n + depends on !HIGHMEM4G select RESOURCES_64BIT + help + PAE is required for NX support, and furthermore enables + larger swapspace support for non-overcommit purposes. It + has the cost of more pagetable lookup overhead, and also + consumes more pagetable space per process. # Common NUMA Features config NUMA Index: mm-2.6.22-rc4-2/arch/i386/kernel/setup.c === --- mm-2.6.22-rc4-2.orig/arch/i386/kernel/setup.c 2007-06-06 23:52:18.839168580 -0700 +++ mm-2.6.22-rc4-2/arch/i386/kernel/setup.c2007-06-07 17:02:24.349263876 -0700 @@ -273,18 +273,18 @@ printk(KERN_WARNING "Warning only %ldMB will be used.\n", MAXMEM>>20); if (max_pfn > MAX_NONPAE_PFN) - printk(KERN_WARNING "Use a PAE enabled kernel.\n"); + printk(KERN_WARNING "Use a HIGHMEM64G enabled kernel.\n"); else printk(KERN_WARNING "Use a HIGHMEM enabled kernel.\n"); max_pfn = MAXMEM_PFN; #else /* !CONFIG_HIGHMEM */ -#ifndef CONFIG_X86_PAE +#ifndef CONFIG_HIGHMEM64G if (max_pfn > MAX_NONPAE_PFN) { max_pfn = MAX_NONPAE_PFN; printk(KERN_WARNING "Warning only 4GB will be used.\n"); - printk(KERN_WARNING "Use a PAE enabled kernel.\n"); + printk(KERN_WARNING "Use a HIGHMEM64G enabled kernel.\n"); } -#endif /* !CONFIG_X86_PAE */ +#endif /* !CONFIG_HIGHMEM64G */ #endif /* !CONFIG_HIGHMEM */ } else { if (highmem_pages == -1) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: why does the macro "ZERO_PAGE" take an argument?
Robert P. J. Day wrote: >> although it's not clear where in the source tree are the invocations >> that would actually make a difference to a MIPS system, which is why >> i've CC'ed ralf on this. i'm sure he can clear this up. :-) On Thu, Jun 07, 2007 at 10:32:29AM -0700, H. Peter Anvin wrote: > x86 could also benefit from coloured zeropages. In fact, I thought it > already had them (K8 wants as many as 8.) How would one demonstrate the beneficial effect of such? -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm2
On Thu, Jun 07, 2007 at 12:19:22AM -0700, Andrew Morton wrote: > hm, OK, this seems to work: [...] > -#ifdef CONFIG_HIGHMEM > +#if defined(CONFIG_HIGHMEM) && defined(CONFIG_ARCH_POPULATES_NODE_MAP) > return movable_zone == ZONE_HIGHMEM; > #else > return 0; > _ > (the first ifdef is just there to trip things at compile time rather than > link time) I guess it's not the arch's fault after all. I probably would've conditionally out-of-lined the thing so as never to expose movable_zone but this will do just fine. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm2
On Thu, Jun 07, 2007 at 12:01:25AM -0700, Andrew Morton wrote: >> config, please? On Thu, Jun 07, 2007 at 12:04:07AM -0700, William Lee Irwin III wrote: > It's the sparc32 defconfig. Included below for completeness. The error output looks like the following. -- wli $ quilt top create-the-zone_movable-zone-fix.patch $ (yes "" | make ARCH=sparc CROSS_COMPILE="sparc-linux-" CC="gcc-sparc-4.1" quiet=1 -j16 defconfig) >& /dev/null; yes "" | make ARCH=sparc CROSS_COMPILE="sparc-linux-" CC="gcc-sparc-4.1" quiet=1 -j16 image modules scripts/kconfig/conf -s arch/sparc/Kconfig drivers/macintosh/Kconfig:116:warning: 'select' used by config symbol 'PMAC_APM_EMU' refers to undefined symbol 'APM_EMULATION' drivers/input/keyboard/Kconfig:170:warning: 'select' used by config symbol 'KEYBOARD_ATARI' refers to undefined symbol 'ATARI_KBD_CORE' drivers/input/mouse/Kconfig:182:warning: 'select' used by config symbol 'MOUSE_ATARI' refers to undefined symbol 'ATARI_KBD_CORE' sound/soc/sh/Kconfig:6:warning: 'select' used by config symbol 'SND_SOC_PCM_SH7760' refers to undefined symbol 'SH_DMABRG' CHK include/linux/version.h UPD include/linux/version.h CHK include/linux/utsrelease.h UPD include/linux/utsrelease.h SYMLINK include/asm -> include/asm-sparc :752:2: warning: #warning syscall setresuid not implemented :756:2: warning: #warning syscall getresuid not implemented :776:2: warning: #warning syscall setresgid not implemented :780:2: warning: #warning syscall getresgid not implemented CHK include/linux/compile.h UPD include/linux/compile.h ipc/msg.c: In function 'sys_msgctl': ipc/msg.c:390: warning: 'setbuf.qbytes' may be used uninitialized in this function ipc/msg.c:390: warning: 'setbuf.uid' may be used uninitialized in this function ipc/msg.c:390: warning: 'setbuf.gid' may be used uninitialized in this function ipc/msg.c:390: warning: 'setbuf.mode' may be used uninitialized in this function ipc/sem.c: In function 'sys_semctl': ipc/sem.c:861: warning: 'setbuf.uid' may be used uninitialized in this function ipc/sem.c:861: warning: 'setbuf.gid' may be used uninitialized in this function ipc/sem.c:861: warning: 'setbuf.mode' may be used uninitialized in this function mm/vmalloc.c: In function 'unmap_kernel_range': mm/vmalloc.c:75: warning: unused variable 'start' drivers/char/rtc.c:118: warning: 'hpet_rtc_interrupt' defined but not used kernel/time/ntp.c: In function 'do_adjtimex': kernel/time/ntp.c:309: warning: comparison of distinct pointer types lacks a cast kernel/time/ntp.c:312: warning: comparison of distinct pointer types lacks a cast drivers/pci/search.c: In function 'pci_find_slot': drivers/pci/search.c:99: warning: 'pci_find_device' is deprecated (declared at include/linux/pci.h:478) drivers/pci/search.c: At top level: drivers/pci/search.c:434: warning: 'pci_find_device' is deprecated (declared at drivers/pci/search.c:241) drivers/pci/search.c:434: warning: 'pci_find_device' is deprecated (declared at drivers/pci/search.c:241) drivers/pci/search.c:435: warning: 'pci_find_slot' is deprecated (declared at drivers/pci/search.c:96) drivers/pci/search.c:435: warning: 'pci_find_slot' is deprecated (declared at drivers/pci/search.c:96) drivers/pci/syscall.c: In function 'sys_pciconfig_read': drivers/pci/syscall.c:22: warning: 'dev' may be used uninitialized in this function fs/partitions/check.c: In function 'add_partition': fs/partitions/check.c:392: warning: ignoring return value of 'kobject_add', declared with attribute warn_unused_result fs/partitions/check.c:395: warning: ignoring return value of 'sysfs_create_link', declared with attribute warn_unused_result fs/partitions/check.c:402: warning: ignoring return value of 'sysfs_create_file', declared with attribute warn_unused_result CHK include/linux/compile.h UPD include/linux/compile.h WARNING: arch/sparc/kernel/head.o(.text+0x9040): Section mismatch: reference to .init.text:no_sun4u_here (between 'current_pc' and 'already_mapped') WARNING: arch/sparc/kernel/head.o(.text+0x9280): Section mismatch: reference to .init.text:execute_in_high_mem (after 'go_to_highmem') WARNING: arch/sparc/kernel/head.o(.text+0x9284): Section mismatch: reference to .init.text:execute_in_high_mem (after 'go_to_highmem') Building modules, stage 2. WARNING: vmlinux(.text+0x9040): Section mismatch: reference to .init.text:no_sun4u_here (between 'current_pc' and 'already_mapped') WARNING: vmlinux(.text+0x9280): Section mismatch: reference to .init.text:execute_in_high_mem (between 'go_to_highmem' and 'init_thread_union') WARNING: vmlinux(.text+0x9284): Section mismatch: reference to .init.text:execute_in_high_mem (between 'go_to_highmem' and 'init_thread_union') WARNING: vmlinux(.text+0x1dfb
Re: 2.6.22-rc4-mm2
On Wed, 6 Jun 2007 23:55:44 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> The fully-applied tree fails with a link error having to do with >> movable_zone. I'm not entirely sure what arches are supposed to do >> about that. On Thu, Jun 07, 2007 at 12:01:25AM -0700, Andrew Morton wrote: > config, please? It's the sparc32 defconfig. Included below for completeness. -- wli # # Automatically generated make config: don't edit # Linux kernel version: 2.6.22-rc4-mm2 # Thu Jun 7 00:01:24 2007 # CONFIG_MMU=y CONFIG_HIGHMEM=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=14 CONFIG_SYSFS_DEPRECATED=y # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_BLOCK=y # CONFIG_LBD is not set # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" # # General machine setup # # CONFIG_SMP is not set CONFIG_SPARC=y CONFIG_SPARC32=y CONFIG_SBUS=y CONFIG_SBUSCHAR=y CONFIG_SERIAL_CONSOLE=y CONFIG_SUN_AUXIO=y CONFIG_SUN_IO=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_FIND_NEXT_BIT=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_EMULATED_CMPXCHG=y CONFIG_SUN_PM=y # CONFIG_SUN4 is not set CONFIG_PCI=y # CONFIG_ARCH_SUPPORTS_MSI is not set # CONFIG_PCI_DEBUG is not set CONFIG_SUN_OPENPROMFS=m # CONFIG_SPARC_LED is not set CONFIG_BINFMT_ELF=y CONFIG_BINFMT_AOUT=y CONFIG_BINFMT_MISC=m CONFIG_SUNOS_EMUL=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y # CONFIG_SPARSEMEM_STATIC is not set CONFIG_SPLIT_PTLOCK_CPUS=4 # CONFIG_RESOURCES_64BIT is not set CONFIG_ZONE_DMA_FLAG=1 # # Networking # CONFIG_NET=y # # Networking options # CONFIG_PACKET=y # CONFIG_PACKET_MMAP is not set CONFIG_UNIX=y CONFIG_XFRM=y CONFIG_XFRM_USER=m # CONFIG_XFRM_SUB_POLICY is not set # CONFIG_XFRM_MIGRATE is not set CONFIG_NET_KEY=m # CONFIG_NET_KEY_MIGRATE is not set CONFIG_INET=y # CONFIG_IP_MULTICAST is not set # CONFIG_IP_ADVANCED_ROUTER is not set CONFIG_IP_FIB_HASH=y CONFIG_IP_PNP=y CONFIG_IP_PNP_DHCP=y # CONFIG_IP_PNP_BOOTP is not set # CONFIG_IP_PNP_RARP is not set # CONFIG_NET_IPIP is not set # CONFIG_NET_IPGRE is not set # CONFIG_ARPD is not set # CONFIG_SYN_COOKIES is not set CONFIG_INET_AH=y CONFIG_INET_ESP=y CONFIG_INET_IPCOMP=y CONFIG_INET_XFRM_TUNNEL=y CONFIG_INET_TUNNEL=y CONFIG_INET_XFRM_MODE_TRANSPORT=y CONFIG_INET_XFRM_MODE_TUNNEL=y CONFIG_INET_XFRM_MODE_BEET=y CONFIG_INET_DIAG=y CONFIG_INET_TCP_DIAG=y # CONFIG_TCP_CONG_ADVANCED is not set CONFIG_TCP_CONG_CUBIC=y CONFIG_DEFAULT_TCP_CONG="cubic" # CONFIG_TCP_MD5SIG is not set CONFIG_IPV6=m CONFIG_IPV6_PRIVACY=y # CONFIG_IPV6_ROUTER_PREF is not set # CONFIG_IPV6_OPTIMISTIC_DAD is not set CONFIG_INET6_AH=m CONFIG_INET6_ESP=m CONFIG_INET6_IPCOMP=m # CONFIG_IPV6_MIP6 is not set CONFIG_INET6_XFRM_TUNNEL=m CONFIG_INET6_TUNNEL=m CONFIG_INET6_XFRM_MODE_TRANSPORT=m CONFIG_INET6_XFRM_MODE_TUNNEL=m CONFIG_INET6_XFRM_MODE_BEET=m # CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set CONFIG_IPV6_SIT=m CONFIG_IPV6_TUNNEL=m # CONFIG_IPV6_MULTIPLE_TABLES is not set # CONFIG_NETWORK_SECMARK is not set # CONFIG_NETFILTER is not set # CONFIG_IP_DCCP is not set CONFIG_IP_SCTP=m # CONFIG_SCTP_D
Re: 2.6.22-rc4-mm2
On Wed, 6 Jun 2007 23:42:31 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> create-the-zone_movable-zone.patch breaks the build on sparc32. On Wed, Jun 06, 2007 at 11:51:31PM -0700, Andrew Morton wrote: > Nope, there are no instances of GFP_HIGH_MOVABLE in the tree once all > patches are applied. You hit a bad bisection point: between > create-the-zone_movable-zone.patch and > create-the-zone_movable-zone-fix.patch. The fully-applied tree fails with a link error having to do with movable_zone. I'm not entirely sure what arches are supposed to do about that. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm2
On Wed, Jun 06, 2007 at 10:03:13PM -0700, Andrew Morton wrote: > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc4/2.6.22-rc4-mm2/ > - Basically a bugfixed version of 2.6.22-rc4-mm1. None of the subsystem > trees were repulled, several bad patches were dropped, a few were fixed. create-the-zone_movable-zone.patch breaks the build on sparc32. -- wli $ good=0; bad=`quilt series -v | wc -l`; time while [[ $(( $bad - $good )) -gt 1 ]]; do cur=`quilt series -v |egrep -c '(=|\+)'`; chkpt=$(( ($good + $bad)/2 )); delta=$(( $chkpt - $cur )); if [[ $delta -lt 0 ]]; then (quilt pop $(( 0 - $delta )) ) >& /dev/null; elif [[ $delta -gt 0 ]]; then (quilt push $delta) >& /dev/null; else true; fi; cur=$chkpt; (yes "" | make ARCH=sparc CROSS_COMPILE="sparc-linux-" CC="gcc-sparc-4.1" quiet=1 -j16 defconfig) >& /dev/null; echo "last known good = $good, first known bad = $bad, trying $chkpt"; yes "" | make ARCH=sparc CROSS_COMPILE="sparc-linux-" CC="gcc-sparc-4.1" quiet=1 -j16 image modules; s=$?; if [[ $s -ne 0 ]]; then echo "$chkpt bad"; bad=$chkpt; else echo "$chkpt good"; good=$chkpt; fi; done ... last known good = 641, first known bad = 645, trying 643 scripts/kconfig/conf -s arch/sparc/Kconfig drivers/macintosh/Kconfig:116:warning: 'select' used by config symbol 'PMAC_APM_EMU' refers to undefined symbol 'APM_EMULATION' drivers/input/keyboard/Kconfig:170:warning: 'select' used by config symbol 'KEYBOARD_ATARI' refers to undefined symbol 'ATARI_KBD_CORE' drivers/input/mouse/Kconfig:182:warning: 'select' used by config symbol 'MOUSE_ATARI' refers to undefined symbol 'ATARI_KBD_CORE' sound/soc/sh/Kconfig:6:warning: 'select' used by config symbol 'SND_SOC_PCM_SH7760' refers to undefined symbol 'SH_DMABRG' CHK include/linux/version.h CHK include/linux/utsrelease.h :752:2: warning: #warning syscall setresuid not implemented :756:2: warning: #warning syscall getresuid not implemented :776:2: warning: #warning syscall setresgid not implemented :780:2: warning: #warning syscall getresgid not implemented CHK include/linux/compile.h mm/page_alloc.c: In function 'nr_free_pagecache_pages': mm/page_alloc.c:1706: error: 'GFP_HIGH_MOVABLE' undeclared (first use in this function) mm/page_alloc.c:1706: error: (Each undeclared identifier is reported only once mm/page_alloc.c:1706: error: for each function it appears in.) make[1]: *** [mm/page_alloc.o] Error 1 make[1]: *** Waiting for unfinished jobs make: *** [mm] Error 2 make: *** Waiting for unfinished jobs drivers/char/rtc.c:118: warning: 'hpet_rtc_interrupt' defined but not used make: *** wait: No child processes. Stop. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm1
On Wed, Jun 06, 2007 at 06:09:24PM -0700, Andrew Morton wrote: > ooh, yes, lockdep_init() really does want to be called before anything > else. > So do we take it that this code hasn't been tested with lockdep? Please > don't forget that step - lockdep finds some pretty nasty bugs sometimes. > This? I found this patch when I woke and it got things booting with the full -mm stack. Now to fix the sparc32 build and see if it boots. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
PAE is useful for more than supporting more than 4GB RAM. It supports expanded swapspace and NX executable protections. Some users may want NX or expanded swapspace support without the overhead or instability of highmem. For these reasons, the following patch divorces CONFIG_X86_PAE from CONFIG_HIGHMEM64G. vs. 2.6.22-rc4-mm2 Cc: Mark Lord [EMAIL PROTECTED] Cc: Andi Kleen [EMAIL PROTECTED] Cc: Andrew Morton [EMAIL PROTECTED] Signed-off-by: William Irwin [EMAIL PROTECTED] Index: mm-2.6.22-rc4-2/arch/i386/Kconfig === --- mm-2.6.22-rc4-2.orig/arch/i386/Kconfig 2007-06-07 00:05:53.609599701 -0700 +++ mm-2.6.22-rc4-2/arch/i386/Kconfig 2007-06-07 17:02:24.333262965 -0700 @@ -544,6 +544,7 @@ config HIGHMEM64G bool 64GB depends on X86_CMPXCHG64 + select X86_PAE help Select this if you have a 32-bit processor and more than 4 gigabytes of physical RAM. @@ -573,12 +574,12 @@ config VMSPLIT_3G bool 3G/1G user/kernel split config VMSPLIT_3G_OPT - depends on !HIGHMEM + depends on !X86_PAE bool 3G/1G user/kernel split (for full 1G low memory) config VMSPLIT_2G bool 2G/2G user/kernel split config VMSPLIT_2G_OPT - depends on !HIGHMEM + depends on !X86_PAE bool 2G/2G user/kernel split (for full 2G low memory) config VMSPLIT_1G bool 1G/3G user/kernel split @@ -598,10 +599,15 @@ default y config X86_PAE - bool - depends on HIGHMEM64G - default y + bool PAE (Physical Address Extension) Support + default n + depends on !HIGHMEM4G select RESOURCES_64BIT + help + PAE is required for NX support, and furthermore enables + larger swapspace support for non-overcommit purposes. It + has the cost of more pagetable lookup overhead, and also + consumes more pagetable space per process. # Common NUMA Features config NUMA Index: mm-2.6.22-rc4-2/arch/i386/kernel/setup.c === --- mm-2.6.22-rc4-2.orig/arch/i386/kernel/setup.c 2007-06-06 23:52:18.839168580 -0700 +++ mm-2.6.22-rc4-2/arch/i386/kernel/setup.c2007-06-07 17:02:24.349263876 -0700 @@ -273,18 +273,18 @@ printk(KERN_WARNING Warning only %ldMB will be used.\n, MAXMEM20); if (max_pfn MAX_NONPAE_PFN) - printk(KERN_WARNING Use a PAE enabled kernel.\n); + printk(KERN_WARNING Use a HIGHMEM64G enabled kernel.\n); else printk(KERN_WARNING Use a HIGHMEM enabled kernel.\n); max_pfn = MAXMEM_PFN; #else /* !CONFIG_HIGHMEM */ -#ifndef CONFIG_X86_PAE +#ifndef CONFIG_HIGHMEM64G if (max_pfn MAX_NONPAE_PFN) { max_pfn = MAX_NONPAE_PFN; printk(KERN_WARNING Warning only 4GB will be used.\n); - printk(KERN_WARNING Use a PAE enabled kernel.\n); + printk(KERN_WARNING Use a HIGHMEM64G enabled kernel.\n); } -#endif /* !CONFIG_X86_PAE */ +#endif /* !CONFIG_HIGHMEM64G */ #endif /* !CONFIG_HIGHMEM */ } else { if (highmem_pages == -1) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
On Thu, 7 Jun 2007 19:35:51 -0700 William Lee Irwin III [EMAIL PROTECTED] wrote: PAE is useful for more than supporting more than 4GB RAM. It supports expanded swapspace and NX executable protections. Some users may want NX or expanded swapspace support without the overhead or instability of highmem. For these reasons, the following patch divorces CONFIG_X86_PAE from CONFIG_HIGHMEM64G. On Thu, Jun 07, 2007 at 07:41:56PM -0700, Andrew Morton wrote: Do (CONFIG_X86_PAE !CONFIG_HIGHMEM64G) and (!CONFIG_X86_PAE CONFIG_HIGHMEM64G) kernels actually work? I wouldn't be surprised if there are places where we used the incorrect one. !CONFIG_X86_PAE CONFIG_HIGHMEM64G doesn't make sense and is not allowed by this patch. CONFIG_X86_PAE !CONFIG_HIGHMEM64G works here. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
William Lee Irwin III wrote: !CONFIG_X86_PAE CONFIG_HIGHMEM64G doesn't make sense and is not allowed by this patch. CONFIG_X86_PAE !CONFIG_HIGHMEM64G works here. On Thu, Jun 07, 2007 at 08:38:22PM -0700, H. Peter Anvin wrote: But what's the point? If you're going to divorce these, at least do it in a way that makes sense, specifically the two independent variables are PAE and HIGHMEM. PAE and !HIGHMEM does make (some amount of) sense, due to no kmap overhead. Beg your pardon? Are you reading the patch description correctly? -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
William Lee Irwin III wrote: Beg your pardon? Are you reading the patch description correctly? On Thu, Jun 07, 2007 at 08:44:09PM -0700, H. Peter Anvin wrote: I mean, with your patch CONFIG_HIGHMEM4G versus CONFIG_HIGHMEM64G really don't make sense as separate selections anymore. I thought about sweeping those up, but defaulted to minimal diffsize. I can sweep them up given more votes in favor of doing so. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm1
On Wed, Jun 06, 2007 at 06:09:24PM -0700, Andrew Morton wrote: ooh, yes, lockdep_init() really does want to be called before anything else. So do we take it that this code hasn't been tested with lockdep? Please don't forget that step - lockdep finds some pretty nasty bugs sometimes. This? I found this patch when I woke and it got things booting with the full -mm stack. Now to fix the sparc32 build and see if it boots. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm2
On Wed, Jun 06, 2007 at 10:03:13PM -0700, Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc4/2.6.22-rc4-mm2/ - Basically a bugfixed version of 2.6.22-rc4-mm1. None of the subsystem trees were repulled, several bad patches were dropped, a few were fixed. create-the-zone_movable-zone.patch breaks the build on sparc32. -- wli $ good=0; bad=`quilt series -v | wc -l`; time while [[ $(( $bad - $good )) -gt 1 ]]; do cur=`quilt series -v |egrep -c '(=|\+)'`; chkpt=$(( ($good + $bad)/2 )); delta=$(( $chkpt - $cur )); if [[ $delta -lt 0 ]]; then (quilt pop $(( 0 - $delta )) ) /dev/null; elif [[ $delta -gt 0 ]]; then (quilt push $delta) /dev/null; else true; fi; cur=$chkpt; (yes | make ARCH=sparc CROSS_COMPILE=sparc-linux- CC=gcc-sparc-4.1 quiet=1 -j16 defconfig) /dev/null; echo last known good = $good, first known bad = $bad, trying $chkpt; yes | make ARCH=sparc CROSS_COMPILE=sparc-linux- CC=gcc-sparc-4.1 quiet=1 -j16 image modules; s=$?; if [[ $s -ne 0 ]]; then echo $chkpt bad; bad=$chkpt; else echo $chkpt good; good=$chkpt; fi; done ... last known good = 641, first known bad = 645, trying 643 scripts/kconfig/conf -s arch/sparc/Kconfig drivers/macintosh/Kconfig:116:warning: 'select' used by config symbol 'PMAC_APM_EMU' refers to undefined symbol 'APM_EMULATION' drivers/input/keyboard/Kconfig:170:warning: 'select' used by config symbol 'KEYBOARD_ATARI' refers to undefined symbol 'ATARI_KBD_CORE' drivers/input/mouse/Kconfig:182:warning: 'select' used by config symbol 'MOUSE_ATARI' refers to undefined symbol 'ATARI_KBD_CORE' sound/soc/sh/Kconfig:6:warning: 'select' used by config symbol 'SND_SOC_PCM_SH7760' refers to undefined symbol 'SH_DMABRG' CHK include/linux/version.h CHK include/linux/utsrelease.h stdin:752:2: warning: #warning syscall setresuid not implemented stdin:756:2: warning: #warning syscall getresuid not implemented stdin:776:2: warning: #warning syscall setresgid not implemented stdin:780:2: warning: #warning syscall getresgid not implemented CHK include/linux/compile.h mm/page_alloc.c: In function 'nr_free_pagecache_pages': mm/page_alloc.c:1706: error: 'GFP_HIGH_MOVABLE' undeclared (first use in this function) mm/page_alloc.c:1706: error: (Each undeclared identifier is reported only once mm/page_alloc.c:1706: error: for each function it appears in.) make[1]: *** [mm/page_alloc.o] Error 1 make[1]: *** Waiting for unfinished jobs make: *** [mm] Error 2 make: *** Waiting for unfinished jobs drivers/char/rtc.c:118: warning: 'hpet_rtc_interrupt' defined but not used make: *** wait: No child processes. Stop. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm2
On Wed, 6 Jun 2007 23:42:31 -0700 William Lee Irwin III [EMAIL PROTECTED] wrote: create-the-zone_movable-zone.patch breaks the build on sparc32. On Wed, Jun 06, 2007 at 11:51:31PM -0700, Andrew Morton wrote: Nope, there are no instances of GFP_HIGH_MOVABLE in the tree once all patches are applied. You hit a bad bisection point: between create-the-zone_movable-zone.patch and create-the-zone_movable-zone-fix.patch. The fully-applied tree fails with a link error having to do with movable_zone. I'm not entirely sure what arches are supposed to do about that. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm2
On Wed, 6 Jun 2007 23:55:44 -0700 William Lee Irwin III [EMAIL PROTECTED] wrote: The fully-applied tree fails with a link error having to do with movable_zone. I'm not entirely sure what arches are supposed to do about that. On Thu, Jun 07, 2007 at 12:01:25AM -0700, Andrew Morton wrote: config, please? It's the sparc32 defconfig. Included below for completeness. -- wli # # Automatically generated make config: don't edit # Linux kernel version: 2.6.22-rc4-mm2 # Thu Jun 7 00:01:24 2007 # CONFIG_MMU=y CONFIG_HIGHMEM=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION= CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=14 CONFIG_SYSFS_DEPRECATED=y # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE= # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_BLOCK=y # CONFIG_LBD is not set # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED=cfq # # General machine setup # # CONFIG_SMP is not set CONFIG_SPARC=y CONFIG_SPARC32=y CONFIG_SBUS=y CONFIG_SBUSCHAR=y CONFIG_SERIAL_CONSOLE=y CONFIG_SUN_AUXIO=y CONFIG_SUN_IO=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_FIND_NEXT_BIT=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_EMULATED_CMPXCHG=y CONFIG_SUN_PM=y # CONFIG_SUN4 is not set CONFIG_PCI=y # CONFIG_ARCH_SUPPORTS_MSI is not set # CONFIG_PCI_DEBUG is not set CONFIG_SUN_OPENPROMFS=m # CONFIG_SPARC_LED is not set CONFIG_BINFMT_ELF=y CONFIG_BINFMT_AOUT=y CONFIG_BINFMT_MISC=m CONFIG_SUNOS_EMUL=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y # CONFIG_SPARSEMEM_STATIC is not set CONFIG_SPLIT_PTLOCK_CPUS=4 # CONFIG_RESOURCES_64BIT is not set CONFIG_ZONE_DMA_FLAG=1 # # Networking # CONFIG_NET=y # # Networking options # CONFIG_PACKET=y # CONFIG_PACKET_MMAP is not set CONFIG_UNIX=y CONFIG_XFRM=y CONFIG_XFRM_USER=m # CONFIG_XFRM_SUB_POLICY is not set # CONFIG_XFRM_MIGRATE is not set CONFIG_NET_KEY=m # CONFIG_NET_KEY_MIGRATE is not set CONFIG_INET=y # CONFIG_IP_MULTICAST is not set # CONFIG_IP_ADVANCED_ROUTER is not set CONFIG_IP_FIB_HASH=y CONFIG_IP_PNP=y CONFIG_IP_PNP_DHCP=y # CONFIG_IP_PNP_BOOTP is not set # CONFIG_IP_PNP_RARP is not set # CONFIG_NET_IPIP is not set # CONFIG_NET_IPGRE is not set # CONFIG_ARPD is not set # CONFIG_SYN_COOKIES is not set CONFIG_INET_AH=y CONFIG_INET_ESP=y CONFIG_INET_IPCOMP=y CONFIG_INET_XFRM_TUNNEL=y CONFIG_INET_TUNNEL=y CONFIG_INET_XFRM_MODE_TRANSPORT=y CONFIG_INET_XFRM_MODE_TUNNEL=y CONFIG_INET_XFRM_MODE_BEET=y CONFIG_INET_DIAG=y CONFIG_INET_TCP_DIAG=y # CONFIG_TCP_CONG_ADVANCED is not set CONFIG_TCP_CONG_CUBIC=y CONFIG_DEFAULT_TCP_CONG=cubic # CONFIG_TCP_MD5SIG is not set CONFIG_IPV6=m CONFIG_IPV6_PRIVACY=y # CONFIG_IPV6_ROUTER_PREF is not set # CONFIG_IPV6_OPTIMISTIC_DAD is not set CONFIG_INET6_AH=m CONFIG_INET6_ESP=m CONFIG_INET6_IPCOMP=m # CONFIG_IPV6_MIP6 is not set CONFIG_INET6_XFRM_TUNNEL=m CONFIG_INET6_TUNNEL=m CONFIG_INET6_XFRM_MODE_TRANSPORT=m CONFIG_INET6_XFRM_MODE_TUNNEL=m CONFIG_INET6_XFRM_MODE_BEET=m # CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set CONFIG_IPV6_SIT=m CONFIG_IPV6_TUNNEL=m # CONFIG_IPV6_MULTIPLE_TABLES is not set # CONFIG_NETWORK_SECMARK is not set # CONFIG_NETFILTER is not set # CONFIG_IP_DCCP is not set CONFIG_IP_SCTP=m # CONFIG_SCTP_DBG_MSG is not set CONFIG_SCTP_DBG_OBJCNT=y # CONFIG_SCTP_HMAC_NONE is not set # CONFIG_SCTP_HMAC_SHA1 is not set
Re: 2.6.22-rc4-mm2
On Thu, Jun 07, 2007 at 12:01:25AM -0700, Andrew Morton wrote: config, please? On Thu, Jun 07, 2007 at 12:04:07AM -0700, William Lee Irwin III wrote: It's the sparc32 defconfig. Included below for completeness. The error output looks like the following. -- wli $ quilt top create-the-zone_movable-zone-fix.patch $ (yes | make ARCH=sparc CROSS_COMPILE=sparc-linux- CC=gcc-sparc-4.1 quiet=1 -j16 defconfig) /dev/null; yes | make ARCH=sparc CROSS_COMPILE=sparc-linux- CC=gcc-sparc-4.1 quiet=1 -j16 image modules scripts/kconfig/conf -s arch/sparc/Kconfig drivers/macintosh/Kconfig:116:warning: 'select' used by config symbol 'PMAC_APM_EMU' refers to undefined symbol 'APM_EMULATION' drivers/input/keyboard/Kconfig:170:warning: 'select' used by config symbol 'KEYBOARD_ATARI' refers to undefined symbol 'ATARI_KBD_CORE' drivers/input/mouse/Kconfig:182:warning: 'select' used by config symbol 'MOUSE_ATARI' refers to undefined symbol 'ATARI_KBD_CORE' sound/soc/sh/Kconfig:6:warning: 'select' used by config symbol 'SND_SOC_PCM_SH7760' refers to undefined symbol 'SH_DMABRG' CHK include/linux/version.h UPD include/linux/version.h CHK include/linux/utsrelease.h UPD include/linux/utsrelease.h SYMLINK include/asm - include/asm-sparc stdin:752:2: warning: #warning syscall setresuid not implemented stdin:756:2: warning: #warning syscall getresuid not implemented stdin:776:2: warning: #warning syscall setresgid not implemented stdin:780:2: warning: #warning syscall getresgid not implemented CHK include/linux/compile.h UPD include/linux/compile.h ipc/msg.c: In function 'sys_msgctl': ipc/msg.c:390: warning: 'setbuf.qbytes' may be used uninitialized in this function ipc/msg.c:390: warning: 'setbuf.uid' may be used uninitialized in this function ipc/msg.c:390: warning: 'setbuf.gid' may be used uninitialized in this function ipc/msg.c:390: warning: 'setbuf.mode' may be used uninitialized in this function ipc/sem.c: In function 'sys_semctl': ipc/sem.c:861: warning: 'setbuf.uid' may be used uninitialized in this function ipc/sem.c:861: warning: 'setbuf.gid' may be used uninitialized in this function ipc/sem.c:861: warning: 'setbuf.mode' may be used uninitialized in this function mm/vmalloc.c: In function 'unmap_kernel_range': mm/vmalloc.c:75: warning: unused variable 'start' drivers/char/rtc.c:118: warning: 'hpet_rtc_interrupt' defined but not used kernel/time/ntp.c: In function 'do_adjtimex': kernel/time/ntp.c:309: warning: comparison of distinct pointer types lacks a cast kernel/time/ntp.c:312: warning: comparison of distinct pointer types lacks a cast drivers/pci/search.c: In function 'pci_find_slot': drivers/pci/search.c:99: warning: 'pci_find_device' is deprecated (declared at include/linux/pci.h:478) drivers/pci/search.c: At top level: drivers/pci/search.c:434: warning: 'pci_find_device' is deprecated (declared at drivers/pci/search.c:241) drivers/pci/search.c:434: warning: 'pci_find_device' is deprecated (declared at drivers/pci/search.c:241) drivers/pci/search.c:435: warning: 'pci_find_slot' is deprecated (declared at drivers/pci/search.c:96) drivers/pci/search.c:435: warning: 'pci_find_slot' is deprecated (declared at drivers/pci/search.c:96) drivers/pci/syscall.c: In function 'sys_pciconfig_read': drivers/pci/syscall.c:22: warning: 'dev' may be used uninitialized in this function fs/partitions/check.c: In function 'add_partition': fs/partitions/check.c:392: warning: ignoring return value of 'kobject_add', declared with attribute warn_unused_result fs/partitions/check.c:395: warning: ignoring return value of 'sysfs_create_link', declared with attribute warn_unused_result fs/partitions/check.c:402: warning: ignoring return value of 'sysfs_create_file', declared with attribute warn_unused_result CHK include/linux/compile.h UPD include/linux/compile.h WARNING: arch/sparc/kernel/head.o(.text+0x9040): Section mismatch: reference to .init.text:no_sun4u_here (between 'current_pc' and 'already_mapped') WARNING: arch/sparc/kernel/head.o(.text+0x9280): Section mismatch: reference to .init.text:execute_in_high_mem (after 'go_to_highmem') WARNING: arch/sparc/kernel/head.o(.text+0x9284): Section mismatch: reference to .init.text:execute_in_high_mem (after 'go_to_highmem') Building modules, stage 2. WARNING: vmlinux(.text+0x9040): Section mismatch: reference to .init.text:no_sun4u_here (between 'current_pc' and 'already_mapped') WARNING: vmlinux(.text+0x9280): Section mismatch: reference to .init.text:execute_in_high_mem (between 'go_to_highmem' and 'init_thread_union') WARNING: vmlinux(.text+0x9284): Section mismatch: reference to .init.text:execute_in_high_mem (between 'go_to_highmem' and 'init_thread_union') WARNING: vmlinux(.text+0x1dfb38): Section mismatch: reference to .init.text:kernel_init (between 'rest_init
Re: 2.6.22-rc4-mm2
On Thu, Jun 07, 2007 at 12:19:22AM -0700, Andrew Morton wrote: hm, OK, this seems to work: [...] -#ifdef CONFIG_HIGHMEM +#if defined(CONFIG_HIGHMEM) defined(CONFIG_ARCH_POPULATES_NODE_MAP) return movable_zone == ZONE_HIGHMEM; #else return 0; _ (the first ifdef is just there to trip things at compile time rather than link time) I guess it's not the arch's fault after all. I probably would've conditionally out-of-lined the thing so as never to expose movable_zone but this will do just fine. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: why does the macro ZERO_PAGE take an argument?
Robert P. J. Day wrote: although it's not clear where in the source tree are the invocations that would actually make a difference to a MIPS system, which is why i've CC'ed ralf on this. i'm sure he can clear this up. :-) On Thu, Jun 07, 2007 at 10:32:29AM -0700, H. Peter Anvin wrote: x86 could also benefit from coloured zeropages. In fact, I thought it already had them (K8 wants as many as 8.) How would one demonstrate the beneficial effect of such? -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm1
On Wed, 6 Jun 2007 09:30:53 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> Something brings down i386/qemu before even earlyprintk can handle. >> Bisection has narrowed it down to patch 1140 after everything got >> renumbered by peterz' fix for mm-variable-length-argument-support.patch, >> namely containersv10-make-cpusets-a-client-of-containers.patch On Wed, Jun 06, 2007 at 11:13:15AM -0700, Andrew Morton wrote: > erk. A step-by-step how-to-make-this-happen might help if poss, please. (1) build for i386 with my .config (2) attempt to boot in qemu's i386 system simulator I'm not seeing the sort of nondeterminism Andy Whitcroft is. It breaks every time when I try this. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm1
On Wed, Jun 06, 2007 at 05:26:49PM +0100, Mel Gorman wrote: > I do not believe this is Nick's problem. I encountered the same issue and > the bisect ended up here; > # BISECT HERE > mm-variable-length-argument-support.patch > mm-variable-length-argument-support-fix.patch > # BISECT BAD > Reverting those two patches boots ok on my standalone x86 laptop. > Patch authors cc'd. I have not read the patches yet to see what might > be the problem. I found this a while ago and peterz already has a tentative fix for it at http://programming.kicks-ass.net/kernel-patches/max_arg_pages/move_anon_vma.patch I'm sure he himself will chime in with more/better code when he returns. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm1
On Wed, Jun 06, 2007 at 02:07:37AM -0700, Andrew Morton wrote: > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc4/2.6.22-rc4-mm1/ > - Somebody broke it on my powerpc G5, but I didn't have time to do yet > another bisection yet. > - There's a lengthy patch series here from Nick which attempts to address > the longstanding pagefault-vs-buffered-write deadlock. > A great shower of filesystems were broken and have been disabled with > CONFIG_BROKEN. This includes reiser4. > - Complex patches which eliminate the kernel's fixed size limit on the > command-line length. These break nommu builds. Someone remind me what the pagefault vs. buffered write deadlock is. Something brings down i386/qemu before even earlyprintk can handle. Bisection has narrowed it down to patch 1140 after everything got renumbered by peterz' fix for mm-variable-length-argument-support.patch, namely containersv10-make-cpusets-a-client-of-containers.patch -- wli # # Automatically generated make config: don't edit # Linux kernel version: 2.6.22-rc4-mm1 # Wed Jun 6 09:08:11 2007 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_QUICKLIST=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SWAP_PREFETCH=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=15 CONFIG_CONTAINERS=y CONFIG_CPUSETS=y CONFIG_SYSFS_DEPRECATED=y CONFIG_CONTAINER_CPUACCT=y CONFIG_PROC_PID_CPUSET=y # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_EMBEDDED=y CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y CONFIG_KALLSYMS_EXTRA_PASS=y CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLUB_DEBUG=y # CONFIG_SLAB is not set CONFIG_SLUB=y # CONFIG_SLOB is not set CONFIG_PROC_SMAPS=y CONFIG_PROC_CLEAR_REFS=y CONFIG_PROC_PAGEMAP=y CONFIG_PROC_KPAGEMAP=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y CONFIG_LBD=y # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" # # Processor type and features # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_SMP=y # CONFIG_X86_PC is not set # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_NUMAQ is not set # CONFIG_X86_SUMMIT is not set # CONFIG_X86_BIGSMP is not set # CONFIG_X86_VISWS is not set CONFIG_X86_GENERICARCH=y # CONFIG_X86_ES7000 is not set # CONFIG_PARAVIRT is not set CONFIG_X86_CYCLONE_TIMER=y # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set CONFIG_M686=y # CONFIG_MPENTIUMII is not set # CONFIG_MPENTIUMIII is not set # CONFIG_MPENTIUMM is not set # CONFIG_MCORE2 is not set # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CONFIG_MK8 is not set # CONFIG_MCRUSOE is not set # CONFIG_MEFFICEON is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWINCHIP2 is not set # CONFIG_MWINCHIP3D is not set # CONFIG_MGEODEGX1 is not set # CONFIG_MGEODE_LX is not set # CONFIG_MCYRIXIII is not set # CONFIG_MVIAC3_2 is not set # CONFIG_MVIAC7 is not set CONFIG_X86_GENERIC=y CONFIG_X86_CMPXCHG=y CONFIG_X86_L1_CACHE_SHIFT=7 CONFIG_X86_XADD=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_PPRO_FENCE=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_X86_CMPXCHG64=y
Re: 2.6.22-rc4-mm1
On Wed, Jun 06, 2007 at 02:07:37AM -0700, Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc4/2.6.22-rc4-mm1/ - Somebody broke it on my powerpc G5, but I didn't have time to do yet another bisection yet. - There's a lengthy patch series here from Nick which attempts to address the longstanding pagefault-vs-buffered-write deadlock. A great shower of filesystems were broken and have been disabled with CONFIG_BROKEN. This includes reiser4. - Complex patches which eliminate the kernel's fixed size limit on the command-line length. These break nommu builds. Someone remind me what the pagefault vs. buffered write deadlock is. Something brings down i386/qemu before even earlyprintk can handle. Bisection has narrowed it down to patch 1140 after everything got renumbered by peterz' fix for mm-variable-length-argument-support.patch, namely containersv10-make-cpusets-a-client-of-containers.patch -- wli # # Automatically generated make config: don't edit # Linux kernel version: 2.6.22-rc4-mm1 # Wed Jun 6 09:08:11 2007 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_QUICKLIST=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION= CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SWAP_PREFETCH=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=15 CONFIG_CONTAINERS=y CONFIG_CPUSETS=y CONFIG_SYSFS_DEPRECATED=y CONFIG_CONTAINER_CPUACCT=y CONFIG_PROC_PID_CPUSET=y # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE= # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_EMBEDDED=y CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y CONFIG_KALLSYMS_EXTRA_PASS=y CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLUB_DEBUG=y # CONFIG_SLAB is not set CONFIG_SLUB=y # CONFIG_SLOB is not set CONFIG_PROC_SMAPS=y CONFIG_PROC_CLEAR_REFS=y CONFIG_PROC_PAGEMAP=y CONFIG_PROC_KPAGEMAP=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y CONFIG_LBD=y # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED=cfq # # Processor type and features # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_SMP=y # CONFIG_X86_PC is not set # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_NUMAQ is not set # CONFIG_X86_SUMMIT is not set # CONFIG_X86_BIGSMP is not set # CONFIG_X86_VISWS is not set CONFIG_X86_GENERICARCH=y # CONFIG_X86_ES7000 is not set # CONFIG_PARAVIRT is not set CONFIG_X86_CYCLONE_TIMER=y # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set CONFIG_M686=y # CONFIG_MPENTIUMII is not set # CONFIG_MPENTIUMIII is not set # CONFIG_MPENTIUMM is not set # CONFIG_MCORE2 is not set # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CONFIG_MK8 is not set # CONFIG_MCRUSOE is not set # CONFIG_MEFFICEON is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWINCHIP2 is not set # CONFIG_MWINCHIP3D is not set # CONFIG_MGEODEGX1 is not set # CONFIG_MGEODE_LX is not set # CONFIG_MCYRIXIII is not set # CONFIG_MVIAC3_2 is not set # CONFIG_MVIAC7 is not set CONFIG_X86_GENERIC=y CONFIG_X86_CMPXCHG=y CONFIG_X86_L1_CACHE_SHIFT=7 CONFIG_X86_XADD=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_PPRO_FENCE=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_X86_CMPXCHG64=y CONFIG_X86_GOOD_APIC=y
Re: 2.6.22-rc4-mm1
On Wed, Jun 06, 2007 at 05:26:49PM +0100, Mel Gorman wrote: I do not believe this is Nick's problem. I encountered the same issue and the bisect ended up here; # BISECT HERE mm-variable-length-argument-support.patch mm-variable-length-argument-support-fix.patch # BISECT BAD Reverting those two patches boots ok on my standalone x86 laptop. Patch authors cc'd. I have not read the patches yet to see what might be the problem. I found this a while ago and peterz already has a tentative fix for it at http://programming.kicks-ass.net/kernel-patches/max_arg_pages/move_anon_vma.patch I'm sure he himself will chime in with more/better code when he returns. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rc4-mm1
On Wed, 6 Jun 2007 09:30:53 -0700 William Lee Irwin III [EMAIL PROTECTED] wrote: Something brings down i386/qemu before even earlyprintk can handle. Bisection has narrowed it down to patch 1140 after everything got renumbered by peterz' fix for mm-variable-length-argument-support.patch, namely containersv10-make-cpusets-a-client-of-containers.patch On Wed, Jun 06, 2007 at 11:13:15AM -0700, Andrew Morton wrote: erk. A step-by-step how-to-make-this-happen might help if poss, please. (1) build for i386 with my .config (2) attempt to boot in qemu's i386 system simulator I'm not seeing the sort of nondeterminism Andy Whitcroft is. It breaks every time when I try this. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata & no PCI: dma_[un]map_single undefined
From: Alan Cox <[EMAIL PROTECTED]> Date: Mon, 4 Jun 2007 14:30:05 +0100 >> There are PCMCIA controllers and PCI/PCMCIA/Cardbus adapters for the >> Sparc platform I thought ? On Mon, Jun 04, 2007 at 02:22:43PM -0700, David Miller wrote: > The 32-bit sparc port has some but those PCMCIA controllers aren't > going to be supported in the foreseeable future, you have to abstract > out all the inb/outb etc. operations to go through the pcmcia > controller driver for one thing. > Secondarily, sparc32 lacks an active maintainer and it's > been like this for several years, the only things getting > worked on therefore are basica functionality and the most > important bug fixes. I don't foresee my ever dealing with those PCMCIA controllers. If by some miracle I manage to get any work done on basic functionality I'll consider that having won. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: Return ZERO_SIZE_PTR for kmalloc(0)
On Mon, Jun 04, 2007 at 10:50:41AM -0700, Linus Torvalds wrote: > The exception is if you use the memory allocator as a "ID allocator", but > quite frankly, if you use a size of zero, it's your own damn problem. > Insane code is not an argument for insane behaviour. > If people can't be bothered to create a "random ID generator" themselves, > they had damn well better use "kmalloc(1)" rather than "kmalloc(0)" to get > a unique cookie. Asking the allocator to do something idiotic because some > idiot thinks a memory allocator is a cookie allocator is just crazy. It's not such a great idea in general. Maybe it's a dumb device to cut down on lines of code for merging or some such. On Mon, Jun 04, 2007 at 10:50:41AM -0700, Linus Torvalds wrote: > I can understand that things like user-level libraries have to take crazy > people into account, but the kernel internal libraries definitely do not. > (Right now we warn once for zero-sized allocations anyway, and all the > cases we've found so far are either bugs that would have been found with > ZERO_ALLOC_PTR or would have been perfectly fine with it, so I don't think > anybody really _is_ that insane in the kernel) There are always drivers for that, but I doubt any were sufficiently creative to pick up on this. At least I've not see any. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: Return ZERO_SIZE_PTR for kmalloc(0)
On Fri, 1 Jun 2007 21:45:15 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> wrote: >> That would have to occur with objects that are repeatedly allocated and >> then linked toghether etc. Linking typicallty requires a listhead so its >> typically difficult to do zero length objects. On Fri, Jun 01, 2007 at 09:54:27PM -0700, Andrew Morton wrote: > Well I can't immediately think of a scenario in which it's likely to occur, > but we're in the position of trying to prove a negative. > Poke Bill Irwin - he'll think of something ;) I've yet to see anyone get quite that creative, but I've not gone fishing for instances of this. I can think of plenty of places where one could do something like this in practice, but don't care to give anyone any ideas. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: Return ZERO_SIZE_PTR for kmalloc(0)
On Fri, 1 Jun 2007 21:45:15 -0700 (PDT) Christoph Lameter [EMAIL PROTECTED] wrote: That would have to occur with objects that are repeatedly allocated and then linked toghether etc. Linking typicallty requires a listhead so its typically difficult to do zero length objects. On Fri, Jun 01, 2007 at 09:54:27PM -0700, Andrew Morton wrote: Well I can't immediately think of a scenario in which it's likely to occur, but we're in the position of trying to prove a negative. Poke Bill Irwin - he'll think of something ;) I've yet to see anyone get quite that creative, but I've not gone fishing for instances of this. I can think of plenty of places where one could do something like this in practice, but don't care to give anyone any ideas. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: Return ZERO_SIZE_PTR for kmalloc(0)
On Mon, Jun 04, 2007 at 10:50:41AM -0700, Linus Torvalds wrote: The exception is if you use the memory allocator as a ID allocator, but quite frankly, if you use a size of zero, it's your own damn problem. Insane code is not an argument for insane behaviour. If people can't be bothered to create a random ID generator themselves, they had damn well better use kmalloc(1) rather than kmalloc(0) to get a unique cookie. Asking the allocator to do something idiotic because some idiot thinks a memory allocator is a cookie allocator is just crazy. It's not such a great idea in general. Maybe it's a dumb device to cut down on lines of code for merging or some such. On Mon, Jun 04, 2007 at 10:50:41AM -0700, Linus Torvalds wrote: I can understand that things like user-level libraries have to take crazy people into account, but the kernel internal libraries definitely do not. (Right now we warn once for zero-sized allocations anyway, and all the cases we've found so far are either bugs that would have been found with ZERO_ALLOC_PTR or would have been perfectly fine with it, so I don't think anybody really _is_ that insane in the kernel) There are always drivers for that, but I doubt any were sufficiently creative to pick up on this. At least I've not see any. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata no PCI: dma_[un]map_single undefined
From: Alan Cox [EMAIL PROTECTED] Date: Mon, 4 Jun 2007 14:30:05 +0100 There are PCMCIA controllers and PCI/PCMCIA/Cardbus adapters for the Sparc platform I thought ? On Mon, Jun 04, 2007 at 02:22:43PM -0700, David Miller wrote: The 32-bit sparc port has some but those PCMCIA controllers aren't going to be supported in the foreseeable future, you have to abstract out all the inb/outb etc. operations to go through the pcmcia controller driver for one thing. Secondarily, sparc32 lacks an active maintainer and it's been like this for several years, the only things getting worked on therefore are basica functionality and the most important bug fixes. I don't foresee my ever dealing with those PCMCIA controllers. If by some miracle I manage to get any work done on basic functionality I'll consider that having won. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 9/9] Scheduler profiling - Use conditional calls
On Wed, May 30, 2007 at 10:00:34AM -0400, Mathieu Desnoyers wrote: >>> + if (prof_on) >>> + BUG_ON(cond_call_arm("profile_on")); * William Lee Irwin III ([EMAIL PROTECTED]) wrote: >> What's the point of this BUG_ON()? The condition is a priori impossible. On Thu, May 31, 2007 at 05:12:58PM -0400, Mathieu Desnoyers wrote: > Not impossible: hash_add_cond_call() can return -ENOMEM if kmalloc lacks > memory. Shouldn't it just propagate the errors like anything else instead of going BUG(), then? One can easily live without profiling if the profile buffers should fail to be allocated e.g. due to memory fragmentation. These things all have to handle errors for hotplugging anyway AIUI. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote: >> Its ->wait_runtime will drop less significantly, which lets it be >> inserted in rb-tree much to the left of those 1000 tasks (and which >> indirectly lets it gain back its fair share during subsequent >> schedule cycles). >> Hmm ..is that the theory? On Thu, May 31, 2007 at 02:26:00PM +0530, Srivatsa Vaddagiri wrote: > My only concern is the time needed to converge to this fair > distribution, especially in face of fluctuating workloads. For ex: a > container who does a fork bomb can have a very adverse impact on > other container's fair share under this scheme compared to other > schemes which dedicate separate rb-trees for differnet containers > (and which also support two level hierarchical scheduling inside the > core scheduler). > I am inclined to have the core scheduler support atleast two levels > of hierarchy (to better isolate each container) and resort to the > flattening trick for higher levels. Yes, the larger number of schedulable entities and hence slower convergence to groupwise weightings is a disadvantage of the flattening. A hybrid scheme seems reasonable enough. Ideally one would chop the hierarchy in pieces so that n levels of hierarchy become k levels of n/k weight-flattened hierarchies for this sort of attack to be most effective (at least assuming similar branching factors at all levels of hierarchy and sufficient depth to the hierarchy to make it meaningful) but this is awkward to do. Peeling off the outermost container or whichever level is deemed most important in terms of accuracy of aggregate enforcement as a hierarchical scheduler is a practical compromise. Hybrid schemes will still incur the difficulties of hierarchical scheduling, but they're by no means insurmountable. Sadly, only complete flattening yields the simplifications that make task group weighting enforcement orthogonal to load balancing and the like. The scheme I described for global nice number behavior is also not readily adaptable to hybrid schemes. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
On Wed, May 30, 2007 at 11:36:47PM -0700, William Lee Irwin III wrote: >> Temporarily, yes. All this only works when averaged out. On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote: > So essentially when we calculate delta_mine component for each of those > 1000 tasks, we will find that it has executed for 1 tick (4 ms say) but > its fair share was very very low. > fair_share = delta_exec * p->load_weight / total_weight > If p->load_weight has been calculated after factoring in hierarchy (as > you outlined in a previous mail), then p->load_weight of those 1000 tasks > will be far less compared to the p->load_weight of one task belonging to > other user, correct? Just to make sure I get all this correct: You've got it all correct. On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote: > User U1 has tasks T0 - T999 > User U2 has task T1000 > assuming each task's weight is 1 and each user's weight is 1 then: > WT0 = (WU1 / WU1 + WU2) * (WT0 / WT0 + WT1 + ... + WT999) > = (1 / 1 + 1) * (1 / 1000) > = 1/2000 > = 0.0005 > WT1 ..WT999 will be same as WT0 > whereas, weight of T1000 will be: > WT1000 = (WU1 / WU1 + WU2) * (WT1000 / WT1000) > = (1 / 1 + 1) * (1/1) > = 0.5 > ? Yes, these calculations are correct. On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote: > So when T0 (or T1 ..T999) executes for 1 tick (4ms), their fair share would > be: > T0's fair_share (delta_mine) > = 4 ms * 0.0005 / (0.0005 * 1000 + 0.5) > = 4 ms * 0.0005 / 1 > = 0.002 ms (2000 ns) > This would cause T0's ->wait_runtime to go negative sharply, causing it to be > inserted back in rb-tree well ahead in future. One change I can forsee > in CFS is with regard to limit_wait_runtime() ..We will have to change > its default limit, atleast when group fairness thingy is enabled. > Compared to this when T1000 executes for 1 tick, its fair share would be > calculated as: > T1000's fair_share (delta_mine) > = 4 ms * 0.5 / (0.0005 * 1000 + 0.5) > = 4 ms * 0.5 / 1 > = 2 ms (200 ns) > Its ->wait_runtime will drop less significantly, which lets it be > inserted in rb-tree much to the left of those 1000 tasks (and which indirectly > lets it gain back its fair share during subsequent schedule cycles). This analysis is again entirely correct. On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote: > Hmm ..is that the theory? > Ingo, do you have any comments on this approach? > /me is tempted to try this all out. Yes, this is the theory behind using task weights to flatten the task group hierarchies. My prior post assumed all this and described a method to make nice numbers behave as expected in the global context atop it. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
On Wed, May 30, 2007 at 09:09:26PM -0700, William Lee Irwin III wrote: >> It's not all that tricky. On Thu, May 31, 2007 at 11:18:28AM +0530, Srivatsa Vaddagiri wrote: > Hmm ..the fact that each task runs for a minimum of 1 tick seems to > complicate the matters to me (when doing group fairness given a single > level hierarchy). A user with 1000 (or more) tasks can be unduly > advantaged compared to another user with just 1 (or fewer) task > because of this? Temporarily, yes. All this only works when averaged out. The basic idea is that you want a constant upper bound on the difference between the CPU time a task receives and the CPU time it was intended to get. This discretization is one of the larger sources of the "error" in the CPU time granted. The constant upper bound usually only applies to the largest difference for any task. When absolute values of differences are summed across tasks the aggregate will be O(tasks) because there's something almost like a constant per-task lower bound a la Heisenberg. It would have to get more exact the more tasks there are on the system for that to work, and something of the opposite actually holds. It might be appropriate for the scheduler to dynamically adjust a periodic timer's period or to set up one-shot timers at involuntary preemption times in order to achieve more precise fairness in this sort of situation. In the case of few preemption points such one-shot code or low periodicity code would also save on taking interrupts that would otherwise manifest as overhead. In short, a user with many tasks can reap a temporary advantage relative to users with fewer tasks because of this, but over time, longer-running tasks will receive the CPU time intended to within some constant upper bound, provided other things aren't broken. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
On Wed, May 30, 2007 at 09:09:26PM -0700, William Lee Irwin III wrote: It's not all that tricky. On Thu, May 31, 2007 at 11:18:28AM +0530, Srivatsa Vaddagiri wrote: Hmm ..the fact that each task runs for a minimum of 1 tick seems to complicate the matters to me (when doing group fairness given a single level hierarchy). A user with 1000 (or more) tasks can be unduly advantaged compared to another user with just 1 (or fewer) task because of this? Temporarily, yes. All this only works when averaged out. The basic idea is that you want a constant upper bound on the difference between the CPU time a task receives and the CPU time it was intended to get. This discretization is one of the larger sources of the error in the CPU time granted. The constant upper bound usually only applies to the largest difference for any task. When absolute values of differences are summed across tasks the aggregate will be O(tasks) because there's something almost like a constant per-task lower bound a la Heisenberg. It would have to get more exact the more tasks there are on the system for that to work, and something of the opposite actually holds. It might be appropriate for the scheduler to dynamically adjust a periodic timer's period or to set up one-shot timers at involuntary preemption times in order to achieve more precise fairness in this sort of situation. In the case of few preemption points such one-shot code or low periodicity code would also save on taking interrupts that would otherwise manifest as overhead. In short, a user with many tasks can reap a temporary advantage relative to users with fewer tasks because of this, but over time, longer-running tasks will receive the CPU time intended to within some constant upper bound, provided other things aren't broken. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
On Wed, May 30, 2007 at 11:36:47PM -0700, William Lee Irwin III wrote: Temporarily, yes. All this only works when averaged out. On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote: So essentially when we calculate delta_mine component for each of those 1000 tasks, we will find that it has executed for 1 tick (4 ms say) but its fair share was very very low. fair_share = delta_exec * p-load_weight / total_weight If p-load_weight has been calculated after factoring in hierarchy (as you outlined in a previous mail), then p-load_weight of those 1000 tasks will be far less compared to the p-load_weight of one task belonging to other user, correct? Just to make sure I get all this correct: You've got it all correct. On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote: User U1 has tasks T0 - T999 User U2 has task T1000 assuming each task's weight is 1 and each user's weight is 1 then: WT0 = (WU1 / WU1 + WU2) * (WT0 / WT0 + WT1 + ... + WT999) = (1 / 1 + 1) * (1 / 1000) = 1/2000 = 0.0005 WT1 ..WT999 will be same as WT0 whereas, weight of T1000 will be: WT1000 = (WU1 / WU1 + WU2) * (WT1000 / WT1000) = (1 / 1 + 1) * (1/1) = 0.5 ? Yes, these calculations are correct. On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote: So when T0 (or T1 ..T999) executes for 1 tick (4ms), their fair share would be: T0's fair_share (delta_mine) = 4 ms * 0.0005 / (0.0005 * 1000 + 0.5) = 4 ms * 0.0005 / 1 = 0.002 ms (2000 ns) This would cause T0's -wait_runtime to go negative sharply, causing it to be inserted back in rb-tree well ahead in future. One change I can forsee in CFS is with regard to limit_wait_runtime() ..We will have to change its default limit, atleast when group fairness thingy is enabled. Compared to this when T1000 executes for 1 tick, its fair share would be calculated as: T1000's fair_share (delta_mine) = 4 ms * 0.5 / (0.0005 * 1000 + 0.5) = 4 ms * 0.5 / 1 = 2 ms (200 ns) Its -wait_runtime will drop less significantly, which lets it be inserted in rb-tree much to the left of those 1000 tasks (and which indirectly lets it gain back its fair share during subsequent schedule cycles). This analysis is again entirely correct. On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote: Hmm ..is that the theory? Ingo, do you have any comments on this approach? /me is tempted to try this all out. Yes, this is the theory behind using task weights to flatten the task group hierarchies. My prior post assumed all this and described a method to make nice numbers behave as expected in the global context atop it. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote: Its -wait_runtime will drop less significantly, which lets it be inserted in rb-tree much to the left of those 1000 tasks (and which indirectly lets it gain back its fair share during subsequent schedule cycles). Hmm ..is that the theory? On Thu, May 31, 2007 at 02:26:00PM +0530, Srivatsa Vaddagiri wrote: My only concern is the time needed to converge to this fair distribution, especially in face of fluctuating workloads. For ex: a container who does a fork bomb can have a very adverse impact on other container's fair share under this scheme compared to other schemes which dedicate separate rb-trees for differnet containers (and which also support two level hierarchical scheduling inside the core scheduler). I am inclined to have the core scheduler support atleast two levels of hierarchy (to better isolate each container) and resort to the flattening trick for higher levels. Yes, the larger number of schedulable entities and hence slower convergence to groupwise weightings is a disadvantage of the flattening. A hybrid scheme seems reasonable enough. Ideally one would chop the hierarchy in pieces so that n levels of hierarchy become k levels of n/k weight-flattened hierarchies for this sort of attack to be most effective (at least assuming similar branching factors at all levels of hierarchy and sufficient depth to the hierarchy to make it meaningful) but this is awkward to do. Peeling off the outermost container or whichever level is deemed most important in terms of accuracy of aggregate enforcement as a hierarchical scheduler is a practical compromise. Hybrid schemes will still incur the difficulties of hierarchical scheduling, but they're by no means insurmountable. Sadly, only complete flattening yields the simplifications that make task group weighting enforcement orthogonal to load balancing and the like. The scheme I described for global nice number behavior is also not readily adaptable to hybrid schemes. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 9/9] Scheduler profiling - Use conditional calls
On Wed, May 30, 2007 at 10:00:34AM -0400, Mathieu Desnoyers wrote: + if (prof_on) + BUG_ON(cond_call_arm(profile_on)); * William Lee Irwin III ([EMAIL PROTECTED]) wrote: What's the point of this BUG_ON()? The condition is a priori impossible. On Thu, May 31, 2007 at 05:12:58PM -0400, Mathieu Desnoyers wrote: Not impossible: hash_add_cond_call() can return -ENOMEM if kmalloc lacks memory. Shouldn't it just propagate the errors like anything else instead of going BUG(), then? One can easily live without profiling if the profile buffers should fail to be allocated e.g. due to memory fragmentation. These things all have to handle errors for hotplugging anyway AIUI. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/