Re: [PATCH/RFC] A method for clearing out page cache
Hi! > So what it comes down to is > > sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free) > > where `what_to_free' consists of a bunch of bitflags (unmapped pagecache, > mapped pagecache, anonymous memory, slab, ...). Heh, swsusp needs shrink_all_memory() and I'd like to use something more generic as shrink_all_memory() does not seem to work properly. I guess that loop over all node_ids should be easy ;-). Pavel -- People were complaining that M$ turns users into beta-testers... ...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Hi! So what it comes down to is sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free) where `what_to_free' consists of a bunch of bitflags (unmapped pagecache, mapped pagecache, anonymous memory, slab, ...). Heh, swsusp needs shrink_all_memory() and I'd like to use something more generic as shrink_all_memory() does not seem to work properly. I guess that loop over all node_ids should be easy ;-). Pavel -- People were complaining that M$ turns users into beta-testers... ...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl! - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew asked: > So... Cannot the applicaiton remove all its pagecache with posix_fadvise() > prior to exitting? Hang on ... The replies of Ray and Martin answer your immediate question. But we (SGI) are still busy discussing the bigger picture behind the scenes ... -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew Morton wrote: Paul Jackson <[EMAIL PROTECTED]> wrote: As Martin wrote, when he submitted this patch: > The motivation for this patch is for setting up High Performance > Computing jobs, where initial memory placement is very important to > overall performance. Any left over cache is wrong, for this situation. So... Cannot the applicaiton remove all its pagecache with posix_fadvise() prior to exitting? Even if we modified all applications to do this, it still wouldn't help for dirty page cache, which would eventually become cleaned, and hang around long after the application has departed. But the previous statement has a false hypothesis, namely, that we could change all applications to do this. -- Best Regards, Ray --- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) [EMAIL PROTECTED] [EMAIL PROTECTED] The box said: "Requires Windows 98 or better", so I installed Linux. --- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
On Tue, Feb 22, 2005 at 10:45:35AM -0800, Andrew Morton wrote: > Paul Jackson <[EMAIL PROTECTED]> wrote: > > > > As Martin wrote, when he submitted this patch: > > > The motivation for this patch is for setting up High Performance > > > Computing jobs, where initial memory placement is very important to > > > overall performance. > > > > Any left over cache is wrong, for this situation. > > So... Cannot the applicaiton remove all its pagecache with posix_fadvise() > prior to exitting? I think Paul's referring to pagecache (as well as other caches) that are on the node from other uses, not necessarily another HPC job that has recently terminated. mh -- Martin Hicks || Silicon Graphics Inc. || [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Paul Jackson <[EMAIL PROTECTED]> wrote: > > As Martin wrote, when he submitted this patch: > > The motivation for this patch is for setting up High Performance > > Computing jobs, where initial memory placement is very important to > > overall performance. > > Any left over cache is wrong, for this situation. So... Cannot the applicaiton remove all its pagecache with posix_fadvise() prior to exitting? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Ingo Molnar wrote: * Andrew Morton <[EMAIL PROTECTED]> wrote: . enable users to specify an 'allocation priority' of some sort, which kicks out the pagecache on the local node - or something like that. Yes, that would be preferable - I don't know what the difficulty is with that. sys_set_mempolicy() should provide a sufficiently good hint. yes. I'm not against some flushing mechanism for debugging or test purposes (it can be useful to start from a new, clean state - and as such the sysctl for root only and depending on KERNEL_DEBUG is probably better than an explicit syscall), but the idea to give a flushing API to applications is bad i believe. We're pretty agnostic about this. I agree that if we were to make this a system call, then it should be restricted to root. Or make it a sysctl. Whichever way you guys want to go is fine with us. It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA) VM problems and i fear that it will destroy the evolution of VM priority/placement/affinity APIs (NUMAlib, etc.). I have two observations about this: (1) It is our intent to use the infrastructure provided by this patch as the basis for an automatic (i. e. included with the VM) approach that selectively removes unused page cache pages before spilling off node. We just figured it would be easier to get the infrastructure in place first. (2) If a sufficiently well behaved application knows in advance how much free memory it needs per node, then it makes sense to provide a mechanism for the application to request this, rather than for the VM to try to puzzle this out later. Automatic algorithms in the VM are never perfect; they should be reserved to work in those cases where the application(s) either cooperate in such a way to make memory demands impossible to predict, or the application programmer can't (or can't take the time to) predict how much memory the application will use. At least making it sufficiently painful to use (via the originally proposed root-only sysctl) could still preserve some of the incentive to provide a clean solution for applications. 'Time to market' constraints should not be considered when adding core mechanisms. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Best Regards, Ray --- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) [EMAIL PROTECTED] [EMAIL PROTECTED] The box said: "Requires Windows 98 or better", so I installed Linux. --- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Ingo wrote: > app designers very frequently think that the VM gets its act wrong (most > of the time for the wrong reasons), As Martin wrote, when he submitted this patch: > The motivation for this patch is for setting up High Performance > Computing jobs, where initial memory placement is very important to > overall performance. Any left over cache is wrong, for this situation. The only right answer, no fault of the VM that it can't predict such, is to clear the past cache and ensure that all allocations are satisfied with node-local memory, and no page out delays, for all the threads in such tightly coupled jobs. These jobs have been sized to use every ounce of CPU and Memory from sometimes hundreds of nodes, and for hours or days, using tightly coupled MPI and OpenMP codes. Any misplaced pages (off the local node) or paging delays quickly leads to erratic and reduced performance. Flushing all the cache like this hurts any normal load that has any continuity of working set, and such flushing is not cheap. I have not observed much interest in doing this, outside of appropriate use when starting up a big HPC app, as described above, or the test and debug situations that you mention. For certain HPC apps, it can be essential to repeatable job performance. Granted, this might not be for most systems. Perhaps a CONFIG option, so that by default this worked on builds for big honkin numa boxes, but was an -ENOSYS error on ordinary sized systems? Though I prefer not to create artificial distinctions between configurations, without good reason, perhaps this is such a reason. Making the API ugly won't reduce its use much, rather just increase code maintenance costs a bit, and breed a few more bugs. Those who think they want this will find a way to do it. If something's worth doing, it's worth doing cleanly. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
* Andrew Morton <[EMAIL PROTECTED]> wrote: > > . enable users to > > specify an 'allocation priority' of some sort, which kicks out the > > pagecache on the local node - or something like that. > > Yes, that would be preferable - I don't know what the difficulty is > with that. sys_set_mempolicy() should provide a sufficiently good > hint. yes. I'm not against some flushing mechanism for debugging or test purposes (it can be useful to start from a new, clean state - and as such the sysctl for root only and depending on KERNEL_DEBUG is probably better than an explicit syscall), but the idea to give a flushing API to applications is bad i believe. It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA) VM problems and i fear that it will destroy the evolution of VM priority/placement/affinity APIs (NUMAlib, etc.). At least making it sufficiently painful to use (via the originally proposed root-only sysctl) could still preserve some of the incentive to provide a clean solution for applications. 'Time to market' constraints should not be considered when adding core mechanisms. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Ingo Molnar <[EMAIL PROTECTED]> wrote: > > app designers very frequently think that the VM gets its act wrong (most > of the time for the wrong reasons), and the last thing we want to enable > them is to hack real problems around. Not really. Memory reclaim tries to predict the future and expects some sort of "average" workload. For some workloads that prediction is hopelessly wrong. Although we could surely provide manual hinting machinery which is less crude than this proposal. > . enable users to > specify an 'allocation priority' of some sort, which kicks out the > pagecache on the local node - or something like that. Yes, that would be preferable - I don't know what the difficulty is with that. sys_set_mempolicy() should provide a sufficiently good hint. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Ingo Molnar [EMAIL PROTECTED] wrote: app designers very frequently think that the VM gets its act wrong (most of the time for the wrong reasons), and the last thing we want to enable them is to hack real problems around. Not really. Memory reclaim tries to predict the future and expects some sort of average workload. For some workloads that prediction is hopelessly wrong. Although we could surely provide manual hinting machinery which is less crude than this proposal. . enable users to specify an 'allocation priority' of some sort, which kicks out the pagecache on the local node - or something like that. Yes, that would be preferable - I don't know what the difficulty is with that. sys_set_mempolicy() should provide a sufficiently good hint. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
* Andrew Morton [EMAIL PROTECTED] wrote: . enable users to specify an 'allocation priority' of some sort, which kicks out the pagecache on the local node - or something like that. Yes, that would be preferable - I don't know what the difficulty is with that. sys_set_mempolicy() should provide a sufficiently good hint. yes. I'm not against some flushing mechanism for debugging or test purposes (it can be useful to start from a new, clean state - and as such the sysctl for root only and depending on KERNEL_DEBUG is probably better than an explicit syscall), but the idea to give a flushing API to applications is bad i believe. It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA) VM problems and i fear that it will destroy the evolution of VM priority/placement/affinity APIs (NUMAlib, etc.). At least making it sufficiently painful to use (via the originally proposed root-only sysctl) could still preserve some of the incentive to provide a clean solution for applications. 'Time to market' constraints should not be considered when adding core mechanisms. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Ingo wrote: app designers very frequently think that the VM gets its act wrong (most of the time for the wrong reasons), As Martin wrote, when he submitted this patch: The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. Any left over cache is wrong, for this situation. The only right answer, no fault of the VM that it can't predict such, is to clear the past cache and ensure that all allocations are satisfied with node-local memory, and no page out delays, for all the threads in such tightly coupled jobs. These jobs have been sized to use every ounce of CPU and Memory from sometimes hundreds of nodes, and for hours or days, using tightly coupled MPI and OpenMP codes. Any misplaced pages (off the local node) or paging delays quickly leads to erratic and reduced performance. Flushing all the cache like this hurts any normal load that has any continuity of working set, and such flushing is not cheap. I have not observed much interest in doing this, outside of appropriate use when starting up a big HPC app, as described above, or the test and debug situations that you mention. For certain HPC apps, it can be essential to repeatable job performance. Granted, this might not be for most systems. Perhaps a CONFIG option, so that by default this worked on builds for big honkin numa boxes, but was an -ENOSYS error on ordinary sized systems? Though I prefer not to create artificial distinctions between configurations, without good reason, perhaps this is such a reason. Making the API ugly won't reduce its use much, rather just increase code maintenance costs a bit, and breed a few more bugs. Those who think they want this will find a way to do it. If something's worth doing, it's worth doing cleanly. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Ingo Molnar wrote: * Andrew Morton [EMAIL PROTECTED] wrote: . enable users to specify an 'allocation priority' of some sort, which kicks out the pagecache on the local node - or something like that. Yes, that would be preferable - I don't know what the difficulty is with that. sys_set_mempolicy() should provide a sufficiently good hint. yes. I'm not against some flushing mechanism for debugging or test purposes (it can be useful to start from a new, clean state - and as such the sysctl for root only and depending on KERNEL_DEBUG is probably better than an explicit syscall), but the idea to give a flushing API to applications is bad i believe. We're pretty agnostic about this. I agree that if we were to make this a system call, then it should be restricted to root. Or make it a sysctl. Whichever way you guys want to go is fine with us. It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA) VM problems and i fear that it will destroy the evolution of VM priority/placement/affinity APIs (NUMAlib, etc.). I have two observations about this: (1) It is our intent to use the infrastructure provided by this patch as the basis for an automatic (i. e. included with the VM) approach that selectively removes unused page cache pages before spilling off node. We just figured it would be easier to get the infrastructure in place first. (2) If a sufficiently well behaved application knows in advance how much free memory it needs per node, then it makes sense to provide a mechanism for the application to request this, rather than for the VM to try to puzzle this out later. Automatic algorithms in the VM are never perfect; they should be reserved to work in those cases where the application(s) either cooperate in such a way to make memory demands impossible to predict, or the application programmer can't (or can't take the time to) predict how much memory the application will use. At least making it sufficiently painful to use (via the originally proposed root-only sysctl) could still preserve some of the incentive to provide a clean solution for applications. 'Time to market' constraints should not be considered when adding core mechanisms. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Best Regards, Ray --- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) [EMAIL PROTECTED] [EMAIL PROTECTED] The box said: Requires Windows 98 or better, so I installed Linux. --- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Paul Jackson [EMAIL PROTECTED] wrote: As Martin wrote, when he submitted this patch: The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. Any left over cache is wrong, for this situation. So... Cannot the applicaiton remove all its pagecache with posix_fadvise() prior to exitting? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
On Tue, Feb 22, 2005 at 10:45:35AM -0800, Andrew Morton wrote: Paul Jackson [EMAIL PROTECTED] wrote: As Martin wrote, when he submitted this patch: The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. Any left over cache is wrong, for this situation. So... Cannot the applicaiton remove all its pagecache with posix_fadvise() prior to exitting? I think Paul's referring to pagecache (as well as other caches) that are on the node from other uses, not necessarily another HPC job that has recently terminated. mh -- Martin Hicks || Silicon Graphics Inc. || [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew Morton wrote: Paul Jackson [EMAIL PROTECTED] wrote: As Martin wrote, when he submitted this patch: The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. Any left over cache is wrong, for this situation. So... Cannot the applicaiton remove all its pagecache with posix_fadvise() prior to exitting? Even if we modified all applications to do this, it still wouldn't help for dirty page cache, which would eventually become cleaned, and hang around long after the application has departed. But the previous statement has a false hypothesis, namely, that we could change all applications to do this. -- Best Regards, Ray --- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) [EMAIL PROTECTED] [EMAIL PROTECTED] The box said: Requires Windows 98 or better, so I installed Linux. --- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew asked: So... Cannot the applicaiton remove all its pagecache with posix_fadvise() prior to exitting? Hang on ... The replies of Ray and Martin answer your immediate question. But we (SGI) are still busy discussing the bigger picture behind the scenes ... -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
* Andrew Morton <[EMAIL PROTECTED]> wrote: > > However, the first step is to do this manually from user space. > > Yup. The thing is, lots of people want this feature for various > reasons. Not just numerical-computing-users-on-NUMA. We should get > it right for them too. > > Especially kernel developers, who have various nasty userspace tools > which will manually reclaim pagecache. But non-kernel-developers will > use it too, when they think the VM is screwing them over ;) app designers very frequently think that the VM gets its act wrong (most of the time for the wrong reasons), and the last thing we want to enable them is to hack real problems around. How are we supposed to debug VM problems where one player periodically flushes the whole pagecache? If that flushing, when disabled, 'results in the app being broken' (_if_ the app gives any option to disable the flushing). Providing APIs to flush system caches, sysctl or syscall, is the road to VM madness. If the goal is to override the pagecache (and other kernel caches) on a given node then for God's sake, think a bit harder. E.g. enable users to specify an 'allocation priority' of some sort, which kicks out the pagecache on the local node - or something like that. Giving a half-assed tool to clean out one aspect of the system caches will only muddy the waters, with no real road back to sanity. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew wrote: > Yes, I ... [clarifies pj's various confusions] Yup - all sounds good - thanks. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew Morton wrote: Ray Bryant <[EMAIL PROTECTED]> wrote: We did it this way because it was easier to get it into SLES9 that way. But there is no particular reason that we couldn't use a system call. It's just that we figured adding system calls is hard. aarggh. This is why you should target kernel.org kernels first. Now we risk ending up with poor old suse carrying an obsolete interface and application developers have to be able to cater for both interfaces. I agree, but time-to-market decisions overrode that. Anyway, everyone uses a program called "bcfree" to actually do the buffer-cache freeing, so changing the interface is not as bad as all that. Let us put something together along these lines and we will get back to you. Thanks, -- Best Regards, Ray --- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) [EMAIL PROTECTED] [EMAIL PROTECTED] The box said: "Requires Windows 98 or better", so I installed Linux. --- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Ray Bryant <[EMAIL PROTECTED]> wrote: > > Andrew Morton wrote: > > Martin Hicks <[EMAIL PROTECTED]> wrote: > > > >>This patch introduces a new sysctl for NUMA systems that tries to drop > >> as much of the page cache as possible from a set of nodes. The > >> motivation for this patch is for setting up High Performance Computing > >> jobs, where initial memory placement is very important to overall > >> performance. > > > > > > - Using a write to /proc for this seems a bit hacky. Why not simply add > > a new system call for it? > > > > We did it this way because it was easier to get it into SLES9 that way. > But there is no particular reason that we couldn't use a system call. > It's just that we figured adding system calls is hard. aarggh. This is why you should target kernel.org kernels first. Now we risk ending up with poor old suse carrying an obsolete interface and application developers have to be able to cater for both interfaces. > > If it does, then userspace could arrange for that concurrency by > > starting a number of processes to perform the toss, each with a different > > nodemask. > > > > That works fine as well if we can get a system call number assigned and > avoids the hackiness of both /proc and the kernel threads. syscall numbers are per-arch. We don't need to assign a syscall number for this one - we can surely have this ready for 2.6.12. Simply include i386 and ia64 in the initial patch and other architectures will catch up pretty quickly. (It would be nice to generate patches for the arch maintainers, however). > > - Dropping "as much pagecache as possible" might be a bit crude. I > > wonder if we should pass in some additional parameter which specifies how > > much of the node's pagecache should be removed. > > > > Or, better, specify how much free memory we will actually require on > > this node. The syscall terminates when it determines that enough > > pagecache has been removed. > > Our thoughts exactly. This is clearly a "big hammer" and we want to > make a lighter hammer to free up a certain number of pages. Indeed, > we would like to have these calls occur automatically from __alloc_pages() > when we try to allocate local storage and find that there isn't any. > For our workloads, we want to free up unmapped, clean pagecache, if that > is what is keeping us from allocating a local page. Not all workloads > want that, however, so we would probably use a sysctl() to enable/disable > this. > > However, the first step is to do this manually from user space. Yup. The thing is, lots of people want this feature for various reasons. Not just numerical-computing-users-on-NUMA. We should get it right for them too. Especially kernel developers, who have various nasty userspace tools which will manually reclaim pagecache. But non-kernel-developers will use it too, when they think the VM is screwing them over ;) I think Solaris used to have such a tool - /usr/etc/chill, although I don't know if it had kernel support. > > > > - To make the syscall more general, we should be able to reclaim mapped > > pagecache and anonymous memory as well. > > > > > > So what it comes down to is > > > > sys_free_node_memory(long node_id, long pages_to_make_free, long > > what_to_free) > > > > where `what_to_free' consists of a bunch of bitflags (unmapped pagecache, > > mapped pagecache, anonymous memory, slab, ...). > > Do we have to implement all of those or just allow for the possibility of that > being implemented in the future? E. g. in our case we'd just implement the > bit that says "unmapped pagecache". Well... please take a look at what's involved. It should just be a matter of sprinkling a few test such as + if (sc->mode & SC_RECLAIM_SLAB) { ... + } into the existing code. If things turn nasty then we can take another look at it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Paul Jackson <[EMAIL PROTECTED]> wrote: > > Andrew wrote: > > sys_free_node_memory(long node_id, long pages_to_make_free, long > > what_to_free) > > ... > > - To make the syscall more general, we should be able to reclaim mapped > > pagecache and anonymous memory as well. > > sys_free_node_memory() - nice. > > Does it make sense to also have it be able to free up slab cache, > calling shrink_slab()? Yes, I suggested that slab be one of the `what_to_free' flags. (Some of this may be tricky to implement. But a good interface with an initially-crappy implementation is OK ;) > Did you mean to pass a nodemask, or a single node id? Passing a single > node id is easier - we've shown that it is difficult to pass bitmaps > across the user/kernel boundary without confusions. But if only a > single node id is passed, then you get the thread per node that you just > argued was sometimes overkill. I meant a single node ID. With a bitmap, the kernel needs to futz around scanning the bitmap, launching kernel threads, etc. I'm proposing that there be no kernel threads at all. If you have four nodes: for i in 0 1 2 3 do call-sys_free_node_memory $i -1 -1 & done > I'd prefer the single node id, because it's easier to get right. yup. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew Morton wrote: Martin Hicks <[EMAIL PROTECTED]> wrote: This patch introduces a new sysctl for NUMA systems that tries to drop as much of the page cache as possible from a set of nodes. The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. - Using a write to /proc for this seems a bit hacky. Why not simply add a new system call for it? We did it this way because it was easier to get it into SLES9 that way. But there is no particular reason that we couldn't use a system call. It's just that we figured adding system calls is hard. - Starting a kernel thread for each node might be overkill. Yes, it would take longer if one process was to do all the work, but does this operation need to be very fast? It is possible that this call might need to be executed at the start of each batch job in the system. The reason for using a kernel thread was that there was no good way to start concurrency due to a write to /proc. If it does, then userspace could arrange for that concurrency by starting a number of processes to perform the toss, each with a different nodemask. That works fine as well if we can get a system call number assigned and avoids the hackiness of both /proc and the kernel threads. - Dropping "as much pagecache as possible" might be a bit crude. I wonder if we should pass in some additional parameter which specifies how much of the node's pagecache should be removed. Or, better, specify how much free memory we will actually require on this node. The syscall terminates when it determines that enough pagecache has been removed. Our thoughts exactly. This is clearly a "big hammer" and we want to make a lighter hammer to free up a certain number of pages. Indeed, we would like to have these calls occur automatically from __alloc_pages() when we try to allocate local storage and find that there isn't any. For our workloads, we want to free up unmapped, clean pagecache, if that is what is keeping us from allocating a local page. Not all workloads want that, however, so we would probably use a sysctl() to enable/disable this. However, the first step is to do this manually from user space. - To make the syscall more general, we should be able to reclaim mapped pagecache and anonymous memory as well. So what it comes down to is sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free) where `what_to_free' consists of a bunch of bitflags (unmapped pagecache, mapped pagecache, anonymous memory, slab, ...). Do we have to implement all of those or just allow for the possibility of that being implemented in the future? E. g. in our case we'd just implement the bit that says "unmapped pagecache". - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Best Regards, Ray --- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) [EMAIL PROTECTED] [EMAIL PROTECTED] The box said: "Requires Windows 98 or better", so I installed Linux. --- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew wrote: > sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free) > ... > - To make the syscall more general, we should be able to reclaim mapped > pagecache and anonymous memory as well. sys_free_node_memory() - nice. Does it make sense to also have it be able to free up slab cache, calling shrink_slab()? Did you mean to pass a nodemask, or a single node id? Passing a single node id is easier - we've shown that it is difficult to pass bitmaps across the user/kernel boundary without confusions. But if only a single node id is passed, then you get the thread per node that you just argued was sometimes overkill. I'd prefer the single node id, because it's easier to get right. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
On Mon, 21 Feb 2005 14:27:21 -0500, Martin Hicks <[EMAIL PROTECTED]> wrote: > > Hi, > > I've made a bunch of changes that Paul suggested. I've also responded > to his concerns further down. Paul correctly pointed out that this > patch uses some helper functions that are part of the cpusets patch. I > should have mentioned this before. > This patch introduces a new sysctl for NUMA systems that tries to drop > as much of the page cache as possible from a set of nodes. The > motivation for this patch is for setting up High Performance Computing > jobs, where initial memory placement is very important to overall > performance. > + /* wait for the kernel threads to complete */ > + while (atomic_read(_toss_threads_active) > 0) { > + __set_current_state(TASK_INTERRUPTIBLE); > + schedule_timeout(10); > + } Would it be possible to use msleep_interruptible() here? Or is it a strict check every 10 ticks, regardless of HZ? Could a comment be inserted indicating which is the case? Thanks, Nish - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Martin Hicks <[EMAIL PROTECTED]> wrote: > > This patch introduces a new sysctl for NUMA systems that tries to drop > as much of the page cache as possible from a set of nodes. The > motivation for this patch is for setting up High Performance Computing > jobs, where initial memory placement is very important to overall > performance. - Using a write to /proc for this seems a bit hacky. Why not simply add a new system call for it? - Starting a kernel thread for each node might be overkill. Yes, it would take longer if one process was to do all the work, but does this operation need to be very fast? If it does, then userspace could arrange for that concurrency by starting a number of processes to perform the toss, each with a different nodemask. - Dropping "as much pagecache as possible" might be a bit crude. I wonder if we should pass in some additional parameter which specifies how much of the node's pagecache should be removed. Or, better, specify how much free memory we will actually require on this node. The syscall terminates when it determines that enough pagecache has been removed. - To make the syscall more general, we should be able to reclaim mapped pagecache and anonymous memory as well. So what it comes down to is sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free) where `what_to_free' consists of a bunch of bitflags (unmapped pagecache, mapped pagecache, anonymous memory, slab, ...). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Hi, I've made a bunch of changes that Paul suggested. I've also responded to his concerns further down. Paul correctly pointed out that this patch uses some helper functions that are part of the cpusets patch. I should have mentioned this before. The major changes are: - Cleanup proc_dobitmask_list() a bit more, including adding bounds checking on *lenp. - An important bugfix in vmscan.c around line 390. Go to the keep_locked label, not the "keep" label. - Add locking in proc_do_toss_page_cache_nodes() to protect the global nodemask_t from getting corrupted. - Change a few functions to "static" - Paul Jackson's suggested changes to greatly simplify proc_do_toss_page_cache_nodes() The patch is inlined at the end of the mail. On Mon, Feb 14, 2005 at 07:37:04PM -0800, Paul Jackson wrote: > > 1) A couple of kmalloc's are done using lengths that > so far as I could tell, came straight from user land. Okay, I've stuck in maximums that are based on MAX_NUMNODES. > > 2) Beware that this patch depends on the cpuset patch: > new-bitmap-list-format-for-cpusets.patch > which is still in *-mm only, for the routines > bitmap_scnlistprintf/bitmap_parselist. Thanks. I hadn't realized that. > 3) Should the maxlen of a nodemask for the sysctl > handler for proc_do_toss_page_cache_nodes be the byte > length of the kernels internal binary nodemask, or It is the byte length of the kernel's bitmask struct. > 5) The requirement to read the string in one read(2) syscall > seemed like it might be draconian. If the available But that's the way the rest of the sysctl read functions work. There's no safe way that I can see to ensure that the data doesn't change in between two consecutive read calls. > 9) Comment - dont we need to protect the kernel global variable > toss_page_cache_nodes from simulaneous access by two tasks? yes, I protected this with a semaphore. mh -- Martin HicksWild Open Source Inc. [EMAIL PROTECTED] 613-266-2296 This patch introduces a new sysctl for NUMA systems that tries to drop as much of the page cache as possible from a set of nodes. The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. Signed-off-by: Martin Hicks <[EMAIL PROTECTED]> Signed-off-by: Ray Bryant <[EMAIL PROTECTED]> [EMAIL PROTECTED] patches]$ diffstat toss_page_cache_nodes_v2.patch include/linux/sysctl.h |3 + kernel/sysctl.c| 95 mm/vmscan.c| 105 - 3 files changed, 201 insertions(+), 2 deletions(-) Index: linux-2.6.10/include/linux/sysctl.h === --- linux-2.6.10.orig/include/linux/sysctl.h2005-02-16 12:43:19.0 -0800 +++ linux-2.6.10/include/linux/sysctl.h 2005-02-19 10:36:41.0 -0800 @@ -170,6 +170,7 @@ VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */ VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */ VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */ + VM_TOSS_PAGE_CACHE_NODES=29, /* nodemask_t: nodes to free page cache on */ }; @@ -803,6 +804,8 @@ void __user *, size_t *, loff_t *); extern int proc_doulongvec_ms_jiffies_minmax(ctl_table *table, int, struct file *, void __user *, size_t *, loff_t *); +extern int proc_dobitmap_list(ctl_table *table, int, struct file *, + void __user *, size_t *, loff_t *); extern int do_sysctl (int __user *name, int nlen, void __user *oldval, size_t __user *oldlenp, Index: linux-2.6.10/kernel/sysctl.c === --- linux-2.6.10.orig/kernel/sysctl.c 2005-02-16 12:43:19.0 -0800 +++ linux-2.6.10/kernel/sysctl.c2005-02-21 10:49:18.0 -0800 @@ -41,6 +41,8 @@ #include #include #include +#include +#include #include #include @@ -72,6 +74,12 @@ void __user *, size_t *, loff_t *); #endif +#ifdef CONFIG_NUMA +extern nodemask_t toss_page_cache_nodes; +extern int proc_do_toss_page_cache_nodes(ctl_table *, int, struct file *, +void __user *, size_t *, loff_t *); +#endif + /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; static int minolduid; @@ -836,6 +844,16 @@ .strategy = _jiffies, }, #endif +#ifdef CONFIG_NUMA + { + .ctl_name = VM_TOSS_PAGE_CACHE_NODES, + .procname = "toss_page_cache_nodes", + .data = _page_cache_nodes, + .maxlen =
Re: [PATCH/RFC] A method for clearing out page cache
Hi, I've made a bunch of changes that Paul suggested. I've also responded to his concerns further down. Paul correctly pointed out that this patch uses some helper functions that are part of the cpusets patch. I should have mentioned this before. The major changes are: - Cleanup proc_dobitmask_list() a bit more, including adding bounds checking on *lenp. - An important bugfix in vmscan.c around line 390. Go to the keep_locked label, not the keep label. - Add locking in proc_do_toss_page_cache_nodes() to protect the global nodemask_t from getting corrupted. - Change a few functions to static - Paul Jackson's suggested changes to greatly simplify proc_do_toss_page_cache_nodes() The patch is inlined at the end of the mail. On Mon, Feb 14, 2005 at 07:37:04PM -0800, Paul Jackson wrote: 1) A couple of kmalloc's are done using lengths that so far as I could tell, came straight from user land. Okay, I've stuck in maximums that are based on MAX_NUMNODES. 2) Beware that this patch depends on the cpuset patch: new-bitmap-list-format-for-cpusets.patch which is still in *-mm only, for the routines bitmap_scnlistprintf/bitmap_parselist. Thanks. I hadn't realized that. 3) Should the maxlen of a nodemask for the sysctl handler for proc_do_toss_page_cache_nodes be the byte length of the kernels internal binary nodemask, or It is the byte length of the kernel's bitmask struct. 5) The requirement to read the string in one read(2) syscall seemed like it might be draconian. If the available But that's the way the rest of the sysctl read functions work. There's no safe way that I can see to ensure that the data doesn't change in between two consecutive read calls. 9) Comment - dont we need to protect the kernel global variable toss_page_cache_nodes from simulaneous access by two tasks? yes, I protected this with a semaphore. mh -- Martin HicksWild Open Source Inc. [EMAIL PROTECTED] 613-266-2296 This patch introduces a new sysctl for NUMA systems that tries to drop as much of the page cache as possible from a set of nodes. The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. Signed-off-by: Martin Hicks [EMAIL PROTECTED] Signed-off-by: Ray Bryant [EMAIL PROTECTED] [EMAIL PROTECTED] patches]$ diffstat toss_page_cache_nodes_v2.patch include/linux/sysctl.h |3 + kernel/sysctl.c| 95 mm/vmscan.c| 105 - 3 files changed, 201 insertions(+), 2 deletions(-) Index: linux-2.6.10/include/linux/sysctl.h === --- linux-2.6.10.orig/include/linux/sysctl.h2005-02-16 12:43:19.0 -0800 +++ linux-2.6.10/include/linux/sysctl.h 2005-02-19 10:36:41.0 -0800 @@ -170,6 +170,7 @@ VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */ VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */ VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */ + VM_TOSS_PAGE_CACHE_NODES=29, /* nodemask_t: nodes to free page cache on */ }; @@ -803,6 +804,8 @@ void __user *, size_t *, loff_t *); extern int proc_doulongvec_ms_jiffies_minmax(ctl_table *table, int, struct file *, void __user *, size_t *, loff_t *); +extern int proc_dobitmap_list(ctl_table *table, int, struct file *, + void __user *, size_t *, loff_t *); extern int do_sysctl (int __user *name, int nlen, void __user *oldval, size_t __user *oldlenp, Index: linux-2.6.10/kernel/sysctl.c === --- linux-2.6.10.orig/kernel/sysctl.c 2005-02-16 12:43:19.0 -0800 +++ linux-2.6.10/kernel/sysctl.c2005-02-21 10:49:18.0 -0800 @@ -41,6 +41,8 @@ #include linux/limits.h #include linux/dcache.h #include linux/syscalls.h +#include linux/bitmap.h +#include linux/nodemask.h #include asm/uaccess.h #include asm/processor.h @@ -72,6 +74,12 @@ void __user *, size_t *, loff_t *); #endif +#ifdef CONFIG_NUMA +extern nodemask_t toss_page_cache_nodes; +extern int proc_do_toss_page_cache_nodes(ctl_table *, int, struct file *, +void __user *, size_t *, loff_t *); +#endif + /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; static int minolduid; @@ -836,6 +844,16 @@ .strategy = sysctl_jiffies, }, #endif +#ifdef CONFIG_NUMA + { + .ctl_name = VM_TOSS_PAGE_CACHE_NODES, + .procname = toss_page_cache_nodes, +
Re: [PATCH/RFC] A method for clearing out page cache
Martin Hicks [EMAIL PROTECTED] wrote: This patch introduces a new sysctl for NUMA systems that tries to drop as much of the page cache as possible from a set of nodes. The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. - Using a write to /proc for this seems a bit hacky. Why not simply add a new system call for it? - Starting a kernel thread for each node might be overkill. Yes, it would take longer if one process was to do all the work, but does this operation need to be very fast? If it does, then userspace could arrange for that concurrency by starting a number of processes to perform the toss, each with a different nodemask. - Dropping as much pagecache as possible might be a bit crude. I wonder if we should pass in some additional parameter which specifies how much of the node's pagecache should be removed. Or, better, specify how much free memory we will actually require on this node. The syscall terminates when it determines that enough pagecache has been removed. - To make the syscall more general, we should be able to reclaim mapped pagecache and anonymous memory as well. So what it comes down to is sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free) where `what_to_free' consists of a bunch of bitflags (unmapped pagecache, mapped pagecache, anonymous memory, slab, ...). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
On Mon, 21 Feb 2005 14:27:21 -0500, Martin Hicks [EMAIL PROTECTED] wrote: Hi, I've made a bunch of changes that Paul suggested. I've also responded to his concerns further down. Paul correctly pointed out that this patch uses some helper functions that are part of the cpusets patch. I should have mentioned this before. snip This patch introduces a new sysctl for NUMA systems that tries to drop as much of the page cache as possible from a set of nodes. The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. snip + /* wait for the kernel threads to complete */ + while (atomic_read(num_toss_threads_active) 0) { + __set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(10); + } snip Would it be possible to use msleep_interruptible() here? Or is it a strict check every 10 ticks, regardless of HZ? Could a comment be inserted indicating which is the case? Thanks, Nish - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew wrote: sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free) ... - To make the syscall more general, we should be able to reclaim mapped pagecache and anonymous memory as well. sys_free_node_memory() - nice. Does it make sense to also have it be able to free up slab cache, calling shrink_slab()? Did you mean to pass a nodemask, or a single node id? Passing a single node id is easier - we've shown that it is difficult to pass bitmaps across the user/kernel boundary without confusions. But if only a single node id is passed, then you get the thread per node that you just argued was sometimes overkill. I'd prefer the single node id, because it's easier to get right. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew Morton wrote: Martin Hicks [EMAIL PROTECTED] wrote: This patch introduces a new sysctl for NUMA systems that tries to drop as much of the page cache as possible from a set of nodes. The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. - Using a write to /proc for this seems a bit hacky. Why not simply add a new system call for it? We did it this way because it was easier to get it into SLES9 that way. But there is no particular reason that we couldn't use a system call. It's just that we figured adding system calls is hard. - Starting a kernel thread for each node might be overkill. Yes, it would take longer if one process was to do all the work, but does this operation need to be very fast? It is possible that this call might need to be executed at the start of each batch job in the system. The reason for using a kernel thread was that there was no good way to start concurrency due to a write to /proc. If it does, then userspace could arrange for that concurrency by starting a number of processes to perform the toss, each with a different nodemask. That works fine as well if we can get a system call number assigned and avoids the hackiness of both /proc and the kernel threads. - Dropping as much pagecache as possible might be a bit crude. I wonder if we should pass in some additional parameter which specifies how much of the node's pagecache should be removed. Or, better, specify how much free memory we will actually require on this node. The syscall terminates when it determines that enough pagecache has been removed. Our thoughts exactly. This is clearly a big hammer and we want to make a lighter hammer to free up a certain number of pages. Indeed, we would like to have these calls occur automatically from __alloc_pages() when we try to allocate local storage and find that there isn't any. For our workloads, we want to free up unmapped, clean pagecache, if that is what is keeping us from allocating a local page. Not all workloads want that, however, so we would probably use a sysctl() to enable/disable this. However, the first step is to do this manually from user space. - To make the syscall more general, we should be able to reclaim mapped pagecache and anonymous memory as well. So what it comes down to is sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free) where `what_to_free' consists of a bunch of bitflags (unmapped pagecache, mapped pagecache, anonymous memory, slab, ...). Do we have to implement all of those or just allow for the possibility of that being implemented in the future? E. g. in our case we'd just implement the bit that says unmapped pagecache. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Best Regards, Ray --- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) [EMAIL PROTECTED] [EMAIL PROTECTED] The box said: Requires Windows 98 or better, so I installed Linux. --- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Paul Jackson [EMAIL PROTECTED] wrote: Andrew wrote: sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free) ... - To make the syscall more general, we should be able to reclaim mapped pagecache and anonymous memory as well. sys_free_node_memory() - nice. Does it make sense to also have it be able to free up slab cache, calling shrink_slab()? Yes, I suggested that slab be one of the `what_to_free' flags. (Some of this may be tricky to implement. But a good interface with an initially-crappy implementation is OK ;) Did you mean to pass a nodemask, or a single node id? Passing a single node id is easier - we've shown that it is difficult to pass bitmaps across the user/kernel boundary without confusions. But if only a single node id is passed, then you get the thread per node that you just argued was sometimes overkill. I meant a single node ID. With a bitmap, the kernel needs to futz around scanning the bitmap, launching kernel threads, etc. I'm proposing that there be no kernel threads at all. If you have four nodes: for i in 0 1 2 3 do call-sys_free_node_memory $i -1 -1 done I'd prefer the single node id, because it's easier to get right. yup. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Ray Bryant [EMAIL PROTECTED] wrote: Andrew Morton wrote: Martin Hicks [EMAIL PROTECTED] wrote: This patch introduces a new sysctl for NUMA systems that tries to drop as much of the page cache as possible from a set of nodes. The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. - Using a write to /proc for this seems a bit hacky. Why not simply add a new system call for it? We did it this way because it was easier to get it into SLES9 that way. But there is no particular reason that we couldn't use a system call. It's just that we figured adding system calls is hard. aarggh. This is why you should target kernel.org kernels first. Now we risk ending up with poor old suse carrying an obsolete interface and application developers have to be able to cater for both interfaces. If it does, then userspace could arrange for that concurrency by starting a number of processes to perform the toss, each with a different nodemask. That works fine as well if we can get a system call number assigned and avoids the hackiness of both /proc and the kernel threads. syscall numbers are per-arch. We don't need to assign a syscall number for this one - we can surely have this ready for 2.6.12. Simply include i386 and ia64 in the initial patch and other architectures will catch up pretty quickly. (It would be nice to generate patches for the arch maintainers, however). - Dropping as much pagecache as possible might be a bit crude. I wonder if we should pass in some additional parameter which specifies how much of the node's pagecache should be removed. Or, better, specify how much free memory we will actually require on this node. The syscall terminates when it determines that enough pagecache has been removed. Our thoughts exactly. This is clearly a big hammer and we want to make a lighter hammer to free up a certain number of pages. Indeed, we would like to have these calls occur automatically from __alloc_pages() when we try to allocate local storage and find that there isn't any. For our workloads, we want to free up unmapped, clean pagecache, if that is what is keeping us from allocating a local page. Not all workloads want that, however, so we would probably use a sysctl() to enable/disable this. However, the first step is to do this manually from user space. Yup. The thing is, lots of people want this feature for various reasons. Not just numerical-computing-users-on-NUMA. We should get it right for them too. Especially kernel developers, who have various nasty userspace tools which will manually reclaim pagecache. But non-kernel-developers will use it too, when they think the VM is screwing them over ;) I think Solaris used to have such a tool - /usr/etc/chill, although I don't know if it had kernel support. - To make the syscall more general, we should be able to reclaim mapped pagecache and anonymous memory as well. So what it comes down to is sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free) where `what_to_free' consists of a bunch of bitflags (unmapped pagecache, mapped pagecache, anonymous memory, slab, ...). Do we have to implement all of those or just allow for the possibility of that being implemented in the future? E. g. in our case we'd just implement the bit that says unmapped pagecache. Well... please take a look at what's involved. It should just be a matter of sprinkling a few test such as + if (sc-mode SC_RECLAIM_SLAB) { ... + } into the existing code. If things turn nasty then we can take another look at it. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew Morton wrote: Ray Bryant [EMAIL PROTECTED] wrote: We did it this way because it was easier to get it into SLES9 that way. But there is no particular reason that we couldn't use a system call. It's just that we figured adding system calls is hard. aarggh. This is why you should target kernel.org kernels first. Now we risk ending up with poor old suse carrying an obsolete interface and application developers have to be able to cater for both interfaces. I agree, but time-to-market decisions overrode that. Anyway, everyone uses a program called bcfree to actually do the buffer-cache freeing, so changing the interface is not as bad as all that. Let us put something together along these lines and we will get back to you. Thanks, -- Best Regards, Ray --- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) [EMAIL PROTECTED] [EMAIL PROTECTED] The box said: Requires Windows 98 or better, so I installed Linux. --- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Andrew wrote: Yes, I ... [clarifies pj's various confusions] Yup - all sounds good - thanks. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
* Andrew Morton [EMAIL PROTECTED] wrote: However, the first step is to do this manually from user space. Yup. The thing is, lots of people want this feature for various reasons. Not just numerical-computing-users-on-NUMA. We should get it right for them too. Especially kernel developers, who have various nasty userspace tools which will manually reclaim pagecache. But non-kernel-developers will use it too, when they think the VM is screwing them over ;) app designers very frequently think that the VM gets its act wrong (most of the time for the wrong reasons), and the last thing we want to enable them is to hack real problems around. How are we supposed to debug VM problems where one player periodically flushes the whole pagecache? If that flushing, when disabled, 'results in the app being broken' (_if_ the app gives any option to disable the flushing). Providing APIs to flush system caches, sysctl or syscall, is the road to VM madness. If the goal is to override the pagecache (and other kernel caches) on a given node then for God's sake, think a bit harder. E.g. enable users to specify an 'allocation priority' of some sort, which kicks out the pagecache on the local node - or something like that. Giving a half-assed tool to clean out one aspect of the system caches will only muddy the waters, with no real road back to sanity. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
On Mon, Feb 14, 2005 at 07:37:04PM -0800, Paul Jackson wrote: > Questions concerning this page cache patch that Martin submitted, > as a merge of something originally written by Ray Bryant. > > The following patch is not really a patch. It is a few questions, a > couple minor space tweaks, and a never compiled nor tested rewrite of > proc_do_toss_page_cache_nodes() to try to make it look a little > prettier. Thanks for the review Paul. I'll take a harder look at your feedback and reply. -- Martin Hicks || Silicon Graphics Inc. || [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
On Mon, Feb 14, 2005 at 07:37:04PM -0800, Paul Jackson wrote: Questions concerning this page cache patch that Martin submitted, as a merge of something originally written by Ray Bryant. The following patch is not really a patch. It is a few questions, a couple minor space tweaks, and a never compiled nor tested rewrite of proc_do_toss_page_cache_nodes() to try to make it look a little prettier. Thanks for the review Paul. I'll take a harder look at your feedback and reply. -- Martin Hicks || Silicon Graphics Inc. || [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] A method for clearing out page cache
Questions concerning this page cache patch that Martin submitted, as a merge of something originally written by Ray Bryant. The following patch is not really a patch. It is a few questions, a couple minor space tweaks, and a never compiled nor tested rewrite of proc_do_toss_page_cache_nodes() to try to make it look a little prettier. Some of the issues are cosmetic, but some I suspect warrant competent response by Martin or Ray, before this goes into *-mm, such as some questions as to whether locking is adequate, or a kmalloc() size might be forced huge by the user. And my suggested rewrite changes the kernel API in one error case, so better to decide that matter before it is too widely used. Specifically: 1) A couple of kmalloc's are done using lengths that so far as I could tell, came straight from user land. Never let the user size a kernel malloc without limit, as it makes it way too easy to ask for something huge, and give the kernel indigestion. If the lengths in question are actually limited, then nevermind (or comment in a terse one-liner, for worry warts such as myself). 2) Beware that this patch depends on the cpuset patch: new-bitmap-list-format-for-cpusets.patch which is still in *-mm only, for the routines bitmap_scnlistprintf/bitmap_parselist. 3) Should the maxlen of a nodemask for the sysctl handler for proc_do_toss_page_cache_nodes be the byte length of the kernels internal binary nodemask, or a reasonable upper bound on the max length of the ascii representation thereof, which is about the value: 100 + 6 * MAX_NUMNODES when using the bitmap_scnlistprintf/bitmap_parselist format. 4) A couple of existing blank lines were nuked by this patch - I restored them. I though them to be nice blank lines ;). 5) The requirement to read the string in one read(2) syscall seemed like it might be draconian. If the available apparatus supports it, better to allocate the ascii buffer on the open for read, let the reads (and seeks) feast on that buffer, using f_pos as it should be used, and freeing the buffer on the close. Mind you, I have no idea if the sysctl.c apparatus conveniently supports this. 6) The kernel header bitops.h is no longer needed by sysctl.c, following my (uncompiled, untested) rewrite. 7) Instead of two counters to track how many threads remained to be waited for, toss_done and nodes_to_toss, my rewrite just has one: num_toss_threads_active. It bumps that value once each kthread it starts, decrements it as each thread finishes, and waits for it to get back to zero in the loop. 8) Several changes in the rewrite of proc_do_toss_page_cache_nodes(): - rename 'retval' to 'ret' (more common, shorter) - nuke bitmap and use nodemask routines - dont error if some nodes offline (general idea is to either do something useful and claim success, or do nothing at all, and complain of error, but dont both do something useful and complain.) - convert to a single return, at bottom of function - XXX Comment: doesn't this code require locking node_online_map? - Remove unused 'started' - Remove no longer used 'i' - Remove no longer used 'errors' - Replace 3 line bitop for loop with one line for_each_node_mask - Replace 15 lines of 'validity checking' with one line check for node being online 9) Comment - dont we need to protect the kernel global variable toss_page_cache_nodes from simulaneous access by two tasks? Index: 2.6.11-rc4/include/linux/sysctl.h === --- 2.6.11-rc4.orig/include/linux/sysctl.h 2005-02-14 18:26:28.0 -0800 +++ 2.6.11-rc4/include/linux/sysctl.h 2005-02-14 18:27:31.0 -0800 @@ -803,6 +803,7 @@ extern int proc_doulongvec_ms_jiffies_mi struct file *, void __user *, size_t *, loff_t *); extern int proc_dobitmap_list(ctl_table *table, int, struct file *, void __user *, size_t *, loff_t *); + extern int do_sysctl (int __user *name, int nlen, void __user *oldval, size_t __user *oldlenp, void __user *newval, size_t newlen); Index: 2.6.11-rc4/kernel/sysctl.c === --- 2.6.11-rc4.orig/kernel/sysctl.c 2005-02-14 18:26:28.0 -0800 +++ 2.6.11-rc4/kernel/sysctl.c 2005-02-14 18:27:46.0 -0800 @@ -42,7 +42,6 @@ #include #include #include -#include #include #include @@ -839,6 +838,8 @@ static ctl_table vm_table[] = { .ctl_name = VM_TOSS_PAGE_CACHE_NODES, .procname = "toss_page_cache_nodes", .data = _page_cache_nodes, +/* XXX
[PATCH/RFC] A method for clearing out page cache
Hi, This patch introduces a new sysctl for NUMA systems that tries to drop as much of the page cache as possible from a set of nodes. The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. Currently if a job is started and there is page cache lying around on a particular node then allocations will spill onto remote nodes and page cache won't be reclaimed until the whole system is short on memory. This can result in a signficiant performance hit for HPC applications that planned on that memory being allocated locally. This patch is intended to be used to clean out the entire page cache before starting a new job. Ideally, we would like to only clear as much page cache as is required to avoid non-local memory allocation. Patches to do this can be built on top of this patch, so this patch should be regarded as the first step in that direction. The long term goal is to have some mechanism that would better control the page cache (and other memory caches) for machines that put a higher priority on memory placement than maintaining big caches. It allows you to clear page cache on nodes in the following manner: echo 1,3,9-12 > /proc/sys/vm/toss_page_cache_nodes The patch was written by Ray Bryant <[EMAIL PROTECTED]> and forward ported by me, Martin Hicks <[EMAIL PROTECTED]>, to 2.6.11-rc3-mm2. Could we get this included in -mm Andrew? mh -- Martin HicksWild Open Source Inc. [EMAIL PROTECTED] 613-266-2296 This patch introduces a new sysctl for NUMA systems that tries to drop as much of the page cache as possible from a set of nodes. The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. It allows you to clear page cache on nodes in the following manner: echo 1,3,9-12 > /proc/sys/vm/toss_page_cache_nodes Signed-off-by: Martin Hicks <[EMAIL PROTECTED]> Signed-off-by: Ray Bryant <[EMAIL PROTECTED]> [EMAIL PROTECTED] patches]$ diffstat toss_page_cache_nodes.patch include/linux/sysctl.h |4 + kernel/sysctl.c| 82 +++ mm/vmscan.c| 128 - 3 files changed, 211 insertions(+), 3 deletions(-) Index: linux-2.6.10/include/linux/sysctl.h === --- linux-2.6.10.orig/include/linux/sysctl.h2005-02-11 10:54:13.0 -0800 +++ linux-2.6.10/include/linux/sysctl.h 2005-02-11 10:54:14.0 -0800 @@ -170,6 +170,7 @@ VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */ VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */ VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */ + VM_TOSS_PAGE_CACHE_NODES=29, /* nodemask_t: nodes to free page cache on */ }; @@ -803,7 +804,8 @@ void __user *, size_t *, loff_t *); extern int proc_doulongvec_ms_jiffies_minmax(ctl_table *table, int, struct file *, void __user *, size_t *, loff_t *); - +extern int proc_dobitmap_list(ctl_table *table, int, struct file *, + void __user *, size_t *, loff_t *); extern int do_sysctl (int __user *name, int nlen, void __user *oldval, size_t __user *oldlenp, void __user *newval, size_t newlen); Index: linux-2.6.10/kernel/sysctl.c === --- linux-2.6.10.orig/kernel/sysctl.c 2005-02-11 10:54:14.0 -0800 +++ linux-2.6.10/kernel/sysctl.c2005-02-11 10:54:14.0 -0800 @@ -41,6 +41,9 @@ #include #include #include +#include +#include +#include #include #include @@ -72,6 +75,12 @@ void __user *, size_t *, loff_t *); #endif +#ifdef CONFIG_NUMA +extern nodemask_t toss_page_cache_nodes; +extern int proc_do_toss_page_cache_nodes(ctl_table *, int, struct file *, +void __user *, size_t *, loff_t *); +#endif + /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; static int minolduid; @@ -836,6 +845,16 @@ .strategy = _jiffies, }, #endif +#ifdef CONFIG_NUMA + { + .ctl_name = VM_TOSS_PAGE_CACHE_NODES, + .procname = "toss_page_cache_nodes", + .data = _page_cache_nodes, + .maxlen = sizeof(nodemask_t), + .mode = 0644, + .proc_handler = _do_toss_page_cache_nodes, + }, +#endif { .ctl_name = 0 } }; @@ -2071,6 +2090,68 @@ do_proc_dointvec_userhz_jiffies_conv,NULL); } +/** + * proc_dobitmap_list -- read/write a
[PATCH/RFC] A method for clearing out page cache
Hi, This patch introduces a new sysctl for NUMA systems that tries to drop as much of the page cache as possible from a set of nodes. The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. Currently if a job is started and there is page cache lying around on a particular node then allocations will spill onto remote nodes and page cache won't be reclaimed until the whole system is short on memory. This can result in a signficiant performance hit for HPC applications that planned on that memory being allocated locally. This patch is intended to be used to clean out the entire page cache before starting a new job. Ideally, we would like to only clear as much page cache as is required to avoid non-local memory allocation. Patches to do this can be built on top of this patch, so this patch should be regarded as the first step in that direction. The long term goal is to have some mechanism that would better control the page cache (and other memory caches) for machines that put a higher priority on memory placement than maintaining big caches. It allows you to clear page cache on nodes in the following manner: echo 1,3,9-12 /proc/sys/vm/toss_page_cache_nodes The patch was written by Ray Bryant [EMAIL PROTECTED] and forward ported by me, Martin Hicks [EMAIL PROTECTED], to 2.6.11-rc3-mm2. Could we get this included in -mm Andrew? mh -- Martin HicksWild Open Source Inc. [EMAIL PROTECTED] 613-266-2296 This patch introduces a new sysctl for NUMA systems that tries to drop as much of the page cache as possible from a set of nodes. The motivation for this patch is for setting up High Performance Computing jobs, where initial memory placement is very important to overall performance. It allows you to clear page cache on nodes in the following manner: echo 1,3,9-12 /proc/sys/vm/toss_page_cache_nodes Signed-off-by: Martin Hicks [EMAIL PROTECTED] Signed-off-by: Ray Bryant [EMAIL PROTECTED] [EMAIL PROTECTED] patches]$ diffstat toss_page_cache_nodes.patch include/linux/sysctl.h |4 + kernel/sysctl.c| 82 +++ mm/vmscan.c| 128 - 3 files changed, 211 insertions(+), 3 deletions(-) Index: linux-2.6.10/include/linux/sysctl.h === --- linux-2.6.10.orig/include/linux/sysctl.h2005-02-11 10:54:13.0 -0800 +++ linux-2.6.10/include/linux/sysctl.h 2005-02-11 10:54:14.0 -0800 @@ -170,6 +170,7 @@ VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */ VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */ VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */ + VM_TOSS_PAGE_CACHE_NODES=29, /* nodemask_t: nodes to free page cache on */ }; @@ -803,7 +804,8 @@ void __user *, size_t *, loff_t *); extern int proc_doulongvec_ms_jiffies_minmax(ctl_table *table, int, struct file *, void __user *, size_t *, loff_t *); - +extern int proc_dobitmap_list(ctl_table *table, int, struct file *, + void __user *, size_t *, loff_t *); extern int do_sysctl (int __user *name, int nlen, void __user *oldval, size_t __user *oldlenp, void __user *newval, size_t newlen); Index: linux-2.6.10/kernel/sysctl.c === --- linux-2.6.10.orig/kernel/sysctl.c 2005-02-11 10:54:14.0 -0800 +++ linux-2.6.10/kernel/sysctl.c2005-02-11 10:54:14.0 -0800 @@ -41,6 +41,9 @@ #include linux/limits.h #include linux/dcache.h #include linux/syscalls.h +#include linux/bitmap.h +#include linux/bitops.h +#include linux/nodemask.h #include asm/uaccess.h #include asm/processor.h @@ -72,6 +75,12 @@ void __user *, size_t *, loff_t *); #endif +#ifdef CONFIG_NUMA +extern nodemask_t toss_page_cache_nodes; +extern int proc_do_toss_page_cache_nodes(ctl_table *, int, struct file *, +void __user *, size_t *, loff_t *); +#endif + /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; static int minolduid; @@ -836,6 +845,16 @@ .strategy = sysctl_jiffies, }, #endif +#ifdef CONFIG_NUMA + { + .ctl_name = VM_TOSS_PAGE_CACHE_NODES, + .procname = toss_page_cache_nodes, + .data = toss_page_cache_nodes, + .maxlen = sizeof(nodemask_t), + .mode = 0644, + .proc_handler = proc_do_toss_page_cache_nodes, + }, +#endif { .ctl_name = 0 } }; @@ -2071,6 +2090,68 @@
Re: [PATCH/RFC] A method for clearing out page cache
Questions concerning this page cache patch that Martin submitted, as a merge of something originally written by Ray Bryant. The following patch is not really a patch. It is a few questions, a couple minor space tweaks, and a never compiled nor tested rewrite of proc_do_toss_page_cache_nodes() to try to make it look a little prettier. Some of the issues are cosmetic, but some I suspect warrant competent response by Martin or Ray, before this goes into *-mm, such as some questions as to whether locking is adequate, or a kmalloc() size might be forced huge by the user. And my suggested rewrite changes the kernel API in one error case, so better to decide that matter before it is too widely used. Specifically: 1) A couple of kmalloc's are done using lengths that so far as I could tell, came straight from user land. Never let the user size a kernel malloc without limit, as it makes it way too easy to ask for something huge, and give the kernel indigestion. If the lengths in question are actually limited, then nevermind (or comment in a terse one-liner, for worry warts such as myself). 2) Beware that this patch depends on the cpuset patch: new-bitmap-list-format-for-cpusets.patch which is still in *-mm only, for the routines bitmap_scnlistprintf/bitmap_parselist. 3) Should the maxlen of a nodemask for the sysctl handler for proc_do_toss_page_cache_nodes be the byte length of the kernels internal binary nodemask, or a reasonable upper bound on the max length of the ascii representation thereof, which is about the value: 100 + 6 * MAX_NUMNODES when using the bitmap_scnlistprintf/bitmap_parselist format. 4) A couple of existing blank lines were nuked by this patch - I restored them. I though them to be nice blank lines ;). 5) The requirement to read the string in one read(2) syscall seemed like it might be draconian. If the available apparatus supports it, better to allocate the ascii buffer on the open for read, let the reads (and seeks) feast on that buffer, using f_pos as it should be used, and freeing the buffer on the close. Mind you, I have no idea if the sysctl.c apparatus conveniently supports this. 6) The kernel header bitops.h is no longer needed by sysctl.c, following my (uncompiled, untested) rewrite. 7) Instead of two counters to track how many threads remained to be waited for, toss_done and nodes_to_toss, my rewrite just has one: num_toss_threads_active. It bumps that value once each kthread it starts, decrements it as each thread finishes, and waits for it to get back to zero in the loop. 8) Several changes in the rewrite of proc_do_toss_page_cache_nodes(): - rename 'retval' to 'ret' (more common, shorter) - nuke bitmap and use nodemask routines - dont error if some nodes offline (general idea is to either do something useful and claim success, or do nothing at all, and complain of error, but dont both do something useful and complain.) - convert to a single return, at bottom of function - XXX Comment: doesn't this code require locking node_online_map? - Remove unused 'started' - Remove no longer used 'i' - Remove no longer used 'errors' - Replace 3 line bitop for loop with one line for_each_node_mask - Replace 15 lines of 'validity checking' with one line check for node being online 9) Comment - dont we need to protect the kernel global variable toss_page_cache_nodes from simulaneous access by two tasks? Index: 2.6.11-rc4/include/linux/sysctl.h === --- 2.6.11-rc4.orig/include/linux/sysctl.h 2005-02-14 18:26:28.0 -0800 +++ 2.6.11-rc4/include/linux/sysctl.h 2005-02-14 18:27:31.0 -0800 @@ -803,6 +803,7 @@ extern int proc_doulongvec_ms_jiffies_mi struct file *, void __user *, size_t *, loff_t *); extern int proc_dobitmap_list(ctl_table *table, int, struct file *, void __user *, size_t *, loff_t *); + extern int do_sysctl (int __user *name, int nlen, void __user *oldval, size_t __user *oldlenp, void __user *newval, size_t newlen); Index: 2.6.11-rc4/kernel/sysctl.c === --- 2.6.11-rc4.orig/kernel/sysctl.c 2005-02-14 18:26:28.0 -0800 +++ 2.6.11-rc4/kernel/sysctl.c 2005-02-14 18:27:46.0 -0800 @@ -42,7 +42,6 @@ #include linux/dcache.h #include linux/syscalls.h #include linux/bitmap.h -#include linux/bitops.h #include linux/nodemask.h #include asm/uaccess.h @@ -839,6 +838,8 @@ static ctl_table vm_table[] = { .ctl_name = VM_TOSS_PAGE_CACHE_NODES, .procname =