Re: userspace pagecache management tool
> On Thu, 08 Mar 2007 13:29:02 +0530 Vaidyanathan Srinivasan <[EMAIL > PROTECTED]> wrote: > > That all sounds reasonably doable. It'd be pretty complex to do it > > in-kernel but we could do it there too. Problem is if course that the > > above strategy is explicitly optimised for the backup program and if it's > > in-kernel it becomes applicable to all other workloads. > > This strategy looks very good. However we are not considering the > performance impact on the 'backup' application as such. By removing > pagecache pages brought in by the application without the knowledge of > the applications usage and behavior may severely affect its performance. > > Certainly we are interested in improving system performance at the > cost certain applications, but not to an extend that the backup > process will drag on and on to an unreasonable amount of time. > > Also backup processes may consist of a group of applications working > on the same stream of data. Like compression program, encryption > program etc which could be independent applications. Well yes, if the application is that funky then suitably funky userspace tricks will be needed to avoid hurting it. > We should consider having a limit on pagecache usage rather than > denying any space in the pagecache for these applications. That's what containerisation is for: run-in-container --memory=16M /bin/backup-program This can be done today with x86_64 fake-numa, controlled by cpusets. One day, when we get our containerisation story sorted out, things will be more convenient... > Can fadvice() be enhanced to have a limit on pagecache usage and > reclaim used pages in LRU order? This way data stays for a little > while for other applications to pickup from pagecache. > > Pages already in memory or brought in by other applications need not > be placed in this list and hence we prevent any collateral pageouts. We could teach the presently-unimplemented POSIX_FADV_NOREUSE to dump this file's pages at the tail of the inactive list (after cleaning them if needed). That way, they're the first to get reclaimed. The standard says "Specifies that the application expects to access the specified data once and then not reuse it thereafter." That's a bit ambiguous: it it before the process accessed the data, or after? Before, I suspect. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: > On Sun, 4 Mar 2007 00:01:55 +0100 bert hubert <[EMAIL PROTECTED]> wrote: > >> On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote: > It is *not* a global instruction. It uses setenv, so the user's policy > affects only the target process and its forked children. ... and all other processes accessing the same file(s)! Your library and the system calls may be limited to one process, but the consequences are global. >>> Yes. So what? If the user wants to go and evict libc.so from pagecache >>> then he can do so - the kernel has provided syscalls with which this can be >>> done for at least seven years. Bad user, shouldn't do that. >> While I agree with your sentiments that userspace can have a good idea on >> how to deal with the page cache, your program does more than it claims to >> do - because of how linux implements posix_fadvise. >> >> I don't think anybody expects or desires your program to actually *evict* >> the stuff from the cache you are trying access, which happens in case the >> data was in the cache prior to starting your program. >> >> What people expect is that a solution such as you wrote it simply won't >> *add* anything to the cache. They don't expect it will actually globally >> *remove* stuff from the cache. >> >> Making a backup this way would hurt even worse than usual with your >> pagecache management tool if the file being backupped was still being read. >> >> This is not your fault, but in practice, it makes your program less useful >> than it could be. > > yup. As I said, it's a proof-of-concept. It's a project. And I have about > one > free femtosecond per fortnight :( > >> One could conceivably fix that up using mincore and simply not fadvise if a >> page was in core already. > > Yes. Let's flesh it out the backup program policy some more: > > - Unconditionally invalidate output files > > - on entry to read(), probe pagecache, record which pages in the range are > present > > - on entry to next read(), shoot down those pages from the previous read > which weren't in pagecache. > > - But we can do better! LRU the page's files up to a certain number of pages. > > - Once that point is exceeded, we need to reclaim some pages. Which > ones? Well, we've been observing all reads, so we can record which pages > were referenced once, and which ones were referenced multiple times so we > can do arbitrarily complex page aging in there. > > - On close(), nuke all pages which weren't in core during open(), even if > this app referenced them multiple times. > > - If the backup program decided to read its input files with mmap we're > rather screwed. We can't intercept pagefaults so the best we can do is > to restore the file's pagecache to its previous state on close(). > > Or if it's really a problem, get control in there somehow and > periodically poll the pagecache occupancy via mincore(), use madvise() > then fadvise() to trim it back. > > That all sounds reasonably doable. It'd be pretty complex to do it > in-kernel but we could do it there too. Problem is if course that the > above strategy is explicitly optimised for the backup program and if it's > in-kernel it becomes applicable to all other workloads. This strategy looks very good. However we are not considering the performance impact on the 'backup' application as such. By removing pagecache pages brought in by the application without the knowledge of the applications usage and behavior may severely affect its performance. Certainly we are interested in improving system performance at the cost certain applications, but not to an extend that the backup process will drag on and on to an unreasonable amount of time. Also backup processes may consist of a group of applications working on the same stream of data. Like compression program, encryption program etc which could be independent applications. We should consider having a limit on pagecache usage rather than denying any space in the pagecache for these applications. Can fadvice() be enhanced to have a limit on pagecache usage and reclaim used pages in LRU order? This way data stays for a little while for other applications to pickup from pagecache. Pages already in memory or brought in by other applications need not be placed in this list and hence we prevent any collateral pageouts. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sun, 4 Mar 2007 00:01:55 +0100 bert hubert [EMAIL PROTECTED] wrote: On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote: It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. ... and all other processes accessing the same file(s)! Your library and the system calls may be limited to one process, but the consequences are global. Yes. So what? If the user wants to go and evict libc.so from pagecache then he can do so - the kernel has provided syscalls with which this can be done for at least seven years. Bad user, shouldn't do that. While I agree with your sentiments that userspace can have a good idea on how to deal with the page cache, your program does more than it claims to do - because of how linux implements posix_fadvise. I don't think anybody expects or desires your program to actually *evict* the stuff from the cache you are trying access, which happens in case the data was in the cache prior to starting your program. What people expect is that a solution such as you wrote it simply won't *add* anything to the cache. They don't expect it will actually globally *remove* stuff from the cache. Making a backup this way would hurt even worse than usual with your pagecache management tool if the file being backupped was still being read. This is not your fault, but in practice, it makes your program less useful than it could be. yup. As I said, it's a proof-of-concept. It's a project. And I have about one free femtosecond per fortnight :( One could conceivably fix that up using mincore and simply not fadvise if a page was in core already. Yes. Let's flesh it out the backup program policy some more: - Unconditionally invalidate output files - on entry to read(), probe pagecache, record which pages in the range are present - on entry to next read(), shoot down those pages from the previous read which weren't in pagecache. - But we can do better! LRU the page's files up to a certain number of pages. - Once that point is exceeded, we need to reclaim some pages. Which ones? Well, we've been observing all reads, so we can record which pages were referenced once, and which ones were referenced multiple times so we can do arbitrarily complex page aging in there. - On close(), nuke all pages which weren't in core during open(), even if this app referenced them multiple times. - If the backup program decided to read its input files with mmap we're rather screwed. We can't intercept pagefaults so the best we can do is to restore the file's pagecache to its previous state on close(). Or if it's really a problem, get control in there somehow and periodically poll the pagecache occupancy via mincore(), use madvise() then fadvise() to trim it back. That all sounds reasonably doable. It'd be pretty complex to do it in-kernel but we could do it there too. Problem is if course that the above strategy is explicitly optimised for the backup program and if it's in-kernel it becomes applicable to all other workloads. This strategy looks very good. However we are not considering the performance impact on the 'backup' application as such. By removing pagecache pages brought in by the application without the knowledge of the applications usage and behavior may severely affect its performance. Certainly we are interested in improving system performance at the cost certain applications, but not to an extend that the backup process will drag on and on to an unreasonable amount of time. Also backup processes may consist of a group of applications working on the same stream of data. Like compression program, encryption program etc which could be independent applications. We should consider having a limit on pagecache usage rather than denying any space in the pagecache for these applications. Can fadvice() be enhanced to have a limit on pagecache usage and reclaim used pages in LRU order? This way data stays for a little while for other applications to pickup from pagecache. Pages already in memory or brought in by other applications need not be placed in this list and hence we prevent any collateral pageouts. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Thu, 08 Mar 2007 13:29:02 +0530 Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote: That all sounds reasonably doable. It'd be pretty complex to do it in-kernel but we could do it there too. Problem is if course that the above strategy is explicitly optimised for the backup program and if it's in-kernel it becomes applicable to all other workloads. This strategy looks very good. However we are not considering the performance impact on the 'backup' application as such. By removing pagecache pages brought in by the application without the knowledge of the applications usage and behavior may severely affect its performance. Certainly we are interested in improving system performance at the cost certain applications, but not to an extend that the backup process will drag on and on to an unreasonable amount of time. Also backup processes may consist of a group of applications working on the same stream of data. Like compression program, encryption program etc which could be independent applications. Well yes, if the application is that funky then suitably funky userspace tricks will be needed to avoid hurting it. We should consider having a limit on pagecache usage rather than denying any space in the pagecache for these applications. That's what containerisation is for: run-in-container --memory=16M /bin/backup-program This can be done today with x86_64 fake-numa, controlled by cpusets. One day, when we get our containerisation story sorted out, things will be more convenient... Can fadvice() be enhanced to have a limit on pagecache usage and reclaim used pages in LRU order? This way data stays for a little while for other applications to pickup from pagecache. Pages already in memory or brought in by other applications need not be placed in this list and hence we prevent any collateral pageouts. We could teach the presently-unimplemented POSIX_FADV_NOREUSE to dump this file's pages at the tail of the inactive list (after cleaning them if needed). That way, they're the first to get reclaimed. The standard says Specifies that the application expects to access the specified data once and then not reuse it thereafter. That's a bit ambiguous: it it before the process accessed the data, or after? Before, I suspect. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Wed, 07 Mar 2007 11:39:02 + Pádraig Brady <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Tue, 06 Mar 2007 12:10:49 + > > P__draig Brady <[EMAIL PROTECTED]> wrote: > >> Perhaps one could possibly just evict pages with _mapcount==0 ? > > > > That is the present fadvise(FADV_DONTNEED) behaviour. > > Ah right. It doesn't invalidate page_mapped() pages. yup > If that means it doesn't invalidate pages previously cached > by other processes, then great. It will do that. This is why I point out that this userspace tool could (easily) be enhanced to not invalidate pages which were in pagecache prior to their being read by the managed application. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: > On Tue, 06 Mar 2007 12:10:49 + > P__draig Brady <[EMAIL PROTECTED]> wrote: >> Perhaps one could possibly just evict pages with _mapcount==0 ? > > That is the present fadvise(FADV_DONTNEED) behaviour. Ah right. It doesn't invalidate page_mapped() pages. If that means it doesn't invalidate pages previously cached by other processes, then great. However I think what I meant though was fadvise(FADV_DONTNEED) should only invalidate pages where page_count()<=1 >From include/linux/mm.h " For pages belonging to inodes, the page_count() is the number of attaches, plus 1 if `private' contains something, plus one for the page cache itself." cheers, Pádraig. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Tue, 06 Mar 2007 12:10:49 + P__draig Brady [EMAIL PROTECTED] wrote: Perhaps one could possibly just evict pages with _mapcount==0 ? That is the present fadvise(FADV_DONTNEED) behaviour. Ah right. It doesn't invalidate page_mapped() pages. If that means it doesn't invalidate pages previously cached by other processes, then great. However I think what I meant though was fadvise(FADV_DONTNEED) should only invalidate pages where page_count()=1 From include/linux/mm.h For pages belonging to inodes, the page_count() is the number of attaches, plus 1 if `private' contains something, plus one for the page cache itself. cheers, Pádraig. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Wed, 07 Mar 2007 11:39:02 + Pádraig Brady [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Tue, 06 Mar 2007 12:10:49 + P__draig Brady [EMAIL PROTECTED] wrote: Perhaps one could possibly just evict pages with _mapcount==0 ? That is the present fadvise(FADV_DONTNEED) behaviour. Ah right. It doesn't invalidate page_mapped() pages. yup If that means it doesn't invalidate pages previously cached by other processes, then great. It will do that. This is why I point out that this userspace tool could (easily) be enhanced to not invalidate pages which were in pagecache prior to their being read by the managed application. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Tue, 06 Mar 2007 12:10:49 + P__draig Brady <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: If I'm the target audience for that API then it's broken as I'd mess it up, or would take too long to get it right. Can't we just fix the posix_fadvise() implementation to only evict pages paged in by the current process. The kernel doesn't have that information. It doesn't _keep_ the information. File readahead is done in the process context, so we had it originally. I agree though that we probably should not bother trying to keep that kind of information :) -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Tue, 06 Mar 2007 12:10:49 + P__draig Brady <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > Yes. Let's flesh it out the backup program policy some more: > > > > - Unconditionally invalidate output files > > > > - on entry to read(), probe pagecache, record which pages in the range are > > present > > > > - on entry to next read(), shoot down those pages from the previous read > > which weren't in pagecache. > > > > - But we can do better! LRU the page's files up to a certain number of > > pages. > > > > - Once that point is exceeded, we need to reclaim some pages. Which > > ones? Well, we've been observing all reads, so we can record which pages > > were referenced once, and which ones were referenced multiple times so we > > can do arbitrarily complex page aging in there. > > > > - On close(), nuke all pages which weren't in core during open(), even if > > this app referenced them multiple times. > > > > - If the backup program decided to read its input files with mmap we're > > rather screwed. We can't intercept pagefaults so the best we can do is > > to restore the file's pagecache to its previous state on close(). > > > > Or if it's really a problem, get control in there somehow and > > periodically poll the pagecache occupancy via mincore(), use madvise() > > then fadvise() to trim it back. > > > > That all sounds reasonably doable. It'd be pretty complex to do it > > in-kernel but we could do it there too. Problem is if course that the > > above strategy is explicitly optimised for the backup program and if it's > > in-kernel it becomes applicable to all other workloads. > > I can see the above being possible, but I can't see the reason > for exposing that complexity to userspace. That's sophistication, not complexity. It doesn't have to do all that stuff to be effective. > If I'm the target > audience for that API then it's broken as I'd mess it up, > or would take too long to get it right. > > Can't we just fix the posix_fadvise() implementation to > only evict pages paged in by the current process. The kernel doesn't have that information. > Perhaps one could possibly just evict pages with _mapcount==0 ? That is the present fadvise(FADV_DONTNEED) behaviour. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: > Yes. Let's flesh it out the backup program policy some more: > > - Unconditionally invalidate output files > > - on entry to read(), probe pagecache, record which pages in the range are > present > > - on entry to next read(), shoot down those pages from the previous read > which weren't in pagecache. > > - But we can do better! LRU the page's files up to a certain number of pages. > > - Once that point is exceeded, we need to reclaim some pages. Which > ones? Well, we've been observing all reads, so we can record which pages > were referenced once, and which ones were referenced multiple times so we > can do arbitrarily complex page aging in there. > > - On close(), nuke all pages which weren't in core during open(), even if > this app referenced them multiple times. > > - If the backup program decided to read its input files with mmap we're > rather screwed. We can't intercept pagefaults so the best we can do is > to restore the file's pagecache to its previous state on close(). > > Or if it's really a problem, get control in there somehow and > periodically poll the pagecache occupancy via mincore(), use madvise() > then fadvise() to trim it back. > > That all sounds reasonably doable. It'd be pretty complex to do it > in-kernel but we could do it there too. Problem is if course that the > above strategy is explicitly optimised for the backup program and if it's > in-kernel it becomes applicable to all other workloads. I can see the above being possible, but I can't see the reason for exposing that complexity to userspace. If I'm the target audience for that API then it's broken as I'd mess it up, or would take too long to get it right. Can't we just fix the posix_fadvise() implementation to only evict pages paged in by the current process. Perhaps one could possibly just evict pages with _mapcount==0 ? cheers, Pádraig. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: Yes. Let's flesh it out the backup program policy some more: - Unconditionally invalidate output files - on entry to read(), probe pagecache, record which pages in the range are present - on entry to next read(), shoot down those pages from the previous read which weren't in pagecache. - But we can do better! LRU the page's files up to a certain number of pages. - Once that point is exceeded, we need to reclaim some pages. Which ones? Well, we've been observing all reads, so we can record which pages were referenced once, and which ones were referenced multiple times so we can do arbitrarily complex page aging in there. - On close(), nuke all pages which weren't in core during open(), even if this app referenced them multiple times. - If the backup program decided to read its input files with mmap we're rather screwed. We can't intercept pagefaults so the best we can do is to restore the file's pagecache to its previous state on close(). Or if it's really a problem, get control in there somehow and periodically poll the pagecache occupancy via mincore(), use madvise() then fadvise() to trim it back. That all sounds reasonably doable. It'd be pretty complex to do it in-kernel but we could do it there too. Problem is if course that the above strategy is explicitly optimised for the backup program and if it's in-kernel it becomes applicable to all other workloads. I can see the above being possible, but I can't see the reason for exposing that complexity to userspace. If I'm the target audience for that API then it's broken as I'd mess it up, or would take too long to get it right. Can't we just fix the posix_fadvise() implementation to only evict pages paged in by the current process. Perhaps one could possibly just evict pages with _mapcount==0 ? cheers, Pádraig. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Tue, 06 Mar 2007 12:10:49 + P__draig Brady [EMAIL PROTECTED] wrote: Andrew Morton wrote: Yes. Let's flesh it out the backup program policy some more: - Unconditionally invalidate output files - on entry to read(), probe pagecache, record which pages in the range are present - on entry to next read(), shoot down those pages from the previous read which weren't in pagecache. - But we can do better! LRU the page's files up to a certain number of pages. - Once that point is exceeded, we need to reclaim some pages. Which ones? Well, we've been observing all reads, so we can record which pages were referenced once, and which ones were referenced multiple times so we can do arbitrarily complex page aging in there. - On close(), nuke all pages which weren't in core during open(), even if this app referenced them multiple times. - If the backup program decided to read its input files with mmap we're rather screwed. We can't intercept pagefaults so the best we can do is to restore the file's pagecache to its previous state on close(). Or if it's really a problem, get control in there somehow and periodically poll the pagecache occupancy via mincore(), use madvise() then fadvise() to trim it back. That all sounds reasonably doable. It'd be pretty complex to do it in-kernel but we could do it there too. Problem is if course that the above strategy is explicitly optimised for the backup program and if it's in-kernel it becomes applicable to all other workloads. I can see the above being possible, but I can't see the reason for exposing that complexity to userspace. That's sophistication, not complexity. It doesn't have to do all that stuff to be effective. If I'm the target audience for that API then it's broken as I'd mess it up, or would take too long to get it right. Can't we just fix the posix_fadvise() implementation to only evict pages paged in by the current process. The kernel doesn't have that information. Perhaps one could possibly just evict pages with _mapcount==0 ? That is the present fadvise(FADV_DONTNEED) behaviour. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Tue, 06 Mar 2007 12:10:49 + P__draig Brady [EMAIL PROTECTED] wrote: Andrew Morton wrote: If I'm the target audience for that API then it's broken as I'd mess it up, or would take too long to get it right. Can't we just fix the posix_fadvise() implementation to only evict pages paged in by the current process. The kernel doesn't have that information. It doesn't _keep_ the information. File readahead is done in the process context, so we had it originally. I agree though that we probably should not bother trying to keep that kind of information :) -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Mon, 05 Mar 2007 11:02:43 + Pádraig Brady <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a > > little tool which permits the management of the pagecache usage of > > arbitrary applications. Effectively it prevents the targetted application > > from using any pagecache at all. > > Cool, Kinda like noca? > http://kernel.umbrella.ro/vm/ yup, same concept. > Though I could easily read your code, > but couldn't immediately figure out what noca was doing. > > I used posix_fadvise in an app I did recently: > http://www.pixelbeat.org/programs/dvd-vr/ > There is a stream_data() func there that does: > > read(src) > write(dst) > posix_fadvise(src) > posix_fadvise(dst) > > for performance I found I needed to do it in that order > so that any readahead done with the read(src) > was not thrown away by the posix_fadvise(src). > In addition to the order, one must be careful > to throw away only what you've actually written. > > I'm not sure your lib gives enough control over this, > as you essentially do: > > posix_fadvise(src) > read(src) > posix_fadvise(dst) > write(dst) That could be so - it's just a demo. But readahead should be OK - I only invalidate from start-of-file up to current-offset-minus-pagesize. So the cache at and ahead of the linear reader is undisturbed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: > I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a > little tool which permits the management of the pagecache usage of > arbitrary applications. Effectively it prevents the targetted application > from using any pagecache at all. Cool, Kinda like noca? http://kernel.umbrella.ro/vm/ Though I could easily read your code, but couldn't immediately figure out what noca was doing. I used posix_fadvise in an app I did recently: http://www.pixelbeat.org/programs/dvd-vr/ There is a stream_data() func there that does: read(src) write(dst) posix_fadvise(src) posix_fadvise(dst) for performance I found I needed to do it in that order so that any readahead done with the read(src) was not thrown away by the posix_fadvise(src). In addition to the order, one must be careful to throw away only what you've actually written. I'm not sure your lib gives enough control over this, as you essentially do: posix_fadvise(src) read(src) posix_fadvise(dst) write(dst) cheers, Pádraig. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a little tool which permits the management of the pagecache usage of arbitrary applications. Effectively it prevents the targetted application from using any pagecache at all. Cool, Kinda like noca? http://kernel.umbrella.ro/vm/ Though I could easily read your code, but couldn't immediately figure out what noca was doing. I used posix_fadvise in an app I did recently: http://www.pixelbeat.org/programs/dvd-vr/ There is a stream_data() func there that does: read(src) write(dst) posix_fadvise(src) posix_fadvise(dst) for performance I found I needed to do it in that order so that any readahead done with the read(src) was not thrown away by the posix_fadvise(src). In addition to the order, one must be careful to throw away only what you've actually written. I'm not sure your lib gives enough control over this, as you essentially do: posix_fadvise(src) read(src) posix_fadvise(dst) write(dst) cheers, Pádraig. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Mon, 05 Mar 2007 11:02:43 + Pádraig Brady [EMAIL PROTECTED] wrote: Andrew Morton wrote: I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a little tool which permits the management of the pagecache usage of arbitrary applications. Effectively it prevents the targetted application from using any pagecache at all. Cool, Kinda like noca? http://kernel.umbrella.ro/vm/ yup, same concept. Though I could easily read your code, but couldn't immediately figure out what noca was doing. I used posix_fadvise in an app I did recently: http://www.pixelbeat.org/programs/dvd-vr/ There is a stream_data() func there that does: read(src) write(dst) posix_fadvise(src) posix_fadvise(dst) for performance I found I needed to do it in that order so that any readahead done with the read(src) was not thrown away by the posix_fadvise(src). In addition to the order, one must be careful to throw away only what you've actually written. I'm not sure your lib gives enough control over this, as you essentially do: posix_fadvise(src) read(src) posix_fadvise(dst) write(dst) That could be so - it's just a demo. But readahead should be OK - I only invalidate from start-of-file up to current-offset-minus-pagesize. So the cache at and ahead of the linear reader is undisturbed. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: Doing a refault thing would help a bit, but stops working at a certain point. At what point does it stop working? We need to store that this-page-got-reclaimed info somewhere. I don't know how space-efficient that is. Did anyone ever do an implementation? One 32 bit word per evicted page that we keep track of. ok... I wonder if we really need a new data structure to track that. I mean, once a file-backed (or indeed swapcache) page has been reclaimed, its radix-tree slot is just sitting there with zeroes in it, asking us to reuse that space for something interesting, no? Of course, if all 64 pages in a radix-tree node get removed, we'll currently free the node itself. We could stop doing that, but the effects of that might be pretty bad sometimes. Instead, it sounds sensible to populate the now-null slot in the parent radix-tree node with an average/max/min/per-child-bitmap/whatever of the metrics for the 64 non-resident pages which that non-leaf slot represents. So as the period since a single page got evicted increases and increases, our information about its state becomes less and less accurate. If that inaccuracy is a problem then perhaps we could defer the collapsing of a now-empty node into its parent in some manner. We know exactly how far to defer that collapsing, too. We know at what rate we rotate through the active list, and the size of the active list. We also know the rate at which we reclaim pages, and the size of the inactive list. Combine the two, and you have an idea roughly how many page faults there are between the accesses to the coldest page on the active list. We don't have to keep the evicted page history beyond that point, because pages that get refaulted after such a long interval have a longer inter-reference distance and should go onto the inactive list - ie. the default list for unknown pages. If you can find holes in http://linux-mm.org/PageReplacementDesign please let me know :) That all looks pretty non-crazy and implementable to me. Alas, getting the stuff written and working is 1% of the effort. The rest is the nasty hunt for new corner-cases and general productisation hassle. But if initial results show benefit, I expect we could manage all that. True, but I've looked through a few hundred VM bugzillas to validate the design against all the common corner cases, all the way from RHEL3 (which also has split anon/file lists) through today. I'm trying to keep the known-good bits of our policy as much as possible, introducing big changes only for those corner cases that plagued multiple VMs in the past. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sun, 2007-03-04 at 04:07 -0800, Andrew Morton wrote: > On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > > > Andrew Morton wrote: > > > > >>> Doing a refault thing would help a bit, but stops working at a certain > > >>> point. > > >> At what point does it stop working? > > > > > > We need to store that this-page-got-reclaimed info somewhere. I don't > > > know > > > how space-efficient that is. Did anyone ever do an implementation? > > > > One 32 bit word per evicted page that we keep track of. > > ok... > > I wonder if we really need a new data structure to track that. I mean, > once a file-backed (or indeed swapcache) page has been reclaimed, its > radix-tree slot is just sitting there with zeroes in it, asking us to reuse > that space for something interesting, no? > > Of course, if all 64 pages in a radix-tree node get removed, we'll > currently free the node itself. We could stop doing that, but the effects > of that might be pretty bad sometimes. Instead, it sounds sensible to > populate the now-null slot in the parent radix-tree node with an > average/max/min/per-child-bitmap/whatever of the metrics for the 64 > non-resident pages which that non-leaf slot represents. So as the period > since a single page got evicted increases and increases, our information > about its state becomes less and less accurate. > > If that inaccuracy is a problem then perhaps we could defer the collapsing > of a now-empty node into its parent in some manner. Getting the refault distance out of such a radix tree would be tricky. One solution I can think of would entail keeping a global fault count and storing the current fault count in the radix node and on refault subtract from the global count. The downside however is this global thing, perhaps we could do some smart percpu count aggregate to fix it. The other point you mention is when to we reap these radix tree nodes, normally nonresident information gets dropped once the distance is further than our memory is big, but these nodes don´t have explicit order. The collapsing idea is interesting, esp. if we could delay the collapse so that the avg refault distance would be in some relation to the error. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > >>> Doing a refault thing would help a bit, but stops working at a certain > >>> point. > >> At what point does it stop working? > > > > We need to store that this-page-got-reclaimed info somewhere. I don't know > > how space-efficient that is. Did anyone ever do an implementation? > > One 32 bit word per evicted page that we keep track of. ok... I wonder if we really need a new data structure to track that. I mean, once a file-backed (or indeed swapcache) page has been reclaimed, its radix-tree slot is just sitting there with zeroes in it, asking us to reuse that space for something interesting, no? Of course, if all 64 pages in a radix-tree node get removed, we'll currently free the node itself. We could stop doing that, but the effects of that might be pretty bad sometimes. Instead, it sounds sensible to populate the now-null slot in the parent radix-tree node with an average/max/min/per-child-bitmap/whatever of the metrics for the 64 non-resident pages which that non-leaf slot represents. So as the period since a single page got evicted increases and increases, our information about its state becomes less and less accurate. If that inaccuracy is a problem then perhaps we could defer the collapsing of a now-empty node into its parent in some manner. > > You mean design it and review the design before coding it? You'll find few > > objections there. > > Few objections, but sadly also very few people interested in > actually reviewing the design :( > > If you can find holes in http://linux-mm.org/PageReplacementDesign > please let me know :) That all looks pretty non-crazy and implementable to me. Alas, getting the stuff written and working is 1% of the effort. The rest is the nasty hunt for new corner-cases and general productisation hassle. But if initial results show benefit, I expect we could manage all that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: Doing a refault thing would help a bit, but stops working at a certain point. At what point does it stop working? We need to store that this-page-got-reclaimed info somewhere. I don't know how space-efficient that is. Did anyone ever do an implementation? One 32 bit word per evicted page that we keep track of. ok... I wonder if we really need a new data structure to track that. I mean, once a file-backed (or indeed swapcache) page has been reclaimed, its radix-tree slot is just sitting there with zeroes in it, asking us to reuse that space for something interesting, no? Of course, if all 64 pages in a radix-tree node get removed, we'll currently free the node itself. We could stop doing that, but the effects of that might be pretty bad sometimes. Instead, it sounds sensible to populate the now-null slot in the parent radix-tree node with an average/max/min/per-child-bitmap/whatever of the metrics for the 64 non-resident pages which that non-leaf slot represents. So as the period since a single page got evicted increases and increases, our information about its state becomes less and less accurate. If that inaccuracy is a problem then perhaps we could defer the collapsing of a now-empty node into its parent in some manner. You mean design it and review the design before coding it? You'll find few objections there. Few objections, but sadly also very few people interested in actually reviewing the design :( If you can find holes in http://linux-mm.org/PageReplacementDesign please let me know :) That all looks pretty non-crazy and implementable to me. Alas, getting the stuff written and working is 1% of the effort. The rest is the nasty hunt for new corner-cases and general productisation hassle. But if initial results show benefit, I expect we could manage all that. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sun, 2007-03-04 at 04:07 -0800, Andrew Morton wrote: On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: Doing a refault thing would help a bit, but stops working at a certain point. At what point does it stop working? We need to store that this-page-got-reclaimed info somewhere. I don't know how space-efficient that is. Did anyone ever do an implementation? One 32 bit word per evicted page that we keep track of. ok... I wonder if we really need a new data structure to track that. I mean, once a file-backed (or indeed swapcache) page has been reclaimed, its radix-tree slot is just sitting there with zeroes in it, asking us to reuse that space for something interesting, no? Of course, if all 64 pages in a radix-tree node get removed, we'll currently free the node itself. We could stop doing that, but the effects of that might be pretty bad sometimes. Instead, it sounds sensible to populate the now-null slot in the parent radix-tree node with an average/max/min/per-child-bitmap/whatever of the metrics for the 64 non-resident pages which that non-leaf slot represents. So as the period since a single page got evicted increases and increases, our information about its state becomes less and less accurate. If that inaccuracy is a problem then perhaps we could defer the collapsing of a now-empty node into its parent in some manner. Getting the refault distance out of such a radix tree would be tricky. One solution I can think of would entail keeping a global fault count and storing the current fault count in the radix node and on refault subtract from the global count. The downside however is this global thing, perhaps we could do some smart percpu count aggregate to fix it. The other point you mention is when to we reap these radix tree nodes, normally nonresident information gets dropped once the distance is further than our memory is big, but these nodes don´t have explicit order. The collapsing idea is interesting, esp. if we could delay the collapse so that the avg refault distance would be in some relation to the error. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: Doing a refault thing would help a bit, but stops working at a certain point. At what point does it stop working? We need to store that this-page-got-reclaimed info somewhere. I don't know how space-efficient that is. Did anyone ever do an implementation? One 32 bit word per evicted page that we keep track of. ok... I wonder if we really need a new data structure to track that. I mean, once a file-backed (or indeed swapcache) page has been reclaimed, its radix-tree slot is just sitting there with zeroes in it, asking us to reuse that space for something interesting, no? Of course, if all 64 pages in a radix-tree node get removed, we'll currently free the node itself. We could stop doing that, but the effects of that might be pretty bad sometimes. Instead, it sounds sensible to populate the now-null slot in the parent radix-tree node with an average/max/min/per-child-bitmap/whatever of the metrics for the 64 non-resident pages which that non-leaf slot represents. So as the period since a single page got evicted increases and increases, our information about its state becomes less and less accurate. If that inaccuracy is a problem then perhaps we could defer the collapsing of a now-empty node into its parent in some manner. We know exactly how far to defer that collapsing, too. We know at what rate we rotate through the active list, and the size of the active list. We also know the rate at which we reclaim pages, and the size of the inactive list. Combine the two, and you have an idea roughly how many page faults there are between the accesses to the coldest page on the active list. We don't have to keep the evicted page history beyond that point, because pages that get refaulted after such a long interval have a longer inter-reference distance and should go onto the inactive list - ie. the default list for unknown pages. If you can find holes in http://linux-mm.org/PageReplacementDesign please let me know :) That all looks pretty non-crazy and implementable to me. Alas, getting the stuff written and working is 1% of the effort. The rest is the nasty hunt for new corner-cases and general productisation hassle. But if initial results show benefit, I expect we could manage all that. True, but I've looked through a few hundred VM bugzillas to validate the design against all the common corner cases, all the way from RHEL3 (which also has split anon/file lists) through today. I'm trying to keep the known-good bits of our policy as much as possible, introducing big changes only for those corner cases that plagued multiple VMs in the past. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 3 Mar 2007 21:35:59 -0500 "Lee Revell" <[EMAIL PROTECTED]> wrote: > On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > But yes, updatedb's pagecache usage will be mainly metadata, and this tool > > doesn't address metadata pagecache, although it could do so. > > > > With no kernel changes? How? I can't find an equivalent API to > posix_fadvise() for metadata. > We can use mincore and fadvise against /dev/sda1, too. mincore's linear search would hurt but you could just run fadvise regularly. A lot of the blockdev pagecache is pretty useless anyway: we've already copied it much of it into dentries and inodes, and some of ext2/3/4's pagecache is already pinned by the fs. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote: But yes, updatedb's pagecache usage will be mainly metadata, and this tool doesn't address metadata pagecache, although it could do so. With no kernel changes? How? I can't find an equivalent API to posix_fadvise() for metadata. Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: Doing a refault thing would help a bit, but stops working at a certain point. At what point does it stop working? We need to store that this-page-got-reclaimed info somewhere. I don't know how space-efficient that is. Did anyone ever do an implementation? One 32 bit word per evicted page that we keep track of. Of course, the pages need to be re-read again so there's a potential 100% hit there, which is in fact not a huge amount in this context. Depends how often it occurs (all the time when refault is being useful?) versus what we gain from it. At this point, when we see that a refaulted page is more active than the coldest page on the active list, we can also immediately shrink the active list. That gives the next inactive page a better chance to get promoted before it gets evicted. I am not asking this to be difficult, I just want to get Linux a VM that does not need to be kludged up every time a distro ships it to its customers. We have a communication problem here. Please please please work harder to get these problems communicated to the MM developers. The only vendor MM kludge of which I'm aware is a thing which Andrea is working on to address a large-shm-segment versus bulk-IO problem (yup, database). If you have enough of an understanding of a problem to be able to develop and productise a fix then share that info madly, asap. The problem is that most of the distro patches are kludges, which we would rather not see again in future kernels. They tend to work around the problem, instead of being a proper fix, since reorganizing the VM in the middle of a release is not an option. However, incremental small-to-medium changes might be an option for the upstream kernel, if you are interested. I believe one starting point would be a concept that people cannot shoot holes in any more. That is no guarantee, but as long as the concept has known holes coding it up is likely to be a waste of time since the code will need kludges to deal with the problems later on and we'd be back to square one. You mean design it and review the design before coding it? You'll find few objections there. Few objections, but sadly also very few people interested in actually reviewing the design :( If you can find holes in http://linux-mm.org/PageReplacementDesign please let me know :) -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 20:23:07 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > > >> The use-once policy we have in the kernel should work > >> perfectly fine for backups. All we need to do is > >> actually honor the accessed bit on active page cache > >> pages, instead of flushing them onto the inactive > >> list. > >> > >> What am I overlooking? > > > > That'll improve backups but will break other things. > > > > To do this effectively we'd need to change the policy so that new pagecache > > allocations cause no scanning of used-twice pages at all. So that even > > after many gigs of backing up, the working set is still there. > > > > Problem is, (for example) what about the person who has 80% of memory in > > used-twice state and who then reads a file or files which are 20% or more of > > the size of memory, two or more times. It'll be 100% cache misses, every > > time. > > This will happen quite a lot. IOW, once those pages are in used-twice > > state, > > how does further pagecache activity ever get them _out_ of that state? Only > > by joining the used-twice page set, and that can't happen if the > > used-once-so-far > > pages got reclaimed. > > > > Doing a refault thing would help a bit, but stops working at a certain > > point. > > At what point does it stop working? We need to store that this-page-got-reclaimed info somewhere. I don't know how space-efficient that is. Did anyone ever do an implementation? Of course, the pages need to be re-read again so there's a potential 100% hit there, which is in fact not a huge amount in this context. Depends how often it occurs (all the time when refault is being useful?) versus what we gain from it. > I am not asking this to be difficult, I just want to get Linux > a VM that does not need to be kludged up every time a distro > ships it to its customers. We have a communication problem here. Please please please work harder to get these problems communicated to the MM developers. The only vendor MM kludge of which I'm aware is a thing which Andrea is working on to address a large-shm-segment versus bulk-IO problem (yup, database). If you have enough of an understanding of a problem to be able to develop and productise a fix then share that info madly, asap. otoh, rhel-on-the-desktop-or-smaller probably isn't a huge priority, which can be taken advantage of. > I believe one starting point would be a concept that people > cannot shoot holes in any more. That is no guarantee, but > as long as the concept has known holes coding it up is likely > to be a waste of time since the code will need kludges to > deal with the problems later on and we'd be back to square > one. You mean design it and review the design before coding it? You'll find few objections there. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Eric St-Laurent wrote: While I think that more user space applications should use fadvise() to avoid polluting the page cache with unneeded data, I still think the kernel should be more fair in regard to page cache management. Personally, I've experienced some sluggish performance after copying large files around. Even more when using NFS. It's difficult to file a bug report for "interactive feel", I don't know how to measure it. I just feel it's a weak aspect of the OS. Fairness and interactiveness are very hard to quantify and measure, which makes it hard to justify patches that improve this behaviour. On the other hand, patches that improve benchmark results are easily justifyable, which makes it easy to merge those even if it comes at the expense of fairness. I think fairness and robustness are important, but I have not figured out a way to justify such changes for upstream inclusion. Well, except perhaps by coming up with artificial test cases, but that feels like cheating :) My personal opinion is that the VM seem tuned for database types workloads. Of course, making the page cache more fair to prevent one process to use most of it will most likely slowdown database type applications. The database people disagree. For one, the accessed bit on active page gets pages gets ignored, so Linux does not properly keep the most actively used page cache pages in memory. Secondly, the VM can waste quite a lot of time scanning over the anonymous pages that it does not even want to evict from memory. If the VM does not plan on evicting anonymous memory (or shared memory segments), why waste CPU time scanning them and randomizing their LRU order? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 3 Mar 2007 20:16:09 -0500 "Lee Revell" <[EMAIL PROTECTED]> wrote: > On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(), > > write(), pwrite(), close() and dup2() functions. pagecache control is done > > via posix_fadvise() and sync_file_range(). > > > > How could this have any effect on the updatedb problem? updatedb does > not read() anything, it just open()s and stat()s every file on the > disk. > err, good point. _one_ of those dang things which goes off when you've stayed up too late does a lot of pagecache IO, not sure which one. Maybe rpmq? But I'd expect that to be doing direct-io. But yes, updatedb's pagecache usage will be mainly metadata, and this tool doesn't address metadata pagecache, although it could do so. It instantiated 5MB of pagecache and 20MB of slab, took about one minute. rpm uses rather a lot of pagecache. So yes, it looks like updatedb is a slab problem. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: The use-once policy we have in the kernel should work perfectly fine for backups. All we need to do is actually honor the accessed bit on active page cache pages, instead of flushing them onto the inactive list. What am I overlooking? That'll improve backups but will break other things. To do this effectively we'd need to change the policy so that new pagecache allocations cause no scanning of used-twice pages at all. So that even after many gigs of backing up, the working set is still there. Problem is, (for example) what about the person who has 80% of memory in used-twice state and who then reads a file or files which are 20% or more of the size of memory, two or more times. It'll be 100% cache misses, every time. This will happen quite a lot. IOW, once those pages are in used-twice state, how does further pagecache activity ever get them _out_ of that state? Only by joining the used-twice page set, and that can't happen if the used-once-so-far pages got reclaimed. Doing a refault thing would help a bit, but stops working at a certain point. At what point does it stop working? I am not asking this to be difficult, I just want to get Linux a VM that does not need to be kludged up every time a distro ships it to its customers. I believe one starting point would be a concept that people cannot shoot holes in any more. That is no guarantee, but as long as the concept has known holes coding it up is likely to be a waste of time since the code will need kludges to deal with the problems later on and we'd be back to square one. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 17:02:31 -0800 Ray Lee <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > > > Would there be any other users of it than updatedb? updatedb is the notorious one. Alas, one can envisage sane workloads which really really really really want to cache millions of dentries and inodes. Workloads which get run more often than once-per-day-at-4AM. So if we "fix" updatedb, those people with kernel-indistinguishable workloads get real unhappy. We have to pass instructions to the kernel to resolve this. A sys_reclaim_dentry() would also need a sys_is_dentry_there() so updatedb could restore the previous state. Probably it'd be better to fix the well-known internal fragmentation problem we have with the VFS caches. That's fairly hard. > I'm not coming up > with much, but given that I'm not always clever, that doesn't mean much. > > A hypothetical on-demand file virus scanner is > going to hit already cached or about-to-be-cached entries by definition. > Perhaps some system audit daemon, such as tripwire. Well, that has the > same access patterns as updatedb, doesn't it: a directory at a time. > find, cp -a, the same. > > So instead of sys_reclaim_dentry, how about extending fadvise to work on > the fd returned via opendir? That'd be pretty simple, but a) would reclaim the pagecache for the directory and not the dentry object itself and b) will only be easy to do for ext2 and minixfs, which maintain a separate pagecache per directory. Yes we could do a "nuke all the dentries in this directory thing", but that's equivalent to sys_reclaim_dentry() in a loop. > And extending POSIX_FADV_NOREUSE on a file > fd to drop the dentry at close? > > (Call me chicken; I just don't want to be the guy suggesting a new > syscall for a single or few users.) > > ~ ~ > > Alternately, there have been requests for a way for userspace to get > notification of all file events for indexing of data and metadata > (inotify, unfortunately, doesn't scale to a full filesystem). (cf. > http://lkml.org/lkml/2006/9/30/98 .) yes, that's a disappointment. > That'd allow an updatedb daemon to > keep the index up to date all the time, amortizing the cost. More > usefully, it'd allow a content indexing daemon to stay up to date all > the time, though inotify mostly works for those, I suppose. > > (Hmm... > [EMAIL PROTECTED]:~$ find ~ -type d | wc -l > 14067 > > ...right. So it probably works fine for normal people.) > > Hey, waitaminute. This should be a solved problem? SELinux must have > some sort of requirement for logging file access attempts. Google, at > least, implies so. Perhaps whatever it implements could be lifted into > the core kernel without dragging the rest behind it. Maybe the syscall auditing code can be persuaded to spit out records which can be used for this. > Dunno. Who do we CC? That's a problem. Nobody and everybody. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote: The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(), write(), pwrite(), close() and dup2() functions. pagecache control is done via posix_fadvise() and sync_file_range(). How could this have any effect on the updatedb problem? updatedb does not read() anything, it just open()s and stat()s every file on the disk. Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 19:14:59 -0500 Eric St-Laurent <[EMAIL PROTECTED]> wrote: > On Sat, 2007-03-03 at 12:29 -0800, Andrew Morton wrote: > > > > There is much more which could be done to make this code smarter, but I > > think the lesson here is that we can produce a far, far better result doing > > this work in userspace than we could ever hope to do with an in-kernel > > implementation. There are some enhancement suggestions in the > > documentation file. > > While I think that more user space applications should use fadvise() to > avoid polluting the page cache with unneeded data, I still think the > kernel should be more fair in regard to page cache management. > > Personally, I've experienced some sluggish performance after copying > large files around. Even more when using NFS. It's difficult to file a > bug report for "interactive feel", I don't know how to measure it. I > just feel it's a weak aspect of the OS. yeah. It'd be worth spending some time, try to come up with some set of commands which produce an effect which you find objectionable. > Surely it's possible to make the kernel a little bit better to protect > the page cache from abuse, from simple or badly designed applications. > > Why fairness is provided by the process scheduler with good results, yet > it somewhat easy for a process to cause slowdowns from page cache usage. > > My personal opinion is that the VM seem tuned for database types > workloads. VM hasn't actually been tuned *for* anything much at all, really. Looking back on it, much of the tweaking in there has been to avoid really bad situations. We put much work into avoiding the 100%, 1000% or 1% slowdowns, but not a lot of work into providing the 15% speedups. So it may well be that the result is not particularly great at anything, but it's also not horridly bad at anything, either. Or at least, it's not supposed to be. > Of course, making the page cache more fair to prevent one > process to use most of it will most likely slowdown database type > applications. databases actually like to manage their own cache via various means. There are some situations in which bulk IO activities can trash the databases's cache. That's the sort of thing which this tool is trying to help address. > Maybe the situation should be reversed, much like the process scheduler. > Fairness by default, and the possibility to request for more system > resources by asking for them with necessary privileges. Much like > SCHED_FIFO policy. Well. If the CPU scheduler makes a mistake, we see 5% or 15% degredations. If VM make a mistake (or fails to read the operator's mind), we go to disk and can suffer 1000% degredations or worse. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 2007-03-03 at 12:29 -0800, Andrew Morton wrote: > There is much more which could be done to make this code smarter, but I > think the lesson here is that we can produce a far, far better result doing > this work in userspace than we could ever hope to do with an in-kernel > implementation. There are some enhancement suggestions in the > documentation file. While I think that more user space applications should use fadvise() to avoid polluting the page cache with unneeded data, I still think the kernel should be more fair in regard to page cache management. Personally, I've experienced some sluggish performance after copying large files around. Even more when using NFS. It's difficult to file a bug report for "interactive feel", I don't know how to measure it. I just feel it's a weak aspect of the OS. Surely it's possible to make the kernel a little bit better to protect the page cache from abuse, from simple or badly designed applications. Why fairness is provided by the process scheduler with good results, yet it somewhat easy for a process to cause slowdowns from page cache usage. My personal opinion is that the VM seem tuned for database types workloads. Of course, making the page cache more fair to prevent one process to use most of it will most likely slowdown database type applications. Maybe the situation should be reversed, much like the process scheduler. Fairness by default, and the possibility to request for more system resources by asking for them with necessary privileges. Much like SCHED_FIFO policy. - Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > > > >> backup program > > > > A suitable policy for a backup program would probably be to invalidate any > > output file(s) and to invalidate those pages of the input files which were > > not in cache when the backup program first opened those files. That way > > the backup program will have no effect on the cache state, except for the > > race situation where someone read an uncached file while the backup program > > was reading from it too. > > The use-once policy we have in the kernel should work > perfectly fine for backups. All we need to do is > actually honor the accessed bit on active page cache > pages, instead of flushing them onto the inactive > list. > > What am I overlooking? That'll improve backups but will break other things. To do this effectively we'd need to change the policy so that new pagecache allocations cause no scanning of used-twice pages at all. So that even after many gigs of backing up, the working set is still there. Problem is, (for example) what about the person who has 80% of memory in used-twice state and who then reads a file or files which are 20% or more of the size of memory, two or more times. It'll be 100% cache misses, every time. This will happen quite a lot. IOW, once those pages are in used-twice state, how does further pagecache activity ever get them _out_ of that state? Only by joining the used-twice page set, and that can't happen if the used-once-so-far pages got reclaimed. Doing a refault thing would help a bit, but stops working at a certain point. > > This can be added in an hour or two with no kernel changes (use mincore). > > mincore only works for mmaped areas, we'd need an fincore > to work with file handles. The LD_PRELOAD code has the fd and can mmap it to perform the pagecache probe. fincore() would be a bit neater, but given the rarity with which mincore() is used it's perhaps hard to justify adding a slightly more efficient and slightly more convenient subset of mincore(). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: > Would there be any other users of it than updatedb? I'm not coming up with much, but given that I'm not always clever, that doesn't mean much. A hypothetical on-demand file virus scanner is going to hit already cached or about-to-be-cached entries by definition. Perhaps some system audit daemon, such as tripwire. Well, that has the same access patterns as updatedb, doesn't it: a directory at a time. find, cp -a, the same. So instead of sys_reclaim_dentry, how about extending fadvise to work on the fd returned via opendir? And extending POSIX_FADV_NOREUSE on a file fd to drop the dentry at close? (Call me chicken; I just don't want to be the guy suggesting a new syscall for a single or few users.) ~ ~ Alternately, there have been requests for a way for userspace to get notification of all file events for indexing of data and metadata (inotify, unfortunately, doesn't scale to a full filesystem). (cf. http://lkml.org/lkml/2006/9/30/98 .) That'd allow an updatedb daemon to keep the index up to date all the time, amortizing the cost. More usefully, it'd allow a content indexing daemon to stay up to date all the time, though inotify mostly works for those, I suppose. (Hmm... [EMAIL PROTECTED]:~$ find ~ -type d | wc -l 14067 ...right. So it probably works fine for normal people.) Hey, waitaminute. This should be a solved problem? SELinux must have some sort of requirement for logging file access attempts. Google, at least, implies so. Perhaps whatever it implements could be lifted into the core kernel without dragging the rest behind it. Dunno. Who do we CC? Ray - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: backup program A suitable policy for a backup program would probably be to invalidate any output file(s) and to invalidate those pages of the input files which were not in cache when the backup program first opened those files. That way the backup program will have no effect on the cache state, except for the race situation where someone read an uncached file while the backup program was reading from it too. The use-once policy we have in the kernel should work perfectly fine for backups. All we need to do is actually honor the accessed bit on active page cache pages, instead of flushing them onto the inactive list. What am I overlooking? This can be added in an hour or two with no kernel changes (use mincore). mincore only works for mmaped areas, we'd need an fincore to work with file handles. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sun, 4 Mar 2007 00:01:55 +0100 bert hubert <[EMAIL PROTECTED]> wrote: > On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote: > > > > It is *not* a global instruction. It uses setenv, so the user's policy > > > > affects only the target process and its forked children. > > > > > > ... and all other processes accessing the same file(s)! > > > > > > Your library and the system calls may be limited to one process, > > > but the consequences are global. > > > > Yes. So what? If the user wants to go and evict libc.so from pagecache > > then he can do so - the kernel has provided syscalls with which this can be > > done for at least seven years. Bad user, shouldn't do that. > > While I agree with your sentiments that userspace can have a good idea on > how to deal with the page cache, your program does more than it claims to > do - because of how linux implements posix_fadvise. > > I don't think anybody expects or desires your program to actually *evict* > the stuff from the cache you are trying access, which happens in case the > data was in the cache prior to starting your program. > > What people expect is that a solution such as you wrote it simply won't > *add* anything to the cache. They don't expect it will actually globally > *remove* stuff from the cache. > > Making a backup this way would hurt even worse than usual with your > pagecache management tool if the file being backupped was still being read. > > This is not your fault, but in practice, it makes your program less useful > than it could be. yup. As I said, it's a proof-of-concept. It's a project. And I have about one free femtosecond per fortnight :( > One could conceivably fix that up using mincore and simply not fadvise if a > page was in core already. Yes. Let's flesh it out the backup program policy some more: - Unconditionally invalidate output files - on entry to read(), probe pagecache, record which pages in the range are present - on entry to next read(), shoot down those pages from the previous read which weren't in pagecache. - But we can do better! LRU the page's files up to a certain number of pages. - Once that point is exceeded, we need to reclaim some pages. Which ones? Well, we've been observing all reads, so we can record which pages were referenced once, and which ones were referenced multiple times so we can do arbitrarily complex page aging in there. - On close(), nuke all pages which weren't in core during open(), even if this app referenced them multiple times. - If the backup program decided to read its input files with mmap we're rather screwed. We can't intercept pagefaults so the best we can do is to restore the file's pagecache to its previous state on close(). Or if it's really a problem, get control in there somehow and periodically poll the pagecache occupancy via mincore(), use madvise() then fadvise() to trim it back. That all sounds reasonably doable. It'd be pretty complex to do it in-kernel but we could do it there too. Problem is if course that the above strategy is explicitly optimised for the backup program and if it's in-kernel it becomes applicable to all other workloads. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote: > On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > > > > It is *not* a global instruction. It uses setenv, so the user's policy > > > affects only the target process and its forked children. > > > > ... and all other processes accessing the same file(s)! > > > > Your library and the system calls may be limited to one process, > > but the consequences are global. > > Yes. So what? If the user wants to go and evict libc.so from pagecache > then he can do so - the kernel has provided syscalls with which this can be > done for at least seven years. Bad user, shouldn't do that. I think what Rik is pointing out is that as currently implemented, posix_fadvise is a much bigger hammer than is generally useful or desirable. Using posix_fadvise on the other hand says "immediately drop this stuff from the pagecache, consequences be damned". If someone else happens to be using the specified data, well too bad, they suffer collateral damage. Process A can, maliciously or ignorantly, deny service to process B. On the other hand, your old but super cool O_STREAMING patch took a kinder gentler approach, where applications could tell the kernel "please do not keep this file descriptor's data in cache on my account since I will not reuse it." If someone else however was using the same data, the kernel would keep things cached as usual and thereby avoid doing collateral damage. -Erik -- Erik B. Andersen http://codepoet-consulting.com/ --This message was written using 73% post-consumer electrons-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 3 Mar 2007 14:58:48 -0800 "Ray Lee" <[EMAIL PROTECTED]> wrote: > On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > It is to address the "waah, backups fill my memory with pagecache" and the > > "waah, updatedb swapped everything out" and the "waah, copying a DVD > > gobbled all my memory" problems. > > Is the updatedb problem really due to pagecache? It's a combination of pagecache, slab cache and of course contention for the disk. In my experience the latter preponderates: the disk is sekeing like mad and I can't get its attention. Others report lots of swapout, which will be a combination of slab and pagecache, varying degrees of each. > > When running > > > > pagecache-management.sh dd if=100-mb-file of=foo > > or > > pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo > > > > the amount of pagecache in the machine is pretty much unaltered. Maybe a > > megabyte of additional cache in the second case, because of ext3 indirect > > blocks. > > [EMAIL PROTECTED]:~/work/home/pagecache-management$ grep ext3_i > /proc/slabinfo; ./pagecache-management.sh sudo updatedb; grep ext3_i > /proc/slabinfo > ext3_inode_cache 21024 23722 158421 : tunables 24 12 > 0 : slabdata 11861 11861 0 > ext3_inode_cache 41332 41332 158421 : tunables 24 12 > 0 : slabdata 20666 20666 0 > [EMAIL PROTECTED]:~/work/home/pagecache-management$ echo $(( 1584 * > (41332-21024) )) > 32167872 If 32 MB is the whole lot then by eliminating pagecache, we just solved the problem. But perhaps you instantiated a lot more VFS cache and all you're seeing there is the leftovers. > Or is there a /proc/sys/vm/* knob that can be tweaked for this > before/after the updatedb? /proc/sys/vm/vfs_cache_pressure should help. I don't recall anyone reporting its effects with updatedb. > But yeah, I for one would happily submit patches to upstream authors > to address this there. There's no reason code should be making the > kernel guess its intention on these things. I think so. We're dealing with super-special cases here and often trying to fix those in-kernel will degrade other, often more common cases. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote: > > > It is *not* a global instruction. It uses setenv, so the user's policy > > > affects only the target process and its forked children. > > > > ... and all other processes accessing the same file(s)! > > > > Your library and the system calls may be limited to one process, > > but the consequences are global. > > Yes. So what? If the user wants to go and evict libc.so from pagecache > then he can do so - the kernel has provided syscalls with which this can be > done for at least seven years. Bad user, shouldn't do that. While I agree with your sentiments that userspace can have a good idea on how to deal with the page cache, your program does more than it claims to do - because of how linux implements posix_fadvise. I don't think anybody expects or desires your program to actually *evict* the stuff from the cache you are trying access, which happens in case the data was in the cache prior to starting your program. What people expect is that a solution such as you wrote it simply won't *add* anything to the cache. They don't expect it will actually globally *remove* stuff from the cache. Making a backup this way would hurt even worse than usual with your pagecache management tool if the file being backupped was still being read. This is not your fault, but in practice, it makes your program less useful than it could be. One could conceivably fix that up using mincore and simply not fadvise if a page was in core already. Bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://netherlabs.nl Open and Closed source services - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote: It is to address the "waah, backups fill my memory with pagecache" and the "waah, updatedb swapped everything out" and the "waah, copying a DVD gobbled all my memory" problems. Is the updatedb problem really due to pagecache? When running pagecache-management.sh dd if=100-mb-file of=foo or pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo the amount of pagecache in the machine is pretty much unaltered. Maybe a megabyte of additional cache in the second case, because of ext3 indirect blocks. [EMAIL PROTECTED]:~/work/home/pagecache-management$ grep ext3_i /proc/slabinfo; ./pagecache-management.sh sudo updatedb; grep ext3_i /proc/slabinfo ext3_inode_cache 21024 23722 158421 : tunables 24 12 0 : slabdata 11861 11861 0 ext3_inode_cache 41332 41332 158421 : tunables 24 12 0 : slabdata 20666 20666 0 [EMAIL PROTECTED]:~/work/home/pagecache-management$ echo $(( 1584 * (41332-21024) )) 32167872 Or is there a /proc/sys/vm/* knob that can be tweaked for this before/after the updatedb? But yeah, I for one would happily submit patches to upstream authors to address this there. There's no reason code should be making the kernel guess its intention on these things. Ray - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > backup program A suitable policy for a backup program would probably be to invalidate any output file(s) and to invalidate those pages of the input files which were not in cache when the backup program first opened those files. That way the backup program will have no effect on the cache state, except for the race situation where someone read an uncached file while the backup program was reading from it too. This can be added in an hour or two with no kernel changes (use mincore). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 17:28:35 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > > > >>> It is *not* a global instruction. It uses setenv, so the user's policy > >>> affects only the target process and its forked children. > >> ... and all other processes accessing the same file(s)! > >> > >> Your library and the system calls may be limited to one process, > >> but the consequences are global. > > > > Yes. So what? If the user wants to go and evict libc.so from pagecache > > then he can do so - the kernel has provided syscalls with which this can be > > done for at least seven years. Bad user, shouldn't do that. > > Are you saying the user should not use your script with their > backup program? No. This is getting silly. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > > Well, backup programs are a unique case. Let's say instead that the user > > has just generated a 600MB ISO image. > > > > The kernel *just doesn't know* whether the user will next try to read the > > kernel tree or will next try to read that ISO image. > > > > That, Rik, is my point, and is the entire point of this work. > > I still don't understand why "the backup program flushed my data out > of the cache with POSIX_FADV_DONTNEED" is an improvement over "the > backup program flushed my data out of the cache by reading other files". Oh. Well, yes, if the user elected to instruct the backup program to invalidate both its input files and its output files and if it's a full dump, you end up with nothing in pagecache. Possibly a more sensible setting would be to invalidate only the output. But having some batch program come in from the side and perform a bulk read of your present working set isn't very common. > Your code may be useful for a few specialized situations, That's quite wrong. It is useful for a great number of well-known problem scenarios, all of which are *already "specialized situations". It's *you* who is chasing down the 1% scenarios and portraying them as general problems. Backups only happen once in 24 hours, for example. > but I don't > see it actually fixing most of the examples you gave in your > announcement, except for the DVD copying one. > I don't know how much benefit it will provide for the updatedb problem - I expect it'll help sometimes. otoh maybe it'll worsen the existing slab internal fragmentation problem, dunno. But the other scenarios it solves completely and optimally. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. ... and all other processes accessing the same file(s)! Your library and the system calls may be limited to one process, but the consequences are global. Yes. So what? If the user wants to go and evict libc.so from pagecache then he can do so - the kernel has provided syscalls with which this can be done for at least seven years. Bad user, shouldn't do that. Are you saying the user should not use your script with their backup program? Then what's the point? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > > It is *not* a global instruction. It uses setenv, so the user's policy > > affects only the target process and its forked children. > > ... and all other processes accessing the same file(s)! > > Your library and the system calls may be limited to one process, > but the consequences are global. Yes. So what? If the user wants to go and evict libc.so from pagecache then he can do so - the kernel has provided syscalls with which this can be done for at least seven years. Bad user, shouldn't do that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: Well, backup programs are a unique case. Let's say instead that the user has just generated a 600MB ISO image. The kernel *just doesn't know* whether the user will next try to read the kernel tree or will next try to read that ISO image. That, Rik, is my point, and is the entire point of this work. I still don't understand why "the backup program flushed my data out of the cache with POSIX_FADV_DONTNEED" is an improvement over "the backup program flushed my data out of the cache by reading other files". Your code may be useful for a few specialized situations, but I don't see it actually fixing most of the examples you gave in your announcement, except for the DVD copying one. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 3 Mar 2007 22:41:09 +0100 bert hubert <[EMAIL PROTECTED]> wrote: How can you make global policy decisions based on the intent of one program? By not doing so. yup. Andrew's program is fine in principle, except that the linux kernel treats the communication of a program's intent as a global instruction. argh. That felt good - let's do it again. argh. It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. ... and all other processes accessing the same file(s)! Your library and the system calls may be limited to one process, but the consequences are global. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 3 Mar 2007 22:41:09 +0100 bert hubert <[EMAIL PROTECTED]> wrote: > > How can you make global policy decisions based on the intent > > of one program? > > By not doing so. yup. > Andrew's program is fine in principle, except that the > linux kernel treats the communication of a program's intent as a global > instruction. argh. That felt good - let's do it again. argh. It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. > Also, Andrew's description is a tad misleasing, as the size of the page > cache might be altered a lot in case content is accessed that was previously > cached! That's true. Although if the user knows that he'll want to use that data again soon, and he elects to purge it all from cache beforehand then we're dealing with a pretty dumb user. If this user doesn't plan to use the data again, but some other user does, then he loses. > With my userspace hat on, I'd love to have a proper way to communicate my > *program's* expectations to the kernel, without stomping other programs. That is the aim and effect of this work. > Also with the same hat on, I hope to rarely *need* to communicate my > expectations because the kernel correctly predicts many cases. yup. It's those odd cases where it goes wrong which I'm addressing here. And of course there's no way for the kernel to work out that it's about to go wrong - kernel can't read user's mind. Well. That's not strictly true. One could envisage a database-backed learning program which observes the users work patterns and, based on various pattern-matchings, works out what the best strategy is likely to be when the user starts some operation which he has performed previously. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 16:30:56 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > > >> I am sick and tired of the "this is hard, let userspace do it" attitude. > > > > Anything you try to do in-kernel will catastrophically screw up some > > workloads. You don't have a chance of getting this right. > > Any time you follow the directions of one userspace program, > you can screw up others. I suspect that userspace has far > less of a chance of getting it right than the kernel. > > ALSA would be a good example of why it is bad to export > tuning knobs directly to userspace - many sound cards have > non-standard names for the volume controls, making it almost > impossible for userspace to present the user with a simple > user interface for tweaking the volume. What on earth are you talking about? Please, go and look at the thing. > > You are the kernel. The user just read an entire kernel tree. You face a > > binary decision: do you cache that tree or do you not? Your time starts > > now. What is your answer? > > Lets turn this around. > > The user has been accessing the kernel tree over and over > again, for hours on end (compile testing a patch). Along > comes a backup program, that tells you to evict the whole > thing from the cache. > > What do you do? Well, backup programs are a unique case. Let's say instead that the user has just generated a 600MB ISO image. The kernel *just doesn't know* whether the user will next try to read the kernel tree or will next try to read that ISO image. That, Rik, is my point, and is the entire point of this work. > How can you make global policy decisions based on the intent > of one program? You can't, that's why I did this work. > Only the kernel knows the state of the whole system and has > observed the behaviour of all the processes. The kernel knows the past, and tries to predict the future from that past. Sometimes, as you well know, that goes badly wrong. That's why I did this work. > One process has > no idea what the other processes in the system are doing. argh. Please, next time click on the link? http://userweb.kernel.org/~akpm/pagecache-management/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, Mar 03, 2007 at 04:30:56PM -0500, Rik van Riel wrote: > The user has been accessing the kernel tree over and over > again, for hours on end (compile testing a patch). Along > comes a backup program, that tells you to evict the whole > thing from the cache. This is arguably due to a linux misimplementation of posix_fadvise. SuS v3 clearly states: The posix_fadvise() function shall advise the implementation on the expected behavior of the application with respect to the data in the file associated with the open file descriptor Note how it refers to the *application*. This is reiterated here: POSIX_FADV_WILLNEED Specifies that the application expects to access the specified data in the near future. POSIX_FADV_DONTNEED Specifies that the application expects that it will not access the specified data in the near future. POSIX_FADV_NOREUSE Specifies that the application expects to access the specified data once and then not reuse it thereafter. Linux however implements posix_fadvise globally: POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region. POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE both initiate a non-blocking read of the specified region into the page cache. > How can you make global policy decisions based on the intent > of one program? By not doing so. Andrew's program is fine in principle, except that the linux kernel treats the communication of a program's intent as a global instruction. Also, Andrew's description is a tad misleasing, as the size of the page cache might be altered a lot in case content is accessed that was previously cached! With my userspace hat on, I'd love to have a proper way to communicate my *program's* expectations to the kernel, without stomping other programs. Also with the same hat on, I hope to rarely *need* to communicate my expectations because the kernel correctly predicts many cases. Bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://netherlabs.nl Open and Closed source services - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: I am sick and tired of the "this is hard, let userspace do it" attitude. Anything you try to do in-kernel will catastrophically screw up some workloads. You don't have a chance of getting this right. Any time you follow the directions of one userspace program, you can screw up others. I suspect that userspace has far less of a chance of getting it right than the kernel. ALSA would be a good example of why it is bad to export tuning knobs directly to userspace - many sound cards have non-standard names for the volume controls, making it almost impossible for userspace to present the user with a simple user interface for tweaking the volume. You are the kernel. The user just read an entire kernel tree. You face a binary decision: do you cache that tree or do you not? Your time starts now. What is your answer? Lets turn this around. The user has been accessing the kernel tree over and over again, for hours on end (compile testing a patch). Along comes a backup program, that tells you to evict the whole thing from the cache. What do you do? How can you make global policy decisions based on the intent of one program? Only the kernel knows the state of the whole system and has observed the behaviour of all the processes. One process has no idea what the other processes in the system are doing. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > > It is to address the "waah, backups fill my memory with pagecache" and the > > "waah, updatedb swapped everything out" and the "waah, copying a DVD > > gobbled all my memory" problems. > > By removing pressure from the page cache, you'll only allow updatedb > to grow the inode and dentry caches larger than before. Well duh. That's a two-order-of-magnitude lesser problem and only affects one of many problematic workloads. > I am sick and tired of the "this is hard, let userspace do it" attitude. Anything you try to do in-kernel will catastrophically screw up some workloads. You don't have a chance of getting this right. You are the kernel. The user just read an entire kernel tree. You face a binary decision: do you cache that tree or do you not? Your time starts now. What is your answer? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: It is to address the "waah, backups fill my memory with pagecache" and the "waah, updatedb swapped everything out" and the "waah, copying a DVD gobbled all my memory" problems. By removing pressure from the page cache, you'll only allow updatedb to grow the inode and dentry caches larger than before. I am sick and tired of the "this is hard, let userspace do it" attitude. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
userspace pagecache management tool
I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a little tool which permits the management of the pagecache usage of arbitrary applications. Effectively it prevents the targetted application from using any pagecache at all. It is to address the "waah, backups fill my memory with pagecache" and the "waah, updatedb swapped everything out" and the "waah, copying a DVD gobbled all my memory" problems. Although it is little more than a proof-of-concept it seems to be fairly useful. When running pagecache-management.sh dd if=100-mb-file of=foo or pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo the amount of pagecache in the machine is pretty much unaltered. Maybe a megabyte of additional cache in the second case, because of ext3 indirect blocks. The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(), write(), pwrite(), close() and dup2() functions. pagecache control is done via posix_fadvise() and sync_file_range(). btw, for a while I was using fdatasync() on close(), but it was slow, because fdatasync() has to run an ext3 commit to commit the metadata. sync_file_range() doesn't do that, and the copy-a-kernel-tree testcase sped up by a factor of five. So sync_file_range() rocks, but the powerpc guys haven't wired it up yet. There is much more which could be done to make this code smarter, but I think the lesson here is that we can produce a far, far better result doing this work in userspace than we could ever hope to do with an in-kernel implementation. There are some enhancement suggestions in the documentation file. It would be good if someone could turn this into a real product, get it fed into distros. Once the design is settled we should look at moving all the functionality into glibc itself, IMO, and get rid of the LD_PRELOAD trick. It might help if the kernel offered APIs which permit userspace to query the number of resident pages in a file (well, actually it already does, kind-of: mincore()) and the ability to query the number of dirty pages in a file, etc. I'd be reluctant to tie the kernel ABI too closely to the current pagecache implementation and data structures, but we can look at these things. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
userspace pagecache management tool
I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a little tool which permits the management of the pagecache usage of arbitrary applications. Effectively it prevents the targetted application from using any pagecache at all. It is to address the waah, backups fill my memory with pagecache and the waah, updatedb swapped everything out and the waah, copying a DVD gobbled all my memory problems. Although it is little more than a proof-of-concept it seems to be fairly useful. When running pagecache-management.sh dd if=100-mb-file of=foo or pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo the amount of pagecache in the machine is pretty much unaltered. Maybe a megabyte of additional cache in the second case, because of ext3 indirect blocks. The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(), write(), pwrite(), close() and dup2() functions. pagecache control is done via posix_fadvise() and sync_file_range(). btw, for a while I was using fdatasync() on close(), but it was slow, because fdatasync() has to run an ext3 commit to commit the metadata. sync_file_range() doesn't do that, and the copy-a-kernel-tree testcase sped up by a factor of five. So sync_file_range() rocks, but the powerpc guys haven't wired it up yet. There is much more which could be done to make this code smarter, but I think the lesson here is that we can produce a far, far better result doing this work in userspace than we could ever hope to do with an in-kernel implementation. There are some enhancement suggestions in the documentation file. It would be good if someone could turn this into a real product, get it fed into distros. Once the design is settled we should look at moving all the functionality into glibc itself, IMO, and get rid of the LD_PRELOAD trick. It might help if the kernel offered APIs which permit userspace to query the number of resident pages in a file (well, actually it already does, kind-of: mincore()) and the ability to query the number of dirty pages in a file, etc. I'd be reluctant to tie the kernel ABI too closely to the current pagecache implementation and data structures, but we can look at these things. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: It is to address the waah, backups fill my memory with pagecache and the waah, updatedb swapped everything out and the waah, copying a DVD gobbled all my memory problems. By removing pressure from the page cache, you'll only allow updatedb to grow the inode and dentry caches larger than before. I am sick and tired of the this is hard, let userspace do it attitude. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: It is to address the waah, backups fill my memory with pagecache and the waah, updatedb swapped everything out and the waah, copying a DVD gobbled all my memory problems. By removing pressure from the page cache, you'll only allow updatedb to grow the inode and dentry caches larger than before. Well duh. That's a two-order-of-magnitude lesser problem and only affects one of many problematic workloads. I am sick and tired of the this is hard, let userspace do it attitude. Anything you try to do in-kernel will catastrophically screw up some workloads. You don't have a chance of getting this right. You are the kernel. The user just read an entire kernel tree. You face a binary decision: do you cache that tree or do you not? Your time starts now. What is your answer? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel [EMAIL PROTECTED] wrote: I am sick and tired of the this is hard, let userspace do it attitude. Anything you try to do in-kernel will catastrophically screw up some workloads. You don't have a chance of getting this right. Any time you follow the directions of one userspace program, you can screw up others. I suspect that userspace has far less of a chance of getting it right than the kernel. ALSA would be a good example of why it is bad to export tuning knobs directly to userspace - many sound cards have non-standard names for the volume controls, making it almost impossible for userspace to present the user with a simple user interface for tweaking the volume. You are the kernel. The user just read an entire kernel tree. You face a binary decision: do you cache that tree or do you not? Your time starts now. What is your answer? Lets turn this around. The user has been accessing the kernel tree over and over again, for hours on end (compile testing a patch). Along comes a backup program, that tells you to evict the whole thing from the cache. What do you do? How can you make global policy decisions based on the intent of one program? Only the kernel knows the state of the whole system and has observed the behaviour of all the processes. One process has no idea what the other processes in the system are doing. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, Mar 03, 2007 at 04:30:56PM -0500, Rik van Riel wrote: The user has been accessing the kernel tree over and over again, for hours on end (compile testing a patch). Along comes a backup program, that tells you to evict the whole thing from the cache. This is arguably due to a linux misimplementation of posix_fadvise. SuS v3 clearly states: The posix_fadvise() function shall advise the implementation on the expected behavior of the application with respect to the data in the file associated with the open file descriptor Note how it refers to the *application*. This is reiterated here: POSIX_FADV_WILLNEED Specifies that the application expects to access the specified data in the near future. POSIX_FADV_DONTNEED Specifies that the application expects that it will not access the specified data in the near future. POSIX_FADV_NOREUSE Specifies that the application expects to access the specified data once and then not reuse it thereafter. Linux however implements posix_fadvise globally: POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region. POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE both initiate a non-blocking read of the specified region into the page cache. How can you make global policy decisions based on the intent of one program? By not doing so. Andrew's program is fine in principle, except that the linux kernel treats the communication of a program's intent as a global instruction. Also, Andrew's description is a tad misleasing, as the size of the page cache might be altered a lot in case content is accessed that was previously cached! With my userspace hat on, I'd love to have a proper way to communicate my *program's* expectations to the kernel, without stomping other programs. Also with the same hat on, I hope to rarely *need* to communicate my expectations because the kernel correctly predicts many cases. Bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://netherlabs.nl Open and Closed source services - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 16:30:56 -0500 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel [EMAIL PROTECTED] wrote: I am sick and tired of the this is hard, let userspace do it attitude. Anything you try to do in-kernel will catastrophically screw up some workloads. You don't have a chance of getting this right. Any time you follow the directions of one userspace program, you can screw up others. I suspect that userspace has far less of a chance of getting it right than the kernel. ALSA would be a good example of why it is bad to export tuning knobs directly to userspace - many sound cards have non-standard names for the volume controls, making it almost impossible for userspace to present the user with a simple user interface for tweaking the volume. What on earth are you talking about? Please, go and look at the thing. You are the kernel. The user just read an entire kernel tree. You face a binary decision: do you cache that tree or do you not? Your time starts now. What is your answer? Lets turn this around. The user has been accessing the kernel tree over and over again, for hours on end (compile testing a patch). Along comes a backup program, that tells you to evict the whole thing from the cache. What do you do? Well, backup programs are a unique case. Let's say instead that the user has just generated a 600MB ISO image. The kernel *just doesn't know* whether the user will next try to read the kernel tree or will next try to read that ISO image. That, Rik, is my point, and is the entire point of this work. How can you make global policy decisions based on the intent of one program? You can't, that's why I did this work. Only the kernel knows the state of the whole system and has observed the behaviour of all the processes. The kernel knows the past, and tries to predict the future from that past. Sometimes, as you well know, that goes badly wrong. That's why I did this work. One process has no idea what the other processes in the system are doing. argh. Please, next time click on the link? http://userweb.kernel.org/~akpm/pagecache-management/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 3 Mar 2007 22:41:09 +0100 bert hubert [EMAIL PROTECTED] wrote: How can you make global policy decisions based on the intent of one program? By not doing so. yup. Andrew's program is fine in principle, except that the linux kernel treats the communication of a program's intent as a global instruction. argh. That felt good - let's do it again. argh. It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. Also, Andrew's description is a tad misleasing, as the size of the page cache might be altered a lot in case content is accessed that was previously cached! That's true. Although if the user knows that he'll want to use that data again soon, and he elects to purge it all from cache beforehand then we're dealing with a pretty dumb user. If this user doesn't plan to use the data again, but some other user does, then he loses. With my userspace hat on, I'd love to have a proper way to communicate my *program's* expectations to the kernel, without stomping other programs. That is the aim and effect of this work. Also with the same hat on, I hope to rarely *need* to communicate my expectations because the kernel correctly predicts many cases. yup. It's those odd cases where it goes wrong which I'm addressing here. And of course there's no way for the kernel to work out that it's about to go wrong - kernel can't read user's mind. Well. That's not strictly true. One could envisage a database-backed learning program which observes the users work patterns and, based on various pattern-matchings, works out what the best strategy is likely to be when the user starts some operation which he has performed previously. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 3 Mar 2007 22:41:09 +0100 bert hubert [EMAIL PROTECTED] wrote: How can you make global policy decisions based on the intent of one program? By not doing so. yup. Andrew's program is fine in principle, except that the linux kernel treats the communication of a program's intent as a global instruction. argh. That felt good - let's do it again. argh. It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. ... and all other processes accessing the same file(s)! Your library and the system calls may be limited to one process, but the consequences are global. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: Well, backup programs are a unique case. Let's say instead that the user has just generated a 600MB ISO image. The kernel *just doesn't know* whether the user will next try to read the kernel tree or will next try to read that ISO image. That, Rik, is my point, and is the entire point of this work. I still don't understand why the backup program flushed my data out of the cache with POSIX_FADV_DONTNEED is an improvement over the backup program flushed my data out of the cache by reading other files. Your code may be useful for a few specialized situations, but I don't see it actually fixing most of the examples you gave in your announcement, except for the DVD copying one. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel [EMAIL PROTECTED] wrote: It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. ... and all other processes accessing the same file(s)! Your library and the system calls may be limited to one process, but the consequences are global. Yes. So what? If the user wants to go and evict libc.so from pagecache then he can do so - the kernel has provided syscalls with which this can be done for at least seven years. Bad user, shouldn't do that. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel [EMAIL PROTECTED] wrote: It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. ... and all other processes accessing the same file(s)! Your library and the system calls may be limited to one process, but the consequences are global. Yes. So what? If the user wants to go and evict libc.so from pagecache then he can do so - the kernel has provided syscalls with which this can be done for at least seven years. Bad user, shouldn't do that. Are you saying the user should not use your script with their backup program? Then what's the point? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: Well, backup programs are a unique case. Let's say instead that the user has just generated a 600MB ISO image. The kernel *just doesn't know* whether the user will next try to read the kernel tree or will next try to read that ISO image. That, Rik, is my point, and is the entire point of this work. I still don't understand why the backup program flushed my data out of the cache with POSIX_FADV_DONTNEED is an improvement over the backup program flushed my data out of the cache by reading other files. Oh. Well, yes, if the user elected to instruct the backup program to invalidate both its input files and its output files and if it's a full dump, you end up with nothing in pagecache. Possibly a more sensible setting would be to invalidate only the output. But having some batch program come in from the side and perform a bulk read of your present working set isn't very common. Your code may be useful for a few specialized situations, That's quite wrong. It is useful for a great number of well-known problem scenarios, all of which are *already specialized situations. It's *you* who is chasing down the 1% scenarios and portraying them as general problems. Backups only happen once in 24 hours, for example. but I don't see it actually fixing most of the examples you gave in your announcement, except for the DVD copying one. I don't know how much benefit it will provide for the updatedb problem - I expect it'll help sometimes. otoh maybe it'll worsen the existing slab internal fragmentation problem, dunno. But the other scenarios it solves completely and optimally. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 17:28:35 -0500 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel [EMAIL PROTECTED] wrote: It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. ... and all other processes accessing the same file(s)! Your library and the system calls may be limited to one process, but the consequences are global. Yes. So what? If the user wants to go and evict libc.so from pagecache then he can do so - the kernel has provided syscalls with which this can be done for at least seven years. Bad user, shouldn't do that. Are you saying the user should not use your script with their backup program? No. This is getting silly. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel [EMAIL PROTECTED] wrote: backup program A suitable policy for a backup program would probably be to invalidate any output file(s) and to invalidate those pages of the input files which were not in cache when the backup program first opened those files. That way the backup program will have no effect on the cache state, except for the race situation where someone read an uncached file while the backup program was reading from it too. This can be added in an hour or two with no kernel changes (use mincore). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote: It is to address the waah, backups fill my memory with pagecache and the waah, updatedb swapped everything out and the waah, copying a DVD gobbled all my memory problems. Is the updatedb problem really due to pagecache? When running pagecache-management.sh dd if=100-mb-file of=foo or pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo the amount of pagecache in the machine is pretty much unaltered. Maybe a megabyte of additional cache in the second case, because of ext3 indirect blocks. [EMAIL PROTECTED]:~/work/home/pagecache-management$ grep ext3_i /proc/slabinfo; ./pagecache-management.sh sudo updatedb; grep ext3_i /proc/slabinfo ext3_inode_cache 21024 23722 158421 : tunables 24 12 0 : slabdata 11861 11861 0 ext3_inode_cache 41332 41332 158421 : tunables 24 12 0 : slabdata 20666 20666 0 [EMAIL PROTECTED]:~/work/home/pagecache-management$ echo $(( 1584 * (41332-21024) )) 32167872 Or is there a /proc/sys/vm/* knob that can be tweaked for this before/after the updatedb? But yeah, I for one would happily submit patches to upstream authors to address this there. There's no reason code should be making the kernel guess its intention on these things. Ray - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote: It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. ... and all other processes accessing the same file(s)! Your library and the system calls may be limited to one process, but the consequences are global. Yes. So what? If the user wants to go and evict libc.so from pagecache then he can do so - the kernel has provided syscalls with which this can be done for at least seven years. Bad user, shouldn't do that. While I agree with your sentiments that userspace can have a good idea on how to deal with the page cache, your program does more than it claims to do - because of how linux implements posix_fadvise. I don't think anybody expects or desires your program to actually *evict* the stuff from the cache you are trying access, which happens in case the data was in the cache prior to starting your program. What people expect is that a solution such as you wrote it simply won't *add* anything to the cache. They don't expect it will actually globally *remove* stuff from the cache. Making a backup this way would hurt even worse than usual with your pagecache management tool if the file being backupped was still being read. This is not your fault, but in practice, it makes your program less useful than it could be. One could conceivably fix that up using mincore and simply not fadvise if a page was in core already. Bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://netherlabs.nl Open and Closed source services - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 3 Mar 2007 14:58:48 -0800 Ray Lee [EMAIL PROTECTED] wrote: On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote: It is to address the waah, backups fill my memory with pagecache and the waah, updatedb swapped everything out and the waah, copying a DVD gobbled all my memory problems. Is the updatedb problem really due to pagecache? It's a combination of pagecache, slab cache and of course contention for the disk. In my experience the latter preponderates: the disk is sekeing like mad and I can't get its attention. Others report lots of swapout, which will be a combination of slab and pagecache, varying degrees of each. When running pagecache-management.sh dd if=100-mb-file of=foo or pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo the amount of pagecache in the machine is pretty much unaltered. Maybe a megabyte of additional cache in the second case, because of ext3 indirect blocks. [EMAIL PROTECTED]:~/work/home/pagecache-management$ grep ext3_i /proc/slabinfo; ./pagecache-management.sh sudo updatedb; grep ext3_i /proc/slabinfo ext3_inode_cache 21024 23722 158421 : tunables 24 12 0 : slabdata 11861 11861 0 ext3_inode_cache 41332 41332 158421 : tunables 24 12 0 : slabdata 20666 20666 0 [EMAIL PROTECTED]:~/work/home/pagecache-management$ echo $(( 1584 * (41332-21024) )) 32167872 If 32 MB is the whole lot then by eliminating pagecache, we just solved the problem. But perhaps you instantiated a lot more VFS cache and all you're seeing there is the leftovers. Or is there a /proc/sys/vm/* knob that can be tweaked for this before/after the updatedb? /proc/sys/vm/vfs_cache_pressure should help. I don't recall anyone reporting its effects with updatedb. But yeah, I for one would happily submit patches to upstream authors to address this there. There's no reason code should be making the kernel guess its intention on these things. I think so. We're dealing with super-special cases here and often trying to fix those in-kernel will degrade other, often more common cases. wonders about sys_reclaim_dentry(const char *pathname) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote: On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel [EMAIL PROTECTED] wrote: It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. ... and all other processes accessing the same file(s)! Your library and the system calls may be limited to one process, but the consequences are global. Yes. So what? If the user wants to go and evict libc.so from pagecache then he can do so - the kernel has provided syscalls with which this can be done for at least seven years. Bad user, shouldn't do that. I think what Rik is pointing out is that as currently implemented, posix_fadvise is a much bigger hammer than is generally useful or desirable. Using posix_fadvise on the other hand says immediately drop this stuff from the pagecache, consequences be damned. If someone else happens to be using the specified data, well too bad, they suffer collateral damage. Process A can, maliciously or ignorantly, deny service to process B. On the other hand, your old but super cool O_STREAMING patch took a kinder gentler approach, where applications could tell the kernel please do not keep this file descriptor's data in cache on my account since I will not reuse it. If someone else however was using the same data, the kernel would keep things cached as usual and thereby avoid doing collateral damage. -Erik -- Erik B. Andersen http://codepoet-consulting.com/ --This message was written using 73% post-consumer electrons-- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sun, 4 Mar 2007 00:01:55 +0100 bert hubert [EMAIL PROTECTED] wrote: On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote: It is *not* a global instruction. It uses setenv, so the user's policy affects only the target process and its forked children. ... and all other processes accessing the same file(s)! Your library and the system calls may be limited to one process, but the consequences are global. Yes. So what? If the user wants to go and evict libc.so from pagecache then he can do so - the kernel has provided syscalls with which this can be done for at least seven years. Bad user, shouldn't do that. While I agree with your sentiments that userspace can have a good idea on how to deal with the page cache, your program does more than it claims to do - because of how linux implements posix_fadvise. I don't think anybody expects or desires your program to actually *evict* the stuff from the cache you are trying access, which happens in case the data was in the cache prior to starting your program. What people expect is that a solution such as you wrote it simply won't *add* anything to the cache. They don't expect it will actually globally *remove* stuff from the cache. Making a backup this way would hurt even worse than usual with your pagecache management tool if the file being backupped was still being read. This is not your fault, but in practice, it makes your program less useful than it could be. yup. As I said, it's a proof-of-concept. It's a project. And I have about one free femtosecond per fortnight :( One could conceivably fix that up using mincore and simply not fadvise if a page was in core already. Yes. Let's flesh it out the backup program policy some more: - Unconditionally invalidate output files - on entry to read(), probe pagecache, record which pages in the range are present - on entry to next read(), shoot down those pages from the previous read which weren't in pagecache. - But we can do better! LRU the page's files up to a certain number of pages. - Once that point is exceeded, we need to reclaim some pages. Which ones? Well, we've been observing all reads, so we can record which pages were referenced once, and which ones were referenced multiple times so we can do arbitrarily complex page aging in there. - On close(), nuke all pages which weren't in core during open(), even if this app referenced them multiple times. - If the backup program decided to read its input files with mmap we're rather screwed. We can't intercept pagefaults so the best we can do is to restore the file's pagecache to its previous state on close(). Or if it's really a problem, get control in there somehow and periodically poll the pagecache occupancy via mincore(), use madvise() then fadvise() to trim it back. That all sounds reasonably doable. It'd be pretty complex to do it in-kernel but we could do it there too. Problem is if course that the above strategy is explicitly optimised for the backup program and if it's in-kernel it becomes applicable to all other workloads. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel [EMAIL PROTECTED] wrote: backup program A suitable policy for a backup program would probably be to invalidate any output file(s) and to invalidate those pages of the input files which were not in cache when the backup program first opened those files. That way the backup program will have no effect on the cache state, except for the race situation where someone read an uncached file while the backup program was reading from it too. The use-once policy we have in the kernel should work perfectly fine for backups. All we need to do is actually honor the accessed bit on active page cache pages, instead of flushing them onto the inactive list. What am I overlooking? This can be added in an hour or two with no kernel changes (use mincore). mincore only works for mmaped areas, we'd need an fincore to work with file handles. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel [EMAIL PROTECTED] wrote: backup program A suitable policy for a backup program would probably be to invalidate any output file(s) and to invalidate those pages of the input files which were not in cache when the backup program first opened those files. That way the backup program will have no effect on the cache state, except for the race situation where someone read an uncached file while the backup program was reading from it too. The use-once policy we have in the kernel should work perfectly fine for backups. All we need to do is actually honor the accessed bit on active page cache pages, instead of flushing them onto the inactive list. What am I overlooking? That'll improve backups but will break other things. To do this effectively we'd need to change the policy so that new pagecache allocations cause no scanning of used-twice pages at all. So that even after many gigs of backing up, the working set is still there. Problem is, (for example) what about the person who has 80% of memory in used-twice state and who then reads a file or files which are 20% or more of the size of memory, two or more times. It'll be 100% cache misses, every time. This will happen quite a lot. IOW, once those pages are in used-twice state, how does further pagecache activity ever get them _out_ of that state? Only by joining the used-twice page set, and that can't happen if the used-once-so-far pages got reclaimed. Doing a refault thing would help a bit, but stops working at a certain point. This can be added in an hour or two with no kernel changes (use mincore). mincore only works for mmaped areas, we'd need an fincore to work with file handles. The LD_PRELOAD code has the fd and can mmap it to perform the pagecache probe. fincore() would be a bit neater, but given the rarity with which mincore() is used it's perhaps hard to justify adding a slightly more efficient and slightly more convenient subset of mincore(). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: wonders about sys_reclaim_dentry(const char *pathname) Would there be any other users of it than updatedb? I'm not coming up with much, but given that I'm not always clever, that doesn't mean much. thinks out loud... A hypothetical on-demand file virus scanner is going to hit already cached or about-to-be-cached entries by definition. Perhaps some system audit daemon, such as tripwire. Well, that has the same access patterns as updatedb, doesn't it: a directory at a time. find, cp -a, the same. So instead of sys_reclaim_dentry, how about extending fadvise to work on the fd returned via opendir? And extending POSIX_FADV_NOREUSE on a file fd to drop the dentry at close? (Call me chicken; I just don't want to be the guy suggesting a new syscall for a single or few users.) ~ ~ Alternately, there have been requests for a way for userspace to get notification of all file events for indexing of data and metadata (inotify, unfortunately, doesn't scale to a full filesystem). (cf. http://lkml.org/lkml/2006/9/30/98 .) That'd allow an updatedb daemon to keep the index up to date all the time, amortizing the cost. More usefully, it'd allow a content indexing daemon to stay up to date all the time, though inotify mostly works for those, I suppose. (Hmm... [EMAIL PROTECTED]:~$ find ~ -type d | wc -l 14067 ...right. So it probably works fine for normal people.) Hey, waitaminute. This should be a solved problem? SELinux must have some sort of requirement for logging file access attempts. Google, at least, implies so. Perhaps whatever it implements could be lifted into the core kernel without dragging the rest behind it. Dunno. Who do we CC? Ray - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 2007-03-03 at 12:29 -0800, Andrew Morton wrote: There is much more which could be done to make this code smarter, but I think the lesson here is that we can produce a far, far better result doing this work in userspace than we could ever hope to do with an in-kernel implementation. There are some enhancement suggestions in the documentation file. While I think that more user space applications should use fadvise() to avoid polluting the page cache with unneeded data, I still think the kernel should be more fair in regard to page cache management. Personally, I've experienced some sluggish performance after copying large files around. Even more when using NFS. It's difficult to file a bug report for interactive feel, I don't know how to measure it. I just feel it's a weak aspect of the OS. Surely it's possible to make the kernel a little bit better to protect the page cache from abuse, from simple or badly designed applications. Why fairness is provided by the process scheduler with good results, yet it somewhat easy for a process to cause slowdowns from page cache usage. My personal opinion is that the VM seem tuned for database types workloads. Of course, making the page cache more fair to prevent one process to use most of it will most likely slowdown database type applications. Maybe the situation should be reversed, much like the process scheduler. Fairness by default, and the possibility to request for more system resources by asking for them with necessary privileges. Much like SCHED_FIFO policy. - Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 19:14:59 -0500 Eric St-Laurent [EMAIL PROTECTED] wrote: On Sat, 2007-03-03 at 12:29 -0800, Andrew Morton wrote: There is much more which could be done to make this code smarter, but I think the lesson here is that we can produce a far, far better result doing this work in userspace than we could ever hope to do with an in-kernel implementation. There are some enhancement suggestions in the documentation file. While I think that more user space applications should use fadvise() to avoid polluting the page cache with unneeded data, I still think the kernel should be more fair in regard to page cache management. Personally, I've experienced some sluggish performance after copying large files around. Even more when using NFS. It's difficult to file a bug report for interactive feel, I don't know how to measure it. I just feel it's a weak aspect of the OS. yeah. It'd be worth spending some time, try to come up with some set of commands which produce an effect which you find objectionable. Surely it's possible to make the kernel a little bit better to protect the page cache from abuse, from simple or badly designed applications. Why fairness is provided by the process scheduler with good results, yet it somewhat easy for a process to cause slowdowns from page cache usage. My personal opinion is that the VM seem tuned for database types workloads. VM hasn't actually been tuned *for* anything much at all, really. Looking back on it, much of the tweaking in there has been to avoid really bad situations. We put much work into avoiding the 100%, 1000% or 1% slowdowns, but not a lot of work into providing the 15% speedups. So it may well be that the result is not particularly great at anything, but it's also not horridly bad at anything, either. Or at least, it's not supposed to be. Of course, making the page cache more fair to prevent one process to use most of it will most likely slowdown database type applications. databases actually like to manage their own cache via various means. There are some situations in which bulk IO activities can trash the databases's cache. That's the sort of thing which this tool is trying to help address. Maybe the situation should be reversed, much like the process scheduler. Fairness by default, and the possibility to request for more system resources by asking for them with necessary privileges. Much like SCHED_FIFO policy. Well. If the CPU scheduler makes a mistake, we see 5% or 15% degredations. If VM make a mistake (or fails to read the operator's mind), we go to disk and can suffer 1000% degredations or worse. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote: The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(), write(), pwrite(), close() and dup2() functions. pagecache control is done via posix_fadvise() and sync_file_range(). How could this have any effect on the updatedb problem? updatedb does not read() anything, it just open()s and stat()s every file on the disk. Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 17:02:31 -0800 Ray Lee [EMAIL PROTECTED] wrote: Andrew Morton wrote: wonders about sys_reclaim_dentry(const char *pathname) Would there be any other users of it than updatedb? updatedb is the notorious one. Alas, one can envisage sane workloads which really really really really want to cache millions of dentries and inodes. Workloads which get run more often than once-per-day-at-4AM. So if we fix updatedb, those people with kernel-indistinguishable workloads get real unhappy. We have to pass instructions to the kernel to resolve this. A sys_reclaim_dentry() would also need a sys_is_dentry_there() so updatedb could restore the previous state. Probably it'd be better to fix the well-known internal fragmentation problem we have with the VFS caches. That's fairly hard. I'm not coming up with much, but given that I'm not always clever, that doesn't mean much. thinks out loud... A hypothetical on-demand file virus scanner is going to hit already cached or about-to-be-cached entries by definition. Perhaps some system audit daemon, such as tripwire. Well, that has the same access patterns as updatedb, doesn't it: a directory at a time. find, cp -a, the same. So instead of sys_reclaim_dentry, how about extending fadvise to work on the fd returned via opendir? That'd be pretty simple, but a) would reclaim the pagecache for the directory and not the dentry object itself and b) will only be easy to do for ext2 and minixfs, which maintain a separate pagecache per directory. Yes we could do a nuke all the dentries in this directory thing, but that's equivalent to sys_reclaim_dentry() in a loop. And extending POSIX_FADV_NOREUSE on a file fd to drop the dentry at close? (Call me chicken; I just don't want to be the guy suggesting a new syscall for a single or few users.) ~ ~ Alternately, there have been requests for a way for userspace to get notification of all file events for indexing of data and metadata (inotify, unfortunately, doesn't scale to a full filesystem). (cf. http://lkml.org/lkml/2006/9/30/98 .) yes, that's a disappointment. That'd allow an updatedb daemon to keep the index up to date all the time, amortizing the cost. More usefully, it'd allow a content indexing daemon to stay up to date all the time, though inotify mostly works for those, I suppose. (Hmm... [EMAIL PROTECTED]:~$ find ~ -type d | wc -l 14067 ...right. So it probably works fine for normal people.) Hey, waitaminute. This should be a solved problem? SELinux must have some sort of requirement for logging file access attempts. Google, at least, implies so. Perhaps whatever it implements could be lifted into the core kernel without dragging the rest behind it. Maybe the syscall auditing code can be persuaded to spit out records which can be used for this. Dunno. Who do we CC? That's a problem. Nobody and everybody. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel [EMAIL PROTECTED] wrote: The use-once policy we have in the kernel should work perfectly fine for backups. All we need to do is actually honor the accessed bit on active page cache pages, instead of flushing them onto the inactive list. What am I overlooking? That'll improve backups but will break other things. To do this effectively we'd need to change the policy so that new pagecache allocations cause no scanning of used-twice pages at all. So that even after many gigs of backing up, the working set is still there. Problem is, (for example) what about the person who has 80% of memory in used-twice state and who then reads a file or files which are 20% or more of the size of memory, two or more times. It'll be 100% cache misses, every time. This will happen quite a lot. IOW, once those pages are in used-twice state, how does further pagecache activity ever get them _out_ of that state? Only by joining the used-twice page set, and that can't happen if the used-once-so-far pages got reclaimed. Doing a refault thing would help a bit, but stops working at a certain point. At what point does it stop working? I am not asking this to be difficult, I just want to get Linux a VM that does not need to be kludged up every time a distro ships it to its customers. I believe one starting point would be a concept that people cannot shoot holes in any more. That is no guarantee, but as long as the concept has known holes coding it up is likely to be a waste of time since the code will need kludges to deal with the problems later on and we'd be back to square one. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 3 Mar 2007 20:16:09 -0500 Lee Revell [EMAIL PROTECTED] wrote: On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote: The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(), write(), pwrite(), close() and dup2() functions. pagecache control is done via posix_fadvise() and sync_file_range(). How could this have any effect on the updatedb problem? updatedb does not read() anything, it just open()s and stat()s every file on the disk. err, good point. _one_ of those dang things which goes off when you've stayed up too late does a lot of pagecache IO, not sure which one. Maybe rpmq? But I'd expect that to be doing direct-io. But yes, updatedb's pagecache usage will be mainly metadata, and this tool doesn't address metadata pagecache, although it could do so. does an updatedb on a modest system It instantiated 5MB of pagecache and 20MB of slab, took about one minute. runs all the other things in /etc/cron.daily rpm uses rather a lot of pagecache. So yes, it looks like updatedb is a slab problem. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Eric St-Laurent wrote: While I think that more user space applications should use fadvise() to avoid polluting the page cache with unneeded data, I still think the kernel should be more fair in regard to page cache management. Personally, I've experienced some sluggish performance after copying large files around. Even more when using NFS. It's difficult to file a bug report for interactive feel, I don't know how to measure it. I just feel it's a weak aspect of the OS. Fairness and interactiveness are very hard to quantify and measure, which makes it hard to justify patches that improve this behaviour. On the other hand, patches that improve benchmark results are easily justifyable, which makes it easy to merge those even if it comes at the expense of fairness. I think fairness and robustness are important, but I have not figured out a way to justify such changes for upstream inclusion. Well, except perhaps by coming up with artificial test cases, but that feels like cheating :) My personal opinion is that the VM seem tuned for database types workloads. Of course, making the page cache more fair to prevent one process to use most of it will most likely slowdown database type applications. The database people disagree. For one, the accessed bit on active page gets pages gets ignored, so Linux does not properly keep the most actively used page cache pages in memory. Secondly, the VM can waste quite a lot of time scanning over the anonymous pages that it does not even want to evict from memory. If the VM does not plan on evicting anonymous memory (or shared memory segments), why waste CPU time scanning them and randomizing their LRU order? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 03 Mar 2007 20:23:07 -0500 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel [EMAIL PROTECTED] wrote: The use-once policy we have in the kernel should work perfectly fine for backups. All we need to do is actually honor the accessed bit on active page cache pages, instead of flushing them onto the inactive list. What am I overlooking? That'll improve backups but will break other things. To do this effectively we'd need to change the policy so that new pagecache allocations cause no scanning of used-twice pages at all. So that even after many gigs of backing up, the working set is still there. Problem is, (for example) what about the person who has 80% of memory in used-twice state and who then reads a file or files which are 20% or more of the size of memory, two or more times. It'll be 100% cache misses, every time. This will happen quite a lot. IOW, once those pages are in used-twice state, how does further pagecache activity ever get them _out_ of that state? Only by joining the used-twice page set, and that can't happen if the used-once-so-far pages got reclaimed. Doing a refault thing would help a bit, but stops working at a certain point. At what point does it stop working? We need to store that this-page-got-reclaimed info somewhere. I don't know how space-efficient that is. Did anyone ever do an implementation? Of course, the pages need to be re-read again so there's a potential 100% hit there, which is in fact not a huge amount in this context. Depends how often it occurs (all the time when refault is being useful?) versus what we gain from it. I am not asking this to be difficult, I just want to get Linux a VM that does not need to be kludged up every time a distro ships it to its customers. We have a communication problem here. Please please please work harder to get these problems communicated to the MM developers. The only vendor MM kludge of which I'm aware is a thing which Andrea is working on to address a large-shm-segment versus bulk-IO problem (yup, database). If you have enough of an understanding of a problem to be able to develop and productise a fix then share that info madly, asap. otoh, rhel-on-the-desktop-or-smaller probably isn't a huge priority, which can be taken advantage of. I believe one starting point would be a concept that people cannot shoot holes in any more. That is no guarantee, but as long as the concept has known holes coding it up is likely to be a waste of time since the code will need kludges to deal with the problems later on and we'd be back to square one. You mean design it and review the design before coding it? You'll find few objections there. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
Andrew Morton wrote: Doing a refault thing would help a bit, but stops working at a certain point. At what point does it stop working? We need to store that this-page-got-reclaimed info somewhere. I don't know how space-efficient that is. Did anyone ever do an implementation? One 32 bit word per evicted page that we keep track of. Of course, the pages need to be re-read again so there's a potential 100% hit there, which is in fact not a huge amount in this context. Depends how often it occurs (all the time when refault is being useful?) versus what we gain from it. At this point, when we see that a refaulted page is more active than the coldest page on the active list, we can also immediately shrink the active list. That gives the next inactive page a better chance to get promoted before it gets evicted. I am not asking this to be difficult, I just want to get Linux a VM that does not need to be kludged up every time a distro ships it to its customers. We have a communication problem here. Please please please work harder to get these problems communicated to the MM developers. The only vendor MM kludge of which I'm aware is a thing which Andrea is working on to address a large-shm-segment versus bulk-IO problem (yup, database). If you have enough of an understanding of a problem to be able to develop and productise a fix then share that info madly, asap. The problem is that most of the distro patches are kludges, which we would rather not see again in future kernels. They tend to work around the problem, instead of being a proper fix, since reorganizing the VM in the middle of a release is not an option. However, incremental small-to-medium changes might be an option for the upstream kernel, if you are interested. I believe one starting point would be a concept that people cannot shoot holes in any more. That is no guarantee, but as long as the concept has known holes coding it up is likely to be a waste of time since the code will need kludges to deal with the problems later on and we'd be back to square one. You mean design it and review the design before coding it? You'll find few objections there. Few objections, but sadly also very few people interested in actually reviewing the design :( If you can find holes in http://linux-mm.org/PageReplacementDesign please let me know :) -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote: But yes, updatedb's pagecache usage will be mainly metadata, and this tool doesn't address metadata pagecache, although it could do so. With no kernel changes? How? I can't find an equivalent API to posix_fadvise() for metadata. Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: userspace pagecache management tool
On Sat, 3 Mar 2007 21:35:59 -0500 Lee Revell [EMAIL PROTECTED] wrote: On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote: But yes, updatedb's pagecache usage will be mainly metadata, and this tool doesn't address metadata pagecache, although it could do so. With no kernel changes? How? I can't find an equivalent API to posix_fadvise() for metadata. We can use mincore and fadvise against /dev/sda1, too. mincore's linear search would hurt but you could just run fadvise regularly. A lot of the blockdev pagecache is pretty useless anyway: we've already copied it much of it into dentries and inodes, and some of ext2/3/4's pagecache is already pinned by the fs. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/