Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On 12/14/2013 06:09 AM, Linus Torvalds wrote: On Wed, Dec 11, 2013 at 3:05 PM, Andrew Morton wrote: But I'm really struggling to think up an implementation! The current code looks only at the caller's node and doesn't seem to make much sense. Should we look at all nodes? Hard to say without prior knowledge of where those pages will be coming from. I really think we want to put an upper bound on the read-ahead, and I'm not convinced we need to try to be excessively clever about it. We also probably don't want to make it too expensive to calculate, because afaik this ends up being called for each file we open when we don't have pages in the page cache yet. The current function seems reasonable on a single-node system. Let's not kill it entirely just because it has some odd corner-case on multi-node systems. In fact, for all I care, I think it would be perfectly ok to just use a truly stupid hard limit ("you can't read-ahead more than 16MB" or whatever). What we do *not* want to allow is to have people call "readahead" functions and basically kill the machine because you now have a unkillable IO that is insanely big. So I'd much rather limit it too much than too little. And on absolutely no sane IO susbsystem does it make sense to read ahead insane amounts. So I'd rather limit it to something stupid and small, than to not limit things at all. Looking at the interface, for example, the natural thing to do for the "readahead()" system call, for example, is to just give it a size of ~0ul, and let the system limit things, becaue limiting things in useer space is just not reasonable. So I really do *not* think it's fine to just remove the limit entirely. Very sorry for late reply (was on very loong vacation). How about having 16MB limit only for remote readaheads and continuing the rest as is, something like below: #define MAX_REMOTE_READAHEAD4096UL unsigned long max_sane_readahead(unsigned long nr) { unsigned long local_free_page = (node_page_state(numa_node_id(), NR_INACTIVE_FILE) + node_page_state(numa_node_id(), NR_FREE_PAGES)); unsigned long sane_nr = min(nr, MAX_REMOTE_READAHEAD); return (local_free_page ? min(nr, local_free_page / 2) : sane_nr); } or we can enforce 16MB limit for all the case too. I 'll send a patch accordingly. (readahead max will scale accordingly if we dont have 4k page size above). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On 12/14/2013 06:09 AM, Linus Torvalds wrote: On Wed, Dec 11, 2013 at 3:05 PM, Andrew Morton a...@linux-foundation.org wrote: But I'm really struggling to think up an implementation! The current code looks only at the caller's node and doesn't seem to make much sense. Should we look at all nodes? Hard to say without prior knowledge of where those pages will be coming from. I really think we want to put an upper bound on the read-ahead, and I'm not convinced we need to try to be excessively clever about it. We also probably don't want to make it too expensive to calculate, because afaik this ends up being called for each file we open when we don't have pages in the page cache yet. The current function seems reasonable on a single-node system. Let's not kill it entirely just because it has some odd corner-case on multi-node systems. In fact, for all I care, I think it would be perfectly ok to just use a truly stupid hard limit (you can't read-ahead more than 16MB or whatever). What we do *not* want to allow is to have people call readahead functions and basically kill the machine because you now have a unkillable IO that is insanely big. So I'd much rather limit it too much than too little. And on absolutely no sane IO susbsystem does it make sense to read ahead insane amounts. So I'd rather limit it to something stupid and small, than to not limit things at all. Looking at the interface, for example, the natural thing to do for the readahead() system call, for example, is to just give it a size of ~0ul, and let the system limit things, becaue limiting things in useer space is just not reasonable. So I really do *not* think it's fine to just remove the limit entirely. Very sorry for late reply (was on very loong vacation). How about having 16MB limit only for remote readaheads and continuing the rest as is, something like below: #define MAX_REMOTE_READAHEAD4096UL unsigned long max_sane_readahead(unsigned long nr) { unsigned long local_free_page = (node_page_state(numa_node_id(), NR_INACTIVE_FILE) + node_page_state(numa_node_id(), NR_FREE_PAGES)); unsigned long sane_nr = min(nr, MAX_REMOTE_READAHEAD); return (local_free_page ? min(nr, local_free_page / 2) : sane_nr); } or we can enforce 16MB limit for all the case too. I 'll send a patch accordingly. (readahead max will scale accordingly if we dont have 4k page size above). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed, Dec 11, 2013 at 3:05 PM, Andrew Morton wrote: > > But I'm really struggling to think up an implementation! The current > code looks only at the caller's node and doesn't seem to make much > sense. Should we look at all nodes? Hard to say without prior > knowledge of where those pages will be coming from. I really think we want to put an upper bound on the read-ahead, and I'm not convinced we need to try to be excessively clever about it. We also probably don't want to make it too expensive to calculate, because afaik this ends up being called for each file we open when we don't have pages in the page cache yet. The current function seems reasonable on a single-node system. Let's not kill it entirely just because it has some odd corner-case on multi-node systems. In fact, for all I care, I think it would be perfectly ok to just use a truly stupid hard limit ("you can't read-ahead more than 16MB" or whatever). What we do *not* want to allow is to have people call "readahead" functions and basically kill the machine because you now have a unkillable IO that is insanely big. So I'd much rather limit it too much than too little. And on absolutely no sane IO susbsystem does it make sense to read ahead insane amounts. So I'd rather limit it to something stupid and small, than to not limit things at all. Looking at the interface, for example, the natural thing to do for the "readahead()" system call, for example, is to just give it a size of ~0ul, and let the system limit things, becaue limiting things in useer space is just not reasonable. So I really do *not* think it's fine to just remove the limit entirely. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed, Dec 11, 2013 at 3:05 PM, Andrew Morton a...@linux-foundation.org wrote: But I'm really struggling to think up an implementation! The current code looks only at the caller's node and doesn't seem to make much sense. Should we look at all nodes? Hard to say without prior knowledge of where those pages will be coming from. I really think we want to put an upper bound on the read-ahead, and I'm not convinced we need to try to be excessively clever about it. We also probably don't want to make it too expensive to calculate, because afaik this ends up being called for each file we open when we don't have pages in the page cache yet. The current function seems reasonable on a single-node system. Let's not kill it entirely just because it has some odd corner-case on multi-node systems. In fact, for all I care, I think it would be perfectly ok to just use a truly stupid hard limit (you can't read-ahead more than 16MB or whatever). What we do *not* want to allow is to have people call readahead functions and basically kill the machine because you now have a unkillable IO that is insanely big. So I'd much rather limit it too much than too little. And on absolutely no sane IO susbsystem does it make sense to read ahead insane amounts. So I'd rather limit it to something stupid and small, than to not limit things at all. Looking at the interface, for example, the natural thing to do for the readahead() system call, for example, is to just give it a size of ~0ul, and let the system limit things, becaue limiting things in useer space is just not reasonable. So I really do *not* think it's fine to just remove the limit entirely. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed 11-12-13 15:05:22, Andrew Morton wrote: > On Wed, 11 Dec 2013 23:49:17 +0100 Jan Kara wrote: > > > > /* > > > - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a > > > - * sensible upper limit. > > > + * max_sane_readahead() is disabled. It can later be removed > > > altogether, but > > > + * let's keep a skeleton in place for now, in case disabling was the > > > wrong call. > > > */ > > > unsigned long max_sane_readahead(unsigned long nr) > > > { > > > - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) > > > - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); > > > + return nr; > > > } > > > > > > /* > > > > > > Can anyone see a problem with this? > > Well, the downside seems to be that if userspace previously issued > > MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible > > size. Now we try to read the whole huge file which is pretty much > > guaranteed to be useless (as we'll be pushing out of cache data we just > > read a while ago). And guessing the right readahead size from userspace > > isn't trivial so it would make WILLNEED advice less useful. What do you > > think? > > OK, yes, there is conceivably a back-compatibility issue there. There > indeed might be applications which decide the chuck the whole thing at > the kernel and let the kernel work out what is a sensible readahead > size to perform. > > But I'm really struggling to think up an implementation! The current > code looks only at the caller's node and doesn't seem to make much > sense. Should we look at all nodes? Hard to say without prior > knowledge of where those pages will be coming from. Well, I believe that we might have some compatibility issues only for non-NUMA machines - there the current logic makes sense. For NUMA machines I believe we are free to do basically anything because results of the current logic are pretty random. Thinking about proper implementation for NUMA - max_sane_readahead() is really interesting for madvise() and fadvise() calls (standard on demand readahead is bounded by bdi->ra_pages which tends to be pretty low anyway (like 512K or so)). For these calls we will do the reads from the process issuing the [fm]advise() call and thus we will allocate pages depending on the NUMA policy. So depending on this policy we should be able to pick some estimate on the number of available pages, shouldn't we? BTW, the fact that [fm]advise() calls submit all reads synchronously is another reason why we should bound the readahead requests to a sensible size. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed 11-12-13 15:05:22, Andrew Morton wrote: On Wed, 11 Dec 2013 23:49:17 +0100 Jan Kara j...@suse.cz wrote: /* - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a - * sensible upper limit. + * max_sane_readahead() is disabled. It can later be removed altogether, but + * let's keep a skeleton in place for now, in case disabling was the wrong call. */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); + return nr; } /* Can anyone see a problem with this? Well, the downside seems to be that if userspace previously issued MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible size. Now we try to read the whole huge file which is pretty much guaranteed to be useless (as we'll be pushing out of cache data we just read a while ago). And guessing the right readahead size from userspace isn't trivial so it would make WILLNEED advice less useful. What do you think? OK, yes, there is conceivably a back-compatibility issue there. There indeed might be applications which decide the chuck the whole thing at the kernel and let the kernel work out what is a sensible readahead size to perform. But I'm really struggling to think up an implementation! The current code looks only at the caller's node and doesn't seem to make much sense. Should we look at all nodes? Hard to say without prior knowledge of where those pages will be coming from. Well, I believe that we might have some compatibility issues only for non-NUMA machines - there the current logic makes sense. For NUMA machines I believe we are free to do basically anything because results of the current logic are pretty random. Thinking about proper implementation for NUMA - max_sane_readahead() is really interesting for madvise() and fadvise() calls (standard on demand readahead is bounded by bdi-ra_pages which tends to be pretty low anyway (like 512K or so)). For these calls we will do the reads from the process issuing the [fm]advise() call and thus we will allocate pages depending on the NUMA policy. So depending on this policy we should be able to pick some estimate on the number of available pages, shouldn't we? BTW, the fact that [fm]advise() calls submit all reads synchronously is another reason why we should bound the readahead requests to a sensible size. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed, 11 Dec 2013 23:49:17 +0100 Jan Kara wrote: > > /* > > - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a > > - * sensible upper limit. > > + * max_sane_readahead() is disabled. It can later be removed altogether, > > but > > + * let's keep a skeleton in place for now, in case disabling was the wrong > > call. > > */ > > unsigned long max_sane_readahead(unsigned long nr) > > { > > - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) > > - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); > > + return nr; > > } > > > > /* > > > > Can anyone see a problem with this? > Well, the downside seems to be that if userspace previously issued > MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible > size. Now we try to read the whole huge file which is pretty much > guaranteed to be useless (as we'll be pushing out of cache data we just > read a while ago). And guessing the right readahead size from userspace > isn't trivial so it would make WILLNEED advice less useful. What do you > think? OK, yes, there is conceivably a back-compatibility issue there. There indeed might be applications which decide the chuck the whole thing at the kernel and let the kernel work out what is a sensible readahead size to perform. But I'm really struggling to think up an implementation! The current code looks only at the caller's node and doesn't seem to make much sense. Should we look at all nodes? Hard to say without prior knowledge of where those pages will be coming from. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed 04-12-13 13:48:38, Andrew Morton wrote: > On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T > wrote: > > > On 12/04/2013 02:11 PM, Andrew Morton wrote: > > > On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T > > > wrote: > > > > > >> Unfaortunately, from my search, I saw that the code belonged to pre git > > >> time, so could not get much information on that. > > > > > > Here: https://lkml.org/lkml/2004/8/20/242 > > > > > > It seems it was done as a rather thoughtless performance optimisation. > > > I'd say it's time to reimplement max_sane_readahead() from scratch. > > > > > > > Ok. Thanks for the link. I think after that, > > Here it was changed to pernode: > > https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over. > > > > do you think above patch (+comments) with some sanitized nr (thus > > avoiding iteration over nodes in remote numa readahead case) does look > > better? > > or should we iterate all memory. > > I dunno, the whole thing smells of arbitrary woolly thinking to me. > Going back further in time.. > > : commit f76d03dc9fcff7ac88e2d23c5814fd0f50c59bb6 > : Author: akpm > : AuthorDate: Sun Dec 15 03:18:58 2002 + > : Commit: akpm > : CommitDate: Sun Dec 15 03:18:58 2002 + > : > : [PATCH] madvise_willneed() maximum readahead checking > : > : madvise_willneed() currently has a very strange check on how much > readahead > : it is prepared to do. > : > : It is based on the user's rss limit. But this is usually enormous, > and > : the user isn't necessarily going to map all that memory at the same > time > : anyway. > : > : And the logic is wrong - it is comparing rss (which is in bytes) with > : `end - start', which is in pages. > : > : And it returns -EIO on error, which is not mentioned in the Open Group > : spec and doesn't make sense. > : > : > : This patch takes it all out and applies the same upper limit as is used > in > : sys_readahead() - half the inactive list. > : > : +/* > : + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a > : + * sensible upper limit. > : + */ > : +unsigned long max_sane_readahead(unsigned long nr) > : +{ > : + unsigned long active; > : + unsigned long inactive; > : + > : + get_zone_counts(, ); > : + return min(nr, inactive / 2); > : +} > > And one would need to go back further still to understand the rationale > for the sys_readahead() decision and that even predates the BK repo. > > iirc the thinking was that we need _some_ limit on readahead size so > the user can't go and do ridiculously large amounts of readahead via > sys_readahead(). But that doesn't make a lot of sense because the user > could do the same thing with plain old read(). > > So for argument's sake I'm thinking we just kill it altogether and > permit arbitrarily large readahead: > > --- a/mm/readahead.c~a > +++ a/mm/readahead.c > @@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad > } > > /* > - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a > - * sensible upper limit. > + * max_sane_readahead() is disabled. It can later be removed altogether, but > + * let's keep a skeleton in place for now, in case disabling was the wrong > call. > */ > unsigned long max_sane_readahead(unsigned long nr) > { > - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) > - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); > + return nr; > } > > /* > > Can anyone see a problem with this? Well, the downside seems to be that if userspace previously issued MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible size. Now we try to read the whole huge file which is pretty much guaranteed to be useless (as we'll be pushing out of cache data we just read a while ago). And guessing the right readahead size from userspace isn't trivial so it would make WILLNEED advice less useful. What do you think? Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed 04-12-13 13:48:38, Andrew Morton wrote: On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: On 12/04/2013 02:11 PM, Andrew Morton wrote: On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: Unfaortunately, from my search, I saw that the code belonged to pre git time, so could not get much information on that. Here: https://lkml.org/lkml/2004/8/20/242 It seems it was done as a rather thoughtless performance optimisation. I'd say it's time to reimplement max_sane_readahead() from scratch. Ok. Thanks for the link. I think after that, Here it was changed to pernode: https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over. do you think above patch (+comments) with some sanitized nr (thus avoiding iteration over nodes in remote numa readahead case) does look better? or should we iterate all memory. I dunno, the whole thing smells of arbitrary woolly thinking to me. Going back further in time.. : commit f76d03dc9fcff7ac88e2d23c5814fd0f50c59bb6 : Author: akpm akpm : AuthorDate: Sun Dec 15 03:18:58 2002 + : Commit: akpm akpm : CommitDate: Sun Dec 15 03:18:58 2002 + : : [PATCH] madvise_willneed() maximum readahead checking : : madvise_willneed() currently has a very strange check on how much readahead : it is prepared to do. : : It is based on the user's rss limit. But this is usually enormous, and : the user isn't necessarily going to map all that memory at the same time : anyway. : : And the logic is wrong - it is comparing rss (which is in bytes) with : `end - start', which is in pages. : : And it returns -EIO on error, which is not mentioned in the Open Group : spec and doesn't make sense. : : : This patch takes it all out and applies the same upper limit as is used in : sys_readahead() - half the inactive list. : : +/* : + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a : + * sensible upper limit. : + */ : +unsigned long max_sane_readahead(unsigned long nr) : +{ : + unsigned long active; : + unsigned long inactive; : + : + get_zone_counts(active, inactive); : + return min(nr, inactive / 2); : +} And one would need to go back further still to understand the rationale for the sys_readahead() decision and that even predates the BK repo. iirc the thinking was that we need _some_ limit on readahead size so the user can't go and do ridiculously large amounts of readahead via sys_readahead(). But that doesn't make a lot of sense because the user could do the same thing with plain old read(). So for argument's sake I'm thinking we just kill it altogether and permit arbitrarily large readahead: --- a/mm/readahead.c~a +++ a/mm/readahead.c @@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad } /* - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a - * sensible upper limit. + * max_sane_readahead() is disabled. It can later be removed altogether, but + * let's keep a skeleton in place for now, in case disabling was the wrong call. */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); + return nr; } /* Can anyone see a problem with this? Well, the downside seems to be that if userspace previously issued MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible size. Now we try to read the whole huge file which is pretty much guaranteed to be useless (as we'll be pushing out of cache data we just read a while ago). And guessing the right readahead size from userspace isn't trivial so it would make WILLNEED advice less useful. What do you think? Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed, 11 Dec 2013 23:49:17 +0100 Jan Kara j...@suse.cz wrote: /* - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a - * sensible upper limit. + * max_sane_readahead() is disabled. It can later be removed altogether, but + * let's keep a skeleton in place for now, in case disabling was the wrong call. */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); + return nr; } /* Can anyone see a problem with this? Well, the downside seems to be that if userspace previously issued MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible size. Now we try to read the whole huge file which is pretty much guaranteed to be useless (as we'll be pushing out of cache data we just read a while ago). And guessing the right readahead size from userspace isn't trivial so it would make WILLNEED advice less useful. What do you think? OK, yes, there is conceivably a back-compatibility issue there. There indeed might be applications which decide the chuck the whole thing at the kernel and let the kernel work out what is a sensible readahead size to perform. But I'm really struggling to think up an implementation! The current code looks only at the caller's node and doesn't seem to make much sense. Should we look at all nodes? Hard to say without prior knowledge of where those pages will be coming from. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On 12/05/2013 03:18 AM, Andrew Morton wrote: On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T wrote: On 12/04/2013 02:11 PM, Andrew Morton wrote: : : This patch takes it all out and applies the same upper limit as is used in : sys_readahead() - half the inactive list. : : +/* : + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a : + * sensible upper limit. : + */ : +unsigned long max_sane_readahead(unsigned long nr) : +{ : + unsigned long active; : + unsigned long inactive; : + : + get_zone_counts(, ); : + return min(nr, inactive / 2); : +} Hi Andrew, Thanks for digging out. So it seems like earlier we had not even considered free pages? And one would need to go back further still to understand the rationale for the sys_readahead() decision and that even predates the BK repo. iirc the thinking was that we need _some_ limit on readahead size so the user can't go and do ridiculously large amounts of readahead via sys_readahead(). But that doesn't make a lot of sense because the user could do the same thing with plain old read(). True. So for argument's sake I'm thinking we just kill it altogether and permit arbitrarily large readahead: --- a/mm/readahead.c~a +++ a/mm/readahead.c @@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad } /* - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a - * sensible upper limit. + * max_sane_readahead() is disabled. It can later be removed altogether, but + * let's keep a skeleton in place for now, in case disabling was the wrong call. */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); + return nr; } I had something like below in mind for posting. But it looks simple now with your patch. unsigned long max_sane_readahead(unsigned long nr) { int nid; unsigned long free_page = 0; for_each_node_state(nid, N_MEMORY) free_page += node_page_state(nid, NR_INACTIVE_FILE) + node_page_state(nid, NR_FREE_PAGES); /* * Readahead onto remote memory is better than no readahead when local * numa node does not have memory. We sanitize readahead size depending * on potential free memory in the whole system. */ return min(nr, free_page / (2 * nr_node_ids)); Or if we wanted to avoid iteration on nodes simply returning something like nr/8 or something like that for remote numa fault cases. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T wrote: > On 12/04/2013 02:11 PM, Andrew Morton wrote: > > On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T > > wrote: > > > >> Unfaortunately, from my search, I saw that the code belonged to pre git > >> time, so could not get much information on that. > > > > Here: https://lkml.org/lkml/2004/8/20/242 > > > > It seems it was done as a rather thoughtless performance optimisation. > > I'd say it's time to reimplement max_sane_readahead() from scratch. > > > > Ok. Thanks for the link. I think after that, > Here it was changed to pernode: > https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over. > > do you think above patch (+comments) with some sanitized nr (thus > avoiding iteration over nodes in remote numa readahead case) does look > better? > or should we iterate all memory. I dunno, the whole thing smells of arbitrary woolly thinking to me. Going back further in time.. : commit f76d03dc9fcff7ac88e2d23c5814fd0f50c59bb6 : Author: akpm : AuthorDate: Sun Dec 15 03:18:58 2002 + : Commit: akpm : CommitDate: Sun Dec 15 03:18:58 2002 + : : [PATCH] madvise_willneed() maximum readahead checking : : madvise_willneed() currently has a very strange check on how much readahead : it is prepared to do. : : It is based on the user's rss limit. But this is usually enormous, and : the user isn't necessarily going to map all that memory at the same time : anyway. : : And the logic is wrong - it is comparing rss (which is in bytes) with : `end - start', which is in pages. : : And it returns -EIO on error, which is not mentioned in the Open Group : spec and doesn't make sense. : : : This patch takes it all out and applies the same upper limit as is used in : sys_readahead() - half the inactive list. : : +/* : + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a : + * sensible upper limit. : + */ : +unsigned long max_sane_readahead(unsigned long nr) : +{ : + unsigned long active; : + unsigned long inactive; : + : + get_zone_counts(, ); : + return min(nr, inactive / 2); : +} And one would need to go back further still to understand the rationale for the sys_readahead() decision and that even predates the BK repo. iirc the thinking was that we need _some_ limit on readahead size so the user can't go and do ridiculously large amounts of readahead via sys_readahead(). But that doesn't make a lot of sense because the user could do the same thing with plain old read(). So for argument's sake I'm thinking we just kill it altogether and permit arbitrarily large readahead: --- a/mm/readahead.c~a +++ a/mm/readahead.c @@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad } /* - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a - * sensible upper limit. + * max_sane_readahead() is disabled. It can later be removed altogether, but + * let's keep a skeleton in place for now, in case disabling was the wrong call. */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); + return nr; } /* Can anyone see a problem with this? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On 12/04/2013 02:11 PM, Andrew Morton wrote: On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T wrote: Unfaortunately, from my search, I saw that the code belonged to pre git time, so could not get much information on that. Here: https://lkml.org/lkml/2004/8/20/242 It seems it was done as a rather thoughtless performance optimisation. I'd say it's time to reimplement max_sane_readahead() from scratch. Ok. Thanks for the link. I think after that, Here it was changed to pernode: https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over. do you think above patch (+comments) with some sanitized nr (thus avoiding iteration over nodes in remote numa readahead case) does look better? or should we iterate all memory. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T wrote: > > I don't recall the rationale for the current code and of course we > > didn't document it. It might be in the changelogs somewhere - could > > you please do the git digging and see if you can find out? > > Unfaortunately, from my search, I saw that the code belonged to pre git > time, so could not get much information on that. Here: https://lkml.org/lkml/2004/8/20/242 It seems it was done as a rather thoughtless performance optimisation. I'd say it's time to reimplement max_sane_readahead() from scratch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
Thank you Andrew. On 12/04/2013 04:08 AM, Andrew Morton wrote: On Tue, 3 Dec 2013 16:06:17 +0530 Raghavendra K T wrote: On a cpu with an empty numa node, This makes no sense - numa nodes don't reside on CPUs. I think you mean "on a CPU which resides on a memoryless NUMA node"? You are right. I was not precise there. I had this example in mind while talking. IBM P730 -- # numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: node 1 size: 12288 MB node 1 free: 10440 MB node distances: node 0 1 0: 10 40 1: 40 10 readahead fails because max_sane_readahead returns zero. The reason is we look into number of inactive + free pages available on the current node. The following patch tries to fix the behaviour by checking for potential empty numa node cases. The rationale for the patch is, readahead may be worth doing on a remote node instead of incuring costly disk faults later. I still feel we may have to sanitize the nr below, (for e.g., nr/8) to avoid serious consequences of malicious application trying to do a big readahead on a empty numa node causing unnecessary load on remote nodes. ( or it may even be that current behaviour is right in not going ahead with readahead to avoid the memory load on remote nodes). I don't recall the rationale for the current code and of course we didn't document it. It might be in the changelogs somewhere - could you please do the git digging and see if you can find out? Unfaortunately, from my search, I saw that the code belonged to pre git time, so could not get much information on that. I don't immediately see why readahead into a different node is considered a bad thing. Ok. --- a/mm/readahead.c +++ b/mm/readahead.c @@ -243,8 +243,11 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp, */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); + unsigned long numa_free_page; + numa_free_page = (node_page_state(numa_node_id(), NR_INACTIVE_FILE) + + node_page_state(numa_node_id(), NR_FREE_PAGES)); + + return numa_free_page ? min(nr, numa_free_page / 2) : nr; Well even if this CPU's node has very little pagecache at all, what's wrong with permitting readahead? We don't know that the new pagecache will be allocated exclusively from this CPU's node anyway. All very odd. true we do not know from where it gets allocated and also I completely agree that I could not think why we should not think of entire memory rather than sticking our decision to one node. Or is this one of proactive case to stop worsening situation when system is really short of memory? Do let me know if you have any idea to handle 'little cache case' or do you think the current one is simple enough for now to live with. Whatever we do, we should leave behind some good code comments which explain the rationale(s), please. Right now it's rather opaque. Yes. For the current code may be we have to have comment some thing like ? /* * Sanitize readahead when we have less memory on the current node. * We do not want to load remote memory with readahead case. */ and if this patch is okay then some thing like. /* * Sanitized readahead onto remote memory is better than no readahead * when local numa node does not have memory. If local numa has less * memory we trim readahead size depending on potential free memory * available. */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
Thank you Andrew. On 12/04/2013 04:08 AM, Andrew Morton wrote: On Tue, 3 Dec 2013 16:06:17 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: On a cpu with an empty numa node, This makes no sense - numa nodes don't reside on CPUs. I think you mean on a CPU which resides on a memoryless NUMA node? You are right. I was not precise there. I had this example in mind while talking. IBM P730 -- # numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: node 1 size: 12288 MB node 1 free: 10440 MB node distances: node 0 1 0: 10 40 1: 40 10 readahead fails because max_sane_readahead returns zero. The reason is we look into number of inactive + free pages available on the current node. The following patch tries to fix the behaviour by checking for potential empty numa node cases. The rationale for the patch is, readahead may be worth doing on a remote node instead of incuring costly disk faults later. I still feel we may have to sanitize the nr below, (for e.g., nr/8) to avoid serious consequences of malicious application trying to do a big readahead on a empty numa node causing unnecessary load on remote nodes. ( or it may even be that current behaviour is right in not going ahead with readahead to avoid the memory load on remote nodes). I don't recall the rationale for the current code and of course we didn't document it. It might be in the changelogs somewhere - could you please do the git digging and see if you can find out? Unfaortunately, from my search, I saw that the code belonged to pre git time, so could not get much information on that. I don't immediately see why readahead into a different node is considered a bad thing. Ok. --- a/mm/readahead.c +++ b/mm/readahead.c @@ -243,8 +243,11 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp, */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); + unsigned long numa_free_page; + numa_free_page = (node_page_state(numa_node_id(), NR_INACTIVE_FILE) + + node_page_state(numa_node_id(), NR_FREE_PAGES)); + + return numa_free_page ? min(nr, numa_free_page / 2) : nr; Well even if this CPU's node has very little pagecache at all, what's wrong with permitting readahead? We don't know that the new pagecache will be allocated exclusively from this CPU's node anyway. All very odd. true we do not know from where it gets allocated and also I completely agree that I could not think why we should not think of entire memory rather than sticking our decision to one node. Or is this one of proactive case to stop worsening situation when system is really short of memory? Do let me know if you have any idea to handle 'little cache case' or do you think the current one is simple enough for now to live with. Whatever we do, we should leave behind some good code comments which explain the rationale(s), please. Right now it's rather opaque. Yes. For the current code may be we have to have comment some thing like ? /* * Sanitize readahead when we have less memory on the current node. * We do not want to load remote memory with readahead case. */ and if this patch is okay then some thing like. /* * Sanitized readahead onto remote memory is better than no readahead * when local numa node does not have memory. If local numa has less * memory we trim readahead size depending on potential free memory * available. */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: I don't recall the rationale for the current code and of course we didn't document it. It might be in the changelogs somewhere - could you please do the git digging and see if you can find out? Unfaortunately, from my search, I saw that the code belonged to pre git time, so could not get much information on that. Here: https://lkml.org/lkml/2004/8/20/242 It seems it was done as a rather thoughtless performance optimisation. I'd say it's time to reimplement max_sane_readahead() from scratch. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On 12/04/2013 02:11 PM, Andrew Morton wrote: On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: Unfaortunately, from my search, I saw that the code belonged to pre git time, so could not get much information on that. Here: https://lkml.org/lkml/2004/8/20/242 It seems it was done as a rather thoughtless performance optimisation. I'd say it's time to reimplement max_sane_readahead() from scratch. Ok. Thanks for the link. I think after that, Here it was changed to pernode: https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over. do you think above patch (+comments) with some sanitized nr (thus avoiding iteration over nodes in remote numa readahead case) does look better? or should we iterate all memory. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: On 12/04/2013 02:11 PM, Andrew Morton wrote: On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: Unfaortunately, from my search, I saw that the code belonged to pre git time, so could not get much information on that. Here: https://lkml.org/lkml/2004/8/20/242 It seems it was done as a rather thoughtless performance optimisation. I'd say it's time to reimplement max_sane_readahead() from scratch. Ok. Thanks for the link. I think after that, Here it was changed to pernode: https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over. do you think above patch (+comments) with some sanitized nr (thus avoiding iteration over nodes in remote numa readahead case) does look better? or should we iterate all memory. I dunno, the whole thing smells of arbitrary woolly thinking to me. Going back further in time.. : commit f76d03dc9fcff7ac88e2d23c5814fd0f50c59bb6 : Author: akpm akpm : AuthorDate: Sun Dec 15 03:18:58 2002 + : Commit: akpm akpm : CommitDate: Sun Dec 15 03:18:58 2002 + : : [PATCH] madvise_willneed() maximum readahead checking : : madvise_willneed() currently has a very strange check on how much readahead : it is prepared to do. : : It is based on the user's rss limit. But this is usually enormous, and : the user isn't necessarily going to map all that memory at the same time : anyway. : : And the logic is wrong - it is comparing rss (which is in bytes) with : `end - start', which is in pages. : : And it returns -EIO on error, which is not mentioned in the Open Group : spec and doesn't make sense. : : : This patch takes it all out and applies the same upper limit as is used in : sys_readahead() - half the inactive list. : : +/* : + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a : + * sensible upper limit. : + */ : +unsigned long max_sane_readahead(unsigned long nr) : +{ : + unsigned long active; : + unsigned long inactive; : + : + get_zone_counts(active, inactive); : + return min(nr, inactive / 2); : +} And one would need to go back further still to understand the rationale for the sys_readahead() decision and that even predates the BK repo. iirc the thinking was that we need _some_ limit on readahead size so the user can't go and do ridiculously large amounts of readahead via sys_readahead(). But that doesn't make a lot of sense because the user could do the same thing with plain old read(). So for argument's sake I'm thinking we just kill it altogether and permit arbitrarily large readahead: --- a/mm/readahead.c~a +++ a/mm/readahead.c @@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad } /* - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a - * sensible upper limit. + * max_sane_readahead() is disabled. It can later be removed altogether, but + * let's keep a skeleton in place for now, in case disabling was the wrong call. */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); + return nr; } /* Can anyone see a problem with this? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On 12/05/2013 03:18 AM, Andrew Morton wrote: On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: On 12/04/2013 02:11 PM, Andrew Morton wrote: : : This patch takes it all out and applies the same upper limit as is used in : sys_readahead() - half the inactive list. : : +/* : + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a : + * sensible upper limit. : + */ : +unsigned long max_sane_readahead(unsigned long nr) : +{ : + unsigned long active; : + unsigned long inactive; : + : + get_zone_counts(active, inactive); : + return min(nr, inactive / 2); : +} Hi Andrew, Thanks for digging out. So it seems like earlier we had not even considered free pages? And one would need to go back further still to understand the rationale for the sys_readahead() decision and that even predates the BK repo. iirc the thinking was that we need _some_ limit on readahead size so the user can't go and do ridiculously large amounts of readahead via sys_readahead(). But that doesn't make a lot of sense because the user could do the same thing with plain old read(). True. So for argument's sake I'm thinking we just kill it altogether and permit arbitrarily large readahead: --- a/mm/readahead.c~a +++ a/mm/readahead.c @@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad } /* - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a - * sensible upper limit. + * max_sane_readahead() is disabled. It can later be removed altogether, but + * let's keep a skeleton in place for now, in case disabling was the wrong call. */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); + return nr; } I had something like below in mind for posting. But it looks simple now with your patch. unsigned long max_sane_readahead(unsigned long nr) { int nid; unsigned long free_page = 0; for_each_node_state(nid, N_MEMORY) free_page += node_page_state(nid, NR_INACTIVE_FILE) + node_page_state(nid, NR_FREE_PAGES); /* * Readahead onto remote memory is better than no readahead when local * numa node does not have memory. We sanitize readahead size depending * on potential free memory in the whole system. */ return min(nr, free_page / (2 * nr_node_ids)); Or if we wanted to avoid iteration on nodes simply returning something like nr/8 or something like that for remote numa fault cases. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Tue, 3 Dec 2013 16:06:17 +0530 Raghavendra K T wrote: > On a cpu with an empty numa node, This makes no sense - numa nodes don't reside on CPUs. I think you mean "on a CPU which resides on a memoryless NUMA node"? > readahead fails because max_sane_readahead > returns zero. The reason is we look into number of inactive + free pages > available on the current node. > > The following patch tries to fix the behaviour by checking for potential > empty numa node cases. > The rationale for the patch is, readahead may be worth doing on a remote > node instead of incuring costly disk faults later. > > I still feel we may have to sanitize the nr below, (for e.g., nr/8) > to avoid serious consequences of malicious application trying to do > a big readahead on a empty numa node causing unnecessary load on remote nodes. > ( or it may even be that current behaviour is right in not going ahead with > readahead to avoid the memory load on remote nodes). > I don't recall the rationale for the current code and of course we didn't document it. It might be in the changelogs somewhere - could you please do the git digging and see if you can find out? I don't immediately see why readahead into a different node is considered a bad thing. > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -243,8 +243,11 @@ int force_page_cache_readahead(struct address_space > *mapping, struct file *filp, > */ > unsigned long max_sane_readahead(unsigned long nr) > { > - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) > - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); > + unsigned long numa_free_page; > + numa_free_page = (node_page_state(numa_node_id(), NR_INACTIVE_FILE) > ++ node_page_state(numa_node_id(), NR_FREE_PAGES)); > + > + return numa_free_page ? min(nr, numa_free_page / 2) : nr; Well even if this CPU's node has very little pagecache at all, what's wrong with permitting readahead? We don't know that the new pagecache will be allocated exclusively from this CPU's node anyway. All very odd. Whatever we do, we should leave behind some good code comments which explain the rationale(s), please. Right now it's rather opaque. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node
On Tue, 3 Dec 2013 16:06:17 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: On a cpu with an empty numa node, This makes no sense - numa nodes don't reside on CPUs. I think you mean on a CPU which resides on a memoryless NUMA node? readahead fails because max_sane_readahead returns zero. The reason is we look into number of inactive + free pages available on the current node. The following patch tries to fix the behaviour by checking for potential empty numa node cases. The rationale for the patch is, readahead may be worth doing on a remote node instead of incuring costly disk faults later. I still feel we may have to sanitize the nr below, (for e.g., nr/8) to avoid serious consequences of malicious application trying to do a big readahead on a empty numa node causing unnecessary load on remote nodes. ( or it may even be that current behaviour is right in not going ahead with readahead to avoid the memory load on remote nodes). I don't recall the rationale for the current code and of course we didn't document it. It might be in the changelogs somewhere - could you please do the git digging and see if you can find out? I don't immediately see why readahead into a different node is considered a bad thing. --- a/mm/readahead.c +++ b/mm/readahead.c @@ -243,8 +243,11 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp, */ unsigned long max_sane_readahead(unsigned long nr) { - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); + unsigned long numa_free_page; + numa_free_page = (node_page_state(numa_node_id(), NR_INACTIVE_FILE) ++ node_page_state(numa_node_id(), NR_FREE_PAGES)); + + return numa_free_page ? min(nr, numa_free_page / 2) : nr; Well even if this CPU's node has very little pagecache at all, what's wrong with permitting readahead? We don't know that the new pagecache will be allocated exclusively from this CPU's node anyway. All very odd. Whatever we do, we should leave behind some good code comments which explain the rationale(s), please. Right now it's rather opaque. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/