Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-31 Thread Raghavendra K T

On 12/14/2013 06:09 AM, Linus Torvalds wrote:

On Wed, Dec 11, 2013 at 3:05 PM, Andrew Morton
 wrote:


But I'm really struggling to think up an implementation!  The current
code looks only at the caller's node and doesn't seem to make much
sense.  Should we look at all nodes?  Hard to say without prior
knowledge of where those pages will be coming from.


I really think we want to put an upper bound on the read-ahead, and
I'm not convinced we need to try to be excessively clever about it. We
also probably don't want to make it too expensive to calculate,
because afaik this ends up being called for each file we open when we
don't have pages in the page cache yet.

The current function seems reasonable on a single-node system. Let's
not kill it entirely just because it has some odd corner-case on
multi-node systems.

In fact, for all I care, I think it would be perfectly ok to just use
a truly stupid hard limit ("you can't read-ahead more than 16MB" or
whatever).

What we do *not* want to allow is to have people call "readahead"
functions and basically kill the machine because you now have a
unkillable IO that is insanely big. So I'd much rather limit it too
much than too little. And on absolutely no sane IO susbsystem does it
make sense to read ahead insane amounts.

So I'd rather limit it to something stupid and small, than to not
limit things at all.

Looking at the interface, for example, the natural thing to do for the
"readahead()" system call, for example, is to just give it a size of
~0ul, and let the system limit things, becaue limiting things in useer
space is just not reasonable.

So I really do *not* think it's fine to just remove the limit entirely.



Very sorry for late reply (was on very loong vacation).

How about having 16MB limit only for remote readaheads and continuing
the rest as is, something like below:

#define MAX_REMOTE_READAHEAD4096UL

unsigned long max_sane_readahead(unsigned long nr)
{

	unsigned long local_free_page = (node_page_state(numa_node_id(), 
NR_INACTIVE_FILE)

+ node_page_state(numa_node_id(), NR_FREE_PAGES));
unsigned long sane_nr = min(nr, MAX_REMOTE_READAHEAD);

return (local_free_page ? min(nr, local_free_page / 2) : sane_nr);
}

or we can enforce 16MB limit for all the case too.

I 'll send a patch accordingly.

(readahead max will scale accordingly if we dont have 4k page size
 above).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-31 Thread Raghavendra K T

On 12/14/2013 06:09 AM, Linus Torvalds wrote:

On Wed, Dec 11, 2013 at 3:05 PM, Andrew Morton
a...@linux-foundation.org wrote:


But I'm really struggling to think up an implementation!  The current
code looks only at the caller's node and doesn't seem to make much
sense.  Should we look at all nodes?  Hard to say without prior
knowledge of where those pages will be coming from.


I really think we want to put an upper bound on the read-ahead, and
I'm not convinced we need to try to be excessively clever about it. We
also probably don't want to make it too expensive to calculate,
because afaik this ends up being called for each file we open when we
don't have pages in the page cache yet.

The current function seems reasonable on a single-node system. Let's
not kill it entirely just because it has some odd corner-case on
multi-node systems.

In fact, for all I care, I think it would be perfectly ok to just use
a truly stupid hard limit (you can't read-ahead more than 16MB or
whatever).

What we do *not* want to allow is to have people call readahead
functions and basically kill the machine because you now have a
unkillable IO that is insanely big. So I'd much rather limit it too
much than too little. And on absolutely no sane IO susbsystem does it
make sense to read ahead insane amounts.

So I'd rather limit it to something stupid and small, than to not
limit things at all.

Looking at the interface, for example, the natural thing to do for the
readahead() system call, for example, is to just give it a size of
~0ul, and let the system limit things, becaue limiting things in useer
space is just not reasonable.

So I really do *not* think it's fine to just remove the limit entirely.



Very sorry for late reply (was on very loong vacation).

How about having 16MB limit only for remote readaheads and continuing
the rest as is, something like below:

#define MAX_REMOTE_READAHEAD4096UL

unsigned long max_sane_readahead(unsigned long nr)
{

	unsigned long local_free_page = (node_page_state(numa_node_id(), 
NR_INACTIVE_FILE)

+ node_page_state(numa_node_id(), NR_FREE_PAGES));
unsigned long sane_nr = min(nr, MAX_REMOTE_READAHEAD);

return (local_free_page ? min(nr, local_free_page / 2) : sane_nr);
}

or we can enforce 16MB limit for all the case too.

I 'll send a patch accordingly.

(readahead max will scale accordingly if we dont have 4k page size
 above).

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-13 Thread Linus Torvalds
On Wed, Dec 11, 2013 at 3:05 PM, Andrew Morton
 wrote:
>
> But I'm really struggling to think up an implementation!  The current
> code looks only at the caller's node and doesn't seem to make much
> sense.  Should we look at all nodes?  Hard to say without prior
> knowledge of where those pages will be coming from.

I really think we want to put an upper bound on the read-ahead, and
I'm not convinced we need to try to be excessively clever about it. We
also probably don't want to make it too expensive to calculate,
because afaik this ends up being called for each file we open when we
don't have pages in the page cache yet.

The current function seems reasonable on a single-node system. Let's
not kill it entirely just because it has some odd corner-case on
multi-node systems.

In fact, for all I care, I think it would be perfectly ok to just use
a truly stupid hard limit ("you can't read-ahead more than 16MB" or
whatever).

What we do *not* want to allow is to have people call "readahead"
functions and basically kill the machine because you now have a
unkillable IO that is insanely big. So I'd much rather limit it too
much than too little. And on absolutely no sane IO susbsystem does it
make sense to read ahead insane amounts.

So I'd rather limit it to something stupid and small, than to not
limit things at all.

Looking at the interface, for example, the natural thing to do for the
"readahead()" system call, for example, is to just give it a size of
~0ul, and let the system limit things, becaue limiting things in useer
space is just not reasonable.

So I really do *not* think it's fine to just remove the limit entirely.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-13 Thread Linus Torvalds
On Wed, Dec 11, 2013 at 3:05 PM, Andrew Morton
a...@linux-foundation.org wrote:

 But I'm really struggling to think up an implementation!  The current
 code looks only at the caller's node and doesn't seem to make much
 sense.  Should we look at all nodes?  Hard to say without prior
 knowledge of where those pages will be coming from.

I really think we want to put an upper bound on the read-ahead, and
I'm not convinced we need to try to be excessively clever about it. We
also probably don't want to make it too expensive to calculate,
because afaik this ends up being called for each file we open when we
don't have pages in the page cache yet.

The current function seems reasonable on a single-node system. Let's
not kill it entirely just because it has some odd corner-case on
multi-node systems.

In fact, for all I care, I think it would be perfectly ok to just use
a truly stupid hard limit (you can't read-ahead more than 16MB or
whatever).

What we do *not* want to allow is to have people call readahead
functions and basically kill the machine because you now have a
unkillable IO that is insanely big. So I'd much rather limit it too
much than too little. And on absolutely no sane IO susbsystem does it
make sense to read ahead insane amounts.

So I'd rather limit it to something stupid and small, than to not
limit things at all.

Looking at the interface, for example, the natural thing to do for the
readahead() system call, for example, is to just give it a size of
~0ul, and let the system limit things, becaue limiting things in useer
space is just not reasonable.

So I really do *not* think it's fine to just remove the limit entirely.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-12 Thread Jan Kara
On Wed 11-12-13 15:05:22, Andrew Morton wrote:
> On Wed, 11 Dec 2013 23:49:17 +0100 Jan Kara  wrote:
> 
> > >  /*
> > > - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> > > - * sensible upper limit.
> > > + * max_sane_readahead() is disabled.  It can later be removed 
> > > altogether, but
> > > + * let's keep a skeleton in place for now, in case disabling was the 
> > > wrong call.
> > >   */
> > >  unsigned long max_sane_readahead(unsigned long nr)
> > >  {
> > > - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > > - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > > + return nr;
> > >  }
> > >  
> > >  /*
> > > 
> > > Can anyone see a problem with this?
> >   Well, the downside seems to be that if userspace previously issued
> > MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible
> > size. Now we try to read the whole huge file which is pretty much
> > guaranteed to be useless (as we'll be pushing out of cache data we just
> > read a while ago). And guessing the right readahead size from userspace
> > isn't trivial so it would make WILLNEED advice less useful. What do you
> > think?
> 
> OK, yes, there is conceivably a back-compatibility issue there.  There
> indeed might be applications which decide the chuck the whole thing at
> the kernel and let the kernel work out what is a sensible readahead
> size to perform.
>
> But I'm really struggling to think up an implementation!  The current
> code looks only at the caller's node and doesn't seem to make much
> sense.  Should we look at all nodes?  Hard to say without prior
> knowledge of where those pages will be coming from.
  Well, I believe that we might have some compatibility issues only for
non-NUMA machines - there the current logic makes sense. For NUMA machines
I believe we are free to do basically anything because results of the
current logic are pretty random.

Thinking about proper implementation for NUMA - max_sane_readahead() is
really interesting for madvise() and fadvise() calls (standard on demand
readahead is bounded by bdi->ra_pages which tends to be pretty low anyway
(like 512K or so)). For these calls we will do the reads from the process
issuing the [fm]advise() call and thus we will allocate pages depending on
the NUMA policy. So depending on this policy we should be able to pick some
estimate on the number of available pages, shouldn't we?

BTW, the fact that [fm]advise() calls submit all reads synchronously is
another reason why we should bound the readahead requests to a sensible
size.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-12 Thread Jan Kara
On Wed 11-12-13 15:05:22, Andrew Morton wrote:
 On Wed, 11 Dec 2013 23:49:17 +0100 Jan Kara j...@suse.cz wrote:
 
/*
   - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
   - * sensible upper limit.
   + * max_sane_readahead() is disabled.  It can later be removed 
   altogether, but
   + * let's keep a skeleton in place for now, in case disabling was the 
   wrong call.
 */
unsigned long max_sane_readahead(unsigned long nr)
{
   - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
   - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
   + return nr;
}

/*
   
   Can anyone see a problem with this?
Well, the downside seems to be that if userspace previously issued
  MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible
  size. Now we try to read the whole huge file which is pretty much
  guaranteed to be useless (as we'll be pushing out of cache data we just
  read a while ago). And guessing the right readahead size from userspace
  isn't trivial so it would make WILLNEED advice less useful. What do you
  think?
 
 OK, yes, there is conceivably a back-compatibility issue there.  There
 indeed might be applications which decide the chuck the whole thing at
 the kernel and let the kernel work out what is a sensible readahead
 size to perform.

 But I'm really struggling to think up an implementation!  The current
 code looks only at the caller's node and doesn't seem to make much
 sense.  Should we look at all nodes?  Hard to say without prior
 knowledge of where those pages will be coming from.
  Well, I believe that we might have some compatibility issues only for
non-NUMA machines - there the current logic makes sense. For NUMA machines
I believe we are free to do basically anything because results of the
current logic are pretty random.

Thinking about proper implementation for NUMA - max_sane_readahead() is
really interesting for madvise() and fadvise() calls (standard on demand
readahead is bounded by bdi-ra_pages which tends to be pretty low anyway
(like 512K or so)). For these calls we will do the reads from the process
issuing the [fm]advise() call and thus we will allocate pages depending on
the NUMA policy. So depending on this policy we should be able to pick some
estimate on the number of available pages, shouldn't we?

BTW, the fact that [fm]advise() calls submit all reads synchronously is
another reason why we should bound the readahead requests to a sensible
size.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-11 Thread Andrew Morton
On Wed, 11 Dec 2013 23:49:17 +0100 Jan Kara  wrote:

> >  /*
> > - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> > - * sensible upper limit.
> > + * max_sane_readahead() is disabled.  It can later be removed altogether, 
> > but
> > + * let's keep a skeleton in place for now, in case disabling was the wrong 
> > call.
> >   */
> >  unsigned long max_sane_readahead(unsigned long nr)
> >  {
> > -   return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > -   + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > +   return nr;
> >  }
> >  
> >  /*
> > 
> > Can anyone see a problem with this?
>   Well, the downside seems to be that if userspace previously issued
> MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible
> size. Now we try to read the whole huge file which is pretty much
> guaranteed to be useless (as we'll be pushing out of cache data we just
> read a while ago). And guessing the right readahead size from userspace
> isn't trivial so it would make WILLNEED advice less useful. What do you
> think?

OK, yes, there is conceivably a back-compatibility issue there.  There
indeed might be applications which decide the chuck the whole thing at
the kernel and let the kernel work out what is a sensible readahead
size to perform.

But I'm really struggling to think up an implementation!  The current
code looks only at the caller's node and doesn't seem to make much
sense.  Should we look at all nodes?  Hard to say without prior
knowledge of where those pages will be coming from.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-11 Thread Jan Kara
On Wed 04-12-13 13:48:38, Andrew Morton wrote:
> On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T 
>  wrote:
> 
> > On 12/04/2013 02:11 PM, Andrew Morton wrote:
> > > On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T 
> > >  wrote:
> > >
> > >> Unfaortunately, from my search, I saw that the code belonged to pre git
> > >> time, so could not get much information on that.
> > >
> > > Here: https://lkml.org/lkml/2004/8/20/242
> > >
> > > It seems it was done as a rather thoughtless performance optimisation.
> > > I'd say it's time to reimplement max_sane_readahead() from scratch.
> > >
> > 
> > Ok. Thanks for the link. I think after that,
> > Here it was changed to pernode:
> > https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over.
> > 
> > do you think above patch (+comments) with some sanitized nr (thus
> > avoiding iteration over nodes in remote numa readahead case) does look
> > better?
> > or should we iterate all memory.
> 
> I dunno, the whole thing smells of arbitrary woolly thinking to me. 
> Going back further in time..
> 
> : commit f76d03dc9fcff7ac88e2d23c5814fd0f50c59bb6
> : Author: akpm 
> : AuthorDate: Sun Dec 15 03:18:58 2002 +
> : Commit: akpm 
> : CommitDate: Sun Dec 15 03:18:58 2002 +
> : 
> : [PATCH] madvise_willneed() maximum readahead checking
> : 
> : madvise_willneed() currently has a very strange check on how much 
> readahead
> : it is prepared to do.
> : 
> :   It is based on the user's rss limit.  But this is usually enormous, 
> and
> :   the user isn't necessarily going to map all that memory at the same 
> time
> :   anyway.
> : 
> :   And the logic is wrong - it is comparing rss (which is in bytes) with
> :   `end - start', which is in pages.
> : 
> :   And it returns -EIO on error, which is not mentioned in the Open Group
> :   spec and doesn't make sense.
> : 
> : 
> : This patch takes it all out and applies the same upper limit as is used 
> in
> : sys_readahead() - half the inactive list.
> : 
> : +/*
> : + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> : + * sensible upper limit.
> : + */
> : +unsigned long max_sane_readahead(unsigned long nr)
> : +{
> : +   unsigned long active;
> : +   unsigned long inactive;
> : +
> : +   get_zone_counts(, );
> : +   return min(nr, inactive / 2);
> : +}
> 
> And one would need to go back further still to understand the rationale
> for the sys_readahead() decision and that even predates the BK repo.
> 
> iirc the thinking was that we need _some_ limit on readahead size so
> the user can't go and do ridiculously large amounts of readahead via
> sys_readahead().  But that doesn't make a lot of sense because the user
> could do the same thing with plain old read().
> 
> So for argument's sake I'm thinking we just kill it altogether and
> permit arbitrarily large readahead:
> 
> --- a/mm/readahead.c~a
> +++ a/mm/readahead.c
> @@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad
>  }
>  
>  /*
> - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> - * sensible upper limit.
> + * max_sane_readahead() is disabled.  It can later be removed altogether, but
> + * let's keep a skeleton in place for now, in case disabling was the wrong 
> call.
>   */
>  unsigned long max_sane_readahead(unsigned long nr)
>  {
> - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> + return nr;
>  }
>  
>  /*
> 
> Can anyone see a problem with this?
  Well, the downside seems to be that if userspace previously issued
MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible
size. Now we try to read the whole huge file which is pretty much
guaranteed to be useless (as we'll be pushing out of cache data we just
read a while ago). And guessing the right readahead size from userspace
isn't trivial so it would make WILLNEED advice less useful. What do you
think?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-11 Thread Jan Kara
On Wed 04-12-13 13:48:38, Andrew Morton wrote:
 On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T 
 raghavendra...@linux.vnet.ibm.com wrote:
 
  On 12/04/2013 02:11 PM, Andrew Morton wrote:
   On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T 
   raghavendra...@linux.vnet.ibm.com wrote:
  
   Unfaortunately, from my search, I saw that the code belonged to pre git
   time, so could not get much information on that.
  
   Here: https://lkml.org/lkml/2004/8/20/242
  
   It seems it was done as a rather thoughtless performance optimisation.
   I'd say it's time to reimplement max_sane_readahead() from scratch.
  
  
  Ok. Thanks for the link. I think after that,
  Here it was changed to pernode:
  https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over.
  
  do you think above patch (+comments) with some sanitized nr (thus
  avoiding iteration over nodes in remote numa readahead case) does look
  better?
  or should we iterate all memory.
 
 I dunno, the whole thing smells of arbitrary woolly thinking to me. 
 Going back further in time..
 
 : commit f76d03dc9fcff7ac88e2d23c5814fd0f50c59bb6
 : Author: akpm akpm
 : AuthorDate: Sun Dec 15 03:18:58 2002 +
 : Commit: akpm akpm
 : CommitDate: Sun Dec 15 03:18:58 2002 +
 : 
 : [PATCH] madvise_willneed() maximum readahead checking
 : 
 : madvise_willneed() currently has a very strange check on how much 
 readahead
 : it is prepared to do.
 : 
 :   It is based on the user's rss limit.  But this is usually enormous, 
 and
 :   the user isn't necessarily going to map all that memory at the same 
 time
 :   anyway.
 : 
 :   And the logic is wrong - it is comparing rss (which is in bytes) with
 :   `end - start', which is in pages.
 : 
 :   And it returns -EIO on error, which is not mentioned in the Open Group
 :   spec and doesn't make sense.
 : 
 : 
 : This patch takes it all out and applies the same upper limit as is used 
 in
 : sys_readahead() - half the inactive list.
 : 
 : +/*
 : + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
 : + * sensible upper limit.
 : + */
 : +unsigned long max_sane_readahead(unsigned long nr)
 : +{
 : +   unsigned long active;
 : +   unsigned long inactive;
 : +
 : +   get_zone_counts(active, inactive);
 : +   return min(nr, inactive / 2);
 : +}
 
 And one would need to go back further still to understand the rationale
 for the sys_readahead() decision and that even predates the BK repo.
 
 iirc the thinking was that we need _some_ limit on readahead size so
 the user can't go and do ridiculously large amounts of readahead via
 sys_readahead().  But that doesn't make a lot of sense because the user
 could do the same thing with plain old read().
 
 So for argument's sake I'm thinking we just kill it altogether and
 permit arbitrarily large readahead:
 
 --- a/mm/readahead.c~a
 +++ a/mm/readahead.c
 @@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad
  }
  
  /*
 - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
 - * sensible upper limit.
 + * max_sane_readahead() is disabled.  It can later be removed altogether, but
 + * let's keep a skeleton in place for now, in case disabling was the wrong 
 call.
   */
  unsigned long max_sane_readahead(unsigned long nr)
  {
 - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
 - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 + return nr;
  }
  
  /*
 
 Can anyone see a problem with this?
  Well, the downside seems to be that if userspace previously issued
MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible
size. Now we try to read the whole huge file which is pretty much
guaranteed to be useless (as we'll be pushing out of cache data we just
read a while ago). And guessing the right readahead size from userspace
isn't trivial so it would make WILLNEED advice less useful. What do you
think?

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-11 Thread Andrew Morton
On Wed, 11 Dec 2013 23:49:17 +0100 Jan Kara j...@suse.cz wrote:

   /*
  - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
  - * sensible upper limit.
  + * max_sane_readahead() is disabled.  It can later be removed altogether, 
  but
  + * let's keep a skeleton in place for now, in case disabling was the wrong 
  call.
*/
   unsigned long max_sane_readahead(unsigned long nr)
   {
  -   return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
  -   + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
  +   return nr;
   }
   
   /*
  
  Can anyone see a problem with this?
   Well, the downside seems to be that if userspace previously issued
 MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible
 size. Now we try to read the whole huge file which is pretty much
 guaranteed to be useless (as we'll be pushing out of cache data we just
 read a while ago). And guessing the right readahead size from userspace
 isn't trivial so it would make WILLNEED advice less useful. What do you
 think?

OK, yes, there is conceivably a back-compatibility issue there.  There
indeed might be applications which decide the chuck the whole thing at
the kernel and let the kernel work out what is a sensible readahead
size to perform.

But I'm really struggling to think up an implementation!  The current
code looks only at the caller's node and doesn't seem to make much
sense.  Should we look at all nodes?  Hard to say without prior
knowledge of where those pages will be coming from.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-04 Thread Raghavendra K T

On 12/05/2013 03:18 AM, Andrew Morton wrote:

On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T 
 wrote:


On 12/04/2013 02:11 PM, Andrew Morton wrote:

:
: This patch takes it all out and applies the same upper limit as is used in
: sys_readahead() - half the inactive list.
:
: +/*
: + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
: + * sensible upper limit.
: + */
: +unsigned long max_sane_readahead(unsigned long nr)
: +{
: +   unsigned long active;
: +   unsigned long inactive;
: +
: +   get_zone_counts(, );
: +   return min(nr, inactive / 2);
: +}



Hi Andrew, Thanks for digging out. So it seems like earlier we had not
even considered free pages?


And one would need to go back further still to understand the rationale
for the sys_readahead() decision and that even predates the BK repo.

iirc the thinking was that we need _some_ limit on readahead size so
the user can't go and do ridiculously large amounts of readahead via
sys_readahead().  But that doesn't make a lot of sense because the user
could do the same thing with plain old read().



True.


So for argument's sake I'm thinking we just kill it altogether and
permit arbitrarily large readahead:

--- a/mm/readahead.c~a
+++ a/mm/readahead.c
@@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad
  }

  /*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
- * sensible upper limit.
+ * max_sane_readahead() is disabled.  It can later be removed altogether, but
+ * let's keep a skeleton in place for now, in case disabling was the wrong 
call.
   */
  unsigned long max_sane_readahead(unsigned long nr)
  {
-   return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
-   + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+   return nr;
  }



I had something like below in mind for posting.  But it looks
simple now with your patch.


 unsigned long max_sane_readahead(unsigned long nr)
 {
int nid;
unsigned long free_page = 0;

for_each_node_state(nid, N_MEMORY)
free_page += node_page_state(nid, NR_INACTIVE_FILE)
+ node_page_state(nid, NR_FREE_PAGES);

/*
 * Readahead onto remote memory is better than no readahead when local
 * numa node does not have memory. We sanitize readahead size depending
 * on potential free memory in the whole system.
 */
return min(nr, free_page / (2 * nr_node_ids));

Or if we wanted to avoid iteration on nodes simply returning

something like nr/8  or something like that for remote numa fault cases.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-04 Thread Andrew Morton
On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T 
 wrote:

> On 12/04/2013 02:11 PM, Andrew Morton wrote:
> > On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T 
> >  wrote:
> >
> >> Unfaortunately, from my search, I saw that the code belonged to pre git
> >> time, so could not get much information on that.
> >
> > Here: https://lkml.org/lkml/2004/8/20/242
> >
> > It seems it was done as a rather thoughtless performance optimisation.
> > I'd say it's time to reimplement max_sane_readahead() from scratch.
> >
> 
> Ok. Thanks for the link. I think after that,
> Here it was changed to pernode:
> https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over.
> 
> do you think above patch (+comments) with some sanitized nr (thus
> avoiding iteration over nodes in remote numa readahead case) does look
> better?
> or should we iterate all memory.

I dunno, the whole thing smells of arbitrary woolly thinking to me. 
Going back further in time..

: commit f76d03dc9fcff7ac88e2d23c5814fd0f50c59bb6
: Author: akpm 
: AuthorDate: Sun Dec 15 03:18:58 2002 +
: Commit: akpm 
: CommitDate: Sun Dec 15 03:18:58 2002 +
: 
: [PATCH] madvise_willneed() maximum readahead checking
: 
: madvise_willneed() currently has a very strange check on how much 
readahead
: it is prepared to do.
: 
:   It is based on the user's rss limit.  But this is usually enormous, and
:   the user isn't necessarily going to map all that memory at the same time
:   anyway.
: 
:   And the logic is wrong - it is comparing rss (which is in bytes) with
:   `end - start', which is in pages.
: 
:   And it returns -EIO on error, which is not mentioned in the Open Group
:   spec and doesn't make sense.
: 
: 
: This patch takes it all out and applies the same upper limit as is used in
: sys_readahead() - half the inactive list.
: 
: +/*
: + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
: + * sensible upper limit.
: + */
: +unsigned long max_sane_readahead(unsigned long nr)
: +{
: +   unsigned long active;
: +   unsigned long inactive;
: +
: +   get_zone_counts(, );
: +   return min(nr, inactive / 2);
: +}

And one would need to go back further still to understand the rationale
for the sys_readahead() decision and that even predates the BK repo.

iirc the thinking was that we need _some_ limit on readahead size so
the user can't go and do ridiculously large amounts of readahead via
sys_readahead().  But that doesn't make a lot of sense because the user
could do the same thing with plain old read().

So for argument's sake I'm thinking we just kill it altogether and
permit arbitrarily large readahead:

--- a/mm/readahead.c~a
+++ a/mm/readahead.c
@@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad
 }
 
 /*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
- * sensible upper limit.
+ * max_sane_readahead() is disabled.  It can later be removed altogether, but
+ * let's keep a skeleton in place for now, in case disabling was the wrong 
call.
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-   return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
-   + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+   return nr;
 }
 
 /*

Can anyone see a problem with this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-04 Thread Raghavendra K T

On 12/04/2013 02:11 PM, Andrew Morton wrote:

On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T 
 wrote:


Unfaortunately, from my search, I saw that the code belonged to pre git
time, so could not get much information on that.


Here: https://lkml.org/lkml/2004/8/20/242

It seems it was done as a rather thoughtless performance optimisation.
I'd say it's time to reimplement max_sane_readahead() from scratch.



Ok. Thanks for the link. I think after that,
Here it was changed to pernode:
https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over.

do you think above patch (+comments) with some sanitized nr (thus
avoiding iteration over nodes in remote numa readahead case) does look
better?
or should we iterate all memory.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-04 Thread Andrew Morton
On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T 
 wrote:

> > I don't recall the rationale for the current code and of course we
> > didn't document it.  It might be in the changelogs somewhere - could
> > you please do the git digging and see if you can find out?
> 
> Unfaortunately, from my search, I saw that the code belonged to pre git
> time, so could not get much information on that.

Here: https://lkml.org/lkml/2004/8/20/242

It seems it was done as a rather thoughtless performance optimisation. 
I'd say it's time to reimplement max_sane_readahead() from scratch.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-04 Thread Raghavendra K T


Thank you Andrew.

On 12/04/2013 04:08 AM, Andrew Morton wrote:

On Tue,  3 Dec 2013 16:06:17 +0530 Raghavendra K T 
 wrote:


On a cpu with an empty numa node,


This makes no sense - numa nodes don't reside on CPUs.

I think you mean "on a CPU which resides on a memoryless NUMA node"?


You are right. I was not precise there.
I had this example in mind while talking.

IBM P730
--
# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
23 24 25 26 27 28 29 30 31

node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus:
node 1 size: 12288 MB
node 1 free: 10440 MB
node distances:
node   0   1
0:  10  40
1:  40  10




readahead fails because max_sane_readahead
returns zero. The reason is we look into number of inactive + free pages
available on the current node.

The following patch tries to fix the behaviour by checking for potential
empty numa node cases.
The rationale for the patch is, readahead may be worth doing on a remote
node instead of incuring costly disk faults later.

I still feel we may have to sanitize the nr below, (for e.g., nr/8)
to avoid serious consequences of malicious application trying to do
a big readahead on a empty numa node causing unnecessary load on remote nodes.
( or it may even be that current behaviour is right in not going ahead with
readahead to avoid the memory load on remote nodes).



I don't recall the rationale for the current code and of course we
didn't document it.  It might be in the changelogs somewhere - could
you please do the git digging and see if you can find out?


Unfaortunately, from my search, I saw that the code belonged to pre git
time, so could not get much information on that.



I don't immediately see why readahead into a different node is
considered a bad thing.



Ok.


--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -243,8 +243,11 @@ int force_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
   */
  unsigned long max_sane_readahead(unsigned long nr)
  {
-   return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
-   + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+   unsigned long numa_free_page;
+   numa_free_page = (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
+  + node_page_state(numa_node_id(), NR_FREE_PAGES));
+
+   return numa_free_page ? min(nr, numa_free_page / 2) : nr;


Well even if this CPU's node has very little pagecache at all, what's
wrong with permitting readahead?  We don't know that the new pagecache
will be allocated exclusively from this CPU's node anyway.  All very
odd.



true we do not know from where it gets allocated and also I completely 
agree that I could not think why we  should not think

of entire memory rather than sticking our decision to one node.

Or is this  one of proactive case to stop worsening situation when 
system is really short of memory?


Do let me know if you have any idea to handle 'little cache case'
or do you think the current one is simple enough for now to live with.


Whatever we do, we should leave behind some good code comments which
explain the rationale(s), please.  Right now it's rather opaque.



Yes. For the current code may be we have to have comment some thing
like ?

/*
 * Sanitize readahead when we have less memory on the current node.
 * We do not want to load remote memory with readahead case.
 */

and if this patch is okay then some thing like.

/*
 * Sanitized readahead onto remote memory is better than no readahead
 * when local numa node does not have memory. If local numa has less
 * memory we trim readahead size depending on potential free memory
 * available.
 */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-04 Thread Raghavendra K T


Thank you Andrew.

On 12/04/2013 04:08 AM, Andrew Morton wrote:

On Tue,  3 Dec 2013 16:06:17 +0530 Raghavendra K T 
raghavendra...@linux.vnet.ibm.com wrote:


On a cpu with an empty numa node,


This makes no sense - numa nodes don't reside on CPUs.

I think you mean on a CPU which resides on a memoryless NUMA node?


You are right. I was not precise there.
I had this example in mind while talking.

IBM P730
--
# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
23 24 25 26 27 28 29 30 31

node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus:
node 1 size: 12288 MB
node 1 free: 10440 MB
node distances:
node   0   1
0:  10  40
1:  40  10




readahead fails because max_sane_readahead
returns zero. The reason is we look into number of inactive + free pages
available on the current node.

The following patch tries to fix the behaviour by checking for potential
empty numa node cases.
The rationale for the patch is, readahead may be worth doing on a remote
node instead of incuring costly disk faults later.

I still feel we may have to sanitize the nr below, (for e.g., nr/8)
to avoid serious consequences of malicious application trying to do
a big readahead on a empty numa node causing unnecessary load on remote nodes.
( or it may even be that current behaviour is right in not going ahead with
readahead to avoid the memory load on remote nodes).



I don't recall the rationale for the current code and of course we
didn't document it.  It might be in the changelogs somewhere - could
you please do the git digging and see if you can find out?


Unfaortunately, from my search, I saw that the code belonged to pre git
time, so could not get much information on that.



I don't immediately see why readahead into a different node is
considered a bad thing.



Ok.


--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -243,8 +243,11 @@ int force_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
   */
  unsigned long max_sane_readahead(unsigned long nr)
  {
-   return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
-   + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+   unsigned long numa_free_page;
+   numa_free_page = (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
+  + node_page_state(numa_node_id(), NR_FREE_PAGES));
+
+   return numa_free_page ? min(nr, numa_free_page / 2) : nr;


Well even if this CPU's node has very little pagecache at all, what's
wrong with permitting readahead?  We don't know that the new pagecache
will be allocated exclusively from this CPU's node anyway.  All very
odd.



true we do not know from where it gets allocated and also I completely 
agree that I could not think why we  should not think

of entire memory rather than sticking our decision to one node.

Or is this  one of proactive case to stop worsening situation when 
system is really short of memory?


Do let me know if you have any idea to handle 'little cache case'
or do you think the current one is simple enough for now to live with.


Whatever we do, we should leave behind some good code comments which
explain the rationale(s), please.  Right now it's rather opaque.



Yes. For the current code may be we have to have comment some thing
like ?

/*
 * Sanitize readahead when we have less memory on the current node.
 * We do not want to load remote memory with readahead case.
 */

and if this patch is okay then some thing like.

/*
 * Sanitized readahead onto remote memory is better than no readahead
 * when local numa node does not have memory. If local numa has less
 * memory we trim readahead size depending on potential free memory
 * available.
 */

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-04 Thread Andrew Morton
On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T 
raghavendra...@linux.vnet.ibm.com wrote:

  I don't recall the rationale for the current code and of course we
  didn't document it.  It might be in the changelogs somewhere - could
  you please do the git digging and see if you can find out?
 
 Unfaortunately, from my search, I saw that the code belonged to pre git
 time, so could not get much information on that.

Here: https://lkml.org/lkml/2004/8/20/242

It seems it was done as a rather thoughtless performance optimisation. 
I'd say it's time to reimplement max_sane_readahead() from scratch.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-04 Thread Raghavendra K T

On 12/04/2013 02:11 PM, Andrew Morton wrote:

On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T 
raghavendra...@linux.vnet.ibm.com wrote:


Unfaortunately, from my search, I saw that the code belonged to pre git
time, so could not get much information on that.


Here: https://lkml.org/lkml/2004/8/20/242

It seems it was done as a rather thoughtless performance optimisation.
I'd say it's time to reimplement max_sane_readahead() from scratch.



Ok. Thanks for the link. I think after that,
Here it was changed to pernode:
https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over.

do you think above patch (+comments) with some sanitized nr (thus
avoiding iteration over nodes in remote numa readahead case) does look
better?
or should we iterate all memory.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-04 Thread Andrew Morton
On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T 
raghavendra...@linux.vnet.ibm.com wrote:

 On 12/04/2013 02:11 PM, Andrew Morton wrote:
  On Wed, 04 Dec 2013 14:00:09 +0530 Raghavendra K T 
  raghavendra...@linux.vnet.ibm.com wrote:
 
  Unfaortunately, from my search, I saw that the code belonged to pre git
  time, so could not get much information on that.
 
  Here: https://lkml.org/lkml/2004/8/20/242
 
  It seems it was done as a rather thoughtless performance optimisation.
  I'd say it's time to reimplement max_sane_readahead() from scratch.
 
 
 Ok. Thanks for the link. I think after that,
 Here it was changed to pernode:
 https://lkml.org/lkml/2004/8/21/9 to avoid iteration all over.
 
 do you think above patch (+comments) with some sanitized nr (thus
 avoiding iteration over nodes in remote numa readahead case) does look
 better?
 or should we iterate all memory.

I dunno, the whole thing smells of arbitrary woolly thinking to me. 
Going back further in time..

: commit f76d03dc9fcff7ac88e2d23c5814fd0f50c59bb6
: Author: akpm akpm
: AuthorDate: Sun Dec 15 03:18:58 2002 +
: Commit: akpm akpm
: CommitDate: Sun Dec 15 03:18:58 2002 +
: 
: [PATCH] madvise_willneed() maximum readahead checking
: 
: madvise_willneed() currently has a very strange check on how much 
readahead
: it is prepared to do.
: 
:   It is based on the user's rss limit.  But this is usually enormous, and
:   the user isn't necessarily going to map all that memory at the same time
:   anyway.
: 
:   And the logic is wrong - it is comparing rss (which is in bytes) with
:   `end - start', which is in pages.
: 
:   And it returns -EIO on error, which is not mentioned in the Open Group
:   spec and doesn't make sense.
: 
: 
: This patch takes it all out and applies the same upper limit as is used in
: sys_readahead() - half the inactive list.
: 
: +/*
: + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
: + * sensible upper limit.
: + */
: +unsigned long max_sane_readahead(unsigned long nr)
: +{
: +   unsigned long active;
: +   unsigned long inactive;
: +
: +   get_zone_counts(active, inactive);
: +   return min(nr, inactive / 2);
: +}

And one would need to go back further still to understand the rationale
for the sys_readahead() decision and that even predates the BK repo.

iirc the thinking was that we need _some_ limit on readahead size so
the user can't go and do ridiculously large amounts of readahead via
sys_readahead().  But that doesn't make a lot of sense because the user
could do the same thing with plain old read().

So for argument's sake I'm thinking we just kill it altogether and
permit arbitrarily large readahead:

--- a/mm/readahead.c~a
+++ a/mm/readahead.c
@@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad
 }
 
 /*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
- * sensible upper limit.
+ * max_sane_readahead() is disabled.  It can later be removed altogether, but
+ * let's keep a skeleton in place for now, in case disabling was the wrong 
call.
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-   return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
-   + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+   return nr;
 }
 
 /*

Can anyone see a problem with this?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-04 Thread Raghavendra K T

On 12/05/2013 03:18 AM, Andrew Morton wrote:

On Wed, 04 Dec 2013 14:38:11 +0530 Raghavendra K T 
raghavendra...@linux.vnet.ibm.com wrote:


On 12/04/2013 02:11 PM, Andrew Morton wrote:

:
: This patch takes it all out and applies the same upper limit as is used in
: sys_readahead() - half the inactive list.
:
: +/*
: + * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
: + * sensible upper limit.
: + */
: +unsigned long max_sane_readahead(unsigned long nr)
: +{
: +   unsigned long active;
: +   unsigned long inactive;
: +
: +   get_zone_counts(active, inactive);
: +   return min(nr, inactive / 2);
: +}



Hi Andrew, Thanks for digging out. So it seems like earlier we had not
even considered free pages?


And one would need to go back further still to understand the rationale
for the sys_readahead() decision and that even predates the BK repo.

iirc the thinking was that we need _some_ limit on readahead size so
the user can't go and do ridiculously large amounts of readahead via
sys_readahead().  But that doesn't make a lot of sense because the user
could do the same thing with plain old read().



True.


So for argument's sake I'm thinking we just kill it altogether and
permit arbitrarily large readahead:

--- a/mm/readahead.c~a
+++ a/mm/readahead.c
@@ -238,13 +238,12 @@ int force_page_cache_readahead(struct ad
  }

  /*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
- * sensible upper limit.
+ * max_sane_readahead() is disabled.  It can later be removed altogether, but
+ * let's keep a skeleton in place for now, in case disabling was the wrong 
call.
   */
  unsigned long max_sane_readahead(unsigned long nr)
  {
-   return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
-   + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+   return nr;
  }



I had something like below in mind for posting.  But it looks
simple now with your patch.


 unsigned long max_sane_readahead(unsigned long nr)
 {
int nid;
unsigned long free_page = 0;

for_each_node_state(nid, N_MEMORY)
free_page += node_page_state(nid, NR_INACTIVE_FILE)
+ node_page_state(nid, NR_FREE_PAGES);

/*
 * Readahead onto remote memory is better than no readahead when local
 * numa node does not have memory. We sanitize readahead size depending
 * on potential free memory in the whole system.
 */
return min(nr, free_page / (2 * nr_node_ids));

Or if we wanted to avoid iteration on nodes simply returning

something like nr/8  or something like that for remote numa fault cases.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-03 Thread Andrew Morton
On Tue,  3 Dec 2013 16:06:17 +0530 Raghavendra K T 
 wrote:

> On a cpu with an empty numa node,

This makes no sense - numa nodes don't reside on CPUs.

I think you mean "on a CPU which resides on a memoryless NUMA node"?

> readahead fails because max_sane_readahead
> returns zero. The reason is we look into number of inactive + free pages 
> available on the current node.
> 
> The following patch tries to fix the behaviour by checking for potential
> empty numa node cases.
> The rationale for the patch is, readahead may be worth doing on a remote
> node instead of incuring costly disk faults later.
> 
> I still feel we may have to sanitize the nr below, (for e.g., nr/8)
> to avoid serious consequences of malicious application trying to do
> a big readahead on a empty numa node causing unnecessary load on remote nodes.
> ( or it may even be that current behaviour is right in not going ahead with
> readahead to avoid the memory load on remote nodes).
> 

I don't recall the rationale for the current code and of course we
didn't document it.  It might be in the changelogs somewhere - could
you please do the git digging and see if you can find out?

I don't immediately see why readahead into a different node is
considered a bad thing.

> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -243,8 +243,11 @@ int force_page_cache_readahead(struct address_space 
> *mapping, struct file *filp,
>   */
>  unsigned long max_sane_readahead(unsigned long nr)
>  {
> - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> + unsigned long numa_free_page;
> + numa_free_page = (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> ++ node_page_state(numa_node_id(), NR_FREE_PAGES));
> +
> + return numa_free_page ? min(nr, numa_free_page / 2) : nr;

Well even if this CPU's node has very little pagecache at all, what's
wrong with permitting readahead?  We don't know that the new pagecache
will be allocated exclusively from this CPU's node anyway.  All very
odd.

Whatever we do, we should leave behind some good code comments which
explain the rationale(s), please.  Right now it's rather opaque.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm readahead: Fix the readahead fail in case of empty numa node

2013-12-03 Thread Andrew Morton
On Tue,  3 Dec 2013 16:06:17 +0530 Raghavendra K T 
raghavendra...@linux.vnet.ibm.com wrote:

 On a cpu with an empty numa node,

This makes no sense - numa nodes don't reside on CPUs.

I think you mean on a CPU which resides on a memoryless NUMA node?

 readahead fails because max_sane_readahead
 returns zero. The reason is we look into number of inactive + free pages 
 available on the current node.
 
 The following patch tries to fix the behaviour by checking for potential
 empty numa node cases.
 The rationale for the patch is, readahead may be worth doing on a remote
 node instead of incuring costly disk faults later.
 
 I still feel we may have to sanitize the nr below, (for e.g., nr/8)
 to avoid serious consequences of malicious application trying to do
 a big readahead on a empty numa node causing unnecessary load on remote nodes.
 ( or it may even be that current behaviour is right in not going ahead with
 readahead to avoid the memory load on remote nodes).
 

I don't recall the rationale for the current code and of course we
didn't document it.  It might be in the changelogs somewhere - could
you please do the git digging and see if you can find out?

I don't immediately see why readahead into a different node is
considered a bad thing.

 --- a/mm/readahead.c
 +++ b/mm/readahead.c
 @@ -243,8 +243,11 @@ int force_page_cache_readahead(struct address_space 
 *mapping, struct file *filp,
   */
  unsigned long max_sane_readahead(unsigned long nr)
  {
 - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
 - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 + unsigned long numa_free_page;
 + numa_free_page = (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
 ++ node_page_state(numa_node_id(), NR_FREE_PAGES));
 +
 + return numa_free_page ? min(nr, numa_free_page / 2) : nr;

Well even if this CPU's node has very little pagecache at all, what's
wrong with permitting readahead?  We don't know that the new pagecache
will be allocated exclusively from this CPU's node anyway.  All very
odd.

Whatever we do, we should leave behind some good code comments which
explain the rationale(s), please.  Right now it's rather opaque.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/