Re: [PATCH] add extra free kbytes tunable

2013-03-08 Thread Simon Jeons

Hi Hugh,
On 03/02/2013 11:08 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Simon Jeons wrote:

On 03/02/2013 09:42 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Simon Jeons wrote:

In function __add_to_swap_cache if add to radix tree successfully will
result
in increase NR_FILE_PAGES, why? This is anonymous page instead of file
backed
page.

Right, that's hard to understand without historical background.

I think the quick answer would be that we used to (and still do) think
of file-cache and swap-cache as two halves of page-cache.  And then when

shmem page should be treated as file-cache or swap-cache? It is strange since
it is consist of anonymous pages and these pages establish files.

A shmem page is swap-backed file-cache, and it may get transferred to or
from swap-cache: yes, it's a difficult and confusing case, as I said below.

I would never call it "anonymous", but it is counted in /proc/meminfo's
Active(anon) or Inactive(anon) rather than in (file), because "anon"
there is shorthand for "swap-backed".


In read_swap_cache_async:

SetPageSwapBacked(new_page);
__add_to_swap_cache();
swap_readpage();
ClearPageSwapBacked(new_page);

Why clear PG_swapbacked flag?




So you'll find that shmem and swap are counted as file in some places
and anon in others, and it's hard to grasp which is where and why,
without remembering the history.

Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-08 Thread Simon Jeons

Hi Hugh,
On 03/02/2013 11:08 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Simon Jeons wrote:

On 03/02/2013 09:42 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Simon Jeons wrote:

In function __add_to_swap_cache if add to radix tree successfully will
result
in increase NR_FILE_PAGES, why? This is anonymous page instead of file
backed
page.

Right, that's hard to understand without historical background.

I think the quick answer would be that we used to (and still do) think
of file-cache and swap-cache as two halves of page-cache.  And then when

shmem page should be treated as file-cache or swap-cache? It is strange since
it is consist of anonymous pages and these pages establish files.

A shmem page is swap-backed file-cache, and it may get transferred to or
from swap-cache: yes, it's a difficult and confusing case, as I said below.

I would never call it anonymous, but it is counted in /proc/meminfo's
Active(anon) or Inactive(anon) rather than in (file), because anon
there is shorthand for swap-backed.


In read_swap_cache_async:

SetPageSwapBacked(new_page);
__add_to_swap_cache();
swap_readpage();
ClearPageSwapBacked(new_page);

Why clear PG_swapbacked flag?




So you'll find that shmem and swap are counted as file in some places
and anon in others, and it's hard to grasp which is where and why,
without remembering the history.

Hugh


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Simon Jeons

On 03/02/2013 11:08 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Simon Jeons wrote:

On 03/02/2013 09:42 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Simon Jeons wrote:

In function __add_to_swap_cache if add to radix tree successfully will
result
in increase NR_FILE_PAGES, why? This is anonymous page instead of file
backed
page.

Right, that's hard to understand without historical background.

I think the quick answer would be that we used to (and still do) think
of file-cache and swap-cache as two halves of page-cache.  And then when

shmem page should be treated as file-cache or swap-cache? It is strange since
it is consist of anonymous pages and these pages establish files.

A shmem page is swap-backed file-cache, and it may get transferred to or
from swap-cache: yes, it's a difficult and confusing case, as I said below.

I would never call it "anonymous", but it is counted in /proc/meminfo's
Active(anon) or Inactive(anon) rather than in (file), because "anon"
there is shorthand for "swap-backed".


Oh, I see. Thanks. :)




So you'll find that shmem and swap are counted as file in some places
and anon in others, and it's hard to grasp which is where and why,
without remembering the history.

Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Hugh Dickins
On Sat, 2 Mar 2013, Simon Jeons wrote:
> On 03/02/2013 09:42 AM, Hugh Dickins wrote:
> > On Sat, 2 Mar 2013, Simon Jeons wrote:
> > > In function __add_to_swap_cache if add to radix tree successfully will
> > > result
> > > in increase NR_FILE_PAGES, why? This is anonymous page instead of file
> > > backed
> > > page.
> > Right, that's hard to understand without historical background.
> > 
> > I think the quick answer would be that we used to (and still do) think
> > of file-cache and swap-cache as two halves of page-cache.  And then when
> 
> shmem page should be treated as file-cache or swap-cache? It is strange since
> it is consist of anonymous pages and these pages establish files.

A shmem page is swap-backed file-cache, and it may get transferred to or
from swap-cache: yes, it's a difficult and confusing case, as I said below.

I would never call it "anonymous", but it is counted in /proc/meminfo's
Active(anon) or Inactive(anon) rather than in (file), because "anon"
there is shorthand for "swap-backed".

> > So you'll find that shmem and swap are counted as file in some places
> > and anon in others, and it's hard to grasp which is where and why,
> > without remembering the history.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Simon Jeons

On 03/02/2013 09:42 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Simon Jeons wrote:

In function __add_to_swap_cache if add to radix tree successfully will result
in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed
page.

Right, that's hard to understand without historical background.

I think the quick answer would be that we used to (and still do) think
of file-cache and swap-cache as two halves of page-cache.  And then when


shmem page should be treated as file-cache or swap-cache? It is strange 
since it is consist of anonymous pages and these pages establish files.



someone changed the way stats were gathered, they couldn't very well
name the stat for page-cache pages NR_PAGE_PAGES, so they called it
NR_FILE_PAGES - but it still included swap.

We have tried down the years to keep the info shown in /proc/meminfo
(for example, but it is the prime example) consistent across releases,
while adding new lines and new distinctions.

But it has often been hard to find good enough short enough names for
those new distinctions: when 2.6.28 split the LRUs between file-backed
and swap-backed, it used "anon" for swap-backed in /proc/meminfo.

So you'll find that shmem and swap are counted as file in some places
and anon in others, and it's hard to grasp which is where and why,
without remembering the history.

I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts
total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo:
so it's undoing what you observe __add_to_swap_cache() to be doing.

It's quite possible that if you went through all the users of
NR_FILE_PAGES, you'd find it makes much more sense to leave out
the swap-cache pages, and just add those on where needed.

But you might find a few places where it's hard to decide whether
the swap-cache pages were ever intended to be included or not, and
hard to decide if it's safe to change those numbers now or not.

Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Hugh Dickins
On Sat, 2 Mar 2013, Simon Jeons wrote:
> 
> In function __add_to_swap_cache if add to radix tree successfully will result
> in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed
> page.

Right, that's hard to understand without historical background.

I think the quick answer would be that we used to (and still do) think
of file-cache and swap-cache as two halves of page-cache.  And then when
someone changed the way stats were gathered, they couldn't very well
name the stat for page-cache pages NR_PAGE_PAGES, so they called it
NR_FILE_PAGES - but it still included swap.

We have tried down the years to keep the info shown in /proc/meminfo
(for example, but it is the prime example) consistent across releases,
while adding new lines and new distinctions.

But it has often been hard to find good enough short enough names for
those new distinctions: when 2.6.28 split the LRUs between file-backed
and swap-backed, it used "anon" for swap-backed in /proc/meminfo.

So you'll find that shmem and swap are counted as file in some places
and anon in others, and it's hard to grasp which is where and why,
without remembering the history.

I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts
total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo:
so it's undoing what you observe __add_to_swap_cache() to be doing.

It's quite possible that if you went through all the users of
NR_FILE_PAGES, you'd find it makes much more sense to leave out
the swap-cache pages, and just add those on where needed.

But you might find a few places where it's hard to decide whether
the swap-cache pages were ever intended to be included or not, and
hard to decide if it's safe to change those numbers now or not.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Simon Jeons

On 03/02/2013 06:33 AM, Hugh Dickins wrote:

On Fri, 1 Mar 2013, Simon Jeons wrote:

On 03/01/2013 05:22 PM, Simon Jeons wrote:

On 02/23/2013 01:56 AM, Johannes Weiner wrote:

Mapped file pages have to get scanned twice before they are reclaimed
because we don't have enough usage information after the first scan.

It seems that just VM_EXEC mapped file pages are protected.
Issue in page reclaim subsystem:
static inline int page_is_file_cache(struct page *page)
{
 return !PageSwapBacked(page);
}
AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be
cleaned if removed from swap cache. So anonymous pages which are reclaimed
and add to swap cache won't have this flag, then they will be treated as

s/are/aren't

PG_swapbacked != PG_swapcache


Oh, I see. Thanks Hugh, thanks for your patient. :)

In function __add_to_swap_cache if add to radix tree successfully will 
result in increase NR_FILE_PAGES, why? This is anonymous page instead of 
file backed page.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Hugh Dickins
On Fri, 1 Mar 2013, Simon Jeons wrote:
> On 03/01/2013 05:22 PM, Simon Jeons wrote:
> > On 02/23/2013 01:56 AM, Johannes Weiner wrote:
> > > Mapped file pages have to get scanned twice before they are reclaimed
> > > because we don't have enough usage information after the first scan.
> > 
> > It seems that just VM_EXEC mapped file pages are protected.
> > Issue in page reclaim subsystem:
> > static inline int page_is_file_cache(struct page *page)
> > {
> > return !PageSwapBacked(page);
> > }
> > AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be
> > cleaned if removed from swap cache. So anonymous pages which are reclaimed
> > and add to swap cache won't have this flag, then they will be treated as
> 
> s/are/aren't

PG_swapbacked != PG_swapcache
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Simon Jeons

On 03/01/2013 05:22 PM, Simon Jeons wrote:

Hi Johannes,

On 02/23/2013 01:56 AM, Johannes Weiner wrote:

On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:

The problem is that adding this tunable will constrain future VM
implementations.  We will forever need to at least retain the
pseudo-file.  We will also need to make some effort to retain its
behaviour.

It would of course be better to fix things so you don't need to tweak
VM internals to get acceptable behaviour.

I sympathize with this. It's presently all that keeps us afloat though.
I'll whine about it again later if nothing else pans out.


You said:

: We have a server workload wherein machines with 100G+ of "free" 
memory
: (used by page cache), scattered but frequent random io reads from 
12+
: SSD's, and 5gbps+ of internet traffic, will frequently hit direct 
reclaim

: in a few different ways.
:
: 1) It'll run into small amounts of reclaim randomly (a few hundred
: thousand).
:
: 2) A burst of reads or traffic can cause extra pressure, which 
kswapd
: occasionally responds to by freeing up 40g+ of the pagecache all 
at once

: (!) while pausing the system (Argh).
:
: 3) A blip in an upstream provider or failover from a peer causes the
: kernel to allocate massive amounts of memory for retransmission
: queues/etc, potentially along with buffered IO reads and (some, 
but not
: often a ton) of new allocations from an application. This paired 
with 2)

: can cause the box to stall for 15+ seconds.

Can we prioritise these?  2) looks just awful - kswapd shouldn't just
go off and free 40G of pagecache.  Do you know what's actually in that
pagecache?  Large number of small files or small number of (very) 
large

files?

We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
accessed via address. occasionally madvise (WILLNEED) applied to the
address ranges before attempting to use them. There're a mix of other
files but nothing significant. The mmap's are READONLY and writes 
are done

via pwrite-ish functions.

I could use some guidance on inspecting/tracing the problem. I've been
trying to reproduce it in a lab, and respecting to 2)'s issue I've 
found:


- The amount of memory freed back up is either a percentage of total
memory or a percentage of free memory. (a machine with 48G of ram will
"only" free up an extra 4-7g)

- It's most likely to happen after a fresh boot, or if "3 > 
drop_caches"
is applied with the application down. As it fills it seems to get 
itself
into trouble, but becomes more stable after that. Unfortunately 1) 
and 3)

still apply to a stable instance.

- Protecting the DMA32 zone with something like "1 1 32" into
lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.

- While watching "sar -B 1" I'll see kswapd wake up, and scan up to 
a few

hundred thousand pages before finding anything it actually wants to
reclaim (low vmeff). I've only been able to reproduce this from a clean
start. It can take up to 3 seconds before kswapd starts actually
reclaiming pages.

- So far as I can tell we're almost exclusively using 0 order 
allocations.

THP is disabled.

There's not much dirty memory involved. It's not flushing out writes 
while

reclaiming, it just kills off massive amount of cached memory.

Mapped file pages have to get scanned twice before they are reclaimed
because we don't have enough usage information after the first scan.


It seems that just VM_EXEC mapped file pages are protected.
Issue in page reclaim subsystem:
static inline int page_is_file_cache(struct page *page)
{
return !PageSwapBacked(page);
}
AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and 
be cleaned if removed from swap cache. So anonymous pages which are 
reclaimed and add to swap cache won't have this flag, then they will 
be treated as


s/are/aren't

file backed pages?  Is it buggy? In function __add_to_swap_cache if 
add to radix tree successfully will result in increase NR_FILE_PAGES, 
why?


In your case, when you start this workload after a fresh boot or
dropping the caches, there will be 48G of mapped file pages that have
never been scanned before and that need to be looked at twice.

Unfortunately, if kswapd does not make progress (and it won't for some
time at first), it will scan more and more aggressively with


Why kswapd does not make progress for some time at first?


increasing scan priority.  And when the 48G of pages are finally
cycled, kswapd's scan window is a large percentage of your machine's
memory, and it will free every single page in it.

I think we should think about capping kswapd zone reclaim cycles just
as we do for direct reclaim.  It's a little ridiculous that it can run
unbounded and reclaim every page in a zone without ever checking back
against the watermark.  We still increase the scan window evenly when
we don't make forward progress, but we are more carefully inching zone
levels back toward the watermarks.

diff --git a/mm/vmscan.c b/mm/vmscan.c

Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Simon Jeons

Hi Johannes,

On 02/23/2013 01:56 AM, Johannes Weiner wrote:

On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:

The problem is that adding this tunable will constrain future VM
implementations.  We will forever need to at least retain the
pseudo-file.  We will also need to make some effort to retain its
behaviour.

It would of course be better to fix things so you don't need to tweak
VM internals to get acceptable behaviour.

I sympathize with this. It's presently all that keeps us afloat though.
I'll whine about it again later if nothing else pans out.


You said:

: We have a server workload wherein machines with 100G+ of "free" memory
: (used by page cache), scattered but frequent random io reads from 12+
: SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
: in a few different ways.
:
: 1) It'll run into small amounts of reclaim randomly (a few hundred
: thousand).
:
: 2) A burst of reads or traffic can cause extra pressure, which kswapd
: occasionally responds to by freeing up 40g+ of the pagecache all at once
: (!) while pausing the system (Argh).
:
: 3) A blip in an upstream provider or failover from a peer causes the
: kernel to allocate massive amounts of memory for retransmission
: queues/etc, potentially along with buffered IO reads and (some, but not
: often a ton) of new allocations from an application. This paired with 2)
: can cause the box to stall for 15+ seconds.

Can we prioritise these?  2) looks just awful - kswapd shouldn't just
go off and free 40G of pagecache.  Do you know what's actually in that
pagecache?  Large number of small files or small number of (very) large
files?

We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
accessed via address. occasionally madvise (WILLNEED) applied to the
address ranges before attempting to use them. There're a mix of other
files but nothing significant. The mmap's are READONLY and writes are done
via pwrite-ish functions.

I could use some guidance on inspecting/tracing the problem. I've been
trying to reproduce it in a lab, and respecting to 2)'s issue I've found:

- The amount of memory freed back up is either a percentage of total
memory or a percentage of free memory. (a machine with 48G of ram will
"only" free up an extra 4-7g)

- It's most likely to happen after a fresh boot, or if "3 > drop_caches"
is applied with the application down. As it fills it seems to get itself
into trouble, but becomes more stable after that. Unfortunately 1) and 3)
still apply to a stable instance.

- Protecting the DMA32 zone with something like "1 1 32" into
lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.

- While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
hundred thousand pages before finding anything it actually wants to
reclaim (low vmeff). I've only been able to reproduce this from a clean
start. It can take up to 3 seconds before kswapd starts actually
reclaiming pages.

- So far as I can tell we're almost exclusively using 0 order allocations.
THP is disabled.

There's not much dirty memory involved. It's not flushing out writes while
reclaiming, it just kills off massive amount of cached memory.

Mapped file pages have to get scanned twice before they are reclaimed
because we don't have enough usage information after the first scan.


It seems that just VM_EXEC mapped file pages are protected.
Issue in page reclaim subsystem:
static inline int page_is_file_cache(struct page *page)
{
return !PageSwapBacked(page);
}
AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and 
be cleaned if removed from swap cache. So anonymous pages which are 
reclaimed and add to swap cache won't have this flag, then they will be 
treated as file backed pages?  Is it buggy? In function 
__add_to_swap_cache if add to radix tree successfully will result in 
increase NR_FILE_PAGES, why?


In your case, when you start this workload after a fresh boot or
dropping the caches, there will be 48G of mapped file pages that have
never been scanned before and that need to be looked at twice.

Unfortunately, if kswapd does not make progress (and it won't for some
time at first), it will scan more and more aggressively with


Why kswapd does not make progress for some time at first?


increasing scan priority.  And when the 48G of pages are finally
cycled, kswapd's scan window is a large percentage of your machine's
memory, and it will free every single page in it.

I think we should think about capping kswapd zone reclaim cycles just
as we do for direct reclaim.  It's a little ridiculous that it can run
unbounded and reclaim every page in a zone without ever checking back
against the watermark.  We still increase the scan window evenly when
we don't make forward progress, but we are more carefully inching zone
levels back toward the watermarks.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4883eb..8a4c446 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2645,10 +2645,11 @@ 

Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Simon Jeons

Hi Johannes,

On 02/23/2013 01:56 AM, Johannes Weiner wrote:

On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:

The problem is that adding this tunable will constrain future VM
implementations.  We will forever need to at least retain the
pseudo-file.  We will also need to make some effort to retain its
behaviour.

It would of course be better to fix things so you don't need to tweak
VM internals to get acceptable behaviour.

I sympathize with this. It's presently all that keeps us afloat though.
I'll whine about it again later if nothing else pans out.


You said:

: We have a server workload wherein machines with 100G+ of free memory
: (used by page cache), scattered but frequent random io reads from 12+
: SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
: in a few different ways.
:
: 1) It'll run into small amounts of reclaim randomly (a few hundred
: thousand).
:
: 2) A burst of reads or traffic can cause extra pressure, which kswapd
: occasionally responds to by freeing up 40g+ of the pagecache all at once
: (!) while pausing the system (Argh).
:
: 3) A blip in an upstream provider or failover from a peer causes the
: kernel to allocate massive amounts of memory for retransmission
: queues/etc, potentially along with buffered IO reads and (some, but not
: often a ton) of new allocations from an application. This paired with 2)
: can cause the box to stall for 15+ seconds.

Can we prioritise these?  2) looks just awful - kswapd shouldn't just
go off and free 40G of pagecache.  Do you know what's actually in that
pagecache?  Large number of small files or small number of (very) large
files?

We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
accessed via address. occasionally madvise (WILLNEED) applied to the
address ranges before attempting to use them. There're a mix of other
files but nothing significant. The mmap's are READONLY and writes are done
via pwrite-ish functions.

I could use some guidance on inspecting/tracing the problem. I've been
trying to reproduce it in a lab, and respecting to 2)'s issue I've found:

- The amount of memory freed back up is either a percentage of total
memory or a percentage of free memory. (a machine with 48G of ram will
only free up an extra 4-7g)

- It's most likely to happen after a fresh boot, or if 3  drop_caches
is applied with the application down. As it fills it seems to get itself
into trouble, but becomes more stable after that. Unfortunately 1) and 3)
still apply to a stable instance.

- Protecting the DMA32 zone with something like 1 1 32 into
lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.

- While watching sar -B 1 I'll see kswapd wake up, and scan up to a few
hundred thousand pages before finding anything it actually wants to
reclaim (low vmeff). I've only been able to reproduce this from a clean
start. It can take up to 3 seconds before kswapd starts actually
reclaiming pages.

- So far as I can tell we're almost exclusively using 0 order allocations.
THP is disabled.

There's not much dirty memory involved. It's not flushing out writes while
reclaiming, it just kills off massive amount of cached memory.

Mapped file pages have to get scanned twice before they are reclaimed
because we don't have enough usage information after the first scan.


It seems that just VM_EXEC mapped file pages are protected.
Issue in page reclaim subsystem:
static inline int page_is_file_cache(struct page *page)
{
return !PageSwapBacked(page);
}
AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and 
be cleaned if removed from swap cache. So anonymous pages which are 
reclaimed and add to swap cache won't have this flag, then they will be 
treated as file backed pages?  Is it buggy? In function 
__add_to_swap_cache if add to radix tree successfully will result in 
increase NR_FILE_PAGES, why?


In your case, when you start this workload after a fresh boot or
dropping the caches, there will be 48G of mapped file pages that have
never been scanned before and that need to be looked at twice.

Unfortunately, if kswapd does not make progress (and it won't for some
time at first), it will scan more and more aggressively with


Why kswapd does not make progress for some time at first?


increasing scan priority.  And when the 48G of pages are finally
cycled, kswapd's scan window is a large percentage of your machine's
memory, and it will free every single page in it.

I think we should think about capping kswapd zone reclaim cycles just
as we do for direct reclaim.  It's a little ridiculous that it can run
unbounded and reclaim every page in a zone without ever checking back
against the watermark.  We still increase the scan window evenly when
we don't make forward progress, but we are more carefully inching zone
levels back toward the watermarks.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4883eb..8a4c446 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2645,10 +2645,11 @@ static 

Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Simon Jeons

On 03/01/2013 05:22 PM, Simon Jeons wrote:

Hi Johannes,

On 02/23/2013 01:56 AM, Johannes Weiner wrote:

On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:

The problem is that adding this tunable will constrain future VM
implementations.  We will forever need to at least retain the
pseudo-file.  We will also need to make some effort to retain its
behaviour.

It would of course be better to fix things so you don't need to tweak
VM internals to get acceptable behaviour.

I sympathize with this. It's presently all that keeps us afloat though.
I'll whine about it again later if nothing else pans out.


You said:

: We have a server workload wherein machines with 100G+ of free 
memory
: (used by page cache), scattered but frequent random io reads from 
12+
: SSD's, and 5gbps+ of internet traffic, will frequently hit direct 
reclaim

: in a few different ways.
:
: 1) It'll run into small amounts of reclaim randomly (a few hundred
: thousand).
:
: 2) A burst of reads or traffic can cause extra pressure, which 
kswapd
: occasionally responds to by freeing up 40g+ of the pagecache all 
at once

: (!) while pausing the system (Argh).
:
: 3) A blip in an upstream provider or failover from a peer causes the
: kernel to allocate massive amounts of memory for retransmission
: queues/etc, potentially along with buffered IO reads and (some, 
but not
: often a ton) of new allocations from an application. This paired 
with 2)

: can cause the box to stall for 15+ seconds.

Can we prioritise these?  2) looks just awful - kswapd shouldn't just
go off and free 40G of pagecache.  Do you know what's actually in that
pagecache?  Large number of small files or small number of (very) 
large

files?

We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
accessed via address. occasionally madvise (WILLNEED) applied to the
address ranges before attempting to use them. There're a mix of other
files but nothing significant. The mmap's are READONLY and writes 
are done

via pwrite-ish functions.

I could use some guidance on inspecting/tracing the problem. I've been
trying to reproduce it in a lab, and respecting to 2)'s issue I've 
found:


- The amount of memory freed back up is either a percentage of total
memory or a percentage of free memory. (a machine with 48G of ram will
only free up an extra 4-7g)

- It's most likely to happen after a fresh boot, or if 3  
drop_caches
is applied with the application down. As it fills it seems to get 
itself
into trouble, but becomes more stable after that. Unfortunately 1) 
and 3)

still apply to a stable instance.

- Protecting the DMA32 zone with something like 1 1 32 into
lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.

- While watching sar -B 1 I'll see kswapd wake up, and scan up to 
a few

hundred thousand pages before finding anything it actually wants to
reclaim (low vmeff). I've only been able to reproduce this from a clean
start. It can take up to 3 seconds before kswapd starts actually
reclaiming pages.

- So far as I can tell we're almost exclusively using 0 order 
allocations.

THP is disabled.

There's not much dirty memory involved. It's not flushing out writes 
while

reclaiming, it just kills off massive amount of cached memory.

Mapped file pages have to get scanned twice before they are reclaimed
because we don't have enough usage information after the first scan.


It seems that just VM_EXEC mapped file pages are protected.
Issue in page reclaim subsystem:
static inline int page_is_file_cache(struct page *page)
{
return !PageSwapBacked(page);
}
AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and 
be cleaned if removed from swap cache. So anonymous pages which are 
reclaimed and add to swap cache won't have this flag, then they will 
be treated as


s/are/aren't

file backed pages?  Is it buggy? In function __add_to_swap_cache if 
add to radix tree successfully will result in increase NR_FILE_PAGES, 
why?


In your case, when you start this workload after a fresh boot or
dropping the caches, there will be 48G of mapped file pages that have
never been scanned before and that need to be looked at twice.

Unfortunately, if kswapd does not make progress (and it won't for some
time at first), it will scan more and more aggressively with


Why kswapd does not make progress for some time at first?


increasing scan priority.  And when the 48G of pages are finally
cycled, kswapd's scan window is a large percentage of your machine's
memory, and it will free every single page in it.

I think we should think about capping kswapd zone reclaim cycles just
as we do for direct reclaim.  It's a little ridiculous that it can run
unbounded and reclaim every page in a zone without ever checking back
against the watermark.  We still increase the scan window evenly when
we don't make forward progress, but we are more carefully inching zone
levels back toward the watermarks.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 

Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Hugh Dickins
On Fri, 1 Mar 2013, Simon Jeons wrote:
 On 03/01/2013 05:22 PM, Simon Jeons wrote:
  On 02/23/2013 01:56 AM, Johannes Weiner wrote:
   Mapped file pages have to get scanned twice before they are reclaimed
   because we don't have enough usage information after the first scan.
  
  It seems that just VM_EXEC mapped file pages are protected.
  Issue in page reclaim subsystem:
  static inline int page_is_file_cache(struct page *page)
  {
  return !PageSwapBacked(page);
  }
  AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be
  cleaned if removed from swap cache. So anonymous pages which are reclaimed
  and add to swap cache won't have this flag, then they will be treated as
 
 s/are/aren't

PG_swapbacked != PG_swapcache
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Simon Jeons

On 03/02/2013 06:33 AM, Hugh Dickins wrote:

On Fri, 1 Mar 2013, Simon Jeons wrote:

On 03/01/2013 05:22 PM, Simon Jeons wrote:

On 02/23/2013 01:56 AM, Johannes Weiner wrote:

Mapped file pages have to get scanned twice before they are reclaimed
because we don't have enough usage information after the first scan.

It seems that just VM_EXEC mapped file pages are protected.
Issue in page reclaim subsystem:
static inline int page_is_file_cache(struct page *page)
{
 return !PageSwapBacked(page);
}
AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be
cleaned if removed from swap cache. So anonymous pages which are reclaimed
and add to swap cache won't have this flag, then they will be treated as

s/are/aren't

PG_swapbacked != PG_swapcache


Oh, I see. Thanks Hugh, thanks for your patient. :)

In function __add_to_swap_cache if add to radix tree successfully will 
result in increase NR_FILE_PAGES, why? This is anonymous page instead of 
file backed page.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Hugh Dickins
On Sat, 2 Mar 2013, Simon Jeons wrote:
 
 In function __add_to_swap_cache if add to radix tree successfully will result
 in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed
 page.

Right, that's hard to understand without historical background.

I think the quick answer would be that we used to (and still do) think
of file-cache and swap-cache as two halves of page-cache.  And then when
someone changed the way stats were gathered, they couldn't very well
name the stat for page-cache pages NR_PAGE_PAGES, so they called it
NR_FILE_PAGES - but it still included swap.

We have tried down the years to keep the info shown in /proc/meminfo
(for example, but it is the prime example) consistent across releases,
while adding new lines and new distinctions.

But it has often been hard to find good enough short enough names for
those new distinctions: when 2.6.28 split the LRUs between file-backed
and swap-backed, it used anon for swap-backed in /proc/meminfo.

So you'll find that shmem and swap are counted as file in some places
and anon in others, and it's hard to grasp which is where and why,
without remembering the history.

I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts
total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo:
so it's undoing what you observe __add_to_swap_cache() to be doing.

It's quite possible that if you went through all the users of
NR_FILE_PAGES, you'd find it makes much more sense to leave out
the swap-cache pages, and just add those on where needed.

But you might find a few places where it's hard to decide whether
the swap-cache pages were ever intended to be included or not, and
hard to decide if it's safe to change those numbers now or not.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Simon Jeons

On 03/02/2013 09:42 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Simon Jeons wrote:

In function __add_to_swap_cache if add to radix tree successfully will result
in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed
page.

Right, that's hard to understand without historical background.

I think the quick answer would be that we used to (and still do) think
of file-cache and swap-cache as two halves of page-cache.  And then when


shmem page should be treated as file-cache or swap-cache? It is strange 
since it is consist of anonymous pages and these pages establish files.



someone changed the way stats were gathered, they couldn't very well
name the stat for page-cache pages NR_PAGE_PAGES, so they called it
NR_FILE_PAGES - but it still included swap.

We have tried down the years to keep the info shown in /proc/meminfo
(for example, but it is the prime example) consistent across releases,
while adding new lines and new distinctions.

But it has often been hard to find good enough short enough names for
those new distinctions: when 2.6.28 split the LRUs between file-backed
and swap-backed, it used anon for swap-backed in /proc/meminfo.

So you'll find that shmem and swap are counted as file in some places
and anon in others, and it's hard to grasp which is where and why,
without remembering the history.

I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts
total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo:
so it's undoing what you observe __add_to_swap_cache() to be doing.

It's quite possible that if you went through all the users of
NR_FILE_PAGES, you'd find it makes much more sense to leave out
the swap-cache pages, and just add those on where needed.

But you might find a few places where it's hard to decide whether
the swap-cache pages were ever intended to be included or not, and
hard to decide if it's safe to change those numbers now or not.

Hugh


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Hugh Dickins
On Sat, 2 Mar 2013, Simon Jeons wrote:
 On 03/02/2013 09:42 AM, Hugh Dickins wrote:
  On Sat, 2 Mar 2013, Simon Jeons wrote:
   In function __add_to_swap_cache if add to radix tree successfully will
   result
   in increase NR_FILE_PAGES, why? This is anonymous page instead of file
   backed
   page.
  Right, that's hard to understand without historical background.
  
  I think the quick answer would be that we used to (and still do) think
  of file-cache and swap-cache as two halves of page-cache.  And then when
 
 shmem page should be treated as file-cache or swap-cache? It is strange since
 it is consist of anonymous pages and these pages establish files.

A shmem page is swap-backed file-cache, and it may get transferred to or
from swap-cache: yes, it's a difficult and confusing case, as I said below.

I would never call it anonymous, but it is counted in /proc/meminfo's
Active(anon) or Inactive(anon) rather than in (file), because anon
there is shorthand for swap-backed.

  So you'll find that shmem and swap are counted as file in some places
  and anon in others, and it's hard to grasp which is where and why,
  without remembering the history.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-03-01 Thread Simon Jeons

On 03/02/2013 11:08 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Simon Jeons wrote:

On 03/02/2013 09:42 AM, Hugh Dickins wrote:

On Sat, 2 Mar 2013, Simon Jeons wrote:

In function __add_to_swap_cache if add to radix tree successfully will
result
in increase NR_FILE_PAGES, why? This is anonymous page instead of file
backed
page.

Right, that's hard to understand without historical background.

I think the quick answer would be that we used to (and still do) think
of file-cache and swap-cache as two halves of page-cache.  And then when

shmem page should be treated as file-cache or swap-cache? It is strange since
it is consist of anonymous pages and these pages establish files.

A shmem page is swap-backed file-cache, and it may get transferred to or
from swap-cache: yes, it's a difficult and confusing case, as I said below.

I would never call it anonymous, but it is counted in /proc/meminfo's
Active(anon) or Inactive(anon) rather than in (file), because anon
there is shorthand for swap-backed.


Oh, I see. Thanks. :)




So you'll find that shmem and swap are counted as file in some places
and anon in others, and it's hard to grasp which is where and why,
without remembering the history.

Hugh


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-02-26 Thread Mel Gorman
On Tue, Feb 26, 2013 at 10:13:15AM -0500, Johannes Weiner wrote:
> > > 
> > > I think we should think about capping kswapd zone reclaim cycles just
> > > as we do for direct reclaim.  It's a little ridiculous that it can run
> > > unbounded and reclaim every page in a zone without ever checking back
> > > against the watermark.  We still increase the scan window evenly when
> > > we don't make forward progress, but we are more carefully inching zone
> > > levels back toward the watermarks.
> > > 
> > 
> > While on the surface I think this will appear to work, I worry that it
> > will cause kswapds priorities to continually reset even when it's under
> > real pressure as opposed to "failing to reclaim because of use-once".
> > With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and
> > reset after each zone scan.
> > 
> > if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> > break;
> 
> But we hit that check now as well...? 

Eventually yes.

> I.e. unless there is a hard to
> reclaim batch and kswapd is unable to make forward progress, priority
> levels will always get reset after we scanned all zones and reclaimed
> SWAP_CLUSTER_MAX or more in the process.
> 

The reset happens after it has reclaimed a lot of pages. I agree with
you that this is likely the wrong thing to do. I'm just pointing out
that this simple patch changes behaviour in a big way.

> All I'm arguing is that, if we hit a hard to reclaim batch we should
> continue to increase the number of pages to scan, but still bail out
> if we reclaimed a batch successfully.  It does make sense to me to
> look at more pages if we encounter unreclaimable ones.  It makes less
> sense to me, however, to increase the reclaim goal as well in that
> case.
> 

Bail out from the reclaim maybe but care should be taken to ensure we do
not hammer slab on each "bail" or reset the scanning priorities if the
watermark was not met by that batch of SWAP_CLUSTER_MAX reclaims.

We also have to think about what it means for pressure being applied
equally to each zone. We will still apply equal scanning pressure but
not necessarily reclaim pressure. Does that matter? I don't know.

> > It'll fail the watermark check and restart of course but it does mean we
> > would call shrink_slab() for every SWAP_CLUSTER_MAX*nr_unbalaced_zones
> > pages scanned which will have other consequences. It'll behave differently
> > but not necessarily better.
> 
> Right, I wasn't proposing to merge the patch as is.  But I do think
> it's not okay that a batch of immediately unreclaimable pages can
> cause kswapd to grow its reclaim target exponentially and we should
> probably think about capping it one way or another.
> 

I agree with you. MMtest results I looked at over the weekend showed
that kswapd tends to be extremely spiky. Doing nothing following by
reclaiming an excessive amount of memory and going back to doing
nothing. This partially explains it.

> shrink_slab()'s action is already based on the ratio between the
> number of scanned pages and the number of lru pages, so I don't see
> this as a fundamental issue, although it may require some tweaking.
> 
> > In general, IO causing anonymous workloads to stall has gotten a lot worse
> > during the last few kernels without us properly realising it other than
> > interactivity in the presence of IO has gone down the crapper again. Late
> > last week I fixed up an mmtests that runs memcachetest as the primary
> > workload while doing varying amounts of IO in the background and found this
> > 
> > http://www.csn.ul.ie/~mel/postings/reclaim-20130221/global-dhp__parallelio-memcachetest-ext4/hydra/report.html
> > 
> > Snippet looks like this;
> > 3.0.56  
> > 3.6.10   3.7.4   3.8.0-rc4
> >   mainline
> > mainlinemainlinemainline
> > Ops memcachetest-0M 10125.00 (  0.00%)  10091.00 ( 
> > -0.34%)  11038.00 (  9.02%)  10864.00 (  7.30%)
> > Ops memcachetest-749M   10097.00 (  0.00%)   8546.00 
> > (-15.36%)   8770.00 (-13.14%)   4872.00 (-51.75%)
> > Ops memcachetest-1623M  10161.00 (  0.00%)   3149.00 
> > (-69.01%)   3645.00 (-64.13%)   2760.00 (-72.84%)
> > Ops memcachetest-2498M   8095.00 (  0.00%)   2527.00 
> > (-68.78%)   2461.00 (-69.60%)   2282.00 (-71.81%)
> > Ops memcachetest-3372M   7814.00 (  0.00%)   2369.00 
> > (-69.68%)   2396.00 (-69.34%)   2323.00 (-70.27%)
> > Ops memcachetest-4247M   3818.00 (  0.00%)   2366.00 
> > (-38.03%)   2391.00 (-37.38%)   2274.00 (-40.44%)
> > Ops memcachetest-5121M   3852.00 (  0.00%)   2335.00 
> > (-39.38%)   2384.00 (-38.11%)  

Re: [PATCH] add extra free kbytes tunable

2013-02-26 Thread Johannes Weiner
On Tue, Feb 26, 2013 at 10:47:31AM +, Mel Gorman wrote:
> On Fri, Feb 22, 2013 at 12:56:34PM -0500, Johannes Weiner wrote:
> > > > 
> > > >
> > > > : We have a server workload wherein machines with 100G+ of "free" memory
> > > > : (used by page cache), scattered but frequent random io reads from 12+
> > > > : SSD's, and 5gbps+ of internet traffic, will frequently hit direct 
> > > > reclaim
> > > > : in a few different ways.
> > > > :
> > > > : 1) It'll run into small amounts of reclaim randomly (a few hundred
> > > > : thousand).
> > > > :
> > > > : 2) A burst of reads or traffic can cause extra pressure, which kswapd
> > > > : occasionally responds to by freeing up 40g+ of the pagecache all at 
> > > > once
> > > > : (!) while pausing the system (Argh).
> > > > :
> > > > : 3) A blip in an upstream provider or failover from a peer causes the
> > > > : kernel to allocate massive amounts of memory for retransmission
> > > > : queues/etc, potentially along with buffered IO reads and (some, but 
> > > > not
> > > > : often a ton) of new allocations from an application. This paired with 
> > > > 2)
> > > > : can cause the box to stall for 15+ seconds.
> > > >
> > > > Can we prioritise these?  2) looks just awful - kswapd shouldn't just
> > > > go off and free 40G of pagecache.  Do you know what's actually in that
> > > > pagecache?  Large number of small files or small number of (very) large
> > > > files?
> > > 
> > > We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
> > > accessed via address. occasionally madvise (WILLNEED) applied to the
> > > address ranges before attempting to use them. There're a mix of other
> > > files but nothing significant. The mmap's are READONLY and writes are done
> > > via pwrite-ish functions.
> > > 
> > > I could use some guidance on inspecting/tracing the problem. I've been
> > > trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
> > > 
> > > - The amount of memory freed back up is either a percentage of total
> > > memory or a percentage of free memory. (a machine with 48G of ram will
> > > "only" free up an extra 4-7g)
> > > 
> > > - It's most likely to happen after a fresh boot, or if "3 > drop_caches"
> > > is applied with the application down. As it fills it seems to get itself
> > > into trouble, but becomes more stable after that. Unfortunately 1) and 3)
> > > still apply to a stable instance.
> > > 
> > > - Protecting the DMA32 zone with something like "1 1 32" into
> > > lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
> > > 
> > > - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
> > > hundred thousand pages before finding anything it actually wants to
> > > reclaim (low vmeff). I've only been able to reproduce this from a clean
> > > start. It can take up to 3 seconds before kswapd starts actually
> > > reclaiming pages.
> > > 
> > > - So far as I can tell we're almost exclusively using 0 order allocations.
> > > THP is disabled.
> > > 
> > > There's not much dirty memory involved. It's not flushing out writes while
> > > reclaiming, it just kills off massive amount of cached memory.
> > 
> > Mapped file pages have to get scanned twice before they are reclaimed
> > because we don't have enough usage information after the first scan.
> > 
> > In your case, when you start this workload after a fresh boot or
> > dropping the caches, there will be 48G of mapped file pages that have
> > never been scanned before and that need to be looked at twice.
> > 
> > Unfortunately, if kswapd does not make progress (and it won't for some
> > time at first), it will scan more and more aggressively with
> > increasing scan priority.  And when the 48G of pages are finally
> > cycled, kswapd's scan window is a large percentage of your machine's
> > memory, and it will free every single page in it.
> > 
> > I think we should think about capping kswapd zone reclaim cycles just
> > as we do for direct reclaim.  It's a little ridiculous that it can run
> > unbounded and reclaim every page in a zone without ever checking back
> > against the watermark.  We still increase the scan window evenly when
> > we don't make forward progress, but we are more carefully inching zone
> > levels back toward the watermarks.
> > 
> 
> While on the surface I think this will appear to work, I worry that it
> will cause kswapds priorities to continually reset even when it's under
> real pressure as opposed to "failing to reclaim because of use-once".
> With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and
> reset after each zone scan.
> 
> if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> break;

But we hit that check now as well...?  I.e. unless there is a hard to
reclaim batch and kswapd is unable to make forward progress, priority
levels will always get reset after we scanned all zones and reclaimed
SWAP_CLUSTER_MAX or more in the process.

All I'm arguing is 

Re: [PATCH] add extra free kbytes tunable

2013-02-26 Thread Mel Gorman
On Fri, Feb 22, 2013 at 12:56:34PM -0500, Johannes Weiner wrote:
> > > 
> > >
> > > : We have a server workload wherein machines with 100G+ of "free" memory
> > > : (used by page cache), scattered but frequent random io reads from 12+
> > > : SSD's, and 5gbps+ of internet traffic, will frequently hit direct 
> > > reclaim
> > > : in a few different ways.
> > > :
> > > : 1) It'll run into small amounts of reclaim randomly (a few hundred
> > > : thousand).
> > > :
> > > : 2) A burst of reads or traffic can cause extra pressure, which kswapd
> > > : occasionally responds to by freeing up 40g+ of the pagecache all at once
> > > : (!) while pausing the system (Argh).
> > > :
> > > : 3) A blip in an upstream provider or failover from a peer causes the
> > > : kernel to allocate massive amounts of memory for retransmission
> > > : queues/etc, potentially along with buffered IO reads and (some, but not
> > > : often a ton) of new allocations from an application. This paired with 2)
> > > : can cause the box to stall for 15+ seconds.
> > >
> > > Can we prioritise these?  2) looks just awful - kswapd shouldn't just
> > > go off and free 40G of pagecache.  Do you know what's actually in that
> > > pagecache?  Large number of small files or small number of (very) large
> > > files?
> > 
> > We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
> > accessed via address. occasionally madvise (WILLNEED) applied to the
> > address ranges before attempting to use them. There're a mix of other
> > files but nothing significant. The mmap's are READONLY and writes are done
> > via pwrite-ish functions.
> > 
> > I could use some guidance on inspecting/tracing the problem. I've been
> > trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
> > 
> > - The amount of memory freed back up is either a percentage of total
> > memory or a percentage of free memory. (a machine with 48G of ram will
> > "only" free up an extra 4-7g)
> > 
> > - It's most likely to happen after a fresh boot, or if "3 > drop_caches"
> > is applied with the application down. As it fills it seems to get itself
> > into trouble, but becomes more stable after that. Unfortunately 1) and 3)
> > still apply to a stable instance.
> > 
> > - Protecting the DMA32 zone with something like "1 1 32" into
> > lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
> > 
> > - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
> > hundred thousand pages before finding anything it actually wants to
> > reclaim (low vmeff). I've only been able to reproduce this from a clean
> > start. It can take up to 3 seconds before kswapd starts actually
> > reclaiming pages.
> > 
> > - So far as I can tell we're almost exclusively using 0 order allocations.
> > THP is disabled.
> > 
> > There's not much dirty memory involved. It's not flushing out writes while
> > reclaiming, it just kills off massive amount of cached memory.
> 
> Mapped file pages have to get scanned twice before they are reclaimed
> because we don't have enough usage information after the first scan.
> 
> In your case, when you start this workload after a fresh boot or
> dropping the caches, there will be 48G of mapped file pages that have
> never been scanned before and that need to be looked at twice.
> 
> Unfortunately, if kswapd does not make progress (and it won't for some
> time at first), it will scan more and more aggressively with
> increasing scan priority.  And when the 48G of pages are finally
> cycled, kswapd's scan window is a large percentage of your machine's
> memory, and it will free every single page in it.
> 
> I think we should think about capping kswapd zone reclaim cycles just
> as we do for direct reclaim.  It's a little ridiculous that it can run
> unbounded and reclaim every page in a zone without ever checking back
> against the watermark.  We still increase the scan window evenly when
> we don't make forward progress, but we are more carefully inching zone
> levels back toward the watermarks.
> 

While on the surface I think this will appear to work, I worry that it
will cause kswapds priorities to continually reset even when it's under
real pressure as opposed to "failing to reclaim because of use-once".
With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and
reset after each zone scan.

if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
break;

It'll fail the watermark check and restart of course but it does mean we
would call shrink_slab() for every SWAP_CLUSTER_MAX*nr_unbalaced_zones
pages scanned which will have other consequences. It'll behave differently
but not necessarily better.

In general, IO causing anonymous workloads to stall has gotten a lot worse
during the last few kernels without us properly realising it other than
interactivity in the presence of IO has gone down the crapper again. Late
last week I fixed up an mmtests that runs memcachetest as the primary

Re: [PATCH] add extra free kbytes tunable

2013-02-26 Thread Mel Gorman
On Fri, Feb 22, 2013 at 12:56:34PM -0500, Johannes Weiner wrote:
   SNIP
  
   : We have a server workload wherein machines with 100G+ of free memory
   : (used by page cache), scattered but frequent random io reads from 12+
   : SSD's, and 5gbps+ of internet traffic, will frequently hit direct 
   reclaim
   : in a few different ways.
   :
   : 1) It'll run into small amounts of reclaim randomly (a few hundred
   : thousand).
   :
   : 2) A burst of reads or traffic can cause extra pressure, which kswapd
   : occasionally responds to by freeing up 40g+ of the pagecache all at once
   : (!) while pausing the system (Argh).
   :
   : 3) A blip in an upstream provider or failover from a peer causes the
   : kernel to allocate massive amounts of memory for retransmission
   : queues/etc, potentially along with buffered IO reads and (some, but not
   : often a ton) of new allocations from an application. This paired with 2)
   : can cause the box to stall for 15+ seconds.
  
   Can we prioritise these?  2) looks just awful - kswapd shouldn't just
   go off and free 40G of pagecache.  Do you know what's actually in that
   pagecache?  Large number of small files or small number of (very) large
   files?
  
  We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
  accessed via address. occasionally madvise (WILLNEED) applied to the
  address ranges before attempting to use them. There're a mix of other
  files but nothing significant. The mmap's are READONLY and writes are done
  via pwrite-ish functions.
  
  I could use some guidance on inspecting/tracing the problem. I've been
  trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
  
  - The amount of memory freed back up is either a percentage of total
  memory or a percentage of free memory. (a machine with 48G of ram will
  only free up an extra 4-7g)
  
  - It's most likely to happen after a fresh boot, or if 3  drop_caches
  is applied with the application down. As it fills it seems to get itself
  into trouble, but becomes more stable after that. Unfortunately 1) and 3)
  still apply to a stable instance.
  
  - Protecting the DMA32 zone with something like 1 1 32 into
  lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
  
  - While watching sar -B 1 I'll see kswapd wake up, and scan up to a few
  hundred thousand pages before finding anything it actually wants to
  reclaim (low vmeff). I've only been able to reproduce this from a clean
  start. It can take up to 3 seconds before kswapd starts actually
  reclaiming pages.
  
  - So far as I can tell we're almost exclusively using 0 order allocations.
  THP is disabled.
  
  There's not much dirty memory involved. It's not flushing out writes while
  reclaiming, it just kills off massive amount of cached memory.
 
 Mapped file pages have to get scanned twice before they are reclaimed
 because we don't have enough usage information after the first scan.
 
 In your case, when you start this workload after a fresh boot or
 dropping the caches, there will be 48G of mapped file pages that have
 never been scanned before and that need to be looked at twice.
 
 Unfortunately, if kswapd does not make progress (and it won't for some
 time at first), it will scan more and more aggressively with
 increasing scan priority.  And when the 48G of pages are finally
 cycled, kswapd's scan window is a large percentage of your machine's
 memory, and it will free every single page in it.
 
 I think we should think about capping kswapd zone reclaim cycles just
 as we do for direct reclaim.  It's a little ridiculous that it can run
 unbounded and reclaim every page in a zone without ever checking back
 against the watermark.  We still increase the scan window evenly when
 we don't make forward progress, but we are more carefully inching zone
 levels back toward the watermarks.
 

While on the surface I think this will appear to work, I worry that it
will cause kswapds priorities to continually reset even when it's under
real pressure as opposed to failing to reclaim because of use-once.
With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and
reset after each zone scan.

if (sc.nr_reclaimed = SWAP_CLUSTER_MAX)
break;

It'll fail the watermark check and restart of course but it does mean we
would call shrink_slab() for every SWAP_CLUSTER_MAX*nr_unbalaced_zones
pages scanned which will have other consequences. It'll behave differently
but not necessarily better.

In general, IO causing anonymous workloads to stall has gotten a lot worse
during the last few kernels without us properly realising it other than
interactivity in the presence of IO has gone down the crapper again. Late
last week I fixed up an mmtests that runs memcachetest as the primary
workload while doing varying amounts of IO in the background and found this


Re: [PATCH] add extra free kbytes tunable

2013-02-26 Thread Johannes Weiner
On Tue, Feb 26, 2013 at 10:47:31AM +, Mel Gorman wrote:
 On Fri, Feb 22, 2013 at 12:56:34PM -0500, Johannes Weiner wrote:
SNIP
   
: We have a server workload wherein machines with 100G+ of free memory
: (used by page cache), scattered but frequent random io reads from 12+
: SSD's, and 5gbps+ of internet traffic, will frequently hit direct 
reclaim
: in a few different ways.
:
: 1) It'll run into small amounts of reclaim randomly (a few hundred
: thousand).
:
: 2) A burst of reads or traffic can cause extra pressure, which kswapd
: occasionally responds to by freeing up 40g+ of the pagecache all at 
once
: (!) while pausing the system (Argh).
:
: 3) A blip in an upstream provider or failover from a peer causes the
: kernel to allocate massive amounts of memory for retransmission
: queues/etc, potentially along with buffered IO reads and (some, but 
not
: often a ton) of new allocations from an application. This paired with 
2)
: can cause the box to stall for 15+ seconds.
   
Can we prioritise these?  2) looks just awful - kswapd shouldn't just
go off and free 40G of pagecache.  Do you know what's actually in that
pagecache?  Large number of small files or small number of (very) large
files?
   
   We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
   accessed via address. occasionally madvise (WILLNEED) applied to the
   address ranges before attempting to use them. There're a mix of other
   files but nothing significant. The mmap's are READONLY and writes are done
   via pwrite-ish functions.
   
   I could use some guidance on inspecting/tracing the problem. I've been
   trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
   
   - The amount of memory freed back up is either a percentage of total
   memory or a percentage of free memory. (a machine with 48G of ram will
   only free up an extra 4-7g)
   
   - It's most likely to happen after a fresh boot, or if 3  drop_caches
   is applied with the application down. As it fills it seems to get itself
   into trouble, but becomes more stable after that. Unfortunately 1) and 3)
   still apply to a stable instance.
   
   - Protecting the DMA32 zone with something like 1 1 32 into
   lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
   
   - While watching sar -B 1 I'll see kswapd wake up, and scan up to a few
   hundred thousand pages before finding anything it actually wants to
   reclaim (low vmeff). I've only been able to reproduce this from a clean
   start. It can take up to 3 seconds before kswapd starts actually
   reclaiming pages.
   
   - So far as I can tell we're almost exclusively using 0 order allocations.
   THP is disabled.
   
   There's not much dirty memory involved. It's not flushing out writes while
   reclaiming, it just kills off massive amount of cached memory.
  
  Mapped file pages have to get scanned twice before they are reclaimed
  because we don't have enough usage information after the first scan.
  
  In your case, when you start this workload after a fresh boot or
  dropping the caches, there will be 48G of mapped file pages that have
  never been scanned before and that need to be looked at twice.
  
  Unfortunately, if kswapd does not make progress (and it won't for some
  time at first), it will scan more and more aggressively with
  increasing scan priority.  And when the 48G of pages are finally
  cycled, kswapd's scan window is a large percentage of your machine's
  memory, and it will free every single page in it.
  
  I think we should think about capping kswapd zone reclaim cycles just
  as we do for direct reclaim.  It's a little ridiculous that it can run
  unbounded and reclaim every page in a zone without ever checking back
  against the watermark.  We still increase the scan window evenly when
  we don't make forward progress, but we are more carefully inching zone
  levels back toward the watermarks.
  
 
 While on the surface I think this will appear to work, I worry that it
 will cause kswapds priorities to continually reset even when it's under
 real pressure as opposed to failing to reclaim because of use-once.
 With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and
 reset after each zone scan.
 
 if (sc.nr_reclaimed = SWAP_CLUSTER_MAX)
 break;

But we hit that check now as well...?  I.e. unless there is a hard to
reclaim batch and kswapd is unable to make forward progress, priority
levels will always get reset after we scanned all zones and reclaimed
SWAP_CLUSTER_MAX or more in the process.

All I'm arguing is that, if we hit a hard to reclaim batch we should
continue to increase the number of pages to scan, but still bail out
if we reclaimed a batch successfully.  It does make sense to me to
look at more pages if we encounter unreclaimable ones.  It makes less
sense to me, 

Re: [PATCH] add extra free kbytes tunable

2013-02-26 Thread Mel Gorman
On Tue, Feb 26, 2013 at 10:13:15AM -0500, Johannes Weiner wrote:
   SNIP
   I think we should think about capping kswapd zone reclaim cycles just
   as we do for direct reclaim.  It's a little ridiculous that it can run
   unbounded and reclaim every page in a zone without ever checking back
   against the watermark.  We still increase the scan window evenly when
   we don't make forward progress, but we are more carefully inching zone
   levels back toward the watermarks.
   
  
  While on the surface I think this will appear to work, I worry that it
  will cause kswapds priorities to continually reset even when it's under
  real pressure as opposed to failing to reclaim because of use-once.
  With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and
  reset after each zone scan.
  
  if (sc.nr_reclaimed = SWAP_CLUSTER_MAX)
  break;
 
 But we hit that check now as well...? 

Eventually yes.

 I.e. unless there is a hard to
 reclaim batch and kswapd is unable to make forward progress, priority
 levels will always get reset after we scanned all zones and reclaimed
 SWAP_CLUSTER_MAX or more in the process.
 

The reset happens after it has reclaimed a lot of pages. I agree with
you that this is likely the wrong thing to do. I'm just pointing out
that this simple patch changes behaviour in a big way.

 All I'm arguing is that, if we hit a hard to reclaim batch we should
 continue to increase the number of pages to scan, but still bail out
 if we reclaimed a batch successfully.  It does make sense to me to
 look at more pages if we encounter unreclaimable ones.  It makes less
 sense to me, however, to increase the reclaim goal as well in that
 case.
 

Bail out from the reclaim maybe but care should be taken to ensure we do
not hammer slab on each bail or reset the scanning priorities if the
watermark was not met by that batch of SWAP_CLUSTER_MAX reclaims.

We also have to think about what it means for pressure being applied
equally to each zone. We will still apply equal scanning pressure but
not necessarily reclaim pressure. Does that matter? I don't know.

  It'll fail the watermark check and restart of course but it does mean we
  would call shrink_slab() for every SWAP_CLUSTER_MAX*nr_unbalaced_zones
  pages scanned which will have other consequences. It'll behave differently
  but not necessarily better.
 
 Right, I wasn't proposing to merge the patch as is.  But I do think
 it's not okay that a batch of immediately unreclaimable pages can
 cause kswapd to grow its reclaim target exponentially and we should
 probably think about capping it one way or another.
 

I agree with you. MMtest results I looked at over the weekend showed
that kswapd tends to be extremely spiky. Doing nothing following by
reclaiming an excessive amount of memory and going back to doing
nothing. This partially explains it.

 shrink_slab()'s action is already based on the ratio between the
 number of scanned pages and the number of lru pages, so I don't see
 this as a fundamental issue, although it may require some tweaking.
 
  In general, IO causing anonymous workloads to stall has gotten a lot worse
  during the last few kernels without us properly realising it other than
  interactivity in the presence of IO has gone down the crapper again. Late
  last week I fixed up an mmtests that runs memcachetest as the primary
  workload while doing varying amounts of IO in the background and found this
  
  http://www.csn.ul.ie/~mel/postings/reclaim-20130221/global-dhp__parallelio-memcachetest-ext4/hydra/report.html
  
  Snippet looks like this;
  3.0.56  
  3.6.10   3.7.4   3.8.0-rc4
mainline
  mainlinemainlinemainline
  Ops memcachetest-0M 10125.00 (  0.00%)  10091.00 ( 
  -0.34%)  11038.00 (  9.02%)  10864.00 (  7.30%)
  Ops memcachetest-749M   10097.00 (  0.00%)   8546.00 
  (-15.36%)   8770.00 (-13.14%)   4872.00 (-51.75%)
  Ops memcachetest-1623M  10161.00 (  0.00%)   3149.00 
  (-69.01%)   3645.00 (-64.13%)   2760.00 (-72.84%)
  Ops memcachetest-2498M   8095.00 (  0.00%)   2527.00 
  (-68.78%)   2461.00 (-69.60%)   2282.00 (-71.81%)
  Ops memcachetest-3372M   7814.00 (  0.00%)   2369.00 
  (-69.68%)   2396.00 (-69.34%)   2323.00 (-70.27%)
  Ops memcachetest-4247M   3818.00 (  0.00%)   2366.00 
  (-38.03%)   2391.00 (-37.38%)   2274.00 (-40.44%)
  Ops memcachetest-5121M   3852.00 (  0.00%)   2335.00 
  (-39.38%)   2384.00 (-38.11%)   2233.00 (-42.03%)
  
  This is showing transactions/second -- more the better. 3.0.56 was pretty
  bad in itself because a 

Re: [PATCH] add extra free kbytes tunable

2013-02-22 Thread Johannes Weiner
On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:
> >
> > The problem is that adding this tunable will constrain future VM
> > implementations.  We will forever need to at least retain the
> > pseudo-file.  We will also need to make some effort to retain its
> > behaviour.
> >
> > It would of course be better to fix things so you don't need to tweak
> > VM internals to get acceptable behaviour.
> 
> I sympathize with this. It's presently all that keeps us afloat though.
> I'll whine about it again later if nothing else pans out.
> 
> > You said:
> >
> > : We have a server workload wherein machines with 100G+ of "free" memory
> > : (used by page cache), scattered but frequent random io reads from 12+
> > : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
> > : in a few different ways.
> > :
> > : 1) It'll run into small amounts of reclaim randomly (a few hundred
> > : thousand).
> > :
> > : 2) A burst of reads or traffic can cause extra pressure, which kswapd
> > : occasionally responds to by freeing up 40g+ of the pagecache all at once
> > : (!) while pausing the system (Argh).
> > :
> > : 3) A blip in an upstream provider or failover from a peer causes the
> > : kernel to allocate massive amounts of memory for retransmission
> > : queues/etc, potentially along with buffered IO reads and (some, but not
> > : often a ton) of new allocations from an application. This paired with 2)
> > : can cause the box to stall for 15+ seconds.
> >
> > Can we prioritise these?  2) looks just awful - kswapd shouldn't just
> > go off and free 40G of pagecache.  Do you know what's actually in that
> > pagecache?  Large number of small files or small number of (very) large
> > files?
> 
> We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
> accessed via address. occasionally madvise (WILLNEED) applied to the
> address ranges before attempting to use them. There're a mix of other
> files but nothing significant. The mmap's are READONLY and writes are done
> via pwrite-ish functions.
> 
> I could use some guidance on inspecting/tracing the problem. I've been
> trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
> 
> - The amount of memory freed back up is either a percentage of total
> memory or a percentage of free memory. (a machine with 48G of ram will
> "only" free up an extra 4-7g)
> 
> - It's most likely to happen after a fresh boot, or if "3 > drop_caches"
> is applied with the application down. As it fills it seems to get itself
> into trouble, but becomes more stable after that. Unfortunately 1) and 3)
> still apply to a stable instance.
> 
> - Protecting the DMA32 zone with something like "1 1 32" into
> lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
> 
> - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
> hundred thousand pages before finding anything it actually wants to
> reclaim (low vmeff). I've only been able to reproduce this from a clean
> start. It can take up to 3 seconds before kswapd starts actually
> reclaiming pages.
> 
> - So far as I can tell we're almost exclusively using 0 order allocations.
> THP is disabled.
> 
> There's not much dirty memory involved. It's not flushing out writes while
> reclaiming, it just kills off massive amount of cached memory.

Mapped file pages have to get scanned twice before they are reclaimed
because we don't have enough usage information after the first scan.

In your case, when you start this workload after a fresh boot or
dropping the caches, there will be 48G of mapped file pages that have
never been scanned before and that need to be looked at twice.

Unfortunately, if kswapd does not make progress (and it won't for some
time at first), it will scan more and more aggressively with
increasing scan priority.  And when the 48G of pages are finally
cycled, kswapd's scan window is a large percentage of your machine's
memory, and it will free every single page in it.

I think we should think about capping kswapd zone reclaim cycles just
as we do for direct reclaim.  It's a little ridiculous that it can run
unbounded and reclaim every page in a zone without ever checking back
against the watermark.  We still increase the scan window evenly when
we don't make forward progress, but we are more carefully inching zone
levels back toward the watermarks.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4883eb..8a4c446 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, 
int order,
.may_unmap = 1,
.may_swap = 1,
/*
-* kswapd doesn't want to be bailed out while reclaim. because
-* we want to put equal scanning pressure on each zone.
+* Even kswapd zone scans want to be bailed out after
+* reclaiming a good chunk of pages.  It will just
+* come back if the watermarks are 

Re: [PATCH] add extra free kbytes tunable

2013-02-22 Thread Johannes Weiner
On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:
 
  The problem is that adding this tunable will constrain future VM
  implementations.  We will forever need to at least retain the
  pseudo-file.  We will also need to make some effort to retain its
  behaviour.
 
  It would of course be better to fix things so you don't need to tweak
  VM internals to get acceptable behaviour.
 
 I sympathize with this. It's presently all that keeps us afloat though.
 I'll whine about it again later if nothing else pans out.
 
  You said:
 
  : We have a server workload wherein machines with 100G+ of free memory
  : (used by page cache), scattered but frequent random io reads from 12+
  : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
  : in a few different ways.
  :
  : 1) It'll run into small amounts of reclaim randomly (a few hundred
  : thousand).
  :
  : 2) A burst of reads or traffic can cause extra pressure, which kswapd
  : occasionally responds to by freeing up 40g+ of the pagecache all at once
  : (!) while pausing the system (Argh).
  :
  : 3) A blip in an upstream provider or failover from a peer causes the
  : kernel to allocate massive amounts of memory for retransmission
  : queues/etc, potentially along with buffered IO reads and (some, but not
  : often a ton) of new allocations from an application. This paired with 2)
  : can cause the box to stall for 15+ seconds.
 
  Can we prioritise these?  2) looks just awful - kswapd shouldn't just
  go off and free 40G of pagecache.  Do you know what's actually in that
  pagecache?  Large number of small files or small number of (very) large
  files?
 
 We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
 accessed via address. occasionally madvise (WILLNEED) applied to the
 address ranges before attempting to use them. There're a mix of other
 files but nothing significant. The mmap's are READONLY and writes are done
 via pwrite-ish functions.
 
 I could use some guidance on inspecting/tracing the problem. I've been
 trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
 
 - The amount of memory freed back up is either a percentage of total
 memory or a percentage of free memory. (a machine with 48G of ram will
 only free up an extra 4-7g)
 
 - It's most likely to happen after a fresh boot, or if 3  drop_caches
 is applied with the application down. As it fills it seems to get itself
 into trouble, but becomes more stable after that. Unfortunately 1) and 3)
 still apply to a stable instance.
 
 - Protecting the DMA32 zone with something like 1 1 32 into
 lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
 
 - While watching sar -B 1 I'll see kswapd wake up, and scan up to a few
 hundred thousand pages before finding anything it actually wants to
 reclaim (low vmeff). I've only been able to reproduce this from a clean
 start. It can take up to 3 seconds before kswapd starts actually
 reclaiming pages.
 
 - So far as I can tell we're almost exclusively using 0 order allocations.
 THP is disabled.
 
 There's not much dirty memory involved. It's not flushing out writes while
 reclaiming, it just kills off massive amount of cached memory.

Mapped file pages have to get scanned twice before they are reclaimed
because we don't have enough usage information after the first scan.

In your case, when you start this workload after a fresh boot or
dropping the caches, there will be 48G of mapped file pages that have
never been scanned before and that need to be looked at twice.

Unfortunately, if kswapd does not make progress (and it won't for some
time at first), it will scan more and more aggressively with
increasing scan priority.  And when the 48G of pages are finally
cycled, kswapd's scan window is a large percentage of your machine's
memory, and it will free every single page in it.

I think we should think about capping kswapd zone reclaim cycles just
as we do for direct reclaim.  It's a little ridiculous that it can run
unbounded and reclaim every page in a zone without ever checking back
against the watermark.  We still increase the scan window evenly when
we don't make forward progress, but we are more carefully inching zone
levels back toward the watermarks.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4883eb..8a4c446 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, 
int order,
.may_unmap = 1,
.may_swap = 1,
/*
-* kswapd doesn't want to be bailed out while reclaim. because
-* we want to put equal scanning pressure on each zone.
+* Even kswapd zone scans want to be bailed out after
+* reclaiming a good chunk of pages.  It will just
+* come back if the watermarks are still not met.
 */
-   .nr_to_reclaim = ULONG_MAX,
+   .nr_to_reclaim = 

Re: [PATCH] add extra free kbytes tunable

2013-02-19 Thread dormando
>
> The problem is that adding this tunable will constrain future VM
> implementations.  We will forever need to at least retain the
> pseudo-file.  We will also need to make some effort to retain its
> behaviour.
>
> It would of course be better to fix things so you don't need to tweak
> VM internals to get acceptable behaviour.

I sympathize with this. It's presently all that keeps us afloat though.
I'll whine about it again later if nothing else pans out.

> You said:
>
> : We have a server workload wherein machines with 100G+ of "free" memory
> : (used by page cache), scattered but frequent random io reads from 12+
> : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
> : in a few different ways.
> :
> : 1) It'll run into small amounts of reclaim randomly (a few hundred
> : thousand).
> :
> : 2) A burst of reads or traffic can cause extra pressure, which kswapd
> : occasionally responds to by freeing up 40g+ of the pagecache all at once
> : (!) while pausing the system (Argh).
> :
> : 3) A blip in an upstream provider or failover from a peer causes the
> : kernel to allocate massive amounts of memory for retransmission
> : queues/etc, potentially along with buffered IO reads and (some, but not
> : often a ton) of new allocations from an application. This paired with 2)
> : can cause the box to stall for 15+ seconds.
>
> Can we prioritise these?  2) looks just awful - kswapd shouldn't just
> go off and free 40G of pagecache.  Do you know what's actually in that
> pagecache?  Large number of small files or small number of (very) large
> files?

We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
accessed via address. occasionally madvise (WILLNEED) applied to the
address ranges before attempting to use them. There're a mix of other
files but nothing significant. The mmap's are READONLY and writes are done
via pwrite-ish functions.

I could use some guidance on inspecting/tracing the problem. I've been
trying to reproduce it in a lab, and respecting to 2)'s issue I've found:

- The amount of memory freed back up is either a percentage of total
memory or a percentage of free memory. (a machine with 48G of ram will
"only" free up an extra 4-7g)

- It's most likely to happen after a fresh boot, or if "3 > drop_caches"
is applied with the application down. As it fills it seems to get itself
into trouble, but becomes more stable after that. Unfortunately 1) and 3)
still apply to a stable instance.

- Protecting the DMA32 zone with something like "1 1 32" into
lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.

- While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
hundred thousand pages before finding anything it actually wants to
reclaim (low vmeff). I've only been able to reproduce this from a clean
start. It can take up to 3 seconds before kswapd starts actually
reclaiming pages.

- So far as I can tell we're almost exclusively using 0 order allocations.
THP is disabled.

There's not much dirty memory involved. It's not flushing out writes while
reclaiming, it just kills off massive amount of cached memory.

We're not running the machines particularily hard... Often less than 30%
CPU usage at peak.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-02-19 Thread Andrew Morton
On Sun, 17 Feb 2013 15:48:31 -0800 (PST)
dormando  wrote:

> Add a userspace visible knob to tell the VM to keep an extra amount
> of memory free, by increasing the gap between each zone's min and
> low watermarks.

The problem is that adding this tunable will constrain future VM
implementations.  We will forever need to at least retain the
pseudo-file.  We will also need to make some effort to retain its
behaviour.

It would of course be better to fix things so you don't need to tweak
VM internals to get acceptable behaviour.

You said:

: We have a server workload wherein machines with 100G+ of "free" memory
: (used by page cache), scattered but frequent random io reads from 12+
: SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
: in a few different ways.
: 
: 1) It'll run into small amounts of reclaim randomly (a few hundred
: thousand).
: 
: 2) A burst of reads or traffic can cause extra pressure, which kswapd
: occasionally responds to by freeing up 40g+ of the pagecache all at once
: (!) while pausing the system (Argh).
: 
: 3) A blip in an upstream provider or failover from a peer causes the
: kernel to allocate massive amounts of memory for retransmission
: queues/etc, potentially along with buffered IO reads and (some, but not
: often a ton) of new allocations from an application. This paired with 2)
: can cause the box to stall for 15+ seconds.

Can we prioritise these?  2) looks just awful - kswapd shouldn't just
go off and free 40G of pagecache.  Do you know what's actually in that
pagecache?  Large number of small files or small number of (very) large
files?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-02-19 Thread Andrew Morton
On Sun, 17 Feb 2013 15:48:31 -0800 (PST)
dormando dorma...@rydia.net wrote:

 Add a userspace visible knob to tell the VM to keep an extra amount
 of memory free, by increasing the gap between each zone's min and
 low watermarks.

The problem is that adding this tunable will constrain future VM
implementations.  We will forever need to at least retain the
pseudo-file.  We will also need to make some effort to retain its
behaviour.

It would of course be better to fix things so you don't need to tweak
VM internals to get acceptable behaviour.

You said:

: We have a server workload wherein machines with 100G+ of free memory
: (used by page cache), scattered but frequent random io reads from 12+
: SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
: in a few different ways.
: 
: 1) It'll run into small amounts of reclaim randomly (a few hundred
: thousand).
: 
: 2) A burst of reads or traffic can cause extra pressure, which kswapd
: occasionally responds to by freeing up 40g+ of the pagecache all at once
: (!) while pausing the system (Argh).
: 
: 3) A blip in an upstream provider or failover from a peer causes the
: kernel to allocate massive amounts of memory for retransmission
: queues/etc, potentially along with buffered IO reads and (some, but not
: often a ton) of new allocations from an application. This paired with 2)
: can cause the box to stall for 15+ seconds.

Can we prioritise these?  2) looks just awful - kswapd shouldn't just
go off and free 40G of pagecache.  Do you know what's actually in that
pagecache?  Large number of small files or small number of (very) large
files?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add extra free kbytes tunable

2013-02-19 Thread dormando

 The problem is that adding this tunable will constrain future VM
 implementations.  We will forever need to at least retain the
 pseudo-file.  We will also need to make some effort to retain its
 behaviour.

 It would of course be better to fix things so you don't need to tweak
 VM internals to get acceptable behaviour.

I sympathize with this. It's presently all that keeps us afloat though.
I'll whine about it again later if nothing else pans out.

 You said:

 : We have a server workload wherein machines with 100G+ of free memory
 : (used by page cache), scattered but frequent random io reads from 12+
 : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
 : in a few different ways.
 :
 : 1) It'll run into small amounts of reclaim randomly (a few hundred
 : thousand).
 :
 : 2) A burst of reads or traffic can cause extra pressure, which kswapd
 : occasionally responds to by freeing up 40g+ of the pagecache all at once
 : (!) while pausing the system (Argh).
 :
 : 3) A blip in an upstream provider or failover from a peer causes the
 : kernel to allocate massive amounts of memory for retransmission
 : queues/etc, potentially along with buffered IO reads and (some, but not
 : often a ton) of new allocations from an application. This paired with 2)
 : can cause the box to stall for 15+ seconds.

 Can we prioritise these?  2) looks just awful - kswapd shouldn't just
 go off and free 40G of pagecache.  Do you know what's actually in that
 pagecache?  Large number of small files or small number of (very) large
 files?

We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
accessed via address. occasionally madvise (WILLNEED) applied to the
address ranges before attempting to use them. There're a mix of other
files but nothing significant. The mmap's are READONLY and writes are done
via pwrite-ish functions.

I could use some guidance on inspecting/tracing the problem. I've been
trying to reproduce it in a lab, and respecting to 2)'s issue I've found:

- The amount of memory freed back up is either a percentage of total
memory or a percentage of free memory. (a machine with 48G of ram will
only free up an extra 4-7g)

- It's most likely to happen after a fresh boot, or if 3  drop_caches
is applied with the application down. As it fills it seems to get itself
into trouble, but becomes more stable after that. Unfortunately 1) and 3)
still apply to a stable instance.

- Protecting the DMA32 zone with something like 1 1 32 into
lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.

- While watching sar -B 1 I'll see kswapd wake up, and scan up to a few
hundred thousand pages before finding anything it actually wants to
reclaim (low vmeff). I've only been able to reproduce this from a clean
start. It can take up to 3 seconds before kswapd starts actually
reclaiming pages.

- So far as I can tell we're almost exclusively using 0 order allocations.
THP is disabled.

There's not much dirty memory involved. It's not flushing out writes while
reclaiming, it just kills off massive amount of cached memory.

We're not running the machines particularily hard... Often less than 30%
CPU usage at peak.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] add extra free kbytes tunable

2013-02-17 Thread dormando
From: Rik van Riel 

Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.

This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period.  In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.

It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.
---
 Documentation/sysctl/vm.txt |   16 
 include/linux/mmzone.h  |2 +-
 include/linux/swap.h|2 ++
 kernel/sysctl.c |   11 +--
 mm/page_alloc.c |   39 +--
 5 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 078701f..5d12bbd 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm:
 - dirty_writeback_centisecs
 - drop_caches
 - extfrag_threshold
+- extra_free_kbytes
 - hugepages_treat_as_movable
 - hugetlb_shm_group
 - laptop_mode
@@ -167,6 +168,21 @@ fragmentation index is <= extfrag_threshold. The default 
value is 500.

 ==

+extra_free_kbytes
+
+This parameter tells the VM to keep extra free memory between the threshold
+where background reclaim (kswapd) kicks in, and the threshold where direct
+reclaim (by allocating processes) kicks in.
+
+This is useful for workloads that require low latency memory allocations
+and have a bounded burstiness in memory allocations, for example a
+realtime application that receives and transmits network traffic
+(causing in-kernel memory allocations) with a maximum total message burst
+size of 200MB may need 200MB of extra free memory to avoid direct reclaim
+related latencies.
+
+==
+
 hugepages_treat_as_movable

 This parameter is only useful when kernelcore= is specified at boot time to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 73b64a3..7f8f883 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -881,7 +881,7 @@ static inline int is_dma(struct zone *zone)

 /* These two functions are used to setup the per zone pages min values */
 struct ctl_table;
-int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
+int free_kbytes_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 68df9c1..66a12c4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,8 @@ struct swap_list_t {
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
+extern int min_free_kbytes;
+extern int extra_free_kbytes;
 extern unsigned long dirty_balance_reserve;
 extern unsigned int nr_free_buffer_pages(void);
 extern unsigned int nr_free_pagecache_pages(void);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c88878d..102e9a1 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -104,7 +104,6 @@ extern char core_pattern[];
 extern unsigned int core_pipe_limit;
 #endif
 extern int pid_max;
-extern int min_free_kbytes;
 extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
 extern int percpu_pagelist_fraction;
@@ -1246,10 +1245,18 @@ static struct ctl_table vm_table[] = {
.data   = _free_kbytes,
.maxlen = sizeof(min_free_kbytes),
.mode   = 0644,
-   .proc_handler   = min_free_kbytes_sysctl_handler,
+   .proc_handler   = free_kbytes_sysctl_handler,
.extra1 = ,
},
{
+   .procname   = "extra_free_kbytes",
+   .data   = _free_kbytes,
+   .maxlen = sizeof(extra_free_kbytes),
+   .mode   = 0644,
+   .proc_handler   = free_kbytes_sysctl_handler,
+   .extra1 = ,
+   },
+   {
.procname   = "percpu_pagelist_fraction",
.data   = _pagelist_fraction,
.maxlen = sizeof(percpu_pagelist_fraction),
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9673d96..5380d84 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -194,8 +194,21 @@ static char * const zone_names[MAX_NR_ZONES] = {
 "Movable",
 };

+/*
+ * Try to keep at least this much lowmem free.  Do not allow normal
+ * allocations below this point, only high priority ones. Automatically

[PATCH] add extra free kbytes tunable

2013-02-17 Thread dormando
From: Rik van Riel r...@redhat.com

Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.

This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period.  In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.

It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.
---
 Documentation/sysctl/vm.txt |   16 
 include/linux/mmzone.h  |2 +-
 include/linux/swap.h|2 ++
 kernel/sysctl.c |   11 +--
 mm/page_alloc.c |   39 +--
 5 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 078701f..5d12bbd 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm:
 - dirty_writeback_centisecs
 - drop_caches
 - extfrag_threshold
+- extra_free_kbytes
 - hugepages_treat_as_movable
 - hugetlb_shm_group
 - laptop_mode
@@ -167,6 +168,21 @@ fragmentation index is = extfrag_threshold. The default 
value is 500.

 ==

+extra_free_kbytes
+
+This parameter tells the VM to keep extra free memory between the threshold
+where background reclaim (kswapd) kicks in, and the threshold where direct
+reclaim (by allocating processes) kicks in.
+
+This is useful for workloads that require low latency memory allocations
+and have a bounded burstiness in memory allocations, for example a
+realtime application that receives and transmits network traffic
+(causing in-kernel memory allocations) with a maximum total message burst
+size of 200MB may need 200MB of extra free memory to avoid direct reclaim
+related latencies.
+
+==
+
 hugepages_treat_as_movable

 This parameter is only useful when kernelcore= is specified at boot time to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 73b64a3..7f8f883 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -881,7 +881,7 @@ static inline int is_dma(struct zone *zone)

 /* These two functions are used to setup the per zone pages min values */
 struct ctl_table;
-int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
+int free_kbytes_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 68df9c1..66a12c4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,8 @@ struct swap_list_t {
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
+extern int min_free_kbytes;
+extern int extra_free_kbytes;
 extern unsigned long dirty_balance_reserve;
 extern unsigned int nr_free_buffer_pages(void);
 extern unsigned int nr_free_pagecache_pages(void);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c88878d..102e9a1 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -104,7 +104,6 @@ extern char core_pattern[];
 extern unsigned int core_pipe_limit;
 #endif
 extern int pid_max;
-extern int min_free_kbytes;
 extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
 extern int percpu_pagelist_fraction;
@@ -1246,10 +1245,18 @@ static struct ctl_table vm_table[] = {
.data   = min_free_kbytes,
.maxlen = sizeof(min_free_kbytes),
.mode   = 0644,
-   .proc_handler   = min_free_kbytes_sysctl_handler,
+   .proc_handler   = free_kbytes_sysctl_handler,
.extra1 = zero,
},
{
+   .procname   = extra_free_kbytes,
+   .data   = extra_free_kbytes,
+   .maxlen = sizeof(extra_free_kbytes),
+   .mode   = 0644,
+   .proc_handler   = free_kbytes_sysctl_handler,
+   .extra1 = zero,
+   },
+   {
.procname   = percpu_pagelist_fraction,
.data   = percpu_pagelist_fraction,
.maxlen = sizeof(percpu_pagelist_fraction),
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9673d96..5380d84 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -194,8 +194,21 @@ static char * const zone_names[MAX_NR_ZONES] = {
 Movable,
 };

+/*
+ * Try to keep at least this much lowmem free.  Do not allow normal
+ * allocations below this point, only high