Re: [patch 00/20] VM pageout scalability improvements

2007-12-27 Thread Matt Mackall

On Sun, 2007-12-23 at 20:11 -0500, Rik van Riel wrote:
> On Mon, 24 Dec 2007 04:29:36 +0530
> Balbir Singh <[EMAIL PROTECTED]> wrote:
> > Rik van Riel wrote:
> 
> > > In the real world, users with large JVMs on their servers, which
> > > sometimes go a little into swap, can trigger this system.  All of
> > > the CPUs end up scanning the active list, and all pages have the
> > > referenced bit set.  Even if the system eventually recovers, it
> > > might as well have been dead.
> > > 
> > > Going into swap a little should only take a little bit of time.
> > 
> > Very fascinating, so we need to scale better with larger memory.
> > I suspect part of the answer will lie with using large/huge pages.
> 
> Linus vetoed going to a larger soft page size, with good reason.
> 
> Just look at how much the 64kB page size on PPC64 sucks for most
> workloads - it works for PPC64 because people buy PPC64 monster
> systems for the kinds of monster workloads that work well with a
> large page size, but it definately isn't general purpose.

Indeed, machines already exist with >> 1TB of RAM, so even going to 1MB
pages leaves these machines in trouble. Going to big pages a few years
ago would have pushed the problem back a few years, but now we need real
fixes.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/20] VM pageout scalability improvements

2007-12-27 Thread Matt Mackall

On Sun, 2007-12-23 at 20:11 -0500, Rik van Riel wrote:
 On Mon, 24 Dec 2007 04:29:36 +0530
 Balbir Singh [EMAIL PROTECTED] wrote:
  Rik van Riel wrote:
 
   In the real world, users with large JVMs on their servers, which
   sometimes go a little into swap, can trigger this system.  All of
   the CPUs end up scanning the active list, and all pages have the
   referenced bit set.  Even if the system eventually recovers, it
   might as well have been dead.
   
   Going into swap a little should only take a little bit of time.
  
  Very fascinating, so we need to scale better with larger memory.
  I suspect part of the answer will lie with using large/huge pages.
 
 Linus vetoed going to a larger soft page size, with good reason.
 
 Just look at how much the 64kB page size on PPC64 sucks for most
 workloads - it works for PPC64 because people buy PPC64 monster
 systems for the kinds of monster workloads that work well with a
 large page size, but it definately isn't general purpose.

Indeed, machines already exist with  1TB of RAM, so even going to 1MB
pages leaves these machines in trouble. Going to big pages a few years
ago would have pushed the problem back a few years, but now we need real
fixes.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/20] VM pageout scalability improvements

2007-12-23 Thread Rik van Riel
On Mon, 24 Dec 2007 04:29:36 +0530
Balbir Singh <[EMAIL PROTECTED]> wrote:
> Rik van Riel wrote:

> > In the real world, users with large JVMs on their servers, which
> > sometimes go a little into swap, can trigger this system.  All of
> > the CPUs end up scanning the active list, and all pages have the
> > referenced bit set.  Even if the system eventually recovers, it
> > might as well have been dead.
> > 
> > Going into swap a little should only take a little bit of time.
> 
> Very fascinating, so we need to scale better with larger memory.
> I suspect part of the answer will lie with using large/huge pages.

Linus vetoed going to a larger soft page size, with good reason.

Just look at how much the 64kB page size on PPC64 sucks for most
workloads - it works for PPC64 because people buy PPC64 monster
systems for the kinds of monster workloads that work well with a
large page size, but it definately isn't general purpose.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/20] VM pageout scalability improvements

2007-12-23 Thread Balbir Singh
Rik van Riel wrote:
> On Sun, 23 Dec 2007 01:57:32 +0530
> Balbir Singh <[EMAIL PROTECTED]> wrote:
>> Rik van Riel wrote:
>>> On large memory systems, the VM can spend way too much time scanning
>>> through pages that it cannot (or should not) evict from memory. Not
>>> only does it use up CPU time, but it also provokes lock contention
>>> and can leave large systems under memory presure in a catatonic state.
>> I remember you mentioning that by large memory systems you mean systems
>> with at-least 128GB, does this definition still hold?
> 
> It depends on the workload.  Certain test cases can wedge the
> VM with as little as 16GB of RAM.  Other workloads cause trouble
> at 32 or 64GB, with the system sometimes hanging for several
> minutes, all the CPUs in the pageout code and no actual swap IO.
> 

Interesting, I have not run into it so far. But I have smaller machines,
typically 4-8GB.

> On systems of 128GB and more, we have seen systems hang in the
> pageout code overnight, without deciding what to swap out.
> 
>>> This patch series improves VM scalability by:
>>>
>>> 1) making the locking a little more scalable
>>>
>>> 2) putting filesystem backed, swap backed and non-reclaimable pages
>>>onto their own LRUs, so the system only scans the pages that it
>>>can/should evict from memory
>>>
>>> 3) switching to SEQ replacement for the anonymous LRUs, so the
>>>number of pages that need to be scanned when the system
>>>starts swapping is bound to a reasonable number
>>>
>>> The noreclaim patches come verbatim from Lee Schermerhorn and
>>> Nick Piggin.  I have not taken a detailed look at them yet and
>>> all I have done is fix the rejects against the latest -mm kernel.
>> Is there a consolidate patch available, it makes it easier to test.
> 
> I will make a big patch available with the next version.  I have
> to upgrade my patch set to newer noreclaim patches from Lee and
> add a few small cleanups elsewhere.
> 

That would be nice. I'll try and help out by testing the patches and
running them

>>> I am posting this series now because I would like to get more
>>> feedback, while I am studying and improving the noreclaim patches
>>> myself.
>> What kind of tests show the problem? I'll try and review and test the code.
> 
> The easiest test possible simply allocates a ton of memory and
> then touches it all.  Enough memory that the system needs to go
> into swap.
> 
> Once memory is full, you will see the VM scan like mad, with a
> big CPU spike (clearing the referenced bits off all pages) before
> it starts swapping out anything.  That big CPU spike should be
> gone or greatly reduced with my patches.
> 
> On really huge systems, that big CPU spike can be enough for one
> CPU to spend so much time in the VM that all the other CPUs join
> it, and the system goes under in a big lock contention fest.
> 
> Besides, even single threadedly clearing the referenced bits on
> 1TB worth of memory can't result in acceptable latencies :)
> 
> In the real world, users with large JVMs on their servers, which
> sometimes go a little into swap, can trigger this system.  All of
> the CPUs end up scanning the active list, and all pages have the
> referenced bit set.  Even if the system eventually recovers, it
> might as well have been dead.
> 
> Going into swap a little should only take a little bit of time.
> 

Very fascinating, so we need to scale better with larger memory.
I suspect part of the answer will lie with using large/huge pages.



-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/20] VM pageout scalability improvements

2007-12-23 Thread Balbir Singh
Rik van Riel wrote:
 On Sun, 23 Dec 2007 01:57:32 +0530
 Balbir Singh [EMAIL PROTECTED] wrote:
 Rik van Riel wrote:
 On large memory systems, the VM can spend way too much time scanning
 through pages that it cannot (or should not) evict from memory. Not
 only does it use up CPU time, but it also provokes lock contention
 and can leave large systems under memory presure in a catatonic state.
 I remember you mentioning that by large memory systems you mean systems
 with at-least 128GB, does this definition still hold?
 
 It depends on the workload.  Certain test cases can wedge the
 VM with as little as 16GB of RAM.  Other workloads cause trouble
 at 32 or 64GB, with the system sometimes hanging for several
 minutes, all the CPUs in the pageout code and no actual swap IO.
 

Interesting, I have not run into it so far. But I have smaller machines,
typically 4-8GB.

 On systems of 128GB and more, we have seen systems hang in the
 pageout code overnight, without deciding what to swap out.
 
 This patch series improves VM scalability by:

 1) making the locking a little more scalable

 2) putting filesystem backed, swap backed and non-reclaimable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory

 3) switching to SEQ replacement for the anonymous LRUs, so the
number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number

 The noreclaim patches come verbatim from Lee Schermerhorn and
 Nick Piggin.  I have not taken a detailed look at them yet and
 all I have done is fix the rejects against the latest -mm kernel.
 Is there a consolidate patch available, it makes it easier to test.
 
 I will make a big patch available with the next version.  I have
 to upgrade my patch set to newer noreclaim patches from Lee and
 add a few small cleanups elsewhere.
 

That would be nice. I'll try and help out by testing the patches and
running them

 I am posting this series now because I would like to get more
 feedback, while I am studying and improving the noreclaim patches
 myself.
 What kind of tests show the problem? I'll try and review and test the code.
 
 The easiest test possible simply allocates a ton of memory and
 then touches it all.  Enough memory that the system needs to go
 into swap.
 
 Once memory is full, you will see the VM scan like mad, with a
 big CPU spike (clearing the referenced bits off all pages) before
 it starts swapping out anything.  That big CPU spike should be
 gone or greatly reduced with my patches.
 
 On really huge systems, that big CPU spike can be enough for one
 CPU to spend so much time in the VM that all the other CPUs join
 it, and the system goes under in a big lock contention fest.
 
 Besides, even single threadedly clearing the referenced bits on
 1TB worth of memory can't result in acceptable latencies :)
 
 In the real world, users with large JVMs on their servers, which
 sometimes go a little into swap, can trigger this system.  All of
 the CPUs end up scanning the active list, and all pages have the
 referenced bit set.  Even if the system eventually recovers, it
 might as well have been dead.
 
 Going into swap a little should only take a little bit of time.
 

Very fascinating, so we need to scale better with larger memory.
I suspect part of the answer will lie with using large/huge pages.



-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/20] VM pageout scalability improvements

2007-12-23 Thread Rik van Riel
On Mon, 24 Dec 2007 04:29:36 +0530
Balbir Singh [EMAIL PROTECTED] wrote:
 Rik van Riel wrote:

  In the real world, users with large JVMs on their servers, which
  sometimes go a little into swap, can trigger this system.  All of
  the CPUs end up scanning the active list, and all pages have the
  referenced bit set.  Even if the system eventually recovers, it
  might as well have been dead.
  
  Going into swap a little should only take a little bit of time.
 
 Very fascinating, so we need to scale better with larger memory.
 I suspect part of the answer will lie with using large/huge pages.

Linus vetoed going to a larger soft page size, with good reason.

Just look at how much the 64kB page size on PPC64 sucks for most
workloads - it works for PPC64 because people buy PPC64 monster
systems for the kinds of monster workloads that work well with a
large page size, but it definately isn't general purpose.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/20] VM pageout scalability improvements

2007-12-22 Thread Rik van Riel
On Sun, 23 Dec 2007 01:57:32 +0530
Balbir Singh <[EMAIL PROTECTED]> wrote:
> Rik van Riel wrote:
> > On large memory systems, the VM can spend way too much time scanning
> > through pages that it cannot (or should not) evict from memory. Not
> > only does it use up CPU time, but it also provokes lock contention
> > and can leave large systems under memory presure in a catatonic state.
> 
> I remember you mentioning that by large memory systems you mean systems
> with at-least 128GB, does this definition still hold?

It depends on the workload.  Certain test cases can wedge the
VM with as little as 16GB of RAM.  Other workloads cause trouble
at 32 or 64GB, with the system sometimes hanging for several
minutes, all the CPUs in the pageout code and no actual swap IO.

On systems of 128GB and more, we have seen systems hang in the
pageout code overnight, without deciding what to swap out.
 
> > This patch series improves VM scalability by:
> > 
> > 1) making the locking a little more scalable
> > 
> > 2) putting filesystem backed, swap backed and non-reclaimable pages
> >onto their own LRUs, so the system only scans the pages that it
> >can/should evict from memory
> > 
> > 3) switching to SEQ replacement for the anonymous LRUs, so the
> >number of pages that need to be scanned when the system
> >starts swapping is bound to a reasonable number
> > 
> > The noreclaim patches come verbatim from Lee Schermerhorn and
> > Nick Piggin.  I have not taken a detailed look at them yet and
> > all I have done is fix the rejects against the latest -mm kernel.
> 
> Is there a consolidate patch available, it makes it easier to test.

I will make a big patch available with the next version.  I have
to upgrade my patch set to newer noreclaim patches from Lee and
add a few small cleanups elsewhere.

> > I am posting this series now because I would like to get more
> > feedback, while I am studying and improving the noreclaim patches
> > myself.
> 
> What kind of tests show the problem? I'll try and review and test the code.

The easiest test possible simply allocates a ton of memory and
then touches it all.  Enough memory that the system needs to go
into swap.

Once memory is full, you will see the VM scan like mad, with a
big CPU spike (clearing the referenced bits off all pages) before
it starts swapping out anything.  That big CPU spike should be
gone or greatly reduced with my patches.

On really huge systems, that big CPU spike can be enough for one
CPU to spend so much time in the VM that all the other CPUs join
it, and the system goes under in a big lock contention fest.

Besides, even single threadedly clearing the referenced bits on
1TB worth of memory can't result in acceptable latencies :)

In the real world, users with large JVMs on their servers, which
sometimes go a little into swap, can trigger this system.  All of
the CPUs end up scanning the active list, and all pages have the
referenced bit set.  Even if the system eventually recovers, it
might as well have been dead.

Going into swap a little should only take a little bit of time.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/20] VM pageout scalability improvements

2007-12-22 Thread Balbir Singh
Rik van Riel wrote:
> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
> 

Hi, Rik,

I remember you mentioning that by large memory systems you mean systems
with at-least 128GB, does this definition still hold?

> This patch series improves VM scalability by:
> 
> 1) making the locking a little more scalable
> 
> 2) putting filesystem backed, swap backed and non-reclaimable pages
>onto their own LRUs, so the system only scans the pages that it
>can/should evict from memory
> 
> 3) switching to SEQ replacement for the anonymous LRUs, so the
>number of pages that need to be scanned when the system
>starts swapping is bound to a reasonable number
> 
> The noreclaim patches come verbatim from Lee Schermerhorn and
> Nick Piggin.  I have not taken a detailed look at them yet and
> all I have done is fix the rejects against the latest -mm kernel.
> 

Is there a consolidate patch available, it makes it easier to test.

> I am posting this series now because I would like to get more
> feedback, while I am studying and improving the noreclaim patches
> myself.
> 

What kind of tests show the problem? I'll try and review and test the code.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/20] VM pageout scalability improvements

2007-12-22 Thread Balbir Singh
Rik van Riel wrote:
 On large memory systems, the VM can spend way too much time scanning
 through pages that it cannot (or should not) evict from memory. Not
 only does it use up CPU time, but it also provokes lock contention
 and can leave large systems under memory presure in a catatonic state.
 

Hi, Rik,

I remember you mentioning that by large memory systems you mean systems
with at-least 128GB, does this definition still hold?

 This patch series improves VM scalability by:
 
 1) making the locking a little more scalable
 
 2) putting filesystem backed, swap backed and non-reclaimable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory
 
 3) switching to SEQ replacement for the anonymous LRUs, so the
number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number
 
 The noreclaim patches come verbatim from Lee Schermerhorn and
 Nick Piggin.  I have not taken a detailed look at them yet and
 all I have done is fix the rejects against the latest -mm kernel.
 

Is there a consolidate patch available, it makes it easier to test.

 I am posting this series now because I would like to get more
 feedback, while I am studying and improving the noreclaim patches
 myself.
 

What kind of tests show the problem? I'll try and review and test the code.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/20] VM pageout scalability improvements

2007-12-22 Thread Rik van Riel
On Sun, 23 Dec 2007 01:57:32 +0530
Balbir Singh [EMAIL PROTECTED] wrote:
 Rik van Riel wrote:
  On large memory systems, the VM can spend way too much time scanning
  through pages that it cannot (or should not) evict from memory. Not
  only does it use up CPU time, but it also provokes lock contention
  and can leave large systems under memory presure in a catatonic state.
 
 I remember you mentioning that by large memory systems you mean systems
 with at-least 128GB, does this definition still hold?

It depends on the workload.  Certain test cases can wedge the
VM with as little as 16GB of RAM.  Other workloads cause trouble
at 32 or 64GB, with the system sometimes hanging for several
minutes, all the CPUs in the pageout code and no actual swap IO.

On systems of 128GB and more, we have seen systems hang in the
pageout code overnight, without deciding what to swap out.
 
  This patch series improves VM scalability by:
  
  1) making the locking a little more scalable
  
  2) putting filesystem backed, swap backed and non-reclaimable pages
 onto their own LRUs, so the system only scans the pages that it
 can/should evict from memory
  
  3) switching to SEQ replacement for the anonymous LRUs, so the
 number of pages that need to be scanned when the system
 starts swapping is bound to a reasonable number
  
  The noreclaim patches come verbatim from Lee Schermerhorn and
  Nick Piggin.  I have not taken a detailed look at them yet and
  all I have done is fix the rejects against the latest -mm kernel.
 
 Is there a consolidate patch available, it makes it easier to test.

I will make a big patch available with the next version.  I have
to upgrade my patch set to newer noreclaim patches from Lee and
add a few small cleanups elsewhere.

  I am posting this series now because I would like to get more
  feedback, while I am studying and improving the noreclaim patches
  myself.
 
 What kind of tests show the problem? I'll try and review and test the code.

The easiest test possible simply allocates a ton of memory and
then touches it all.  Enough memory that the system needs to go
into swap.

Once memory is full, you will see the VM scan like mad, with a
big CPU spike (clearing the referenced bits off all pages) before
it starts swapping out anything.  That big CPU spike should be
gone or greatly reduced with my patches.

On really huge systems, that big CPU spike can be enough for one
CPU to spend so much time in the VM that all the other CPUs join
it, and the system goes under in a big lock contention fest.

Besides, even single threadedly clearing the referenced bits on
1TB worth of memory can't result in acceptable latencies :)

In the real world, users with large JVMs on their servers, which
sometimes go a little into swap, can trigger this system.  All of
the CPUs end up scanning the active list, and all pages have the
referenced bit set.  Even if the system eventually recovers, it
might as well have been dead.

Going into swap a little should only take a little bit of time.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 00/20] VM pageout scalability improvements

2007-12-18 Thread Rik van Riel
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

This patch series improves VM scalability by:

1) making the locking a little more scalable

2) putting filesystem backed, swap backed and non-reclaimable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory

3) switching to SEQ replacement for the anonymous LRUs, so the
   number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number

The noreclaim patches come verbatim from Lee Schermerhorn and
Nick Piggin.  I have not taken a detailed look at them yet and
all I have done is fix the rejects against the latest -mm kernel.

I am posting this series now because I would like to get more
feedback, while I am studying and improving the noreclaim patches
myself.

-- 
All Rights Reversed

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 00/20] VM pageout scalability improvements

2007-12-18 Thread Rik van Riel
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

This patch series improves VM scalability by:

1) making the locking a little more scalable

2) putting filesystem backed, swap backed and non-reclaimable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory

3) switching to SEQ replacement for the anonymous LRUs, so the
   number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number

The noreclaim patches come verbatim from Lee Schermerhorn and
Nick Piggin.  I have not taken a detailed look at them yet and
all I have done is fix the rejects against the latest -mm kernel.

I am posting this series now because I would like to get more
feedback, while I am studying and improving the noreclaim patches
myself.

-- 
All Rights Reversed

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/