subject:"Re\: \[RFC\] Tracking mlocked pages and moving them off the LRU"

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-07 Thread Nate Diller

On 2/7/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:

On Tue, 6 Feb 2007, Nate Diller wrote:

> > The dirty ratio with the ZVCS would be
> >
> > NR_DIRTY + NR_UNSTABLE_NFS
> > /
> > NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE + NR_MLOCK
>
> I don't understand why you want to account mlocked pages in
> dirty_ratio.  of course mlocked pages *can* be dirty, but they have no
> relevance in the write throttling code.  the point of dirty ratio is

mlocked pages can be counted as dirty pages. So if we do not include
NR_MLOCK in the number of total pages that could be dirty then we may in
some cases have >100% dirty pages.

unless we exclude mlocked dirty pages from NR_DIRTY accounting, which
is what i suggest should be done as part of this patch

> to guarantee that there are some number of non-dirty, non-pinned,
> non-mlocked pages on the LRU, to (try to) avoid deadlocks where the
> writeback path needs to allocate pages, which many filesystems like to
> do.  if an mlocked page is clean, there's still no way to free it up,
> so it should not be treated as being on the LRU at all, for write
> throttling.  the ideal (IMO) dirty ratio would be

Hmmm... I think write throttling is different from reclaim. In write
throttling the major objective is to decouple the applications from
the physical I/O. So the dirty ratio specifies how much "buffer" space
can be used for I/O. There is an issue that too many dirty pages will
cause difficulty for reclaim because pages can only be reclaimed after
writeback is complete.

NR_DIRTY is only used for write throttling, right?  well, and
reporting to user-space, but again, i suggest that user space should
get to see NR_MLOCKED as well.  would people flip out if NR_DIRTY
stopped showing pages that are mlocked, as long as a seperate
NR_MLOCKED variable was present?

And yes this is not true for mlocked pages.

>
> NR_DIRTY - NR_DIRTY_MLOCKED + NR_UNSTABLE_NFS
>/
> NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE
>
> obviously it's kinda useless to keep an NR_DIRTY_MLOCKED counter, any
> of these mlock accounting schemes could easily be modified to update
> the NR_DIRTY counter so that it only reflects dirty unpinned pages,
> and not mlocked ones.

So you would be okay with dirty_ratio possibly be >100% of mlocked pages
are dirty?

> is that the only place you wanted to have an accurate mocked page count?

Rik had some other ideas on what to do with it. I also think we may end up
checking for excessive high mlock counts in various tight VM situations.

i'd be wary of a VM algorithm that treated mlocked pages any
differently than, say, unreclaimable slab pages.  but there are no
concrete suggestions yet, so i won't comment further.

all this is not to say that i dislike the idea of keeping mlocked
pages off the LRU, quite the opposite i've been looking for this for a
while and was hoping that Stone Wang's wired list patch
(http://lkml.org/lkml/2006/3/20/128) would get further than it did.
but i don't see the need to keep strict accounting if it hurts
performance in the common case.

NATE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-07 Thread Christoph Lameter

On Tue, 6 Feb 2007, Nate Diller wrote:

> > The dirty ratio with the ZVCS would be
> > 
> > NR_DIRTY + NR_UNSTABLE_NFS
> > /
> > NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE + NR_MLOCK
> 
> I don't understand why you want to account mlocked pages in
> dirty_ratio.  of course mlocked pages *can* be dirty, but they have no
> relevance in the write throttling code.  the point of dirty ratio is

mlocked pages can be counted as dirty pages. So if we do not include
NR_MLOCK in the number of total pages that could be dirty then we may in 
some cases have >100% dirty pages.

> to guarantee that there are some number of non-dirty, non-pinned,
> non-mlocked pages on the LRU, to (try to) avoid deadlocks where the
> writeback path needs to allocate pages, which many filesystems like to
> do.  if an mlocked page is clean, there's still no way to free it up,
> so it should not be treated as being on the LRU at all, for write
> throttling.  the ideal (IMO) dirty ratio would be

Hmmm... I think write throttling is different from reclaim. In write 
throttling the major objective is to decouple the applications from
the physical I/O. So the dirty ratio specifies how much "buffer" space
can be used for I/O. There is an issue that too many dirty pages will
cause difficulty for reclaim because pages can only be reclaimed after
writeback is complete.

And yes this is not true for mlocked pages.

> 
> NR_DIRTY - NR_DIRTY_MLOCKED + NR_UNSTABLE_NFS
>/
> NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE
> 
> obviously it's kinda useless to keep an NR_DIRTY_MLOCKED counter, any
> of these mlock accounting schemes could easily be modified to update
> the NR_DIRTY counter so that it only reflects dirty unpinned pages,
> and not mlocked ones.

So you would be okay with dirty_ratio possibly be >100% of mlocked pages 
are dirty?

> is that the only place you wanted to have an accurate mocked page count?

Rik had some other ideas on what to do with it. I also think we may end up 
checking for excessive high mlock counts in various tight VM situations.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-07 Thread Christoph Lameter

On Tue, 6 Feb 2007, Nate Diller wrote:

  The dirty ratio with the ZVCS would be
  
  NR_DIRTY + NR_UNSTABLE_NFS
  /
  NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE + NR_MLOCK
 
 I don't understand why you want to account mlocked pages in
 dirty_ratio.  of course mlocked pages *can* be dirty, but they have no
 relevance in the write throttling code.  the point of dirty ratio is

mlocked pages can be counted as dirty pages. So if we do not include
NR_MLOCK in the number of total pages that could be dirty then we may in 
some cases have 100% dirty pages.

 to guarantee that there are some number of non-dirty, non-pinned,
 non-mlocked pages on the LRU, to (try to) avoid deadlocks where the
 writeback path needs to allocate pages, which many filesystems like to
 do.  if an mlocked page is clean, there's still no way to free it up,
 so it should not be treated as being on the LRU at all, for write
 throttling.  the ideal (IMO) dirty ratio would be

Hmmm... I think write throttling is different from reclaim. In write 
throttling the major objective is to decouple the applications from
the physical I/O. So the dirty ratio specifies how much buffer space
can be used for I/O. There is an issue that too many dirty pages will
cause difficulty for reclaim because pages can only be reclaimed after
writeback is complete.

And yes this is not true for mlocked pages.

 
 NR_DIRTY - NR_DIRTY_MLOCKED + NR_UNSTABLE_NFS
/
 NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE
 
 obviously it's kinda useless to keep an NR_DIRTY_MLOCKED counter, any
 of these mlock accounting schemes could easily be modified to update
 the NR_DIRTY counter so that it only reflects dirty unpinned pages,
 and not mlocked ones.

So you would be okay with dirty_ratio possibly be 100% of mlocked pages 
are dirty?

 is that the only place you wanted to have an accurate mocked page count?

Rik had some other ideas on what to do with it. I also think we may end up 
checking for excessive high mlock counts in various tight VM situations.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-07 Thread Nate Diller


On 2/7/07, Christoph Lameter [EMAIL PROTECTED] wrote:

On Tue, 6 Feb 2007, Nate Diller wrote:

  The dirty ratio with the ZVCS would be
 
  NR_DIRTY + NR_UNSTABLE_NFS
  /
  NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE + NR_MLOCK

 I don't understand why you want to account mlocked pages in
 dirty_ratio.  of course mlocked pages *can* be dirty, but they have no
 relevance in the write throttling code.  the point of dirty ratio is

mlocked pages can be counted as dirty pages. So if we do not include
NR_MLOCK in the number of total pages that could be dirty then we may in
some cases have 100% dirty pages.


unless we exclude mlocked dirty pages from NR_DIRTY accounting, which
is what i suggest should be done as part of this patch


 to guarantee that there are some number of non-dirty, non-pinned,
 non-mlocked pages on the LRU, to (try to) avoid deadlocks where the
 writeback path needs to allocate pages, which many filesystems like to
 do.  if an mlocked page is clean, there's still no way to free it up,
 so it should not be treated as being on the LRU at all, for write
 throttling.  the ideal (IMO) dirty ratio would be

Hmmm... I think write throttling is different from reclaim. In write
throttling the major objective is to decouple the applications from
the physical I/O. So the dirty ratio specifies how much buffer space
can be used for I/O. There is an issue that too many dirty pages will
cause difficulty for reclaim because pages can only be reclaimed after
writeback is complete.


NR_DIRTY is only used for write throttling, right?  well, and
reporting to user-space, but again, i suggest that user space should
get to see NR_MLOCKED as well.  would people flip out if NR_DIRTY
stopped showing pages that are mlocked, as long as a seperate
NR_MLOCKED variable was present?


And yes this is not true for mlocked pages.


 NR_DIRTY - NR_DIRTY_MLOCKED + NR_UNSTABLE_NFS
/
 NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE

 obviously it's kinda useless to keep an NR_DIRTY_MLOCKED counter, any
 of these mlock accounting schemes could easily be modified to update
 the NR_DIRTY counter so that it only reflects dirty unpinned pages,
 and not mlocked ones.

So you would be okay with dirty_ratio possibly be 100% of mlocked pages
are dirty?

 is that the only place you wanted to have an accurate mocked page count?

Rik had some other ideas on what to do with it. I also think we may end up
checking for excessive high mlock counts in various tight VM situations.


i'd be wary of a VM algorithm that treated mlocked pages any
differently than, say, unreclaimable slab pages.  but there are no
concrete suggestions yet, so i won't comment further.

all this is not to say that i dislike the idea of keeping mlocked
pages off the LRU, quite the opposite i've been looking for this for a
while and was hoping that Stone Wang's wired list patch
(http://lkml.org/lkml/2006/3/20/128) would get further than it did.
but i don't see the need to keep strict accounting if it hurts
performance in the common case.

NATE
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-06 Thread Nate Diller

On 2/4/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:

On Sun, 4 Feb 2007, Arjan van de Ven wrote:

>
> > Exclusion or inclusion of NR_MLOCK number is straightforward for the dirty
> > ratio calcuations. global_page_state(NR_MLOCK) f.e. would get us totals on
> > mlocked pages per zone. node_page_state(NR_MLOCK) gives a node specific
> > number of mlocked pages. The nice thing about ZVCs is that it allows
> > easy access to counts on different levels.
>
> however... mlocked pages still can be dirty, and want to be written back
> at some point ;)

Yes that is why we need to add them to the count of total pages.

> I can see the point of doing dirty ratio as percentage of the LRU size,
> but in that case you don't need to track NR_MLOCK, only the total LRU
> size. (And yes it'll be sometimes optimistic because not all mlock'd
> pages are moved off the lru yet, but I doubt you'll have that as a
> problem in practice)

The dirty ratio with the ZVCS would be

NR_DIRTY + NR_UNSTABLE_NFS
/
NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE + NR_MLOCK

I don't understand why you want to account mlocked pages in
dirty_ratio.  of course mlocked pages *can* be dirty, but they have no
relevance in the write throttling code.  the point of dirty ratio is
to guarantee that there are some number of non-dirty, non-pinned,
non-mlocked pages on the LRU, to (try to) avoid deadlocks where the
writeback path needs to allocate pages, which many filesystems like to
do.  if an mlocked page is clean, there's still no way to free it up,
so it should not be treated as being on the LRU at all, for write
throttling.  the ideal (IMO) dirty ratio would be

NR_DIRTY - NR_DIRTY_MLOCKED + NR_UNSTABLE_NFS
   /
NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE

obviously it's kinda useless to keep an NR_DIRTY_MLOCKED counter, any
of these mlock accounting schemes could easily be modified to update
the NR_DIRTY counter so that it only reflects dirty unpinned pages,
and not mlocked ones.

is that the only place you wanted to have an accurate mocked page count?

NATE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-06 Thread Nate Diller


On 2/4/07, Christoph Lameter [EMAIL PROTECTED] wrote:

On Sun, 4 Feb 2007, Arjan van de Ven wrote:


  Exclusion or inclusion of NR_MLOCK number is straightforward for the dirty
  ratio calcuations. global_page_state(NR_MLOCK) f.e. would get us totals on
  mlocked pages per zone. node_page_state(NR_MLOCK) gives a node specific
  number of mlocked pages. The nice thing about ZVCs is that it allows
  easy access to counts on different levels.

 however... mlocked pages still can be dirty, and want to be written back
 at some point ;)

Yes that is why we need to add them to the count of total pages.

 I can see the point of doing dirty ratio as percentage of the LRU size,
 but in that case you don't need to track NR_MLOCK, only the total LRU
 size. (And yes it'll be sometimes optimistic because not all mlock'd
 pages are moved off the lru yet, but I doubt you'll have that as a
 problem in practice)

The dirty ratio with the ZVCS would be

NR_DIRTY + NR_UNSTABLE_NFS
/
NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE + NR_MLOCK


I don't understand why you want to account mlocked pages in
dirty_ratio.  of course mlocked pages *can* be dirty, but they have no
relevance in the write throttling code.  the point of dirty ratio is
to guarantee that there are some number of non-dirty, non-pinned,
non-mlocked pages on the LRU, to (try to) avoid deadlocks where the
writeback path needs to allocate pages, which many filesystems like to
do.  if an mlocked page is clean, there's still no way to free it up,
so it should not be treated as being on the LRU at all, for write
throttling.  the ideal (IMO) dirty ratio would be

NR_DIRTY - NR_DIRTY_MLOCKED + NR_UNSTABLE_NFS
   /
NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE

obviously it's kinda useless to keep an NR_DIRTY_MLOCKED counter, any
of these mlock accounting schemes could easily be modified to update
the NR_DIRTY counter so that it only reflects dirty unpinned pages,
and not mlocked ones.

is that the only place you wanted to have an accurate mocked page count?

NATE
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-05 Thread Christoph Lameter

Patch seems to work and survives AIM7. However, we only know about 30% of 
the Mlocked pages after boot. With this additional patch to 
opportunistically move pages off the LRU immediately I can get the counter 
be accurate (for all practical purposes) like the non lazy version:


Index: current/mm/memory.c
===
--- current.orig/mm/memory.c2007-02-05 10:44:10.0 -0800
+++ current/mm/memory.c 2007-02-05 11:01:46.0 -0800
@@ -919,6 +919,30 @@ void anon_add(struct vm_area_struct *vma
 }
 
 /*
+ * Opportunistically move the page off the LRU
+ * if possible. If we do not succeed then the LRU
+ * scans will take the page off.
+ */
+void try_to_set_mlocked(struct page *page)
+{
+   struct zone *zone;
+   unsigned long flags;
+
+   if (!PageLRU(page) || PageMlocked(page))
+   return;
+
+   zone = page_zone(page);
+   if (spin_trylock_irqsave(>lru_lock, flags)) {
+   if (PageLRU(page) && !PageMlocked(page)) {
+   ClearPageLRU(page);
+   list_del(>lru);
+   SetPageMlocked(page);
+   __inc_zone_page_state(page, NR_MLOCK);
+   }
+   spin_unlock_irqrestore(>lru_lock, flags);
+   }
+}
+/*
  * Do a quick page-table lookup for a single page.
  */
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -978,6 +1002,8 @@ struct page *follow_page(struct vm_area_
set_page_dirty(page);
mark_page_accessed(page);
}
+   if (vma->vm_flags & VM_LOCKED)
+   try_to_set_mlocked(page);
 unlock:
pte_unmap_unlock(ptep, ptl);
 out:
@@ -2271,6 +2297,8 @@ retry:
else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
+   if (vma->vm_flags & VM_LOCKED)
+   try_to_set_mlocked(new_page);
if (write_access) {
dirty_page = new_page;
get_page(dirty_page);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-05 Thread Christoph Lameter

On Mon, 5 Feb 2007, Matt Mackall wrote:

> 2) lazy accounting - the same as above, with the work all moved to the
> LRU sweep
> 3) accounting with an extra page flag - still needs to scan VMAs on munmap
> 
> Christoph seems to prefer the third.

No I am saying that 2 requires 3 to work reliably. The patch I posted last 
night does 2 but the approach needs page flag in order to work 
correctly.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-05 Thread Christoph Lameter

On Mon, 5 Feb 2007, Arjan van de Ven wrote:

> > I still need a solution for the problem of not having enough page flag 
> > bits on i386 NUMA.
> 
> I still don't get why you *really* need such a bit. 

Because otherwise you cannot establish why a page was removed from the 
LRU. If a page is off the LRU for other reasons then one should not 
return the page to the LRU in zap_pte_range. How can this determination 
be made without a page flag?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-05 Thread Matt Mackall

On Mon, Feb 05, 2007 at 09:39:34AM +0100, Arjan van de Ven wrote:
> On Sun, 2007-02-04 at 23:57 -0800, Christoph Lameter wrote:
> > Hmmm.. I have had no time to test this one yet but I think this should 
> > work. It uses the delayed method and a new page flag PageMlocked() with 
> > different semantics. Fix for page migration is also included.
> > 
> > Patch avoids to put new anonymous mlocked pages on the LRU. Maybe the same 
> > could be done for new pagecache pages?
> > 
> > I still need a solution for the problem of not having enough page flag 
> > bits on i386 NUMA.
> 
> I still don't get why you *really* need such a bit. 

There are three possibilities mentioned so far:

1) slow accounting - scan each attached VMA on each mmap/munmap
2) lazy accounting - the same as above, with the work all moved to the
LRU sweep
3) accounting with an extra page flag - still needs to scan VMAs on munmap

Christoph seems to prefer the third.

I wonder if we couldn't stick a rough counter in address_space to
fast-path the slow accounting - we'll typically only have 0 or 1 locks
active.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-05 Thread Arjan van de Ven

On Sun, 2007-02-04 at 23:57 -0800, Christoph Lameter wrote:
> Hmmm.. I have had no time to test this one yet but I think this should 
> work. It uses the delayed method and a new page flag PageMlocked() with 
> different semantics. Fix for page migration is also included.
> 
> Patch avoids to put new anonymous mlocked pages on the LRU. Maybe the same 
> could be done for new pagecache pages?
> 
> I still need a solution for the problem of not having enough page flag 
> bits on i386 NUMA.

I still don't get why you *really* need such a bit. 
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-05 Thread Arjan van de Ven

On Sun, 2007-02-04 at 23:57 -0800, Christoph Lameter wrote:
 Hmmm.. I have had no time to test this one yet but I think this should 
 work. It uses the delayed method and a new page flag PageMlocked() with 
 different semantics. Fix for page migration is also included.
 
 Patch avoids to put new anonymous mlocked pages on the LRU. Maybe the same 
 could be done for new pagecache pages?
 
 I still need a solution for the problem of not having enough page flag 
 bits on i386 NUMA.

I still don't get why you *really* need such a bit. 
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-05 Thread Matt Mackall

On Mon, Feb 05, 2007 at 09:39:34AM +0100, Arjan van de Ven wrote:
 On Sun, 2007-02-04 at 23:57 -0800, Christoph Lameter wrote:
  Hmmm.. I have had no time to test this one yet but I think this should 
  work. It uses the delayed method and a new page flag PageMlocked() with 
  different semantics. Fix for page migration is also included.
  
  Patch avoids to put new anonymous mlocked pages on the LRU. Maybe the same 
  could be done for new pagecache pages?
  
  I still need a solution for the problem of not having enough page flag 
  bits on i386 NUMA.
 
 I still don't get why you *really* need such a bit. 

There are three possibilities mentioned so far:

1) slow accounting - scan each attached VMA on each mmap/munmap
2) lazy accounting - the same as above, with the work all moved to the
LRU sweep
3) accounting with an extra page flag - still needs to scan VMAs on munmap

Christoph seems to prefer the third.

I wonder if we couldn't stick a rough counter in address_space to
fast-path the slow accounting - we'll typically only have 0 or 1 locks
active.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-05 Thread Christoph Lameter

On Mon, 5 Feb 2007, Arjan van de Ven wrote:

  I still need a solution for the problem of not having enough page flag 
  bits on i386 NUMA.
 
 I still don't get why you *really* need such a bit. 

Because otherwise you cannot establish why a page was removed from the 
LRU. If a page is off the LRU for other reasons then one should not 
return the page to the LRU in zap_pte_range. How can this determination 
be made without a page flag?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-05 Thread Christoph Lameter

On Mon, 5 Feb 2007, Matt Mackall wrote:

 2) lazy accounting - the same as above, with the work all moved to the
 LRU sweep
 3) accounting with an extra page flag - still needs to scan VMAs on munmap
 
 Christoph seems to prefer the third.

No I am saying that 2 requires 3 to work reliably. The patch I posted last 
night does 2 but the approach needs page flag in order to work 
correctly.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-05 Thread Christoph Lameter

Patch seems to work and survives AIM7. However, we only know about 30% of 
the Mlocked pages after boot. With this additional patch to 
opportunistically move pages off the LRU immediately I can get the counter 
be accurate (for all practical purposes) like the non lazy version:


Index: current/mm/memory.c
===
--- current.orig/mm/memory.c2007-02-05 10:44:10.0 -0800
+++ current/mm/memory.c 2007-02-05 11:01:46.0 -0800
@@ -919,6 +919,30 @@ void anon_add(struct vm_area_struct *vma
 }
 
 /*
+ * Opportunistically move the page off the LRU
+ * if possible. If we do not succeed then the LRU
+ * scans will take the page off.
+ */
+void try_to_set_mlocked(struct page *page)
+{
+   struct zone *zone;
+   unsigned long flags;
+
+   if (!PageLRU(page) || PageMlocked(page))
+   return;
+
+   zone = page_zone(page);
+   if (spin_trylock_irqsave(zone-lru_lock, flags)) {
+   if (PageLRU(page)  !PageMlocked(page)) {
+   ClearPageLRU(page);
+   list_del(page-lru);
+   SetPageMlocked(page);
+   __inc_zone_page_state(page, NR_MLOCK);
+   }
+   spin_unlock_irqrestore(zone-lru_lock, flags);
+   }
+}
+/*
  * Do a quick page-table lookup for a single page.
  */
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -978,6 +1002,8 @@ struct page *follow_page(struct vm_area_
set_page_dirty(page);
mark_page_accessed(page);
}
+   if (vma-vm_flags  VM_LOCKED)
+   try_to_set_mlocked(page);
 unlock:
pte_unmap_unlock(ptep, ptl);
 out:
@@ -2271,6 +2297,8 @@ retry:
else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
+   if (vma-vm_flags  VM_LOCKED)
+   try_to_set_mlocked(new_page);
if (write_access) {
dirty_page = new_page;
get_page(dirty_page);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-04 Thread Christoph Lameter

Hmmm.. I have had no time to test this one yet but I think this should 
work. It uses the delayed method and a new page flag PageMlocked() with 
different semantics. Fix for page migration is also included.

Patch avoids to put new anonymous mlocked pages on the LRU. Maybe the same 
could be done for new pagecache pages?

I still need a solution for the problem of not having enough page flag 
bits on i386 NUMA.


Index: current/mm/vmscan.c
===
--- current.orig/mm/vmscan.c2007-02-03 10:53:15.0 -0800
+++ current/mm/vmscan.c 2007-02-04 22:59:01.0 -0800
@@ -516,10 +516,11 @@ static unsigned long shrink_page_list(st
if (page_mapped(page) && mapping) {
switch (try_to_unmap(page, 0)) {
case SWAP_FAIL:
-   case SWAP_MLOCK:
goto activate_locked;
case SWAP_AGAIN:
goto keep_locked;
+   case SWAP_MLOCK:
+   goto mlocked;
case SWAP_SUCCESS:
; /* try to free the page below */
}
@@ -594,6 +595,14 @@ free_it:
__pagevec_release_nonlru(_pvec);
continue;
 
+mlocked:
+   ClearPageActive(page);
+   unlock_page(page);
+   __inc_zone_page_state(page, NR_MLOCK);
+   smp_wmb();
+   SetPageMlocked(page);
+   continue;
+
 activate_locked:
SetPageActive(page);
pgactivate++;
Index: current/mm/memory.c
===
--- current.orig/mm/memory.c2007-02-03 10:52:37.0 -0800
+++ current/mm/memory.c 2007-02-04 23:48:36.0 -0800
@@ -682,6 +682,8 @@ static unsigned long zap_pte_range(struc
file_rss--;
}
page_remove_rmap(page, vma);
+   if (PageMlocked(page) && vma->vm_flags & VM_LOCKED)
+   lru_cache_add_mlock(page);
tlb_remove_page(tlb, page);
continue;
}
@@ -898,6 +900,25 @@ unsigned long zap_page_range(struct vm_a
 }
 
 /*
+ * Add a new anonymous page
+ */
+void anon_add(struct vm_area_struct *vma, struct page *page,
+   unsigned long address)
+{
+   inc_mm_counter(vma->vm_mm, anon_rss);
+   if (vma->vm_flags & VM_LOCKED) {
+   /*
+* Page is new and therefore not on the LRU
+* so we can directly mark it as mlocked
+*/
+   SetPageMlocked(page);
+   inc_zone_page_state(page, NR_MLOCK);
+   } else
+   lru_cache_add_active(page);
+   page_add_new_anon_rmap(page, vma, address);
+}
+
+/*
  * Do a quick page-table lookup for a single page.
  */
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -2101,9 +2122,7 @@ static int do_anonymous_page(struct mm_s
page_table = pte_offset_map_lock(mm, pmd, address, );
if (!pte_none(*page_table))
goto release;
-   inc_mm_counter(mm, anon_rss);
-   lru_cache_add_active(page);
-   page_add_new_anon_rmap(page, vma, address);
+   anon_add(vma, page, address);
} else {
/* Map the ZERO_PAGE - vm_page_prot is readonly */
page = ZERO_PAGE(address);
@@ -2247,11 +2266,9 @@ retry:
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
set_pte_at(mm, address, page_table, entry);
-   if (anon) {
-   inc_mm_counter(mm, anon_rss);
-   lru_cache_add_active(new_page);
-   page_add_new_anon_rmap(new_page, vma, address);
-   } else {
+   if (anon)
+   anon_add(vma, new_page, address);
+   else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
if (write_access) {
Index: current/drivers/base/node.c
===
--- current.orig/drivers/base/node.c2007-02-03 10:52:35.0 -0800
+++ current/drivers/base/node.c 2007-02-03 10:53:25.0 -0800
@@ -60,6 +60,7 @@ static ssize_t node_read_meminfo(struct 
   "Node %d FilePages:%8lu kB\n"
   "Node %d Mapped:   %8lu kB\n"
   "Node %d AnonPages:%8lu kB\n"
+  "Node %d Mlock:%8lu KB\n"
   "Node %d PageTables:   %8lu kB\n"

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-04 Thread Christoph Lameter

On Sun, 4 Feb 2007, Arjan van de Ven wrote:

> 
> > Exclusion or inclusion of NR_MLOCK number is straightforward for the dirty 
> > ratio calcuations. global_page_state(NR_MLOCK) f.e. would get us totals on 
> > mlocked pages per zone. node_page_state(NR_MLOCK) gives a node specific 
> > number of mlocked pages. The nice thing about ZVCs is that it allows
> > easy access to counts on different levels.
> 
> however... mlocked pages still can be dirty, and want to be written back
> at some point ;)

Yes that is why we need to add them to the count of total pages.

> I can see the point of doing dirty ratio as percentage of the LRU size,
> but in that case you don't need to track NR_MLOCK, only the total LRU
> size. (And yes it'll be sometimes optimistic because not all mlock'd
> pages are moved off the lru yet, but I doubt you'll have that as a
> problem in practice)

The dirty ratio with the ZVCS would be

NR_DIRTY + NR_UNSTABLE_NFS
/ 
NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE + NR_MLOCK

I think we need a PageMlocked after all for the delayed NR_MLOCK 
approach but it needs to have different semantics in order to make
it work. With the patch that I posted earlier we could actually return
a page to the LRU in zap_pte_range while something else is keeping
a page off the LRU (i.e. Page migration is occurring and suddenly the 
page is reclaimed since zap_pte_range put it back). So PageMlocked needs 
to be set in shrink_list() in order to clarify that the page was taken 
off the LRU lists due to it being mlocked and not for other reasons. 
zap_pte_range then needs to check for PageMlocked before putting the 
page onto the LRU.

If we do that then we can observer the state transitions and have an 
accurate count. The delayed accounting problem can probably be 
somewhat remedied by putting new mlocked pages not on the LRU and 
counting them directly. Page migration can simply clear the PageMlocked 
bit and then treat the page as if it was taken off the LRU.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-04 Thread Arjan van de Ven


> Exclusion or inclusion of NR_MLOCK number is straightforward for the dirty 
> ratio calcuations. global_page_state(NR_MLOCK) f.e. would get us totals on 
> mlocked pages per zone. node_page_state(NR_MLOCK) gives a node specific 
> number of mlocked pages. The nice thing about ZVCs is that it allows
> easy access to counts on different levels.

however... mlocked pages still can be dirty, and want to be written back
at some point ;)

I can see the point of doing dirty ratio as percentage of the LRU size,
but in that case you don't need to track NR_MLOCK, only the total LRU
size. (And yes it'll be sometimes optimistic because not all mlock'd
pages are moved off the lru yet, but I doubt you'll have that as a
problem in practice)
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-04 Thread Arjan van de Ven


 Exclusion or inclusion of NR_MLOCK number is straightforward for the dirty 
 ratio calcuations. global_page_state(NR_MLOCK) f.e. would get us totals on 
 mlocked pages per zone. node_page_state(NR_MLOCK) gives a node specific 
 number of mlocked pages. The nice thing about ZVCs is that it allows
 easy access to counts on different levels.

however... mlocked pages still can be dirty, and want to be written back
at some point ;)

I can see the point of doing dirty ratio as percentage of the LRU size,
but in that case you don't need to track NR_MLOCK, only the total LRU
size. (And yes it'll be sometimes optimistic because not all mlock'd
pages are moved off the lru yet, but I doubt you'll have that as a
problem in practice)
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-04 Thread Christoph Lameter

On Sun, 4 Feb 2007, Arjan van de Ven wrote:

 
  Exclusion or inclusion of NR_MLOCK number is straightforward for the dirty 
  ratio calcuations. global_page_state(NR_MLOCK) f.e. would get us totals on 
  mlocked pages per zone. node_page_state(NR_MLOCK) gives a node specific 
  number of mlocked pages. The nice thing about ZVCs is that it allows
  easy access to counts on different levels.
 
 however... mlocked pages still can be dirty, and want to be written back
 at some point ;)

Yes that is why we need to add them to the count of total pages.
 
 I can see the point of doing dirty ratio as percentage of the LRU size,
 but in that case you don't need to track NR_MLOCK, only the total LRU
 size. (And yes it'll be sometimes optimistic because not all mlock'd
 pages are moved off the lru yet, but I doubt you'll have that as a
 problem in practice)

The dirty ratio with the ZVCS would be

NR_DIRTY + NR_UNSTABLE_NFS
/ 
NR_FREE_PAGES + NR_INACTIVE + NR_ACTIVE + NR_MLOCK


I think we need a PageMlocked after all for the delayed NR_MLOCK 
approach but it needs to have different semantics in order to make
it work. With the patch that I posted earlier we could actually return
a page to the LRU in zap_pte_range while something else is keeping
a page off the LRU (i.e. Page migration is occurring and suddenly the 
page is reclaimed since zap_pte_range put it back). So PageMlocked needs 
to be set in shrink_list() in order to clarify that the page was taken 
off the LRU lists due to it being mlocked and not for other reasons. 
zap_pte_range then needs to check for PageMlocked before putting the 
page onto the LRU.

If we do that then we can observer the state transitions and have an 
accurate count. The delayed accounting problem can probably be 
somewhat remedied by putting new mlocked pages not on the LRU and 
counting them directly. Page migration can simply clear the PageMlocked 
bit and then treat the page as if it was taken off the LRU.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-04 Thread Christoph Lameter

Hmmm.. I have had no time to test this one yet but I think this should 
work. It uses the delayed method and a new page flag PageMlocked() with 
different semantics. Fix for page migration is also included.

Patch avoids to put new anonymous mlocked pages on the LRU. Maybe the same 
could be done for new pagecache pages?

I still need a solution for the problem of not having enough page flag 
bits on i386 NUMA.


Index: current/mm/vmscan.c
===
--- current.orig/mm/vmscan.c2007-02-03 10:53:15.0 -0800
+++ current/mm/vmscan.c 2007-02-04 22:59:01.0 -0800
@@ -516,10 +516,11 @@ static unsigned long shrink_page_list(st
if (page_mapped(page)  mapping) {
switch (try_to_unmap(page, 0)) {
case SWAP_FAIL:
-   case SWAP_MLOCK:
goto activate_locked;
case SWAP_AGAIN:
goto keep_locked;
+   case SWAP_MLOCK:
+   goto mlocked;
case SWAP_SUCCESS:
; /* try to free the page below */
}
@@ -594,6 +595,14 @@ free_it:
__pagevec_release_nonlru(freed_pvec);
continue;
 
+mlocked:
+   ClearPageActive(page);
+   unlock_page(page);
+   __inc_zone_page_state(page, NR_MLOCK);
+   smp_wmb();
+   SetPageMlocked(page);
+   continue;
+
 activate_locked:
SetPageActive(page);
pgactivate++;
Index: current/mm/memory.c
===
--- current.orig/mm/memory.c2007-02-03 10:52:37.0 -0800
+++ current/mm/memory.c 2007-02-04 23:48:36.0 -0800
@@ -682,6 +682,8 @@ static unsigned long zap_pte_range(struc
file_rss--;
}
page_remove_rmap(page, vma);
+   if (PageMlocked(page)  vma-vm_flags  VM_LOCKED)
+   lru_cache_add_mlock(page);
tlb_remove_page(tlb, page);
continue;
}
@@ -898,6 +900,25 @@ unsigned long zap_page_range(struct vm_a
 }
 
 /*
+ * Add a new anonymous page
+ */
+void anon_add(struct vm_area_struct *vma, struct page *page,
+   unsigned long address)
+{
+   inc_mm_counter(vma-vm_mm, anon_rss);
+   if (vma-vm_flags  VM_LOCKED) {
+   /*
+* Page is new and therefore not on the LRU
+* so we can directly mark it as mlocked
+*/
+   SetPageMlocked(page);
+   inc_zone_page_state(page, NR_MLOCK);
+   } else
+   lru_cache_add_active(page);
+   page_add_new_anon_rmap(page, vma, address);
+}
+
+/*
  * Do a quick page-table lookup for a single page.
  */
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -2101,9 +2122,7 @@ static int do_anonymous_page(struct mm_s
page_table = pte_offset_map_lock(mm, pmd, address, ptl);
if (!pte_none(*page_table))
goto release;
-   inc_mm_counter(mm, anon_rss);
-   lru_cache_add_active(page);
-   page_add_new_anon_rmap(page, vma, address);
+   anon_add(vma, page, address);
} else {
/* Map the ZERO_PAGE - vm_page_prot is readonly */
page = ZERO_PAGE(address);
@@ -2247,11 +2266,9 @@ retry:
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
set_pte_at(mm, address, page_table, entry);
-   if (anon) {
-   inc_mm_counter(mm, anon_rss);
-   lru_cache_add_active(new_page);
-   page_add_new_anon_rmap(new_page, vma, address);
-   } else {
+   if (anon)
+   anon_add(vma, new_page, address);
+   else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
if (write_access) {
Index: current/drivers/base/node.c
===
--- current.orig/drivers/base/node.c2007-02-03 10:52:35.0 -0800
+++ current/drivers/base/node.c 2007-02-03 10:53:25.0 -0800
@@ -60,6 +60,7 @@ static ssize_t node_read_meminfo(struct 
   Node %d FilePages:%8lu kB\n
   Node %d Mapped:   %8lu kB\n
   Node %d AnonPages:%8lu kB\n
+  Node %d Mlock:%8lu KB\n
   Node %d PageTables:   %8lu kB\n
   Node %d

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Andrew Morton wrote:

> Do we actually need NR_MLOCK?  Page reclaim tends to care more about the
> size of the LRUs and doesn't have much dependency on ->present_pages,

Yes, we'd be fine with general reclaim I think. But the calculation of the 
dirty ratio based on ZVCs would need it if we take the mlocked pages off. 
Otherwise we may have dirty ratios > 100%.

> I guess we could use NR_MLOCK for writeback threshold calculations, to
> force writeback earlier if there's a lot of mlocked memory in the affected
> zones.  But that code isn't zone-aware anyway, and we don't know how to make
> it zone aware in any sane fashion and making it cpuset-aware isn't very
> interesting or useful..

Exclusion or inclusion of NR_MLOCK number is straightforward for the dirty 
ratio calcuations. global_page_state(NR_MLOCK) f.e. would get us totals on 
mlocked pages per zone. node_page_state(NR_MLOCK) gives a node specific 
number of mlocked pages. The nice thing about ZVCs is that it allows
easy access to counts on different levels.

> So..  Why do we want NR_MLOCK?

Rik also had some uses in mind for allocation?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Andrew Morton

On Sat, 3 Feb 2007 11:03:59 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> Here is the second piece removing mlock pages off the LRU during scanning. 
> I tried moving them to a separate list but then we run into issues with
> locking. We do not need ithe list though since we will encounter the
> page again anyways during zap_pte_range.
> 
> However, in zap_pte_range we then run into another problem. Multiple 
> zap_pte_ranges may handle the same page and without a page flag and 
> scanning all the vmas we cannot determine if the page should or should not 
> be moved back to the LRU. As a result this patch may decrement NR_MLOCK 
> too much so that is goes below zero. Any ideas on how to fix this without 
> a page flag and a scan over vmas?
> 
> Plus there is the issue of NR_MLOCK only being updated when we are 
> reclaiming and when we may already be in trouble. An app may mlock huge 
> amounts of memory and NR_MLOCK will stay low. If memory gets too low then
> NR_MLOCKED is suddenly become accurate and the VM is likely undergoing a 
> shock from that discovery (should we actually use NR_MLOCK elsewhere to 
> determine memory management behavior). Hopefully we will not fall over 
> then.

Do we actually need NR_MLOCK?  Page reclaim tends to care more about the
size of the LRUs and doesn't have much dependency on ->present_pages,
iirc.

I guess we could use NR_MLOCK for writeback threshold calculations, to
force writeback earlier if there's a lot of mlocked memory in the affected
zones.  But that code isn't zone-aware anyway, and we don't know how to make
it zone aware in any sane fashion and making it cpuset-aware isn't very
interesting or useful..

So..  Why do we want NR_MLOCK?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Nigel Cunningham

Hi.

On Fri, 2007-02-02 at 22:20 -0800, Christoph Lameter wrote:
> 1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.

If it will help, I now have an implementation of the dynamically
allocated pageflags code I've posted in the past that is NUMA aware.
It's not memory hotplug aware yet, but that can be fixed. I can see if I
can find time this week to address that and send it to you if it will
help.

Regards,

Nigel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Nigel Cunningham

Hi again.

On Sat, 2007-02-03 at 00:53 -0800, Andrew Morton wrote:
> > 1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.
> 
> Ow.  How were you thinking of fixing that?

Oh, guess the dyn_pageflags patch is not needed then - the dangers of
replying before reading a whole thread :)

Nigel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

Here is the second piece removing mlock pages off the LRU during scanning. 
I tried moving them to a separate list but then we run into issues with
locking. We do not need ithe list though since we will encounter the
page again anyways during zap_pte_range.

However, in zap_pte_range we then run into another problem. Multiple 
zap_pte_ranges may handle the same page and without a page flag and 
scanning all the vmas we cannot determine if the page should or should not 
be moved back to the LRU. As a result this patch may decrement NR_MLOCK 
too much so that is goes below zero. Any ideas on how to fix this without 
a page flag and a scan over vmas?

Plus there is the issue of NR_MLOCK only being updated when we are 
reclaiming and when we may already be in trouble. An app may mlock huge 
amounts of memory and NR_MLOCK will stay low. If memory gets too low then
NR_MLOCKED is suddenly become accurate and the VM is likely undergoing a 
shock from that discovery (should we actually use NR_MLOCK elsewhere to 
determine memory management behavior). Hopefully we will not fall over 
then.

Maybe the best would be to handle the counter separately via a page flag? 
But then we go back to ugly vma scans. Yuck.

Index: current/mm/vmscan.c
===
--- current.orig/mm/vmscan.c2007-02-03 10:53:15.0 -0800
+++ current/mm/vmscan.c 2007-02-03 10:53:25.0 -0800
@@ -516,10 +516,11 @@ static unsigned long shrink_page_list(st
if (page_mapped(page) && mapping) {
switch (try_to_unmap(page, 0)) {
case SWAP_FAIL:
-   case SWAP_MLOCK:
goto activate_locked;
case SWAP_AGAIN:
goto keep_locked;
+   case SWAP_MLOCK:
+   goto mlocked;
case SWAP_SUCCESS:
; /* try to free the page below */
}
@@ -594,6 +595,11 @@ free_it:
__pagevec_release_nonlru(_pvec);
continue;
 
+mlocked:
+   unlock_page(page);
+   __inc_zone_page_state(page, NR_MLOCK);
+   continue;
+
 activate_locked:
SetPageActive(page);
pgactivate++;
Index: current/mm/memory.c
===
--- current.orig/mm/memory.c2007-02-03 10:52:37.0 -0800
+++ current/mm/memory.c 2007-02-03 10:53:25.0 -0800
@@ -682,6 +682,10 @@ static unsigned long zap_pte_range(struc
file_rss--;
}
page_remove_rmap(page, vma);
+   if (vma->vm_flags & VM_LOCKED) {
+   __dec_zone_page_state(page, NR_MLOCK);
+   lru_cache_add_active(page);
+   }
tlb_remove_page(tlb, page);
continue;
}
Index: current/drivers/base/node.c
===
--- current.orig/drivers/base/node.c2007-02-03 10:52:35.0 -0800
+++ current/drivers/base/node.c 2007-02-03 10:53:25.0 -0800
@@ -60,6 +60,7 @@ static ssize_t node_read_meminfo(struct 
   "Node %d FilePages:%8lu kB\n"
   "Node %d Mapped:   %8lu kB\n"
   "Node %d AnonPages:%8lu kB\n"
+  "Node %d Mlock:%8lu KB\n"
   "Node %d PageTables:   %8lu kB\n"
   "Node %d NFS_Unstable: %8lu kB\n"
   "Node %d Bounce:   %8lu kB\n"
@@ -82,6 +83,7 @@ static ssize_t node_read_meminfo(struct 
   nid, K(node_page_state(nid, NR_FILE_PAGES)),
   nid, K(node_page_state(nid, NR_FILE_MAPPED)),
   nid, K(node_page_state(nid, NR_ANON_PAGES)),
+  nid, K(node_page_state(nid, NR_MLOCK)),
   nid, K(node_page_state(nid, NR_PAGETABLE)),
   nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
   nid, K(node_page_state(nid, NR_BOUNCE)),
Index: current/fs/proc/proc_misc.c
===
--- current.orig/fs/proc/proc_misc.c2007-02-03 10:52:36.0 -0800
+++ current/fs/proc/proc_misc.c 2007-02-03 10:53:25.0 -0800
@@ -166,6 +166,7 @@ static int meminfo_read_proc(char *page,
"Writeback:%8lu kB\n"
"AnonPages:%8lu kB\n"
"Mapped:   %8lu kB\n"
+   "Mlock:%8lu KB\n"
"Slab: %8lu kB\n"
"SReclaimable: %8lu kB\n"
"SUnreclaim:   %8lu kB\n"
@@ -196,6 +197,7 @@ static int

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Arjan van de Ven wrote:

> Well.. That's the point! Only IF there is a reclaim scan do you move
> them out again. The fact that these pages are on the list isn't a
> problem. The fact that you keep encountering them over and over again
> during *scanning* is. So Andrews suggestion makes them go away in the
> situations that actually matter

In order to get this to work try_to_unmap() must be able to distinguish 
betwen failures due to MLOCK and otherwise. So I guess we need this patch:


[PATCH] Make try_to_unmap() return SWAP_MLOCK for mlocked pages

Modify try_to_unmap() so that we can distinguish between failing to
unmap because a page is mlocked from other causes.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: current/include/linux/rmap.h
===
--- current.orig/include/linux/rmap.h   2007-02-03 10:24:47.0 -0800
+++ current/include/linux/rmap.h2007-02-03 10:25:08.0 -0800
@@ -134,5 +134,6 @@ static inline int page_mkclean(struct pa
 #define SWAP_SUCCESS   0
 #define SWAP_AGAIN 1
 #define SWAP_FAIL  2
+#define SWAP_MLOCK 3
 
 #endif /* _LINUX_RMAP_H */
Index: current/mm/rmap.c
===
--- current.orig/mm/rmap.c  2007-02-03 10:24:47.0 -0800
+++ current/mm/rmap.c   2007-02-03 10:25:08.0 -0800
@@ -631,10 +631,16 @@ static int try_to_unmap_one(struct page 
 * If it's recently referenced (perhaps page_referenced
 * skipped over this mm) then we should reactivate it.
 */
-   if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-   (ptep_clear_flush_young(vma, address, pte {
-   ret = SWAP_FAIL;
-   goto out_unmap;
+   if (!migration) {
+   if (vma->vm_flags & VM_LOCKED) {
+   ret = SWAP_MLOCK;
+   goto out_unmap;
+   }
+
+   if (ptep_clear_flush_young(vma, address, pte)) {
+   ret = SWAP_FAIL;
+   goto out_unmap;
+   }
}
 
/* Nuke the page table entry. */
@@ -799,7 +805,8 @@ static int try_to_unmap_anon(struct page
 
list_for_each_entry(vma, _vma->head, anon_vma_node) {
ret = try_to_unmap_one(page, vma, migration);
-   if (ret == SWAP_FAIL || !page_mapped(page))
+   if (ret == SWAP_FAIL || ret == SWAP_MLOCK ||
+   !page_mapped(page))
break;
}
spin_unlock(_vma->lock);
@@ -830,7 +837,8 @@ static int try_to_unmap_file(struct page
spin_lock(>i_mmap_lock);
vma_prio_tree_foreach(vma, , >i_mmap, pgoff, pgoff) {
ret = try_to_unmap_one(page, vma, migration);
-   if (ret == SWAP_FAIL || !page_mapped(page))
+   if (ret == SWAP_FAIL || ret == SWAP_MLOCK ||
+   !page_mapped(page))
goto out;
}
 
@@ -913,6 +921,7 @@ out:
  * SWAP_SUCCESS- we succeeded in removing all mappings
  * SWAP_AGAIN  - we missed a mapping, try again later
  * SWAP_FAIL   - the page is unswappable
+ * SWAP_MLOCK  - the page is under mlock()
  */
 int try_to_unmap(struct page *page, int migration)
 {
Index: current/mm/vmscan.c
===
--- current.orig/mm/vmscan.c2007-02-03 10:25:00.0 -0800
+++ current/mm/vmscan.c 2007-02-03 10:25:12.0 -0800
@@ -516,6 +516,7 @@ static unsigned long shrink_page_list(st
if (page_mapped(page) && mapping) {
switch (try_to_unmap(page, 0)) {
case SWAP_FAIL:
+   case SWAP_MLOCK:
goto activate_locked;
case SWAP_AGAIN:
goto keep_locked;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Arjan van de Ven wrote:

> it's simpler. You only move them off when you encounter them during a
> scan. No walking early etc etc. Only do work when there is an actual
> situation where you do scan.

Yes but then you do not have an accurate count of MLOCKed pages. We will 
only have that count after reclaim has run and removed the pages from the 
LRU. That could take some time (or may never happen the way some people 
handle memory around here).

> Well.. That's the point! Only IF there is a reclaim scan do you move
> them out again. The fact that these pages are on the list isn't a
> problem. The fact that you keep encountering them over and over again
> during *scanning* is. So Andrews suggestion makes them go away in the
> situations that actually matter

I see... Okay that could be simple to address.

> who cares though.. just do it lazy.

The other issue that the patch should address is to allow the VM to have 
statistics that show the actual amount of MLOCKed pages. With the pure 
scanner based approach this would no longer work.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Arjan van de Ven

On Sat, 2007-02-03 at 09:56 -0800, Christoph Lameter wrote:
> On Sat, 3 Feb 2007, Andrew Morton wrote:
> 
> > I wonder if it can be simpler.  Make two changes:
> 
> Would be great if this could get simpler.
> 
> > a) If the scanner encounters an mlocked page on the LRU, take it off.
> 
> The current patch takes them off when mlock is set (which may not work 
> since the page may be off the LRU) and then has the scanner taking them 
> off. We could just remove the early one but what would this bring us?

it's simpler. You only move them off when you encounter them during a
scan. No walking early etc etc. Only do work when there is an actual
situation where you do scan.

> 
> > b) munlock() adds all affected pages to the LRU.
> 
> Hmmm... You mean without checking all the vmas of a page for VM_LOCKED? So 
> they 
> are going to be removed again on the next pass? Ok. I see that makes it 
> simpler but it requires another reclaim scan.

Well.. That's the point! Only IF there is a reclaim scan do you move
them out again. The fact that these pages are on the list isn't a
problem. The fact that you keep encountering them over and over again
during *scanning* is. So Andrews suggestion makes them go away in the
situations that actually matter

> The page flag allows a clean state transition of a page and accurate 
> keeping of statistics for MLOCKed pages. There were objections against the 
> fuzzy counting in the earlier incarnation and it was proposed that a page 
> flag be introduced. Without the flag we cannot know that the page is 
> already mapped by a VM_LOCKED vma without scanning over all vmas 
> referencing the page.

who cares though.. just do it lazy.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Martin J. Bligh wrote:

> Doesn't matter - you can just do it lazily. If you find a page that is
> locked, move it to the locked list. when unlocking a page you *always*
> move it back to the normal list. If someone else is still locking it,
> we'll move it back to the lock list on next reclaim pass.

Sounds similar to what Andrew is proposing.
 
> I have a half-finished patch from 6 months ago that does this, but never
> found time to complete it ;-(

Could I see that patch? Could have some good approaches in there that 
would be useful.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Andrew Morton wrote:

> I wonder if it can be simpler.  Make two changes:

Would be great if this could get simpler.

> a) If the scanner encounters an mlocked page on the LRU, take it off.

The current patch takes them off when mlock is set (which may not work 
since the page may be off the LRU) and then has the scanner taking them 
off. We could just remove the early one but what would this bring us?

> b) munlock() adds all affected pages to the LRU.

Hmmm... You mean without checking all the vmas of a page for VM_LOCKED? So they 
are going to be removed again on the next pass? Ok. I see that makes it 
simpler but it requires another reclaim scan.

> doesn't consume a page flag.  Optional (and arguable) extension: scan the
> vmas during munmap, don't add page to LRU if it's still mlocked.
> 
> Why _does_ your patch add a new page flag?  That info is available via a
> vma scan.

The page flag allows a clean state transition of a page and accurate 
keeping of statistics for MLOCKed pages. There were objections against the 
fuzzy counting in the earlier incarnation and it was proposed that a page 
flag be introduced. Without the flag we cannot know that the page is 
already mapped by a VM_LOCKED vma without scanning over all vmas 
referencing the page.

> > 1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.
> 
> Ow.  How were you thinking of fixing that?

I thought someone else could come up with something. Maybe the one that 
told me to use another page flag?

> > 2. Since mlocked pages are now off the LRU page migration will no longer
> >move them.
> 
> Ow.  That could be a right pain when we get around to using migration for
> memory-unplug?

We need to expand page migration anyways to allow the gerneral migration 
of non-LRU pages.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Christoph Hellwig wrote:

> This patch seems to not handle the cases where more than one process mlocks
> a page and you really need a pincount in the page to not release it before
> all processes have munlock it or died.  I did a similar patch a while
> ago and tried to handle it by overloading the lru lists pointers with
> a pincount, but at some point I gave up because I couldn't get that part
> right.

It doees handle that case. Before PageMlocked is cleared for a page we 
check for vmas referencing the page that have VM_LOCKED set. That logic 
makes the patch so big.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Martin J. Bligh


Christoph Hellwig wrote:

On Fri, Feb 02, 2007 at 10:20:12PM -0800, Christoph Lameter wrote:

This is a new variation on the earlier RFC for tracking mlocked pages.
We now mark a mlocked page with a bit in the page flags and remove
them from the LRU. Pages get moved back when no vma that references
the page has VM_LOCKED set anymore.

This means that vmscan no longer uselessly cycles over large amounts
of mlocked memory should someone attempt to mlock large amounts of
memory (may even result in a livelock on large systems).

Synchronization is build around state changes of the PageMlocked bit.
The NR_MLOCK counter is incremented and decremented based on
state transitions of PageMlocked. So the count is accurate.

There is still some unfinished business:

1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.

2. Since mlocked pages are now off the LRU page migration will no longer
   move them.

3. Use NR_MLOCK to tune various VM behaviors so that the VM does not 
   longer fall due to too many mlocked pages in certain areas.


This patch seems to not handle the cases where more than one process mlocks
a page and you really need a pincount in the page to not release it before
all processes have munlock it or died.  I did a similar patch a while
ago and tried to handle it by overloading the lru lists pointers with
a pincount, but at some point I gave up because I couldn't get that part
right.


Doesn't matter - you can just do it lazily. If you find a page that is
locked, move it to the locked list. when unlocking a page you *always*
move it back to the normal list. If someone else is still locking it,
we'll move it back to the lock list on next reclaim pass.

I have a half-finished patch from 6 months ago that does this, but never
found time to complete it ;-(

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Hellwig

On Fri, Feb 02, 2007 at 10:20:12PM -0800, Christoph Lameter wrote:
> This is a new variation on the earlier RFC for tracking mlocked pages.
> We now mark a mlocked page with a bit in the page flags and remove
> them from the LRU. Pages get moved back when no vma that references
> the page has VM_LOCKED set anymore.
> 
> This means that vmscan no longer uselessly cycles over large amounts
> of mlocked memory should someone attempt to mlock large amounts of
> memory (may even result in a livelock on large systems).
> 
> Synchronization is build around state changes of the PageMlocked bit.
> The NR_MLOCK counter is incremented and decremented based on
> state transitions of PageMlocked. So the count is accurate.
> 
> There is still some unfinished business:
> 
> 1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.
> 
> 2. Since mlocked pages are now off the LRU page migration will no longer
>move them.
> 
> 3. Use NR_MLOCK to tune various VM behaviors so that the VM does not 
>longer fall due to too many mlocked pages in certain areas.

This patch seems to not handle the cases where more than one process mlocks
a page and you really need a pincount in the page to not release it before
all processes have munlock it or died.  I did a similar patch a while
ago and tried to handle it by overloading the lru lists pointers with
a pincount, but at some point I gave up because I couldn't get that part
right.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Andrew Morton

On Fri, 2 Feb 2007 22:20:12 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> This is a new variation on the earlier RFC for tracking mlocked pages.
> We now mark a mlocked page with a bit in the page flags and remove
> them from the LRU. Pages get moved back when no vma that references
> the page has VM_LOCKED set anymore.
> 
> This means that vmscan no longer uselessly cycles over large amounts
> of mlocked memory should someone attempt to mlock large amounts of
> memory (may even result in a livelock on large systems).
> 
> Synchronization is build around state changes of the PageMlocked bit.
> The NR_MLOCK counter is incremented and decremented based on
> state transitions of PageMlocked. So the count is accurate.

I wonder if it can be simpler.  Make two changes:

a) If the scanner encounters an mlocked page on the LRU, take it off.

b) munlock() adds all affected pages to the LRU.

And that's it.  Simpler, solves the uselessly-scan-lots-of-mlocked-pages
problem (which is the sole requirement according to your description) and
doesn't consume a page flag.  Optional (and arguable) extension: scan the
vmas during munmap, don't add page to LRU if it's still mlocked.

Why _does_ your patch add a new page flag?  That info is available via a
vma scan.

> There is still some unfinished business:
> 
> 1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.

Ow.  How were you thinking of fixing that?

> 2. Since mlocked pages are now off the LRU page migration will no longer
>move them.

Ow.  That could be a right pain when we get around to using migration for
memory-unplug?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Andrew Morton

On Fri, 2 Feb 2007 22:20:12 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] 
wrote:

 This is a new variation on the earlier RFC for tracking mlocked pages.
 We now mark a mlocked page with a bit in the page flags and remove
 them from the LRU. Pages get moved back when no vma that references
 the page has VM_LOCKED set anymore.
 
 This means that vmscan no longer uselessly cycles over large amounts
 of mlocked memory should someone attempt to mlock large amounts of
 memory (may even result in a livelock on large systems).
 
 Synchronization is build around state changes of the PageMlocked bit.
 The NR_MLOCK counter is incremented and decremented based on
 state transitions of PageMlocked. So the count is accurate.

I wonder if it can be simpler.  Make two changes:

a) If the scanner encounters an mlocked page on the LRU, take it off.

b) munlock() adds all affected pages to the LRU.

And that's it.  Simpler, solves the uselessly-scan-lots-of-mlocked-pages
problem (which is the sole requirement according to your description) and
doesn't consume a page flag.  Optional (and arguable) extension: scan the
vmas during munmap, don't add page to LRU if it's still mlocked.

Why _does_ your patch add a new page flag?  That info is available via a
vma scan.

 There is still some unfinished business:
 
 1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.

Ow.  How were you thinking of fixing that?

 2. Since mlocked pages are now off the LRU page migration will no longer
move them.

Ow.  That could be a right pain when we get around to using migration for
memory-unplug?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Hellwig

On Fri, Feb 02, 2007 at 10:20:12PM -0800, Christoph Lameter wrote:
 This is a new variation on the earlier RFC for tracking mlocked pages.
 We now mark a mlocked page with a bit in the page flags and remove
 them from the LRU. Pages get moved back when no vma that references
 the page has VM_LOCKED set anymore.
 
 This means that vmscan no longer uselessly cycles over large amounts
 of mlocked memory should someone attempt to mlock large amounts of
 memory (may even result in a livelock on large systems).
 
 Synchronization is build around state changes of the PageMlocked bit.
 The NR_MLOCK counter is incremented and decremented based on
 state transitions of PageMlocked. So the count is accurate.
 
 There is still some unfinished business:
 
 1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.
 
 2. Since mlocked pages are now off the LRU page migration will no longer
move them.
 
 3. Use NR_MLOCK to tune various VM behaviors so that the VM does not 
longer fall due to too many mlocked pages in certain areas.

This patch seems to not handle the cases where more than one process mlocks
a page and you really need a pincount in the page to not release it before
all processes have munlock it or died.  I did a similar patch a while
ago and tried to handle it by overloading the lru lists pointers with
a pincount, but at some point I gave up because I couldn't get that part
right.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Martin J. Bligh


Christoph Hellwig wrote:

On Fri, Feb 02, 2007 at 10:20:12PM -0800, Christoph Lameter wrote:

This is a new variation on the earlier RFC for tracking mlocked pages.
We now mark a mlocked page with a bit in the page flags and remove
them from the LRU. Pages get moved back when no vma that references
the page has VM_LOCKED set anymore.

This means that vmscan no longer uselessly cycles over large amounts
of mlocked memory should someone attempt to mlock large amounts of
memory (may even result in a livelock on large systems).

Synchronization is build around state changes of the PageMlocked bit.
The NR_MLOCK counter is incremented and decremented based on
state transitions of PageMlocked. So the count is accurate.

There is still some unfinished business:

1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.

2. Since mlocked pages are now off the LRU page migration will no longer
   move them.

3. Use NR_MLOCK to tune various VM behaviors so that the VM does not 
   longer fall due to too many mlocked pages in certain areas.


This patch seems to not handle the cases where more than one process mlocks
a page and you really need a pincount in the page to not release it before
all processes have munlock it or died.  I did a similar patch a while
ago and tried to handle it by overloading the lru lists pointers with
a pincount, but at some point I gave up because I couldn't get that part
right.


Doesn't matter - you can just do it lazily. If you find a page that is
locked, move it to the locked list. when unlocking a page you *always*
move it back to the normal list. If someone else is still locking it,
we'll move it back to the lock list on next reclaim pass.

I have a half-finished patch from 6 months ago that does this, but never
found time to complete it ;-(

M.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Christoph Hellwig wrote:

 This patch seems to not handle the cases where more than one process mlocks
 a page and you really need a pincount in the page to not release it before
 all processes have munlock it or died.  I did a similar patch a while
 ago and tried to handle it by overloading the lru lists pointers with
 a pincount, but at some point I gave up because I couldn't get that part
 right.

It doees handle that case. Before PageMlocked is cleared for a page we 
check for vmas referencing the page that have VM_LOCKED set. That logic 
makes the patch so big.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Andrew Morton wrote:

 I wonder if it can be simpler.  Make two changes:

Would be great if this could get simpler.

 a) If the scanner encounters an mlocked page on the LRU, take it off.

The current patch takes them off when mlock is set (which may not work 
since the page may be off the LRU) and then has the scanner taking them 
off. We could just remove the early one but what would this bring us?

 b) munlock() adds all affected pages to the LRU.

Hmmm... You mean without checking all the vmas of a page for VM_LOCKED? So they 
are going to be removed again on the next pass? Ok. I see that makes it 
simpler but it requires another reclaim scan.

 doesn't consume a page flag.  Optional (and arguable) extension: scan the
 vmas during munmap, don't add page to LRU if it's still mlocked.
 
 Why _does_ your patch add a new page flag?  That info is available via a
 vma scan.

The page flag allows a clean state transition of a page and accurate 
keeping of statistics for MLOCKed pages. There were objections against the 
fuzzy counting in the earlier incarnation and it was proposed that a page 
flag be introduced. Without the flag we cannot know that the page is 
already mapped by a VM_LOCKED vma without scanning over all vmas 
referencing the page.

  1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.
 
 Ow.  How were you thinking of fixing that?

I thought someone else could come up with something. Maybe the one that 
told me to use another page flag?

  2. Since mlocked pages are now off the LRU page migration will no longer
 move them.
 
 Ow.  That could be a right pain when we get around to using migration for
 memory-unplug?

We need to expand page migration anyways to allow the gerneral migration 
of non-LRU pages.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Martin J. Bligh wrote:

 Doesn't matter - you can just do it lazily. If you find a page that is
 locked, move it to the locked list. when unlocking a page you *always*
 move it back to the normal list. If someone else is still locking it,
 we'll move it back to the lock list on next reclaim pass.

Sounds similar to what Andrew is proposing.
 
 I have a half-finished patch from 6 months ago that does this, but never
 found time to complete it ;-(

Could I see that patch? Could have some good approaches in there that 
would be useful.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Arjan van de Ven

On Sat, 2007-02-03 at 09:56 -0800, Christoph Lameter wrote:
 On Sat, 3 Feb 2007, Andrew Morton wrote:
 
  I wonder if it can be simpler.  Make two changes:
 
 Would be great if this could get simpler.
 
  a) If the scanner encounters an mlocked page on the LRU, take it off.
 
 The current patch takes them off when mlock is set (which may not work 
 since the page may be off the LRU) and then has the scanner taking them 
 off. We could just remove the early one but what would this bring us?

it's simpler. You only move them off when you encounter them during a
scan. No walking early etc etc. Only do work when there is an actual
situation where you do scan.

 
  b) munlock() adds all affected pages to the LRU.
 
 Hmmm... You mean without checking all the vmas of a page for VM_LOCKED? So 
 they 
 are going to be removed again on the next pass? Ok. I see that makes it 
 simpler but it requires another reclaim scan.

Well.. That's the point! Only IF there is a reclaim scan do you move
them out again. The fact that these pages are on the list isn't a
problem. The fact that you keep encountering them over and over again
during *scanning* is. So Andrews suggestion makes them go away in the
situations that actually matter

 The page flag allows a clean state transition of a page and accurate 
 keeping of statistics for MLOCKed pages. There were objections against the 
 fuzzy counting in the earlier incarnation and it was proposed that a page 
 flag be introduced. Without the flag we cannot know that the page is 
 already mapped by a VM_LOCKED vma without scanning over all vmas 
 referencing the page.

who cares though.. just do it lazy.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Arjan van de Ven wrote:

 it's simpler. You only move them off when you encounter them during a
 scan. No walking early etc etc. Only do work when there is an actual
 situation where you do scan.

Yes but then you do not have an accurate count of MLOCKed pages. We will 
only have that count after reclaim has run and removed the pages from the 
LRU. That could take some time (or may never happen the way some people 
handle memory around here).

 Well.. That's the point! Only IF there is a reclaim scan do you move
 them out again. The fact that these pages are on the list isn't a
 problem. The fact that you keep encountering them over and over again
 during *scanning* is. So Andrews suggestion makes them go away in the
 situations that actually matter

I see... Okay that could be simple to address.

 who cares though.. just do it lazy.

The other issue that the patch should address is to allow the VM to have 
statistics that show the actual amount of MLOCKed pages. With the pure 
scanner based approach this would no longer work.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Arjan van de Ven wrote:

 Well.. That's the point! Only IF there is a reclaim scan do you move
 them out again. The fact that these pages are on the list isn't a
 problem. The fact that you keep encountering them over and over again
 during *scanning* is. So Andrews suggestion makes them go away in the
 situations that actually matter

In order to get this to work try_to_unmap() must be able to distinguish 
betwen failures due to MLOCK and otherwise. So I guess we need this patch:


[PATCH] Make try_to_unmap() return SWAP_MLOCK for mlocked pages

Modify try_to_unmap() so that we can distinguish between failing to
unmap because a page is mlocked from other causes.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

Index: current/include/linux/rmap.h
===
--- current.orig/include/linux/rmap.h   2007-02-03 10:24:47.0 -0800
+++ current/include/linux/rmap.h2007-02-03 10:25:08.0 -0800
@@ -134,5 +134,6 @@ static inline int page_mkclean(struct pa
 #define SWAP_SUCCESS   0
 #define SWAP_AGAIN 1
 #define SWAP_FAIL  2
+#define SWAP_MLOCK 3
 
 #endif /* _LINUX_RMAP_H */
Index: current/mm/rmap.c
===
--- current.orig/mm/rmap.c  2007-02-03 10:24:47.0 -0800
+++ current/mm/rmap.c   2007-02-03 10:25:08.0 -0800
@@ -631,10 +631,16 @@ static int try_to_unmap_one(struct page 
 * If it's recently referenced (perhaps page_referenced
 * skipped over this mm) then we should reactivate it.
 */
-   if (!migration  ((vma-vm_flags  VM_LOCKED) ||
-   (ptep_clear_flush_young(vma, address, pte {
-   ret = SWAP_FAIL;
-   goto out_unmap;
+   if (!migration) {
+   if (vma-vm_flags  VM_LOCKED) {
+   ret = SWAP_MLOCK;
+   goto out_unmap;
+   }
+
+   if (ptep_clear_flush_young(vma, address, pte)) {
+   ret = SWAP_FAIL;
+   goto out_unmap;
+   }
}
 
/* Nuke the page table entry. */
@@ -799,7 +805,8 @@ static int try_to_unmap_anon(struct page
 
list_for_each_entry(vma, anon_vma-head, anon_vma_node) {
ret = try_to_unmap_one(page, vma, migration);
-   if (ret == SWAP_FAIL || !page_mapped(page))
+   if (ret == SWAP_FAIL || ret == SWAP_MLOCK ||
+   !page_mapped(page))
break;
}
spin_unlock(anon_vma-lock);
@@ -830,7 +837,8 @@ static int try_to_unmap_file(struct page
spin_lock(mapping-i_mmap_lock);
vma_prio_tree_foreach(vma, iter, mapping-i_mmap, pgoff, pgoff) {
ret = try_to_unmap_one(page, vma, migration);
-   if (ret == SWAP_FAIL || !page_mapped(page))
+   if (ret == SWAP_FAIL || ret == SWAP_MLOCK ||
+   !page_mapped(page))
goto out;
}
 
@@ -913,6 +921,7 @@ out:
  * SWAP_SUCCESS- we succeeded in removing all mappings
  * SWAP_AGAIN  - we missed a mapping, try again later
  * SWAP_FAIL   - the page is unswappable
+ * SWAP_MLOCK  - the page is under mlock()
  */
 int try_to_unmap(struct page *page, int migration)
 {
Index: current/mm/vmscan.c
===
--- current.orig/mm/vmscan.c2007-02-03 10:25:00.0 -0800
+++ current/mm/vmscan.c 2007-02-03 10:25:12.0 -0800
@@ -516,6 +516,7 @@ static unsigned long shrink_page_list(st
if (page_mapped(page)  mapping) {
switch (try_to_unmap(page, 0)) {
case SWAP_FAIL:
+   case SWAP_MLOCK:
goto activate_locked;
case SWAP_AGAIN:
goto keep_locked;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

Here is the second piece removing mlock pages off the LRU during scanning. 
I tried moving them to a separate list but then we run into issues with
locking. We do not need ithe list though since we will encounter the
page again anyways during zap_pte_range.

However, in zap_pte_range we then run into another problem. Multiple 
zap_pte_ranges may handle the same page and without a page flag and 
scanning all the vmas we cannot determine if the page should or should not 
be moved back to the LRU. As a result this patch may decrement NR_MLOCK 
too much so that is goes below zero. Any ideas on how to fix this without 
a page flag and a scan over vmas?

Plus there is the issue of NR_MLOCK only being updated when we are 
reclaiming and when we may already be in trouble. An app may mlock huge 
amounts of memory and NR_MLOCK will stay low. If memory gets too low then
NR_MLOCKED is suddenly become accurate and the VM is likely undergoing a 
shock from that discovery (should we actually use NR_MLOCK elsewhere to 
determine memory management behavior). Hopefully we will not fall over 
then.

Maybe the best would be to handle the counter separately via a page flag? 
But then we go back to ugly vma scans. Yuck.

Index: current/mm/vmscan.c
===
--- current.orig/mm/vmscan.c2007-02-03 10:53:15.0 -0800
+++ current/mm/vmscan.c 2007-02-03 10:53:25.0 -0800
@@ -516,10 +516,11 @@ static unsigned long shrink_page_list(st
if (page_mapped(page)  mapping) {
switch (try_to_unmap(page, 0)) {
case SWAP_FAIL:
-   case SWAP_MLOCK:
goto activate_locked;
case SWAP_AGAIN:
goto keep_locked;
+   case SWAP_MLOCK:
+   goto mlocked;
case SWAP_SUCCESS:
; /* try to free the page below */
}
@@ -594,6 +595,11 @@ free_it:
__pagevec_release_nonlru(freed_pvec);
continue;
 
+mlocked:
+   unlock_page(page);
+   __inc_zone_page_state(page, NR_MLOCK);
+   continue;
+
 activate_locked:
SetPageActive(page);
pgactivate++;
Index: current/mm/memory.c
===
--- current.orig/mm/memory.c2007-02-03 10:52:37.0 -0800
+++ current/mm/memory.c 2007-02-03 10:53:25.0 -0800
@@ -682,6 +682,10 @@ static unsigned long zap_pte_range(struc
file_rss--;
}
page_remove_rmap(page, vma);
+   if (vma-vm_flags  VM_LOCKED) {
+   __dec_zone_page_state(page, NR_MLOCK);
+   lru_cache_add_active(page);
+   }
tlb_remove_page(tlb, page);
continue;
}
Index: current/drivers/base/node.c
===
--- current.orig/drivers/base/node.c2007-02-03 10:52:35.0 -0800
+++ current/drivers/base/node.c 2007-02-03 10:53:25.0 -0800
@@ -60,6 +60,7 @@ static ssize_t node_read_meminfo(struct 
   Node %d FilePages:%8lu kB\n
   Node %d Mapped:   %8lu kB\n
   Node %d AnonPages:%8lu kB\n
+  Node %d Mlock:%8lu KB\n
   Node %d PageTables:   %8lu kB\n
   Node %d NFS_Unstable: %8lu kB\n
   Node %d Bounce:   %8lu kB\n
@@ -82,6 +83,7 @@ static ssize_t node_read_meminfo(struct 
   nid, K(node_page_state(nid, NR_FILE_PAGES)),
   nid, K(node_page_state(nid, NR_FILE_MAPPED)),
   nid, K(node_page_state(nid, NR_ANON_PAGES)),
+  nid, K(node_page_state(nid, NR_MLOCK)),
   nid, K(node_page_state(nid, NR_PAGETABLE)),
   nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
   nid, K(node_page_state(nid, NR_BOUNCE)),
Index: current/fs/proc/proc_misc.c
===
--- current.orig/fs/proc/proc_misc.c2007-02-03 10:52:36.0 -0800
+++ current/fs/proc/proc_misc.c 2007-02-03 10:53:25.0 -0800
@@ -166,6 +166,7 @@ static int meminfo_read_proc(char *page,
Writeback:%8lu kB\n
AnonPages:%8lu kB\n
Mapped:   %8lu kB\n
+   Mlock:%8lu KB\n
Slab: %8lu kB\n
SReclaimable: %8lu kB\n
SUnreclaim:   %8lu kB\n
@@ -196,6 +197,7 @@ static int meminfo_read_proc(char *page,

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Nigel Cunningham

Hi.

On Fri, 2007-02-02 at 22:20 -0800, Christoph Lameter wrote:
 1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.

If it will help, I now have an implementation of the dynamically
allocated pageflags code I've posted in the past that is NUMA aware.
It's not memory hotplug aware yet, but that can be fixed. I can see if I
can find time this week to address that and send it to you if it will
help.

Regards,

Nigel

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Nigel Cunningham

Hi again.

On Sat, 2007-02-03 at 00:53 -0800, Andrew Morton wrote:
  1. We use the 21st page flag and we only have 20 on 32 bit NUMA platforms.
 
 Ow.  How were you thinking of fixing that?

Oh, guess the dyn_pageflags patch is not needed then - the dangers of
replying before reading a whole thread :)

Nigel

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Andrew Morton

On Sat, 3 Feb 2007 11:03:59 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] 
wrote:

 Here is the second piece removing mlock pages off the LRU during scanning. 
 I tried moving them to a separate list but then we run into issues with
 locking. We do not need ithe list though since we will encounter the
 page again anyways during zap_pte_range.
 
 However, in zap_pte_range we then run into another problem. Multiple 
 zap_pte_ranges may handle the same page and without a page flag and 
 scanning all the vmas we cannot determine if the page should or should not 
 be moved back to the LRU. As a result this patch may decrement NR_MLOCK 
 too much so that is goes below zero. Any ideas on how to fix this without 
 a page flag and a scan over vmas?
 
 Plus there is the issue of NR_MLOCK only being updated when we are 
 reclaiming and when we may already be in trouble. An app may mlock huge 
 amounts of memory and NR_MLOCK will stay low. If memory gets too low then
 NR_MLOCKED is suddenly become accurate and the VM is likely undergoing a 
 shock from that discovery (should we actually use NR_MLOCK elsewhere to 
 determine memory management behavior). Hopefully we will not fall over 
 then.

Do we actually need NR_MLOCK?  Page reclaim tends to care more about the
size of the LRUs and doesn't have much dependency on -present_pages,
iirc.

I guess we could use NR_MLOCK for writeback threshold calculations, to
force writeback earlier if there's a lot of mlocked memory in the affected
zones.  But that code isn't zone-aware anyway, and we don't know how to make
it zone aware in any sane fashion and making it cpuset-aware isn't very
interesting or useful..

So..  Why do we want NR_MLOCK?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Tracking mlocked pages and moving them off the LRU

2007-02-03 Thread Christoph Lameter

On Sat, 3 Feb 2007, Andrew Morton wrote:

 Do we actually need NR_MLOCK?  Page reclaim tends to care more about the
 size of the LRUs and doesn't have much dependency on -present_pages,

Yes, we'd be fine with general reclaim I think. But the calculation of the 
dirty ratio based on ZVCs would need it if we take the mlocked pages off. 
Otherwise we may have dirty ratios  100%.

 I guess we could use NR_MLOCK for writeback threshold calculations, to
 force writeback earlier if there's a lot of mlocked memory in the affected
 zones.  But that code isn't zone-aware anyway, and we don't know how to make
 it zone aware in any sane fashion and making it cpuset-aware isn't very
 interesting or useful..

Exclusion or inclusion of NR_MLOCK number is straightforward for the dirty 
ratio calcuations. global_page_state(NR_MLOCK) f.e. would get us totals on 
mlocked pages per zone. node_page_state(NR_MLOCK) gives a node specific 
number of mlocked pages. The nice thing about ZVCs is that it allows
easy access to counts on different levels.

 So..  Why do we want NR_MLOCK?

Rik also had some uses in mind for allocation?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

50 matches

Mail list logo