Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Paul Mackerras wrote:

Rik van Riel writes:


I guess we'll need to call tlb_remove_tlb_entry() inside the
MADV_FREE code to keep powerpc happy.


I don't see why; once ptep_test_and_clear_young has returned, the
entry in the hash table has already been removed. 


OK, so this one won't be necessary. Good to know that.

Andrew, it looks like things won't be that bad :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Paul Mackerras
Rik van Riel writes:

> I guess we'll need to call tlb_remove_tlb_entry() inside the
> MADV_FREE code to keep powerpc happy.

I don't see why; once ptep_test_and_clear_young has returned, the
entry in the hash table has already been removed.  Adding the
tlb_remove_tlb_entry call certainly won't do anything on 64-bit
powerpc, since it expands to do {} while (0) there, and in fact it
won't do anything on 32-bit powerpc either.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Andrew Morton
On Mon, 23 Apr 2007 22:53:49 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote:

> I don't see why we need the attached, but in case you find
> a good reason, here's my signed-off-by line for Andrew :)

Andew is in a defensive crouch trying to work his way through all the bugs
he's been sent.  After I've managed to release 2.6.21-rc7-mm1 (say, December)
I expect I'll drop the MADV_FREE stuff, give you a run at creating a new
patch series.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Nick Piggin wrote:


What the tlb flush used to be able to assume is that the page
has been removed from the pagetables when they are put in the
tlb flush batch.


I think this is still the case, to a degree.  There should be
no harm in removing the TLB entries after the page table has
been unlocked, right?

Or is something like the attached really needed?

From what I can see, the page table lock should be enough
synchronization between unmap_mapping_range, MADV_FREE and
MADV_DONTNEED.

I don't see why we need the attached, but in case you find
a good reason, here's my signed-off-by line for Andrew :)

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.x86_64/mm/memory.c.flushme	2007-04-23 22:26:06.0 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 22:42:06.0 -0400
@@ -628,6 +628,7 @@ static unsigned long zap_pte_range(struc
 long *zap_work, struct zap_details *details)
 {
 	struct mm_struct *mm = tlb->mm;
+	unsigned long start_addr = addr;
 	pte_t *pte;
 	spinlock_t *ptl;
 	int file_rss = 0;
@@ -726,6 +727,11 @@ static unsigned long zap_pte_range(struc
 
 	add_mm_rss(mm, file_rss, anon_rss);
 	arch_leave_lazy_mmu_mode();
+	if (details && details->madv_free) {
+		/* Protect against MADV_DONTNEED or unmap_mapping_range */
+		tlb_finish_mmu(tlb, start_addr, addr);
+		tlb = tlb_gather_mmu(mm, 0);
+	}
 	pte_unmap_unlock(pte - 1, ptl);
 
 	return addr;


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Rik van Riel wrote:

This should fix the MADV_FREE code for PPC's hashed tlb.

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>
---

Nick Piggin wrote:


Nick Piggin wrote:


3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them




Yes, but I'm wondering if it is legal in all architectures.




It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.



The question is whether the architecture specific tlb
flushing code will break or not.



I guess we'll need to call tlb_remove_tlb_entry() inside the
MADV_FREE code to keep powerpc happy.

Thanks for pointing this one out.


Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.



What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.



Oh dear.  I see it now...

The tlb end things inside zap_pte_range() are actually
noops and the actual tlb flush only happens inside
zap_page_range().

I guess the fact that munmap gets the mmap_sem for
writing should save us, though...


What about an unmap_mapping_range, or another MADV_FREE or
MADV_DONTNEED?






--- linux-2.6.20.x86_64/mm/memory.c.noppc   2007-04-23 21:50:09.0 
-0400
+++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 21:48:59.0 -0400
@@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc
}
ptep_test_and_clear_dirty(vma, addr, 
pte);
ptep_test_and_clear_young(vma, addr, 
pte);
+   tlb_remove_tlb_entry(tlb, pte, addr);
SetPageLazyFree(page);
if (PageActive(page))
deactivate_tail_page(page);



--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

This should fix the MADV_FREE code for PPC's hashed tlb.

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>
---

Nick Piggin wrote:

Nick Piggin wrote:


3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them



Yes, but I'm wondering if it is legal in all architectures.



It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.


The question is whether the architecture specific tlb
flushing code will break or not.


I guess we'll need to call tlb_remove_tlb_entry() inside the
MADV_FREE code to keep powerpc happy.

Thanks for pointing this one out.


Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.


What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.


Oh dear.  I see it now...

The tlb end things inside zap_pte_range() are actually
noops and the actual tlb flush only happens inside
zap_page_range().

I guess the fact that munmap gets the mmap_sem for
writing should save us, though...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.x86_64/mm/memory.c.noppc	2007-04-23 21:50:09.0 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 21:48:59.0 -0400
@@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc
 	}
 	ptep_test_and_clear_dirty(vma, addr, pte);
 	ptep_test_and_clear_young(vma, addr, pte);
+	tlb_remove_tlb_entry(tlb, pte, addr);
 	SetPageLazyFree(page);
 	if (PageActive(page))
 		deactivate_tail_page(page);


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Rik van Riel wrote:

Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
to the MySQL sysbench results on my quad core system.

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>
---

Nick Piggin wrote:


3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them



Yes, but I'm wondering if it is legal in all architectures.



It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.


The question is whether the architecture specific tlb
flushing code will break or not.



4) because we flush the tlb before releasing the page
   table lock, other CPUs cannot remove this page from
   the address space - they will block on the page
   table lock before looking at this pte



We don't when the ptl is split.



Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.


What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Rik van Riel wrote:


First some ebizzy runs...


This is interesting.  Ginormous speedups in ebizzy[1] on my quad core
test system.  The following numbers are the average of 10 runs, since
ebizzy shows some variability.

You can see a big influence from the tlb batching and from Nick's
madv_sem patch.  The reduction in system time from 100 seconds to
3 seconds is way more than I had expected, but I'm not complaining.
The 4 fold reduction in wall clock time is a nice bonus.

According to Val, ebizzy shows the weaknesses of Linux with a real
workload, so this could be a useful result.

kernel
   user system wall clock%CPU

vanilla 186s101s   123s  230%
madv_free (madv)175s 96s   120s  230%
mmap_sem (sem)  100s 40s40s  370%
madv+sem200s140s   100s  393%
madv+sem+tlb118s  3s30s  395%
madv+tlb150s 10s50s  310%

[1] http://www.ussg.iu.edu/hypermail/linux/kernel/0604.2/1699.html
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
to the MySQL sysbench results on my quad core system.

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>
---

Nick Piggin wrote:


3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them


Yes, but I'm wondering if it is legal in all architectures.


It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.


4) because we flush the tlb before releasing the page
   table lock, other CPUs cannot remove this page from
   the address space - they will block on the page
   table lock before looking at this pte


We don't when the ptl is split.


Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.


What the tlb flush used to be able to assume is that the page
has been removed from the pagetables when they are put in the
tlb flush batch.


All the tlb flush code seems to assume is that the tlb entries
should be invalidated.


I'm not saying there is any bugs, but just suggesting there
might be.


Jakub found a potential bug, in that I did not use an atomic
operation to clear the page table entries.  I've attached a
new patch which simply uses ptep_test_and_clear_dirty/young
to get rid of the dirty and accessed bits.

It uses the same atomic accesses we use elsewhere in the VM
and the code is a line shorter than before.

Andrew, please use this one.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.x86_64/mm/memory.c.orig	2007-04-23 02:48:36.0 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 02:54:42.0 -0400
@@ -677,11 +677,14 @@ static unsigned long zap_pte_range(struc
 		remove_exclusive_swap_page(page);
 		unlock_page(page);
 	}
-	ptep_clear_flush_dirty(vma, addr, pte);
-	ptep_clear_flush_young(vma, addr, pte);
+	ptep_test_and_clear_dirty(vma, addr, pte);
+	ptep_test_and_clear_young(vma, addr, pte);
 	SetPageLazyFree(page);
 	if (PageActive(page))
 		deactivate_tail_page(page);
+	/* tlb_remove_page frees it again */
+	get_page(page);
+	tlb_remove_page(tlb, page);
 	continue;
 }
 			}


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Jakub Jelinek
On Mon, Apr 23, 2007 at 08:21:37PM +1000, Nick Piggin wrote:
> I guess it is a good idea to batch these things. But can you
> do that on all architectures? What happens if your tlb flush
> happens after another thread already accesses it again, or
> after it subsequently gets removed from the address space via
> another CPU?

Accessing the page by another thread before madvise (MADV_FREE)
returns is undefined behavior, it can act as if that access happened
right before the madvise (MADV_FREE) call or right after it.
That's ok for glibc and supposedly any other malloc implementation,
madvise (MADV_FREE) is called while holding containing's arena lock
and for whatever malloc implementaton, madvise (MADV_FREE) would be
part of free operations and you definitely need some synchronization
between one thread freeing some memory and other thread deciding
to reuse that memory and return it from malloc/realloc/calloc/etc.

My only concern is whether using non-atomic update of the pte is
ok or not.
ptep_test_and_clear_young/ptep_test_and_clear_dirty Rik's patch
was doing before are done using atomic instructions, at least on x86_64.
The operation we want for MADV_FREE is, clear young/dirty bits if they
have been set on entry to the MADV_FREE madvise call, undefined values
for these 2 bits if some other task modifies the young/dirty bits
concurrently with this MADV_FREE zap_page_range, but I'd say other
bits need to be unmodified.
Now, is there some kernel code which while either not holding corresponding
mmap_sem at all or holding it just down_read modifies other bits
in the pte?  If yes, we need to do this clearing atomically, basically
do a cmpxchg loop until we succeed to clear the 2 bits and then flush
the tlb if any of them was set before (ptep_test_and_clear_dirty_and_young?),
if not, set_pte_at is ok and faster than a lock prefixed insn.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Rik van Riel wrote:

Nick Piggin wrote:


It looks like the tlb flushes (and IPIs) from zap_pte_range()
could have been the problem.  They're gone now.



I guess it is a good idea to batch these things. But can you
do that on all architectures? What happens if your tlb flush
happens after another thread already accesses it again, or
after it subsequently gets removed from the address space via
another CPU?



I have thought about this a lot tonight, and have come to the conclusion
that they are ok.

The reason is simple:

1) we do the TLB flush before we return from the
   madvise(MADV_FREE) syscall.

2) anything that accessess the pages between the start
   and end of the MADV_FREE procedure does not know in
   which order we go through the pages, so it could hit
   a page either before or after we get to processing
   it

3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them


Yes, but I'm wondering if it is legal in all architectures.



4) because we flush the tlb before releasing the page
   table lock, other CPUs cannot remove this page from
   the address space - they will block on the page
   table lock before looking at this pte


We don't when the ptl is split.

What the tlb flush used to be able to assume is that the page
has been removed from the pagetables when they are put in the
tlb flush batch.

I'm not saying there is any bugs, but just suggesting there
might be.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Nick Piggin wrote:


It looks like the tlb flushes (and IPIs) from zap_pte_range()
could have been the problem.  They're gone now.


I guess it is a good idea to batch these things. But can you
do that on all architectures? What happens if your tlb flush
happens after another thread already accesses it again, or
after it subsequently gets removed from the address space via
another CPU?


I have thought about this a lot tonight, and have come to the conclusion
that they are ok.

The reason is simple:

1) we do the TLB flush before we return from the
   madvise(MADV_FREE) syscall.

2) anything that accessess the pages between the start
   and end of the MADV_FREE procedure does not know in
   which order we go through the pages, so it could hit
   a page either before or after we get to processing
   it

3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them

4) because we flush the tlb before releasing the page
   table lock, other CPUs cannot remove this page from
   the address space - they will block on the page
   table lock before looking at this pte

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Rik van Riel wrote:

Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
to the MySQL sysbench results on my quad core system.

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>
---
Rik van Riel wrote:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.



With the attached patch to make MADV_FREE use tlb batching, not
only do we gain an additional 10-15% performance but Nick's
mmap_sem patch also shows the performance increase that we
expected to see.

It looks like the tlb flushes (and IPIs) from zap_pte_range()
could have been the problem.  They're gone now.


I guess it is a good idea to batch these things. But can you
do that on all architectures? What happens if your tlb flush
happens after another thread already accesses it again, or
after it subsequently gets removed from the address space via
another CPU?



The second column from the right has Nick's patch and my own
two patches.  Performance with 16 threads is almost triple what
it used to be...

vanilla   glibc  glibc  glibcglibc  glibc  glibc
 madv_free  madv_free   madv_free madv_free
mmap_sem mmap_sem   mmap_sem
tlb batch  tlb_batch
threads

 1 610 609 596 545 534 547 537
 21032113611961200118012931194
 41070112820142024202722482040
 81000108816652087208923141869
 16779107313101999201222141557



Now that I think about it - this is all with the rawhide kernel
configuration, which has an ungodly number of debug config
options enabled.

I should try this with a more normal kernel, on various different
systems.



This is for another day. :)

First some ebizzy runs...




--- linux-2.6.20.x86_64/mm/memory.c.orig2007-04-23 02:48:36.0 
-0400
+++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 02:54:42.0 -0400
@@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc

remove_exclusive_swap_page(page);
unlock_page(page);
}
-   ptep_clear_flush_dirty(vma, addr, pte);
-   ptep_clear_flush_young(vma, addr, pte);
SetPageLazyFree(page);
if (PageActive(page))
deactivate_tail_page(page);
+   ptent = *pte;
+   set_pte_at(mm, addr, pte,
+   pte_mkclean(pte_mkold(ptent)));
+   /* tlb_remove_page frees it again */
+   get_page(page);
+   tlb_remove_page(tlb, page);
continue;
}
}



--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Nick Piggin wrote:


I haven't tested your MADV_FREE patch yet.


Good.  It turned out that one behaved a bit strange without tlb batching 
anyway.


I'm now running ebizzy across the whole set of kernels I tested before,
and will post the results in a bit.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Nick Piggin wrote:

Rik van Riel wrote:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.



Thanks! (I edited slightly so it doesn't wrap)



  vanilla   new glibc   madv_freemmap_semboth
threads

1 610 609 596 534 545
210321136119611801200
410701128201420272024
810001088166520892087
167791073131020121999


Not doing the mprotect calls is the big one I guess, especially
the fact that we don't need to take the mmap_sem for writing.



Yes.



With both our patches, single and two thread performance with
MySQL sysbench is somewhat better than with just your patch,
4 and 8 thread performance are basically the same and just
your patch gives a slight benefit with 16 threads.

I guess I should benchmark up to 64 or 128 threads tomorrow,
to see if this is just luck or if the cache benefit of doing
the page faults and reusing hot pages is faster than not
having page faults at all.

I should run some benchmarks on other systems, too.  Some of
these results could be an artifact of my quad core CPU.  The
results could be very different on other systems...



I'm getting the 16 core box out of retirement as we speak :)



OK, 10 runs at 1 client, 2.6.21-rc6, MySQL version 5.33, and new
Jakub's glibc gives a 99.9% confidence of:

vanilla:  467.2 +/- 7.9 (tps)
mmap_sem: 470.5 +/- 9.3 (tps)

However, it seems those means jump around a bit from boot to boot,
so there could be some some memory placement luck for cache and/or
NUMA goodness involved.

So I think it is safe to say that the mmap_sem patch doesn't hurt
single threaded performance (from looking at the numbers and the
patch). And that's the most important thing for that patch.

I'll post some scalability results tomorrow. From my first round
of tests, after new glibc and the mmap_sem patch, it doesn't seem
like rwsem improvements, private futexes, or avoiding zero_page
make any significant differences.

I haven't tested your MADV_FREE patch yet.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
to the MySQL sysbench results on my quad core system.

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>
---
Rik van Riel wrote:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.


With the attached patch to make MADV_FREE use tlb batching, not
only do we gain an additional 10-15% performance but Nick's
mmap_sem patch also shows the performance increase that we
expected to see.

It looks like the tlb flushes (and IPIs) from zap_pte_range()
could have been the problem.  They're gone now.

The second column from the right has Nick's patch and my own
two patches.  Performance with 16 threads is almost triple what
it used to be...

vanilla   glibc  glibc  glibcglibc  glibc  glibc
 madv_free  madv_free   madv_free 
madv_free

mmap_sem mmap_sem   mmap_sem
tlb batch  tlb_batch
threads

 1 610 609 596 545 534 547 537
 21032113611961200118012931194
 41070112820142024202722482040
 81000108816652087208923141869
 16779107313101999201222141557



Now that I think about it - this is all with the rawhide kernel
configuration, which has an ungodly number of debug config
options enabled.

I should try this with a more normal kernel, on various different
systems.


This is for another day. :)

First some ebizzy runs...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.x86_64/mm/memory.c.orig	2007-04-23 02:48:36.0 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 02:54:42.0 -0400
@@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc
 		remove_exclusive_swap_page(page);
 		unlock_page(page);
 	}
-	ptep_clear_flush_dirty(vma, addr, pte);
-	ptep_clear_flush_young(vma, addr, pte);
 	SetPageLazyFree(page);
 	if (PageActive(page))
 		deactivate_tail_page(page);
+	ptent = *pte;
+	set_pte_at(mm, addr, pte,
+		pte_mkclean(pte_mkold(ptent)));
+	/* tlb_remove_page frees it again */
+	get_page(page);
+	tlb_remove_page(tlb, page);
 	continue;
 }
 			}


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
to the MySQL sysbench results on my quad core system.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]
---
Rik van Riel wrote:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.


With the attached patch to make MADV_FREE use tlb batching, not
only do we gain an additional 10-15% performance but Nick's
mmap_sem patch also shows the performance increase that we
expected to see.

It looks like the tlb flushes (and IPIs) from zap_pte_range()
could have been the problem.  They're gone now.

The second column from the right has Nick's patch and my own
two patches.  Performance with 16 threads is almost triple what
it used to be...

vanilla   glibc  glibc  glibcglibc  glibc  glibc
 madv_free  madv_free   madv_free 
madv_free

mmap_sem mmap_sem   mmap_sem
tlb batch  tlb_batch
threads

 1 610 609 596 545 534 547 537
 21032113611961200118012931194
 41070112820142024202722482040
 81000108816652087208923141869
 16779107313101999201222141557



Now that I think about it - this is all with the rawhide kernel
configuration, which has an ungodly number of debug config
options enabled.

I should try this with a more normal kernel, on various different
systems.


This is for another day. :)

First some ebizzy runs...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.x86_64/mm/memory.c.orig	2007-04-23 02:48:36.0 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 02:54:42.0 -0400
@@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc
 		remove_exclusive_swap_page(page);
 		unlock_page(page);
 	}
-	ptep_clear_flush_dirty(vma, addr, pte);
-	ptep_clear_flush_young(vma, addr, pte);
 	SetPageLazyFree(page);
 	if (PageActive(page))
 		deactivate_tail_page(page);
+	ptent = *pte;
+	set_pte_at(mm, addr, pte,
+		pte_mkclean(pte_mkold(ptent)));
+	/* tlb_remove_page frees it again */
+	get_page(page);
+	tlb_remove_page(tlb, page);
 	continue;
 }
 			}


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Nick Piggin wrote:

Rik van Riel wrote:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.



Thanks! (I edited slightly so it doesn't wrap)



  vanilla   new glibc   madv_freemmap_semboth
threads

1 610 609 596 534 545
210321136119611801200
410701128201420272024
810001088166520892087
167791073131020121999


Not doing the mprotect calls is the big one I guess, especially
the fact that we don't need to take the mmap_sem for writing.



Yes.



With both our patches, single and two thread performance with
MySQL sysbench is somewhat better than with just your patch,
4 and 8 thread performance are basically the same and just
your patch gives a slight benefit with 16 threads.

I guess I should benchmark up to 64 or 128 threads tomorrow,
to see if this is just luck or if the cache benefit of doing
the page faults and reusing hot pages is faster than not
having page faults at all.

I should run some benchmarks on other systems, too.  Some of
these results could be an artifact of my quad core CPU.  The
results could be very different on other systems...



I'm getting the 16 core box out of retirement as we speak :)



OK, 10 runs at 1 client, 2.6.21-rc6, MySQL version 5.33, and new
Jakub's glibc gives a 99.9% confidence of:

vanilla:  467.2 +/- 7.9 (tps)
mmap_sem: 470.5 +/- 9.3 (tps)

However, it seems those means jump around a bit from boot to boot,
so there could be some some memory placement luck for cache and/or
NUMA goodness involved.

So I think it is safe to say that the mmap_sem patch doesn't hurt
single threaded performance (from looking at the numbers and the
patch). And that's the most important thing for that patch.

I'll post some scalability results tomorrow. From my first round
of tests, after new glibc and the mmap_sem patch, it doesn't seem
like rwsem improvements, private futexes, or avoiding zero_page
make any significant differences.

I haven't tested your MADV_FREE patch yet.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Nick Piggin wrote:


I haven't tested your MADV_FREE patch yet.


Good.  It turned out that one behaved a bit strange without tlb batching 
anyway.


I'm now running ebizzy across the whole set of kernels I tested before,
and will post the results in a bit.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Rik van Riel wrote:

Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
to the MySQL sysbench results on my quad core system.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]
---
Rik van Riel wrote:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.



With the attached patch to make MADV_FREE use tlb batching, not
only do we gain an additional 10-15% performance but Nick's
mmap_sem patch also shows the performance increase that we
expected to see.

It looks like the tlb flushes (and IPIs) from zap_pte_range()
could have been the problem.  They're gone now.


I guess it is a good idea to batch these things. But can you
do that on all architectures? What happens if your tlb flush
happens after another thread already accesses it again, or
after it subsequently gets removed from the address space via
another CPU?



The second column from the right has Nick's patch and my own
two patches.  Performance with 16 threads is almost triple what
it used to be...

vanilla   glibc  glibc  glibcglibc  glibc  glibc
 madv_free  madv_free   madv_free madv_free
mmap_sem mmap_sem   mmap_sem
tlb batch  tlb_batch
threads

 1 610 609 596 545 534 547 537
 21032113611961200118012931194
 41070112820142024202722482040
 81000108816652087208923141869
 16779107313101999201222141557



Now that I think about it - this is all with the rawhide kernel
configuration, which has an ungodly number of debug config
options enabled.

I should try this with a more normal kernel, on various different
systems.



This is for another day. :)

First some ebizzy runs...




--- linux-2.6.20.x86_64/mm/memory.c.orig2007-04-23 02:48:36.0 
-0400
+++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 02:54:42.0 -0400
@@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc

remove_exclusive_swap_page(page);
unlock_page(page);
}
-   ptep_clear_flush_dirty(vma, addr, pte);
-   ptep_clear_flush_young(vma, addr, pte);
SetPageLazyFree(page);
if (PageActive(page))
deactivate_tail_page(page);
+   ptent = *pte;
+   set_pte_at(mm, addr, pte,
+   pte_mkclean(pte_mkold(ptent)));
+   /* tlb_remove_page frees it again */
+   get_page(page);
+   tlb_remove_page(tlb, page);
continue;
}
}



--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Nick Piggin wrote:


It looks like the tlb flushes (and IPIs) from zap_pte_range()
could have been the problem.  They're gone now.


I guess it is a good idea to batch these things. But can you
do that on all architectures? What happens if your tlb flush
happens after another thread already accesses it again, or
after it subsequently gets removed from the address space via
another CPU?


I have thought about this a lot tonight, and have come to the conclusion
that they are ok.

The reason is simple:

1) we do the TLB flush before we return from the
   madvise(MADV_FREE) syscall.

2) anything that accessess the pages between the start
   and end of the MADV_FREE procedure does not know in
   which order we go through the pages, so it could hit
   a page either before or after we get to processing
   it

3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them

4) because we flush the tlb before releasing the page
   table lock, other CPUs cannot remove this page from
   the address space - they will block on the page
   table lock before looking at this pte

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Rik van Riel wrote:

Nick Piggin wrote:


It looks like the tlb flushes (and IPIs) from zap_pte_range()
could have been the problem.  They're gone now.



I guess it is a good idea to batch these things. But can you
do that on all architectures? What happens if your tlb flush
happens after another thread already accesses it again, or
after it subsequently gets removed from the address space via
another CPU?



I have thought about this a lot tonight, and have come to the conclusion
that they are ok.

The reason is simple:

1) we do the TLB flush before we return from the
   madvise(MADV_FREE) syscall.

2) anything that accessess the pages between the start
   and end of the MADV_FREE procedure does not know in
   which order we go through the pages, so it could hit
   a page either before or after we get to processing
   it

3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them


Yes, but I'm wondering if it is legal in all architectures.



4) because we flush the tlb before releasing the page
   table lock, other CPUs cannot remove this page from
   the address space - they will block on the page
   table lock before looking at this pte


We don't when the ptl is split.

What the tlb flush used to be able to assume is that the page
has been removed from the pagetables when they are put in the
tlb flush batch.

I'm not saying there is any bugs, but just suggesting there
might be.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Jakub Jelinek
On Mon, Apr 23, 2007 at 08:21:37PM +1000, Nick Piggin wrote:
 I guess it is a good idea to batch these things. But can you
 do that on all architectures? What happens if your tlb flush
 happens after another thread already accesses it again, or
 after it subsequently gets removed from the address space via
 another CPU?

Accessing the page by another thread before madvise (MADV_FREE)
returns is undefined behavior, it can act as if that access happened
right before the madvise (MADV_FREE) call or right after it.
That's ok for glibc and supposedly any other malloc implementation,
madvise (MADV_FREE) is called while holding containing's arena lock
and for whatever malloc implementaton, madvise (MADV_FREE) would be
part of free operations and you definitely need some synchronization
between one thread freeing some memory and other thread deciding
to reuse that memory and return it from malloc/realloc/calloc/etc.

My only concern is whether using non-atomic update of the pte is
ok or not.
ptep_test_and_clear_young/ptep_test_and_clear_dirty Rik's patch
was doing before are done using atomic instructions, at least on x86_64.
The operation we want for MADV_FREE is, clear young/dirty bits if they
have been set on entry to the MADV_FREE madvise call, undefined values
for these 2 bits if some other task modifies the young/dirty bits
concurrently with this MADV_FREE zap_page_range, but I'd say other
bits need to be unmodified.
Now, is there some kernel code which while either not holding corresponding
mmap_sem at all or holding it just down_read modifies other bits
in the pte?  If yes, we need to do this clearing atomically, basically
do a cmpxchg loop until we succeed to clear the 2 bits and then flush
the tlb if any of them was set before (ptep_test_and_clear_dirty_and_young?),
if not, set_pte_at is ok and faster than a lock prefixed insn.

Jakub
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
to the MySQL sysbench results on my quad core system.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]
---

Nick Piggin wrote:


3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them


Yes, but I'm wondering if it is legal in all architectures.


It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.


4) because we flush the tlb before releasing the page
   table lock, other CPUs cannot remove this page from
   the address space - they will block on the page
   table lock before looking at this pte


We don't when the ptl is split.


Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.


What the tlb flush used to be able to assume is that the page
has been removed from the pagetables when they are put in the
tlb flush batch.


All the tlb flush code seems to assume is that the tlb entries
should be invalidated.


I'm not saying there is any bugs, but just suggesting there
might be.


Jakub found a potential bug, in that I did not use an atomic
operation to clear the page table entries.  I've attached a
new patch which simply uses ptep_test_and_clear_dirty/young
to get rid of the dirty and accessed bits.

It uses the same atomic accesses we use elsewhere in the VM
and the code is a line shorter than before.

Andrew, please use this one.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.x86_64/mm/memory.c.orig	2007-04-23 02:48:36.0 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 02:54:42.0 -0400
@@ -677,11 +677,14 @@ static unsigned long zap_pte_range(struc
 		remove_exclusive_swap_page(page);
 		unlock_page(page);
 	}
-	ptep_clear_flush_dirty(vma, addr, pte);
-	ptep_clear_flush_young(vma, addr, pte);
+	ptep_test_and_clear_dirty(vma, addr, pte);
+	ptep_test_and_clear_young(vma, addr, pte);
 	SetPageLazyFree(page);
 	if (PageActive(page))
 		deactivate_tail_page(page);
+	/* tlb_remove_page frees it again */
+	get_page(page);
+	tlb_remove_page(tlb, page);
 	continue;
 }
 			}


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Rik van Riel wrote:


First some ebizzy runs...


This is interesting.  Ginormous speedups in ebizzy[1] on my quad core
test system.  The following numbers are the average of 10 runs, since
ebizzy shows some variability.

You can see a big influence from the tlb batching and from Nick's
madv_sem patch.  The reduction in system time from 100 seconds to
3 seconds is way more than I had expected, but I'm not complaining.
The 4 fold reduction in wall clock time is a nice bonus.

According to Val, ebizzy shows the weaknesses of Linux with a real
workload, so this could be a useful result.

kernel
   user system wall clock%CPU

vanilla 186s101s   123s  230%
madv_free (madv)175s 96s   120s  230%
mmap_sem (sem)  100s 40s40s  370%
madv+sem200s140s   100s  393%
madv+sem+tlb118s  3s30s  395%
madv+tlb150s 10s50s  310%

[1] http://www.ussg.iu.edu/hypermail/linux/kernel/0604.2/1699.html
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Rik van Riel wrote:

Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
to the MySQL sysbench results on my quad core system.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]
---

Nick Piggin wrote:


3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them



Yes, but I'm wondering if it is legal in all architectures.



It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.


The question is whether the architecture specific tlb
flushing code will break or not.



4) because we flush the tlb before releasing the page
   table lock, other CPUs cannot remove this page from
   the address space - they will block on the page
   table lock before looking at this pte



We don't when the ptl is split.



Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.


What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

This should fix the MADV_FREE code for PPC's hashed tlb.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]
---

Nick Piggin wrote:

Nick Piggin wrote:


3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them



Yes, but I'm wondering if it is legal in all architectures.



It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.


The question is whether the architecture specific tlb
flushing code will break or not.


I guess we'll need to call tlb_remove_tlb_entry() inside the
MADV_FREE code to keep powerpc happy.

Thanks for pointing this one out.


Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.


What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.


Oh dear.  I see it now...

The tlb end things inside zap_pte_range() are actually
noops and the actual tlb flush only happens inside
zap_page_range().

I guess the fact that munmap gets the mmap_sem for
writing should save us, though...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.x86_64/mm/memory.c.noppc	2007-04-23 21:50:09.0 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 21:48:59.0 -0400
@@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc
 	}
 	ptep_test_and_clear_dirty(vma, addr, pte);
 	ptep_test_and_clear_young(vma, addr, pte);
+	tlb_remove_tlb_entry(tlb, pte, addr);
 	SetPageLazyFree(page);
 	if (PageActive(page))
 		deactivate_tail_page(page);


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Nick Piggin

Rik van Riel wrote:

This should fix the MADV_FREE code for PPC's hashed tlb.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]
---

Nick Piggin wrote:


Nick Piggin wrote:


3) because of this, we can treat any such accesses as
   happening simultaneously with the MADV_FREE and
   as illegal, aka undefined behaviour territory and
   we do not need to worry about them




Yes, but I'm wondering if it is legal in all architectures.




It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.



The question is whether the architecture specific tlb
flushing code will break or not.



I guess we'll need to call tlb_remove_tlb_entry() inside the
MADV_FREE code to keep powerpc happy.

Thanks for pointing this one out.


Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.



What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.



Oh dear.  I see it now...

The tlb end things inside zap_pte_range() are actually
noops and the actual tlb flush only happens inside
zap_page_range().

I guess the fact that munmap gets the mmap_sem for
writing should save us, though...


What about an unmap_mapping_range, or another MADV_FREE or
MADV_DONTNEED?






--- linux-2.6.20.x86_64/mm/memory.c.noppc   2007-04-23 21:50:09.0 
-0400
+++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 21:48:59.0 -0400
@@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc
}
ptep_test_and_clear_dirty(vma, addr, 
pte);
ptep_test_and_clear_young(vma, addr, 
pte);
+   tlb_remove_tlb_entry(tlb, pte, addr);
SetPageLazyFree(page);
if (PageActive(page))
deactivate_tail_page(page);



--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Nick Piggin wrote:


What the tlb flush used to be able to assume is that the page
has been removed from the pagetables when they are put in the
tlb flush batch.


I think this is still the case, to a degree.  There should be
no harm in removing the TLB entries after the page table has
been unlocked, right?

Or is something like the attached really needed?

From what I can see, the page table lock should be enough
synchronization between unmap_mapping_range, MADV_FREE and
MADV_DONTNEED.

I don't see why we need the attached, but in case you find
a good reason, here's my signed-off-by line for Andrew :)

Signed-off-by: Rik van Riel [EMAIL PROTECTED]

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.x86_64/mm/memory.c.flushme	2007-04-23 22:26:06.0 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 22:42:06.0 -0400
@@ -628,6 +628,7 @@ static unsigned long zap_pte_range(struc
 long *zap_work, struct zap_details *details)
 {
 	struct mm_struct *mm = tlb-mm;
+	unsigned long start_addr = addr;
 	pte_t *pte;
 	spinlock_t *ptl;
 	int file_rss = 0;
@@ -726,6 +727,11 @@ static unsigned long zap_pte_range(struc
 
 	add_mm_rss(mm, file_rss, anon_rss);
 	arch_leave_lazy_mmu_mode();
+	if (details  details-madv_free) {
+		/* Protect against MADV_DONTNEED or unmap_mapping_range */
+		tlb_finish_mmu(tlb, start_addr, addr);
+		tlb = tlb_gather_mmu(mm, 0);
+	}
 	pte_unmap_unlock(pte - 1, ptl);
 
 	return addr;


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Andrew Morton
On Mon, 23 Apr 2007 22:53:49 -0400 Rik van Riel [EMAIL PROTECTED] wrote:

 I don't see why we need the attached, but in case you find
 a good reason, here's my signed-off-by line for Andrew :)

Andew is in a defensive crouch trying to work his way through all the bugs
he's been sent.  After I've managed to release 2.6.21-rc7-mm1 (say, December)
I expect I'll drop the MADV_FREE stuff, give you a run at creating a new
patch series.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Paul Mackerras
Rik van Riel writes:

 I guess we'll need to call tlb_remove_tlb_entry() inside the
 MADV_FREE code to keep powerpc happy.

I don't see why; once ptep_test_and_clear_young has returned, the
entry in the hash table has already been removed.  Adding the
tlb_remove_tlb_entry call certainly won't do anything on 64-bit
powerpc, since it expands to do {} while (0) there, and in fact it
won't do anything on 32-bit powerpc either.

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-23 Thread Rik van Riel

Paul Mackerras wrote:

Rik van Riel writes:


I guess we'll need to call tlb_remove_tlb_entry() inside the
MADV_FREE code to keep powerpc happy.


I don't see why; once ptep_test_and_clear_young has returned, the
entry in the hash table has already been removed. 


OK, so this one won't be necessary. Good to know that.

Andrew, it looks like things won't be that bad :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Nick Piggin

Jakub Jelinek wrote:

On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote:


It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

  vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999



FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back
to MADV_DONTUSE if MADV_FREE is not available, to
http://people.redhat.com/jakub/glibc/2.5.90-21.1/


Hmm, I wonder how glibc malloc stacks up to tcmalloc on this test
(after the mmap_sem patch as well).

I'll try running that as well!

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Rik van Riel

Nick Piggin wrote:


So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?


Trying to answer this question, I straced the mysql threads that
showed up in top when running a single threaded sysbench workload.

There were no mmap, munmap, brk, mprotect or madvise system calls
in the trace.

MySQL has me puzzled, but it seems to have some other people
interested too.

I think I'll go play a bit with ebizzy now, to see how other
workloads are affected by our kernel changes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Nick Piggin

Rik van Riel wrote:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.


Thanks! (I edited slightly so it doesn't wrap)



  vanilla   new glibc   madv_freemmap_semboth
threads

1 610 609 596 534 545
210321136119611801200
410701128201420272024
810001088166520892087
167791073131020121999


Not doing the mprotect calls is the big one I guess, especially
the fact that we don't need to take the mmap_sem for writing.


Yes.



With both our patches, single and two thread performance with
MySQL sysbench is somewhat better than with just your patch,
4 and 8 thread performance are basically the same and just
your patch gives a slight benefit with 16 threads.

I guess I should benchmark up to 64 or 128 threads tomorrow,
to see if this is just luck or if the cache benefit of doing
the page faults and reusing hot pages is faster than not
having page faults at all.

I should run some benchmarks on other systems, too.  Some of
these results could be an artifact of my quad core CPU.  The
results could be very different on other systems...


I'm getting the 16 core box out of retirement as we speak :)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Rik van Riel

Rik van Riel wrote:

Nick Piggin wrote:

Rik van Riel wrote:

Nick Piggin wrote:


Rik van Riel wrote:



Here are the transactions/seconds for each combination:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem  
mmap_sem

threads

1 610 609 596545 534
2103211361196   12001180
4107011282014   20242027
8100010881665   20872089
1677910731310   19992012


Now that I think about it - this is all with the rawhide kernel
configuration, which has an ungodly number of debug config
options enabled.

I should try this with a more normal kernel, on various different
systems.

It would also be helpful if other people tried this same benchmark,
and others, on their systems.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Rik van Riel

Nick Piggin wrote:

Rik van Riel wrote:

Nick Piggin wrote:


Rik van Riel wrote:



Here are the transactions/seconds for each combination:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.


   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem  mmap_sem
threads

1 610 609 596545 534
2103211361196   12001180
4107011282014   20242027
8100010881665   20872089
1677910731310   19992012


Not doing the mprotect calls is the big one I guess, especially
the fact that we don't need to take the mmap_sem for writing.

With both our patches, single and two thread performance with
MySQL sysbench is somewhat better than with just your patch,
4 and 8 thread performance are basically the same and just
your patch gives a slight benefit with 16 threads.

I guess I should benchmark up to 64 or 128 threads tomorrow,
to see if this is just luck or if the cache benefit of doing
the page faults and reusing hot pages is faster than not
having page faults at all.

I should run some benchmarks on other systems, too.  Some of
these results could be an artifact of my quad core CPU.  The
results could be very different on other systems...


Yeah. That's funny, because it means either there is some
contention on the mmap_sem (or ptl) at 1 thread, or that my
patch alters the uncontended performance.


Maybe MySQL has various different threads to do
different tasks.  Something to look into...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Nick Piggin

Rik van Riel wrote:

Nick Piggin wrote:


Rik van Riel wrote:



Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999




Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch?



No, that's just the glibc change, with a vanilla kernel.


OK. That would be interesting to see with the mmap_sem change,
because that should increase scalability.



The third column is glibc change + mmap_sem patch.

The fourth column has your patch in it, too.


The strange thing with your madv_free kernel is that it doesn't
help single-threaded performance at all. So that work to avoid
zeroing the new page is not a win at all there (maybe due to the
cache effects I was worried about?).



Well, your patch causes the performance to drop from
596 transactions/second to 545.  Your patch is the only
difference between the third and the fourth column.


Yeah. That's funny, because it means either there is some
contention on the mmap_sem (or ptl) at 1 thread, or that my
patch alters the uncontended performance.



However MADV_FREE does improve scalability, which is interesting.
The most likely reason I can see why that may be the case is that
it avoids mmap_sem when faulting pages back in (I doubt it is due
to avoiding the page allocator, but maybe?).

So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?



I wonder if the increased parallelism simply caused
more cache line bouncing, with bounces happening in
some inner loop instead of an outer loop.

Btw, it is quite possible that the MySQL sysbench
thing gives different results on your system.  It
would be good to know what it does on a real SMP
system, vs. a single quad-core chip :)

Other architectures would be interesting to know,
too.


I don't see why parallelism should come into it at 1 thread, unless
MySQL is parallelising individual transactions. Anyway, I'll try to do
some more digging.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Ulrich Drepper

On 4/22/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

Why isn't MADV_FREE defined to 5 for linux?  It's our first free madv
value?  Also the behaviour should better match the one in solaris or BSD,
the last thing we need is slightly different behaviour from operating
systems supporting this for ages.


The behavior should indeed be identical.  Both implementations
restrict MADV_FREE to work on anonymous memory and it is unspecified
whether a renewed access yields to a zerod page being created or
whether the old content is still there.  So, just use 0x5 for both the
Linux and Solaris version on sparc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Christoph Hellwig
On Sun, Apr 22, 2007 at 01:18:10AM -0700, Andrew Morton wrote:
> On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote:
> 
> > Make it possible for applications to have the kernel free memory
> > lazily.  This reduces a repeated free/malloc cycle from freeing
> > pages and allocating them, to just marking them freeable.  If the
> > application wants to reuse them before the kernel needs the memory,
> > not even a page fault will happen.
> > 
> > This patch, together with Ulrich's glibc change, increases
> > MySQL sysbench performance by a factor of 2 on my quad core
> > test system.
> > 
> 
> In file included from include/linux/mman.h:4,
>  from arch/sparc64/kernel/sys_sparc.c:19:
> include/asm/mman.h:36:1: "MADV_FREE" redefined
> In file included from include/asm/mman.h:5,
>  from include/linux/mman.h:4,
>  from arch/sparc64/kernel/sys_sparc.c:19:
> include/asm-generic/mman.h:32:1: this is the location of the previous 
> definition
> 
> sparc32 and sparc64 already defined MADV_FREE:
> 
> 
> #define MADV_FREE   0x5 /* (Solaris) contents can be freed */
> 
> I'll remove the sparc definitions for now, but we need to work out what
> we're going to do here.  Your patch changes the values of MADV_FREE on
> sparc.
> 
> Perhaps this should be renamed to MADV_FREE_LINUX and given a different
> number.  It depends on how close your proposed behaviour is to Solaris's.

Why isn't MADV_FREE defined to 5 for linux?  It's our first free madv
value?  Also the behaviour should better match the one in solaris or BSD,
the last thing we need is slightly different behaviour from operating
systems supporting this for ages.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Andrew Morton
On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote:

> Make it possible for applications to have the kernel free memory
> lazily.  This reduces a repeated free/malloc cycle from freeing
> pages and allocating them, to just marking them freeable.  If the
> application wants to reuse them before the kernel needs the memory,
> not even a page fault will happen.
> 
> This patch, together with Ulrich's glibc change, increases
> MySQL sysbench performance by a factor of 2 on my quad core
> test system.
> 

In file included from include/linux/mman.h:4,
 from arch/sparc64/kernel/sys_sparc.c:19:
include/asm/mman.h:36:1: "MADV_FREE" redefined
In file included from include/asm/mman.h:5,
 from include/linux/mman.h:4,
 from arch/sparc64/kernel/sys_sparc.c:19:
include/asm-generic/mman.h:32:1: this is the location of the previous definition

sparc32 and sparc64 already defined MADV_FREE:


#define MADV_FREE   0x5 /* (Solaris) contents can be freed */

I'll remove the sparc definitions for now, but we need to work out what
we're going to do here.  Your patch changes the values of MADV_FREE on
sparc.

Perhaps this should be renamed to MADV_FREE_LINUX and given a different
number.  It depends on how close your proposed behaviour is to Solaris's.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Rik van Riel

Nick Piggin wrote:

Rik van Riel wrote:

Andrew Morton wrote:


On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:


I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  


I'll test that.



Thanks.



Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999



Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch?


No, that's just the glibc change, with a vanilla kernel.

The third column is glibc change + mmap_sem patch.

The fourth column has your patch in it, too.


The strange thing with your madv_free kernel is that it doesn't
help single-threaded performance at all. So that work to avoid
zeroing the new page is not a win at all there (maybe due to the
cache effects I was worried about?).


Well, your patch causes the performance to drop from
596 transactions/second to 545.  Your patch is the only
difference between the third and the fourth column.


However MADV_FREE does improve scalability, which is interesting.
The most likely reason I can see why that may be the case is that
it avoids mmap_sem when faulting pages back in (I doubt it is due
to avoiding the page allocator, but maybe?).

So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?


I wonder if the increased parallelism simply caused
more cache line bouncing, with bounces happening in
some inner loop instead of an outer loop.

Btw, it is quite possible that the MySQL sysbench
thing gives different results on your system.  It
would be good to know what it does on a real SMP
system, vs. a single quad-core chip :)

Other architectures would be interesting to know,
too.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Rik van Riel

Nick Piggin wrote:

Rik van Riel wrote:

Andrew Morton wrote:


On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel [EMAIL PROTECTED] wrote:


Andrew Morton wrote:


I've also merged Nick's mm: madvise avoid exclusive mmap_sem.

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  


I'll test that.



Thanks.



Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999



Is new glibc meaning MADV_DONTNEED + kernel with mmap_sem patch?


No, that's just the glibc change, with a vanilla kernel.

The third column is glibc change + mmap_sem patch.

The fourth column has your patch in it, too.


The strange thing with your madv_free kernel is that it doesn't
help single-threaded performance at all. So that work to avoid
zeroing the new page is not a win at all there (maybe due to the
cache effects I was worried about?).


Well, your patch causes the performance to drop from
596 transactions/second to 545.  Your patch is the only
difference between the third and the fourth column.


However MADV_FREE does improve scalability, which is interesting.
The most likely reason I can see why that may be the case is that
it avoids mmap_sem when faulting pages back in (I doubt it is due
to avoiding the page allocator, but maybe?).

So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?


I wonder if the increased parallelism simply caused
more cache line bouncing, with bounces happening in
some inner loop instead of an outer loop.

Btw, it is quite possible that the MySQL sysbench
thing gives different results on your system.  It
would be good to know what it does on a real SMP
system, vs. a single quad-core chip :)

Other architectures would be interesting to know,
too.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Andrew Morton
On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel [EMAIL PROTECTED] wrote:

 Make it possible for applications to have the kernel free memory
 lazily.  This reduces a repeated free/malloc cycle from freeing
 pages and allocating them, to just marking them freeable.  If the
 application wants to reuse them before the kernel needs the memory,
 not even a page fault will happen.
 
 This patch, together with Ulrich's glibc change, increases
 MySQL sysbench performance by a factor of 2 on my quad core
 test system.
 

In file included from include/linux/mman.h:4,
 from arch/sparc64/kernel/sys_sparc.c:19:
include/asm/mman.h:36:1: MADV_FREE redefined
In file included from include/asm/mman.h:5,
 from include/linux/mman.h:4,
 from arch/sparc64/kernel/sys_sparc.c:19:
include/asm-generic/mman.h:32:1: this is the location of the previous definition

sparc32 and sparc64 already defined MADV_FREE:


#define MADV_FREE   0x5 /* (Solaris) contents can be freed */

I'll remove the sparc definitions for now, but we need to work out what
we're going to do here.  Your patch changes the values of MADV_FREE on
sparc.

Perhaps this should be renamed to MADV_FREE_LINUX and given a different
number.  It depends on how close your proposed behaviour is to Solaris's.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Christoph Hellwig
On Sun, Apr 22, 2007 at 01:18:10AM -0700, Andrew Morton wrote:
 On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel [EMAIL PROTECTED] wrote:
 
  Make it possible for applications to have the kernel free memory
  lazily.  This reduces a repeated free/malloc cycle from freeing
  pages and allocating them, to just marking them freeable.  If the
  application wants to reuse them before the kernel needs the memory,
  not even a page fault will happen.
  
  This patch, together with Ulrich's glibc change, increases
  MySQL sysbench performance by a factor of 2 on my quad core
  test system.
  
 
 In file included from include/linux/mman.h:4,
  from arch/sparc64/kernel/sys_sparc.c:19:
 include/asm/mman.h:36:1: MADV_FREE redefined
 In file included from include/asm/mman.h:5,
  from include/linux/mman.h:4,
  from arch/sparc64/kernel/sys_sparc.c:19:
 include/asm-generic/mman.h:32:1: this is the location of the previous 
 definition
 
 sparc32 and sparc64 already defined MADV_FREE:
 
 
 #define MADV_FREE   0x5 /* (Solaris) contents can be freed */
 
 I'll remove the sparc definitions for now, but we need to work out what
 we're going to do here.  Your patch changes the values of MADV_FREE on
 sparc.
 
 Perhaps this should be renamed to MADV_FREE_LINUX and given a different
 number.  It depends on how close your proposed behaviour is to Solaris's.

Why isn't MADV_FREE defined to 5 for linux?  It's our first free madv
value?  Also the behaviour should better match the one in solaris or BSD,
the last thing we need is slightly different behaviour from operating
systems supporting this for ages.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Ulrich Drepper

On 4/22/07, Christoph Hellwig [EMAIL PROTECTED] wrote:

Why isn't MADV_FREE defined to 5 for linux?  It's our first free madv
value?  Also the behaviour should better match the one in solaris or BSD,
the last thing we need is slightly different behaviour from operating
systems supporting this for ages.


The behavior should indeed be identical.  Both implementations
restrict MADV_FREE to work on anonymous memory and it is unspecified
whether a renewed access yields to a zerod page being created or
whether the old content is still there.  So, just use 0x5 for both the
Linux and Solaris version on sparc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Nick Piggin

Rik van Riel wrote:

Nick Piggin wrote:


Rik van Riel wrote:



Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999




Is new glibc meaning MADV_DONTNEED + kernel with mmap_sem patch?



No, that's just the glibc change, with a vanilla kernel.


OK. That would be interesting to see with the mmap_sem change,
because that should increase scalability.



The third column is glibc change + mmap_sem patch.

The fourth column has your patch in it, too.


The strange thing with your madv_free kernel is that it doesn't
help single-threaded performance at all. So that work to avoid
zeroing the new page is not a win at all there (maybe due to the
cache effects I was worried about?).



Well, your patch causes the performance to drop from
596 transactions/second to 545.  Your patch is the only
difference between the third and the fourth column.


Yeah. That's funny, because it means either there is some
contention on the mmap_sem (or ptl) at 1 thread, or that my
patch alters the uncontended performance.



However MADV_FREE does improve scalability, which is interesting.
The most likely reason I can see why that may be the case is that
it avoids mmap_sem when faulting pages back in (I doubt it is due
to avoiding the page allocator, but maybe?).

So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?



I wonder if the increased parallelism simply caused
more cache line bouncing, with bounces happening in
some inner loop instead of an outer loop.

Btw, it is quite possible that the MySQL sysbench
thing gives different results on your system.  It
would be good to know what it does on a real SMP
system, vs. a single quad-core chip :)

Other architectures would be interesting to know,
too.


I don't see why parallelism should come into it at 1 thread, unless
MySQL is parallelising individual transactions. Anyway, I'll try to do
some more digging.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Rik van Riel

Nick Piggin wrote:

Rik van Riel wrote:

Nick Piggin wrote:


Rik van Riel wrote:



Here are the transactions/seconds for each combination:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.


   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem  mmap_sem
threads

1 610 609 596545 534
2103211361196   12001180
4107011282014   20242027
8100010881665   20872089
1677910731310   19992012


Not doing the mprotect calls is the big one I guess, especially
the fact that we don't need to take the mmap_sem for writing.

With both our patches, single and two thread performance with
MySQL sysbench is somewhat better than with just your patch,
4 and 8 thread performance are basically the same and just
your patch gives a slight benefit with 16 threads.

I guess I should benchmark up to 64 or 128 threads tomorrow,
to see if this is just luck or if the cache benefit of doing
the page faults and reusing hot pages is faster than not
having page faults at all.

I should run some benchmarks on other systems, too.  Some of
these results could be an artifact of my quad core CPU.  The
results could be very different on other systems...


Yeah. That's funny, because it means either there is some
contention on the mmap_sem (or ptl) at 1 thread, or that my
patch alters the uncontended performance.


Maybe MySQL has various different threads to do
different tasks.  Something to look into...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Rik van Riel

Rik van Riel wrote:

Nick Piggin wrote:

Rik van Riel wrote:

Nick Piggin wrote:


Rik van Riel wrote:



Here are the transactions/seconds for each combination:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem  
mmap_sem

threads

1 610 609 596545 534
2103211361196   12001180
4107011282014   20242027
8100010881665   20872089
1677910731310   19992012


Now that I think about it - this is all with the rawhide kernel
configuration, which has an ungodly number of debug config
options enabled.

I should try this with a more normal kernel, on various different
systems.

It would also be helpful if other people tried this same benchmark,
and others, on their systems.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Nick Piggin

Rik van Riel wrote:


I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.


Thanks! (I edited slightly so it doesn't wrap)



  vanilla   new glibc   madv_freemmap_semboth
threads

1 610 609 596 534 545
210321136119611801200
410701128201420272024
810001088166520892087
167791073131020121999


Not doing the mprotect calls is the big one I guess, especially
the fact that we don't need to take the mmap_sem for writing.


Yes.



With both our patches, single and two thread performance with
MySQL sysbench is somewhat better than with just your patch,
4 and 8 thread performance are basically the same and just
your patch gives a slight benefit with 16 threads.

I guess I should benchmark up to 64 or 128 threads tomorrow,
to see if this is just luck or if the cache benefit of doing
the page faults and reusing hot pages is faster than not
having page faults at all.

I should run some benchmarks on other systems, too.  Some of
these results could be an artifact of my quad core CPU.  The
results could be very different on other systems...


I'm getting the 16 core box out of retirement as we speak :)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Rik van Riel

Nick Piggin wrote:


So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?


Trying to answer this question, I straced the mysql threads that
showed up in top when running a single threaded sysbench workload.

There were no mmap, munmap, brk, mprotect or madvise system calls
in the trace.

MySQL has me puzzled, but it seems to have some other people
interested too.

I think I'll go play a bit with ebizzy now, to see how other
workloads are affected by our kernel changes.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-22 Thread Nick Piggin

Jakub Jelinek wrote:

On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote:


It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

  vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999



FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back
to MADV_DONTUSE if MADV_FREE is not available, to
http://people.redhat.com/jakub/glibc/2.5.90-21.1/


Hmm, I wonder how glibc malloc stacks up to tcmalloc on this test
(after the mmap_sem patch as well).

I'll try running that as well!

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Nick Piggin

Nick Piggin wrote:

Rik van Riel wrote:


Andrew Morton wrote:


On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:


I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  



I'll test that.




Thanks.




Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999




Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch?

The strange thing with your madv_free kernel is that it doesn't
help single-threaded performance at all. So that work to avoid
zeroing the new page is not a win at all there (maybe due to the
cache effects I was worried about?).

However MADV_FREE does improve scalability, which is interesting.
The most likely reason I can see why that may be the case is that
it avoids mmap_sem when faulting pages back in (I doubt it is due
to avoiding the page allocator, but maybe?).

So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?

x86_64's rwsems are crap under heavy parallelism (even read-only),
as I fixed in my recent generic rwsems patch. I don't expect MySQL
to be such a mmap_sem microbenchmark, but I wonder how much this
would help?

What if we ran the private futexes patch to further cut down
mmap_sem contention?


Hmm, without the MADV_FREE patch, I wonder if it isn't doing something
silly like read-faulting in a ZERO_PAGE then write faulting a new page
straight afterwards.. I'll have to try a few tests.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Nick Piggin

Rik van Riel wrote:

Andrew Morton wrote:


On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:


I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  


I'll test that.



Thanks.



Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999



Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch?

The strange thing with your madv_free kernel is that it doesn't
help single-threaded performance at all. So that work to avoid
zeroing the new page is not a win at all there (maybe due to the
cache effects I was worried about?).

However MADV_FREE does improve scalability, which is interesting.
The most likely reason I can see why that may be the case is that
it avoids mmap_sem when faulting pages back in (I doubt it is due
to avoiding the page allocator, but maybe?).

So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?

x86_64's rwsems are crap under heavy parallelism (even read-only),
as I fixed in my recent generic rwsems patch. I don't expect MySQL
to be such a mmap_sem microbenchmark, but I wonder how much this
would help?

What if we ran the private futexes patch to further cut down
mmap_sem contention?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Rik van Riel

Hugh Dickins wrote:

On Fri, 20 Apr 2007, Rik van Riel wrote:

Andrew Morton wrote:


  I do go on about that.  But we're adding page flags at about one per
  year, and when we run out we're screwed - we'll need to grow the
  pageframe.

If you want, I can take a look at folding this into the
->mapping pointer.  I can guarantee you it won't be
pretty, though :)


Please don't.  If we're going to stuff another pageflag into there,
let it be PageSwapCache the natural partner of PageAnon, rather than
whatever our latest pageflag happens to be. 


I looked at doing what Andrew wanted, and it did indeed not
look like the right thing to do.  The locking on page->mapping
is the kind of locking we want to avoid during zap_page_range
and in the pageout code.

I like your suggestion better.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2

2007-04-21 Thread Ulrich Drepper

On 4/21/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:

But the Linux MADV_DONTNEED does throw away
data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those
changes are discarded, and a subsequent access will revert to zeroes
or the underlying mapped file.  Been like that since before 2.4.0.


I didn't say it changed.  I just say that there is a hole in the
current implementation as it does not allow to implement
POSIX_MADV_DONTNEED with anything but a no-op.  The
POSIX_MADV_DONTNEED behavior is useful and something IMO should be
added to allow implementing it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2

2007-04-21 Thread Hugh Dickins
On Fri, 20 Apr 2007, Ulrich Drepper wrote:
> 
> Just for reference: the MADV_CURRENT behavior is to throw away data in
> the range.

Not exactly.  The Linux MADV_DONTNEED never throws away data from a
PROT_WRITE,MAP_SHARED mapping (or shm) - it propagates the dirty bit,
the page will eventually get written out to file, and can be retrieved
later by subsequent access.  But the Linux MADV_DONTNEED does throw away
data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those
changes are discarded, and a subsequent access will revert to zeroes
or the underlying mapped file.  Been like that since before 2.4.0.

> The POSIX_MADV_DONTNEED behavior is to never lose data.
> I.e., file backed data is written back, anon data is at most swapped
> out.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Hugh Dickins
On Fri, 20 Apr 2007, Rik van Riel wrote:
> Andrew Morton wrote:
> 
> >   I do go on about that.  But we're adding page flags at about one per
> >   year, and when we run out we're screwed - we'll need to grow the
> >   pageframe.
> 
> If you want, I can take a look at folding this into the
> ->mapping pointer.  I can guarantee you it won't be
> pretty, though :)

Please don't.  If we're going to stuff another pageflag into there,
let it be PageSwapCache the natural partner of PageAnon, rather than
whatever our latest pageflag happens to be.  I'll look into it - but
do keep an eye on me, I've developed a dubious track record of
obstructing other people's attempts to save pageflags.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Jakub Jelinek
On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote:
> It turns out that Nick's patch does not improve peak
> performance much, but it does prevent the decline when
> running with 16 threads on my quad core CPU!
> 
> We _definately_ want both patches, there's a huge benefit
> in having them both.
> 
> Here are the transactions/seconds for each combination:
> 
>vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
> threads
> 
> 1 610 609 596545
> 2103211361196   1200
> 4107011282014   2024
> 8100010881665   2087
> 1677910731310   1999

FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back
to MADV_DONTUSE if MADV_FREE is not available, to
http://people.redhat.com/jakub/glibc/2.5.90-21.1/
and I'm also attaching the glibc patch for those who want to build it
themselves:

2007-04-19  Ulrich Drepper  <[EMAIL PROTECTED]>
Jakub Jelinek  <[EMAIL PROTECTED]>

* malloc/arena.c (heap_info): Add mprotect_size field, adjust pad.
(new_heap): Initialize mprotect_size.
(no_madv_free): New variable.
(grow_heap): When growing, only mprotect from mprotect_size till
new_size if mprotect_size is smaller.  When shrinking, use PROT_NONE
MMAP for __libc_enable_secure only, otherwise if MADV_FREE is
available use it and fall back to MADV_DONTNEED.
* sysdeps/unix/sysv/linux/alpha/bits/mman.h (MADV_FREE): Define.
* sysdeps/unix/sysv/linux/ia64/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/i386/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/s390/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/powerpc/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/x86_64/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/sparc/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/sh/bits/mman.h (MADV_FREE): Likewise.

--- libc/malloc/arena.c.jj  2006-10-31 23:05:31.0 +0100
+++ libc/malloc/arena.c 2007-04-19 18:54:20.0 +0200
@@ -1,5 +1,6 @@
 /* Malloc implementation for multiple threads without lock contention.
-   Copyright (C) 2001,2002,2003,2004,2005,2006 Free Software Foundation, Inc.
+   Copyright (C) 2001,2002,2003,2004,2005,2006,2007
+   Free Software Foundation, Inc.
This file is part of the GNU C Library.
Contributed by Wolfram Gloger <[EMAIL PROTECTED]>, 2001.
 
@@ -59,10 +60,12 @@ typedef struct _heap_info {
   mstate ar_ptr; /* Arena for this heap. */
   struct _heap_info *prev; /* Previous heap. */
   size_t size;   /* Current size in bytes. */
+  size_t mprotect_size;/* Size in bytes that has been mprotected
+  PROT_READ|PROT_WRITE.  */
   /* Make sure the following data is properly aligned, particularly
  that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of
- MALLOG_ALIGNMENT. */
-  char pad[-5 * SIZE_SZ & MALLOC_ALIGN_MASK];
+ MALLOC_ALIGNMENT. */
+  char pad[-6 * SIZE_SZ & MALLOC_ALIGN_MASK];
 } heap_info;
 
 /* Get a compile-time error if the heap_info padding is not correct
@@ -692,10 +695,15 @@ new_heap(size, top_pad) size_t size, top
   }
   h = (heap_info *)p2;
   h->size = size;
+  h->mprotect_size = size;
   THREAD_STAT(stat_n_heaps++);
   return h;
 }
 
+#if defined _LIBC && defined MADV_FREE
+static int no_madv_free;
+#endif
+
 /* Grow or shrink a heap.  size is automatically rounded up to a
multiple of the page size if it is positive. */
 
@@ -714,17 +722,49 @@ grow_heap(h, diff) heap_info *h; long di
 new_size = (long)h->size + diff;
 if((unsigned long) new_size > (unsigned long) HEAP_MAX_SIZE)
   return -1;
-if(mprotect((char *)h + h->size, diff, PROT_READ|PROT_WRITE) != 0)
-  return -2;
+if((unsigned long) new_size > h->mprotect_size) {
+  if (mprotect((char *)h + h->mprotect_size,
+  (unsigned long) new_size - h->mprotect_size,
+  PROT_READ|PROT_WRITE) != 0)
+   return -2;
+  h->mprotect_size = new_size;
+}
   } else {
 new_size = (long)h->size + diff;
 if(new_size < (long)sizeof(*h))
   return -1;
 /* Try to re-map the extra heap space freshly to save memory, and
make it inaccessible. */
-if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE,
-MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED)
-  return -2;
+#ifdef _LIBC
+if (__builtin_expect (__libc_enable_secure, 0))
+#else
+if (1)
+#endif
+  {
+   if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE,
+   MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED)
+ return -2;
+   h->mprotect_size = new_size;
+  }
+#ifdef _LIBC
+else
+  {
+# ifdef MADV_FREE
+   if (!__builtin_expect (no_madv_free, 0))

Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Jakub Jelinek
On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote:
 It turns out that Nick's patch does not improve peak
 performance much, but it does prevent the decline when
 running with 16 threads on my quad core CPU!
 
 We _definately_ want both patches, there's a huge benefit
 in having them both.
 
 Here are the transactions/seconds for each combination:
 
vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
 threads
 
 1 610 609 596545
 2103211361196   1200
 4107011282014   2024
 8100010881665   2087
 1677910731310   1999

FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back
to MADV_DONTUSE if MADV_FREE is not available, to
http://people.redhat.com/jakub/glibc/2.5.90-21.1/
and I'm also attaching the glibc patch for those who want to build it
themselves:

2007-04-19  Ulrich Drepper  [EMAIL PROTECTED]
Jakub Jelinek  [EMAIL PROTECTED]

* malloc/arena.c (heap_info): Add mprotect_size field, adjust pad.
(new_heap): Initialize mprotect_size.
(no_madv_free): New variable.
(grow_heap): When growing, only mprotect from mprotect_size till
new_size if mprotect_size is smaller.  When shrinking, use PROT_NONE
MMAP for __libc_enable_secure only, otherwise if MADV_FREE is
available use it and fall back to MADV_DONTNEED.
* sysdeps/unix/sysv/linux/alpha/bits/mman.h (MADV_FREE): Define.
* sysdeps/unix/sysv/linux/ia64/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/i386/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/s390/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/powerpc/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/x86_64/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/sparc/bits/mman.h (MADV_FREE): Likewise.
* sysdeps/unix/sysv/linux/sh/bits/mman.h (MADV_FREE): Likewise.

--- libc/malloc/arena.c.jj  2006-10-31 23:05:31.0 +0100
+++ libc/malloc/arena.c 2007-04-19 18:54:20.0 +0200
@@ -1,5 +1,6 @@
 /* Malloc implementation for multiple threads without lock contention.
-   Copyright (C) 2001,2002,2003,2004,2005,2006 Free Software Foundation, Inc.
+   Copyright (C) 2001,2002,2003,2004,2005,2006,2007
+   Free Software Foundation, Inc.
This file is part of the GNU C Library.
Contributed by Wolfram Gloger [EMAIL PROTECTED], 2001.
 
@@ -59,10 +60,12 @@ typedef struct _heap_info {
   mstate ar_ptr; /* Arena for this heap. */
   struct _heap_info *prev; /* Previous heap. */
   size_t size;   /* Current size in bytes. */
+  size_t mprotect_size;/* Size in bytes that has been mprotected
+  PROT_READ|PROT_WRITE.  */
   /* Make sure the following data is properly aligned, particularly
  that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of
- MALLOG_ALIGNMENT. */
-  char pad[-5 * SIZE_SZ  MALLOC_ALIGN_MASK];
+ MALLOC_ALIGNMENT. */
+  char pad[-6 * SIZE_SZ  MALLOC_ALIGN_MASK];
 } heap_info;
 
 /* Get a compile-time error if the heap_info padding is not correct
@@ -692,10 +695,15 @@ new_heap(size, top_pad) size_t size, top
   }
   h = (heap_info *)p2;
   h-size = size;
+  h-mprotect_size = size;
   THREAD_STAT(stat_n_heaps++);
   return h;
 }
 
+#if defined _LIBC  defined MADV_FREE
+static int no_madv_free;
+#endif
+
 /* Grow or shrink a heap.  size is automatically rounded up to a
multiple of the page size if it is positive. */
 
@@ -714,17 +722,49 @@ grow_heap(h, diff) heap_info *h; long di
 new_size = (long)h-size + diff;
 if((unsigned long) new_size  (unsigned long) HEAP_MAX_SIZE)
   return -1;
-if(mprotect((char *)h + h-size, diff, PROT_READ|PROT_WRITE) != 0)
-  return -2;
+if((unsigned long) new_size  h-mprotect_size) {
+  if (mprotect((char *)h + h-mprotect_size,
+  (unsigned long) new_size - h-mprotect_size,
+  PROT_READ|PROT_WRITE) != 0)
+   return -2;
+  h-mprotect_size = new_size;
+}
   } else {
 new_size = (long)h-size + diff;
 if(new_size  (long)sizeof(*h))
   return -1;
 /* Try to re-map the extra heap space freshly to save memory, and
make it inaccessible. */
-if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE,
-MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED)
-  return -2;
+#ifdef _LIBC
+if (__builtin_expect (__libc_enable_secure, 0))
+#else
+if (1)
+#endif
+  {
+   if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE,
+   MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED)
+ return -2;
+   h-mprotect_size = new_size;
+  }
+#ifdef _LIBC
+else
+  {
+# ifdef MADV_FREE
+   if (!__builtin_expect (no_madv_free, 0))
+ {
+   if 

Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Hugh Dickins
On Fri, 20 Apr 2007, Rik van Riel wrote:
 Andrew Morton wrote:
 
I do go on about that.  But we're adding page flags at about one per
year, and when we run out we're screwed - we'll need to grow the
pageframe.
 
 If you want, I can take a look at folding this into the
 -mapping pointer.  I can guarantee you it won't be
 pretty, though :)

Please don't.  If we're going to stuff another pageflag into there,
let it be PageSwapCache the natural partner of PageAnon, rather than
whatever our latest pageflag happens to be.  I'll look into it - but
do keep an eye on me, I've developed a dubious track record of
obstructing other people's attempts to save pageflags.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2

2007-04-21 Thread Hugh Dickins
On Fri, 20 Apr 2007, Ulrich Drepper wrote:
 
 Just for reference: the MADV_CURRENT behavior is to throw away data in
 the range.

Not exactly.  The Linux MADV_DONTNEED never throws away data from a
PROT_WRITE,MAP_SHARED mapping (or shm) - it propagates the dirty bit,
the page will eventually get written out to file, and can be retrieved
later by subsequent access.  But the Linux MADV_DONTNEED does throw away
data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those
changes are discarded, and a subsequent access will revert to zeroes
or the underlying mapped file.  Been like that since before 2.4.0.

 The POSIX_MADV_DONTNEED behavior is to never lose data.
 I.e., file backed data is written back, anon data is at most swapped
 out.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2

2007-04-21 Thread Ulrich Drepper

On 4/21/07, Hugh Dickins [EMAIL PROTECTED] wrote:

But the Linux MADV_DONTNEED does throw away
data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those
changes are discarded, and a subsequent access will revert to zeroes
or the underlying mapped file.  Been like that since before 2.4.0.


I didn't say it changed.  I just say that there is a hole in the
current implementation as it does not allow to implement
POSIX_MADV_DONTNEED with anything but a no-op.  The
POSIX_MADV_DONTNEED behavior is useful and something IMO should be
added to allow implementing it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Rik van Riel

Hugh Dickins wrote:

On Fri, 20 Apr 2007, Rik van Riel wrote:

Andrew Morton wrote:


  I do go on about that.  But we're adding page flags at about one per
  year, and when we run out we're screwed - we'll need to grow the
  pageframe.

If you want, I can take a look at folding this into the
-mapping pointer.  I can guarantee you it won't be
pretty, though :)


Please don't.  If we're going to stuff another pageflag into there,
let it be PageSwapCache the natural partner of PageAnon, rather than
whatever our latest pageflag happens to be. 


I looked at doing what Andrew wanted, and it did indeed not
look like the right thing to do.  The locking on page-mapping
is the kind of locking we want to avoid during zap_page_range
and in the pageout code.

I like your suggestion better.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Nick Piggin

Rik van Riel wrote:

Andrew Morton wrote:


On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel [EMAIL PROTECTED] wrote:


Andrew Morton wrote:


I've also merged Nick's mm: madvise avoid exclusive mmap_sem.

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  


I'll test that.



Thanks.



Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999



Is new glibc meaning MADV_DONTNEED + kernel with mmap_sem patch?

The strange thing with your madv_free kernel is that it doesn't
help single-threaded performance at all. So that work to avoid
zeroing the new page is not a win at all there (maybe due to the
cache effects I was worried about?).

However MADV_FREE does improve scalability, which is interesting.
The most likely reason I can see why that may be the case is that
it avoids mmap_sem when faulting pages back in (I doubt it is due
to avoiding the page allocator, but maybe?).

So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?

x86_64's rwsems are crap under heavy parallelism (even read-only),
as I fixed in my recent generic rwsems patch. I don't expect MySQL
to be such a mmap_sem microbenchmark, but I wonder how much this
would help?

What if we ran the private futexes patch to further cut down
mmap_sem contention?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-21 Thread Nick Piggin

Nick Piggin wrote:

Rik van Riel wrote:


Andrew Morton wrote:


On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel [EMAIL PROTECTED] wrote:


Andrew Morton wrote:


I've also merged Nick's mm: madvise avoid exclusive mmap_sem.

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  



I'll test that.




Thanks.




Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999




Is new glibc meaning MADV_DONTNEED + kernel with mmap_sem patch?

The strange thing with your madv_free kernel is that it doesn't
help single-threaded performance at all. So that work to avoid
zeroing the new page is not a win at all there (maybe due to the
cache effects I was worried about?).

However MADV_FREE does improve scalability, which is interesting.
The most likely reason I can see why that may be the case is that
it avoids mmap_sem when faulting pages back in (I doubt it is due
to avoiding the page allocator, but maybe?).

So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?

x86_64's rwsems are crap under heavy parallelism (even read-only),
as I fixed in my recent generic rwsems patch. I don't expect MySQL
to be such a mmap_sem microbenchmark, but I wonder how much this
would help?

What if we ran the private futexes patch to further cut down
mmap_sem contention?


Hmm, without the MADV_FREE patch, I wonder if it isn't doing something
silly like read-faulting in a ZERO_PAGE then write faulting a new page
straight afterwards.. I'll have to try a few tests.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Rik van Riel

Eric Dumazet wrote:

Rik van Riel a écrit :

Andrew Morton wrote:

On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:


I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  

I'll test that.


Thanks.


Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545


545 tps versus 610 tps for one thread ? It seems quite bad, no ?

Could you please find an explanation for this ?


I have no idea why this happens.  Especially the last one,
going from a write lock to a read lock on the mmap_sem
should not make ANY difference whatsoever since we're
running single threaded!


2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999


Performance with 2 database threads is way better though,
and performance with 4 or more threads more than doubles...

If you have an explanation on why single threaded performance
went down a little on my quad core system, please let me know.

Does performance suffer at all on a real UP system?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Eric Dumazet

Rik van Riel a écrit :

Andrew Morton wrote:

On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:


I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  

I'll test that.


Thanks.


Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545


545 tps versus 610 tps for one thread ? It seems quite bad, no ?

Could you please find an explanation for this ?


2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999




Thank you
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Rik van Riel

Andrew Morton wrote:

On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:


I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your patch
  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, but it
  would be nice to firm that up a bit.  

I'll test that.


Thanks.


Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999


--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Andrew Morton
On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> 
> > I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".
> > 
> > - Nick's patch also will help this problem.  It could be that your patch
> >   no longer offers a 2x speedup when combined with Nick's patch.
> > 
> >   It could well be that the combination of the two is even better, but it
> >   would be nice to firm that up a bit.  
> 
> I'll test that.

Thanks.

> >   I do go on about that.  But we're adding page flags at about one per
> >   year, and when we run out we're screwed - we'll need to grow the
> >   pageframe.
> 
> If you want, I can take a look at folding this into the
> ->mapping pointer.  I can guarantee you it won't be
> pretty, though :)

Well, let's see how fugly it ends up looking?

> > - I need to update your patch for Nick's patch.  Please confirm that
> >   down_read(mmap_sem) is sufficient for MADV_FREE.
> 
> It is.  MADV_FREE needs no more protection than MADV_DONTNEED.
> 
> > Stylistic nit:
> > 
> >> +  if (PageLazyFree(page) && !migration) {
> >> +  /* There is new data in the page.  Reinstate it. */
> >> +  if (unlikely(pte_dirty(pteval))) {
> >> +  set_pte_at(mm, address, pte, pteval);
> >> +  ret = SWAP_FAIL;
> >> +  goto out_unmap;
> >> +  }
> > 
> > The comment should be inside the second `if' statement.  As it is, It
> > looks like we reinstate the page if (PageLazyFree(page) && !migration).
> 
> Want me to move it?

I did that, thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Rik van Riel

Andrew Morton wrote:


I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your patch
  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, but it
  would be nice to firm that up a bit.  


I'll test that.


  I do go on about that.  But we're adding page flags at about one per
  year, and when we run out we're screwed - we'll need to grow the
  pageframe.


If you want, I can take a look at folding this into the
->mapping pointer.  I can guarantee you it won't be
pretty, though :)


- I need to update your patch for Nick's patch.  Please confirm that
  down_read(mmap_sem) is sufficient for MADV_FREE.


It is.  MADV_FREE needs no more protection than MADV_DONTNEED.


Stylistic nit:


+   if (PageLazyFree(page) && !migration) {
+   /* There is new data in the page.  Reinstate it. */
+   if (unlikely(pte_dirty(pteval))) {
+   set_pte_at(mm, address, pte, pteval);
+   ret = SWAP_FAIL;
+   goto out_unmap;
+   }


The comment should be inside the second `if' statement.  As it is, It
looks like we reinstate the page if (PageLazyFree(page) && !migration).


Want me to move it?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2

2007-04-20 Thread Ulrich Drepper

On 4/20/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

OK, we need to flesh this out a lot please.  People often get confused
about what our MADV_DONTNEED behaviour is.


Well, there's not really much to flesh out.  The current MADV_DONTNEED
is useful in some situations.  The behavior cannot be changed, even
glibc will rely on it for the case when MADV_FREE is not supported.

What might be nice to have is to have a POSIX-compliant
POSIX_MADV_DONTNEED implementation.  We currently do nothing which is
OK since no test suite can detect that.  But some code might want to
use the real behavior and we're missing an optimization possibility.

Just for reference: the MADV_CURRENT behavior is to throw away data in
the range.  The POSIX_MADV_DONTNEED behavior is to never lose data.
I.e., file backed data is written back, anon data is at most swapped
out.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2

2007-04-20 Thread Andrew Morton
On Thu, 19 Apr 2007 17:15:28 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:

> Restore MADV_DONTNEED to its original Linux behaviour.  This is still
> not the same behaviour as POSIX, but applications may be depending on
> the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED
> and makes sure nothing is done...

OK, we need to flesh this out a lot please.  People often get confused
about what our MADV_DONTNEED behaviour is.  I regularly forget, then look
at the code, then get it wrong.  That's for mainline, let alone older
kernels whose behaviour is gawd-knows-what.

So...  For the changelog (and the manpage) could we please have a full
description of the 2.6.21 behaviour and the 2.6.21-post-rik behaviour (and
the 2.4 behaviour, if it differs at all)?  Also some code comments to
demystify all of this once and for all?

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Andrew Morton
On Tue, 17 Apr 2007 03:15:51 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:

> Make it possible for applications to have the kernel free memory
> lazily.  This reduces a repeated free/malloc cycle from freeing
> pages and allocating them, to just marking them freeable.  If the
> application wants to reuse them before the kernel needs the memory,
> not even a page fault will happen.
> 
> This patch, together with Ulrich's glibc change, increases
> MySQL sysbench performance by a factor of 2 on my quad core
> test system.
> 
> Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>
> 
> ---
> Ulrich Drepper has test glibc RPMS for this functionality at:
> 
>  http://people.redhat.com/drepper/rpms
> 
> Andrew, I have stress tested this patch for a few days now and
> have not been able to find any more bugs.  I believe it is ready
> to be merged in -mm, and upstream at the next merge window.
> 
> When the patch goes upstream, I will submit a small follow-up
> patch to revert MADV_DONTNEED behaviour to what it did previously
> and have the new behaviour trigger only on MADV_FREE: at that
> point people will have to get new test RPMs of glibc.
> 
> 

I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your patch
  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, but it
  would be nice to firm that up a bit.  Chewing a page flag is an expensive
  thing to do.

  I do go on about that.  But we're adding page flags at about one per
  year, and when we run out we're screwed - we'll need to grow the
  pageframe.

- I need to update your patch for Nick's patch.  Please confirm that
  down_read(mmap_sem) is sufficient for MADV_FREE.


Stylistic nit:

> + if (PageLazyFree(page) && !migration) {
> + /* There is new data in the page.  Reinstate it. */
> + if (unlikely(pte_dirty(pteval))) {
> + set_pte_at(mm, address, pte, pteval);
> + ret = SWAP_FAIL;
> + goto out_unmap;
> + }

The comment should be inside the second `if' statement.  As it is, It
looks like we reinstate the page if (PageLazyFree(page) && !migration).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Andrew Morton
On Tue, 17 Apr 2007 03:15:51 -0400
Rik van Riel [EMAIL PROTECTED] wrote:

 Make it possible for applications to have the kernel free memory
 lazily.  This reduces a repeated free/malloc cycle from freeing
 pages and allocating them, to just marking them freeable.  If the
 application wants to reuse them before the kernel needs the memory,
 not even a page fault will happen.
 
 This patch, together with Ulrich's glibc change, increases
 MySQL sysbench performance by a factor of 2 on my quad core
 test system.
 
 Signed-off-by: Rik van Riel [EMAIL PROTECTED]
 
 ---
 Ulrich Drepper has test glibc RPMS for this functionality at:
 
  http://people.redhat.com/drepper/rpms
 
 Andrew, I have stress tested this patch for a few days now and
 have not been able to find any more bugs.  I believe it is ready
 to be merged in -mm, and upstream at the next merge window.
 
 When the patch goes upstream, I will submit a small follow-up
 patch to revert MADV_DONTNEED behaviour to what it did previously
 and have the new behaviour trigger only on MADV_FREE: at that
 point people will have to get new test RPMs of glibc.
 
 

I've also merged Nick's mm: madvise avoid exclusive mmap_sem.

- Nick's patch also will help this problem.  It could be that your patch
  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, but it
  would be nice to firm that up a bit.  Chewing a page flag is an expensive
  thing to do.

  I do go on about that.  But we're adding page flags at about one per
  year, and when we run out we're screwed - we'll need to grow the
  pageframe.

- I need to update your patch for Nick's patch.  Please confirm that
  down_read(mmap_sem) is sufficient for MADV_FREE.


Stylistic nit:

 + if (PageLazyFree(page)  !migration) {
 + /* There is new data in the page.  Reinstate it. */
 + if (unlikely(pte_dirty(pteval))) {
 + set_pte_at(mm, address, pte, pteval);
 + ret = SWAP_FAIL;
 + goto out_unmap;
 + }

The comment should be inside the second `if' statement.  As it is, It
looks like we reinstate the page if (PageLazyFree(page)  !migration).

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2

2007-04-20 Thread Andrew Morton
On Thu, 19 Apr 2007 17:15:28 -0400
Rik van Riel [EMAIL PROTECTED] wrote:

 Restore MADV_DONTNEED to its original Linux behaviour.  This is still
 not the same behaviour as POSIX, but applications may be depending on
 the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED
 and makes sure nothing is done...

OK, we need to flesh this out a lot please.  People often get confused
about what our MADV_DONTNEED behaviour is.  I regularly forget, then look
at the code, then get it wrong.  That's for mainline, let alone older
kernels whose behaviour is gawd-knows-what.

So...  For the changelog (and the manpage) could we please have a full
description of the 2.6.21 behaviour and the 2.6.21-post-rik behaviour (and
the 2.4 behaviour, if it differs at all)?  Also some code comments to
demystify all of this once and for all?

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2

2007-04-20 Thread Ulrich Drepper

On 4/20/07, Andrew Morton [EMAIL PROTECTED] wrote:

OK, we need to flesh this out a lot please.  People often get confused
about what our MADV_DONTNEED behaviour is.


Well, there's not really much to flesh out.  The current MADV_DONTNEED
is useful in some situations.  The behavior cannot be changed, even
glibc will rely on it for the case when MADV_FREE is not supported.

What might be nice to have is to have a POSIX-compliant
POSIX_MADV_DONTNEED implementation.  We currently do nothing which is
OK since no test suite can detect that.  But some code might want to
use the real behavior and we're missing an optimization possibility.

Just for reference: the MADV_CURRENT behavior is to throw away data in
the range.  The POSIX_MADV_DONTNEED behavior is to never lose data.
I.e., file backed data is written back, anon data is at most swapped
out.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Rik van Riel

Andrew Morton wrote:


I've also merged Nick's mm: madvise avoid exclusive mmap_sem.

- Nick's patch also will help this problem.  It could be that your patch
  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, but it
  would be nice to firm that up a bit.  


I'll test that.


  I do go on about that.  But we're adding page flags at about one per
  year, and when we run out we're screwed - we'll need to grow the
  pageframe.


If you want, I can take a look at folding this into the
-mapping pointer.  I can guarantee you it won't be
pretty, though :)


- I need to update your patch for Nick's patch.  Please confirm that
  down_read(mmap_sem) is sufficient for MADV_FREE.


It is.  MADV_FREE needs no more protection than MADV_DONTNEED.


Stylistic nit:


+   if (PageLazyFree(page)  !migration) {
+   /* There is new data in the page.  Reinstate it. */
+   if (unlikely(pte_dirty(pteval))) {
+   set_pte_at(mm, address, pte, pteval);
+   ret = SWAP_FAIL;
+   goto out_unmap;
+   }


The comment should be inside the second `if' statement.  As it is, It
looks like we reinstate the page if (PageLazyFree(page)  !migration).


Want me to move it?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Andrew Morton
On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
 
  I've also merged Nick's mm: madvise avoid exclusive mmap_sem.
  
  - Nick's patch also will help this problem.  It could be that your patch
no longer offers a 2x speedup when combined with Nick's patch.
  
It could well be that the combination of the two is even better, but it
would be nice to firm that up a bit.  
 
 I'll test that.

Thanks.

I do go on about that.  But we're adding page flags at about one per
year, and when we run out we're screwed - we'll need to grow the
pageframe.
 
 If you want, I can take a look at folding this into the
 -mapping pointer.  I can guarantee you it won't be
 pretty, though :)

Well, let's see how fugly it ends up looking?

  - I need to update your patch for Nick's patch.  Please confirm that
down_read(mmap_sem) is sufficient for MADV_FREE.
 
 It is.  MADV_FREE needs no more protection than MADV_DONTNEED.
 
  Stylistic nit:
  
  +  if (PageLazyFree(page)  !migration) {
  +  /* There is new data in the page.  Reinstate it. */
  +  if (unlikely(pte_dirty(pteval))) {
  +  set_pte_at(mm, address, pte, pteval);
  +  ret = SWAP_FAIL;
  +  goto out_unmap;
  +  }
  
  The comment should be inside the second `if' statement.  As it is, It
  looks like we reinstate the page if (PageLazyFree(page)  !migration).
 
 Want me to move it?

I did that, thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Rik van Riel

Andrew Morton wrote:

On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel [EMAIL PROTECTED] wrote:


Andrew Morton wrote:


I've also merged Nick's mm: madvise avoid exclusive mmap_sem.

- Nick's patch also will help this problem.  It could be that your patch
  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, but it
  would be nice to firm that up a bit.  

I'll test that.


Thanks.


Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545
2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999


--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Eric Dumazet

Rik van Riel a écrit :

Andrew Morton wrote:

On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel [EMAIL PROTECTED] wrote:


Andrew Morton wrote:


I've also merged Nick's mm: madvise avoid exclusive mmap_sem.

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  

I'll test that.


Thanks.


Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545


545 tps versus 610 tps for one thread ? It seems quite bad, no ?

Could you please find an explanation for this ?


2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999




Thank you
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Rik van Riel

Eric Dumazet wrote:

Rik van Riel a écrit :

Andrew Morton wrote:

On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel [EMAIL PROTECTED] wrote:


Andrew Morton wrote:


I've also merged Nick's mm: madvise avoid exclusive mmap_sem.

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  

I'll test that.


Thanks.


Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545


545 tps versus 610 tps for one thread ? It seems quite bad, no ?

Could you please find an explanation for this ?


I have no idea why this happens.  Especially the last one,
going from a write lock to a read lock on the mmap_sem
should not make ANY difference whatsoever since we're
running single threaded!


2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999


Performance with 2 database threads is way better though,
and performance with 4 or more threads more than doubles...

If you have an explanation on why single threaded performance
went down a little on my quad core system, please let me know.

Does performance suffer at all on a real UP system?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2

2007-04-19 Thread Rik van Riel

Restore MADV_DONTNEED to its original Linux behaviour.  This is still
not the same behaviour as POSIX, but applications may be depending on
the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED
and makes sure nothing is done...

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>

---
This is to be applied over of the original MADV_FREE patch.
It turns out that the current glibc patch already falls back
to MADV_DONTNEED if it gets an -EINVAL.
--- linux-2.6.20.x86_64/mm/madvise.c.madv_free	2007-04-19 16:46:22.0 -0400
+++ linux-2.6.20.x86_64/mm/madvise.c	2007-04-19 16:52:19.0 -0400
@@ -130,7 +130,8 @@ static long madvise_willneed(struct vm_a
  */
 static long madvise_dontneed(struct vm_area_struct * vma,
 			 struct vm_area_struct ** prev,
-			 unsigned long start, unsigned long end)
+			 unsigned long start, unsigned long end,
+			 int behavior)
 {
 	*prev = vma;
 	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
@@ -142,12 +143,14 @@ static long madvise_dontneed(struct vm_a
 			.last_index = ULONG_MAX,
 		};
 		zap_page_range(vma, start, end - start, );
-	} else {
+	} else if (behavior == MADV_FREE) {
 		struct zap_details details = {
 			.madv_free = 1,
 		};
 		zap_page_range(vma, start, end - start, );
-	}
+	} else /* behavior == MADV_DONTNEED */
+		zap_page_range(vma, start, end - start, NULL);
+
 	return 0;
 }
 
@@ -219,10 +222,9 @@ madvise_vma(struct vm_area_struct *vma, 
 		error = madvise_willneed(vma, prev, start, end);
 		break;
 
-	/* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
 	case MADV_DONTNEED:
 	case MADV_FREE:
-		error = madvise_dontneed(vma, prev, start, end);
+		error = madvise_dontneed(vma, prev, start, end, behavior);
 		break;
 
 	default:


Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2

2007-04-19 Thread Rik van Riel

Restore MADV_DONTNEED to its original Linux behaviour.  This is still
not the same behaviour as POSIX, but applications may be depending on
the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED
and makes sure nothing is done...

Signed-off-by: Rik van Riel [EMAIL PROTECTED]

---
This is to be applied over of the original MADV_FREE patch.
It turns out that the current glibc patch already falls back
to MADV_DONTNEED if it gets an -EINVAL.
--- linux-2.6.20.x86_64/mm/madvise.c.madv_free	2007-04-19 16:46:22.0 -0400
+++ linux-2.6.20.x86_64/mm/madvise.c	2007-04-19 16:52:19.0 -0400
@@ -130,7 +130,8 @@ static long madvise_willneed(struct vm_a
  */
 static long madvise_dontneed(struct vm_area_struct * vma,
 			 struct vm_area_struct ** prev,
-			 unsigned long start, unsigned long end)
+			 unsigned long start, unsigned long end,
+			 int behavior)
 {
 	*prev = vma;
 	if (vma-vm_flags  (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
@@ -142,12 +143,14 @@ static long madvise_dontneed(struct vm_a
 			.last_index = ULONG_MAX,
 		};
 		zap_page_range(vma, start, end - start, details);
-	} else {
+	} else if (behavior == MADV_FREE) {
 		struct zap_details details = {
 			.madv_free = 1,
 		};
 		zap_page_range(vma, start, end - start, details);
-	}
+	} else /* behavior == MADV_DONTNEED */
+		zap_page_range(vma, start, end - start, NULL);
+
 	return 0;
 }
 
@@ -219,10 +222,9 @@ madvise_vma(struct vm_area_struct *vma, 
 		error = madvise_willneed(vma, prev, start, end);
 		break;
 
-	/* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
 	case MADV_DONTNEED:
 	case MADV_FREE:
-		error = madvise_dontneed(vma, prev, start, end);
+		error = madvise_dontneed(vma, prev, start, end, behavior);
 		break;
 
 	default:


[PATCH] lazy freeing of memory through MADV_FREE

2007-04-17 Thread Rik van Riel

Make it possible for applications to have the kernel free memory
lazily.  This reduces a repeated free/malloc cycle from freeing
pages and allocating them, to just marking them freeable.  If the
application wants to reuse them before the kernel needs the memory,
not even a page fault will happen.

This patch, together with Ulrich's glibc change, increases
MySQL sysbench performance by a factor of 2 on my quad core
test system.

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>

---
Ulrich Drepper has test glibc RPMS for this functionality at:

http://people.redhat.com/drepper/rpms

Andrew, I have stress tested this patch for a few days now and
have not been able to find any more bugs.  I believe it is ready
to be merged in -mm, and upstream at the next merge window.

When the patch goes upstream, I will submit a small follow-up
patch to revert MADV_DONTNEED behaviour to what it did previously
and have the new behaviour trigger only on MADV_FREE: at that
point people will have to get new test RPMs of glibc.

--- linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h.madv_free	2007-04-17 02:17:19.0 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h	2007-04-17 02:22:46.0 -0400
@@ -38,6 +38,7 @@
 #define MADV_SPACEAVAIL 5   /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6   /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7  /* Inherit parents page size */
+#define MADV_FREE	8		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/include/asm-mips/mman.h.madv_free	2007-04-17 02:17:19.0 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-mips/mman.h	2007-04-17 02:22:46.0 -0400
@@ -65,6 +65,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h.madv_free	2007-04-17 02:17:19.0 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h	2007-04-17 02:22:46.0 -0400
@@ -72,6 +72,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/include/linux/swap.h.madv_free	2007-04-17 02:17:43.0 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/swap.h	2007-04-17 02:22:46.0 -0400
@@ -182,6 +182,7 @@ extern void FASTCALL(lru_cache_add(struc
 extern void FASTCALL(lru_cache_add_active(struct page *));
 extern void FASTCALL(lru_cache_add_tail(struct page *));
 extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(deactivate_tail_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
--- linux-2.6.21-rc6-mm1/include/linux/mm.h.madv_free	2007-04-17 02:17:43.0 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/mm.h	2007-04-17 02:22:46.0 -0400
@@ -767,6 +767,7 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page->index to unmap */
 	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
+	short madv_free;			/* MADV_FREE anonymous memory */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
--- linux-2.6.21-rc6-mm1/include/linux/page-flags.h.madv_free	2007-04-17 02:17:43.0 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/page-flags.h	2007-04-17 02:23:16.0 -0400
@@ -91,6 +91,7 @@
 #define PG_booked		20	/* Has blocks reserved on-disk */
 
 #define PG_readahead		21	/* Reminder to do read-ahead */
+#define PG_lazyfree		22	/* MADV_FREE potential throwaway */
 
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
@@ -216,6 +217,11 @@ static inline void SetPageUptodate(struc
 #define ClearPageReclaim(page)	clear_bit(PG_reclaim, &(page)->flags)
 #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)
 
+#define PageLazyFree(page)	test_bit(PG_lazyfree, &(page)->flags)
+#define SetPageLazyFree(page)	set_bit(PG_lazyfree, &(page)->flags)
+#define ClearPageLazyFree(page)	clear_bit(PG_lazyfree, &(page)->flags)
+#define __ClearPageLazyFree(page) __clear_bit(PG_lazyfree, &(page)->flags)
+
 #define PageCompound(page)	test_bit(PG_compound, &(page)->flags)
 #define 

[PATCH] lazy freeing of memory through MADV_FREE

2007-04-17 Thread Rik van Riel

Make it possible for applications to have the kernel free memory
lazily.  This reduces a repeated free/malloc cycle from freeing
pages and allocating them, to just marking them freeable.  If the
application wants to reuse them before the kernel needs the memory,
not even a page fault will happen.

This patch, together with Ulrich's glibc change, increases
MySQL sysbench performance by a factor of 2 on my quad core
test system.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]

---
Ulrich Drepper has test glibc RPMS for this functionality at:

http://people.redhat.com/drepper/rpms

Andrew, I have stress tested this patch for a few days now and
have not been able to find any more bugs.  I believe it is ready
to be merged in -mm, and upstream at the next merge window.

When the patch goes upstream, I will submit a small follow-up
patch to revert MADV_DONTNEED behaviour to what it did previously
and have the new behaviour trigger only on MADV_FREE: at that
point people will have to get new test RPMs of glibc.

--- linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h.madv_free	2007-04-17 02:17:19.0 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h	2007-04-17 02:22:46.0 -0400
@@ -38,6 +38,7 @@
 #define MADV_SPACEAVAIL 5   /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6   /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7  /* Inherit parents page size */
+#define MADV_FREE	8		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages  resources */
--- linux-2.6.21-rc6-mm1/include/asm-mips/mman.h.madv_free	2007-04-17 02:17:19.0 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-mips/mman.h	2007-04-17 02:22:46.0 -0400
@@ -65,6 +65,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages  resources */
--- linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h.madv_free	2007-04-17 02:17:19.0 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h	2007-04-17 02:22:46.0 -0400
@@ -72,6 +72,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages  resources */
--- linux-2.6.21-rc6-mm1/include/linux/swap.h.madv_free	2007-04-17 02:17:43.0 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/swap.h	2007-04-17 02:22:46.0 -0400
@@ -182,6 +182,7 @@ extern void FASTCALL(lru_cache_add(struc
 extern void FASTCALL(lru_cache_add_active(struct page *));
 extern void FASTCALL(lru_cache_add_tail(struct page *));
 extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(deactivate_tail_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
--- linux-2.6.21-rc6-mm1/include/linux/mm.h.madv_free	2007-04-17 02:17:43.0 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/mm.h	2007-04-17 02:22:46.0 -0400
@@ -767,6 +767,7 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page-index to unmap */
 	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
+	short madv_free;			/* MADV_FREE anonymous memory */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
--- linux-2.6.21-rc6-mm1/include/linux/page-flags.h.madv_free	2007-04-17 02:17:43.0 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/page-flags.h	2007-04-17 02:23:16.0 -0400
@@ -91,6 +91,7 @@
 #define PG_booked		20	/* Has blocks reserved on-disk */
 
 #define PG_readahead		21	/* Reminder to do read-ahead */
+#define PG_lazyfree		22	/* MADV_FREE potential throwaway */
 
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
@@ -216,6 +217,11 @@ static inline void SetPageUptodate(struc
 #define ClearPageReclaim(page)	clear_bit(PG_reclaim, (page)-flags)
 #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, (page)-flags)
 
+#define PageLazyFree(page)	test_bit(PG_lazyfree, (page)-flags)
+#define SetPageLazyFree(page)	set_bit(PG_lazyfree, (page)-flags)
+#define ClearPageLazyFree(page)	clear_bit(PG_lazyfree, (page)-flags)
+#define __ClearPageLazyFree(page) __clear_bit(PG_lazyfree, (page)-flags)
+
 #define PageCompound(page)	test_bit(PG_compound, (page)-flags)
 #define __SetPageCompound(page)