Re: process creation time increases linearly with shmem

2005-08-29 Thread Linus Torvalds


On Tue, 30 Aug 2005, Nick Piggin wrote:
> 
> Andrew, did you pick up the patch or should I resend to someone?

I picked it up. If it causes performance regressions, we can fix them, and
if it causes other problems then that will be interesting in itself.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-29 Thread Nick Piggin

Ray Fucillo wrote:


Nick Piggin wrote:

How does the following look? (I changed the comment a bit). Andrew, 
please

apply if nobody objects.



Nick, I applied this latest patch to a 2.6.12 kernel and found that it 
does resolve the problem.  Prior to the patch on this machine, I was 
seeing about 23ms spent in fork for ever 100MB of shared memory 
segment.  After applying the patch, fork is taking about 1ms 
regardless of the shared memory size.




Hi Ray,
That's good news. I think we should probably consider putting the patch in
2.6.14 or if not, then definitely 2.6.15.

Andrew, did you pick up the patch or should I resend to someone?

I think the fork latency alone is enough to justify inclusion... 
however, did
you actually see increased aggregate throughput of your database (or at 
least

not a _decreased_ throughput)?


Many thanks to everyone for your help on this.



Well thank you very much for breaking the kernel and telling us about it! :)

Nick


Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-29 Thread Linus Torvalds


On Mon, 29 Aug 2005, Ray Fucillo wrote:
> 
> FWIW, an interesting side effect of this occurs when I run the database 
> with this patch internally on a Linux server that uses NIS.  Its an 
> unrelated problem and not a kernel problem.  Its due to the children 
> calling initgroups()...  apparently when you have many processes making 
> simultaneous initgroups() calls something starts imposing very long 
> waits in increments of 3 seconds

Sounds like something is backing off by waiting for three seconds whenever
some lock failure occurs. I don't see what locking the code might want to
do (it should just do the NIS equivalent of reading /etc/groups and do a
"setgroups()" system call), but I assume that the NIS server ends up
having some strange locking.

You might do an "ltrace testcase" (and, probably, the nis server) to see
if you can see where it happens, and bug the appropriate maintainers.
Especially if you have a repeatable test-case (where "repeatable" isn't
just for your particular machine: it's probably timing-related), somebody
might even fix it ;)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-29 Thread Ray Fucillo

Nick Piggin wrote:

How does the following look? (I changed the comment a bit). Andrew, please
apply if nobody objects.


Nick, I applied this latest patch to a 2.6.12 kernel and found that it 
does resolve the problem.  Prior to the patch on this machine, I was 
seeing about 23ms spent in fork for ever 100MB of shared memory segment. 
 After applying the patch, fork is taking about 1ms regardless of the 
shared memory size.


Many thanks to everyone for your help on this.

FWIW, an interesting side effect of this occurs when I run the database 
with this patch internally on a Linux server that uses NIS.  Its an 
unrelated problem and not a kernel problem.  Its due to the children 
calling initgroups()...  apparently when you have many processes making 
simultaneous initgroups() calls something starts imposing very long 
waits in increments of 3 seconds, so some processes return from 
initgroups() in a few ms and other processes complete in 3, 6, 9, up to 
21 seconds (plus a few ms).  I'm not sure what the story is with that, 
though its clearly not a kernel issue.  If someone happens to have the 
answer or a suggestion, great, otherwise I'll persue that elsewhere as 
necessary.  (I can reproduce this by simply adding a call to 
initgroups() call in the child of the forktest program that I sent earlier)



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-29 Thread Ray Fucillo

Nick Piggin wrote:

How does the following look? (I changed the comment a bit). Andrew, please
apply if nobody objects.


Nick, I applied this latest patch to a 2.6.12 kernel and found that it 
does resolve the problem.  Prior to the patch on this machine, I was 
seeing about 23ms spent in fork for ever 100MB of shared memory segment. 
 After applying the patch, fork is taking about 1ms regardless of the 
shared memory size.


Many thanks to everyone for your help on this.

FWIW, an interesting side effect of this occurs when I run the database 
with this patch internally on a Linux server that uses NIS.  Its an 
unrelated problem and not a kernel problem.  Its due to the children 
calling initgroups()...  apparently when you have many processes making 
simultaneous initgroups() calls something starts imposing very long 
waits in increments of 3 seconds, so some processes return from 
initgroups() in a few ms and other processes complete in 3, 6, 9, up to 
21 seconds (plus a few ms).  I'm not sure what the story is with that, 
though its clearly not a kernel issue.  If someone happens to have the 
answer or a suggestion, great, otherwise I'll persue that elsewhere as 
necessary.  (I can reproduce this by simply adding a call to 
initgroups() call in the child of the forktest program that I sent earlier)



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-29 Thread Linus Torvalds


On Mon, 29 Aug 2005, Ray Fucillo wrote:
 
 FWIW, an interesting side effect of this occurs when I run the database 
 with this patch internally on a Linux server that uses NIS.  Its an 
 unrelated problem and not a kernel problem.  Its due to the children 
 calling initgroups()...  apparently when you have many processes making 
 simultaneous initgroups() calls something starts imposing very long 
 waits in increments of 3 seconds

Sounds like something is backing off by waiting for three seconds whenever
some lock failure occurs. I don't see what locking the code might want to
do (it should just do the NIS equivalent of reading /etc/groups and do a
setgroups() system call), but I assume that the NIS server ends up
having some strange locking.

You might do an ltrace testcase (and, probably, the nis server) to see
if you can see where it happens, and bug the appropriate maintainers.
Especially if you have a repeatable test-case (where repeatable isn't
just for your particular machine: it's probably timing-related), somebody
might even fix it ;)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-29 Thread Nick Piggin

Ray Fucillo wrote:


Nick Piggin wrote:

How does the following look? (I changed the comment a bit). Andrew, 
please

apply if nobody objects.



Nick, I applied this latest patch to a 2.6.12 kernel and found that it 
does resolve the problem.  Prior to the patch on this machine, I was 
seeing about 23ms spent in fork for ever 100MB of shared memory 
segment.  After applying the patch, fork is taking about 1ms 
regardless of the shared memory size.




Hi Ray,
That's good news. I think we should probably consider putting the patch in
2.6.14 or if not, then definitely 2.6.15.

Andrew, did you pick up the patch or should I resend to someone?

I think the fork latency alone is enough to justify inclusion... 
however, did
you actually see increased aggregate throughput of your database (or at 
least

not a _decreased_ throughput)?


Many thanks to everyone for your help on this.



Well thank you very much for breaking the kernel and telling us about it! :)

Nick


Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-29 Thread Linus Torvalds


On Tue, 30 Aug 2005, Nick Piggin wrote:
 
 Andrew, did you pick up the patch or should I resend to someone?

I picked it up. If it causes performance regressions, we can fix them, and
if it causes other problems then that will be interesting in itself.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-28 Thread Nick Piggin

Hugh Dickins wrote:

On Sun, 28 Aug 2005, Nick Piggin wrote:


This is the condition I ended up with. Any good?

if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
if (vma->vm_flags & VM_MAYSHARE)
return 0;
if (vma->vm_file && !vma->anon_vma)
  return 0;
}



It's not bad, and practical timings are unlikely to differ, but your
VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *),
and your vm_file test is unnecessary, excluding pure anonymous areas
which haven't yet taken a fault.



Haven't taken a _write_ fault? Hmm, OK  that would seem to be a good
optimisation as well: we don't need to copy anon memory with only
ZERO_PAGE mappings... well, good as in "nice and logical" if not so
much "will make a difference"!


Please do send Andrew the patch for -mm, Nick: you were one of the
creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus),
much better that it go in your name (heh, heh, heh, can you trust me?)



Well Andi and I seemed to have the idea independently, Linus thought
private would be a good idea (I agree), you came up with the complete
patch with others contributing bits and pieces, and most importantly
Ray brought our attention to the possible deficiency in our mm.


Hugh

* That's ignoring, as we do everywhere else, the case which came up
a couple of weeks back in discussions with Linus, ptrace writing to
an area the process does not have write access to, creating an anon
page within a shared vma: that's an awkward case currently mishandled,
but the patch below does it no harm.



And in that case maybe your patch works better anyway, because the child
will inherit that page from parent.

How does the following look? (I changed the comment a bit). Andrew, please
apply if nobody objects.

--
SUSE Labs, Novell Inc.

Defer copying of ptes until fault time when it is possible to reconstruct
the pte from backing store. Idea from Andi Kleen and Nick Piggin.

Thanks to input from Rik van Riel and Linus and to Hugh for correcting
my blundering.

[ Note to akpm: Ray Fucillo <[EMAIL PROTECTED]>'s results go here ]

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2005-08-13 11:16:34.0 +1000
+++ linux-2.6/mm/memory.c   2005-08-28 16:41:32.0 +1000
@@ -498,6 +498,17 @@ int copy_page_range(struct mm_struct *ds
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;
 
+   /*
+* Don't copy ptes where a page fault will fill them correctly.
+* Fork becomes much lighter when there are big shared or private
+* readonly mappings. The tradeoff is that copy_page_range is more
+* efficient than faulting.
+*/
+   if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
+   if (!vma->anon_vma)
+   return 0;
+   }
+
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 


Re: process creation time increases linearly with shmem

2005-08-28 Thread Nick Piggin

Hugh Dickins wrote:

On Sun, 28 Aug 2005, Nick Piggin wrote:


This is the condition I ended up with. Any good?

if (!(vma-vm_flags  (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
if (vma-vm_flags  VM_MAYSHARE)
return 0;
if (vma-vm_file  !vma-anon_vma)
  return 0;
}



It's not bad, and practical timings are unlikely to differ, but your
VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *),
and your vm_file test is unnecessary, excluding pure anonymous areas
which haven't yet taken a fault.



Haven't taken a _write_ fault? Hmm, OK  that would seem to be a good
optimisation as well: we don't need to copy anon memory with only
ZERO_PAGE mappings... well, good as in nice and logical if not so
much will make a difference!


Please do send Andrew the patch for -mm, Nick: you were one of the
creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus),
much better that it go in your name (heh, heh, heh, can you trust me?)



Well Andi and I seemed to have the idea independently, Linus thought
private would be a good idea (I agree), you came up with the complete
patch with others contributing bits and pieces, and most importantly
Ray brought our attention to the possible deficiency in our mm.


Hugh

* That's ignoring, as we do everywhere else, the case which came up
a couple of weeks back in discussions with Linus, ptrace writing to
an area the process does not have write access to, creating an anon
page within a shared vma: that's an awkward case currently mishandled,
but the patch below does it no harm.



And in that case maybe your patch works better anyway, because the child
will inherit that page from parent.

How does the following look? (I changed the comment a bit). Andrew, please
apply if nobody objects.

--
SUSE Labs, Novell Inc.

Defer copying of ptes until fault time when it is possible to reconstruct
the pte from backing store. Idea from Andi Kleen and Nick Piggin.

Thanks to input from Rik van Riel and Linus and to Hugh for correcting
my blundering.

[ Note to akpm: Ray Fucillo [EMAIL PROTECTED]'s results go here ]

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2005-08-13 11:16:34.0 +1000
+++ linux-2.6/mm/memory.c   2005-08-28 16:41:32.0 +1000
@@ -498,6 +498,17 @@ int copy_page_range(struct mm_struct *ds
unsigned long addr = vma-vm_start;
unsigned long end = vma-vm_end;
 
+   /*
+* Don't copy ptes where a page fault will fill them correctly.
+* Fork becomes much lighter when there are big shared or private
+* readonly mappings. The tradeoff is that copy_page_range is more
+* efficient than faulting.
+*/
+   if (!(vma-vm_flags  (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
+   if (!vma-anon_vma)
+   return 0;
+   }
+
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 


Re: process creation time increases linearly with shmem

2005-08-27 Thread Hugh Dickins
On Sun, 28 Aug 2005, Nick Piggin wrote:
> 
> This is the condition I ended up with. Any good?
> 
> if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
> if (vma->vm_flags & VM_MAYSHARE)
>  return 0;
> if (vma->vm_file && !vma->anon_vma)
>return 0;
> }

It's not bad, and practical timings are unlikely to differ, but your
VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *),
and your vm_file test is unnecessary, excluding pure anonymous areas
which haven't yet taken a fault.

Please do send Andrew the patch for -mm, Nick: you were one of the
creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus),
much better that it go in your name (heh, heh, heh, can you trust me?)

Hugh

* That's ignoring, as we do everywhere else, the case which came up
a couple of weeks back in discussions with Linus, ptrace writing to
an area the process does not have write access to, creating an anon
page within a shared vma: that's an awkward case currently mishandled,
but the patch below does it no harm.

--- 2.6.13-rc7/mm/memory.c  2005-08-24 11:13:41.0 +0100
+++ linux/mm/memory.c   2005-08-28 04:48:34.0 +0100
@@ -498,6 +498,15 @@ int copy_page_range(struct mm_struct *ds
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;
 
+   /*
+* Assume the fork will probably exec: don't waste time copying
+* ptes where a page fault will fill them correctly afterwards.
+*/
+   if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
+   if (!vma->anon_vma)
+   return 0;
+   }
+
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-27 Thread Nick Piggin

Linus Torvalds wrote:


On Fri, 26 Aug 2005, Rik van Riel wrote:


On Fri, 26 Aug 2005, Hugh Dickins wrote:



Well, I still don't think we need to test vm_file.  We can add an
anon_vma test if you like, if we really want to minimize the fork
overhead, in favour of later faults.  Do we?


When you consider NUMA placement (the child process may
end up running elsewhere), allocating things like page
tables lazily may well end up being a performance win.



It should be easy enough to benchmark something like kernel compiles etc, 
which are reasonably fork-rich and should show a good mix for something 
like this. Or even just something like "time to restart a X session" after 
you've brought it into memory once.




2.6.13-rc7-git2
kbuild (make -j4) on  dual G5.

plain
228.85user 19.90system 2:06.50elapsed 196%CPU (3725666minor)
228.91user 19.90system 2:06.07elapsed 197%CPU (3721353minor)
229.00user 19.78system 2:06.20elapsed 197%CPU (3721345minor)
228.81user 19.94system 2:06.05elapsed 197%CPU (3723791minor)

nocopy shared
229.28user 19.76system 2:06.24elapsed 197%CPU (3725661minor)
229.04user 19.91system 2:06.92elapsed 196%CPU (3718904minor)
228.97user 20.06system 2:06.46elapsed 196%CPU (3723807minor)
229.24user 19.84system 2:06.13elapsed 197%CPU (3723793minor)

nocopy all
228.74user 19.87system 2:06.27elapsed 196%CPU (3819927minor)
228.89user 19.81system 2:05.89elapsed 197%CPU (3822943minor)
228.77user 19.73system 2:06.23elapsed 196%CPU (3820517minor)
228.93user 19.70system 2:05.84elapsed 197%CPU (3822935minor)

I'd say the full test (including anon_vma) is maybe slightly
faster on this test though maybe it isn't significant.

It is doing around 2.5% more minor faults, thought the profiles
say copy_page_range time is reduced as one would expect.

I think that if all else (ie. final performance) is equal, then
faulting is better than copying because the work is being
deferred until it is needed, and we dodge some pathological
cases like Ray's database taking 100s of ms to fork (we hope!)

However it will always depend on workload.

This is the condition I ended up with. Any good?

  if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
if (vma->vm_flags & VM_MAYSHARE)
  return 0;
if (vma->vm_file && !vma->anon_vma)
  return 0;
  }

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-27 Thread Nick Piggin

Linus Torvalds wrote:


On Fri, 26 Aug 2005, Rik van Riel wrote:


On Fri, 26 Aug 2005, Hugh Dickins wrote:



Well, I still don't think we need to test vm_file.  We can add an
anon_vma test if you like, if we really want to minimize the fork
overhead, in favour of later faults.  Do we?


When you consider NUMA placement (the child process may
end up running elsewhere), allocating things like page
tables lazily may well end up being a performance win.



It should be easy enough to benchmark something like kernel compiles etc, 
which are reasonably fork-rich and should show a good mix for something 
like this. Or even just something like time to restart a X session after 
you've brought it into memory once.




2.6.13-rc7-git2
kbuild (make -j4) on  dual G5.

plain
228.85user 19.90system 2:06.50elapsed 196%CPU (3725666minor)
228.91user 19.90system 2:06.07elapsed 197%CPU (3721353minor)
229.00user 19.78system 2:06.20elapsed 197%CPU (3721345minor)
228.81user 19.94system 2:06.05elapsed 197%CPU (3723791minor)

nocopy shared
229.28user 19.76system 2:06.24elapsed 197%CPU (3725661minor)
229.04user 19.91system 2:06.92elapsed 196%CPU (3718904minor)
228.97user 20.06system 2:06.46elapsed 196%CPU (3723807minor)
229.24user 19.84system 2:06.13elapsed 197%CPU (3723793minor)

nocopy all
228.74user 19.87system 2:06.27elapsed 196%CPU (3819927minor)
228.89user 19.81system 2:05.89elapsed 197%CPU (3822943minor)
228.77user 19.73system 2:06.23elapsed 196%CPU (3820517minor)
228.93user 19.70system 2:05.84elapsed 197%CPU (3822935minor)

I'd say the full test (including anon_vma) is maybe slightly
faster on this test though maybe it isn't significant.

It is doing around 2.5% more minor faults, thought the profiles
say copy_page_range time is reduced as one would expect.

I think that if all else (ie. final performance) is equal, then
faulting is better than copying because the work is being
deferred until it is needed, and we dodge some pathological
cases like Ray's database taking 100s of ms to fork (we hope!)

However it will always depend on workload.

This is the condition I ended up with. Any good?

  if (!(vma-vm_flags  (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
if (vma-vm_flags  VM_MAYSHARE)
  return 0;
if (vma-vm_file  !vma-anon_vma)
  return 0;
  }

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-27 Thread Hugh Dickins
On Sun, 28 Aug 2005, Nick Piggin wrote:
 
 This is the condition I ended up with. Any good?
 
 if (!(vma-vm_flags  (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
 if (vma-vm_flags  VM_MAYSHARE)
  return 0;
 if (vma-vm_file  !vma-anon_vma)
return 0;
 }

It's not bad, and practical timings are unlikely to differ, but your
VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *),
and your vm_file test is unnecessary, excluding pure anonymous areas
which haven't yet taken a fault.

Please do send Andrew the patch for -mm, Nick: you were one of the
creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus),
much better that it go in your name (heh, heh, heh, can you trust me?)

Hugh

* That's ignoring, as we do everywhere else, the case which came up
a couple of weeks back in discussions with Linus, ptrace writing to
an area the process does not have write access to, creating an anon
page within a shared vma: that's an awkward case currently mishandled,
but the patch below does it no harm.

--- 2.6.13-rc7/mm/memory.c  2005-08-24 11:13:41.0 +0100
+++ linux/mm/memory.c   2005-08-28 04:48:34.0 +0100
@@ -498,6 +498,15 @@ int copy_page_range(struct mm_struct *ds
unsigned long addr = vma-vm_start;
unsigned long end = vma-vm_end;
 
+   /*
+* Assume the fork will probably exec: don't waste time copying
+* ptes where a page fault will fill them correctly afterwards.
+*/
+   if (!(vma-vm_flags  (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
+   if (!vma-anon_vma)
+   return 0;
+   }
+
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Linus Torvalds


On Fri, 26 Aug 2005, Rik van Riel wrote:
> On Fri, 26 Aug 2005, Hugh Dickins wrote:
> 
> > Well, I still don't think we need to test vm_file.  We can add an
> > anon_vma test if you like, if we really want to minimize the fork
> > overhead, in favour of later faults.  Do we?
> 
> When you consider NUMA placement (the child process may
> end up running elsewhere), allocating things like page
> tables lazily may well end up being a performance win.

It should be easy enough to benchmark something like kernel compiles etc, 
which are reasonably fork-rich and should show a good mix for something 
like this. Or even just something like "time to restart a X session" after 
you've brought it into memory once.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Rik van Riel
On Fri, 26 Aug 2005, Hugh Dickins wrote:

> Well, I still don't think we need to test vm_file.  We can add an
> anon_vma test if you like, if we really want to minimize the fork
> overhead, in favour of later faults.  Do we?

When you consider NUMA placement (the child process may
end up running elsewhere), allocating things like page
tables lazily may well end up being a performance win.

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Linus Torvalds


On Fri, 26 Aug 2005, Hugh Dickins wrote:
>
> Well, I still don't think we need to test vm_file.  We can add an
> anon_vma test if you like, if we really want to minimize the fork
> overhead, in favour of later faults.  Do we?

I think we might want to do it in -mm for testing. Because quite frankly, 
otherwise the new fork() logic won't get a lot of testing. Shared memory 
isn't that common.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Hugh Dickins
On Fri, 26 Aug 2005, Ross Biro wrote:
> On 8/26/05, Rik van Riel <[EMAIL PROTECTED]> wrote:
> > 
> > Filling in all the page table entries at the first fault to
> > a VMA doesn't make much sense, IMHO.
> > 
> > I suspect we would be better off without that extra complexity,
> > unless there is a demonstrated benefit to it.
> 
> You are probably right, but do you want to put in a patch that might
> have a big performance impact in either direction with out verifying
> it?
> 
> My suggestion is safe, but most likely sub-optimal.  What everyone
> else is suggesting may be far better, but needs to be verified first.

It all has to be verified, and the problem will be that some things
fare well and others badly: how to reach a balanced decision?
Following your suggestion is no more safe than not following it.

> I'm suggesting that we change the code to do the same work fork would
> have done on the first page fault immediately, since it's easy to
> argue that it's not much worse than we have now and much better in
> many cases, and then try to experiment and figure out  what the
> correct solution is.

We don't know what work fork would have done, that information was in
the ptes we decided not to bother to copy.  Perhaps every pte of the
vma was set, perhaps none, perhaps only one.

Also, doing it at fault time has significantly more work to do than
just zipping along the ptes incrementing page counts and clearing bits.
I think; but probably much less extra work than I originally imagined,
since Andrew gave us the gang lookup of the page cache.

All the same, I'm with Rik: one of the great virtues of the original
idea was its simplicity; I'd prefer not to add complexity.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Hugh Dickins
On Fri, 26 Aug 2005, Linus Torvalds wrote:
> On Fri, 26 Aug 2005, Hugh Dickins wrote:
> > 
> > I see some flaws in the various patches posted, including Rik's.
> > Here's another version - doing it inside copy_page_range, so this
> > kind of vma special-casing is over in mm/ rather than kernel/.
> 
> I like this approach better, but I don't understand your particular 
> choice of bits.
> 
> > +* Assume the fork will probably exec: don't waste time copying
> > +* ptes where a page fault will fill them correctly afterwards.
> > +*/
> > +   if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
> > +   == VM_MAYSHARE)
> > +   return 0;
> > +
> > if (is_vm_hugetlb_page(vma))
> > return copy_hugetlb_page_range(dst_mm, src_mm, vma);
> 
> First off, if you just did it below the hugetlb check, you'd not need to
> check hugetlb again.

Yes: I wanted to include VM_HUGETLB in the list as documentation really;
and it costs nothing to test it along with the other flags - or are there
architectures where the more bits you test, the costlier?

> And while I understand VM_NONLINEAR and VM_RESERVED,
> can you please comment on why VM_MAYSHARE is so important, and why no
> other information matters.

The VM_MAYSHARE one isn't terribly important, there's no correctness
reason to replace VM_SHARED there.   It's just that do_mmap_pgoff takes
VM_SHARED and VM_MAYWRITE off a MAP_SHARED mapping of a file which was
not opened for writing.  We can safely avoid copying the ptes of such a
vma, just as with the writable ones, but the VM_MAYSHARE test catches
them where the VM_SHARED test does not.

> Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair 
> enough. But afaik, a shared anonymous mapping absolutely needs its page 
> tables copied, because those page tables contains either the pointers to 
> the shared pages, or the swap entries.
> 
> So I really think you need to verify that it's a file mapping too.

Either I'm misunderstanding, or you're remembering back to how shared
anonymous was done in 2.2 (perhaps).  In 2.4 and 2.6, shared anonymous
is "backed" by a shared memory object, created by shmem_zero_setup:
which sets vm_file even though we came into do_mmap_pgoff with no file.

> Also, arguably, there are other cases that may or may not be worth 
> worrying about. What about non-shared non-writable file mappings? What 
> about private mappings that haven't been COW'ed? 

Non-shared non-currently-writable file mappings might have been writable
and modified in the past, so we cannot necessarily skip those.

We could, and I did, consider testing whether the vma has an anon_vma:
we always allocate a vma's anon_vma just before first allocating it a
private page, and it's a good test which swapoff uses to narrow its
search.

But partly I thought that a little too tricksy, and hard to explain;
and partly I thought it was liable to catch the executable text,
some of which is most likely to be needed in between fork and exec.

> So I think that in addition to your tests, you should test for
> "vma->vm_file", and you could toy with testing for "vma->anon_vma"  being
> NULL (the latter will cause a _lot_ of hits, because any read-only private
> mapping will trigger, but it's a good stress-test and conceptually
> interesting, even if I suspect it will kill any performance gain through
> extra minor faults in the child).

Ah yes, I wrote the paragraph above before reading this one, honest!

Well, I still don't think we need to test vm_file.  We can add an
anon_vma test if you like, if we really want to minimize the fork
overhead, in favour of later faults.  Do we?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Ross Biro
On 8/26/05, Rik van Riel <[EMAIL PROTECTED]> wrote:
> 
> Filling in all the page table entries at the first fault to
> a VMA doesn't make much sense, IMHO.
> 
> 
> I suspect we would be better off without that extra complexity,
> unless there is a demonstrated benefit to it.

You are probably right, but do you want to put in a patch that might
have a big performance impact in either direction with out verifying
it?

My suggestion is safe, but most likely sub-optimal.  What everyone
else is suggesting may be far better, but needs to be verified first.

I'm suggesting that we change the code to do the same work fork would
have done on the first page fault immediately, since it's easy to
argue that it's not much worse than we have now and much better in
many cases, and then try to experiment and figure out  what the
correct solution is.

Ross
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Linus Torvalds


On Fri, 26 Aug 2005, Hugh Dickins wrote:
> 
> I see some flaws in the various patches posted, including Rik's.
> Here's another version - doing it inside copy_page_range, so this
> kind of vma special-casing is over in mm/ rather than kernel/.

I like this approach better, but I don't understand your particular 
choice of bits.

> +  * Assume the fork will probably exec: don't waste time copying
> +  * ptes where a page fault will fill them correctly afterwards.
> +  */
> + if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
> + == VM_MAYSHARE)
> + return 0;
> +
>   if (is_vm_hugetlb_page(vma))
>   return copy_hugetlb_page_range(dst_mm, src_mm, vma);

First off, if you just did it below the hugetlb check, you'd not need to
check hugetlb again. And while I understand VM_NONLINEAR and VM_RESERVED,
can you please comment on why VM_MAYSHARE is so important, and why no
other information matters.

Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair 
enough. But afaik, a shared anonymous mapping absolutely needs its page 
tables copied, because those page tables contains either the pointers to 
the shared pages, or the swap entries.

So I really think you need to verify that it's a file mapping too.

Also, arguably, there are other cases that may or may not be worth 
worrying about. What about non-shared non-writable file mappings? What 
about private mappings that haven't been COW'ed? 

So I think that in addition to your tests, you should test for
"vma->vm_file", and you could toy with testing for "vma->anon_vma"  being
NULL (the latter will cause a _lot_ of hits, because any read-only private
mapping will trigger, but it's a good stress-test and conceptually
interesting, even if I suspect it will kill any performance gain through
extra minor faults in the child).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Rik van Riel
On Fri, 26 Aug 2005, Ray Fucillo wrote:

> However, there is still a need that the child, once successfully forked, is
> operational reasonably quickly.  I suspect that Ross's idea of paging in
> everything after the first fault would not be optimal for us, because we'd
> still be talking about hundreds of ms of work done before the child does
> anything useful. 

Simply skipping the page table setup of MAP_SHARED regions
should be enough to fix this issue.

> It would still be far better than the behavior we have today because 
> that time would no longer be synchronous with the fork().

Filling in all the page table entries at the first fault to
a VMA doesn't make much sense, IMHO.

The reason I think this is that people have experimented
with prefaulting already resident pages at page fault time,
and those experiments have never shown a conclusive benefit.

Now, if doing such prefaulting for normal processes does not
show a benefit - why would it be beneficial to recently forked
processes with a huge SHM area ?

I suspect we would be better off without that extra complexity,
unless there is a demonstrated benefit to it.

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Ray Fucillo

Nick Piggin wrote:

OK let's see how Ray goes, and try it when 2.6.14 opens...


Working on that now - I'll let you know.


Yeah I guess that's a good idea. Patch looks pretty good.
Just a minor issue with the comment, it is not strictly
just assuming the child will exec... IMO it is worthwhile
in Ray's case even if his forked process _eventually_ ends
up touching all the shared memory pages, it is better to
avoid many ms of fork overhead.


Yes, in our database system the child will immediately touch some shmem 
pages, and may eventually touch most of them (and would almost never 
exec()).  Fork performance is critical in usage scenarios where an 
end-user database request forks a new server process from one master 
server process.


However, there is still a need that the child, once successfully forked, 
is operational reasonably quickly.  I suspect that Ross's idea of paging 
in everything after the first fault would not be optimal for us, because 
we'd still be talking about hundreds of ms of work done before the child 
does anything useful.  It would still be far better than the behavior we 
have today because that time would no longer be synchronous with the 
fork().  Of course, it sounds like our app might be able to make use of 
the hugetlb stuff can mitigate this problem in the future...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Ross Biro
On 8/26/05, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> On Fri, 26 Aug 2005, Ross Biro wrote:
> > On 8/26/05, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > >
> > > The refaulting will hurt the performance of something: let's
> > > just hope that something doesn't turn out to be a show-stopper.
> >
> > Why not just fault in all the pages on the first fault. Then the performance
> > loss is a single page fault (the page table copy that would have happened a
> > fork time now happens at fault time) and you get the big win for processes
> > that do fork/exec.
> 
> "all" might be very many more pages than were ever mapped in the parent,
> and not be a win.  Some faultahead might work better.  Might, might, ...

If you reduce "all" to whatever would have been done in fork
originially, then you've got a big win in some cases and a minimal
loss in others, and it's easy to argue you've got something better.

Now changng "all" to something even less might be an even bigger win,
but that requires a lot of benchmarking to justify.

Ross
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Hugh Dickins
On Fri, 26 Aug 2005, Ross Biro wrote:
> On 8/26/05, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > 
> > The refaulting will hurt the performance of something: let's
> > just hope that something doesn't turn out to be a show-stopper.
> 
> Why not just fault in all the pages on the first fault. Then the performance 
> loss is a single page fault (the page table copy that would have happened a 
> fork time now happens at fault time) and you get the big win for processes 
> that do fork/exec.

"all" might be very many more pages than were ever mapped in the parent,
and not be a win.  Some faultahead might work better.  Might, might, ...

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 25 Aug 2005, Linus Torvalds wrote:


That said, I think it's a valid optimization. Especially as the child 
_probably_ doesn't need it (ie there's at least some likelihood of an 
execve() or similar).



I agree, seems a great idea to me (sulking because I was too dumb
to get it, even when Nick and Andi first posted their patches).

It won't just save on the copying at fork time, it'll save on
undoing it all again when the child mm is torn down for exec.

The refaulting will hurt the performance of something: let's
just hope that something doesn't turn out to be a show-stopper.



OK let's see how Ray goes, and try it when 2.6.14 opens...


I see some flaws in the various patches posted, including Rik's.
Here's another version - doing it inside copy_page_range, so this
kind of vma special-casing is over in mm/ rather than kernel/.



Yeah I guess that's a good idea. Patch looks pretty good.
Just a minor issue with the comment, it is not strictly
just assuming the child will exec... IMO it is worthwhile
in Ray's case even if his forked process _eventually_ ends
up touching all the shared memory pages, it is better to
avoid many ms of fork overhead.

Also, on NUMA systems this will help get page tables allocated
on the right nodes, which is not an insignificant problem for
big HPC jobs.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Hugh Dickins
On Thu, 25 Aug 2005, Linus Torvalds wrote:
> On Fri, 26 Aug 2005, Nick Piggin wrote:
> > 
> > > Skipping MAP_SHARED in fork() sounds like a good idea to me...
> > 
> > Indeed. Linus, can you remember why we haven't done this before?
> 
> Hmm. Historical reasons. Also, if the child ends up needing it, it will 
> now have to fault them in.
> 
> That said, I think it's a valid optimization. Especially as the child 
> _probably_ doesn't need it (ie there's at least some likelihood of an 
> execve() or similar).

I agree, seems a great idea to me (sulking because I was too dumb
to get it, even when Nick and Andi first posted their patches).

It won't just save on the copying at fork time, it'll save on
undoing it all again when the child mm is torn down for exec.

The refaulting will hurt the performance of something: let's
just hope that something doesn't turn out to be a show-stopper.

I see some flaws in the various patches posted, including Rik's.
Here's another version - doing it inside copy_page_range, so this
kind of vma special-casing is over in mm/ rather than kernel/.

No point in testing vm_file, the vm_flags cover the cases.
Test VM_MAYSHARE rather than VM_SHARED to include the never-can-be-
written MAP_SHARED cases too.  Must exclude VM_NONLINEAR, their ptes
are essential for defining the file offsets.  Must exclude VM_RESERVED,
faults on remap_pfn_range areas would usually put in anon zeroed pages
instead of the driver pages - or perhaps would be better as a test
against VM_IO, or vma->vm_ops->nopage?

Having to exclude the VM_NONLINEAR seems rather a shame, since those
are always shared and likely enormous.  The InfiniBand people's idea 
of a way for the app to set VM_DONTCOPY (to avoid rdma get_user_pages
problems) becomes attractive as a way for apps to speed their forks.

Hugh

--- 2.6.13-rc7/mm/memory.c  2005-08-24 11:13:41.0 +0100
+++ linux/mm/memory.c   2005-08-26 10:09:50.0 +0100
@@ -498,6 +498,14 @@ int copy_page_range(struct mm_struct *ds
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;
 
+   /*
+* Assume the fork will probably exec: don't waste time copying
+* ptes where a page fault will fill them correctly afterwards.
+*/
+   if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
+   == VM_MAYSHARE)
+   return 0;
+
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Hugh Dickins
On Thu, 25 Aug 2005, Linus Torvalds wrote:
 On Fri, 26 Aug 2005, Nick Piggin wrote:
  
   Skipping MAP_SHARED in fork() sounds like a good idea to me...
  
  Indeed. Linus, can you remember why we haven't done this before?
 
 Hmm. Historical reasons. Also, if the child ends up needing it, it will 
 now have to fault them in.
 
 That said, I think it's a valid optimization. Especially as the child 
 _probably_ doesn't need it (ie there's at least some likelihood of an 
 execve() or similar).

I agree, seems a great idea to me (sulking because I was too dumb
to get it, even when Nick and Andi first posted their patches).

It won't just save on the copying at fork time, it'll save on
undoing it all again when the child mm is torn down for exec.

The refaulting will hurt the performance of something: let's
just hope that something doesn't turn out to be a show-stopper.

I see some flaws in the various patches posted, including Rik's.
Here's another version - doing it inside copy_page_range, so this
kind of vma special-casing is over in mm/ rather than kernel/.

No point in testing vm_file, the vm_flags cover the cases.
Test VM_MAYSHARE rather than VM_SHARED to include the never-can-be-
written MAP_SHARED cases too.  Must exclude VM_NONLINEAR, their ptes
are essential for defining the file offsets.  Must exclude VM_RESERVED,
faults on remap_pfn_range areas would usually put in anon zeroed pages
instead of the driver pages - or perhaps would be better as a test
against VM_IO, or vma-vm_ops-nopage?

Having to exclude the VM_NONLINEAR seems rather a shame, since those
are always shared and likely enormous.  The InfiniBand people's idea 
of a way for the app to set VM_DONTCOPY (to avoid rdma get_user_pages
problems) becomes attractive as a way for apps to speed their forks.

Hugh

--- 2.6.13-rc7/mm/memory.c  2005-08-24 11:13:41.0 +0100
+++ linux/mm/memory.c   2005-08-26 10:09:50.0 +0100
@@ -498,6 +498,14 @@ int copy_page_range(struct mm_struct *ds
unsigned long addr = vma-vm_start;
unsigned long end = vma-vm_end;
 
+   /*
+* Assume the fork will probably exec: don't waste time copying
+* ptes where a page fault will fill them correctly afterwards.
+*/
+   if ((vma-vm_flags  (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
+   == VM_MAYSHARE)
+   return 0;
+
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 25 Aug 2005, Linus Torvalds wrote:


That said, I think it's a valid optimization. Especially as the child 
_probably_ doesn't need it (ie there's at least some likelihood of an 
execve() or similar).



I agree, seems a great idea to me (sulking because I was too dumb
to get it, even when Nick and Andi first posted their patches).

It won't just save on the copying at fork time, it'll save on
undoing it all again when the child mm is torn down for exec.

The refaulting will hurt the performance of something: let's
just hope that something doesn't turn out to be a show-stopper.



OK let's see how Ray goes, and try it when 2.6.14 opens...


I see some flaws in the various patches posted, including Rik's.
Here's another version - doing it inside copy_page_range, so this
kind of vma special-casing is over in mm/ rather than kernel/.



Yeah I guess that's a good idea. Patch looks pretty good.
Just a minor issue with the comment, it is not strictly
just assuming the child will exec... IMO it is worthwhile
in Ray's case even if his forked process _eventually_ ends
up touching all the shared memory pages, it is better to
avoid many ms of fork overhead.

Also, on NUMA systems this will help get page tables allocated
on the right nodes, which is not an insignificant problem for
big HPC jobs.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Hugh Dickins
On Fri, 26 Aug 2005, Ross Biro wrote:
 On 8/26/05, Hugh Dickins [EMAIL PROTECTED] wrote:
  
  The refaulting will hurt the performance of something: let's
  just hope that something doesn't turn out to be a show-stopper.
 
 Why not just fault in all the pages on the first fault. Then the performance 
 loss is a single page fault (the page table copy that would have happened a 
 fork time now happens at fault time) and you get the big win for processes 
 that do fork/exec.

all might be very many more pages than were ever mapped in the parent,
and not be a win.  Some faultahead might work better.  Might, might, ...

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Ross Biro
On 8/26/05, Hugh Dickins [EMAIL PROTECTED] wrote:
 On Fri, 26 Aug 2005, Ross Biro wrote:
  On 8/26/05, Hugh Dickins [EMAIL PROTECTED] wrote:
  
   The refaulting will hurt the performance of something: let's
   just hope that something doesn't turn out to be a show-stopper.
 
  Why not just fault in all the pages on the first fault. Then the performance
  loss is a single page fault (the page table copy that would have happened a
  fork time now happens at fault time) and you get the big win for processes
  that do fork/exec.
 
 all might be very many more pages than were ever mapped in the parent,
 and not be a win.  Some faultahead might work better.  Might, might, ...

If you reduce all to whatever would have been done in fork
originially, then you've got a big win in some cases and a minimal
loss in others, and it's easy to argue you've got something better.

Now changng all to something even less might be an even bigger win,
but that requires a lot of benchmarking to justify.

Ross
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Ray Fucillo

Nick Piggin wrote:

OK let's see how Ray goes, and try it when 2.6.14 opens...


Working on that now - I'll let you know.


Yeah I guess that's a good idea. Patch looks pretty good.
Just a minor issue with the comment, it is not strictly
just assuming the child will exec... IMO it is worthwhile
in Ray's case even if his forked process _eventually_ ends
up touching all the shared memory pages, it is better to
avoid many ms of fork overhead.


Yes, in our database system the child will immediately touch some shmem 
pages, and may eventually touch most of them (and would almost never 
exec()).  Fork performance is critical in usage scenarios where an 
end-user database request forks a new server process from one master 
server process.


However, there is still a need that the child, once successfully forked, 
is operational reasonably quickly.  I suspect that Ross's idea of paging 
in everything after the first fault would not be optimal for us, because 
we'd still be talking about hundreds of ms of work done before the child 
does anything useful.  It would still be far better than the behavior we 
have today because that time would no longer be synchronous with the 
fork().  Of course, it sounds like our app might be able to make use of 
the hugetlb stuff can mitigate this problem in the future...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Rik van Riel
On Fri, 26 Aug 2005, Ray Fucillo wrote:

 However, there is still a need that the child, once successfully forked, is
 operational reasonably quickly.  I suspect that Ross's idea of paging in
 everything after the first fault would not be optimal for us, because we'd
 still be talking about hundreds of ms of work done before the child does
 anything useful. 

Simply skipping the page table setup of MAP_SHARED regions
should be enough to fix this issue.

 It would still be far better than the behavior we have today because 
 that time would no longer be synchronous with the fork().

Filling in all the page table entries at the first fault to
a VMA doesn't make much sense, IMHO.

The reason I think this is that people have experimented
with prefaulting already resident pages at page fault time,
and those experiments have never shown a conclusive benefit.

Now, if doing such prefaulting for normal processes does not
show a benefit - why would it be beneficial to recently forked
processes with a huge SHM area ?

I suspect we would be better off without that extra complexity,
unless there is a demonstrated benefit to it.

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Linus Torvalds


On Fri, 26 Aug 2005, Hugh Dickins wrote:
 
 I see some flaws in the various patches posted, including Rik's.
 Here's another version - doing it inside copy_page_range, so this
 kind of vma special-casing is over in mm/ rather than kernel/.

I like this approach better, but I don't understand your particular 
choice of bits.

 +  * Assume the fork will probably exec: don't waste time copying
 +  * ptes where a page fault will fill them correctly afterwards.
 +  */
 + if ((vma-vm_flags  (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
 + == VM_MAYSHARE)
 + return 0;
 +
   if (is_vm_hugetlb_page(vma))
   return copy_hugetlb_page_range(dst_mm, src_mm, vma);

First off, if you just did it below the hugetlb check, you'd not need to
check hugetlb again. And while I understand VM_NONLINEAR and VM_RESERVED,
can you please comment on why VM_MAYSHARE is so important, and why no
other information matters.

Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair 
enough. But afaik, a shared anonymous mapping absolutely needs its page 
tables copied, because those page tables contains either the pointers to 
the shared pages, or the swap entries.

So I really think you need to verify that it's a file mapping too.

Also, arguably, there are other cases that may or may not be worth 
worrying about. What about non-shared non-writable file mappings? What 
about private mappings that haven't been COW'ed? 

So I think that in addition to your tests, you should test for
vma-vm_file, and you could toy with testing for vma-anon_vma  being
NULL (the latter will cause a _lot_ of hits, because any read-only private
mapping will trigger, but it's a good stress-test and conceptually
interesting, even if I suspect it will kill any performance gain through
extra minor faults in the child).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Ross Biro
On 8/26/05, Rik van Riel [EMAIL PROTECTED] wrote:
 
 Filling in all the page table entries at the first fault to
 a VMA doesn't make much sense, IMHO.
 
 
 I suspect we would be better off without that extra complexity,
 unless there is a demonstrated benefit to it.

You are probably right, but do you want to put in a patch that might
have a big performance impact in either direction with out verifying
it?

My suggestion is safe, but most likely sub-optimal.  What everyone
else is suggesting may be far better, but needs to be verified first.

I'm suggesting that we change the code to do the same work fork would
have done on the first page fault immediately, since it's easy to
argue that it's not much worse than we have now and much better in
many cases, and then try to experiment and figure out  what the
correct solution is.

Ross
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Hugh Dickins
On Fri, 26 Aug 2005, Linus Torvalds wrote:
 On Fri, 26 Aug 2005, Hugh Dickins wrote:
  
  I see some flaws in the various patches posted, including Rik's.
  Here's another version - doing it inside copy_page_range, so this
  kind of vma special-casing is over in mm/ rather than kernel/.
 
 I like this approach better, but I don't understand your particular 
 choice of bits.
 
  +* Assume the fork will probably exec: don't waste time copying
  +* ptes where a page fault will fill them correctly afterwards.
  +*/
  +   if ((vma-vm_flags  (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
  +   == VM_MAYSHARE)
  +   return 0;
  +
  if (is_vm_hugetlb_page(vma))
  return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
 First off, if you just did it below the hugetlb check, you'd not need to
 check hugetlb again.

Yes: I wanted to include VM_HUGETLB in the list as documentation really;
and it costs nothing to test it along with the other flags - or are there
architectures where the more bits you test, the costlier?

 And while I understand VM_NONLINEAR and VM_RESERVED,
 can you please comment on why VM_MAYSHARE is so important, and why no
 other information matters.

The VM_MAYSHARE one isn't terribly important, there's no correctness
reason to replace VM_SHARED there.   It's just that do_mmap_pgoff takes
VM_SHARED and VM_MAYWRITE off a MAP_SHARED mapping of a file which was
not opened for writing.  We can safely avoid copying the ptes of such a
vma, just as with the writable ones, but the VM_MAYSHARE test catches
them where the VM_SHARED test does not.

 Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair 
 enough. But afaik, a shared anonymous mapping absolutely needs its page 
 tables copied, because those page tables contains either the pointers to 
 the shared pages, or the swap entries.
 
 So I really think you need to verify that it's a file mapping too.

Either I'm misunderstanding, or you're remembering back to how shared
anonymous was done in 2.2 (perhaps).  In 2.4 and 2.6, shared anonymous
is backed by a shared memory object, created by shmem_zero_setup:
which sets vm_file even though we came into do_mmap_pgoff with no file.

 Also, arguably, there are other cases that may or may not be worth 
 worrying about. What about non-shared non-writable file mappings? What 
 about private mappings that haven't been COW'ed? 

Non-shared non-currently-writable file mappings might have been writable
and modified in the past, so we cannot necessarily skip those.

We could, and I did, consider testing whether the vma has an anon_vma:
we always allocate a vma's anon_vma just before first allocating it a
private page, and it's a good test which swapoff uses to narrow its
search.

But partly I thought that a little too tricksy, and hard to explain;
and partly I thought it was liable to catch the executable text,
some of which is most likely to be needed in between fork and exec.

 So I think that in addition to your tests, you should test for
 vma-vm_file, and you could toy with testing for vma-anon_vma  being
 NULL (the latter will cause a _lot_ of hits, because any read-only private
 mapping will trigger, but it's a good stress-test and conceptually
 interesting, even if I suspect it will kill any performance gain through
 extra minor faults in the child).

Ah yes, I wrote the paragraph above before reading this one, honest!

Well, I still don't think we need to test vm_file.  We can add an
anon_vma test if you like, if we really want to minimize the fork
overhead, in favour of later faults.  Do we?

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Hugh Dickins
On Fri, 26 Aug 2005, Ross Biro wrote:
 On 8/26/05, Rik van Riel [EMAIL PROTECTED] wrote:
  
  Filling in all the page table entries at the first fault to
  a VMA doesn't make much sense, IMHO.
  
  I suspect we would be better off without that extra complexity,
  unless there is a demonstrated benefit to it.
 
 You are probably right, but do you want to put in a patch that might
 have a big performance impact in either direction with out verifying
 it?
 
 My suggestion is safe, but most likely sub-optimal.  What everyone
 else is suggesting may be far better, but needs to be verified first.

It all has to be verified, and the problem will be that some things
fare well and others badly: how to reach a balanced decision?
Following your suggestion is no more safe than not following it.

 I'm suggesting that we change the code to do the same work fork would
 have done on the first page fault immediately, since it's easy to
 argue that it's not much worse than we have now and much better in
 many cases, and then try to experiment and figure out  what the
 correct solution is.

We don't know what work fork would have done, that information was in
the ptes we decided not to bother to copy.  Perhaps every pte of the
vma was set, perhaps none, perhaps only one.

Also, doing it at fault time has significantly more work to do than
just zipping along the ptes incrementing page counts and clearing bits.
I think; but probably much less extra work than I originally imagined,
since Andrew gave us the gang lookup of the page cache.

All the same, I'm with Rik: one of the great virtues of the original
idea was its simplicity; I'd prefer not to add complexity.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Linus Torvalds


On Fri, 26 Aug 2005, Hugh Dickins wrote:

 Well, I still don't think we need to test vm_file.  We can add an
 anon_vma test if you like, if we really want to minimize the fork
 overhead, in favour of later faults.  Do we?

I think we might want to do it in -mm for testing. Because quite frankly, 
otherwise the new fork() logic won't get a lot of testing. Shared memory 
isn't that common.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Rik van Riel
On Fri, 26 Aug 2005, Hugh Dickins wrote:

 Well, I still don't think we need to test vm_file.  We can add an
 anon_vma test if you like, if we really want to minimize the fork
 overhead, in favour of later faults.  Do we?

When you consider NUMA placement (the child process may
end up running elsewhere), allocating things like page
tables lazily may well end up being a performance win.

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-26 Thread Linus Torvalds


On Fri, 26 Aug 2005, Rik van Riel wrote:
 On Fri, 26 Aug 2005, Hugh Dickins wrote:
 
  Well, I still don't think we need to test vm_file.  We can add an
  anon_vma test if you like, if we really want to minimize the fork
  overhead, in favour of later faults.  Do we?
 
 When you consider NUMA placement (the child process may
 end up running elsewhere), allocating things like page
 tables lazily may well end up being a performance win.

It should be easy enough to benchmark something like kernel compiles etc, 
which are reasonably fork-rich and should show a good mix for something 
like this. Or even just something like time to restart a X session after 
you've brought it into memory once.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Linus Torvalds


On Fri, 26 Aug 2005, Nick Piggin wrote:
> 
> > Skipping MAP_SHARED in fork() sounds like a good idea to me...
> > 
> 
> Indeed. Linus, can you remember why we haven't done this before?

Hmm. Historical reasons. Also, if the child ends up needing it, it will 
now have to fault them in.

That said, I think it's a valid optimization. Especially as the child 
_probably_ doesn't need it (ie there's at least some likelihood of an 
execve() or similar).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Rik van Riel
On Fri, 26 Aug 2005, Nick Piggin wrote:

> > Skipping MAP_SHARED in fork() sounds like a good idea to me...
> 
> Indeed. Linus, can you remember why we haven't done this before?

Where "this" looks something like the patch below, shamelessly
merging Nick's and Andy's patches and adding the initialization
of retval.

I suspect this may be a measurable win on database servers with
a web frontend, where the connections to the database server are
set up basically for each individual query, and don't stick around
for a long time.

No, I haven't actually tested this patch - but feel free to go
wild while I sign off for the night.

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>

--- linux-2.6.12/kernel/fork.c.mapshared2005-08-25 18:40:44.0 
-0400
+++ linux-2.6.12/kernel/fork.c  2005-08-25 18:47:16.0 -0400
@@ -184,7 +184,7 @@
 {
struct vm_area_struct * mpnt, *tmp, **pprev;
struct rb_node **rb_link, *rb_parent;
-   int retval;
+   int retval = 0;
unsigned long charge;
struct mempolicy *pol;
 
@@ -265,7 +265,10 @@
rb_parent = >vm_rb;
 
mm->map_count++;
-   retval = copy_page_range(mm, current->mm, tmp);
+   /* Skip pte copying if page faults can take care of things. */
+   if (!file || !(tmp->vm_flags & VM_SHARED) ||
+   is_vm_hugetlb_page(vma))
+   retval = copy_page_range(mm, current->mm, tmp);
spin_unlock(>page_table_lock);
 
if (tmp->vm_ops && tmp->vm_ops->open)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Nick Piggin

Rik van Riel wrote:

On Thu, 25 Aug 2005, Nick Piggin wrote:



fork() can be changed so as not to set up page tables for
MAP_SHARED mappings. I think that has other tradeoffs like
initially causing several unavoidable faults reading
libraries and program text.



Actually, libraries and program text are usually mapped
MAP_PRIVATE, so those would still be copied.



Yep, that seems to be the case here.


Skipping MAP_SHARED in fork() sounds like a good idea to me...



Indeed. Linus, can you remember why we haven't done this before?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Rik van Riel
On Thu, 25 Aug 2005, Nick Piggin wrote:

> fork() can be changed so as not to set up page tables for
> MAP_SHARED mappings. I think that has other tradeoffs like
> initially causing several unavoidable faults reading
> libraries and program text.

Actually, libraries and program text are usually mapped
MAP_PRIVATE, so those would still be copied.

Skipping MAP_SHARED in fork() sounds like a good idea to me...

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Andi Kleen
On Thursday 25 August 2005 16:47, Parag Warudkar wrote:

> Exactly - one problem is that this forces all of the hugetlb users to go
> the lazy faulting way. 
Actually I disabled it for hugetlbfs (... !is_huge...vma). The reason 
is that lazy faulting for huge pages is still not in mainline.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Parag Warudkar
On Thu, 2005-08-25 at 16:22 +0200, Andi Kleen wrote:
> But I'm not sure it's a good idea in all cases. Would need a lot of 
> benchmarking  at least.
> 
> -Andi
> 

Exactly - one problem is that this forces all of the hugetlb users to go
the lazy faulting way. This is more or less similar to the original
problem the fork() forces everything to be mapped and some apps don't
like it. Same way, some apps may not want hugetlb pages to be all
pre-mapped. 

That's why I was alluding towards having the user specify MAP_SHARED|
MAP_LAZY or something to that tune and then have fork() honor it. So
people who want all things pre-mapped will not specify MAP_LAZY, just
MAP_SHARED. 

Now I don't even know if above is possible and workable for all
scenarios but that's why I was asking.. :)

Parag

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Nick Piggin

Andi Kleen wrote:

Would it be worth trying to do something like this?



Maybe. Shouldn't be very hard though - you just need to check if the VMA is 
backed by an object and if yes don't call copy_page_range for it.


I think it just needs (untested) 



I think you need to check for MAP_SHARED as well, because
MAP_PRIVATE mapping of a file could be modified in parent.

See patch I posted just now.

Also, do you need any special case for hugetlb?



Index: linux-2.6.13-rc5-misc/kernel/fork.c
===
--- linux-2.6.13-rc5-misc.orig/kernel/fork.c
+++ linux-2.6.13-rc5-misc/kernel/fork.c
@@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str
rb_parent = >vm_rb;
 
 		mm->map_count++;

-   retval = copy_page_range(mm, current->mm, tmp);
+   if (!file && !is_vm_hugetlb_page(vma))
+   retval = copy_page_range(mm, current->mm, tmp);
spin_unlock(>page_table_lock);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)


But I'm not sure it's a good idea in all cases. Would need a lot of 
benchmarking  at least.




Yep. I'm sure it must have come up in the past, and Linus
must have said something about best-for-most.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Nick Piggin

Ray Fucillo wrote:

Nick Piggin wrote:


fork() can be changed so as not to set up page tables for
MAP_SHARED mappings. I think that has other tradeoffs like
initially causing several unavoidable faults reading
libraries and program text.

What kind of application are you using?



The application is a database system called Caché.  We allocate a large 
shared memory segment for database cache, which in a large production 
environment may realistically be 1+GB on 32-bit platforms and much 
larger on 64-bit.  At these sizes fork() is taking hundreds of 
miliseconds, which can become a noticeable bottleneck for us.  This 
performance characteristic seems to be unique to Linux vs other Unix 
implementations.





As Andi said, hugepages might be a very nice feature for you guys
to look into and might potentially give a performance increase with
reduced TLB pressure, not only your immediate fork problem.

Anyway, the attached patch is something you could try testing. If
you do so, then I would be very interested to see performance results.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Index: linux-2.6/kernel/fork.c
===
--- linux-2.6.orig/kernel/fork.c2005-08-04 15:24:36.0 +1000
+++ linux-2.6/kernel/fork.c 2005-08-26 00:20:50.0 +1000
@@ -256,7 +256,6 @@ static inline int dup_mmap(struct mm_str
 * Note that, exceptionally, here the vma is inserted
 * without holding mm->mmap_sem.
 */
-   spin_lock(>page_table_lock);
*pprev = tmp;
pprev = >vm_next;
 
@@ -265,8 +264,11 @@ static inline int dup_mmap(struct mm_str
rb_parent = >vm_rb;
 
mm->map_count++;
-   retval = copy_page_range(mm, current->mm, tmp);
-   spin_unlock(>page_table_lock);
+   if (!(file && (tmp->vm_flags & VM_SHARED))) {
+   spin_lock(>page_table_lock);
+   retval = copy_page_range(mm, current->mm, tmp);
+   spin_unlock(>page_table_lock);
+   }
 
if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);


Re: process creation time increases linearly with shmem

2005-08-25 Thread Andi Kleen

> Would it be worth trying to do something like this?

Maybe. Shouldn't be very hard though - you just need to check if the VMA is 
backed by an object and if yes don't call copy_page_range for it.

I think it just needs (untested) 

Index: linux-2.6.13-rc5-misc/kernel/fork.c
===
--- linux-2.6.13-rc5-misc.orig/kernel/fork.c
+++ linux-2.6.13-rc5-misc/kernel/fork.c
@@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str
rb_parent = >vm_rb;
 
mm->map_count++;
-   retval = copy_page_range(mm, current->mm, tmp);
+   if (!file && !is_vm_hugetlb_page(vma))
+   retval = copy_page_range(mm, current->mm, tmp);
spin_unlock(>page_table_lock);
 
if (tmp->vm_ops && tmp->vm_ops->open)

But I'm not sure it's a good idea in all cases. Would need a lot of 
benchmarking  at least.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Parag Warudkar
> Ray Fucillo <[EMAIL PROTECTED]> writes:
> > 
> > The application is a database system called Caché.  We allocate a
> > large shared memory segment for database cache, which in a large
> > production environment may realistically be 1+GB on 32-bit platforms
> > and much larger on 64-bit.  At these sizes fork() is taking hundreds
> > of miliseconds, which can become a noticeable bottleneck for us.  This
> > performance characteristic seems to be unique to Linux vs other Unix
> > implementations.
> 
> You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB);
> then the overhead of walking the pages of it at fork would be much lower.
> 
> -Andi
> -

Why isn't the page walk for the Shared Memory done lazily though? It is better 
in that applications most likely may not want to page in all of the shared 
memory at once. Program logic/requirements should dictate this instead of fork 
making it compulsory. I think this is because we don't distinguish between 
shared libraries, program text and explicitly shared memory as the above 
application does - everything is MAP_SHARED.

As someone mentioned this causes unavoidable faults for reading in shared 
libraries and program text. But if there was a MAP_SHARED|MAP_LAZY - can fork() 
then be setup not to setup page tables for such mappings and still continue to 
map the MAP_SHARED ones so program text and libraries don't cause faults? 
Applications can then specify MAP_SHARED|MAP_LAZY and not incur the overhead of 
page table walk for the shared memory all at once.

Would it be worth trying to do something like this?

Parag



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Andi Kleen
Ray Fucillo <[EMAIL PROTECTED]> writes:
> 
> The application is a database system called Caché.  We allocate a
> large shared memory segment for database cache, which in a large
> production environment may realistically be 1+GB on 32-bit platforms
> and much larger on 64-bit.  At these sizes fork() is taking hundreds
> of miliseconds, which can become a noticeable bottleneck for us.  This
> performance characteristic seems to be unique to Linux vs other Unix
> implementations.

You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB);
then the overhead of walking the pages of it at fork would be much lower.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Ray Fucillo

Nick Piggin wrote:

fork() can be changed so as not to set up page tables for
MAP_SHARED mappings. I think that has other tradeoffs like
initially causing several unavoidable faults reading
libraries and program text.

What kind of application are you using?


The application is a database system called Caché.  We allocate a large 
shared memory segment for database cache, which in a large production 
environment may realistically be 1+GB on 32-bit platforms and much 
larger on 64-bit.  At these sizes fork() is taking hundreds of 
miliseconds, which can become a noticeable bottleneck for us.  This 
performance characteristic seems to be unique to Linux vs other Unix 
implementations.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Ray Fucillo

Nick Piggin wrote:

fork() can be changed so as not to set up page tables for
MAP_SHARED mappings. I think that has other tradeoffs like
initially causing several unavoidable faults reading
libraries and program text.

What kind of application are you using?


The application is a database system called Caché.  We allocate a large 
shared memory segment for database cache, which in a large production 
environment may realistically be 1+GB on 32-bit platforms and much 
larger on 64-bit.  At these sizes fork() is taking hundreds of 
miliseconds, which can become a noticeable bottleneck for us.  This 
performance characteristic seems to be unique to Linux vs other Unix 
implementations.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Andi Kleen
Ray Fucillo [EMAIL PROTECTED] writes:
 
 The application is a database system called Caché.  We allocate a
 large shared memory segment for database cache, which in a large
 production environment may realistically be 1+GB on 32-bit platforms
 and much larger on 64-bit.  At these sizes fork() is taking hundreds
 of miliseconds, which can become a noticeable bottleneck for us.  This
 performance characteristic seems to be unique to Linux vs other Unix
 implementations.

You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB);
then the overhead of walking the pages of it at fork would be much lower.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Parag Warudkar
 Ray Fucillo [EMAIL PROTECTED] writes:
  
  The application is a database system called Caché.  We allocate a
  large shared memory segment for database cache, which in a large
  production environment may realistically be 1+GB on 32-bit platforms
  and much larger on 64-bit.  At these sizes fork() is taking hundreds
  of miliseconds, which can become a noticeable bottleneck for us.  This
  performance characteristic seems to be unique to Linux vs other Unix
  implementations.
 
 You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB);
 then the overhead of walking the pages of it at fork would be much lower.
 
 -Andi
 -

Why isn't the page walk for the Shared Memory done lazily though? It is better 
in that applications most likely may not want to page in all of the shared 
memory at once. Program logic/requirements should dictate this instead of fork 
making it compulsory. I think this is because we don't distinguish between 
shared libraries, program text and explicitly shared memory as the above 
application does - everything is MAP_SHARED.

As someone mentioned this causes unavoidable faults for reading in shared 
libraries and program text. But if there was a MAP_SHARED|MAP_LAZY - can fork() 
then be setup not to setup page tables for such mappings and still continue to 
map the MAP_SHARED ones so program text and libraries don't cause faults? 
Applications can then specify MAP_SHARED|MAP_LAZY and not incur the overhead of 
page table walk for the shared memory all at once.

Would it be worth trying to do something like this?

Parag



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Andi Kleen

 Would it be worth trying to do something like this?

Maybe. Shouldn't be very hard though - you just need to check if the VMA is 
backed by an object and if yes don't call copy_page_range for it.

I think it just needs (untested) 

Index: linux-2.6.13-rc5-misc/kernel/fork.c
===
--- linux-2.6.13-rc5-misc.orig/kernel/fork.c
+++ linux-2.6.13-rc5-misc/kernel/fork.c
@@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str
rb_parent = tmp-vm_rb;
 
mm-map_count++;
-   retval = copy_page_range(mm, current-mm, tmp);
+   if (!file  !is_vm_hugetlb_page(vma))
+   retval = copy_page_range(mm, current-mm, tmp);
spin_unlock(mm-page_table_lock);
 
if (tmp-vm_ops  tmp-vm_ops-open)

But I'm not sure it's a good idea in all cases. Would need a lot of 
benchmarking  at least.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Nick Piggin

Ray Fucillo wrote:

Nick Piggin wrote:


fork() can be changed so as not to set up page tables for
MAP_SHARED mappings. I think that has other tradeoffs like
initially causing several unavoidable faults reading
libraries and program text.

What kind of application are you using?



The application is a database system called Caché.  We allocate a large 
shared memory segment for database cache, which in a large production 
environment may realistically be 1+GB on 32-bit platforms and much 
larger on 64-bit.  At these sizes fork() is taking hundreds of 
miliseconds, which can become a noticeable bottleneck for us.  This 
performance characteristic seems to be unique to Linux vs other Unix 
implementations.





As Andi said, hugepages might be a very nice feature for you guys
to look into and might potentially give a performance increase with
reduced TLB pressure, not only your immediate fork problem.

Anyway, the attached patch is something you could try testing. If
you do so, then I would be very interested to see performance results.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Index: linux-2.6/kernel/fork.c
===
--- linux-2.6.orig/kernel/fork.c2005-08-04 15:24:36.0 +1000
+++ linux-2.6/kernel/fork.c 2005-08-26 00:20:50.0 +1000
@@ -256,7 +256,6 @@ static inline int dup_mmap(struct mm_str
 * Note that, exceptionally, here the vma is inserted
 * without holding mm-mmap_sem.
 */
-   spin_lock(mm-page_table_lock);
*pprev = tmp;
pprev = tmp-vm_next;
 
@@ -265,8 +264,11 @@ static inline int dup_mmap(struct mm_str
rb_parent = tmp-vm_rb;
 
mm-map_count++;
-   retval = copy_page_range(mm, current-mm, tmp);
-   spin_unlock(mm-page_table_lock);
+   if (!(file  (tmp-vm_flags  VM_SHARED))) {
+   spin_lock(mm-page_table_lock);
+   retval = copy_page_range(mm, current-mm, tmp);
+   spin_unlock(mm-page_table_lock);
+   }
 
if (tmp-vm_ops  tmp-vm_ops-open)
tmp-vm_ops-open(tmp);


Re: process creation time increases linearly with shmem

2005-08-25 Thread Nick Piggin

Andi Kleen wrote:

Would it be worth trying to do something like this?



Maybe. Shouldn't be very hard though - you just need to check if the VMA is 
backed by an object and if yes don't call copy_page_range for it.


I think it just needs (untested) 



I think you need to check for MAP_SHARED as well, because
MAP_PRIVATE mapping of a file could be modified in parent.

See patch I posted just now.

Also, do you need any special case for hugetlb?



Index: linux-2.6.13-rc5-misc/kernel/fork.c
===
--- linux-2.6.13-rc5-misc.orig/kernel/fork.c
+++ linux-2.6.13-rc5-misc/kernel/fork.c
@@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str
rb_parent = tmp-vm_rb;
 
 		mm-map_count++;

-   retval = copy_page_range(mm, current-mm, tmp);
+   if (!file  !is_vm_hugetlb_page(vma))
+   retval = copy_page_range(mm, current-mm, tmp);
spin_unlock(mm-page_table_lock);
 
 		if (tmp-vm_ops  tmp-vm_ops-open)


But I'm not sure it's a good idea in all cases. Would need a lot of 
benchmarking  at least.




Yep. I'm sure it must have come up in the past, and Linus
must have said something about best-for-most.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Parag Warudkar
On Thu, 2005-08-25 at 16:22 +0200, Andi Kleen wrote:
 But I'm not sure it's a good idea in all cases. Would need a lot of 
 benchmarking  at least.
 
 -Andi
 

Exactly - one problem is that this forces all of the hugetlb users to go
the lazy faulting way. This is more or less similar to the original
problem the fork() forces everything to be mapped and some apps don't
like it. Same way, some apps may not want hugetlb pages to be all
pre-mapped. 

That's why I was alluding towards having the user specify MAP_SHARED|
MAP_LAZY or something to that tune and then have fork() honor it. So
people who want all things pre-mapped will not specify MAP_LAZY, just
MAP_SHARED. 

Now I don't even know if above is possible and workable for all
scenarios but that's why I was asking.. :)

Parag

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Rik van Riel
On Thu, 25 Aug 2005, Nick Piggin wrote:

 fork() can be changed so as not to set up page tables for
 MAP_SHARED mappings. I think that has other tradeoffs like
 initially causing several unavoidable faults reading
 libraries and program text.

Actually, libraries and program text are usually mapped
MAP_PRIVATE, so those would still be copied.

Skipping MAP_SHARED in fork() sounds like a good idea to me...

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Nick Piggin

Rik van Riel wrote:

On Thu, 25 Aug 2005, Nick Piggin wrote:



fork() can be changed so as not to set up page tables for
MAP_SHARED mappings. I think that has other tradeoffs like
initially causing several unavoidable faults reading
libraries and program text.



Actually, libraries and program text are usually mapped
MAP_PRIVATE, so those would still be copied.



Yep, that seems to be the case here.


Skipping MAP_SHARED in fork() sounds like a good idea to me...



Indeed. Linus, can you remember why we haven't done this before?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Rik van Riel
On Fri, 26 Aug 2005, Nick Piggin wrote:

  Skipping MAP_SHARED in fork() sounds like a good idea to me...
 
 Indeed. Linus, can you remember why we haven't done this before?

Where this looks something like the patch below, shamelessly
merging Nick's and Andy's patches and adding the initialization
of retval.

I suspect this may be a measurable win on database servers with
a web frontend, where the connections to the database server are
set up basically for each individual query, and don't stick around
for a long time.

No, I haven't actually tested this patch - but feel free to go
wild while I sign off for the night.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]

--- linux-2.6.12/kernel/fork.c.mapshared2005-08-25 18:40:44.0 
-0400
+++ linux-2.6.12/kernel/fork.c  2005-08-25 18:47:16.0 -0400
@@ -184,7 +184,7 @@
 {
struct vm_area_struct * mpnt, *tmp, **pprev;
struct rb_node **rb_link, *rb_parent;
-   int retval;
+   int retval = 0;
unsigned long charge;
struct mempolicy *pol;
 
@@ -265,7 +265,10 @@
rb_parent = tmp-vm_rb;
 
mm-map_count++;
-   retval = copy_page_range(mm, current-mm, tmp);
+   /* Skip pte copying if page faults can take care of things. */
+   if (!file || !(tmp-vm_flags  VM_SHARED) ||
+   is_vm_hugetlb_page(vma))
+   retval = copy_page_range(mm, current-mm, tmp);
spin_unlock(mm-page_table_lock);
 
if (tmp-vm_ops  tmp-vm_ops-open)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-25 Thread Linus Torvalds


On Fri, 26 Aug 2005, Nick Piggin wrote:
 
  Skipping MAP_SHARED in fork() sounds like a good idea to me...
  
 
 Indeed. Linus, can you remember why we haven't done this before?

Hmm. Historical reasons. Also, if the child ends up needing it, it will 
now have to fault them in.

That said, I think it's a valid optimization. Especially as the child 
_probably_ doesn't need it (ie there's at least some likelihood of an 
execve() or similar).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-24 Thread Nick Piggin

Ray Fucillo wrote:
I am seeing process creation time increase linearly with the size of the 
shared memory segment that the parent touches.  The attached forktest.c 
is a very simple user program that illustrates this behavior, which I 
have tested on various kernel versions from 2.4 through 2.6.  Is this a 
known issue, and is it solvable?




fork() can be changed so as not to set up page tables for
MAP_SHARED mappings. I think that has other tradeoffs like
initially causing several unavoidable faults reading
libraries and program text.

What kind of application are you using?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process creation time increases linearly with shmem

2005-08-24 Thread Nick Piggin

Ray Fucillo wrote:
I am seeing process creation time increase linearly with the size of the 
shared memory segment that the parent touches.  The attached forktest.c 
is a very simple user program that illustrates this behavior, which I 
have tested on various kernel versions from 2.4 through 2.6.  Is this a 
known issue, and is it solvable?




fork() can be changed so as not to set up page tables for
MAP_SHARED mappings. I think that has other tradeoffs like
initially causing several unavoidable faults reading
libraries and program text.

What kind of application are you using?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/