Re: process creation time increases linearly with shmem
On Tue, 30 Aug 2005, Nick Piggin wrote: > > Andrew, did you pick up the patch or should I resend to someone? I picked it up. If it causes performance regressions, we can fix them, and if it causes other problems then that will be interesting in itself. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Ray Fucillo wrote: Nick Piggin wrote: How does the following look? (I changed the comment a bit). Andrew, please apply if nobody objects. Nick, I applied this latest patch to a 2.6.12 kernel and found that it does resolve the problem. Prior to the patch on this machine, I was seeing about 23ms spent in fork for ever 100MB of shared memory segment. After applying the patch, fork is taking about 1ms regardless of the shared memory size. Hi Ray, That's good news. I think we should probably consider putting the patch in 2.6.14 or if not, then definitely 2.6.15. Andrew, did you pick up the patch or should I resend to someone? I think the fork latency alone is enough to justify inclusion... however, did you actually see increased aggregate throughput of your database (or at least not a _decreased_ throughput)? Many thanks to everyone for your help on this. Well thank you very much for breaking the kernel and telling us about it! :) Nick Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Mon, 29 Aug 2005, Ray Fucillo wrote: > > FWIW, an interesting side effect of this occurs when I run the database > with this patch internally on a Linux server that uses NIS. Its an > unrelated problem and not a kernel problem. Its due to the children > calling initgroups()... apparently when you have many processes making > simultaneous initgroups() calls something starts imposing very long > waits in increments of 3 seconds Sounds like something is backing off by waiting for three seconds whenever some lock failure occurs. I don't see what locking the code might want to do (it should just do the NIS equivalent of reading /etc/groups and do a "setgroups()" system call), but I assume that the NIS server ends up having some strange locking. You might do an "ltrace testcase" (and, probably, the nis server) to see if you can see where it happens, and bug the appropriate maintainers. Especially if you have a repeatable test-case (where "repeatable" isn't just for your particular machine: it's probably timing-related), somebody might even fix it ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Nick Piggin wrote: How does the following look? (I changed the comment a bit). Andrew, please apply if nobody objects. Nick, I applied this latest patch to a 2.6.12 kernel and found that it does resolve the problem. Prior to the patch on this machine, I was seeing about 23ms spent in fork for ever 100MB of shared memory segment. After applying the patch, fork is taking about 1ms regardless of the shared memory size. Many thanks to everyone for your help on this. FWIW, an interesting side effect of this occurs when I run the database with this patch internally on a Linux server that uses NIS. Its an unrelated problem and not a kernel problem. Its due to the children calling initgroups()... apparently when you have many processes making simultaneous initgroups() calls something starts imposing very long waits in increments of 3 seconds, so some processes return from initgroups() in a few ms and other processes complete in 3, 6, 9, up to 21 seconds (plus a few ms). I'm not sure what the story is with that, though its clearly not a kernel issue. If someone happens to have the answer or a suggestion, great, otherwise I'll persue that elsewhere as necessary. (I can reproduce this by simply adding a call to initgroups() call in the child of the forktest program that I sent earlier) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Nick Piggin wrote: How does the following look? (I changed the comment a bit). Andrew, please apply if nobody objects. Nick, I applied this latest patch to a 2.6.12 kernel and found that it does resolve the problem. Prior to the patch on this machine, I was seeing about 23ms spent in fork for ever 100MB of shared memory segment. After applying the patch, fork is taking about 1ms regardless of the shared memory size. Many thanks to everyone for your help on this. FWIW, an interesting side effect of this occurs when I run the database with this patch internally on a Linux server that uses NIS. Its an unrelated problem and not a kernel problem. Its due to the children calling initgroups()... apparently when you have many processes making simultaneous initgroups() calls something starts imposing very long waits in increments of 3 seconds, so some processes return from initgroups() in a few ms and other processes complete in 3, 6, 9, up to 21 seconds (plus a few ms). I'm not sure what the story is with that, though its clearly not a kernel issue. If someone happens to have the answer or a suggestion, great, otherwise I'll persue that elsewhere as necessary. (I can reproduce this by simply adding a call to initgroups() call in the child of the forktest program that I sent earlier) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Mon, 29 Aug 2005, Ray Fucillo wrote: FWIW, an interesting side effect of this occurs when I run the database with this patch internally on a Linux server that uses NIS. Its an unrelated problem and not a kernel problem. Its due to the children calling initgroups()... apparently when you have many processes making simultaneous initgroups() calls something starts imposing very long waits in increments of 3 seconds Sounds like something is backing off by waiting for three seconds whenever some lock failure occurs. I don't see what locking the code might want to do (it should just do the NIS equivalent of reading /etc/groups and do a setgroups() system call), but I assume that the NIS server ends up having some strange locking. You might do an ltrace testcase (and, probably, the nis server) to see if you can see where it happens, and bug the appropriate maintainers. Especially if you have a repeatable test-case (where repeatable isn't just for your particular machine: it's probably timing-related), somebody might even fix it ;) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Ray Fucillo wrote: Nick Piggin wrote: How does the following look? (I changed the comment a bit). Andrew, please apply if nobody objects. Nick, I applied this latest patch to a 2.6.12 kernel and found that it does resolve the problem. Prior to the patch on this machine, I was seeing about 23ms spent in fork for ever 100MB of shared memory segment. After applying the patch, fork is taking about 1ms regardless of the shared memory size. Hi Ray, That's good news. I think we should probably consider putting the patch in 2.6.14 or if not, then definitely 2.6.15. Andrew, did you pick up the patch or should I resend to someone? I think the fork latency alone is enough to justify inclusion... however, did you actually see increased aggregate throughput of your database (or at least not a _decreased_ throughput)? Many thanks to everyone for your help on this. Well thank you very much for breaking the kernel and telling us about it! :) Nick Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Tue, 30 Aug 2005, Nick Piggin wrote: Andrew, did you pick up the patch or should I resend to someone? I picked it up. If it causes performance regressions, we can fix them, and if it causes other problems then that will be interesting in itself. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Hugh Dickins wrote: On Sun, 28 Aug 2005, Nick Piggin wrote: This is the condition I ended up with. Any good? if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) { if (vma->vm_flags & VM_MAYSHARE) return 0; if (vma->vm_file && !vma->anon_vma) return 0; } It's not bad, and practical timings are unlikely to differ, but your VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *), and your vm_file test is unnecessary, excluding pure anonymous areas which haven't yet taken a fault. Haven't taken a _write_ fault? Hmm, OK that would seem to be a good optimisation as well: we don't need to copy anon memory with only ZERO_PAGE mappings... well, good as in "nice and logical" if not so much "will make a difference"! Please do send Andrew the patch for -mm, Nick: you were one of the creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus), much better that it go in your name (heh, heh, heh, can you trust me?) Well Andi and I seemed to have the idea independently, Linus thought private would be a good idea (I agree), you came up with the complete patch with others contributing bits and pieces, and most importantly Ray brought our attention to the possible deficiency in our mm. Hugh * That's ignoring, as we do everywhere else, the case which came up a couple of weeks back in discussions with Linus, ptrace writing to an area the process does not have write access to, creating an anon page within a shared vma: that's an awkward case currently mishandled, but the patch below does it no harm. And in that case maybe your patch works better anyway, because the child will inherit that page from parent. How does the following look? (I changed the comment a bit). Andrew, please apply if nobody objects. -- SUSE Labs, Novell Inc. Defer copying of ptes until fault time when it is possible to reconstruct the pte from backing store. Idea from Andi Kleen and Nick Piggin. Thanks to input from Rik van Riel and Linus and to Hugh for correcting my blundering. [ Note to akpm: Ray Fucillo <[EMAIL PROTECTED]>'s results go here ] Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2005-08-13 11:16:34.0 +1000 +++ linux-2.6/mm/memory.c 2005-08-28 16:41:32.0 +1000 @@ -498,6 +498,17 @@ int copy_page_range(struct mm_struct *ds unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + /* +* Don't copy ptes where a page fault will fill them correctly. +* Fork becomes much lighter when there are big shared or private +* readonly mappings. The tradeoff is that copy_page_range is more +* efficient than faulting. +*/ + if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) { + if (!vma->anon_vma) + return 0; + } + if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma);
Re: process creation time increases linearly with shmem
Hugh Dickins wrote: On Sun, 28 Aug 2005, Nick Piggin wrote: This is the condition I ended up with. Any good? if (!(vma-vm_flags (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) { if (vma-vm_flags VM_MAYSHARE) return 0; if (vma-vm_file !vma-anon_vma) return 0; } It's not bad, and practical timings are unlikely to differ, but your VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *), and your vm_file test is unnecessary, excluding pure anonymous areas which haven't yet taken a fault. Haven't taken a _write_ fault? Hmm, OK that would seem to be a good optimisation as well: we don't need to copy anon memory with only ZERO_PAGE mappings... well, good as in nice and logical if not so much will make a difference! Please do send Andrew the patch for -mm, Nick: you were one of the creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus), much better that it go in your name (heh, heh, heh, can you trust me?) Well Andi and I seemed to have the idea independently, Linus thought private would be a good idea (I agree), you came up with the complete patch with others contributing bits and pieces, and most importantly Ray brought our attention to the possible deficiency in our mm. Hugh * That's ignoring, as we do everywhere else, the case which came up a couple of weeks back in discussions with Linus, ptrace writing to an area the process does not have write access to, creating an anon page within a shared vma: that's an awkward case currently mishandled, but the patch below does it no harm. And in that case maybe your patch works better anyway, because the child will inherit that page from parent. How does the following look? (I changed the comment a bit). Andrew, please apply if nobody objects. -- SUSE Labs, Novell Inc. Defer copying of ptes until fault time when it is possible to reconstruct the pte from backing store. Idea from Andi Kleen and Nick Piggin. Thanks to input from Rik van Riel and Linus and to Hugh for correcting my blundering. [ Note to akpm: Ray Fucillo [EMAIL PROTECTED]'s results go here ] Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2005-08-13 11:16:34.0 +1000 +++ linux-2.6/mm/memory.c 2005-08-28 16:41:32.0 +1000 @@ -498,6 +498,17 @@ int copy_page_range(struct mm_struct *ds unsigned long addr = vma-vm_start; unsigned long end = vma-vm_end; + /* +* Don't copy ptes where a page fault will fill them correctly. +* Fork becomes much lighter when there are big shared or private +* readonly mappings. The tradeoff is that copy_page_range is more +* efficient than faulting. +*/ + if (!(vma-vm_flags (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) { + if (!vma-anon_vma) + return 0; + } + if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma);
Re: process creation time increases linearly with shmem
On Sun, 28 Aug 2005, Nick Piggin wrote: > > This is the condition I ended up with. Any good? > > if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) { > if (vma->vm_flags & VM_MAYSHARE) > return 0; > if (vma->vm_file && !vma->anon_vma) >return 0; > } It's not bad, and practical timings are unlikely to differ, but your VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *), and your vm_file test is unnecessary, excluding pure anonymous areas which haven't yet taken a fault. Please do send Andrew the patch for -mm, Nick: you were one of the creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus), much better that it go in your name (heh, heh, heh, can you trust me?) Hugh * That's ignoring, as we do everywhere else, the case which came up a couple of weeks back in discussions with Linus, ptrace writing to an area the process does not have write access to, creating an anon page within a shared vma: that's an awkward case currently mishandled, but the patch below does it no harm. --- 2.6.13-rc7/mm/memory.c 2005-08-24 11:13:41.0 +0100 +++ linux/mm/memory.c 2005-08-28 04:48:34.0 +0100 @@ -498,6 +498,15 @@ int copy_page_range(struct mm_struct *ds unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + /* +* Assume the fork will probably exec: don't waste time copying +* ptes where a page fault will fill them correctly afterwards. +*/ + if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) { + if (!vma->anon_vma) + return 0; + } + if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Linus Torvalds wrote: On Fri, 26 Aug 2005, Rik van Riel wrote: On Fri, 26 Aug 2005, Hugh Dickins wrote: Well, I still don't think we need to test vm_file. We can add an anon_vma test if you like, if we really want to minimize the fork overhead, in favour of later faults. Do we? When you consider NUMA placement (the child process may end up running elsewhere), allocating things like page tables lazily may well end up being a performance win. It should be easy enough to benchmark something like kernel compiles etc, which are reasonably fork-rich and should show a good mix for something like this. Or even just something like "time to restart a X session" after you've brought it into memory once. 2.6.13-rc7-git2 kbuild (make -j4) on dual G5. plain 228.85user 19.90system 2:06.50elapsed 196%CPU (3725666minor) 228.91user 19.90system 2:06.07elapsed 197%CPU (3721353minor) 229.00user 19.78system 2:06.20elapsed 197%CPU (3721345minor) 228.81user 19.94system 2:06.05elapsed 197%CPU (3723791minor) nocopy shared 229.28user 19.76system 2:06.24elapsed 197%CPU (3725661minor) 229.04user 19.91system 2:06.92elapsed 196%CPU (3718904minor) 228.97user 20.06system 2:06.46elapsed 196%CPU (3723807minor) 229.24user 19.84system 2:06.13elapsed 197%CPU (3723793minor) nocopy all 228.74user 19.87system 2:06.27elapsed 196%CPU (3819927minor) 228.89user 19.81system 2:05.89elapsed 197%CPU (3822943minor) 228.77user 19.73system 2:06.23elapsed 196%CPU (3820517minor) 228.93user 19.70system 2:05.84elapsed 197%CPU (3822935minor) I'd say the full test (including anon_vma) is maybe slightly faster on this test though maybe it isn't significant. It is doing around 2.5% more minor faults, thought the profiles say copy_page_range time is reduced as one would expect. I think that if all else (ie. final performance) is equal, then faulting is better than copying because the work is being deferred until it is needed, and we dodge some pathological cases like Ray's database taking 100s of ms to fork (we hope!) However it will always depend on workload. This is the condition I ended up with. Any good? if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) { if (vma->vm_flags & VM_MAYSHARE) return 0; if (vma->vm_file && !vma->anon_vma) return 0; } -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Linus Torvalds wrote: On Fri, 26 Aug 2005, Rik van Riel wrote: On Fri, 26 Aug 2005, Hugh Dickins wrote: Well, I still don't think we need to test vm_file. We can add an anon_vma test if you like, if we really want to minimize the fork overhead, in favour of later faults. Do we? When you consider NUMA placement (the child process may end up running elsewhere), allocating things like page tables lazily may well end up being a performance win. It should be easy enough to benchmark something like kernel compiles etc, which are reasonably fork-rich and should show a good mix for something like this. Or even just something like time to restart a X session after you've brought it into memory once. 2.6.13-rc7-git2 kbuild (make -j4) on dual G5. plain 228.85user 19.90system 2:06.50elapsed 196%CPU (3725666minor) 228.91user 19.90system 2:06.07elapsed 197%CPU (3721353minor) 229.00user 19.78system 2:06.20elapsed 197%CPU (3721345minor) 228.81user 19.94system 2:06.05elapsed 197%CPU (3723791minor) nocopy shared 229.28user 19.76system 2:06.24elapsed 197%CPU (3725661minor) 229.04user 19.91system 2:06.92elapsed 196%CPU (3718904minor) 228.97user 20.06system 2:06.46elapsed 196%CPU (3723807minor) 229.24user 19.84system 2:06.13elapsed 197%CPU (3723793minor) nocopy all 228.74user 19.87system 2:06.27elapsed 196%CPU (3819927minor) 228.89user 19.81system 2:05.89elapsed 197%CPU (3822943minor) 228.77user 19.73system 2:06.23elapsed 196%CPU (3820517minor) 228.93user 19.70system 2:05.84elapsed 197%CPU (3822935minor) I'd say the full test (including anon_vma) is maybe slightly faster on this test though maybe it isn't significant. It is doing around 2.5% more minor faults, thought the profiles say copy_page_range time is reduced as one would expect. I think that if all else (ie. final performance) is equal, then faulting is better than copying because the work is being deferred until it is needed, and we dodge some pathological cases like Ray's database taking 100s of ms to fork (we hope!) However it will always depend on workload. This is the condition I ended up with. Any good? if (!(vma-vm_flags (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) { if (vma-vm_flags VM_MAYSHARE) return 0; if (vma-vm_file !vma-anon_vma) return 0; } -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Sun, 28 Aug 2005, Nick Piggin wrote: This is the condition I ended up with. Any good? if (!(vma-vm_flags (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) { if (vma-vm_flags VM_MAYSHARE) return 0; if (vma-vm_file !vma-anon_vma) return 0; } It's not bad, and practical timings are unlikely to differ, but your VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *), and your vm_file test is unnecessary, excluding pure anonymous areas which haven't yet taken a fault. Please do send Andrew the patch for -mm, Nick: you were one of the creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus), much better that it go in your name (heh, heh, heh, can you trust me?) Hugh * That's ignoring, as we do everywhere else, the case which came up a couple of weeks back in discussions with Linus, ptrace writing to an area the process does not have write access to, creating an anon page within a shared vma: that's an awkward case currently mishandled, but the patch below does it no harm. --- 2.6.13-rc7/mm/memory.c 2005-08-24 11:13:41.0 +0100 +++ linux/mm/memory.c 2005-08-28 04:48:34.0 +0100 @@ -498,6 +498,15 @@ int copy_page_range(struct mm_struct *ds unsigned long addr = vma-vm_start; unsigned long end = vma-vm_end; + /* +* Assume the fork will probably exec: don't waste time copying +* ptes where a page fault will fill them correctly afterwards. +*/ + if (!(vma-vm_flags (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) { + if (!vma-anon_vma) + return 0; + } + if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Rik van Riel wrote: > On Fri, 26 Aug 2005, Hugh Dickins wrote: > > > Well, I still don't think we need to test vm_file. We can add an > > anon_vma test if you like, if we really want to minimize the fork > > overhead, in favour of later faults. Do we? > > When you consider NUMA placement (the child process may > end up running elsewhere), allocating things like page > tables lazily may well end up being a performance win. It should be easy enough to benchmark something like kernel compiles etc, which are reasonably fork-rich and should show a good mix for something like this. Or even just something like "time to restart a X session" after you've brought it into memory once. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Hugh Dickins wrote: > Well, I still don't think we need to test vm_file. We can add an > anon_vma test if you like, if we really want to minimize the fork > overhead, in favour of later faults. Do we? When you consider NUMA placement (the child process may end up running elsewhere), allocating things like page tables lazily may well end up being a performance win. -- All Rights Reversed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Hugh Dickins wrote: > > Well, I still don't think we need to test vm_file. We can add an > anon_vma test if you like, if we really want to minimize the fork > overhead, in favour of later faults. Do we? I think we might want to do it in -mm for testing. Because quite frankly, otherwise the new fork() logic won't get a lot of testing. Shared memory isn't that common. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Ross Biro wrote: > On 8/26/05, Rik van Riel <[EMAIL PROTECTED]> wrote: > > > > Filling in all the page table entries at the first fault to > > a VMA doesn't make much sense, IMHO. > > > > I suspect we would be better off without that extra complexity, > > unless there is a demonstrated benefit to it. > > You are probably right, but do you want to put in a patch that might > have a big performance impact in either direction with out verifying > it? > > My suggestion is safe, but most likely sub-optimal. What everyone > else is suggesting may be far better, but needs to be verified first. It all has to be verified, and the problem will be that some things fare well and others badly: how to reach a balanced decision? Following your suggestion is no more safe than not following it. > I'm suggesting that we change the code to do the same work fork would > have done on the first page fault immediately, since it's easy to > argue that it's not much worse than we have now and much better in > many cases, and then try to experiment and figure out what the > correct solution is. We don't know what work fork would have done, that information was in the ptes we decided not to bother to copy. Perhaps every pte of the vma was set, perhaps none, perhaps only one. Also, doing it at fault time has significantly more work to do than just zipping along the ptes incrementing page counts and clearing bits. I think; but probably much less extra work than I originally imagined, since Andrew gave us the gang lookup of the page cache. All the same, I'm with Rik: one of the great virtues of the original idea was its simplicity; I'd prefer not to add complexity. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Linus Torvalds wrote: > On Fri, 26 Aug 2005, Hugh Dickins wrote: > > > > I see some flaws in the various patches posted, including Rik's. > > Here's another version - doing it inside copy_page_range, so this > > kind of vma special-casing is over in mm/ rather than kernel/. > > I like this approach better, but I don't understand your particular > choice of bits. > > > +* Assume the fork will probably exec: don't waste time copying > > +* ptes where a page fault will fill them correctly afterwards. > > +*/ > > + if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED)) > > + == VM_MAYSHARE) > > + return 0; > > + > > if (is_vm_hugetlb_page(vma)) > > return copy_hugetlb_page_range(dst_mm, src_mm, vma); > > First off, if you just did it below the hugetlb check, you'd not need to > check hugetlb again. Yes: I wanted to include VM_HUGETLB in the list as documentation really; and it costs nothing to test it along with the other flags - or are there architectures where the more bits you test, the costlier? > And while I understand VM_NONLINEAR and VM_RESERVED, > can you please comment on why VM_MAYSHARE is so important, and why no > other information matters. The VM_MAYSHARE one isn't terribly important, there's no correctness reason to replace VM_SHARED there. It's just that do_mmap_pgoff takes VM_SHARED and VM_MAYWRITE off a MAP_SHARED mapping of a file which was not opened for writing. We can safely avoid copying the ptes of such a vma, just as with the writable ones, but the VM_MAYSHARE test catches them where the VM_SHARED test does not. > Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair > enough. But afaik, a shared anonymous mapping absolutely needs its page > tables copied, because those page tables contains either the pointers to > the shared pages, or the swap entries. > > So I really think you need to verify that it's a file mapping too. Either I'm misunderstanding, or you're remembering back to how shared anonymous was done in 2.2 (perhaps). In 2.4 and 2.6, shared anonymous is "backed" by a shared memory object, created by shmem_zero_setup: which sets vm_file even though we came into do_mmap_pgoff with no file. > Also, arguably, there are other cases that may or may not be worth > worrying about. What about non-shared non-writable file mappings? What > about private mappings that haven't been COW'ed? Non-shared non-currently-writable file mappings might have been writable and modified in the past, so we cannot necessarily skip those. We could, and I did, consider testing whether the vma has an anon_vma: we always allocate a vma's anon_vma just before first allocating it a private page, and it's a good test which swapoff uses to narrow its search. But partly I thought that a little too tricksy, and hard to explain; and partly I thought it was liable to catch the executable text, some of which is most likely to be needed in between fork and exec. > So I think that in addition to your tests, you should test for > "vma->vm_file", and you could toy with testing for "vma->anon_vma" being > NULL (the latter will cause a _lot_ of hits, because any read-only private > mapping will trigger, but it's a good stress-test and conceptually > interesting, even if I suspect it will kill any performance gain through > extra minor faults in the child). Ah yes, I wrote the paragraph above before reading this one, honest! Well, I still don't think we need to test vm_file. We can add an anon_vma test if you like, if we really want to minimize the fork overhead, in favour of later faults. Do we? Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On 8/26/05, Rik van Riel <[EMAIL PROTECTED]> wrote: > > Filling in all the page table entries at the first fault to > a VMA doesn't make much sense, IMHO. > > > I suspect we would be better off without that extra complexity, > unless there is a demonstrated benefit to it. You are probably right, but do you want to put in a patch that might have a big performance impact in either direction with out verifying it? My suggestion is safe, but most likely sub-optimal. What everyone else is suggesting may be far better, but needs to be verified first. I'm suggesting that we change the code to do the same work fork would have done on the first page fault immediately, since it's easy to argue that it's not much worse than we have now and much better in many cases, and then try to experiment and figure out what the correct solution is. Ross - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Hugh Dickins wrote: > > I see some flaws in the various patches posted, including Rik's. > Here's another version - doing it inside copy_page_range, so this > kind of vma special-casing is over in mm/ rather than kernel/. I like this approach better, but I don't understand your particular choice of bits. > + * Assume the fork will probably exec: don't waste time copying > + * ptes where a page fault will fill them correctly afterwards. > + */ > + if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED)) > + == VM_MAYSHARE) > + return 0; > + > if (is_vm_hugetlb_page(vma)) > return copy_hugetlb_page_range(dst_mm, src_mm, vma); First off, if you just did it below the hugetlb check, you'd not need to check hugetlb again. And while I understand VM_NONLINEAR and VM_RESERVED, can you please comment on why VM_MAYSHARE is so important, and why no other information matters. Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair enough. But afaik, a shared anonymous mapping absolutely needs its page tables copied, because those page tables contains either the pointers to the shared pages, or the swap entries. So I really think you need to verify that it's a file mapping too. Also, arguably, there are other cases that may or may not be worth worrying about. What about non-shared non-writable file mappings? What about private mappings that haven't been COW'ed? So I think that in addition to your tests, you should test for "vma->vm_file", and you could toy with testing for "vma->anon_vma" being NULL (the latter will cause a _lot_ of hits, because any read-only private mapping will trigger, but it's a good stress-test and conceptually interesting, even if I suspect it will kill any performance gain through extra minor faults in the child). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Ray Fucillo wrote: > However, there is still a need that the child, once successfully forked, is > operational reasonably quickly. I suspect that Ross's idea of paging in > everything after the first fault would not be optimal for us, because we'd > still be talking about hundreds of ms of work done before the child does > anything useful. Simply skipping the page table setup of MAP_SHARED regions should be enough to fix this issue. > It would still be far better than the behavior we have today because > that time would no longer be synchronous with the fork(). Filling in all the page table entries at the first fault to a VMA doesn't make much sense, IMHO. The reason I think this is that people have experimented with prefaulting already resident pages at page fault time, and those experiments have never shown a conclusive benefit. Now, if doing such prefaulting for normal processes does not show a benefit - why would it be beneficial to recently forked processes with a huge SHM area ? I suspect we would be better off without that extra complexity, unless there is a demonstrated benefit to it. -- All Rights Reversed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Nick Piggin wrote: OK let's see how Ray goes, and try it when 2.6.14 opens... Working on that now - I'll let you know. Yeah I guess that's a good idea. Patch looks pretty good. Just a minor issue with the comment, it is not strictly just assuming the child will exec... IMO it is worthwhile in Ray's case even if his forked process _eventually_ ends up touching all the shared memory pages, it is better to avoid many ms of fork overhead. Yes, in our database system the child will immediately touch some shmem pages, and may eventually touch most of them (and would almost never exec()). Fork performance is critical in usage scenarios where an end-user database request forks a new server process from one master server process. However, there is still a need that the child, once successfully forked, is operational reasonably quickly. I suspect that Ross's idea of paging in everything after the first fault would not be optimal for us, because we'd still be talking about hundreds of ms of work done before the child does anything useful. It would still be far better than the behavior we have today because that time would no longer be synchronous with the fork(). Of course, it sounds like our app might be able to make use of the hugetlb stuff can mitigate this problem in the future... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On 8/26/05, Hugh Dickins <[EMAIL PROTECTED]> wrote: > On Fri, 26 Aug 2005, Ross Biro wrote: > > On 8/26/05, Hugh Dickins <[EMAIL PROTECTED]> wrote: > > > > > > The refaulting will hurt the performance of something: let's > > > just hope that something doesn't turn out to be a show-stopper. > > > > Why not just fault in all the pages on the first fault. Then the performance > > loss is a single page fault (the page table copy that would have happened a > > fork time now happens at fault time) and you get the big win for processes > > that do fork/exec. > > "all" might be very many more pages than were ever mapped in the parent, > and not be a win. Some faultahead might work better. Might, might, ... If you reduce "all" to whatever would have been done in fork originially, then you've got a big win in some cases and a minimal loss in others, and it's easy to argue you've got something better. Now changng "all" to something even less might be an even bigger win, but that requires a lot of benchmarking to justify. Ross - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Ross Biro wrote: > On 8/26/05, Hugh Dickins <[EMAIL PROTECTED]> wrote: > > > > The refaulting will hurt the performance of something: let's > > just hope that something doesn't turn out to be a show-stopper. > > Why not just fault in all the pages on the first fault. Then the performance > loss is a single page fault (the page table copy that would have happened a > fork time now happens at fault time) and you get the big win for processes > that do fork/exec. "all" might be very many more pages than were ever mapped in the parent, and not be a win. Some faultahead might work better. Might, might, ... Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Hugh Dickins wrote: On Thu, 25 Aug 2005, Linus Torvalds wrote: That said, I think it's a valid optimization. Especially as the child _probably_ doesn't need it (ie there's at least some likelihood of an execve() or similar). I agree, seems a great idea to me (sulking because I was too dumb to get it, even when Nick and Andi first posted their patches). It won't just save on the copying at fork time, it'll save on undoing it all again when the child mm is torn down for exec. The refaulting will hurt the performance of something: let's just hope that something doesn't turn out to be a show-stopper. OK let's see how Ray goes, and try it when 2.6.14 opens... I see some flaws in the various patches posted, including Rik's. Here's another version - doing it inside copy_page_range, so this kind of vma special-casing is over in mm/ rather than kernel/. Yeah I guess that's a good idea. Patch looks pretty good. Just a minor issue with the comment, it is not strictly just assuming the child will exec... IMO it is worthwhile in Ray's case even if his forked process _eventually_ ends up touching all the shared memory pages, it is better to avoid many ms of fork overhead. Also, on NUMA systems this will help get page tables allocated on the right nodes, which is not an insignificant problem for big HPC jobs. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Thu, 25 Aug 2005, Linus Torvalds wrote: > On Fri, 26 Aug 2005, Nick Piggin wrote: > > > > > Skipping MAP_SHARED in fork() sounds like a good idea to me... > > > > Indeed. Linus, can you remember why we haven't done this before? > > Hmm. Historical reasons. Also, if the child ends up needing it, it will > now have to fault them in. > > That said, I think it's a valid optimization. Especially as the child > _probably_ doesn't need it (ie there's at least some likelihood of an > execve() or similar). I agree, seems a great idea to me (sulking because I was too dumb to get it, even when Nick and Andi first posted their patches). It won't just save on the copying at fork time, it'll save on undoing it all again when the child mm is torn down for exec. The refaulting will hurt the performance of something: let's just hope that something doesn't turn out to be a show-stopper. I see some flaws in the various patches posted, including Rik's. Here's another version - doing it inside copy_page_range, so this kind of vma special-casing is over in mm/ rather than kernel/. No point in testing vm_file, the vm_flags cover the cases. Test VM_MAYSHARE rather than VM_SHARED to include the never-can-be- written MAP_SHARED cases too. Must exclude VM_NONLINEAR, their ptes are essential for defining the file offsets. Must exclude VM_RESERVED, faults on remap_pfn_range areas would usually put in anon zeroed pages instead of the driver pages - or perhaps would be better as a test against VM_IO, or vma->vm_ops->nopage? Having to exclude the VM_NONLINEAR seems rather a shame, since those are always shared and likely enormous. The InfiniBand people's idea of a way for the app to set VM_DONTCOPY (to avoid rdma get_user_pages problems) becomes attractive as a way for apps to speed their forks. Hugh --- 2.6.13-rc7/mm/memory.c 2005-08-24 11:13:41.0 +0100 +++ linux/mm/memory.c 2005-08-26 10:09:50.0 +0100 @@ -498,6 +498,14 @@ int copy_page_range(struct mm_struct *ds unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + /* +* Assume the fork will probably exec: don't waste time copying +* ptes where a page fault will fill them correctly afterwards. +*/ + if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED)) + == VM_MAYSHARE) + return 0; + if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Thu, 25 Aug 2005, Linus Torvalds wrote: On Fri, 26 Aug 2005, Nick Piggin wrote: Skipping MAP_SHARED in fork() sounds like a good idea to me... Indeed. Linus, can you remember why we haven't done this before? Hmm. Historical reasons. Also, if the child ends up needing it, it will now have to fault them in. That said, I think it's a valid optimization. Especially as the child _probably_ doesn't need it (ie there's at least some likelihood of an execve() or similar). I agree, seems a great idea to me (sulking because I was too dumb to get it, even when Nick and Andi first posted their patches). It won't just save on the copying at fork time, it'll save on undoing it all again when the child mm is torn down for exec. The refaulting will hurt the performance of something: let's just hope that something doesn't turn out to be a show-stopper. I see some flaws in the various patches posted, including Rik's. Here's another version - doing it inside copy_page_range, so this kind of vma special-casing is over in mm/ rather than kernel/. No point in testing vm_file, the vm_flags cover the cases. Test VM_MAYSHARE rather than VM_SHARED to include the never-can-be- written MAP_SHARED cases too. Must exclude VM_NONLINEAR, their ptes are essential for defining the file offsets. Must exclude VM_RESERVED, faults on remap_pfn_range areas would usually put in anon zeroed pages instead of the driver pages - or perhaps would be better as a test against VM_IO, or vma-vm_ops-nopage? Having to exclude the VM_NONLINEAR seems rather a shame, since those are always shared and likely enormous. The InfiniBand people's idea of a way for the app to set VM_DONTCOPY (to avoid rdma get_user_pages problems) becomes attractive as a way for apps to speed their forks. Hugh --- 2.6.13-rc7/mm/memory.c 2005-08-24 11:13:41.0 +0100 +++ linux/mm/memory.c 2005-08-26 10:09:50.0 +0100 @@ -498,6 +498,14 @@ int copy_page_range(struct mm_struct *ds unsigned long addr = vma-vm_start; unsigned long end = vma-vm_end; + /* +* Assume the fork will probably exec: don't waste time copying +* ptes where a page fault will fill them correctly afterwards. +*/ + if ((vma-vm_flags (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED)) + == VM_MAYSHARE) + return 0; + if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Hugh Dickins wrote: On Thu, 25 Aug 2005, Linus Torvalds wrote: That said, I think it's a valid optimization. Especially as the child _probably_ doesn't need it (ie there's at least some likelihood of an execve() or similar). I agree, seems a great idea to me (sulking because I was too dumb to get it, even when Nick and Andi first posted their patches). It won't just save on the copying at fork time, it'll save on undoing it all again when the child mm is torn down for exec. The refaulting will hurt the performance of something: let's just hope that something doesn't turn out to be a show-stopper. OK let's see how Ray goes, and try it when 2.6.14 opens... I see some flaws in the various patches posted, including Rik's. Here's another version - doing it inside copy_page_range, so this kind of vma special-casing is over in mm/ rather than kernel/. Yeah I guess that's a good idea. Patch looks pretty good. Just a minor issue with the comment, it is not strictly just assuming the child will exec... IMO it is worthwhile in Ray's case even if his forked process _eventually_ ends up touching all the shared memory pages, it is better to avoid many ms of fork overhead. Also, on NUMA systems this will help get page tables allocated on the right nodes, which is not an insignificant problem for big HPC jobs. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Ross Biro wrote: On 8/26/05, Hugh Dickins [EMAIL PROTECTED] wrote: The refaulting will hurt the performance of something: let's just hope that something doesn't turn out to be a show-stopper. Why not just fault in all the pages on the first fault. Then the performance loss is a single page fault (the page table copy that would have happened a fork time now happens at fault time) and you get the big win for processes that do fork/exec. all might be very many more pages than were ever mapped in the parent, and not be a win. Some faultahead might work better. Might, might, ... Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On 8/26/05, Hugh Dickins [EMAIL PROTECTED] wrote: On Fri, 26 Aug 2005, Ross Biro wrote: On 8/26/05, Hugh Dickins [EMAIL PROTECTED] wrote: The refaulting will hurt the performance of something: let's just hope that something doesn't turn out to be a show-stopper. Why not just fault in all the pages on the first fault. Then the performance loss is a single page fault (the page table copy that would have happened a fork time now happens at fault time) and you get the big win for processes that do fork/exec. all might be very many more pages than were ever mapped in the parent, and not be a win. Some faultahead might work better. Might, might, ... If you reduce all to whatever would have been done in fork originially, then you've got a big win in some cases and a minimal loss in others, and it's easy to argue you've got something better. Now changng all to something even less might be an even bigger win, but that requires a lot of benchmarking to justify. Ross - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Nick Piggin wrote: OK let's see how Ray goes, and try it when 2.6.14 opens... Working on that now - I'll let you know. Yeah I guess that's a good idea. Patch looks pretty good. Just a minor issue with the comment, it is not strictly just assuming the child will exec... IMO it is worthwhile in Ray's case even if his forked process _eventually_ ends up touching all the shared memory pages, it is better to avoid many ms of fork overhead. Yes, in our database system the child will immediately touch some shmem pages, and may eventually touch most of them (and would almost never exec()). Fork performance is critical in usage scenarios where an end-user database request forks a new server process from one master server process. However, there is still a need that the child, once successfully forked, is operational reasonably quickly. I suspect that Ross's idea of paging in everything after the first fault would not be optimal for us, because we'd still be talking about hundreds of ms of work done before the child does anything useful. It would still be far better than the behavior we have today because that time would no longer be synchronous with the fork(). Of course, it sounds like our app might be able to make use of the hugetlb stuff can mitigate this problem in the future... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Ray Fucillo wrote: However, there is still a need that the child, once successfully forked, is operational reasonably quickly. I suspect that Ross's idea of paging in everything after the first fault would not be optimal for us, because we'd still be talking about hundreds of ms of work done before the child does anything useful. Simply skipping the page table setup of MAP_SHARED regions should be enough to fix this issue. It would still be far better than the behavior we have today because that time would no longer be synchronous with the fork(). Filling in all the page table entries at the first fault to a VMA doesn't make much sense, IMHO. The reason I think this is that people have experimented with prefaulting already resident pages at page fault time, and those experiments have never shown a conclusive benefit. Now, if doing such prefaulting for normal processes does not show a benefit - why would it be beneficial to recently forked processes with a huge SHM area ? I suspect we would be better off without that extra complexity, unless there is a demonstrated benefit to it. -- All Rights Reversed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Hugh Dickins wrote: I see some flaws in the various patches posted, including Rik's. Here's another version - doing it inside copy_page_range, so this kind of vma special-casing is over in mm/ rather than kernel/. I like this approach better, but I don't understand your particular choice of bits. + * Assume the fork will probably exec: don't waste time copying + * ptes where a page fault will fill them correctly afterwards. + */ + if ((vma-vm_flags (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED)) + == VM_MAYSHARE) + return 0; + if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); First off, if you just did it below the hugetlb check, you'd not need to check hugetlb again. And while I understand VM_NONLINEAR and VM_RESERVED, can you please comment on why VM_MAYSHARE is so important, and why no other information matters. Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair enough. But afaik, a shared anonymous mapping absolutely needs its page tables copied, because those page tables contains either the pointers to the shared pages, or the swap entries. So I really think you need to verify that it's a file mapping too. Also, arguably, there are other cases that may or may not be worth worrying about. What about non-shared non-writable file mappings? What about private mappings that haven't been COW'ed? So I think that in addition to your tests, you should test for vma-vm_file, and you could toy with testing for vma-anon_vma being NULL (the latter will cause a _lot_ of hits, because any read-only private mapping will trigger, but it's a good stress-test and conceptually interesting, even if I suspect it will kill any performance gain through extra minor faults in the child). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On 8/26/05, Rik van Riel [EMAIL PROTECTED] wrote: Filling in all the page table entries at the first fault to a VMA doesn't make much sense, IMHO. I suspect we would be better off without that extra complexity, unless there is a demonstrated benefit to it. You are probably right, but do you want to put in a patch that might have a big performance impact in either direction with out verifying it? My suggestion is safe, but most likely sub-optimal. What everyone else is suggesting may be far better, but needs to be verified first. I'm suggesting that we change the code to do the same work fork would have done on the first page fault immediately, since it's easy to argue that it's not much worse than we have now and much better in many cases, and then try to experiment and figure out what the correct solution is. Ross - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Linus Torvalds wrote: On Fri, 26 Aug 2005, Hugh Dickins wrote: I see some flaws in the various patches posted, including Rik's. Here's another version - doing it inside copy_page_range, so this kind of vma special-casing is over in mm/ rather than kernel/. I like this approach better, but I don't understand your particular choice of bits. +* Assume the fork will probably exec: don't waste time copying +* ptes where a page fault will fill them correctly afterwards. +*/ + if ((vma-vm_flags (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED)) + == VM_MAYSHARE) + return 0; + if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); First off, if you just did it below the hugetlb check, you'd not need to check hugetlb again. Yes: I wanted to include VM_HUGETLB in the list as documentation really; and it costs nothing to test it along with the other flags - or are there architectures where the more bits you test, the costlier? And while I understand VM_NONLINEAR and VM_RESERVED, can you please comment on why VM_MAYSHARE is so important, and why no other information matters. The VM_MAYSHARE one isn't terribly important, there's no correctness reason to replace VM_SHARED there. It's just that do_mmap_pgoff takes VM_SHARED and VM_MAYWRITE off a MAP_SHARED mapping of a file which was not opened for writing. We can safely avoid copying the ptes of such a vma, just as with the writable ones, but the VM_MAYSHARE test catches them where the VM_SHARED test does not. Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair enough. But afaik, a shared anonymous mapping absolutely needs its page tables copied, because those page tables contains either the pointers to the shared pages, or the swap entries. So I really think you need to verify that it's a file mapping too. Either I'm misunderstanding, or you're remembering back to how shared anonymous was done in 2.2 (perhaps). In 2.4 and 2.6, shared anonymous is backed by a shared memory object, created by shmem_zero_setup: which sets vm_file even though we came into do_mmap_pgoff with no file. Also, arguably, there are other cases that may or may not be worth worrying about. What about non-shared non-writable file mappings? What about private mappings that haven't been COW'ed? Non-shared non-currently-writable file mappings might have been writable and modified in the past, so we cannot necessarily skip those. We could, and I did, consider testing whether the vma has an anon_vma: we always allocate a vma's anon_vma just before first allocating it a private page, and it's a good test which swapoff uses to narrow its search. But partly I thought that a little too tricksy, and hard to explain; and partly I thought it was liable to catch the executable text, some of which is most likely to be needed in between fork and exec. So I think that in addition to your tests, you should test for vma-vm_file, and you could toy with testing for vma-anon_vma being NULL (the latter will cause a _lot_ of hits, because any read-only private mapping will trigger, but it's a good stress-test and conceptually interesting, even if I suspect it will kill any performance gain through extra minor faults in the child). Ah yes, I wrote the paragraph above before reading this one, honest! Well, I still don't think we need to test vm_file. We can add an anon_vma test if you like, if we really want to minimize the fork overhead, in favour of later faults. Do we? Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Ross Biro wrote: On 8/26/05, Rik van Riel [EMAIL PROTECTED] wrote: Filling in all the page table entries at the first fault to a VMA doesn't make much sense, IMHO. I suspect we would be better off without that extra complexity, unless there is a demonstrated benefit to it. You are probably right, but do you want to put in a patch that might have a big performance impact in either direction with out verifying it? My suggestion is safe, but most likely sub-optimal. What everyone else is suggesting may be far better, but needs to be verified first. It all has to be verified, and the problem will be that some things fare well and others badly: how to reach a balanced decision? Following your suggestion is no more safe than not following it. I'm suggesting that we change the code to do the same work fork would have done on the first page fault immediately, since it's easy to argue that it's not much worse than we have now and much better in many cases, and then try to experiment and figure out what the correct solution is. We don't know what work fork would have done, that information was in the ptes we decided not to bother to copy. Perhaps every pte of the vma was set, perhaps none, perhaps only one. Also, doing it at fault time has significantly more work to do than just zipping along the ptes incrementing page counts and clearing bits. I think; but probably much less extra work than I originally imagined, since Andrew gave us the gang lookup of the page cache. All the same, I'm with Rik: one of the great virtues of the original idea was its simplicity; I'd prefer not to add complexity. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Hugh Dickins wrote: Well, I still don't think we need to test vm_file. We can add an anon_vma test if you like, if we really want to minimize the fork overhead, in favour of later faults. Do we? I think we might want to do it in -mm for testing. Because quite frankly, otherwise the new fork() logic won't get a lot of testing. Shared memory isn't that common. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Hugh Dickins wrote: Well, I still don't think we need to test vm_file. We can add an anon_vma test if you like, if we really want to minimize the fork overhead, in favour of later faults. Do we? When you consider NUMA placement (the child process may end up running elsewhere), allocating things like page tables lazily may well end up being a performance win. -- All Rights Reversed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Rik van Riel wrote: On Fri, 26 Aug 2005, Hugh Dickins wrote: Well, I still don't think we need to test vm_file. We can add an anon_vma test if you like, if we really want to minimize the fork overhead, in favour of later faults. Do we? When you consider NUMA placement (the child process may end up running elsewhere), allocating things like page tables lazily may well end up being a performance win. It should be easy enough to benchmark something like kernel compiles etc, which are reasonably fork-rich and should show a good mix for something like this. Or even just something like time to restart a X session after you've brought it into memory once. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Nick Piggin wrote: > > > Skipping MAP_SHARED in fork() sounds like a good idea to me... > > > > Indeed. Linus, can you remember why we haven't done this before? Hmm. Historical reasons. Also, if the child ends up needing it, it will now have to fault them in. That said, I think it's a valid optimization. Especially as the child _probably_ doesn't need it (ie there's at least some likelihood of an execve() or similar). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Nick Piggin wrote: > > Skipping MAP_SHARED in fork() sounds like a good idea to me... > > Indeed. Linus, can you remember why we haven't done this before? Where "this" looks something like the patch below, shamelessly merging Nick's and Andy's patches and adding the initialization of retval. I suspect this may be a measurable win on database servers with a web frontend, where the connections to the database server are set up basically for each individual query, and don't stick around for a long time. No, I haven't actually tested this patch - but feel free to go wild while I sign off for the night. Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> --- linux-2.6.12/kernel/fork.c.mapshared2005-08-25 18:40:44.0 -0400 +++ linux-2.6.12/kernel/fork.c 2005-08-25 18:47:16.0 -0400 @@ -184,7 +184,7 @@ { struct vm_area_struct * mpnt, *tmp, **pprev; struct rb_node **rb_link, *rb_parent; - int retval; + int retval = 0; unsigned long charge; struct mempolicy *pol; @@ -265,7 +265,10 @@ rb_parent = >vm_rb; mm->map_count++; - retval = copy_page_range(mm, current->mm, tmp); + /* Skip pte copying if page faults can take care of things. */ + if (!file || !(tmp->vm_flags & VM_SHARED) || + is_vm_hugetlb_page(vma)) + retval = copy_page_range(mm, current->mm, tmp); spin_unlock(>page_table_lock); if (tmp->vm_ops && tmp->vm_ops->open) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Rik van Riel wrote: On Thu, 25 Aug 2005, Nick Piggin wrote: fork() can be changed so as not to set up page tables for MAP_SHARED mappings. I think that has other tradeoffs like initially causing several unavoidable faults reading libraries and program text. Actually, libraries and program text are usually mapped MAP_PRIVATE, so those would still be copied. Yep, that seems to be the case here. Skipping MAP_SHARED in fork() sounds like a good idea to me... Indeed. Linus, can you remember why we haven't done this before? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Thu, 25 Aug 2005, Nick Piggin wrote: > fork() can be changed so as not to set up page tables for > MAP_SHARED mappings. I think that has other tradeoffs like > initially causing several unavoidable faults reading > libraries and program text. Actually, libraries and program text are usually mapped MAP_PRIVATE, so those would still be copied. Skipping MAP_SHARED in fork() sounds like a good idea to me... -- All Rights Reversed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Thursday 25 August 2005 16:47, Parag Warudkar wrote: > Exactly - one problem is that this forces all of the hugetlb users to go > the lazy faulting way. Actually I disabled it for hugetlbfs (... !is_huge...vma). The reason is that lazy faulting for huge pages is still not in mainline. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Thu, 2005-08-25 at 16:22 +0200, Andi Kleen wrote: > But I'm not sure it's a good idea in all cases. Would need a lot of > benchmarking at least. > > -Andi > Exactly - one problem is that this forces all of the hugetlb users to go the lazy faulting way. This is more or less similar to the original problem the fork() forces everything to be mapped and some apps don't like it. Same way, some apps may not want hugetlb pages to be all pre-mapped. That's why I was alluding towards having the user specify MAP_SHARED| MAP_LAZY or something to that tune and then have fork() honor it. So people who want all things pre-mapped will not specify MAP_LAZY, just MAP_SHARED. Now I don't even know if above is possible and workable for all scenarios but that's why I was asking.. :) Parag - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Andi Kleen wrote: Would it be worth trying to do something like this? Maybe. Shouldn't be very hard though - you just need to check if the VMA is backed by an object and if yes don't call copy_page_range for it. I think it just needs (untested) I think you need to check for MAP_SHARED as well, because MAP_PRIVATE mapping of a file could be modified in parent. See patch I posted just now. Also, do you need any special case for hugetlb? Index: linux-2.6.13-rc5-misc/kernel/fork.c === --- linux-2.6.13-rc5-misc.orig/kernel/fork.c +++ linux-2.6.13-rc5-misc/kernel/fork.c @@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str rb_parent = >vm_rb; mm->map_count++; - retval = copy_page_range(mm, current->mm, tmp); + if (!file && !is_vm_hugetlb_page(vma)) + retval = copy_page_range(mm, current->mm, tmp); spin_unlock(>page_table_lock); if (tmp->vm_ops && tmp->vm_ops->open) But I'm not sure it's a good idea in all cases. Would need a lot of benchmarking at least. Yep. I'm sure it must have come up in the past, and Linus must have said something about best-for-most. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Ray Fucillo wrote: Nick Piggin wrote: fork() can be changed so as not to set up page tables for MAP_SHARED mappings. I think that has other tradeoffs like initially causing several unavoidable faults reading libraries and program text. What kind of application are you using? The application is a database system called Caché. We allocate a large shared memory segment for database cache, which in a large production environment may realistically be 1+GB on 32-bit platforms and much larger on 64-bit. At these sizes fork() is taking hundreds of miliseconds, which can become a noticeable bottleneck for us. This performance characteristic seems to be unique to Linux vs other Unix implementations. As Andi said, hugepages might be a very nice feature for you guys to look into and might potentially give a performance increase with reduced TLB pressure, not only your immediate fork problem. Anyway, the attached patch is something you could try testing. If you do so, then I would be very interested to see performance results. Thanks, Nick -- SUSE Labs, Novell Inc. Index: linux-2.6/kernel/fork.c === --- linux-2.6.orig/kernel/fork.c2005-08-04 15:24:36.0 +1000 +++ linux-2.6/kernel/fork.c 2005-08-26 00:20:50.0 +1000 @@ -256,7 +256,6 @@ static inline int dup_mmap(struct mm_str * Note that, exceptionally, here the vma is inserted * without holding mm->mmap_sem. */ - spin_lock(>page_table_lock); *pprev = tmp; pprev = >vm_next; @@ -265,8 +264,11 @@ static inline int dup_mmap(struct mm_str rb_parent = >vm_rb; mm->map_count++; - retval = copy_page_range(mm, current->mm, tmp); - spin_unlock(>page_table_lock); + if (!(file && (tmp->vm_flags & VM_SHARED))) { + spin_lock(>page_table_lock); + retval = copy_page_range(mm, current->mm, tmp); + spin_unlock(>page_table_lock); + } if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp);
Re: process creation time increases linearly with shmem
> Would it be worth trying to do something like this? Maybe. Shouldn't be very hard though - you just need to check if the VMA is backed by an object and if yes don't call copy_page_range for it. I think it just needs (untested) Index: linux-2.6.13-rc5-misc/kernel/fork.c === --- linux-2.6.13-rc5-misc.orig/kernel/fork.c +++ linux-2.6.13-rc5-misc/kernel/fork.c @@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str rb_parent = >vm_rb; mm->map_count++; - retval = copy_page_range(mm, current->mm, tmp); + if (!file && !is_vm_hugetlb_page(vma)) + retval = copy_page_range(mm, current->mm, tmp); spin_unlock(>page_table_lock); if (tmp->vm_ops && tmp->vm_ops->open) But I'm not sure it's a good idea in all cases. Would need a lot of benchmarking at least. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
> Ray Fucillo <[EMAIL PROTECTED]> writes: > > > > The application is a database system called Caché. We allocate a > > large shared memory segment for database cache, which in a large > > production environment may realistically be 1+GB on 32-bit platforms > > and much larger on 64-bit. At these sizes fork() is taking hundreds > > of miliseconds, which can become a noticeable bottleneck for us. This > > performance characteristic seems to be unique to Linux vs other Unix > > implementations. > > You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB); > then the overhead of walking the pages of it at fork would be much lower. > > -Andi > - Why isn't the page walk for the Shared Memory done lazily though? It is better in that applications most likely may not want to page in all of the shared memory at once. Program logic/requirements should dictate this instead of fork making it compulsory. I think this is because we don't distinguish between shared libraries, program text and explicitly shared memory as the above application does - everything is MAP_SHARED. As someone mentioned this causes unavoidable faults for reading in shared libraries and program text. But if there was a MAP_SHARED|MAP_LAZY - can fork() then be setup not to setup page tables for such mappings and still continue to map the MAP_SHARED ones so program text and libraries don't cause faults? Applications can then specify MAP_SHARED|MAP_LAZY and not incur the overhead of page table walk for the shared memory all at once. Would it be worth trying to do something like this? Parag - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Ray Fucillo <[EMAIL PROTECTED]> writes: > > The application is a database system called Caché. We allocate a > large shared memory segment for database cache, which in a large > production environment may realistically be 1+GB on 32-bit platforms > and much larger on 64-bit. At these sizes fork() is taking hundreds > of miliseconds, which can become a noticeable bottleneck for us. This > performance characteristic seems to be unique to Linux vs other Unix > implementations. You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB); then the overhead of walking the pages of it at fork would be much lower. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Nick Piggin wrote: fork() can be changed so as not to set up page tables for MAP_SHARED mappings. I think that has other tradeoffs like initially causing several unavoidable faults reading libraries and program text. What kind of application are you using? The application is a database system called Caché. We allocate a large shared memory segment for database cache, which in a large production environment may realistically be 1+GB on 32-bit platforms and much larger on 64-bit. At these sizes fork() is taking hundreds of miliseconds, which can become a noticeable bottleneck for us. This performance characteristic seems to be unique to Linux vs other Unix implementations. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Nick Piggin wrote: fork() can be changed so as not to set up page tables for MAP_SHARED mappings. I think that has other tradeoffs like initially causing several unavoidable faults reading libraries and program text. What kind of application are you using? The application is a database system called Caché. We allocate a large shared memory segment for database cache, which in a large production environment may realistically be 1+GB on 32-bit platforms and much larger on 64-bit. At these sizes fork() is taking hundreds of miliseconds, which can become a noticeable bottleneck for us. This performance characteristic seems to be unique to Linux vs other Unix implementations. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Ray Fucillo [EMAIL PROTECTED] writes: The application is a database system called Caché. We allocate a large shared memory segment for database cache, which in a large production environment may realistically be 1+GB on 32-bit platforms and much larger on 64-bit. At these sizes fork() is taking hundreds of miliseconds, which can become a noticeable bottleneck for us. This performance characteristic seems to be unique to Linux vs other Unix implementations. You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB); then the overhead of walking the pages of it at fork would be much lower. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Ray Fucillo [EMAIL PROTECTED] writes: The application is a database system called Caché. We allocate a large shared memory segment for database cache, which in a large production environment may realistically be 1+GB on 32-bit platforms and much larger on 64-bit. At these sizes fork() is taking hundreds of miliseconds, which can become a noticeable bottleneck for us. This performance characteristic seems to be unique to Linux vs other Unix implementations. You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB); then the overhead of walking the pages of it at fork would be much lower. -Andi - Why isn't the page walk for the Shared Memory done lazily though? It is better in that applications most likely may not want to page in all of the shared memory at once. Program logic/requirements should dictate this instead of fork making it compulsory. I think this is because we don't distinguish between shared libraries, program text and explicitly shared memory as the above application does - everything is MAP_SHARED. As someone mentioned this causes unavoidable faults for reading in shared libraries and program text. But if there was a MAP_SHARED|MAP_LAZY - can fork() then be setup not to setup page tables for such mappings and still continue to map the MAP_SHARED ones so program text and libraries don't cause faults? Applications can then specify MAP_SHARED|MAP_LAZY and not incur the overhead of page table walk for the shared memory all at once. Would it be worth trying to do something like this? Parag - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Would it be worth trying to do something like this? Maybe. Shouldn't be very hard though - you just need to check if the VMA is backed by an object and if yes don't call copy_page_range for it. I think it just needs (untested) Index: linux-2.6.13-rc5-misc/kernel/fork.c === --- linux-2.6.13-rc5-misc.orig/kernel/fork.c +++ linux-2.6.13-rc5-misc/kernel/fork.c @@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str rb_parent = tmp-vm_rb; mm-map_count++; - retval = copy_page_range(mm, current-mm, tmp); + if (!file !is_vm_hugetlb_page(vma)) + retval = copy_page_range(mm, current-mm, tmp); spin_unlock(mm-page_table_lock); if (tmp-vm_ops tmp-vm_ops-open) But I'm not sure it's a good idea in all cases. Would need a lot of benchmarking at least. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Ray Fucillo wrote: Nick Piggin wrote: fork() can be changed so as not to set up page tables for MAP_SHARED mappings. I think that has other tradeoffs like initially causing several unavoidable faults reading libraries and program text. What kind of application are you using? The application is a database system called Caché. We allocate a large shared memory segment for database cache, which in a large production environment may realistically be 1+GB on 32-bit platforms and much larger on 64-bit. At these sizes fork() is taking hundreds of miliseconds, which can become a noticeable bottleneck for us. This performance characteristic seems to be unique to Linux vs other Unix implementations. As Andi said, hugepages might be a very nice feature for you guys to look into and might potentially give a performance increase with reduced TLB pressure, not only your immediate fork problem. Anyway, the attached patch is something you could try testing. If you do so, then I would be very interested to see performance results. Thanks, Nick -- SUSE Labs, Novell Inc. Index: linux-2.6/kernel/fork.c === --- linux-2.6.orig/kernel/fork.c2005-08-04 15:24:36.0 +1000 +++ linux-2.6/kernel/fork.c 2005-08-26 00:20:50.0 +1000 @@ -256,7 +256,6 @@ static inline int dup_mmap(struct mm_str * Note that, exceptionally, here the vma is inserted * without holding mm-mmap_sem. */ - spin_lock(mm-page_table_lock); *pprev = tmp; pprev = tmp-vm_next; @@ -265,8 +264,11 @@ static inline int dup_mmap(struct mm_str rb_parent = tmp-vm_rb; mm-map_count++; - retval = copy_page_range(mm, current-mm, tmp); - spin_unlock(mm-page_table_lock); + if (!(file (tmp-vm_flags VM_SHARED))) { + spin_lock(mm-page_table_lock); + retval = copy_page_range(mm, current-mm, tmp); + spin_unlock(mm-page_table_lock); + } if (tmp-vm_ops tmp-vm_ops-open) tmp-vm_ops-open(tmp);
Re: process creation time increases linearly with shmem
Andi Kleen wrote: Would it be worth trying to do something like this? Maybe. Shouldn't be very hard though - you just need to check if the VMA is backed by an object and if yes don't call copy_page_range for it. I think it just needs (untested) I think you need to check for MAP_SHARED as well, because MAP_PRIVATE mapping of a file could be modified in parent. See patch I posted just now. Also, do you need any special case for hugetlb? Index: linux-2.6.13-rc5-misc/kernel/fork.c === --- linux-2.6.13-rc5-misc.orig/kernel/fork.c +++ linux-2.6.13-rc5-misc/kernel/fork.c @@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str rb_parent = tmp-vm_rb; mm-map_count++; - retval = copy_page_range(mm, current-mm, tmp); + if (!file !is_vm_hugetlb_page(vma)) + retval = copy_page_range(mm, current-mm, tmp); spin_unlock(mm-page_table_lock); if (tmp-vm_ops tmp-vm_ops-open) But I'm not sure it's a good idea in all cases. Would need a lot of benchmarking at least. Yep. I'm sure it must have come up in the past, and Linus must have said something about best-for-most. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Thu, 2005-08-25 at 16:22 +0200, Andi Kleen wrote: But I'm not sure it's a good idea in all cases. Would need a lot of benchmarking at least. -Andi Exactly - one problem is that this forces all of the hugetlb users to go the lazy faulting way. This is more or less similar to the original problem the fork() forces everything to be mapped and some apps don't like it. Same way, some apps may not want hugetlb pages to be all pre-mapped. That's why I was alluding towards having the user specify MAP_SHARED| MAP_LAZY or something to that tune and then have fork() honor it. So people who want all things pre-mapped will not specify MAP_LAZY, just MAP_SHARED. Now I don't even know if above is possible and workable for all scenarios but that's why I was asking.. :) Parag - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Thu, 25 Aug 2005, Nick Piggin wrote: fork() can be changed so as not to set up page tables for MAP_SHARED mappings. I think that has other tradeoffs like initially causing several unavoidable faults reading libraries and program text. Actually, libraries and program text are usually mapped MAP_PRIVATE, so those would still be copied. Skipping MAP_SHARED in fork() sounds like a good idea to me... -- All Rights Reversed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Rik van Riel wrote: On Thu, 25 Aug 2005, Nick Piggin wrote: fork() can be changed so as not to set up page tables for MAP_SHARED mappings. I think that has other tradeoffs like initially causing several unavoidable faults reading libraries and program text. Actually, libraries and program text are usually mapped MAP_PRIVATE, so those would still be copied. Yep, that seems to be the case here. Skipping MAP_SHARED in fork() sounds like a good idea to me... Indeed. Linus, can you remember why we haven't done this before? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Nick Piggin wrote: Skipping MAP_SHARED in fork() sounds like a good idea to me... Indeed. Linus, can you remember why we haven't done this before? Where this looks something like the patch below, shamelessly merging Nick's and Andy's patches and adding the initialization of retval. I suspect this may be a measurable win on database servers with a web frontend, where the connections to the database server are set up basically for each individual query, and don't stick around for a long time. No, I haven't actually tested this patch - but feel free to go wild while I sign off for the night. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- linux-2.6.12/kernel/fork.c.mapshared2005-08-25 18:40:44.0 -0400 +++ linux-2.6.12/kernel/fork.c 2005-08-25 18:47:16.0 -0400 @@ -184,7 +184,7 @@ { struct vm_area_struct * mpnt, *tmp, **pprev; struct rb_node **rb_link, *rb_parent; - int retval; + int retval = 0; unsigned long charge; struct mempolicy *pol; @@ -265,7 +265,10 @@ rb_parent = tmp-vm_rb; mm-map_count++; - retval = copy_page_range(mm, current-mm, tmp); + /* Skip pte copying if page faults can take care of things. */ + if (!file || !(tmp-vm_flags VM_SHARED) || + is_vm_hugetlb_page(vma)) + retval = copy_page_range(mm, current-mm, tmp); spin_unlock(mm-page_table_lock); if (tmp-vm_ops tmp-vm_ops-open) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
On Fri, 26 Aug 2005, Nick Piggin wrote: Skipping MAP_SHARED in fork() sounds like a good idea to me... Indeed. Linus, can you remember why we haven't done this before? Hmm. Historical reasons. Also, if the child ends up needing it, it will now have to fault them in. That said, I think it's a valid optimization. Especially as the child _probably_ doesn't need it (ie there's at least some likelihood of an execve() or similar). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: process creation time increases linearly with shmem
Ray Fucillo wrote: I am seeing process creation time increase linearly with the size of the shared memory segment that the parent touches. The attached forktest.c is a very simple user program that illustrates this behavior, which I have tested on various kernel versions from 2.4 through 2.6. Is this a known issue, and is it solvable? fork() can be changed so as not to set up page tables for MAP_SHARED mappings. I think that has other tradeoffs like initially causing several unavoidable faults reading libraries and program text. What kind of application are you using? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
process creation time increases linearly with shmem
I am seeing process creation time increase linearly with the size of the shared memory segment that the parent touches. The attached forktest.c is a very simple user program that illustrates this behavior, which I have tested on various kernel versions from 2.4 through 2.6. Is this a known issue, and is it solvable? TIA, Ray #include #include #include #include #include #include #include #include #include #include #define MAXJOBS 50 #define MAXMALLOC 1024 #define USESIGCHLDHND /* USESIGCHLDHND feature code changes how the parent waits for the children. When this feature code is on we define a signal handler for SIGCHLD and call waitpid to clean up the child process. If this feature code is off, we wait until all children are forked and then loop through the array of child pids and call waitpid() on each. The purpose of this feature code was to see if there is any difference in timing based on cleaning up zombies faster. Test have shown no appreciable difference. */ /* Return a floating point number of seconds since the start time in the timeval structure pointed to by starttv */ float elapsedtime(struct timeval *starttv) { struct timeval currenttime; gettimeofday(,NULL); return ((currenttime.tv_sec - starttv->tv_sec) + ((float)(currenttime.tv_usec - starttv->tv_usec)/100)); } #ifdef USESIGCHLDHND int childexitcnt = 0; void sigchldhnd(int signum) { if (waitpid(-1,NULL,WNOHANG)) ++childexitcnt; return; } #endif int main(void) { pid_t childpid[MAXJOBS]; int x,i; int childcnt = 0; float endfork, endwait; struct shmid_ds myshmid_ds; unsigned int mb; int myshmid; key_t mykey = 0xf00df00d; char *mymem = 0; struct timeval starttime; #ifdef USESIGCHLDHND struct sigaction sa; sa.sa_handler = sigchldhnd; sigemptyset(_mask); sa.sa_flags = SA_RESTART; if (sigaction(SIGCHLD, , NULL) == -1) { printf("sigaction() failed, errno %d - exiting\n",errno); exit(1); } #endif printf("\nNumber of jobs to fork (max %d): ",MAXJOBS); scanf("%d",); if ((x < 1) || (x > MAXJOBS)) { printf("\ninvalid input - exiting\n"); exit(1); } printf("\nNumber of MB to allocate (0-%d): ",MAXMALLOC); scanf("%d",); if (mb > MAXMALLOC) { printf("\ninvalid input - exiting\n"); exit(1); } /* allocate and initialize shared memory if number of MB is not zero */ if (mb) { myshmid = shmget(mykey,mb*1024*1024,IPC_CREAT|0777); if (myshmid == -1) { printf("\nshmget() failed, errno %d. - exiting\n",errno); exit(1); } mymem = (char *) shmat(myshmid,0,0); if (mymem == (char *) -1) { printf("\nshmat() failed, errno %d. - exiting\n",errno); exit(1); } if (shmctl(myshmid,IPC_STAT,_ds)) { printf("\nshmctl() failed, errno %d. - exiting\n",errno); exit(1); } /* write a pattern in the new shmem segment*/ for (i=0; i < (mb*1024*1024); i+=32) mymem[i]='R'; } printf("\nStarting %d jobs. time:0.0", x); fflush(stdout); gettimeofday(,NULL); for (i=0; i
process creation time increases linearly with shmem
I am seeing process creation time increase linearly with the size of the shared memory segment that the parent touches. The attached forktest.c is a very simple user program that illustrates this behavior, which I have tested on various kernel versions from 2.4 through 2.6. Is this a known issue, and is it solvable? TIA, Ray #include stdio.h #include stdlib.h #include sys/types.h #include unistd.h #include sys/time.h #include errno.h #include sys/shm.h #include sys/wait.h #include sys/types.h #include signal.h #define MAXJOBS 50 #define MAXMALLOC 1024 #define USESIGCHLDHND /* USESIGCHLDHND feature code changes how the parent waits for the children. When this feature code is on we define a signal handler for SIGCHLD and call waitpid to clean up the child process. If this feature code is off, we wait until all children are forked and then loop through the array of child pids and call waitpid() on each. The purpose of this feature code was to see if there is any difference in timing based on cleaning up zombies faster. Test have shown no appreciable difference. */ /* Return a floating point number of seconds since the start time in the timeval structure pointed to by starttv */ float elapsedtime(struct timeval *starttv) { struct timeval currenttime; gettimeofday(currenttime,NULL); return ((currenttime.tv_sec - starttv-tv_sec) + ((float)(currenttime.tv_usec - starttv-tv_usec)/100)); } #ifdef USESIGCHLDHND int childexitcnt = 0; void sigchldhnd(int signum) { if (waitpid(-1,NULL,WNOHANG)) ++childexitcnt; return; } #endif int main(void) { pid_t childpid[MAXJOBS]; int x,i; int childcnt = 0; float endfork, endwait; struct shmid_ds myshmid_ds; unsigned int mb; int myshmid; key_t mykey = 0xf00df00d; char *mymem = 0; struct timeval starttime; #ifdef USESIGCHLDHND struct sigaction sa; sa.sa_handler = sigchldhnd; sigemptyset(sa.sa_mask); sa.sa_flags = SA_RESTART; if (sigaction(SIGCHLD, sa, NULL) == -1) { printf(sigaction() failed, errno %d - exiting\n,errno); exit(1); } #endif printf(\nNumber of jobs to fork (max %d): ,MAXJOBS); scanf(%d,x); if ((x 1) || (x MAXJOBS)) { printf(\ninvalid input - exiting\n); exit(1); } printf(\nNumber of MB to allocate (0-%d): ,MAXMALLOC); scanf(%d,mb); if (mb MAXMALLOC) { printf(\ninvalid input - exiting\n); exit(1); } /* allocate and initialize shared memory if number of MB is not zero */ if (mb) { myshmid = shmget(mykey,mb*1024*1024,IPC_CREAT|0777); if (myshmid == -1) { printf(\nshmget() failed, errno %d. - exiting\n,errno); exit(1); } mymem = (char *) shmat(myshmid,0,0); if (mymem == (char *) -1) { printf(\nshmat() failed, errno %d. - exiting\n,errno); exit(1); } if (shmctl(myshmid,IPC_STAT,myshmid_ds)) { printf(\nshmctl() failed, errno %d. - exiting\n,errno); exit(1); } /* write a pattern in the new shmem segment*/ for (i=0; i (mb*1024*1024); i+=32) mymem[i]='R'; } printf(\nStarting %d jobs. time:0.0, x); fflush(stdout); gettimeofday(starttime,NULL); for (i=0; ix; i++) { childpid[i] = fork(); if (!childpid[i]) { /* child process */ printf(\n - Child %d time:%f,i,elapsedtime(starttime)); exit(1); } else if (childpid[i] == -1) { /* failure */ printf(\nfork failed, errno = %d); } else childcnt++; } endfork = elapsedtime(starttime); #ifndef USESIGCHLDHND for (i=0; ix; i++) waitpid(childpid[i],0,0); #else while (childexitcnt childcnt) { if (waitpid(-1,NULL,0)) ++childexitcnt; } #endif endwait = elapsedtime(starttime); printf(\nTime to fork all processes in seconds: %f, endfork); printf(\nTime for all processes to complete: %f\n, endwait); /* kill shmem segment */ if ((mb) (shmctl(myshmid,IPC_RMID,myshmid_ds))) { printf(\nshmctl() failed, errno %d. - exiting\n,errno); exit(1); } }
Re: process creation time increases linearly with shmem
Ray Fucillo wrote: I am seeing process creation time increase linearly with the size of the shared memory segment that the parent touches. The attached forktest.c is a very simple user program that illustrates this behavior, which I have tested on various kernel versions from 2.4 through 2.6. Is this a known issue, and is it solvable? fork() can be changed so as not to set up page tables for MAP_SHARED mappings. I think that has other tradeoffs like initially causing several unavoidable faults reading libraries and program text. What kind of application are you using? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/