Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Wed, 15 Jan 2014 12:47:00 -0800 Davidlohr Bueso wrote: > > Well, we're mainly looking for bugfixes this last in the cycle. > > "[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly > > introduced resv_map lock" fixes a bug, but I'd assumed that it depended > > on earlier patches. > > It doesn't seem to depend on anything. All 1-7 patches apply cleanly on > linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm: > dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE). > > > If we think that one is serious then it would be > > better to cook up a minimal fix which is backportable into 3.12 and > > eariler? > > I don't think it's too serious, afaik it's a theoretical race and I > haven't seen any bug reports for it. So we can probably just wait for > 3.14, as you say, it's already late in the cycle anyways. OK, thanks. > Just let me > know what you want to do so we can continue working on the actual > performance issue. A resend after -rc1 would suit. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Tue, 2014-01-14 at 20:56 -0800, Andrew Morton wrote: > On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso wrote: > > > On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote: > > > On Mon, 6 Jan 2014, Davidlohr Bueso wrote: > > > > > > > > If Andrew agree, It would be great to merge 1-7 patches into mainline > > > > > before your mutex approach. There are some of clean-up patches and, > > > > > IMO, > > > > > it makes the code more readable and maintainable, so it is worth to > > > > > merge > > > > > separately. > > > > > > > > Fine by me. > > > > > > > > > > It appears like patches 1-7 are still missing from linux-next, would you > > > mind posting them in a series with your approach? > > > > I haven't looked much into patches 4-7, but at least the first three are > > ok. I was waiting for Andrew to take all seven for linux-next and then > > I'd rebase my approach on top. Anyway, unless Andrew has any > > preferences, if by later this week they're not picked up, I'll resend > > everything. > > Well, we're mainly looking for bugfixes this last in the cycle. > "[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly > introduced resv_map lock" fixes a bug, but I'd assumed that it depended > on earlier patches. It doesn't seem to depend on anything. All 1-7 patches apply cleanly on linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE). > If we think that one is serious then it would be > better to cook up a minimal fix which is backportable into 3.12 and > eariler? I don't think it's too serious, afaik it's a theoretical race and I haven't seen any bug reports for it. So we can probably just wait for 3.14, as you say, it's already late in the cycle anyways. Just let me know what you want to do so we can continue working on the actual performance issue. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Tue, 2014-01-14 at 20:56 -0800, Andrew Morton wrote: On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso davidl...@hp.com wrote: On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote: On Mon, 6 Jan 2014, Davidlohr Bueso wrote: If Andrew agree, It would be great to merge 1-7 patches into mainline before your mutex approach. There are some of clean-up patches and, IMO, it makes the code more readable and maintainable, so it is worth to merge separately. Fine by me. It appears like patches 1-7 are still missing from linux-next, would you mind posting them in a series with your approach? I haven't looked much into patches 4-7, but at least the first three are ok. I was waiting for Andrew to take all seven for linux-next and then I'd rebase my approach on top. Anyway, unless Andrew has any preferences, if by later this week they're not picked up, I'll resend everything. Well, we're mainly looking for bugfixes this last in the cycle. [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock fixes a bug, but I'd assumed that it depended on earlier patches. It doesn't seem to depend on anything. All 1-7 patches apply cleanly on linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE). If we think that one is serious then it would be better to cook up a minimal fix which is backportable into 3.12 and eariler? I don't think it's too serious, afaik it's a theoretical race and I haven't seen any bug reports for it. So we can probably just wait for 3.14, as you say, it's already late in the cycle anyways. Just let me know what you want to do so we can continue working on the actual performance issue. Thanks, Davidlohr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Wed, 15 Jan 2014 12:47:00 -0800 Davidlohr Bueso davidl...@hp.com wrote: Well, we're mainly looking for bugfixes this last in the cycle. [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock fixes a bug, but I'd assumed that it depended on earlier patches. It doesn't seem to depend on anything. All 1-7 patches apply cleanly on linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE). If we think that one is serious then it would be better to cook up a minimal fix which is backportable into 3.12 and eariler? I don't think it's too serious, afaik it's a theoretical race and I haven't seen any bug reports for it. So we can probably just wait for 3.14, as you say, it's already late in the cycle anyways. OK, thanks. Just let me know what you want to do so we can continue working on the actual performance issue. A resend after -rc1 would suit. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso wrote: > On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote: > > On Mon, 6 Jan 2014, Davidlohr Bueso wrote: > > > > > > If Andrew agree, It would be great to merge 1-7 patches into mainline > > > > before your mutex approach. There are some of clean-up patches and, IMO, > > > > it makes the code more readable and maintainable, so it is worth to > > > > merge > > > > separately. > > > > > > Fine by me. > > > > > > > It appears like patches 1-7 are still missing from linux-next, would you > > mind posting them in a series with your approach? > > I haven't looked much into patches 4-7, but at least the first three are > ok. I was waiting for Andrew to take all seven for linux-next and then > I'd rebase my approach on top. Anyway, unless Andrew has any > preferences, if by later this week they're not picked up, I'll resend > everything. Well, we're mainly looking for bugfixes this last in the cycle. "[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock" fixes a bug, but I'd assumed that it depended on earlier patches. If we think that one is serious then it would be better to cook up a minimal fix which is backportable into 3.12 and eariler? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote: > On Mon, 6 Jan 2014, Davidlohr Bueso wrote: > > > > If Andrew agree, It would be great to merge 1-7 patches into mainline > > > before your mutex approach. There are some of clean-up patches and, IMO, > > > it makes the code more readable and maintainable, so it is worth to merge > > > separately. > > > > Fine by me. > > > > It appears like patches 1-7 are still missing from linux-next, would you > mind posting them in a series with your approach? I haven't looked much into patches 4-7, but at least the first three are ok. I was waiting for Andrew to take all seven for linux-next and then I'd rebase my approach on top. Anyway, unless Andrew has any preferences, if by later this week they're not picked up, I'll resend everything. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Mon, 6 Jan 2014, Davidlohr Bueso wrote: > > If Andrew agree, It would be great to merge 1-7 patches into mainline > > before your mutex approach. There are some of clean-up patches and, IMO, > > it makes the code more readable and maintainable, so it is worth to merge > > separately. > > Fine by me. > It appears like patches 1-7 are still missing from linux-next, would you mind posting them in a series with your approach? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Mon, 6 Jan 2014, Davidlohr Bueso wrote: If Andrew agree, It would be great to merge 1-7 patches into mainline before your mutex approach. There are some of clean-up patches and, IMO, it makes the code more readable and maintainable, so it is worth to merge separately. Fine by me. It appears like patches 1-7 are still missing from linux-next, would you mind posting them in a series with your approach? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote: On Mon, 6 Jan 2014, Davidlohr Bueso wrote: If Andrew agree, It would be great to merge 1-7 patches into mainline before your mutex approach. There are some of clean-up patches and, IMO, it makes the code more readable and maintainable, so it is worth to merge separately. Fine by me. It appears like patches 1-7 are still missing from linux-next, would you mind posting them in a series with your approach? I haven't looked much into patches 4-7, but at least the first three are ok. I was waiting for Andrew to take all seven for linux-next and then I'd rebase my approach on top. Anyway, unless Andrew has any preferences, if by later this week they're not picked up, I'll resend everything. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso davidl...@hp.com wrote: On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote: On Mon, 6 Jan 2014, Davidlohr Bueso wrote: If Andrew agree, It would be great to merge 1-7 patches into mainline before your mutex approach. There are some of clean-up patches and, IMO, it makes the code more readable and maintainable, so it is worth to merge separately. Fine by me. It appears like patches 1-7 are still missing from linux-next, would you mind posting them in a series with your approach? I haven't looked much into patches 4-7, but at least the first three are ok. I was waiting for Andrew to take all seven for linux-next and then I'd rebase my approach on top. Anyway, unless Andrew has any preferences, if by later this week they're not picked up, I'll resend everything. Well, we're mainly looking for bugfixes this last in the cycle. [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock fixes a bug, but I'd assumed that it depended on earlier patches. If we think that one is serious then it would be better to cook up a minimal fix which is backportable into 3.12 and eariler? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Tue, 2014-01-07 at 10:57 +0900, Joonsoo Kim wrote: > On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote: > > On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote: > > > On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote: > > > > Hi Joonsoo, > > > > > > > > Sorry about the delay... > > > > > > > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote: > > > > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: > > > > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: > > > > > > > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: > > > > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > > > > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage, > > > > > > > > > > because many threads dequeue a hugepage to handle a fault > > > > > > > > > > of same address. > > > > > > > > > > This makes reserved pool shortage just for a little while > > > > > > > > > > and this cause > > > > > > > > > > faulting thread who can get hugepages to get a SIGBUS > > > > > > > > > > signal. > > > > > > > > > > > > > > > > > > > > To solve this problem, we already have a nice solution, > > > > > > > > > > that is, > > > > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to > > > > > > > > > > dive into > > > > > > > > > > a fault handler. This solve the problem clearly, but it > > > > > > > > > > introduce > > > > > > > > > > performance degradation, because it serialize all fault > > > > > > > > > > handling. > > > > > > > > > > > > > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get > > > > > > > > > > rid of > > > > > > > > > > performance degradation. > > > > > > > > > > > > > > > > > > So the whole point of the patch is to improve performance, > > > > > > > > > but the > > > > > > > > > changelog doesn't include any performance measurements! > > > > > > > > > > > > > > > > > > > > > > > > > I don't really deal with hugetlbfs any more and I have not > > > > > > > > examined this > > > > > > > > series but I remember why I never really cared about this > > > > > > > > mutex. It wrecks > > > > > > > > fault scalability but AFAIK fault scalability almost never > > > > > > > > mattered for > > > > > > > > workloads using hugetlbfs. The most common user of hugetlbfs > > > > > > > > by far is > > > > > > > > sysv shared memory. The memory is faulted early in the lifetime > > > > > > > > of the > > > > > > > > workload and after that it does not matter. At worst, it hurts > > > > > > > > application > > > > > > > > startup time but that is still poor motivation for putting a > > > > > > > > lot of work > > > > > > > > into removing the mutex. > > > > > > > > > > > > > > Yep, important hugepage workloads initially pound heavily on this > > > > > > > lock, > > > > > > > then it naturally decreases. > > > > > > > > > > > > > > > Microbenchmarks will be able to trigger problems in this area > > > > > > > > but it'd > > > > > > > > be important to check if any workload that matters is actually > > > > > > > > hitting > > > > > > > > that problem. > > > > > > > > > > > > > > I was thinking of writing one to actually get some numbers for > > > > > > > this > > > > > > > patchset -- I don't know of any benchmark that might stress this > > > > > > > lock. > > > > > > > > > > > > > > However I first measured the amount of cycles it costs to start an > > > > > > > Oracle DB and things went south with these changes. A simple > > > > > > > 'startup > > > > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla > > > > > > > kernel, this > > > > > > > costs ~7.5 billion cycles and with this patchset it goes up to > > > > > > > ~27.1 > > > > > > > billion. While there is naturally a fair amount of variation, > > > > > > > these > > > > > > > changes do seem to do more harm than good, at least in real world > > > > > > > scenarios. > > > > > > > > > > > > Hello, > > > > > > > > > > > > I think that number of cycles is not proper to measure this > > > > > > patchset, > > > > > > because cycles would be wasted by fault handling failure. Instead, > > > > > > it > > > > > > targeted improved elapsed time. > > > > > > > > Fair enough, however the fact of the matter is this approach does en up > > > > hurting performance. Regarding total startup time, I didn't see hardly > > > > any differences, with both vanilla and this patchset it takes close to > > > > 33.5 seconds. > > > > > > > > > Could you tell me how long it > > > > > > takes to fault all of it's hugepages? > > > > > > > > > > > > Anyway, this order of magnitude still seems a problem. :/ > > > > > > > > > > > > I guess that cycles are wasted by zeroing hugepage in fault-path > > > > > > like as > > > > > > Andrew pointed out. > > > > > > > > > > > > I will send another patches to fix
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote: > On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote: > > On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote: > > > Hi Joonsoo, > > > > > > Sorry about the delay... > > > > > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote: > > > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: > > > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: > > > > > > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: > > > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > > > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim > > > > > > > > wrote: > > > > > > > > > > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage, > > > > > > > > > because many threads dequeue a hugepage to handle a fault of > > > > > > > > > same address. > > > > > > > > > This makes reserved pool shortage just for a little while and > > > > > > > > > this cause > > > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > > > > > > > > > > > > > > > To solve this problem, we already have a nice solution, that > > > > > > > > > is, > > > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to > > > > > > > > > dive into > > > > > > > > > a fault handler. This solve the problem clearly, but it > > > > > > > > > introduce > > > > > > > > > performance degradation, because it serialize all fault > > > > > > > > > handling. > > > > > > > > > > > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid > > > > > > > > > of > > > > > > > > > performance degradation. > > > > > > > > > > > > > > > > So the whole point of the patch is to improve performance, but > > > > > > > > the > > > > > > > > changelog doesn't include any performance measurements! > > > > > > > > > > > > > > > > > > > > > > I don't really deal with hugetlbfs any more and I have not > > > > > > > examined this > > > > > > > series but I remember why I never really cared about this mutex. > > > > > > > It wrecks > > > > > > > fault scalability but AFAIK fault scalability almost never > > > > > > > mattered for > > > > > > > workloads using hugetlbfs. The most common user of hugetlbfs by > > > > > > > far is > > > > > > > sysv shared memory. The memory is faulted early in the lifetime > > > > > > > of the > > > > > > > workload and after that it does not matter. At worst, it hurts > > > > > > > application > > > > > > > startup time but that is still poor motivation for putting a lot > > > > > > > of work > > > > > > > into removing the mutex. > > > > > > > > > > > > Yep, important hugepage workloads initially pound heavily on this > > > > > > lock, > > > > > > then it naturally decreases. > > > > > > > > > > > > > Microbenchmarks will be able to trigger problems in this area but > > > > > > > it'd > > > > > > > be important to check if any workload that matters is actually > > > > > > > hitting > > > > > > > that problem. > > > > > > > > > > > > I was thinking of writing one to actually get some numbers for this > > > > > > patchset -- I don't know of any benchmark that might stress this > > > > > > lock. > > > > > > > > > > > > However I first measured the amount of cycles it costs to start an > > > > > > Oracle DB and things went south with these changes. A simple > > > > > > 'startup > > > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, > > > > > > this > > > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 > > > > > > billion. While there is naturally a fair amount of variation, these > > > > > > changes do seem to do more harm than good, at least in real world > > > > > > scenarios. > > > > > > > > > > Hello, > > > > > > > > > > I think that number of cycles is not proper to measure this patchset, > > > > > because cycles would be wasted by fault handling failure. Instead, it > > > > > targeted improved elapsed time. > > > > > > Fair enough, however the fact of the matter is this approach does en up > > > hurting performance. Regarding total startup time, I didn't see hardly > > > any differences, with both vanilla and this patchset it takes close to > > > 33.5 seconds. > > > > > > > Could you tell me how long it > > > > > takes to fault all of it's hugepages? > > > > > > > > > > Anyway, this order of magnitude still seems a problem. :/ > > > > > > > > > > I guess that cycles are wasted by zeroing hugepage in fault-path like > > > > > as > > > > > Andrew pointed out. > > > > > > > > > > I will send another patches to fix this problem. > > > > > > > > Hello, Davidlohr. > > > > > > > > Here goes the fix on top of this series. > > > > > > ... and with this patch we go from 27 down to 11 billion cycles, so this > > > approach still costs more than what we currently have. A perf stat shows > > > that an entire 1Gb huge page aware DB startup costs around ~30 billion
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote: > On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote: > > Hi Joonsoo, > > > > Sorry about the delay... > > > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote: > > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: > > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: > > > > > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: > > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim > > > > > > > wrote: > > > > > > > > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage, > > > > > > > > because many threads dequeue a hugepage to handle a fault of > > > > > > > > same address. > > > > > > > > This makes reserved pool shortage just for a little while and > > > > > > > > this cause > > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > > > > > > > > > > > > > To solve this problem, we already have a nice solution, that is, > > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to > > > > > > > > dive into > > > > > > > > a fault handler. This solve the problem clearly, but it > > > > > > > > introduce > > > > > > > > performance degradation, because it serialize all fault > > > > > > > > handling. > > > > > > > > > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of > > > > > > > > performance degradation. > > > > > > > > > > > > > > So the whole point of the patch is to improve performance, but the > > > > > > > changelog doesn't include any performance measurements! > > > > > > > > > > > > > > > > > > > I don't really deal with hugetlbfs any more and I have not examined > > > > > > this > > > > > > series but I remember why I never really cared about this mutex. It > > > > > > wrecks > > > > > > fault scalability but AFAIK fault scalability almost never mattered > > > > > > for > > > > > > workloads using hugetlbfs. The most common user of hugetlbfs by > > > > > > far is > > > > > > sysv shared memory. The memory is faulted early in the lifetime of > > > > > > the > > > > > > workload and after that it does not matter. At worst, it hurts > > > > > > application > > > > > > startup time but that is still poor motivation for putting a lot of > > > > > > work > > > > > > into removing the mutex. > > > > > > > > > > Yep, important hugepage workloads initially pound heavily on this > > > > > lock, > > > > > then it naturally decreases. > > > > > > > > > > > Microbenchmarks will be able to trigger problems in this area but > > > > > > it'd > > > > > > be important to check if any workload that matters is actually > > > > > > hitting > > > > > > that problem. > > > > > > > > > > I was thinking of writing one to actually get some numbers for this > > > > > patchset -- I don't know of any benchmark that might stress this > > > > > lock. > > > > > > > > > > However I first measured the amount of cycles it costs to start an > > > > > Oracle DB and things went south with these changes. A simple 'startup > > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, > > > > > this > > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 > > > > > billion. While there is naturally a fair amount of variation, these > > > > > changes do seem to do more harm than good, at least in real world > > > > > scenarios. > > > > > > > > Hello, > > > > > > > > I think that number of cycles is not proper to measure this patchset, > > > > because cycles would be wasted by fault handling failure. Instead, it > > > > targeted improved elapsed time. > > > > Fair enough, however the fact of the matter is this approach does en up > > hurting performance. Regarding total startup time, I didn't see hardly > > any differences, with both vanilla and this patchset it takes close to > > 33.5 seconds. > > > > > Could you tell me how long it > > > > takes to fault all of it's hugepages? > > > > > > > > Anyway, this order of magnitude still seems a problem. :/ > > > > > > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as > > > > Andrew pointed out. > > > > > > > > I will send another patches to fix this problem. > > > > > > Hello, Davidlohr. > > > > > > Here goes the fix on top of this series. > > > > ... and with this patch we go from 27 down to 11 billion cycles, so this > > approach still costs more than what we currently have. A perf stat shows > > that an entire 1Gb huge page aware DB startup costs around ~30 billion > > cycles on a vanilla kernel, so the impact of hugetlb_fault() is > > definitely non trivial and IMO worth considering. > > Really thanks for your help. :) > > > > > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a > > ride and things do look quite better, which is basically what Andrew was > > suggesting previously
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote: On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote: Hi Joonsoo, Sorry about the delay... On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote: On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! I don't really deal with hugetlbfs any more and I have not examined this series but I remember why I never really cared about this mutex. It wrecks fault scalability but AFAIK fault scalability almost never mattered for workloads using hugetlbfs. The most common user of hugetlbfs by far is sysv shared memory. The memory is faulted early in the lifetime of the workload and after that it does not matter. At worst, it hurts application startup time but that is still poor motivation for putting a lot of work into removing the mutex. Yep, important hugepage workloads initially pound heavily on this lock, then it naturally decreases. Microbenchmarks will be able to trigger problems in this area but it'd be important to check if any workload that matters is actually hitting that problem. I was thinking of writing one to actually get some numbers for this patchset -- I don't know of any benchmark that might stress this lock. However I first measured the amount of cycles it costs to start an Oracle DB and things went south with these changes. A simple 'startup immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 billion. While there is naturally a fair amount of variation, these changes do seem to do more harm than good, at least in real world scenarios. Hello, I think that number of cycles is not proper to measure this patchset, because cycles would be wasted by fault handling failure. Instead, it targeted improved elapsed time. Fair enough, however the fact of the matter is this approach does en up hurting performance. Regarding total startup time, I didn't see hardly any differences, with both vanilla and this patchset it takes close to 33.5 seconds. Could you tell me how long it takes to fault all of it's hugepages? Anyway, this order of magnitude still seems a problem. :/ I guess that cycles are wasted by zeroing hugepage in fault-path like as Andrew pointed out. I will send another patches to fix this problem. Hello, Davidlohr. Here goes the fix on top of this series. ... and with this patch we go from 27 down to 11 billion cycles, so this approach still costs more than what we currently have. A perf stat shows that an entire 1Gb huge page aware DB startup costs around ~30 billion cycles on a vanilla kernel, so the impact of hugetlb_fault() is definitely non trivial and IMO worth considering. Really thanks for your help. :) Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a ride and things do look quite better, which is basically what Andrew was suggesting previously anyway. With the hash table approach the startup time did go down to ~25.1 seconds, which is a nice -24.7% time reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles. This hash table was on a 80 core system, so since we do the power of two round up we end up with 256 entries -- I think we can do better if we enlarger further, maybe something like statically 1024, or probably better, 8-ish * nr cpus. Thoughts? Is there any reason why we cannot go with this instead? Yes, we still
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote: On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote: On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote: Hi Joonsoo, Sorry about the delay... On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote: On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! I don't really deal with hugetlbfs any more and I have not examined this series but I remember why I never really cared about this mutex. It wrecks fault scalability but AFAIK fault scalability almost never mattered for workloads using hugetlbfs. The most common user of hugetlbfs by far is sysv shared memory. The memory is faulted early in the lifetime of the workload and after that it does not matter. At worst, it hurts application startup time but that is still poor motivation for putting a lot of work into removing the mutex. Yep, important hugepage workloads initially pound heavily on this lock, then it naturally decreases. Microbenchmarks will be able to trigger problems in this area but it'd be important to check if any workload that matters is actually hitting that problem. I was thinking of writing one to actually get some numbers for this patchset -- I don't know of any benchmark that might stress this lock. However I first measured the amount of cycles it costs to start an Oracle DB and things went south with these changes. A simple 'startup immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 billion. While there is naturally a fair amount of variation, these changes do seem to do more harm than good, at least in real world scenarios. Hello, I think that number of cycles is not proper to measure this patchset, because cycles would be wasted by fault handling failure. Instead, it targeted improved elapsed time. Fair enough, however the fact of the matter is this approach does en up hurting performance. Regarding total startup time, I didn't see hardly any differences, with both vanilla and this patchset it takes close to 33.5 seconds. Could you tell me how long it takes to fault all of it's hugepages? Anyway, this order of magnitude still seems a problem. :/ I guess that cycles are wasted by zeroing hugepage in fault-path like as Andrew pointed out. I will send another patches to fix this problem. Hello, Davidlohr. Here goes the fix on top of this series. ... and with this patch we go from 27 down to 11 billion cycles, so this approach still costs more than what we currently have. A perf stat shows that an entire 1Gb huge page aware DB startup costs around ~30 billion cycles on a vanilla kernel, so the impact of hugetlb_fault() is definitely non trivial and IMO worth considering. Really thanks for your help. :) Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a ride and things do look quite better, which is basically what Andrew was suggesting previously anyway. With the hash table approach the startup time did go down to ~25.1 seconds, which is a nice -24.7% time reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles. This hash table was on a 80 core system, so since we do the power of two round up we end up with 256
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Tue, 2014-01-07 at 10:57 +0900, Joonsoo Kim wrote: On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote: On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote: On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote: Hi Joonsoo, Sorry about the delay... On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote: On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! I don't really deal with hugetlbfs any more and I have not examined this series but I remember why I never really cared about this mutex. It wrecks fault scalability but AFAIK fault scalability almost never mattered for workloads using hugetlbfs. The most common user of hugetlbfs by far is sysv shared memory. The memory is faulted early in the lifetime of the workload and after that it does not matter. At worst, it hurts application startup time but that is still poor motivation for putting a lot of work into removing the mutex. Yep, important hugepage workloads initially pound heavily on this lock, then it naturally decreases. Microbenchmarks will be able to trigger problems in this area but it'd be important to check if any workload that matters is actually hitting that problem. I was thinking of writing one to actually get some numbers for this patchset -- I don't know of any benchmark that might stress this lock. However I first measured the amount of cycles it costs to start an Oracle DB and things went south with these changes. A simple 'startup immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 billion. While there is naturally a fair amount of variation, these changes do seem to do more harm than good, at least in real world scenarios. Hello, I think that number of cycles is not proper to measure this patchset, because cycles would be wasted by fault handling failure. Instead, it targeted improved elapsed time. Fair enough, however the fact of the matter is this approach does en up hurting performance. Regarding total startup time, I didn't see hardly any differences, with both vanilla and this patchset it takes close to 33.5 seconds. Could you tell me how long it takes to fault all of it's hugepages? Anyway, this order of magnitude still seems a problem. :/ I guess that cycles are wasted by zeroing hugepage in fault-path like as Andrew pointed out. I will send another patches to fix this problem. Hello, Davidlohr. Here goes the fix on top of this series. ... and with this patch we go from 27 down to 11 billion cycles, so this approach still costs more than what we currently have. A perf stat shows that an entire 1Gb huge page aware DB startup costs around ~30 billion cycles on a vanilla kernel, so the impact of hugetlb_fault() is definitely non trivial and IMO worth considering. Really thanks for your help. :) Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a ride and things do look quite better, which is basically what Andrew was suggesting previously anyway. With the hash table approach the startup time did go down to
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote: > Hi Joonsoo, > > Sorry about the delay... > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote: > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: > > > > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim > > > > > > wrote: > > > > > > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage, > > > > > > > because many threads dequeue a hugepage to handle a fault of same > > > > > > > address. > > > > > > > This makes reserved pool shortage just for a little while and > > > > > > > this cause > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > > > > > > > > > > > To solve this problem, we already have a nice solution, that is, > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive > > > > > > > into > > > > > > > a fault handler. This solve the problem clearly, but it introduce > > > > > > > performance degradation, because it serialize all fault handling. > > > > > > > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of > > > > > > > performance degradation. > > > > > > > > > > > > So the whole point of the patch is to improve performance, but the > > > > > > changelog doesn't include any performance measurements! > > > > > > > > > > > > > > > > I don't really deal with hugetlbfs any more and I have not examined > > > > > this > > > > > series but I remember why I never really cared about this mutex. It > > > > > wrecks > > > > > fault scalability but AFAIK fault scalability almost never mattered > > > > > for > > > > > workloads using hugetlbfs. The most common user of hugetlbfs by far > > > > > is > > > > > sysv shared memory. The memory is faulted early in the lifetime of the > > > > > workload and after that it does not matter. At worst, it hurts > > > > > application > > > > > startup time but that is still poor motivation for putting a lot of > > > > > work > > > > > into removing the mutex. > > > > > > > > Yep, important hugepage workloads initially pound heavily on this lock, > > > > then it naturally decreases. > > > > > > > > > Microbenchmarks will be able to trigger problems in this area but it'd > > > > > be important to check if any workload that matters is actually hitting > > > > > that problem. > > > > > > > > I was thinking of writing one to actually get some numbers for this > > > > patchset -- I don't know of any benchmark that might stress this lock. > > > > > > > > However I first measured the amount of cycles it costs to start an > > > > Oracle DB and things went south with these changes. A simple 'startup > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 > > > > billion. While there is naturally a fair amount of variation, these > > > > changes do seem to do more harm than good, at least in real world > > > > scenarios. > > > > > > Hello, > > > > > > I think that number of cycles is not proper to measure this patchset, > > > because cycles would be wasted by fault handling failure. Instead, it > > > targeted improved elapsed time. > > Fair enough, however the fact of the matter is this approach does en up > hurting performance. Regarding total startup time, I didn't see hardly > any differences, with both vanilla and this patchset it takes close to > 33.5 seconds. > > > Could you tell me how long it > > > takes to fault all of it's hugepages? > > > > > > Anyway, this order of magnitude still seems a problem. :/ > > > > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as > > > Andrew pointed out. > > > > > > I will send another patches to fix this problem. > > > > Hello, Davidlohr. > > > > Here goes the fix on top of this series. > > ... and with this patch we go from 27 down to 11 billion cycles, so this > approach still costs more than what we currently have. A perf stat shows > that an entire 1Gb huge page aware DB startup costs around ~30 billion > cycles on a vanilla kernel, so the impact of hugetlb_fault() is > definitely non trivial and IMO worth considering. Really thanks for your help. :) > > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a > ride and things do look quite better, which is basically what Andrew was > suggesting previously anyway. With the hash table approach the startup > time did go down to ~25.1 seconds, which is a nice -24.7% time > reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles. > This hash table was on a 80 core system, so since we do the power of two > round up we end up with 256 entries -- I think we can do better if we > enlarger further, maybe something
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote: Hi Joonsoo, Sorry about the delay... On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote: On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! I don't really deal with hugetlbfs any more and I have not examined this series but I remember why I never really cared about this mutex. It wrecks fault scalability but AFAIK fault scalability almost never mattered for workloads using hugetlbfs. The most common user of hugetlbfs by far is sysv shared memory. The memory is faulted early in the lifetime of the workload and after that it does not matter. At worst, it hurts application startup time but that is still poor motivation for putting a lot of work into removing the mutex. Yep, important hugepage workloads initially pound heavily on this lock, then it naturally decreases. Microbenchmarks will be able to trigger problems in this area but it'd be important to check if any workload that matters is actually hitting that problem. I was thinking of writing one to actually get some numbers for this patchset -- I don't know of any benchmark that might stress this lock. However I first measured the amount of cycles it costs to start an Oracle DB and things went south with these changes. A simple 'startup immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 billion. While there is naturally a fair amount of variation, these changes do seem to do more harm than good, at least in real world scenarios. Hello, I think that number of cycles is not proper to measure this patchset, because cycles would be wasted by fault handling failure. Instead, it targeted improved elapsed time. Fair enough, however the fact of the matter is this approach does en up hurting performance. Regarding total startup time, I didn't see hardly any differences, with both vanilla and this patchset it takes close to 33.5 seconds. Could you tell me how long it takes to fault all of it's hugepages? Anyway, this order of magnitude still seems a problem. :/ I guess that cycles are wasted by zeroing hugepage in fault-path like as Andrew pointed out. I will send another patches to fix this problem. Hello, Davidlohr. Here goes the fix on top of this series. ... and with this patch we go from 27 down to 11 billion cycles, so this approach still costs more than what we currently have. A perf stat shows that an entire 1Gb huge page aware DB startup costs around ~30 billion cycles on a vanilla kernel, so the impact of hugetlb_fault() is definitely non trivial and IMO worth considering. Really thanks for your help. :) Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a ride and things do look quite better, which is basically what Andrew was suggesting previously anyway. With the hash table approach the startup time did go down to ~25.1 seconds, which is a nice -24.7% time reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles. This hash table was on a 80 core system, so since we do the power of two round up we end up with 256 entries -- I think we can do better if we enlarger further, maybe something like statically 1024, or probably better, 8-ish * nr cpus. Thoughts? Is there any reason why we cannot go with this instead? Yes, we still keep the mutex, but the approach is (1) proven better for performance on real world workloads and (2) far less invasive. I have no more idea to improve my patches now, so I agree with your approach. When I reviewed your
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
Hi Joonsoo, Sorry about the delay... On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote: > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: > > > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim > > > > > wrote: > > > > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage, > > > > > > because many threads dequeue a hugepage to handle a fault of same > > > > > > address. > > > > > > This makes reserved pool shortage just for a little while and this > > > > > > cause > > > > > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > > > > > > > > > To solve this problem, we already have a nice solution, that is, > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive > > > > > > into > > > > > > a fault handler. This solve the problem clearly, but it introduce > > > > > > performance degradation, because it serialize all fault handling. > > > > > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of > > > > > > performance degradation. > > > > > > > > > > So the whole point of the patch is to improve performance, but the > > > > > changelog doesn't include any performance measurements! > > > > > > > > > > > > > I don't really deal with hugetlbfs any more and I have not examined this > > > > series but I remember why I never really cared about this mutex. It > > > > wrecks > > > > fault scalability but AFAIK fault scalability almost never mattered for > > > > workloads using hugetlbfs. The most common user of hugetlbfs by far is > > > > sysv shared memory. The memory is faulted early in the lifetime of the > > > > workload and after that it does not matter. At worst, it hurts > > > > application > > > > startup time but that is still poor motivation for putting a lot of work > > > > into removing the mutex. > > > > > > Yep, important hugepage workloads initially pound heavily on this lock, > > > then it naturally decreases. > > > > > > > Microbenchmarks will be able to trigger problems in this area but it'd > > > > be important to check if any workload that matters is actually hitting > > > > that problem. > > > > > > I was thinking of writing one to actually get some numbers for this > > > patchset -- I don't know of any benchmark that might stress this lock. > > > > > > However I first measured the amount of cycles it costs to start an > > > Oracle DB and things went south with these changes. A simple 'startup > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 > > > billion. While there is naturally a fair amount of variation, these > > > changes do seem to do more harm than good, at least in real world > > > scenarios. > > > > Hello, > > > > I think that number of cycles is not proper to measure this patchset, > > because cycles would be wasted by fault handling failure. Instead, it > > targeted improved elapsed time. Fair enough, however the fact of the matter is this approach does en up hurting performance. Regarding total startup time, I didn't see hardly any differences, with both vanilla and this patchset it takes close to 33.5 seconds. > Could you tell me how long it > > takes to fault all of it's hugepages? > > > > Anyway, this order of magnitude still seems a problem. :/ > > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as > > Andrew pointed out. > > > > I will send another patches to fix this problem. > > Hello, Davidlohr. > > Here goes the fix on top of this series. ... and with this patch we go from 27 down to 11 billion cycles, so this approach still costs more than what we currently have. A perf stat shows that an entire 1Gb huge page aware DB startup costs around ~30 billion cycles on a vanilla kernel, so the impact of hugetlb_fault() is definitely non trivial and IMO worth considering. Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a ride and things do look quite better, which is basically what Andrew was suggesting previously anyway. With the hash table approach the startup time did go down to ~25.1 seconds, which is a nice -24.7% time reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles. This hash table was on a 80 core system, so since we do the power of two round up we end up with 256 entries -- I think we can do better if we enlarger further, maybe something like statically 1024, or probably better, 8-ish * nr cpus. Thoughts? Is there any reason why we cannot go with this instead? Yes, we still keep the mutex, but the approach is (1) proven better for performance on real world workloads and (2) far less invasive. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
Hi Joonsoo, Sorry about the delay... On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote: On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! I don't really deal with hugetlbfs any more and I have not examined this series but I remember why I never really cared about this mutex. It wrecks fault scalability but AFAIK fault scalability almost never mattered for workloads using hugetlbfs. The most common user of hugetlbfs by far is sysv shared memory. The memory is faulted early in the lifetime of the workload and after that it does not matter. At worst, it hurts application startup time but that is still poor motivation for putting a lot of work into removing the mutex. Yep, important hugepage workloads initially pound heavily on this lock, then it naturally decreases. Microbenchmarks will be able to trigger problems in this area but it'd be important to check if any workload that matters is actually hitting that problem. I was thinking of writing one to actually get some numbers for this patchset -- I don't know of any benchmark that might stress this lock. However I first measured the amount of cycles it costs to start an Oracle DB and things went south with these changes. A simple 'startup immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 billion. While there is naturally a fair amount of variation, these changes do seem to do more harm than good, at least in real world scenarios. Hello, I think that number of cycles is not proper to measure this patchset, because cycles would be wasted by fault handling failure. Instead, it targeted improved elapsed time. Fair enough, however the fact of the matter is this approach does en up hurting performance. Regarding total startup time, I didn't see hardly any differences, with both vanilla and this patchset it takes close to 33.5 seconds. Could you tell me how long it takes to fault all of it's hugepages? Anyway, this order of magnitude still seems a problem. :/ I guess that cycles are wasted by zeroing hugepage in fault-path like as Andrew pointed out. I will send another patches to fix this problem. Hello, Davidlohr. Here goes the fix on top of this series. ... and with this patch we go from 27 down to 11 billion cycles, so this approach still costs more than what we currently have. A perf stat shows that an entire 1Gb huge page aware DB startup costs around ~30 billion cycles on a vanilla kernel, so the impact of hugetlb_fault() is definitely non trivial and IMO worth considering. Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a ride and things do look quite better, which is basically what Andrew was suggesting previously anyway. With the hash table approach the startup time did go down to ~25.1 seconds, which is a nice -24.7% time reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles. This hash table was on a 80 core system, so since we do the power of two round up we end up with 256 entries -- I think we can do better if we enlarger further, maybe something like statically 1024, or probably better, 8-ish * nr cpus. Thoughts? Is there any reason why we cannot go with this instead? Yes, we still keep the mutex, but the approach is (1) proven better for performance on real world workloads and (2) far less invasive. Thanks, Davidlohr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: > > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim > > > > wrote: > > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage, > > > > > because many threads dequeue a hugepage to handle a fault of same > > > > > address. > > > > > This makes reserved pool shortage just for a little while and this > > > > > cause > > > > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > > > > > > > To solve this problem, we already have a nice solution, that is, > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into > > > > > a fault handler. This solve the problem clearly, but it introduce > > > > > performance degradation, because it serialize all fault handling. > > > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of > > > > > performance degradation. > > > > > > > > So the whole point of the patch is to improve performance, but the > > > > changelog doesn't include any performance measurements! > > > > > > > > > > I don't really deal with hugetlbfs any more and I have not examined this > > > series but I remember why I never really cared about this mutex. It wrecks > > > fault scalability but AFAIK fault scalability almost never mattered for > > > workloads using hugetlbfs. The most common user of hugetlbfs by far is > > > sysv shared memory. The memory is faulted early in the lifetime of the > > > workload and after that it does not matter. At worst, it hurts application > > > startup time but that is still poor motivation for putting a lot of work > > > into removing the mutex. > > > > Yep, important hugepage workloads initially pound heavily on this lock, > > then it naturally decreases. > > > > > Microbenchmarks will be able to trigger problems in this area but it'd > > > be important to check if any workload that matters is actually hitting > > > that problem. > > > > I was thinking of writing one to actually get some numbers for this > > patchset -- I don't know of any benchmark that might stress this lock. > > > > However I first measured the amount of cycles it costs to start an > > Oracle DB and things went south with these changes. A simple 'startup > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 > > billion. While there is naturally a fair amount of variation, these > > changes do seem to do more harm than good, at least in real world > > scenarios. > > Hello, > > I think that number of cycles is not proper to measure this patchset, > because cycles would be wasted by fault handling failure. Instead, it > targeted improved elapsed time. Could you tell me how long it > takes to fault all of it's hugepages? > > Anyway, this order of magnitude still seems a problem. :/ > > I guess that cycles are wasted by zeroing hugepage in fault-path like as > Andrew pointed out. > > I will send another patches to fix this problem. Hello, Davidlohr. Here goes the fix on top of this series. Thanks. -->8--- >From 5f20459d90dfa2f7cd28d62194ce22bd9a0df0f5 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Mon, 23 Dec 2013 10:32:04 +0900 Subject: [PATCH] mm, hugetlb: optimize zeroing hugepage When parallel faults occur, someone would be failed. In this case, cpu cycles for zeroing failed hugepage is wasted. To reduce this overhead, mark the hugepage as zeroed hugepage after zeroing hugepage and unmark it as non-zeroed hugepage after it is really used. If it isn't used with any reason, it returns back to the hugepage pool and it will be used sometime ago. At this time, we would see zeroed page marker and skip to do zeroing. Signed-off-by: Joonsoo Kim diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6edf423..b90b792 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -582,6 +582,7 @@ static void update_and_free_page(struct hstate *h, struct page *page) 1 << PG_private | 1 << PG_writeback); } VM_BUG_ON(hugetlb_cgroup_from_page(page)); + ClearPageActive(page); set_compound_page_dtor(page, NULL); set_page_refcounted(page); arch_release_hugepage(page); @@ -2715,6 +2716,7 @@ retry_avoidcopy: spin_lock(ptl); ptep = huge_pte_offset(mm, address & huge_page_mask(h)); if (likely(pte_same(huge_ptep_get(ptep), pte))) { + ClearPageActive(new_page); ClearPagePrivate(new_page); /* Break COW */ @@ -2834,7 +2836,10 @@ retry: } goto out; } - clear_huge_page(page, address, pages_per_huge_page(h)); + if
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim > > > wrote: > > > > > > > If parallel fault occur, we can fail to allocate a hugepage, > > > > because many threads dequeue a hugepage to handle a fault of same > > > > address. > > > > This makes reserved pool shortage just for a little while and this cause > > > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > > > > > To solve this problem, we already have a nice solution, that is, > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into > > > > a fault handler. This solve the problem clearly, but it introduce > > > > performance degradation, because it serialize all fault handling. > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of > > > > performance degradation. > > > > > > So the whole point of the patch is to improve performance, but the > > > changelog doesn't include any performance measurements! > > > > > > > I don't really deal with hugetlbfs any more and I have not examined this > > series but I remember why I never really cared about this mutex. It wrecks > > fault scalability but AFAIK fault scalability almost never mattered for > > workloads using hugetlbfs. The most common user of hugetlbfs by far is > > sysv shared memory. The memory is faulted early in the lifetime of the > > workload and after that it does not matter. At worst, it hurts application > > startup time but that is still poor motivation for putting a lot of work > > into removing the mutex. > > Yep, important hugepage workloads initially pound heavily on this lock, > then it naturally decreases. > > > Microbenchmarks will be able to trigger problems in this area but it'd > > be important to check if any workload that matters is actually hitting > > that problem. > > I was thinking of writing one to actually get some numbers for this > patchset -- I don't know of any benchmark that might stress this lock. > > However I first measured the amount of cycles it costs to start an > Oracle DB and things went south with these changes. A simple 'startup > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 > billion. While there is naturally a fair amount of variation, these > changes do seem to do more harm than good, at least in real world > scenarios. Hello, I think that number of cycles is not proper to measure this patchset, because cycles would be wasted by fault handling failure. Instead, it targeted improved elapsed time. Could you tell me how long it takes to fault all of it's hugepages? Anyway, this order of magnitude still seems a problem. :/ I guess that cycles are wasted by zeroing hugepage in fault-path like as Andrew pointed out. I will send another patches to fix this problem. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! I don't really deal with hugetlbfs any more and I have not examined this series but I remember why I never really cared about this mutex. It wrecks fault scalability but AFAIK fault scalability almost never mattered for workloads using hugetlbfs. The most common user of hugetlbfs by far is sysv shared memory. The memory is faulted early in the lifetime of the workload and after that it does not matter. At worst, it hurts application startup time but that is still poor motivation for putting a lot of work into removing the mutex. Yep, important hugepage workloads initially pound heavily on this lock, then it naturally decreases. Microbenchmarks will be able to trigger problems in this area but it'd be important to check if any workload that matters is actually hitting that problem. I was thinking of writing one to actually get some numbers for this patchset -- I don't know of any benchmark that might stress this lock. However I first measured the amount of cycles it costs to start an Oracle DB and things went south with these changes. A simple 'startup immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 billion. While there is naturally a fair amount of variation, these changes do seem to do more harm than good, at least in real world scenarios. Hello, I think that number of cycles is not proper to measure this patchset, because cycles would be wasted by fault handling failure. Instead, it targeted improved elapsed time. Could you tell me how long it takes to fault all of it's hugepages? Anyway, this order of magnitude still seems a problem. :/ I guess that cycles are wasted by zeroing hugepage in fault-path like as Andrew pointed out. I will send another patches to fix this problem. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote: On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote: On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! I don't really deal with hugetlbfs any more and I have not examined this series but I remember why I never really cared about this mutex. It wrecks fault scalability but AFAIK fault scalability almost never mattered for workloads using hugetlbfs. The most common user of hugetlbfs by far is sysv shared memory. The memory is faulted early in the lifetime of the workload and after that it does not matter. At worst, it hurts application startup time but that is still poor motivation for putting a lot of work into removing the mutex. Yep, important hugepage workloads initially pound heavily on this lock, then it naturally decreases. Microbenchmarks will be able to trigger problems in this area but it'd be important to check if any workload that matters is actually hitting that problem. I was thinking of writing one to actually get some numbers for this patchset -- I don't know of any benchmark that might stress this lock. However I first measured the amount of cycles it costs to start an Oracle DB and things went south with these changes. A simple 'startup immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 billion. While there is naturally a fair amount of variation, these changes do seem to do more harm than good, at least in real world scenarios. Hello, I think that number of cycles is not proper to measure this patchset, because cycles would be wasted by fault handling failure. Instead, it targeted improved elapsed time. Could you tell me how long it takes to fault all of it's hugepages? Anyway, this order of magnitude still seems a problem. :/ I guess that cycles are wasted by zeroing hugepage in fault-path like as Andrew pointed out. I will send another patches to fix this problem. Hello, Davidlohr. Here goes the fix on top of this series. Thanks. --8--- From 5f20459d90dfa2f7cd28d62194ce22bd9a0df0f5 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim iamjoonsoo@lge.com Date: Mon, 23 Dec 2013 10:32:04 +0900 Subject: [PATCH] mm, hugetlb: optimize zeroing hugepage When parallel faults occur, someone would be failed. In this case, cpu cycles for zeroing failed hugepage is wasted. To reduce this overhead, mark the hugepage as zeroed hugepage after zeroing hugepage and unmark it as non-zeroed hugepage after it is really used. If it isn't used with any reason, it returns back to the hugepage pool and it will be used sometime ago. At this time, we would see zeroed page marker and skip to do zeroing. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6edf423..b90b792 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -582,6 +582,7 @@ static void update_and_free_page(struct hstate *h, struct page *page) 1 PG_private | 1 PG_writeback); } VM_BUG_ON(hugetlb_cgroup_from_page(page)); + ClearPageActive(page); set_compound_page_dtor(page, NULL); set_page_refcounted(page); arch_release_hugepage(page); @@ -2715,6 +2716,7 @@ retry_avoidcopy: spin_lock(ptl); ptep = huge_pte_offset(mm, address huge_page_mask(h)); if (likely(pte_same(huge_ptep_get(ptep), pte))) { + ClearPageActive(new_page); ClearPagePrivate(new_page); /* Break COW */ @@ -2834,7 +2836,10 @@ retry: } goto out; } - clear_huge_page(page, address, pages_per_huge_page(h)); + if (!PageActive(page)) { + clear_huge_page(page, address, pages_per_huge_page(h)); +
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim > > wrote: > > > > > If parallel fault occur, we can fail to allocate a hugepage, > > > because many threads dequeue a hugepage to handle a fault of same address. > > > This makes reserved pool shortage just for a little while and this cause > > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > > > To solve this problem, we already have a nice solution, that is, > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into > > > a fault handler. This solve the problem clearly, but it introduce > > > performance degradation, because it serialize all fault handling. > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of > > > performance degradation. > > > > So the whole point of the patch is to improve performance, but the > > changelog doesn't include any performance measurements! > > > > I don't really deal with hugetlbfs any more and I have not examined this > series but I remember why I never really cared about this mutex. It wrecks > fault scalability but AFAIK fault scalability almost never mattered for > workloads using hugetlbfs. The most common user of hugetlbfs by far is > sysv shared memory. The memory is faulted early in the lifetime of the > workload and after that it does not matter. At worst, it hurts application > startup time but that is still poor motivation for putting a lot of work > into removing the mutex. Yep, important hugepage workloads initially pound heavily on this lock, then it naturally decreases. > Microbenchmarks will be able to trigger problems in this area but it'd > be important to check if any workload that matters is actually hitting > that problem. I was thinking of writing one to actually get some numbers for this patchset -- I don't know of any benchmark that might stress this lock. However I first measured the amount of cycles it costs to start an Oracle DB and things went south with these changes. A simple 'startup immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 billion. While there is naturally a fair amount of variation, these changes do seem to do more harm than good, at least in real world scenarios. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim wrote: > > > If parallel fault occur, we can fail to allocate a hugepage, > > because many threads dequeue a hugepage to handle a fault of same address. > > This makes reserved pool shortage just for a little while and this cause > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > To solve this problem, we already have a nice solution, that is, > > a hugetlb_instantiation_mutex. This blocks other threads to dive into > > a fault handler. This solve the problem clearly, but it introduce > > performance degradation, because it serialize all fault handling. > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of > > performance degradation. > > So the whole point of the patch is to improve performance, but the > changelog doesn't include any performance measurements! > I don't really deal with hugetlbfs any more and I have not examined this series but I remember why I never really cared about this mutex. It wrecks fault scalability but AFAIK fault scalability almost never mattered for workloads using hugetlbfs. The most common user of hugetlbfs by far is sysv shared memory. The memory is faulted early in the lifetime of the workload and after that it does not matter. At worst, it hurts application startup time but that is still poor motivation for putting a lot of work into removing the mutex. Microbenchmarks will be able to trigger problems in this area but it'd be important to check if any workload that matters is actually hitting that problem. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! I don't really deal with hugetlbfs any more and I have not examined this series but I remember why I never really cared about this mutex. It wrecks fault scalability but AFAIK fault scalability almost never mattered for workloads using hugetlbfs. The most common user of hugetlbfs by far is sysv shared memory. The memory is faulted early in the lifetime of the workload and after that it does not matter. At worst, it hurts application startup time but that is still poor motivation for putting a lot of work into removing the mutex. Microbenchmarks will be able to trigger problems in this area but it'd be important to check if any workload that matters is actually hitting that problem. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote: On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! I don't really deal with hugetlbfs any more and I have not examined this series but I remember why I never really cared about this mutex. It wrecks fault scalability but AFAIK fault scalability almost never mattered for workloads using hugetlbfs. The most common user of hugetlbfs by far is sysv shared memory. The memory is faulted early in the lifetime of the workload and after that it does not matter. At worst, it hurts application startup time but that is still poor motivation for putting a lot of work into removing the mutex. Yep, important hugepage workloads initially pound heavily on this lock, then it naturally decreases. Microbenchmarks will be able to trigger problems in this area but it'd be important to check if any workload that matters is actually hitting that problem. I was thinking of writing one to actually get some numbers for this patchset -- I don't know of any benchmark that might stress this lock. However I first measured the amount of cycles it costs to start an Oracle DB and things went south with these changes. A simple 'startup immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this costs ~7.5 billion cycles and with this patchset it goes up to ~27.1 billion. While there is naturally a fair amount of variation, these changes do seem to do more harm than good, at least in real world scenarios. Thanks, Davidlohr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Thu, Dec 19, 2013 at 06:15:20PM -0800, Andrew Morton wrote: > On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim wrote: > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim > > > wrote: > > > > > > > If parallel fault occur, we can fail to allocate a hugepage, > > > > because many threads dequeue a hugepage to handle a fault of same > > > > address. > > > > This makes reserved pool shortage just for a little while and this cause > > > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > > > > > > > So if I'm understanding this correctly... if N threads all generate a > > > fault against the same address, they will all dive in and allocate a > > > hugepage, will then do an enormous memcpy into that page and will then > > > attempt to instantiate the page in pagetables. All threads except one > > > will lose the race and will free the page again! This sounds terribly > > > inefficient; it would be useful to write a microbenchmark which > > > triggers this scenario so we can explore the impact. > > > > Yes, you understand correctly, I think. > > > > I have an idea to prevent this overhead. It is that marking page when it > > is zeroed and unmarking when it is mapped to page table. If page mapping > > is failed due to current thread, the zeroed page will keep the marker and > > later we can determine if it is zeroed or not. > > Well OK, but the other threads will need to test that in-progress flag > and then do . Where will involve some form of > open-coded sleep/wakeup thing. To avoid all that wheel-reinventing we > can avoid using an internal flag and use an external flag instead. > There's one in struct mutex! My idea consider only hugetlb_no_page() and doesn't need a sleep. It just set page flag after zeroing and if some thread takes the page with this flag when faulting, simply use it without zeroing. > > I doubt if the additional complexity of the external flag is worth it, > but convincing performance testing results would sway me ;) Please have > a think about it all. > > > If you want to include this functionality in this series, I can do it ;) > > Please let me know your decision. > > > > > I'm wondering if a better solution to all of this would be to make > > > hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it > > > with a hash of the faulting address. That will 99.9% solve the > > > performance issue which you believe exists without introducing this new > > > performance issue? > > > > Yes, that approach would solve the performance issue. > > IIRC, you already suggested this idea roughly 6 months ago and it is > > implemented by Davidlohr. I remembered that there is a race issue on > > COW case with this approach. See following link for more information. > > https://lkml.org/lkml/2013/8/7/142 > > That seems to be unrelated to hugetlb_instantiation_mutex? Yes, it is related to hugetlb_instantiation_mutex. In the link, I mentioned about race condition of table mutex patches which is for replacing hugetlb_instantiation_mutex, although conversation isn't easy to follow-up. > > > And we need 1-3 patches to prevent other theorectical race issue > > regardless any approaches. > > Yes, I'll be going through patches 1-12 very soon, thanks. Okay. Thanks :) > > > And to reiterate: I'm very uncomfortable mucking around with > performance patches when we have run no tests to measure their > magnitude, or even whether they are beneficial at all! Okay. I will keep in mind it. :) Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
Hello, Davidlohr. On Thu, Dec 19, 2013 at 06:31:21PM -0800, Davidlohr Bueso wrote: > On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote: > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim > > wrote: > > > > > If parallel fault occur, we can fail to allocate a hugepage, > > > because many threads dequeue a hugepage to handle a fault of same address. > > > This makes reserved pool shortage just for a little while and this cause > > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > > > To solve this problem, we already have a nice solution, that is, > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into > > > a fault handler. This solve the problem clearly, but it introduce > > > performance degradation, because it serialize all fault handling. > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of > > > performance degradation. > > > > So the whole point of the patch is to improve performance, but the > > changelog doesn't include any performance measurements! > > > > Please, run some quantitative tests and include a nice summary of the > > results in the changelog. > > I was actually spending this afternoon testing these patches with Oracle > (I haven't seen any issues so far) and unless Joonsoo already did so, I > want to run these by the libhugetlb test cases - I got side tracked by > futexes though :/ Really thanks for your time to test these patches. I already did libhugetlbfs test cases and passed it. > > Please do consider that performance wise I haven't seen much in > particular. The thing is, I started dealing with this mutex once I > noticed it as the #1 hot lock in Oracle DB starts, but then once the > faults are done, it really goes away. So I wouldn't say that the mutex > is a bottleneck except for the first few minutes. What I want to be sure is for the first few minutes you mentioned. If possible, let me know the result like as following link. https://lkml.org/lkml/2013/7/12/428 Thanks in advance. :) > > > > This is terribly important, because if the performance benefit is > > infinitesimally small or negative, the patch goes into the bit bucket ;) > > Well, this mutex is infinitesimally ugly and needs to die (as long as > performance isn't hurt). Yes, I agreed. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote: > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim wrote: > > > If parallel fault occur, we can fail to allocate a hugepage, > > because many threads dequeue a hugepage to handle a fault of same address. > > This makes reserved pool shortage just for a little while and this cause > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > To solve this problem, we already have a nice solution, that is, > > a hugetlb_instantiation_mutex. This blocks other threads to dive into > > a fault handler. This solve the problem clearly, but it introduce > > performance degradation, because it serialize all fault handling. > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of > > performance degradation. > > So the whole point of the patch is to improve performance, but the > changelog doesn't include any performance measurements! > > Please, run some quantitative tests and include a nice summary of the > results in the changelog. I was actually spending this afternoon testing these patches with Oracle (I haven't seen any issues so far) and unless Joonsoo already did so, I want to run these by the libhugetlb test cases - I got side tracked by futexes though :/ Please do consider that performance wise I haven't seen much in particular. The thing is, I started dealing with this mutex once I noticed it as the #1 hot lock in Oracle DB starts, but then once the faults are done, it really goes away. So I wouldn't say that the mutex is a bottleneck except for the first few minutes. > > This is terribly important, because if the performance benefit is > infinitesimally small or negative, the patch goes into the bit bucket ;) Well, this mutex is infinitesimally ugly and needs to die (as long as performance isn't hurt). Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim wrote: > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim > > wrote: > > > > > If parallel fault occur, we can fail to allocate a hugepage, > > > because many threads dequeue a hugepage to handle a fault of same address. > > > This makes reserved pool shortage just for a little while and this cause > > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > > > > So if I'm understanding this correctly... if N threads all generate a > > fault against the same address, they will all dive in and allocate a > > hugepage, will then do an enormous memcpy into that page and will then > > attempt to instantiate the page in pagetables. All threads except one > > will lose the race and will free the page again! This sounds terribly > > inefficient; it would be useful to write a microbenchmark which > > triggers this scenario so we can explore the impact. > > Yes, you understand correctly, I think. > > I have an idea to prevent this overhead. It is that marking page when it > is zeroed and unmarking when it is mapped to page table. If page mapping > is failed due to current thread, the zeroed page will keep the marker and > later we can determine if it is zeroed or not. Well OK, but the other threads will need to test that in-progress flag and then do . Where will involve some form of open-coded sleep/wakeup thing. To avoid all that wheel-reinventing we can avoid using an internal flag and use an external flag instead. There's one in struct mutex! I doubt if the additional complexity of the external flag is worth it, but convincing performance testing results would sway me ;) Please have a think about it all. > If you want to include this functionality in this series, I can do it ;) > Please let me know your decision. > > > I'm wondering if a better solution to all of this would be to make > > hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it > > with a hash of the faulting address. That will 99.9% solve the > > performance issue which you believe exists without introducing this new > > performance issue? > > Yes, that approach would solve the performance issue. > IIRC, you already suggested this idea roughly 6 months ago and it is > implemented by Davidlohr. I remembered that there is a race issue on > COW case with this approach. See following link for more information. > https://lkml.org/lkml/2013/8/7/142 That seems to be unrelated to hugetlb_instantiation_mutex? > And we need 1-3 patches to prevent other theorectical race issue > regardless any approaches. Yes, I'll be going through patches 1-12 very soon, thanks. And to reiterate: I'm very uncomfortable mucking around with performance patches when we have run no tests to measure their magnitude, or even whether they are beneficial at all! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim wrote: > > > If parallel fault occur, we can fail to allocate a hugepage, > > because many threads dequeue a hugepage to handle a fault of same address. > > This makes reserved pool shortage just for a little while and this cause > > faulting thread who can get hugepages to get a SIGBUS signal. > > > > To solve this problem, we already have a nice solution, that is, > > a hugetlb_instantiation_mutex. This blocks other threads to dive into > > a fault handler. This solve the problem clearly, but it introduce > > performance degradation, because it serialize all fault handling. > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of > > performance degradation. > > So the whole point of the patch is to improve performance, but the > changelog doesn't include any performance measurements! > > Please, run some quantitative tests and include a nice summary of the > results in the changelog. > > This is terribly important, because if the performance benefit is > infinitesimally small or negative, the patch goes into the bit bucket ;) Hello, Andrew, Davidlohr. Yes, I should include the peformance measurements. I can measure it on artificial circumstance, but I think that the best is Davidlohr who reported the issue checks the performance improvement. https://lkml.org/lkml/2013/7/12/428 Davidlohr, could you measure it on your testing environment where you originally reported the issue? > > > For achieving it, at first, we should ensure that > > no one get a SIGBUS if there are enough hugepages. > > > > For this purpose, if we fail to allocate a new hugepage when there is > > concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this, > > these threads defer to get a SIGBUS signal until there is no > > concurrent user, and so, we can ensure that no one get a SIGBUS if there > > are enough hugepages. > > So if I'm understanding this correctly... if N threads all generate a > fault against the same address, they will all dive in and allocate a > hugepage, will then do an enormous memcpy into that page and will then > attempt to instantiate the page in pagetables. All threads except one > will lose the race and will free the page again! This sounds terribly > inefficient; it would be useful to write a microbenchmark which > triggers this scenario so we can explore the impact. Yes, you understand correctly, I think. I have an idea to prevent this overhead. It is that marking page when it is zeroed and unmarking when it is mapped to page table. If page mapping is failed due to current thread, the zeroed page will keep the marker and later we can determine if it is zeroed or not. If you want to include this functionality in this series, I can do it ;) Please let me know your decision. > I'm wondering if a better solution to all of this would be to make > hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it > with a hash of the faulting address. That will 99.9% solve the > performance issue which you believe exists without introducing this new > performance issue? Yes, that approach would solve the performance issue. IIRC, you already suggested this idea roughly 6 months ago and it is implemented by Davidlohr. I remembered that there is a race issue on COW case with this approach. See following link for more information. https://lkml.org/lkml/2013/8/7/142 And we need 1-3 patches to prevent other theorectical race issue regardless any approaches. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim wrote: > If parallel fault occur, we can fail to allocate a hugepage, > because many threads dequeue a hugepage to handle a fault of same address. > This makes reserved pool shortage just for a little while and this cause > faulting thread who can get hugepages to get a SIGBUS signal. > > To solve this problem, we already have a nice solution, that is, > a hugetlb_instantiation_mutex. This blocks other threads to dive into > a fault handler. This solve the problem clearly, but it introduce > performance degradation, because it serialize all fault handling. > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of > performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! Please, run some quantitative tests and include a nice summary of the results in the changelog. This is terribly important, because if the performance benefit is infinitesimally small or negative, the patch goes into the bit bucket ;) > For achieving it, at first, we should ensure that > no one get a SIGBUS if there are enough hugepages. > > For this purpose, if we fail to allocate a new hugepage when there is > concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this, > these threads defer to get a SIGBUS signal until there is no > concurrent user, and so, we can ensure that no one get a SIGBUS if there > are enough hugepages. So if I'm understanding this correctly... if N threads all generate a fault against the same address, they will all dive in and allocate a hugepage, will then do an enormous memcpy into that page and will then attempt to instantiate the page in pagetables. All threads except one will lose the race and will free the page again! This sounds terribly inefficient; it would be useful to write a microbenchmark which triggers this scenario so we can explore the impact. I'm wondering if a better solution to all of this would be to make hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it with a hash of the faulting address. That will 99.9% solve the performance issue which you believe exists without introducing this new performance issue? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! Please, run some quantitative tests and include a nice summary of the results in the changelog. This is terribly important, because if the performance benefit is infinitesimally small or negative, the patch goes into the bit bucket ;) For achieving it, at first, we should ensure that no one get a SIGBUS if there are enough hugepages. For this purpose, if we fail to allocate a new hugepage when there is concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this, these threads defer to get a SIGBUS signal until there is no concurrent user, and so, we can ensure that no one get a SIGBUS if there are enough hugepages. So if I'm understanding this correctly... if N threads all generate a fault against the same address, they will all dive in and allocate a hugepage, will then do an enormous memcpy into that page and will then attempt to instantiate the page in pagetables. All threads except one will lose the race and will free the page again! This sounds terribly inefficient; it would be useful to write a microbenchmark which triggers this scenario so we can explore the impact. I'm wondering if a better solution to all of this would be to make hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it with a hash of the faulting address. That will 99.9% solve the performance issue which you believe exists without introducing this new performance issue? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! Please, run some quantitative tests and include a nice summary of the results in the changelog. This is terribly important, because if the performance benefit is infinitesimally small or negative, the patch goes into the bit bucket ;) Hello, Andrew, Davidlohr. Yes, I should include the peformance measurements. I can measure it on artificial circumstance, but I think that the best is Davidlohr who reported the issue checks the performance improvement. https://lkml.org/lkml/2013/7/12/428 Davidlohr, could you measure it on your testing environment where you originally reported the issue? For achieving it, at first, we should ensure that no one get a SIGBUS if there are enough hugepages. For this purpose, if we fail to allocate a new hugepage when there is concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this, these threads defer to get a SIGBUS signal until there is no concurrent user, and so, we can ensure that no one get a SIGBUS if there are enough hugepages. So if I'm understanding this correctly... if N threads all generate a fault against the same address, they will all dive in and allocate a hugepage, will then do an enormous memcpy into that page and will then attempt to instantiate the page in pagetables. All threads except one will lose the race and will free the page again! This sounds terribly inefficient; it would be useful to write a microbenchmark which triggers this scenario so we can explore the impact. Yes, you understand correctly, I think. I have an idea to prevent this overhead. It is that marking page when it is zeroed and unmarking when it is mapped to page table. If page mapping is failed due to current thread, the zeroed page will keep the marker and later we can determine if it is zeroed or not. If you want to include this functionality in this series, I can do it ;) Please let me know your decision. I'm wondering if a better solution to all of this would be to make hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it with a hash of the faulting address. That will 99.9% solve the performance issue which you believe exists without introducing this new performance issue? Yes, that approach would solve the performance issue. IIRC, you already suggested this idea roughly 6 months ago and it is implemented by Davidlohr. I remembered that there is a race issue on COW case with this approach. See following link for more information. https://lkml.org/lkml/2013/8/7/142 And we need 1-3 patches to prevent other theorectical race issue regardless any approaches. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. So if I'm understanding this correctly... if N threads all generate a fault against the same address, they will all dive in and allocate a hugepage, will then do an enormous memcpy into that page and will then attempt to instantiate the page in pagetables. All threads except one will lose the race and will free the page again! This sounds terribly inefficient; it would be useful to write a microbenchmark which triggers this scenario so we can explore the impact. Yes, you understand correctly, I think. I have an idea to prevent this overhead. It is that marking page when it is zeroed and unmarking when it is mapped to page table. If page mapping is failed due to current thread, the zeroed page will keep the marker and later we can determine if it is zeroed or not. Well OK, but the other threads will need to test that in-progress flag and then do something. Where something will involve some form of open-coded sleep/wakeup thing. To avoid all that wheel-reinventing we can avoid using an internal flag and use an external flag instead. There's one in struct mutex! I doubt if the additional complexity of the external flag is worth it, but convincing performance testing results would sway me ;) Please have a think about it all. If you want to include this functionality in this series, I can do it ;) Please let me know your decision. I'm wondering if a better solution to all of this would be to make hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it with a hash of the faulting address. That will 99.9% solve the performance issue which you believe exists without introducing this new performance issue? Yes, that approach would solve the performance issue. IIRC, you already suggested this idea roughly 6 months ago and it is implemented by Davidlohr. I remembered that there is a race issue on COW case with this approach. See following link for more information. https://lkml.org/lkml/2013/8/7/142 That seems to be unrelated to hugetlb_instantiation_mutex? And we need 1-3 patches to prevent other theorectical race issue regardless any approaches. Yes, I'll be going through patches 1-12 very soon, thanks. And to reiterate: I'm very uncomfortable mucking around with performance patches when we have run no tests to measure their magnitude, or even whether they are beneficial at all! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! Please, run some quantitative tests and include a nice summary of the results in the changelog. I was actually spending this afternoon testing these patches with Oracle (I haven't seen any issues so far) and unless Joonsoo already did so, I want to run these by the libhugetlb test cases - I got side tracked by futexes though :/ Please do consider that performance wise I haven't seen much in particular. The thing is, I started dealing with this mutex once I noticed it as the #1 hot lock in Oracle DB starts, but then once the faults are done, it really goes away. So I wouldn't say that the mutex is a bottleneck except for the first few minutes. This is terribly important, because if the performance benefit is infinitesimally small or negative, the patch goes into the bit bucket ;) Well, this mutex is infinitesimally ugly and needs to die (as long as performance isn't hurt). Thanks, Davidlohr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
Hello, Davidlohr. On Thu, Dec 19, 2013 at 06:31:21PM -0800, Davidlohr Bueso wrote: On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. So the whole point of the patch is to improve performance, but the changelog doesn't include any performance measurements! Please, run some quantitative tests and include a nice summary of the results in the changelog. I was actually spending this afternoon testing these patches with Oracle (I haven't seen any issues so far) and unless Joonsoo already did so, I want to run these by the libhugetlb test cases - I got side tracked by futexes though :/ Really thanks for your time to test these patches. I already did libhugetlbfs test cases and passed it. Please do consider that performance wise I haven't seen much in particular. The thing is, I started dealing with this mutex once I noticed it as the #1 hot lock in Oracle DB starts, but then once the faults are done, it really goes away. So I wouldn't say that the mutex is a bottleneck except for the first few minutes. What I want to be sure is for the first few minutes you mentioned. If possible, let me know the result like as following link. https://lkml.org/lkml/2013/7/12/428 Thanks in advance. :) This is terribly important, because if the performance benefit is infinitesimally small or negative, the patch goes into the bit bucket ;) Well, this mutex is infinitesimally ugly and needs to die (as long as performance isn't hurt). Yes, I agreed. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
On Thu, Dec 19, 2013 at 06:15:20PM -0800, Andrew Morton wrote: On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote: On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who can get hugepages to get a SIGBUS signal. So if I'm understanding this correctly... if N threads all generate a fault against the same address, they will all dive in and allocate a hugepage, will then do an enormous memcpy into that page and will then attempt to instantiate the page in pagetables. All threads except one will lose the race and will free the page again! This sounds terribly inefficient; it would be useful to write a microbenchmark which triggers this scenario so we can explore the impact. Yes, you understand correctly, I think. I have an idea to prevent this overhead. It is that marking page when it is zeroed and unmarking when it is mapped to page table. If page mapping is failed due to current thread, the zeroed page will keep the marker and later we can determine if it is zeroed or not. Well OK, but the other threads will need to test that in-progress flag and then do something. Where something will involve some form of open-coded sleep/wakeup thing. To avoid all that wheel-reinventing we can avoid using an internal flag and use an external flag instead. There's one in struct mutex! My idea consider only hugetlb_no_page() and doesn't need a sleep. It just set some page flag after zeroing and if some thread takes the page with this flag when faulting, simply use it without zeroing. I doubt if the additional complexity of the external flag is worth it, but convincing performance testing results would sway me ;) Please have a think about it all. If you want to include this functionality in this series, I can do it ;) Please let me know your decision. I'm wondering if a better solution to all of this would be to make hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it with a hash of the faulting address. That will 99.9% solve the performance issue which you believe exists without introducing this new performance issue? Yes, that approach would solve the performance issue. IIRC, you already suggested this idea roughly 6 months ago and it is implemented by Davidlohr. I remembered that there is a race issue on COW case with this approach. See following link for more information. https://lkml.org/lkml/2013/8/7/142 That seems to be unrelated to hugetlb_instantiation_mutex? Yes, it is related to hugetlb_instantiation_mutex. In the link, I mentioned about race condition of table mutex patches which is for replacing hugetlb_instantiation_mutex, although conversation isn't easy to follow-up. And we need 1-3 patches to prevent other theorectical race issue regardless any approaches. Yes, I'll be going through patches 1-12 very soon, thanks. Okay. Thanks :) And to reiterate: I'm very uncomfortable mucking around with performance patches when we have run no tests to measure their magnitude, or even whether they are beneficial at all! Okay. I will keep in mind it. :) Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/