Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-15 Thread Andrew Morton
On Wed, 15 Jan 2014 12:47:00 -0800 Davidlohr Bueso  wrote:

> > Well, we're mainly looking for bugfixes this last in the cycle. 
> > "[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
> > introduced resv_map lock" fixes a bug, but I'd assumed that it depended
> > on earlier patches. 
> 
> It doesn't seem to depend on anything. All 1-7 patches apply cleanly on
> linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm:
> dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE).
> 
> >  If we think that one is serious then it would be
> > better to cook up a minimal fix which is backportable into 3.12 and
> > eariler?
> 
> I don't think it's too serious, afaik it's a theoretical race and I
> haven't seen any bug reports for it. So we can probably just wait for
> 3.14, as you say, it's already late in the cycle anyways.

OK, thanks.

> Just let me
> know what you want to do so we can continue working on the actual
> performance issue.

A resend after -rc1 would suit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-15 Thread Davidlohr Bueso
On Tue, 2014-01-14 at 20:56 -0800, Andrew Morton wrote:
> On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso  wrote:
> 
> > On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> > > On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
> > > 
> > > > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > > > before your mutex approach. There are some of clean-up patches and, 
> > > > > IMO,
> > > > > it makes the code more readable and maintainable, so it is worth to 
> > > > > merge
> > > > > separately.
> > > > 
> > > > Fine by me.
> > > > 
> > > 
> > > It appears like patches 1-7 are still missing from linux-next, would you 
> > > mind posting them in a series with your approach?
> > 
> > I haven't looked much into patches 4-7, but at least the first three are
> > ok. I was waiting for Andrew to take all seven for linux-next and then
> > I'd rebase my approach on top. Anyway, unless Andrew has any
> > preferences, if by later this week they're not picked up, I'll resend
> > everything.
> 
> Well, we're mainly looking for bugfixes this last in the cycle. 
> "[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
> introduced resv_map lock" fixes a bug, but I'd assumed that it depended
> on earlier patches. 

It doesn't seem to depend on anything. All 1-7 patches apply cleanly on
linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm:
dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE).

>  If we think that one is serious then it would be
> better to cook up a minimal fix which is backportable into 3.12 and
> eariler?

I don't think it's too serious, afaik it's a theoretical race and I
haven't seen any bug reports for it. So we can probably just wait for
3.14, as you say, it's already late in the cycle anyways. Just let me
know what you want to do so we can continue working on the actual
performance issue.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-15 Thread Davidlohr Bueso
On Tue, 2014-01-14 at 20:56 -0800, Andrew Morton wrote:
 On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso davidl...@hp.com wrote:
 
  On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
   On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
   
 If Andrew agree, It would be great to merge 1-7 patches into mainline
 before your mutex approach. There are some of clean-up patches and, 
 IMO,
 it makes the code more readable and maintainable, so it is worth to 
 merge
 separately.

Fine by me.

   
   It appears like patches 1-7 are still missing from linux-next, would you 
   mind posting them in a series with your approach?
  
  I haven't looked much into patches 4-7, but at least the first three are
  ok. I was waiting for Andrew to take all seven for linux-next and then
  I'd rebase my approach on top. Anyway, unless Andrew has any
  preferences, if by later this week they're not picked up, I'll resend
  everything.
 
 Well, we're mainly looking for bugfixes this last in the cycle. 
 [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
 introduced resv_map lock fixes a bug, but I'd assumed that it depended
 on earlier patches. 

It doesn't seem to depend on anything. All 1-7 patches apply cleanly on
linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm:
dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE).

  If we think that one is serious then it would be
 better to cook up a minimal fix which is backportable into 3.12 and
 eariler?

I don't think it's too serious, afaik it's a theoretical race and I
haven't seen any bug reports for it. So we can probably just wait for
3.14, as you say, it's already late in the cycle anyways. Just let me
know what you want to do so we can continue working on the actual
performance issue.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-15 Thread Andrew Morton
On Wed, 15 Jan 2014 12:47:00 -0800 Davidlohr Bueso davidl...@hp.com wrote:

  Well, we're mainly looking for bugfixes this last in the cycle. 
  [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
  introduced resv_map lock fixes a bug, but I'd assumed that it depended
  on earlier patches. 
 
 It doesn't seem to depend on anything. All 1-7 patches apply cleanly on
 linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm:
 dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE).
 
   If we think that one is serious then it would be
  better to cook up a minimal fix which is backportable into 3.12 and
  eariler?
 
 I don't think it's too serious, afaik it's a theoretical race and I
 haven't seen any bug reports for it. So we can probably just wait for
 3.14, as you say, it's already late in the cycle anyways.

OK, thanks.

 Just let me
 know what you want to do so we can continue working on the actual
 performance issue.

A resend after -rc1 would suit.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-14 Thread Andrew Morton
On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso  wrote:

> On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> > On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
> > 
> > > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > > before your mutex approach. There are some of clean-up patches and, IMO,
> > > > it makes the code more readable and maintainable, so it is worth to 
> > > > merge
> > > > separately.
> > > 
> > > Fine by me.
> > > 
> > 
> > It appears like patches 1-7 are still missing from linux-next, would you 
> > mind posting them in a series with your approach?
> 
> I haven't looked much into patches 4-7, but at least the first three are
> ok. I was waiting for Andrew to take all seven for linux-next and then
> I'd rebase my approach on top. Anyway, unless Andrew has any
> preferences, if by later this week they're not picked up, I'll resend
> everything.

Well, we're mainly looking for bugfixes this last in the cycle. 
"[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
introduced resv_map lock" fixes a bug, but I'd assumed that it depended
on earlier patches.  If we think that one is serious then it would be
better to cook up a minimal fix which is backportable into 3.12 and
eariler?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-14 Thread Davidlohr Bueso
On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
> 
> > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > before your mutex approach. There are some of clean-up patches and, IMO,
> > > it makes the code more readable and maintainable, so it is worth to merge
> > > separately.
> > 
> > Fine by me.
> > 
> 
> It appears like patches 1-7 are still missing from linux-next, would you 
> mind posting them in a series with your approach?

I haven't looked much into patches 4-7, but at least the first three are
ok. I was waiting for Andrew to take all seven for linux-next and then
I'd rebase my approach on top. Anyway, unless Andrew has any
preferences, if by later this week they're not picked up, I'll resend
everything.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-14 Thread David Rientjes
On Mon, 6 Jan 2014, Davidlohr Bueso wrote:

> > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > before your mutex approach. There are some of clean-up patches and, IMO,
> > it makes the code more readable and maintainable, so it is worth to merge
> > separately.
> 
> Fine by me.
> 

It appears like patches 1-7 are still missing from linux-next, would you 
mind posting them in a series with your approach?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-14 Thread David Rientjes
On Mon, 6 Jan 2014, Davidlohr Bueso wrote:

  If Andrew agree, It would be great to merge 1-7 patches into mainline
  before your mutex approach. There are some of clean-up patches and, IMO,
  it makes the code more readable and maintainable, so it is worth to merge
  separately.
 
 Fine by me.
 

It appears like patches 1-7 are still missing from linux-next, would you 
mind posting them in a series with your approach?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-14 Thread Davidlohr Bueso
On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
 On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
 
   If Andrew agree, It would be great to merge 1-7 patches into mainline
   before your mutex approach. There are some of clean-up patches and, IMO,
   it makes the code more readable and maintainable, so it is worth to merge
   separately.
  
  Fine by me.
  
 
 It appears like patches 1-7 are still missing from linux-next, would you 
 mind posting them in a series with your approach?

I haven't looked much into patches 4-7, but at least the first three are
ok. I was waiting for Andrew to take all seven for linux-next and then
I'd rebase my approach on top. Anyway, unless Andrew has any
preferences, if by later this week they're not picked up, I'll resend
everything.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-14 Thread Andrew Morton
On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso davidl...@hp.com wrote:

 On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
  On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
  
If Andrew agree, It would be great to merge 1-7 patches into mainline
before your mutex approach. There are some of clean-up patches and, IMO,
it makes the code more readable and maintainable, so it is worth to 
merge
separately.
   
   Fine by me.
   
  
  It appears like patches 1-7 are still missing from linux-next, would you 
  mind posting them in a series with your approach?
 
 I haven't looked much into patches 4-7, but at least the first three are
 ok. I was waiting for Andrew to take all seven for linux-next and then
 I'd rebase my approach on top. Anyway, unless Andrew has any
 preferences, if by later this week they're not picked up, I'll resend
 everything.

Well, we're mainly looking for bugfixes this last in the cycle. 
[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
introduced resv_map lock fixes a bug, but I'd assumed that it depended
on earlier patches.  If we think that one is serious then it would be
better to cook up a minimal fix which is backportable into 3.12 and
eariler?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-06 Thread Davidlohr Bueso
On Tue, 2014-01-07 at 10:57 +0900, Joonsoo Kim wrote:
> On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote:
> > On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> > > On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > > > Hi Joonsoo,
> > > > 
> > > > Sorry about the delay...
> > > > 
> > > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > > > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
> > > > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim 
> > > > > > > > >  wrote:
> > > > > > > > > 
> > > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > > > because many threads dequeue a hugepage to handle a fault 
> > > > > > > > > > of same address.
> > > > > > > > > > This makes reserved pool shortage just for a little while 
> > > > > > > > > > and this cause
> > > > > > > > > > faulting thread who can get hugepages to get a SIGBUS 
> > > > > > > > > > signal.
> > > > > > > > > > 
> > > > > > > > > > To solve this problem, we already have a nice solution, 
> > > > > > > > > > that is,
> > > > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to 
> > > > > > > > > > dive into
> > > > > > > > > > a fault handler. This solve the problem clearly, but it 
> > > > > > > > > > introduce
> > > > > > > > > > performance degradation, because it serialize all fault 
> > > > > > > > > > handling.
> > > > > > > > > > 
> > > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get 
> > > > > > > > > > rid of
> > > > > > > > > > performance degradation.
> > > > > > > > > 
> > > > > > > > > So the whole point of the patch is to improve performance, 
> > > > > > > > > but the
> > > > > > > > > changelog doesn't include any performance measurements!
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I don't really deal with hugetlbfs any more and I have not 
> > > > > > > > examined this
> > > > > > > > series but I remember why I never really cared about this 
> > > > > > > > mutex. It wrecks
> > > > > > > > fault scalability but AFAIK fault scalability almost never 
> > > > > > > > mattered for
> > > > > > > > workloads using hugetlbfs.  The most common user of hugetlbfs 
> > > > > > > > by far is
> > > > > > > > sysv shared memory. The memory is faulted early in the lifetime 
> > > > > > > > of the
> > > > > > > > workload and after that it does not matter. At worst, it hurts 
> > > > > > > > application
> > > > > > > > startup time but that is still poor motivation for putting a 
> > > > > > > > lot of work
> > > > > > > > into removing the mutex.
> > > > > > > 
> > > > > > > Yep, important hugepage workloads initially pound heavily on this 
> > > > > > > lock,
> > > > > > > then it naturally decreases.
> > > > > > > 
> > > > > > > > Microbenchmarks will be able to trigger problems in this area 
> > > > > > > > but it'd
> > > > > > > > be important to check if any workload that matters is actually 
> > > > > > > > hitting
> > > > > > > > that problem.
> > > > > > > 
> > > > > > > I was thinking of writing one to actually get some numbers for 
> > > > > > > this
> > > > > > > patchset -- I don't know of any benchmark that might stress this 
> > > > > > > lock. 
> > > > > > > 
> > > > > > > However I first measured the amount of cycles it costs to start an
> > > > > > > Oracle DB and things went south with these changes. A simple 
> > > > > > > 'startup
> > > > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla 
> > > > > > > kernel, this
> > > > > > > costs ~7.5 billion cycles and with this patchset it goes up to 
> > > > > > > ~27.1
> > > > > > > billion. While there is naturally a fair amount of variation, 
> > > > > > > these
> > > > > > > changes do seem to do more harm than good, at least in real world
> > > > > > > scenarios.
> > > > > > 
> > > > > > Hello,
> > > > > > 
> > > > > > I think that number of cycles is not proper to measure this 
> > > > > > patchset,
> > > > > > because cycles would be wasted by fault handling failure. Instead, 
> > > > > > it
> > > > > > targeted improved elapsed time. 
> > > > 
> > > > Fair enough, however the fact of the matter is this approach does en up
> > > > hurting performance. Regarding total startup time, I didn't see hardly
> > > > any differences, with both vanilla and this patchset it takes close to
> > > > 33.5 seconds.
> > > > 
> > > > > Could you tell me how long it
> > > > > > takes to fault all of it's hugepages?
> > > > > > 
> > > > > > Anyway, this order of magnitude still seems a problem. :/
> > > > > > 
> > > > > > I guess that cycles are wasted by zeroing hugepage in fault-path 
> > > > > > like as
> > > > > > Andrew pointed out.
> > > > > > 
> > > > > > I will send another patches to fix 

Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-06 Thread Joonsoo Kim
On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote:
> On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> > On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > > Hi Joonsoo,
> > > 
> > > Sorry about the delay...
> > > 
> > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
> > > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim 
> > > > > > > >  wrote:
> > > > > > > > 
> > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > > because many threads dequeue a hugepage to handle a fault of 
> > > > > > > > > same address.
> > > > > > > > > This makes reserved pool shortage just for a little while and 
> > > > > > > > > this cause
> > > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > > > 
> > > > > > > > > To solve this problem, we already have a nice solution, that 
> > > > > > > > > is,
> > > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to 
> > > > > > > > > dive into
> > > > > > > > > a fault handler. This solve the problem clearly, but it 
> > > > > > > > > introduce
> > > > > > > > > performance degradation, because it serialize all fault 
> > > > > > > > > handling.
> > > > > > > > > 
> > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid 
> > > > > > > > > of
> > > > > > > > > performance degradation.
> > > > > > > > 
> > > > > > > > So the whole point of the patch is to improve performance, but 
> > > > > > > > the
> > > > > > > > changelog doesn't include any performance measurements!
> > > > > > > > 
> > > > > > > 
> > > > > > > I don't really deal with hugetlbfs any more and I have not 
> > > > > > > examined this
> > > > > > > series but I remember why I never really cared about this mutex. 
> > > > > > > It wrecks
> > > > > > > fault scalability but AFAIK fault scalability almost never 
> > > > > > > mattered for
> > > > > > > workloads using hugetlbfs.  The most common user of hugetlbfs by 
> > > > > > > far is
> > > > > > > sysv shared memory. The memory is faulted early in the lifetime 
> > > > > > > of the
> > > > > > > workload and after that it does not matter. At worst, it hurts 
> > > > > > > application
> > > > > > > startup time but that is still poor motivation for putting a lot 
> > > > > > > of work
> > > > > > > into removing the mutex.
> > > > > > 
> > > > > > Yep, important hugepage workloads initially pound heavily on this 
> > > > > > lock,
> > > > > > then it naturally decreases.
> > > > > > 
> > > > > > > Microbenchmarks will be able to trigger problems in this area but 
> > > > > > > it'd
> > > > > > > be important to check if any workload that matters is actually 
> > > > > > > hitting
> > > > > > > that problem.
> > > > > > 
> > > > > > I was thinking of writing one to actually get some numbers for this
> > > > > > patchset -- I don't know of any benchmark that might stress this 
> > > > > > lock. 
> > > > > > 
> > > > > > However I first measured the amount of cycles it costs to start an
> > > > > > Oracle DB and things went south with these changes. A simple 
> > > > > > 'startup
> > > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, 
> > > > > > this
> > > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > > > billion. While there is naturally a fair amount of variation, these
> > > > > > changes do seem to do more harm than good, at least in real world
> > > > > > scenarios.
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > I think that number of cycles is not proper to measure this patchset,
> > > > > because cycles would be wasted by fault handling failure. Instead, it
> > > > > targeted improved elapsed time. 
> > > 
> > > Fair enough, however the fact of the matter is this approach does en up
> > > hurting performance. Regarding total startup time, I didn't see hardly
> > > any differences, with both vanilla and this patchset it takes close to
> > > 33.5 seconds.
> > > 
> > > > Could you tell me how long it
> > > > > takes to fault all of it's hugepages?
> > > > > 
> > > > > Anyway, this order of magnitude still seems a problem. :/
> > > > > 
> > > > > I guess that cycles are wasted by zeroing hugepage in fault-path like 
> > > > > as
> > > > > Andrew pointed out.
> > > > > 
> > > > > I will send another patches to fix this problem.
> > > > 
> > > > Hello, Davidlohr.
> > > > 
> > > > Here goes the fix on top of this series.
> > > 
> > > ... and with this patch we go from 27 down to 11 billion cycles, so this
> > > approach still costs more than what we currently have. A perf stat shows
> > > that an entire 1Gb huge page aware DB startup costs around ~30 billion

Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-06 Thread Davidlohr Bueso
On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > Hi Joonsoo,
> > 
> > Sorry about the delay...
> > 
> > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
> > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim 
> > > > > > >  wrote:
> > > > > > > 
> > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > because many threads dequeue a hugepage to handle a fault of 
> > > > > > > > same address.
> > > > > > > > This makes reserved pool shortage just for a little while and 
> > > > > > > > this cause
> > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > > 
> > > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to 
> > > > > > > > dive into
> > > > > > > > a fault handler. This solve the problem clearly, but it 
> > > > > > > > introduce
> > > > > > > > performance degradation, because it serialize all fault 
> > > > > > > > handling.
> > > > > > > > 
> > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > > performance degradation.
> > > > > > > 
> > > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > > changelog doesn't include any performance measurements!
> > > > > > > 
> > > > > > 
> > > > > > I don't really deal with hugetlbfs any more and I have not examined 
> > > > > > this
> > > > > > series but I remember why I never really cared about this mutex. It 
> > > > > > wrecks
> > > > > > fault scalability but AFAIK fault scalability almost never mattered 
> > > > > > for
> > > > > > workloads using hugetlbfs.  The most common user of hugetlbfs by 
> > > > > > far is
> > > > > > sysv shared memory. The memory is faulted early in the lifetime of 
> > > > > > the
> > > > > > workload and after that it does not matter. At worst, it hurts 
> > > > > > application
> > > > > > startup time but that is still poor motivation for putting a lot of 
> > > > > > work
> > > > > > into removing the mutex.
> > > > > 
> > > > > Yep, important hugepage workloads initially pound heavily on this 
> > > > > lock,
> > > > > then it naturally decreases.
> > > > > 
> > > > > > Microbenchmarks will be able to trigger problems in this area but 
> > > > > > it'd
> > > > > > be important to check if any workload that matters is actually 
> > > > > > hitting
> > > > > > that problem.
> > > > > 
> > > > > I was thinking of writing one to actually get some numbers for this
> > > > > patchset -- I don't know of any benchmark that might stress this 
> > > > > lock. 
> > > > > 
> > > > > However I first measured the amount of cycles it costs to start an
> > > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, 
> > > > > this
> > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > > billion. While there is naturally a fair amount of variation, these
> > > > > changes do seem to do more harm than good, at least in real world
> > > > > scenarios.
> > > > 
> > > > Hello,
> > > > 
> > > > I think that number of cycles is not proper to measure this patchset,
> > > > because cycles would be wasted by fault handling failure. Instead, it
> > > > targeted improved elapsed time. 
> > 
> > Fair enough, however the fact of the matter is this approach does en up
> > hurting performance. Regarding total startup time, I didn't see hardly
> > any differences, with both vanilla and this patchset it takes close to
> > 33.5 seconds.
> > 
> > > Could you tell me how long it
> > > > takes to fault all of it's hugepages?
> > > > 
> > > > Anyway, this order of magnitude still seems a problem. :/
> > > > 
> > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > > Andrew pointed out.
> > > > 
> > > > I will send another patches to fix this problem.
> > > 
> > > Hello, Davidlohr.
> > > 
> > > Here goes the fix on top of this series.
> > 
> > ... and with this patch we go from 27 down to 11 billion cycles, so this
> > approach still costs more than what we currently have. A perf stat shows
> > that an entire 1Gb huge page aware DB startup costs around ~30 billion
> > cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> > definitely non trivial and IMO worth considering.
> 
> Really thanks for your help. :)
> 
> > 
> > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> > ride and things do look quite better, which is basically what Andrew was
> > suggesting previously 

Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-06 Thread Davidlohr Bueso
On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
 On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
  Hi Joonsoo,
  
  Sorry about the delay...
  
  On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
   On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
 On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
  On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
   On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim 
   iamjoonsoo@lge.com wrote:
   
If parallel fault occur, we can fail to allocate a hugepage,
because many threads dequeue a hugepage to handle a fault of 
same address.
This makes reserved pool shortage just for a little while and 
this cause
faulting thread who can get hugepages to get a SIGBUS signal.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to 
dive into
a fault handler. This solve the problem clearly, but it 
introduce
performance degradation, because it serialize all fault 
handling.

Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance degradation.
   
   So the whole point of the patch is to improve performance, but the
   changelog doesn't include any performance measurements!
   
  
  I don't really deal with hugetlbfs any more and I have not examined 
  this
  series but I remember why I never really cared about this mutex. It 
  wrecks
  fault scalability but AFAIK fault scalability almost never mattered 
  for
  workloads using hugetlbfs.  The most common user of hugetlbfs by 
  far is
  sysv shared memory. The memory is faulted early in the lifetime of 
  the
  workload and after that it does not matter. At worst, it hurts 
  application
  startup time but that is still poor motivation for putting a lot of 
  work
  into removing the mutex.
 
 Yep, important hugepage workloads initially pound heavily on this 
 lock,
 then it naturally decreases.
 
  Microbenchmarks will be able to trigger problems in this area but 
  it'd
  be important to check if any workload that matters is actually 
  hitting
  that problem.
 
 I was thinking of writing one to actually get some numbers for this
 patchset -- I don't know of any benchmark that might stress this 
 lock. 
 
 However I first measured the amount of cycles it costs to start an
 Oracle DB and things went south with these changes. A simple 'startup
 immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, 
 this
 costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
 billion. While there is naturally a fair amount of variation, these
 changes do seem to do more harm than good, at least in real world
 scenarios.

Hello,

I think that number of cycles is not proper to measure this patchset,
because cycles would be wasted by fault handling failure. Instead, it
targeted improved elapsed time. 
  
  Fair enough, however the fact of the matter is this approach does en up
  hurting performance. Regarding total startup time, I didn't see hardly
  any differences, with both vanilla and this patchset it takes close to
  33.5 seconds.
  
   Could you tell me how long it
takes to fault all of it's hugepages?

Anyway, this order of magnitude still seems a problem. :/

I guess that cycles are wasted by zeroing hugepage in fault-path like as
Andrew pointed out.

I will send another patches to fix this problem.
   
   Hello, Davidlohr.
   
   Here goes the fix on top of this series.
  
  ... and with this patch we go from 27 down to 11 billion cycles, so this
  approach still costs more than what we currently have. A perf stat shows
  that an entire 1Gb huge page aware DB startup costs around ~30 billion
  cycles on a vanilla kernel, so the impact of hugetlb_fault() is
  definitely non trivial and IMO worth considering.
 
 Really thanks for your help. :)
 
  
  Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
  ride and things do look quite better, which is basically what Andrew was
  suggesting previously anyway. With the hash table approach the startup
  time did go down to ~25.1 seconds, which is a nice -24.7% time
  reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
  This hash table was on a 80 core system, so since we do the power of two
  round up we end up with 256 entries -- I think we can do better if we
  enlarger further, maybe something like statically 1024, or probably
  better, 8-ish * nr cpus.
  
  Thoughts? Is there any reason why we cannot go with this instead? Yes,
  we still 

Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-06 Thread Joonsoo Kim
On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote:
 On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
  On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
   Hi Joonsoo,
   
   Sorry about the delay...
   
   On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
 On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
  On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
   On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim 
iamjoonsoo@lge.com wrote:

 If parallel fault occur, we can fail to allocate a hugepage,
 because many threads dequeue a hugepage to handle a fault of 
 same address.
 This makes reserved pool shortage just for a little while and 
 this cause
 faulting thread who can get hugepages to get a SIGBUS signal.
 
 To solve this problem, we already have a nice solution, that 
 is,
 a hugetlb_instantiation_mutex. This blocks other threads to 
 dive into
 a fault handler. This solve the problem clearly, but it 
 introduce
 performance degradation, because it serialize all fault 
 handling.
 
 Now, I try to remove a hugetlb_instantiation_mutex to get rid 
 of
 performance degradation.

So the whole point of the patch is to improve performance, but 
the
changelog doesn't include any performance measurements!

   
   I don't really deal with hugetlbfs any more and I have not 
   examined this
   series but I remember why I never really cared about this mutex. 
   It wrecks
   fault scalability but AFAIK fault scalability almost never 
   mattered for
   workloads using hugetlbfs.  The most common user of hugetlbfs by 
   far is
   sysv shared memory. The memory is faulted early in the lifetime 
   of the
   workload and after that it does not matter. At worst, it hurts 
   application
   startup time but that is still poor motivation for putting a lot 
   of work
   into removing the mutex.
  
  Yep, important hugepage workloads initially pound heavily on this 
  lock,
  then it naturally decreases.
  
   Microbenchmarks will be able to trigger problems in this area but 
   it'd
   be important to check if any workload that matters is actually 
   hitting
   that problem.
  
  I was thinking of writing one to actually get some numbers for this
  patchset -- I don't know of any benchmark that might stress this 
  lock. 
  
  However I first measured the amount of cycles it costs to start an
  Oracle DB and things went south with these changes. A simple 
  'startup
  immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, 
  this
  costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
  billion. While there is naturally a fair amount of variation, these
  changes do seem to do more harm than good, at least in real world
  scenarios.
 
 Hello,
 
 I think that number of cycles is not proper to measure this patchset,
 because cycles would be wasted by fault handling failure. Instead, it
 targeted improved elapsed time. 
   
   Fair enough, however the fact of the matter is this approach does en up
   hurting performance. Regarding total startup time, I didn't see hardly
   any differences, with both vanilla and this patchset it takes close to
   33.5 seconds.
   
Could you tell me how long it
 takes to fault all of it's hugepages?
 
 Anyway, this order of magnitude still seems a problem. :/
 
 I guess that cycles are wasted by zeroing hugepage in fault-path like 
 as
 Andrew pointed out.
 
 I will send another patches to fix this problem.

Hello, Davidlohr.

Here goes the fix on top of this series.
   
   ... and with this patch we go from 27 down to 11 billion cycles, so this
   approach still costs more than what we currently have. A perf stat shows
   that an entire 1Gb huge page aware DB startup costs around ~30 billion
   cycles on a vanilla kernel, so the impact of hugetlb_fault() is
   definitely non trivial and IMO worth considering.
  
  Really thanks for your help. :)
  
   
   Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
   ride and things do look quite better, which is basically what Andrew was
   suggesting previously anyway. With the hash table approach the startup
   time did go down to ~25.1 seconds, which is a nice -24.7% time
   reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
   This hash table was on a 80 core system, so since we do the power of two
   round up we end up with 256 

Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-06 Thread Davidlohr Bueso
On Tue, 2014-01-07 at 10:57 +0900, Joonsoo Kim wrote:
 On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote:
  On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
   On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
Hi Joonsoo,

Sorry about the delay...

On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
 On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
  On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
   On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
 On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim 
 iamjoonsoo@lge.com wrote:
 
  If parallel fault occur, we can fail to allocate a hugepage,
  because many threads dequeue a hugepage to handle a fault 
  of same address.
  This makes reserved pool shortage just for a little while 
  and this cause
  faulting thread who can get hugepages to get a SIGBUS 
  signal.
  
  To solve this problem, we already have a nice solution, 
  that is,
  a hugetlb_instantiation_mutex. This blocks other threads to 
  dive into
  a fault handler. This solve the problem clearly, but it 
  introduce
  performance degradation, because it serialize all fault 
  handling.
  
  Now, I try to remove a hugetlb_instantiation_mutex to get 
  rid of
  performance degradation.
 
 So the whole point of the patch is to improve performance, 
 but the
 changelog doesn't include any performance measurements!
 

I don't really deal with hugetlbfs any more and I have not 
examined this
series but I remember why I never really cared about this 
mutex. It wrecks
fault scalability but AFAIK fault scalability almost never 
mattered for
workloads using hugetlbfs.  The most common user of hugetlbfs 
by far is
sysv shared memory. The memory is faulted early in the lifetime 
of the
workload and after that it does not matter. At worst, it hurts 
application
startup time but that is still poor motivation for putting a 
lot of work
into removing the mutex.
   
   Yep, important hugepage workloads initially pound heavily on this 
   lock,
   then it naturally decreases.
   
Microbenchmarks will be able to trigger problems in this area 
but it'd
be important to check if any workload that matters is actually 
hitting
that problem.
   
   I was thinking of writing one to actually get some numbers for 
   this
   patchset -- I don't know of any benchmark that might stress this 
   lock. 
   
   However I first measured the amount of cycles it costs to start an
   Oracle DB and things went south with these changes. A simple 
   'startup
   immediate' calls hugetlb_fault() ~5000 times. For a vanilla 
   kernel, this
   costs ~7.5 billion cycles and with this patchset it goes up to 
   ~27.1
   billion. While there is naturally a fair amount of variation, 
   these
   changes do seem to do more harm than good, at least in real world
   scenarios.
  
  Hello,
  
  I think that number of cycles is not proper to measure this 
  patchset,
  because cycles would be wasted by fault handling failure. Instead, 
  it
  targeted improved elapsed time. 

Fair enough, however the fact of the matter is this approach does en up
hurting performance. Regarding total startup time, I didn't see hardly
any differences, with both vanilla and this patchset it takes close to
33.5 seconds.

 Could you tell me how long it
  takes to fault all of it's hugepages?
  
  Anyway, this order of magnitude still seems a problem. :/
  
  I guess that cycles are wasted by zeroing hugepage in fault-path 
  like as
  Andrew pointed out.
  
  I will send another patches to fix this problem.
 
 Hello, Davidlohr.
 
 Here goes the fix on top of this series.

... and with this patch we go from 27 down to 11 billion cycles, so this
approach still costs more than what we currently have. A perf stat shows
that an entire 1Gb huge page aware DB startup costs around ~30 billion
cycles on a vanilla kernel, so the impact of hugetlb_fault() is
definitely non trivial and IMO worth considering.
   
   Really thanks for your help. :)
   

Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
ride and things do look quite better, which is basically what Andrew was
suggesting previously anyway. With the hash table approach the startup
time did go down to 

Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-05 Thread Joonsoo Kim
On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> Hi Joonsoo,
> 
> Sorry about the delay...
> 
> On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
> > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim 
> > > > > >  wrote:
> > > > > > 
> > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > because many threads dequeue a hugepage to handle a fault of same 
> > > > > > > address.
> > > > > > > This makes reserved pool shortage just for a little while and 
> > > > > > > this cause
> > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > 
> > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive 
> > > > > > > into
> > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > 
> > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > performance degradation.
> > > > > > 
> > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > changelog doesn't include any performance measurements!
> > > > > > 
> > > > > 
> > > > > I don't really deal with hugetlbfs any more and I have not examined 
> > > > > this
> > > > > series but I remember why I never really cared about this mutex. It 
> > > > > wrecks
> > > > > fault scalability but AFAIK fault scalability almost never mattered 
> > > > > for
> > > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far 
> > > > > is
> > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > workload and after that it does not matter. At worst, it hurts 
> > > > > application
> > > > > startup time but that is still poor motivation for putting a lot of 
> > > > > work
> > > > > into removing the mutex.
> > > > 
> > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > then it naturally decreases.
> > > > 
> > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > be important to check if any workload that matters is actually hitting
> > > > > that problem.
> > > > 
> > > > I was thinking of writing one to actually get some numbers for this
> > > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > > 
> > > > However I first measured the amount of cycles it costs to start an
> > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > billion. While there is naturally a fair amount of variation, these
> > > > changes do seem to do more harm than good, at least in real world
> > > > scenarios.
> > > 
> > > Hello,
> > > 
> > > I think that number of cycles is not proper to measure this patchset,
> > > because cycles would be wasted by fault handling failure. Instead, it
> > > targeted improved elapsed time. 
> 
> Fair enough, however the fact of the matter is this approach does en up
> hurting performance. Regarding total startup time, I didn't see hardly
> any differences, with both vanilla and this patchset it takes close to
> 33.5 seconds.
> 
> > Could you tell me how long it
> > > takes to fault all of it's hugepages?
> > > 
> > > Anyway, this order of magnitude still seems a problem. :/
> > > 
> > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > Andrew pointed out.
> > > 
> > > I will send another patches to fix this problem.
> > 
> > Hello, Davidlohr.
> > 
> > Here goes the fix on top of this series.
> 
> ... and with this patch we go from 27 down to 11 billion cycles, so this
> approach still costs more than what we currently have. A perf stat shows
> that an entire 1Gb huge page aware DB startup costs around ~30 billion
> cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> definitely non trivial and IMO worth considering.

Really thanks for your help. :)

> 
> Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> ride and things do look quite better, which is basically what Andrew was
> suggesting previously anyway. With the hash table approach the startup
> time did go down to ~25.1 seconds, which is a nice -24.7% time
> reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> This hash table was on a 80 core system, so since we do the power of two
> round up we end up with 256 entries -- I think we can do better if we
> enlarger further, maybe something 

Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-05 Thread Joonsoo Kim
On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
 Hi Joonsoo,
 
 Sorry about the delay...
 
 On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
  On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
   On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
 On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
  On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim 
  iamjoonsoo@lge.com wrote:
  
   If parallel fault occur, we can fail to allocate a hugepage,
   because many threads dequeue a hugepage to handle a fault of same 
   address.
   This makes reserved pool shortage just for a little while and 
   this cause
   faulting thread who can get hugepages to get a SIGBUS signal.
   
   To solve this problem, we already have a nice solution, that is,
   a hugetlb_instantiation_mutex. This blocks other threads to dive 
   into
   a fault handler. This solve the problem clearly, but it introduce
   performance degradation, because it serialize all fault handling.
   
   Now, I try to remove a hugetlb_instantiation_mutex to get rid of
   performance degradation.
  
  So the whole point of the patch is to improve performance, but the
  changelog doesn't include any performance measurements!
  
 
 I don't really deal with hugetlbfs any more and I have not examined 
 this
 series but I remember why I never really cared about this mutex. It 
 wrecks
 fault scalability but AFAIK fault scalability almost never mattered 
 for
 workloads using hugetlbfs.  The most common user of hugetlbfs by far 
 is
 sysv shared memory. The memory is faulted early in the lifetime of the
 workload and after that it does not matter. At worst, it hurts 
 application
 startup time but that is still poor motivation for putting a lot of 
 work
 into removing the mutex.

Yep, important hugepage workloads initially pound heavily on this lock,
then it naturally decreases.

 Microbenchmarks will be able to trigger problems in this area but it'd
 be important to check if any workload that matters is actually hitting
 that problem.

I was thinking of writing one to actually get some numbers for this
patchset -- I don't know of any benchmark that might stress this lock. 

However I first measured the amount of cycles it costs to start an
Oracle DB and things went south with these changes. A simple 'startup
immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
billion. While there is naturally a fair amount of variation, these
changes do seem to do more harm than good, at least in real world
scenarios.
   
   Hello,
   
   I think that number of cycles is not proper to measure this patchset,
   because cycles would be wasted by fault handling failure. Instead, it
   targeted improved elapsed time. 
 
 Fair enough, however the fact of the matter is this approach does en up
 hurting performance. Regarding total startup time, I didn't see hardly
 any differences, with both vanilla and this patchset it takes close to
 33.5 seconds.
 
  Could you tell me how long it
   takes to fault all of it's hugepages?
   
   Anyway, this order of magnitude still seems a problem. :/
   
   I guess that cycles are wasted by zeroing hugepage in fault-path like as
   Andrew pointed out.
   
   I will send another patches to fix this problem.
  
  Hello, Davidlohr.
  
  Here goes the fix on top of this series.
 
 ... and with this patch we go from 27 down to 11 billion cycles, so this
 approach still costs more than what we currently have. A perf stat shows
 that an entire 1Gb huge page aware DB startup costs around ~30 billion
 cycles on a vanilla kernel, so the impact of hugetlb_fault() is
 definitely non trivial and IMO worth considering.

Really thanks for your help. :)

 
 Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
 ride and things do look quite better, which is basically what Andrew was
 suggesting previously anyway. With the hash table approach the startup
 time did go down to ~25.1 seconds, which is a nice -24.7% time
 reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
 This hash table was on a 80 core system, so since we do the power of two
 round up we end up with 256 entries -- I think we can do better if we
 enlarger further, maybe something like statically 1024, or probably
 better, 8-ish * nr cpus.
 
 Thoughts? Is there any reason why we cannot go with this instead? Yes,
 we still keep the mutex, but the approach is (1) proven better for
 performance on real world workloads and (2) far less invasive. 

I have no more idea to improve my patches now, so I agree with your approach.
When I reviewed your 

Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-03 Thread Davidlohr Bueso
Hi Joonsoo,

Sorry about the delay...

On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
> > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim 
> > > > >  wrote:
> > > > > 
> > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > because many threads dequeue a hugepage to handle a fault of same 
> > > > > > address.
> > > > > > This makes reserved pool shortage just for a little while and this 
> > > > > > cause
> > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > 
> > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive 
> > > > > > into
> > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > performance degradation, because it serialize all fault handling.
> > > > > > 
> > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > performance degradation.
> > > > > 
> > > > > So the whole point of the patch is to improve performance, but the
> > > > > changelog doesn't include any performance measurements!
> > > > > 
> > > > 
> > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > series but I remember why I never really cared about this mutex. It 
> > > > wrecks
> > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > workload and after that it does not matter. At worst, it hurts 
> > > > application
> > > > startup time but that is still poor motivation for putting a lot of work
> > > > into removing the mutex.
> > > 
> > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > then it naturally decreases.
> > > 
> > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > be important to check if any workload that matters is actually hitting
> > > > that problem.
> > > 
> > > I was thinking of writing one to actually get some numbers for this
> > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > 
> > > However I first measured the amount of cycles it costs to start an
> > > Oracle DB and things went south with these changes. A simple 'startup
> > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > billion. While there is naturally a fair amount of variation, these
> > > changes do seem to do more harm than good, at least in real world
> > > scenarios.
> > 
> > Hello,
> > 
> > I think that number of cycles is not proper to measure this patchset,
> > because cycles would be wasted by fault handling failure. Instead, it
> > targeted improved elapsed time. 

Fair enough, however the fact of the matter is this approach does en up
hurting performance. Regarding total startup time, I didn't see hardly
any differences, with both vanilla and this patchset it takes close to
33.5 seconds.

> Could you tell me how long it
> > takes to fault all of it's hugepages?
> > 
> > Anyway, this order of magnitude still seems a problem. :/
> > 
> > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > Andrew pointed out.
> > 
> > I will send another patches to fix this problem.
> 
> Hello, Davidlohr.
> 
> Here goes the fix on top of this series.

... and with this patch we go from 27 down to 11 billion cycles, so this
approach still costs more than what we currently have. A perf stat shows
that an entire 1Gb huge page aware DB startup costs around ~30 billion
cycles on a vanilla kernel, so the impact of hugetlb_fault() is
definitely non trivial and IMO worth considering.

Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
ride and things do look quite better, which is basically what Andrew was
suggesting previously anyway. With the hash table approach the startup
time did go down to ~25.1 seconds, which is a nice -24.7% time
reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
This hash table was on a 80 core system, so since we do the power of two
round up we end up with 256 entries -- I think we can do better if we
enlarger further, maybe something like statically 1024, or probably
better, 8-ish * nr cpus.

Thoughts? Is there any reason why we cannot go with this instead? Yes,
we still keep the mutex, but the approach is (1) proven better for
performance on real world workloads and (2) far less invasive. 

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe 

Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2014-01-03 Thread Davidlohr Bueso
Hi Joonsoo,

Sorry about the delay...

On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
 On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
  On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
   On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
 On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim 
 iamjoonsoo@lge.com wrote:
 
  If parallel fault occur, we can fail to allocate a hugepage,
  because many threads dequeue a hugepage to handle a fault of same 
  address.
  This makes reserved pool shortage just for a little while and this 
  cause
  faulting thread who can get hugepages to get a SIGBUS signal.
  
  To solve this problem, we already have a nice solution, that is,
  a hugetlb_instantiation_mutex. This blocks other threads to dive 
  into
  a fault handler. This solve the problem clearly, but it introduce
  performance degradation, because it serialize all fault handling.
  
  Now, I try to remove a hugetlb_instantiation_mutex to get rid of
  performance degradation.
 
 So the whole point of the patch is to improve performance, but the
 changelog doesn't include any performance measurements!
 

I don't really deal with hugetlbfs any more and I have not examined this
series but I remember why I never really cared about this mutex. It 
wrecks
fault scalability but AFAIK fault scalability almost never mattered for
workloads using hugetlbfs.  The most common user of hugetlbfs by far is
sysv shared memory. The memory is faulted early in the lifetime of the
workload and after that it does not matter. At worst, it hurts 
application
startup time but that is still poor motivation for putting a lot of work
into removing the mutex.
   
   Yep, important hugepage workloads initially pound heavily on this lock,
   then it naturally decreases.
   
Microbenchmarks will be able to trigger problems in this area but it'd
be important to check if any workload that matters is actually hitting
that problem.
   
   I was thinking of writing one to actually get some numbers for this
   patchset -- I don't know of any benchmark that might stress this lock. 
   
   However I first measured the amount of cycles it costs to start an
   Oracle DB and things went south with these changes. A simple 'startup
   immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
   costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
   billion. While there is naturally a fair amount of variation, these
   changes do seem to do more harm than good, at least in real world
   scenarios.
  
  Hello,
  
  I think that number of cycles is not proper to measure this patchset,
  because cycles would be wasted by fault handling failure. Instead, it
  targeted improved elapsed time. 

Fair enough, however the fact of the matter is this approach does en up
hurting performance. Regarding total startup time, I didn't see hardly
any differences, with both vanilla and this patchset it takes close to
33.5 seconds.

 Could you tell me how long it
  takes to fault all of it's hugepages?
  
  Anyway, this order of magnitude still seems a problem. :/
  
  I guess that cycles are wasted by zeroing hugepage in fault-path like as
  Andrew pointed out.
  
  I will send another patches to fix this problem.
 
 Hello, Davidlohr.
 
 Here goes the fix on top of this series.

... and with this patch we go from 27 down to 11 billion cycles, so this
approach still costs more than what we currently have. A perf stat shows
that an entire 1Gb huge page aware DB startup costs around ~30 billion
cycles on a vanilla kernel, so the impact of hugetlb_fault() is
definitely non trivial and IMO worth considering.

Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
ride and things do look quite better, which is basically what Andrew was
suggesting previously anyway. With the hash table approach the startup
time did go down to ~25.1 seconds, which is a nice -24.7% time
reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
This hash table was on a 80 core system, so since we do the power of two
round up we end up with 256 entries -- I think we can do better if we
enlarger further, maybe something like statically 1024, or probably
better, 8-ish * nr cpus.

Thoughts? Is there any reason why we cannot go with this instead? Yes,
we still keep the mutex, but the approach is (1) proven better for
performance on real world workloads and (2) far less invasive. 

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-22 Thread Joonsoo Kim
On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
> > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim  
> > > > wrote:
> > > > 
> > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > because many threads dequeue a hugepage to handle a fault of same 
> > > > > address.
> > > > > This makes reserved pool shortage just for a little while and this 
> > > > > cause
> > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > 
> > > > > To solve this problem, we already have a nice solution, that is,
> > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > performance degradation, because it serialize all fault handling.
> > > > > 
> > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > performance degradation.
> > > > 
> > > > So the whole point of the patch is to improve performance, but the
> > > > changelog doesn't include any performance measurements!
> > > > 
> > > 
> > > I don't really deal with hugetlbfs any more and I have not examined this
> > > series but I remember why I never really cared about this mutex. It wrecks
> > > fault scalability but AFAIK fault scalability almost never mattered for
> > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > workload and after that it does not matter. At worst, it hurts application
> > > startup time but that is still poor motivation for putting a lot of work
> > > into removing the mutex.
> > 
> > Yep, important hugepage workloads initially pound heavily on this lock,
> > then it naturally decreases.
> > 
> > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > be important to check if any workload that matters is actually hitting
> > > that problem.
> > 
> > I was thinking of writing one to actually get some numbers for this
> > patchset -- I don't know of any benchmark that might stress this lock. 
> > 
> > However I first measured the amount of cycles it costs to start an
> > Oracle DB and things went south with these changes. A simple 'startup
> > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > billion. While there is naturally a fair amount of variation, these
> > changes do seem to do more harm than good, at least in real world
> > scenarios.
> 
> Hello,
> 
> I think that number of cycles is not proper to measure this patchset,
> because cycles would be wasted by fault handling failure. Instead, it
> targeted improved elapsed time. Could you tell me how long it
> takes to fault all of it's hugepages?
> 
> Anyway, this order of magnitude still seems a problem. :/
> 
> I guess that cycles are wasted by zeroing hugepage in fault-path like as
> Andrew pointed out.
> 
> I will send another patches to fix this problem.

Hello, Davidlohr.

Here goes the fix on top of this series.
Thanks.

-->8---
>From 5f20459d90dfa2f7cd28d62194ce22bd9a0df0f5 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Mon, 23 Dec 2013 10:32:04 +0900
Subject: [PATCH] mm, hugetlb: optimize zeroing hugepage

When parallel faults occur, someone would be failed. In this case,
cpu cycles for zeroing failed hugepage is wasted. To reduce this overhead,
mark the hugepage as zeroed hugepage after zeroing hugepage and unmark
it as non-zeroed hugepage after it is really used. If it isn't used with
any reason, it returns back to the hugepage pool and it will be used
sometime ago. At this time, we would see zeroed page marker and skip to
do zeroing.

Signed-off-by: Joonsoo Kim 

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6edf423..b90b792 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -582,6 +582,7 @@ static void update_and_free_page(struct hstate *h, struct 
page *page)
1 << PG_private | 1 << PG_writeback);
}
VM_BUG_ON(hugetlb_cgroup_from_page(page));
+   ClearPageActive(page);
set_compound_page_dtor(page, NULL);
set_page_refcounted(page);
arch_release_hugepage(page);
@@ -2715,6 +2716,7 @@ retry_avoidcopy:
spin_lock(ptl);
ptep = huge_pte_offset(mm, address & huge_page_mask(h));
if (likely(pte_same(huge_ptep_get(ptep), pte))) {
+   ClearPageActive(new_page);
ClearPagePrivate(new_page);
 
/* Break COW */
@@ -2834,7 +2836,10 @@ retry:
}
goto out;
}
-   clear_huge_page(page, address, pages_per_huge_page(h));
+   if 

Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-22 Thread Joonsoo Kim
On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
> > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim  
> > > wrote:
> > > 
> > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > because many threads dequeue a hugepage to handle a fault of same 
> > > > address.
> > > > This makes reserved pool shortage just for a little while and this cause
> > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > 
> > > > To solve this problem, we already have a nice solution, that is,
> > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > a fault handler. This solve the problem clearly, but it introduce
> > > > performance degradation, because it serialize all fault handling.
> > > > 
> > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > performance degradation.
> > > 
> > > So the whole point of the patch is to improve performance, but the
> > > changelog doesn't include any performance measurements!
> > > 
> > 
> > I don't really deal with hugetlbfs any more and I have not examined this
> > series but I remember why I never really cared about this mutex. It wrecks
> > fault scalability but AFAIK fault scalability almost never mattered for
> > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > sysv shared memory. The memory is faulted early in the lifetime of the
> > workload and after that it does not matter. At worst, it hurts application
> > startup time but that is still poor motivation for putting a lot of work
> > into removing the mutex.
> 
> Yep, important hugepage workloads initially pound heavily on this lock,
> then it naturally decreases.
> 
> > Microbenchmarks will be able to trigger problems in this area but it'd
> > be important to check if any workload that matters is actually hitting
> > that problem.
> 
> I was thinking of writing one to actually get some numbers for this
> patchset -- I don't know of any benchmark that might stress this lock. 
> 
> However I first measured the amount of cycles it costs to start an
> Oracle DB and things went south with these changes. A simple 'startup
> immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> billion. While there is naturally a fair amount of variation, these
> changes do seem to do more harm than good, at least in real world
> scenarios.

Hello,

I think that number of cycles is not proper to measure this patchset,
because cycles would be wasted by fault handling failure. Instead, it
targeted improved elapsed time. Could you tell me how long it
takes to fault all of it's hugepages?

Anyway, this order of magnitude still seems a problem. :/

I guess that cycles are wasted by zeroing hugepage in fault-path like as
Andrew pointed out.

I will send another patches to fix this problem.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-22 Thread Joonsoo Kim
On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
 On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
  On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
   On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com 
   wrote:
   
If parallel fault occur, we can fail to allocate a hugepage,
because many threads dequeue a hugepage to handle a fault of same 
address.
This makes reserved pool shortage just for a little while and this cause
faulting thread who can get hugepages to get a SIGBUS signal.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.

Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance degradation.
   
   So the whole point of the patch is to improve performance, but the
   changelog doesn't include any performance measurements!
   
  
  I don't really deal with hugetlbfs any more and I have not examined this
  series but I remember why I never really cared about this mutex. It wrecks
  fault scalability but AFAIK fault scalability almost never mattered for
  workloads using hugetlbfs.  The most common user of hugetlbfs by far is
  sysv shared memory. The memory is faulted early in the lifetime of the
  workload and after that it does not matter. At worst, it hurts application
  startup time but that is still poor motivation for putting a lot of work
  into removing the mutex.
 
 Yep, important hugepage workloads initially pound heavily on this lock,
 then it naturally decreases.
 
  Microbenchmarks will be able to trigger problems in this area but it'd
  be important to check if any workload that matters is actually hitting
  that problem.
 
 I was thinking of writing one to actually get some numbers for this
 patchset -- I don't know of any benchmark that might stress this lock. 
 
 However I first measured the amount of cycles it costs to start an
 Oracle DB and things went south with these changes. A simple 'startup
 immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
 costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
 billion. While there is naturally a fair amount of variation, these
 changes do seem to do more harm than good, at least in real world
 scenarios.

Hello,

I think that number of cycles is not proper to measure this patchset,
because cycles would be wasted by fault handling failure. Instead, it
targeted improved elapsed time. Could you tell me how long it
takes to fault all of it's hugepages?

Anyway, this order of magnitude still seems a problem. :/

I guess that cycles are wasted by zeroing hugepage in fault-path like as
Andrew pointed out.

I will send another patches to fix this problem.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-22 Thread Joonsoo Kim
On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
 On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
  On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
   On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com 
wrote:

 If parallel fault occur, we can fail to allocate a hugepage,
 because many threads dequeue a hugepage to handle a fault of same 
 address.
 This makes reserved pool shortage just for a little while and this 
 cause
 faulting thread who can get hugepages to get a SIGBUS signal.
 
 To solve this problem, we already have a nice solution, that is,
 a hugetlb_instantiation_mutex. This blocks other threads to dive into
 a fault handler. This solve the problem clearly, but it introduce
 performance degradation, because it serialize all fault handling.
 
 Now, I try to remove a hugetlb_instantiation_mutex to get rid of
 performance degradation.

So the whole point of the patch is to improve performance, but the
changelog doesn't include any performance measurements!

   
   I don't really deal with hugetlbfs any more and I have not examined this
   series but I remember why I never really cared about this mutex. It wrecks
   fault scalability but AFAIK fault scalability almost never mattered for
   workloads using hugetlbfs.  The most common user of hugetlbfs by far is
   sysv shared memory. The memory is faulted early in the lifetime of the
   workload and after that it does not matter. At worst, it hurts application
   startup time but that is still poor motivation for putting a lot of work
   into removing the mutex.
  
  Yep, important hugepage workloads initially pound heavily on this lock,
  then it naturally decreases.
  
   Microbenchmarks will be able to trigger problems in this area but it'd
   be important to check if any workload that matters is actually hitting
   that problem.
  
  I was thinking of writing one to actually get some numbers for this
  patchset -- I don't know of any benchmark that might stress this lock. 
  
  However I first measured the amount of cycles it costs to start an
  Oracle DB and things went south with these changes. A simple 'startup
  immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
  costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
  billion. While there is naturally a fair amount of variation, these
  changes do seem to do more harm than good, at least in real world
  scenarios.
 
 Hello,
 
 I think that number of cycles is not proper to measure this patchset,
 because cycles would be wasted by fault handling failure. Instead, it
 targeted improved elapsed time. Could you tell me how long it
 takes to fault all of it's hugepages?
 
 Anyway, this order of magnitude still seems a problem. :/
 
 I guess that cycles are wasted by zeroing hugepage in fault-path like as
 Andrew pointed out.
 
 I will send another patches to fix this problem.

Hello, Davidlohr.

Here goes the fix on top of this series.
Thanks.

--8---
From 5f20459d90dfa2f7cd28d62194ce22bd9a0df0f5 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim iamjoonsoo@lge.com
Date: Mon, 23 Dec 2013 10:32:04 +0900
Subject: [PATCH] mm, hugetlb: optimize zeroing hugepage

When parallel faults occur, someone would be failed. In this case,
cpu cycles for zeroing failed hugepage is wasted. To reduce this overhead,
mark the hugepage as zeroed hugepage after zeroing hugepage and unmark
it as non-zeroed hugepage after it is really used. If it isn't used with
any reason, it returns back to the hugepage pool and it will be used
sometime ago. At this time, we would see zeroed page marker and skip to
do zeroing.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6edf423..b90b792 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -582,6 +582,7 @@ static void update_and_free_page(struct hstate *h, struct 
page *page)
1  PG_private | 1  PG_writeback);
}
VM_BUG_ON(hugetlb_cgroup_from_page(page));
+   ClearPageActive(page);
set_compound_page_dtor(page, NULL);
set_page_refcounted(page);
arch_release_hugepage(page);
@@ -2715,6 +2716,7 @@ retry_avoidcopy:
spin_lock(ptl);
ptep = huge_pte_offset(mm, address  huge_page_mask(h));
if (likely(pte_same(huge_ptep_get(ptep), pte))) {
+   ClearPageActive(new_page);
ClearPagePrivate(new_page);
 
/* Break COW */
@@ -2834,7 +2836,10 @@ retry:
}
goto out;
}
-   clear_huge_page(page, address, pages_per_huge_page(h));
+   if (!PageActive(page)) {
+   clear_huge_page(page, address, pages_per_huge_page(h));
+   

Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-20 Thread Davidlohr Bueso
On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
> On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim  
> > wrote:
> > 
> > > If parallel fault occur, we can fail to allocate a hugepage,
> > > because many threads dequeue a hugepage to handle a fault of same address.
> > > This makes reserved pool shortage just for a little while and this cause
> > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > 
> > > To solve this problem, we already have a nice solution, that is,
> > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > a fault handler. This solve the problem clearly, but it introduce
> > > performance degradation, because it serialize all fault handling.
> > > 
> > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > performance degradation.
> > 
> > So the whole point of the patch is to improve performance, but the
> > changelog doesn't include any performance measurements!
> > 
> 
> I don't really deal with hugetlbfs any more and I have not examined this
> series but I remember why I never really cared about this mutex. It wrecks
> fault scalability but AFAIK fault scalability almost never mattered for
> workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> sysv shared memory. The memory is faulted early in the lifetime of the
> workload and after that it does not matter. At worst, it hurts application
> startup time but that is still poor motivation for putting a lot of work
> into removing the mutex.

Yep, important hugepage workloads initially pound heavily on this lock,
then it naturally decreases.

> Microbenchmarks will be able to trigger problems in this area but it'd
> be important to check if any workload that matters is actually hitting
> that problem.

I was thinking of writing one to actually get some numbers for this
patchset -- I don't know of any benchmark that might stress this lock. 

However I first measured the amount of cycles it costs to start an
Oracle DB and things went south with these changes. A simple 'startup
immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
billion. While there is naturally a fair amount of variation, these
changes do seem to do more harm than good, at least in real world
scenarios.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-20 Thread Mel Gorman
On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim  wrote:
> 
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation.
> 
> So the whole point of the patch is to improve performance, but the
> changelog doesn't include any performance measurements!
> 

I don't really deal with hugetlbfs any more and I have not examined this
series but I remember why I never really cared about this mutex. It wrecks
fault scalability but AFAIK fault scalability almost never mattered for
workloads using hugetlbfs.  The most common user of hugetlbfs by far is
sysv shared memory. The memory is faulted early in the lifetime of the
workload and after that it does not matter. At worst, it hurts application
startup time but that is still poor motivation for putting a lot of work
into removing the mutex.

Microbenchmarks will be able to trigger problems in this area but it'd
be important to check if any workload that matters is actually hitting
that problem.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-20 Thread Mel Gorman
On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
 On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:
 
  If parallel fault occur, we can fail to allocate a hugepage,
  because many threads dequeue a hugepage to handle a fault of same address.
  This makes reserved pool shortage just for a little while and this cause
  faulting thread who can get hugepages to get a SIGBUS signal.
  
  To solve this problem, we already have a nice solution, that is,
  a hugetlb_instantiation_mutex. This blocks other threads to dive into
  a fault handler. This solve the problem clearly, but it introduce
  performance degradation, because it serialize all fault handling.
  
  Now, I try to remove a hugetlb_instantiation_mutex to get rid of
  performance degradation.
 
 So the whole point of the patch is to improve performance, but the
 changelog doesn't include any performance measurements!
 

I don't really deal with hugetlbfs any more and I have not examined this
series but I remember why I never really cared about this mutex. It wrecks
fault scalability but AFAIK fault scalability almost never mattered for
workloads using hugetlbfs.  The most common user of hugetlbfs by far is
sysv shared memory. The memory is faulted early in the lifetime of the
workload and after that it does not matter. At worst, it hurts application
startup time but that is still poor motivation for putting a lot of work
into removing the mutex.

Microbenchmarks will be able to trigger problems in this area but it'd
be important to check if any workload that matters is actually hitting
that problem.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-20 Thread Davidlohr Bueso
On Fri, 2013-12-20 at 14:01 +, Mel Gorman wrote:
 On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
  On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com 
  wrote:
  
   If parallel fault occur, we can fail to allocate a hugepage,
   because many threads dequeue a hugepage to handle a fault of same address.
   This makes reserved pool shortage just for a little while and this cause
   faulting thread who can get hugepages to get a SIGBUS signal.
   
   To solve this problem, we already have a nice solution, that is,
   a hugetlb_instantiation_mutex. This blocks other threads to dive into
   a fault handler. This solve the problem clearly, but it introduce
   performance degradation, because it serialize all fault handling.
   
   Now, I try to remove a hugetlb_instantiation_mutex to get rid of
   performance degradation.
  
  So the whole point of the patch is to improve performance, but the
  changelog doesn't include any performance measurements!
  
 
 I don't really deal with hugetlbfs any more and I have not examined this
 series but I remember why I never really cared about this mutex. It wrecks
 fault scalability but AFAIK fault scalability almost never mattered for
 workloads using hugetlbfs.  The most common user of hugetlbfs by far is
 sysv shared memory. The memory is faulted early in the lifetime of the
 workload and after that it does not matter. At worst, it hurts application
 startup time but that is still poor motivation for putting a lot of work
 into removing the mutex.

Yep, important hugepage workloads initially pound heavily on this lock,
then it naturally decreases.

 Microbenchmarks will be able to trigger problems in this area but it'd
 be important to check if any workload that matters is actually hitting
 that problem.

I was thinking of writing one to actually get some numbers for this
patchset -- I don't know of any benchmark that might stress this lock. 

However I first measured the amount of cycles it costs to start an
Oracle DB and things went south with these changes. A simple 'startup
immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
billion. While there is naturally a fair amount of variation, these
changes do seem to do more harm than good, at least in real world
scenarios.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Joonsoo Kim
On Thu, Dec 19, 2013 at 06:15:20PM -0800, Andrew Morton wrote:
> On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim  wrote:
> 
> > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim  
> > > wrote:
> > > 
> > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > because many threads dequeue a hugepage to handle a fault of same 
> > > > address.
> > > > This makes reserved pool shortage just for a little while and this cause
> > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > 
> > > 
> > > So if I'm understanding this correctly...  if N threads all generate a
> > > fault against the same address, they will all dive in and allocate a
> > > hugepage, will then do an enormous memcpy into that page and will then
> > > attempt to instantiate the page in pagetables.  All threads except one
> > > will lose the race and will free the page again!  This sounds terribly
> > > inefficient; it would be useful to write a microbenchmark which
> > > triggers this scenario so we can explore the impact.
> > 
> > Yes, you understand correctly, I think.
> > 
> > I have an idea to prevent this overhead. It is that marking page when it
> > is zeroed and unmarking when it is mapped to page table. If page mapping
> > is failed due to current thread, the zeroed page will keep the marker and
> > later we can determine if it is zeroed or not.
> 
> Well OK, but the other threads will need to test that in-progress flag
> and then do .  Where  will involve some form of
> open-coded sleep/wakeup thing.  To avoid all that wheel-reinventing we
> can avoid using an internal flag and use an external flag instead. 
> There's one in struct mutex!

My idea consider only hugetlb_no_page() and doesn't need a sleep.
It just set  page flag after zeroing and if some thread takes
the page with this flag when faulting, simply use it without zeroing.

> 
> I doubt if the additional complexity of the external flag is worth it,
> but convincing performance testing results would sway me ;) Please have
> a think about it all.
> 
> > If you want to include this functionality in this series, I can do it ;)
> > Please let me know your decision.
> > 
> > > I'm wondering if a better solution to all of this would be to make
> > > hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
> > > with a hash of the faulting address.  That will 99.9% solve the
> > > performance issue which you believe exists without introducing this new
> > > performance issue?
> > 
> > Yes, that approach would solve the performance issue.
> > IIRC, you already suggested this idea roughly 6 months ago and it is
> > implemented by Davidlohr. I remembered that there is a race issue on
> > COW case with this approach. See following link for more information.
> > https://lkml.org/lkml/2013/8/7/142
> 
> That seems to be unrelated to hugetlb_instantiation_mutex?

Yes, it is related to hugetlb_instantiation_mutex. In the link, I mentioned
about race condition of table mutex patches which is for replacing
hugetlb_instantiation_mutex, although conversation isn't easy to follow-up.

> 
> > And we need 1-3 patches to prevent other theorectical race issue
> > regardless any approaches.
> 
> Yes, I'll be going through patches 1-12 very soon, thanks.

Okay. Thanks :)

> 
> 
> And to reiterate: I'm very uncomfortable mucking around with
> performance patches when we have run no tests to measure their
> magnitude, or even whether they are beneficial at all!

Okay. I will keep in mind it. :)

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Joonsoo Kim
Hello, Davidlohr.

On Thu, Dec 19, 2013 at 06:31:21PM -0800, Davidlohr Bueso wrote:
> On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote:
> > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim  
> > wrote:
> > 
> > > If parallel fault occur, we can fail to allocate a hugepage,
> > > because many threads dequeue a hugepage to handle a fault of same address.
> > > This makes reserved pool shortage just for a little while and this cause
> > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > 
> > > To solve this problem, we already have a nice solution, that is,
> > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > a fault handler. This solve the problem clearly, but it introduce
> > > performance degradation, because it serialize all fault handling.
> > > 
> > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > performance degradation.
> > 
> > So the whole point of the patch is to improve performance, but the
> > changelog doesn't include any performance measurements!
> > 
> > Please, run some quantitative tests and include a nice summary of the
> > results in the changelog.
> 
> I was actually spending this afternoon testing these patches with Oracle
> (I haven't seen any issues so far) and unless Joonsoo already did so, I
> want to run these by the libhugetlb test cases - I got side tracked by
> futexes though :/

Really thanks for your time to test these patches.
I already did libhugetlbfs test cases and passed it.

> 
> Please do consider that performance wise I haven't seen much in
> particular. The thing is, I started dealing with this mutex once I
> noticed it as the #1 hot lock in Oracle DB starts, but then once the
> faults are done, it really goes away. So I wouldn't say that the mutex
> is a bottleneck except for the first few minutes.

What I want to be sure is for the first few minutes you mentioned.
If possible, let me know the result like as following link.
https://lkml.org/lkml/2013/7/12/428

Thanks in advance. :)

> > 
> > This is terribly important, because if the performance benefit is
> > infinitesimally small or negative, the patch goes into the bit bucket ;)
> 
> Well, this mutex is infinitesimally ugly and needs to die (as long as
> performance isn't hurt).

Yes, I agreed.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Davidlohr Bueso
On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote:
> On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim  wrote:
> 
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation.
> 
> So the whole point of the patch is to improve performance, but the
> changelog doesn't include any performance measurements!
> 
> Please, run some quantitative tests and include a nice summary of the
> results in the changelog.

I was actually spending this afternoon testing these patches with Oracle
(I haven't seen any issues so far) and unless Joonsoo already did so, I
want to run these by the libhugetlb test cases - I got side tracked by
futexes though :/

Please do consider that performance wise I haven't seen much in
particular. The thing is, I started dealing with this mutex once I
noticed it as the #1 hot lock in Oracle DB starts, but then once the
faults are done, it really goes away. So I wouldn't say that the mutex
is a bottleneck except for the first few minutes.

> 
> This is terribly important, because if the performance benefit is
> infinitesimally small or negative, the patch goes into the bit bucket ;)

Well, this mutex is infinitesimally ugly and needs to die (as long as
performance isn't hurt).

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Andrew Morton
On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim  wrote:

> On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim  
> > wrote:
> > 
> > > If parallel fault occur, we can fail to allocate a hugepage,
> > > because many threads dequeue a hugepage to handle a fault of same address.
> > > This makes reserved pool shortage just for a little while and this cause
> > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > 
> > 
> > So if I'm understanding this correctly...  if N threads all generate a
> > fault against the same address, they will all dive in and allocate a
> > hugepage, will then do an enormous memcpy into that page and will then
> > attempt to instantiate the page in pagetables.  All threads except one
> > will lose the race and will free the page again!  This sounds terribly
> > inefficient; it would be useful to write a microbenchmark which
> > triggers this scenario so we can explore the impact.
> 
> Yes, you understand correctly, I think.
> 
> I have an idea to prevent this overhead. It is that marking page when it
> is zeroed and unmarking when it is mapped to page table. If page mapping
> is failed due to current thread, the zeroed page will keep the marker and
> later we can determine if it is zeroed or not.

Well OK, but the other threads will need to test that in-progress flag
and then do .  Where  will involve some form of
open-coded sleep/wakeup thing.  To avoid all that wheel-reinventing we
can avoid using an internal flag and use an external flag instead. 
There's one in struct mutex!

I doubt if the additional complexity of the external flag is worth it,
but convincing performance testing results would sway me ;) Please have
a think about it all.

> If you want to include this functionality in this series, I can do it ;)
> Please let me know your decision.
> 
> > I'm wondering if a better solution to all of this would be to make
> > hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
> > with a hash of the faulting address.  That will 99.9% solve the
> > performance issue which you believe exists without introducing this new
> > performance issue?
> 
> Yes, that approach would solve the performance issue.
> IIRC, you already suggested this idea roughly 6 months ago and it is
> implemented by Davidlohr. I remembered that there is a race issue on
> COW case with this approach. See following link for more information.
> https://lkml.org/lkml/2013/8/7/142

That seems to be unrelated to hugetlb_instantiation_mutex?

> And we need 1-3 patches to prevent other theorectical race issue
> regardless any approaches.

Yes, I'll be going through patches 1-12 very soon, thanks.


And to reiterate: I'm very uncomfortable mucking around with
performance patches when we have run no tests to measure their
magnitude, or even whether they are beneficial at all!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Joonsoo Kim
On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim  wrote:
> 
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation.
> 
> So the whole point of the patch is to improve performance, but the
> changelog doesn't include any performance measurements!
> 
> Please, run some quantitative tests and include a nice summary of the
> results in the changelog.
> 
> This is terribly important, because if the performance benefit is
> infinitesimally small or negative, the patch goes into the bit bucket ;)

Hello, Andrew, Davidlohr.

Yes, I should include the peformance measurements.
I can measure it on artificial circumstance, but I think that the best is
Davidlohr who reported the issue checks the performance improvement.
https://lkml.org/lkml/2013/7/12/428

Davidlohr, could you measure it on your testing environment where you
originally reported the issue?

> 
> > For achieving it, at first, we should ensure that
> > no one get a SIGBUS if there are enough hugepages.
> > 
> > For this purpose, if we fail to allocate a new hugepage when there is
> > concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> > these threads defer to get a SIGBUS signal until there is no
> > concurrent user, and so, we can ensure that no one get a SIGBUS if there
> > are enough hugepages.
> 
> So if I'm understanding this correctly...  if N threads all generate a
> fault against the same address, they will all dive in and allocate a
> hugepage, will then do an enormous memcpy into that page and will then
> attempt to instantiate the page in pagetables.  All threads except one
> will lose the race and will free the page again!  This sounds terribly
> inefficient; it would be useful to write a microbenchmark which
> triggers this scenario so we can explore the impact.

Yes, you understand correctly, I think.

I have an idea to prevent this overhead. It is that marking page when it
is zeroed and unmarking when it is mapped to page table. If page mapping
is failed due to current thread, the zeroed page will keep the marker and
later we can determine if it is zeroed or not.

If you want to include this functionality in this series, I can do it ;)
Please let me know your decision.

> I'm wondering if a better solution to all of this would be to make
> hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
> with a hash of the faulting address.  That will 99.9% solve the
> performance issue which you believe exists without introducing this new
> performance issue?

Yes, that approach would solve the performance issue.
IIRC, you already suggested this idea roughly 6 months ago and it is
implemented by Davidlohr. I remembered that there is a race issue on
COW case with this approach. See following link for more information.
https://lkml.org/lkml/2013/8/7/142

And we need 1-3 patches to prevent other theorectical race issue
regardless any approaches.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Andrew Morton
On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim  wrote:

> If parallel fault occur, we can fail to allocate a hugepage,
> because many threads dequeue a hugepage to handle a fault of same address.
> This makes reserved pool shortage just for a little while and this cause
> faulting thread who can get hugepages to get a SIGBUS signal.
> 
> To solve this problem, we already have a nice solution, that is,
> a hugetlb_instantiation_mutex. This blocks other threads to dive into
> a fault handler. This solve the problem clearly, but it introduce
> performance degradation, because it serialize all fault handling.
> 
> Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> performance degradation.

So the whole point of the patch is to improve performance, but the
changelog doesn't include any performance measurements!

Please, run some quantitative tests and include a nice summary of the
results in the changelog.

This is terribly important, because if the performance benefit is
infinitesimally small or negative, the patch goes into the bit bucket ;)

> For achieving it, at first, we should ensure that
> no one get a SIGBUS if there are enough hugepages.
> 
> For this purpose, if we fail to allocate a new hugepage when there is
> concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> these threads defer to get a SIGBUS signal until there is no
> concurrent user, and so, we can ensure that no one get a SIGBUS if there
> are enough hugepages.

So if I'm understanding this correctly...  if N threads all generate a
fault against the same address, they will all dive in and allocate a
hugepage, will then do an enormous memcpy into that page and will then
attempt to instantiate the page in pagetables.  All threads except one
will lose the race and will free the page again!  This sounds terribly
inefficient; it would be useful to write a microbenchmark which
triggers this scenario so we can explore the impact.

I'm wondering if a better solution to all of this would be to make
hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
with a hash of the faulting address.  That will 99.9% solve the
performance issue which you believe exists without introducing this new
performance issue?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Andrew Morton
On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:

 If parallel fault occur, we can fail to allocate a hugepage,
 because many threads dequeue a hugepage to handle a fault of same address.
 This makes reserved pool shortage just for a little while and this cause
 faulting thread who can get hugepages to get a SIGBUS signal.
 
 To solve this problem, we already have a nice solution, that is,
 a hugetlb_instantiation_mutex. This blocks other threads to dive into
 a fault handler. This solve the problem clearly, but it introduce
 performance degradation, because it serialize all fault handling.
 
 Now, I try to remove a hugetlb_instantiation_mutex to get rid of
 performance degradation.

So the whole point of the patch is to improve performance, but the
changelog doesn't include any performance measurements!

Please, run some quantitative tests and include a nice summary of the
results in the changelog.

This is terribly important, because if the performance benefit is
infinitesimally small or negative, the patch goes into the bit bucket ;)

 For achieving it, at first, we should ensure that
 no one get a SIGBUS if there are enough hugepages.
 
 For this purpose, if we fail to allocate a new hugepage when there is
 concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
 these threads defer to get a SIGBUS signal until there is no
 concurrent user, and so, we can ensure that no one get a SIGBUS if there
 are enough hugepages.

So if I'm understanding this correctly...  if N threads all generate a
fault against the same address, they will all dive in and allocate a
hugepage, will then do an enormous memcpy into that page and will then
attempt to instantiate the page in pagetables.  All threads except one
will lose the race and will free the page again!  This sounds terribly
inefficient; it would be useful to write a microbenchmark which
triggers this scenario so we can explore the impact.

I'm wondering if a better solution to all of this would be to make
hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
with a hash of the faulting address.  That will 99.9% solve the
performance issue which you believe exists without introducing this new
performance issue?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Joonsoo Kim
On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
 On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:
 
  If parallel fault occur, we can fail to allocate a hugepage,
  because many threads dequeue a hugepage to handle a fault of same address.
  This makes reserved pool shortage just for a little while and this cause
  faulting thread who can get hugepages to get a SIGBUS signal.
  
  To solve this problem, we already have a nice solution, that is,
  a hugetlb_instantiation_mutex. This blocks other threads to dive into
  a fault handler. This solve the problem clearly, but it introduce
  performance degradation, because it serialize all fault handling.
  
  Now, I try to remove a hugetlb_instantiation_mutex to get rid of
  performance degradation.
 
 So the whole point of the patch is to improve performance, but the
 changelog doesn't include any performance measurements!
 
 Please, run some quantitative tests and include a nice summary of the
 results in the changelog.
 
 This is terribly important, because if the performance benefit is
 infinitesimally small or negative, the patch goes into the bit bucket ;)

Hello, Andrew, Davidlohr.

Yes, I should include the peformance measurements.
I can measure it on artificial circumstance, but I think that the best is
Davidlohr who reported the issue checks the performance improvement.
https://lkml.org/lkml/2013/7/12/428

Davidlohr, could you measure it on your testing environment where you
originally reported the issue?

 
  For achieving it, at first, we should ensure that
  no one get a SIGBUS if there are enough hugepages.
  
  For this purpose, if we fail to allocate a new hugepage when there is
  concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
  these threads defer to get a SIGBUS signal until there is no
  concurrent user, and so, we can ensure that no one get a SIGBUS if there
  are enough hugepages.
 
 So if I'm understanding this correctly...  if N threads all generate a
 fault against the same address, they will all dive in and allocate a
 hugepage, will then do an enormous memcpy into that page and will then
 attempt to instantiate the page in pagetables.  All threads except one
 will lose the race and will free the page again!  This sounds terribly
 inefficient; it would be useful to write a microbenchmark which
 triggers this scenario so we can explore the impact.

Yes, you understand correctly, I think.

I have an idea to prevent this overhead. It is that marking page when it
is zeroed and unmarking when it is mapped to page table. If page mapping
is failed due to current thread, the zeroed page will keep the marker and
later we can determine if it is zeroed or not.

If you want to include this functionality in this series, I can do it ;)
Please let me know your decision.

 I'm wondering if a better solution to all of this would be to make
 hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
 with a hash of the faulting address.  That will 99.9% solve the
 performance issue which you believe exists without introducing this new
 performance issue?

Yes, that approach would solve the performance issue.
IIRC, you already suggested this idea roughly 6 months ago and it is
implemented by Davidlohr. I remembered that there is a race issue on
COW case with this approach. See following link for more information.
https://lkml.org/lkml/2013/8/7/142

And we need 1-3 patches to prevent other theorectical race issue
regardless any approaches.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Andrew Morton
On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:

 On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
  On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com 
  wrote:
  
   If parallel fault occur, we can fail to allocate a hugepage,
   because many threads dequeue a hugepage to handle a fault of same address.
   This makes reserved pool shortage just for a little while and this cause
   faulting thread who can get hugepages to get a SIGBUS signal.
   
  
  So if I'm understanding this correctly...  if N threads all generate a
  fault against the same address, they will all dive in and allocate a
  hugepage, will then do an enormous memcpy into that page and will then
  attempt to instantiate the page in pagetables.  All threads except one
  will lose the race and will free the page again!  This sounds terribly
  inefficient; it would be useful to write a microbenchmark which
  triggers this scenario so we can explore the impact.
 
 Yes, you understand correctly, I think.
 
 I have an idea to prevent this overhead. It is that marking page when it
 is zeroed and unmarking when it is mapped to page table. If page mapping
 is failed due to current thread, the zeroed page will keep the marker and
 later we can determine if it is zeroed or not.

Well OK, but the other threads will need to test that in-progress flag
and then do something.  Where something will involve some form of
open-coded sleep/wakeup thing.  To avoid all that wheel-reinventing we
can avoid using an internal flag and use an external flag instead. 
There's one in struct mutex!

I doubt if the additional complexity of the external flag is worth it,
but convincing performance testing results would sway me ;) Please have
a think about it all.

 If you want to include this functionality in this series, I can do it ;)
 Please let me know your decision.
 
  I'm wondering if a better solution to all of this would be to make
  hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
  with a hash of the faulting address.  That will 99.9% solve the
  performance issue which you believe exists without introducing this new
  performance issue?
 
 Yes, that approach would solve the performance issue.
 IIRC, you already suggested this idea roughly 6 months ago and it is
 implemented by Davidlohr. I remembered that there is a race issue on
 COW case with this approach. See following link for more information.
 https://lkml.org/lkml/2013/8/7/142

That seems to be unrelated to hugetlb_instantiation_mutex?

 And we need 1-3 patches to prevent other theorectical race issue
 regardless any approaches.

Yes, I'll be going through patches 1-12 very soon, thanks.


And to reiterate: I'm very uncomfortable mucking around with
performance patches when we have run no tests to measure their
magnitude, or even whether they are beneficial at all!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Davidlohr Bueso
On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote:
 On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:
 
  If parallel fault occur, we can fail to allocate a hugepage,
  because many threads dequeue a hugepage to handle a fault of same address.
  This makes reserved pool shortage just for a little while and this cause
  faulting thread who can get hugepages to get a SIGBUS signal.
  
  To solve this problem, we already have a nice solution, that is,
  a hugetlb_instantiation_mutex. This blocks other threads to dive into
  a fault handler. This solve the problem clearly, but it introduce
  performance degradation, because it serialize all fault handling.
  
  Now, I try to remove a hugetlb_instantiation_mutex to get rid of
  performance degradation.
 
 So the whole point of the patch is to improve performance, but the
 changelog doesn't include any performance measurements!
 
 Please, run some quantitative tests and include a nice summary of the
 results in the changelog.

I was actually spending this afternoon testing these patches with Oracle
(I haven't seen any issues so far) and unless Joonsoo already did so, I
want to run these by the libhugetlb test cases - I got side tracked by
futexes though :/

Please do consider that performance wise I haven't seen much in
particular. The thing is, I started dealing with this mutex once I
noticed it as the #1 hot lock in Oracle DB starts, but then once the
faults are done, it really goes away. So I wouldn't say that the mutex
is a bottleneck except for the first few minutes.

 
 This is terribly important, because if the performance benefit is
 infinitesimally small or negative, the patch goes into the bit bucket ;)

Well, this mutex is infinitesimally ugly and needs to die (as long as
performance isn't hurt).

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Joonsoo Kim
Hello, Davidlohr.

On Thu, Dec 19, 2013 at 06:31:21PM -0800, Davidlohr Bueso wrote:
 On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote:
  On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com 
  wrote:
  
   If parallel fault occur, we can fail to allocate a hugepage,
   because many threads dequeue a hugepage to handle a fault of same address.
   This makes reserved pool shortage just for a little while and this cause
   faulting thread who can get hugepages to get a SIGBUS signal.
   
   To solve this problem, we already have a nice solution, that is,
   a hugetlb_instantiation_mutex. This blocks other threads to dive into
   a fault handler. This solve the problem clearly, but it introduce
   performance degradation, because it serialize all fault handling.
   
   Now, I try to remove a hugetlb_instantiation_mutex to get rid of
   performance degradation.
  
  So the whole point of the patch is to improve performance, but the
  changelog doesn't include any performance measurements!
  
  Please, run some quantitative tests and include a nice summary of the
  results in the changelog.
 
 I was actually spending this afternoon testing these patches with Oracle
 (I haven't seen any issues so far) and unless Joonsoo already did so, I
 want to run these by the libhugetlb test cases - I got side tracked by
 futexes though :/

Really thanks for your time to test these patches.
I already did libhugetlbfs test cases and passed it.

 
 Please do consider that performance wise I haven't seen much in
 particular. The thing is, I started dealing with this mutex once I
 noticed it as the #1 hot lock in Oracle DB starts, but then once the
 faults are done, it really goes away. So I wouldn't say that the mutex
 is a bottleneck except for the first few minutes.

What I want to be sure is for the first few minutes you mentioned.
If possible, let me know the result like as following link.
https://lkml.org/lkml/2013/7/12/428

Thanks in advance. :)

  
  This is terribly important, because if the performance benefit is
  infinitesimally small or negative, the patch goes into the bit bucket ;)
 
 Well, this mutex is infinitesimally ugly and needs to die (as long as
 performance isn't hurt).

Yes, I agreed.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

2013-12-19 Thread Joonsoo Kim
On Thu, Dec 19, 2013 at 06:15:20PM -0800, Andrew Morton wrote:
 On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:
 
  On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
   On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim iamjoonsoo@lge.com 
   wrote:
   
If parallel fault occur, we can fail to allocate a hugepage,
because many threads dequeue a hugepage to handle a fault of same 
address.
This makes reserved pool shortage just for a little while and this cause
faulting thread who can get hugepages to get a SIGBUS signal.

   
   So if I'm understanding this correctly...  if N threads all generate a
   fault against the same address, they will all dive in and allocate a
   hugepage, will then do an enormous memcpy into that page and will then
   attempt to instantiate the page in pagetables.  All threads except one
   will lose the race and will free the page again!  This sounds terribly
   inefficient; it would be useful to write a microbenchmark which
   triggers this scenario so we can explore the impact.
  
  Yes, you understand correctly, I think.
  
  I have an idea to prevent this overhead. It is that marking page when it
  is zeroed and unmarking when it is mapped to page table. If page mapping
  is failed due to current thread, the zeroed page will keep the marker and
  later we can determine if it is zeroed or not.
 
 Well OK, but the other threads will need to test that in-progress flag
 and then do something.  Where something will involve some form of
 open-coded sleep/wakeup thing.  To avoid all that wheel-reinventing we
 can avoid using an internal flag and use an external flag instead. 
 There's one in struct mutex!

My idea consider only hugetlb_no_page() and doesn't need a sleep.
It just set some page flag after zeroing and if some thread takes
the page with this flag when faulting, simply use it without zeroing.

 
 I doubt if the additional complexity of the external flag is worth it,
 but convincing performance testing results would sway me ;) Please have
 a think about it all.
 
  If you want to include this functionality in this series, I can do it ;)
  Please let me know your decision.
  
   I'm wondering if a better solution to all of this would be to make
   hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
   with a hash of the faulting address.  That will 99.9% solve the
   performance issue which you believe exists without introducing this new
   performance issue?
  
  Yes, that approach would solve the performance issue.
  IIRC, you already suggested this idea roughly 6 months ago and it is
  implemented by Davidlohr. I remembered that there is a race issue on
  COW case with this approach. See following link for more information.
  https://lkml.org/lkml/2013/8/7/142
 
 That seems to be unrelated to hugetlb_instantiation_mutex?

Yes, it is related to hugetlb_instantiation_mutex. In the link, I mentioned
about race condition of table mutex patches which is for replacing
hugetlb_instantiation_mutex, although conversation isn't easy to follow-up.

 
  And we need 1-3 patches to prevent other theorectical race issue
  regardless any approaches.
 
 Yes, I'll be going through patches 1-12 very soon, thanks.

Okay. Thanks :)

 
 
 And to reiterate: I'm very uncomfortable mucking around with
 performance patches when we have run no tests to measure their
 magnitude, or even whether they are beneficial at all!

Okay. I will keep in mind it. :)

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/