subject:"\[patch 0\/2 for\-4.20\] mm, thp\: fix remote access and allocation regressions"

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-09 Thread David Rientjes

On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > I've must have said this at least six or seven times: fault latency is 
> 
> In your original regression report in this thread to Linus:
> 
> https://lkml.kernel.org/r/alpine.deb.2.21.1811281504030.231...@chino.kir.corp.google.com
> 
> you said "On a fragmented host, the change itself showed a 13.9%
> access latency regression on Haswell and up to 40% allocation latency
>  
> regression. This is more substantial on Naples and Rome.  I also
> ^^
> measured similar numbers to this for Haswell."
> 
> > secondary to the *access* latency.  We want to try hard for MADV_HUGEPAGE 
> > users to do synchronous compaction and try to make a hugepage available.  
> 
> I'm glad you said it six or seven times now, because you forgot to
> mention in the above email that the "40% allocation/fault latency
> regression" you reported above, is actually a secondary concern because
> those must be long lived allocations and we can't yet generate
> compound pages for free after all..
> 

I've been referring to the long history of this discussion, namely my 
explicit Nacked-by in https://marc.info/?l=linux-kernel&m=153868420126775 
two months ago stating the 13.9% access latency regression.  The patch was 
nonetheless still merged and I proposed the revert for the same chief 
complaint, and it was reverted.

I brought up the access latency issue three months ago in
https://marc.info/?l=linux-kernel&m=153661012118046 and said allocation 
latency was a secondary concern, specifically that our users of 
MADV_HUGEPAGE are willing to accept the increased allocation latency for 
local hugepages.

> BTW, I never bothered to ask yet, but, did you enable NUMA balancing
> in your benchmarks? NUMA balancing would fix the access latency very
> easily too, so that 13.9% access latency must quickly disappear if you
> correctly have NUMA balancing enabled in a NUMA system.
> 

No, we do not have CONFIG_NUMA_BALANCING enabled.  The __GFP_THISNODE 
behavior for hugepages was added in 4.0 for the PPC usecase, not by me.  
That had nothing to do with the madvise mode: the initial documentation 
referred to the mode as a way to prevent an increase in rss for configs 
where "enabled" was set to madvise.  The allocation policy was never about 
MADV_HUGEPAGE in any 4.x kernel, it was only an indication for certain 
defrag settings to determine how much work should be done to allocate 
*local* hugepages at fault.

If you are saying that the change in allocator policy in a patch from 
Aneesh almost four years ago and has gone unreported by anybody up until a 
few months ago, I can understand the frustration.  I do, however, support 
the __GFP_THISNODE change he made because his data shows the same results 
as mine.

I've suggested a very simple extension, specifically a prctl() mode that 
is inherited across fork, that would allow a workload to specify that it 
prefers remote allocations over local compaction/reclaim because it is too 
large to fit on a single node.  I'd value your feedback for that 
suggestion to fix your usecase.

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-05 Thread Andrea Arcangeli

On Wed, Dec 05, 2018 at 02:10:47PM -0800, David Rientjes wrote:
> I've must have said this at least six or seven times: fault latency is 

In your original regression report in this thread to Linus:

https://lkml.kernel.org/r/alpine.deb.2.21.1811281504030.231...@chino.kir.corp.google.com

you said "On a fragmented host, the change itself showed a 13.9%
access latency regression on Haswell and up to 40% allocation latency

regression. This is more substantial on Naples and Rome.  I also
^^
measured similar numbers to this for Haswell."

> secondary to the *access* latency.  We want to try hard for MADV_HUGEPAGE 
> users to do synchronous compaction and try to make a hugepage available.  

I'm glad you said it six or seven times now, because you forgot to
mention in the above email that the "40% allocation/fault latency
regression" you reported above, is actually a secondary concern because
those must be long lived allocations and we can't yet generate
compound pages for free after all..

> We really want to be backed by hugepages, but certainly not when the 
> access latency becomes 13.9% worse as a result compared to local pages of 
> the native page size.

Yes the only regression that you measure that isn't only a secondary
concern, is that 13.9% access latency because of not immediate NUMA
locality.

BTW, I never bothered to ask yet, but, did you enable NUMA balancing
in your benchmarks? NUMA balancing would fix the access latency very
easily too, so that 13.9% access latency must quickly disappear if you
correctly have NUMA balancing enabled in a NUMA system.

Furthermore NUMA balancing is fully converging guaranteed if the
workload can fit in a single node (your case or __GFP_THISNODE would
hardly fly in the first place). It'll work even better for you because
you copy off all MAP_PRIVATE binaries into MAP_ANON to make them THP
backed so they can also be replicated per NODE and they won't increase
the NUMA balancing false sharing. And khugepaged always remains NUMA
agnostic so it won't risk to stop on NUMA balancing toes no matter how
we tweak the MADV_HUGEPAGE behavior.

> This is not a system-wide configuration detail, it is specific to the 
> workload: does it span more than one node or not?  No workload that can 
> fit into a single node, which you also say is going to be the majority of 
> workloads on today's platforms, is going to want to revert __GFP_THISNODE 
> behavior of the past almost four years.  It perfectly makes sense, 
> however, to be a new mempolicy mode, a new madvise mode, or a prctl.

qemu has been using MADV_HUGEPAGE since the below commit in Oct 2012.

>From ad0b5321f1f797274603ebbe20108b0750baee94 Mon Sep 17 00:00:00 2001
From: Luiz Capitulino 
Date: Fri, 5 Oct 2012 16:47:57 -0300
Subject: [PATCH 1/1] Call MADV_HUGEPAGE for guest RAM allocations

This makes it possible for QEMU to use transparent huge pages (THP)
when transparent_hugepage/enabled=madvise. Otherwise THP is only
used when it's enabled system wide.

Signed-off-by: Luiz Capitulino 
Signed-off-by: Anthony Liguori 
Signed-off-by: Andrea Arcangeli 
---
 exec.c  | 1 +
 osdep.h | 5 +
 2 files changed, 6 insertions(+)

diff --git a/exec.c b/exec.c
index c4ed6fdef1..750008c4e1 100644
--- a/exec.c
+++ b/exec.c
@@ -2571,6 +2571,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void 
*host,
 cpu_physical_memory_set_dirty_range(new_block->offset, size, 0xff);

 qemu_ram_setup_dump(new_block->host, size);
+qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);

 if (kvm_enabled())
 kvm_setup_guest_memory(new_block->host, size);
diff --git a/osdep.h b/osdep.h
index cb213e0295..c5fd3d91ff 100644
--- a/osdep.h
+++ b/osdep.h
@@ -108,6 +108,11 @@ void qemu_vfree(void *ptr);
 #else
 #define QEMU_MADV_DONTDUMP QEMU_MADV_INVALID
 #endif
+#ifdef MADV_HUGEPAGE
+#define QEMU_MADV_HUGEPAGE MADV_HUGEPAGE
+#else
+#define QEMU_MADV_HUGEPAGE QEMU_MADV_INVALID
+#endif

 #elif defined(CONFIG_POSIX_MADVISE)

> Please: if we wish to change behavior from February 2015, let's extend the 
> API to allow for remote allocations in several of the ways we have already 
> brainstormed rather than cause regressions.

So if I understand correctly, you broke QEMU 3 years ago (again only
noticed Jan 2018 because it requires >2 nodes and it requires a VM
larger than 1 NODE which isn't common).

QEMU before your regression in April 2015:

Finished: 30 GB mapped, 10.163159s elapsed, 2.95GB/s
Finished: 34 GB mapped, 11.806526s elapsed, 2.88GB/s
Finished: 38 GB mapped, 10.369081s elapsed, 3.66GB/s
Finished: 42 GB mapped, 12.357719s elapsed, 3.40GB/s

QEMU after your regression in April 2015:

Finished: 30 GB mapped, 8.821632s elapsed, 3.40GB/s
Finished: 34 GB mapped, 341.979543s elapsed, 0.10GB/s
Finished: 38 GB mapped, 761.933231s elapsed, 0.05GB/s
Finished: 42 GB mapped, 1188.409235s elapsed, 0.04GB/s

And now you ask open source upstream QEMU

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-05 Thread David Rientjes

On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > High thp utilization is not always better, especially when those hugepages 
> > are accessed remotely and introduce the regressions that I've reported.  
> > Seeking high thp utilization at all costs is not the goal if it causes 
> > workloads to regress.
> 
> Is it possible what you need is a defrag=compactonly_thisnode to set
> instead of the default defrag=madvise? The fact you seem concerned
> about page fault latencies doesn't make your workload an obvious
> candidate for MADV_HUGEPAGE to begin with. At least unless you decide
> to smooth the MADV_HUGEPAGE behavior with an mbind that will simply
> add __GFP_THISNODE to the allocations, perhaps you'll be even faster
> if you invoke reclaim in the local node for 4k allocations too.
> 

I've must have said this at least six or seven times: fault latency is 
secondary to the *access* latency.  We want to try hard for MADV_HUGEPAGE 
users to do synchronous compaction and try to make a hugepage available.  
We really want to be backed by hugepages, but certainly not when the 
access latency becomes 13.9% worse as a result compared to local pages of 
the native page size.

This is not a system-wide configuration detail, it is specific to the 
workload: does it span more than one node or not?  No workload that can 
fit into a single node, which you also say is going to be the majority of 
workloads on today's platforms, is going to want to revert __GFP_THISNODE 
behavior of the past almost four years.  It perfectly makes sense, 
however, to be a new mempolicy mode, a new madvise mode, or a prctl.

> It looks like for your workload THP is a nice to have add-on, which is
> practically true of all workloads (with a few corner cases that must
> use MADV_NOHUGEPAGE), and it's what the defrag= default is about.
> 
> Is it possible that you just don't want to shut off completely
> compaction in the page fault and if you're ok to do it for your
> library, you may be ok with that for all other apps too?
> 

We enable synchronous compaction for MADV_HUGEPAGE users, yes, because we 
are not concerned with the fault latency but rather the access latency.

> That's a different stance from other MADV_HUGEPAGE users because you
> don't seem to mind a severely crippled THP utilization in your
> app.
> 

If access latency is really better for local pages of the native page 
size, we of course want to fault those instead.  For almost the past four 
years, the behavior of MADV_HUGEPAGE has been to compact and possibly 
reclaim locally and then fallback to local pages.  It is exactly what our 
users of MADV_HUGEPAGE want; I did not introduce this NUMA locality 
restriction but our users have used it.

Please: if we wish to change behavior from February 2015, let's extend the 
API to allow for remote allocations in several of the ways we have already 
brainstormed rather than cause regressions.

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-05 Thread Andrea Arcangeli

On Wed, Dec 05, 2018 at 11:49:26AM -0800, David Rientjes wrote:
> High thp utilization is not always better, especially when those hugepages 
> are accessed remotely and introduce the regressions that I've reported.  
> Seeking high thp utilization at all costs is not the goal if it causes 
> workloads to regress.

Is it possible what you need is a defrag=compactonly_thisnode to set
instead of the default defrag=madvise? The fact you seem concerned
about page fault latencies doesn't make your workload an obvious
candidate for MADV_HUGEPAGE to begin with. At least unless you decide
to smooth the MADV_HUGEPAGE behavior with an mbind that will simply
add __GFP_THISNODE to the allocations, perhaps you'll be even faster
if you invoke reclaim in the local node for 4k allocations too.

It looks like for your workload THP is a nice to have add-on, which is
practically true of all workloads (with a few corner cases that must
use MADV_NOHUGEPAGE), and it's what the defrag= default is about.

Is it possible that you just don't want to shut off completely
compaction in the page fault and if you're ok to do it for your
library, you may be ok with that for all other apps too?

That's a different stance from other MADV_HUGEPAGE users because you
don't seem to mind a severely crippled THP utilization in your
app.

With your patch the utilization will go down a lot compared to the
previous __GFP_THISNODE swap storm capable and you're still very fine
with that. The fact you're fine with that points in the direction of
changing the default tuning for defrag= to something stronger than
madvise (that is precisely the default setting that is forcing you to
use MADV_HUGEPAGE to get a chance to get some THP once a in a while
during the page fault, after some uptime).

Considering mbind surprisingly isn't privileged (so I suppose it may
cause swap storms equivalent to __GFP_THISNODE if maliciously used
after all) you could even consider a defrag=thisnode to force
compaction+defrag local to the node to retain your THP+NUMA dynamic
partitioning behavior that ends up swappin heavy in the local node.

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-05 Thread David Rientjes

On Wed, 5 Dec 2018, Michal Hocko wrote:

> > As we've been over countless times, this is the desired effect for 
> > workloads that fit on a single node.  We want local pages of the native 
> > page size because they (1) are accessed faster than remote hugepages and 
> > (2) are candidates for collapse by khugepaged.
> > 
> > For applications that do not fit in a single node, we have discussed 
> > possible ways to extend the API to allow remote faulting of hugepages, 
> > absent remote fragmentation as well, then the long-standing behavior is 
> > preserved and large applications can use the API to increase their thp 
> > success rate.
> 
> OK, I just give up. This doesn't lead anywhere. You keep repeating the
> same stuff over and over, neglect other usecases and actually force them
> to do something special just to keep your very specific usecase which
> you clearly refuse to abstract into a form other people can experiment
> with or at least provide more detailed broken down numbers for a more
> serious analyses. Fault latency is only a part of the picture which is
> much more complex. Look at Mel's report to get an impression of what
> might be really useful for a _productive_ discussion.

The other usecases is part of patch 2/2 in this series that is 
functionally similar to the __GFP_COMPACT_ONLY patch that Andrea proposed.  
We can also work to extend the API to allow remote thp allocations.

Patch 1/2 reverts the behavior of commit ac5b2c18911f ("mm: thp: relax 
__GFP_THISNODE for MADV_HUGEPAGE mappings") which added NUMA locality on 
top of an already conflated madvise mode.  Prior to this commit that was 
merged for 4.20, *all* thp faults were constrained to the local node; this 
has been the case for three years and even prior to that in other kernels.  
It turns out that allowing remote allocations introduces access latency in 
the presence of local fragmentation.

The solution is not to conflate MADV_HUGEPAGE with any sematic that 
suggests it allows remote thp allocations, especially when that changes 
long-standing behavior, regresses my usecase, and regresses the kernel 
test robot.

I'll change patch 1/2 to not touch new_page() so that we are only 
addressing thp faults and post a v2.

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-05 Thread Michal Hocko

On Wed 05-12-18 11:49:26, David Rientjes wrote:
> On Wed, 5 Dec 2018, Michal Hocko wrote:
> 
> > > The revert is certainly needed to prevent the regression, yes, but I 
> > > anticipate that Andrea will report back that patch 2 at least improves 
> > > the 
> > > situation for the problem that he was addressing, specifically that it is 
> > > pointless to thrash any node or reclaim unnecessarily when compaction has 
> > > already failed.  This is what setting __GFP_NORETRY for all thp fault 
> > > allocations fixes.
> > 
> > Yes but earlier numbers from Mel and repeated again [1] simply show
> > that the swap storms are only handled in favor of an absolute drop of
> > THP success rate.
> >  
> 
> As we've been over countless times, this is the desired effect for 
> workloads that fit on a single node.  We want local pages of the native 
> page size because they (1) are accessed faster than remote hugepages and 
> (2) are candidates for collapse by khugepaged.
> 
> For applications that do not fit in a single node, we have discussed 
> possible ways to extend the API to allow remote faulting of hugepages, 
> absent remote fragmentation as well, then the long-standing behavior is 
> preserved and large applications can use the API to increase their thp 
> success rate.

OK, I just give up. This doesn't lead anywhere. You keep repeating the
same stuff over and over, neglect other usecases and actually force them
to do something special just to keep your very specific usecase which
you clearly refuse to abstract into a form other people can experiment
with or at least provide more detailed broken down numbers for a more
serious analyses. Fault latency is only a part of the picture which is
much more complex. Look at Mel's report to get an impression of what
might be really useful for a _productive_ discussion.
-- 
Michal Hocko
SUSE Labs

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-05 Thread David Rientjes

On Wed, 5 Dec 2018, Michal Hocko wrote:

> > The revert is certainly needed to prevent the regression, yes, but I 
> > anticipate that Andrea will report back that patch 2 at least improves the 
> > situation for the problem that he was addressing, specifically that it is 
> > pointless to thrash any node or reclaim unnecessarily when compaction has 
> > already failed.  This is what setting __GFP_NORETRY for all thp fault 
> > allocations fixes.
> 
> Yes but earlier numbers from Mel and repeated again [1] simply show
> that the swap storms are only handled in favor of an absolute drop of
> THP success rate.
>  

As we've been over countless times, this is the desired effect for 
workloads that fit on a single node.  We want local pages of the native 
page size because they (1) are accessed faster than remote hugepages and 
(2) are candidates for collapse by khugepaged.

For applications that do not fit in a single node, we have discussed 
possible ways to extend the API to allow remote faulting of hugepages, 
absent remote fragmentation as well, then the long-standing behavior is 
preserved and large applications can use the API to increase their thp 
success rate.

> Yes, this is understood. So we are getting worst of both. We have a
> numa locality side effect of MADV_HUGEPAGE and we have a poor THP
> utilization. So how come this is an improvement. Especially when the
> reported regression hasn't been demonstrated on a real or repeatable
> workload but rather a very vague presumably worst case behavior where
> the access penalty is absolutely prevailing.
> 

High thp utilization is not always better, especially when those hugepages 
are accessed remotely and introduce the regressions that I've reported.  
Seeking high thp utilization at all costs is not the goal if it causes 
workloads to regress.

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-05 Thread David Rientjes

On Wed, 5 Dec 2018, Mel Gorman wrote:

> > This is a single MADV_HUGEPAGE usecase, there is nothing special about it.  
> > It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and 
> > faulted the memory with a fragmented local node and then measured the 
> > remote access latency to the remote hugepage that occurs without setting 
> > __GFP_THISNODE.  You can also measure the remote allocation latency by 
> > fragmenting the entire system and then faulting.
> > 
> 
> I'll make the same point as before, the form the fragmentation takes
> matters as well as the types of pages that are resident and whether
> they are active or not. It affects the level of work the system does
> as well as the overall success rate of operations (be it reclaim, THP
> allocation, compaction, whatever). This is why a reproduction case that is
> representative of the problem you're facing on the real workload matters
> would have been helpful because then any alternative proposal could have
> taken your workload into account during testing.
> 

We know from Andrea's report that compaction is failing, and repeatedly 
failing because otherwise we would not need excessive swapping to make it 
work.  That can mean one of two things: (1) a general low-on-memory 
situation that causes us repeatedly to be under watermarks to deem 
compaction suitable (isolate_freepages() will be too painful) or (2) 
compaction has the memory that it needs but is failing to make a hugepage 
available because all pages from a pageblock cannot be migrated.

If (1), perhaps in the presence of an antagonist that is quickly 
allocating the memory before compaction can pass watermark checks, further 
reclaim is not beneficial: the allocation is becoming too expensive and 
there is no guarantee that compaction can find this reclaimed memory in 
isolate_freepages().

I chose to duplicate (2) by synthetically introducing fragmentation 
(high-order slab, free every other one) locally to test the patch that 
does not set __GFP_THISNODE.  The result is a remote transparent hugepage, 
but we do not even need to get to the point of local compaction for that 
fallback to happen.  And this is where I measure the 13.9% access latency 
regression for the lifetime of the binary as a result of this patch.

If local compaction works the first time, great!  But that is not what is 
happening in Andrea's report and as a result of not setting __GFP_THISNODE 
we are *guaranteed* worse access latency and may encounter even worse 
allocation latency if the remote memory is fragmented as well.

So while I'm only testing the functional behavior of the patch itself, I 
cannot speak to the nature of the local fragmentation on Andrea's systems.

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-05 Thread Mel Gorman

On Tue, Dec 04, 2018 at 02:25:54PM -0800, David Rientjes wrote:
> On Tue, 4 Dec 2018, Michal Hocko wrote:
> 
> > > This fixes a 13.9% of remote memory access regression and 40% remote
> > > memory allocation regression on Haswell when the local node is fragmented
> > > for hugepage sized pages and memory is being faulted with either the thp
> > > defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
> > > 
> > > The usecase that initially identified this issue were binaries that mremap
> > > their .text segment to be backed by transparent hugepages on startup.
> > > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().
> > 
> > Do you have something you can share with so that other people can play
> > and try to reproduce?
> > 
> 
> This is a single MADV_HUGEPAGE usecase, there is nothing special about it.  
> It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and 
> faulted the memory with a fragmented local node and then measured the 
> remote access latency to the remote hugepage that occurs without setting 
> __GFP_THISNODE.  You can also measure the remote allocation latency by 
> fragmenting the entire system and then faulting.
> 

I'll make the same point as before, the form the fragmentation takes
matters as well as the types of pages that are resident and whether
they are active or not. It affects the level of work the system does
as well as the overall success rate of operations (be it reclaim, THP
allocation, compaction, whatever). This is why a reproduction case that is
representative of the problem you're facing on the real workload matters
would have been helpful because then any alternative proposal could have
taken your workload into account during testing.

-- 
Mel Gorman
SUSE Labs

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-05 Thread Michal Hocko

On Tue 04-12-18 14:04:10, David Rientjes wrote:
> On Tue, 4 Dec 2018, Vlastimil Babka wrote:
> 
> > So, AFAIK, the situation is:
> > 
> > - commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The
> > intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking
> > both as it seemed to make sense).
> 
> Yes, both are based on the preference to fault local thp and fallback to 
> local pages before allocating remotely because it does not cause the 
> performance regression introduced by not setting __GFP_THISNODE.
> 
> > - The resulting node-reclaim-like behavior regressed Andrea's KVM
> > workloads, but reverting it (only for madvised or non-default
> > defrag=always THP by commit ac5b2c18911f) would regress David's
> > workloads starting with 4.20 to pre-4.1 levels.
> > 
> 
> Almost, but the defrag=always case had the subtle difference of also 
> setting __GFP_NORETRY whereas MADV_HUGEPAGE did not.  This was different 
> than the comment in __alloc_pages_slowpath() that expected thp fault 
> allocations to be caught by checking __GFP_NORETRY.
> 
> > If the decision is that it's too late to revert a 4.1 regression for one
> > kind of workload in 4.20 because it causes regression for another
> > workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20
> > and don't rush a different fix (patch 2) to 4.20. It's not a big
> > difference if a 4.1 regression is fixed in 4.20 or 4.21?
> > 
> 
> The revert is certainly needed to prevent the regression, yes, but I 
> anticipate that Andrea will report back that patch 2 at least improves the 
> situation for the problem that he was addressing, specifically that it is 
> pointless to thrash any node or reclaim unnecessarily when compaction has 
> already failed.  This is what setting __GFP_NORETRY for all thp fault 
> allocations fixes.

Yes but earlier numbers from Mel and repeated again [1] simply show
that the swap storms are only handled in favor of an absolute drop of
THP success rate.
 
> > Because there might be other unexpected consequences of patch 2 that
> > testing won't be able to catch in the remaining 4.20 rc's. And I'm not
> > even sure if it will fix Andrea's workloads. While it should prevent
> > node-reclaim-like thrashing, it will still mean that KVM (or anyone)
> > won't be able to allocate THP's remotely, even if the local node is
> > exhausted of both huge and base pages.
> > 
> 
> Patch 2 does nothing with respect to the remote allocation policy; it 
> simply prevents reclaim (and potentially thrashing).  Patch 1 sets 
> __GFP_THISNODE to prevent the remote allocation.

Yes, this is understood. So we are getting worst of both. We have a
numa locality side effect of MADV_HUGEPAGE and we have a poor THP
utilization. So how come this is an improvement. Especially when the
reported regression hasn't been demonstrated on a real or repeatable
workload but rather a very vague presumably worst case behavior where
the access penalty is absolutely prevailing.

[1] http://lkml.kernel.org/r/20181204104558.gv23...@techsingularity.net
-- 
Michal Hocko
SUSE Labs

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-04 Thread Michal Hocko

On Tue 04-12-18 14:25:54, David Rientjes wrote:
> On Tue, 4 Dec 2018, Michal Hocko wrote:
> 
> > > This fixes a 13.9% of remote memory access regression and 40% remote
> > > memory allocation regression on Haswell when the local node is fragmented
> > > for hugepage sized pages and memory is being faulted with either the thp
> > > defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
> > > 
> > > The usecase that initially identified this issue were binaries that mremap
> > > their .text segment to be backed by transparent hugepages on startup.
> > > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().
> > 
> > Do you have something you can share with so that other people can play
> > and try to reproduce?
> > 
> 
> This is a single MADV_HUGEPAGE usecase, there is nothing special about it.  
> It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and 
> faulted the memory with a fragmented local node and then measured the 
> remote access latency to the remote hugepage that occurs without setting 
> __GFP_THISNODE.  You can also measure the remote allocation latency by 
> fragmenting the entire system and then faulting.
> 
> (Remapping the text segment only involves parsing /proc/self/exe, mmap, 
> madvise, memcpy, and mremap.)

How does this reflect your real workload and regressions in it. It
certainly shows the worst case behavior where the access penalty is
prevalent while there are no other metrics which might be interesting or
even important. E.g. page table savings or the TLB pressure in general
when THP fail too eagerly.
As Anrea mentioned there are really valid cases where the remote latency
pays off.

Have you actually seen the advertised regression in the real workload
and do you have any means to simulate that workload.

> > > This requires a full revert and partial revert of commits merged during
> > > the 4.20 rc cycle.  The full revert, of ac5b2c18911f ("mm: thp: relax
> > > __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
> > > amounts of swap activity on the local zone when faulting hugepages by
> > > falling back to remote memory.  This remote allocation causes the access
> > > regression and, if fragmented, the allocation regression.
> > 
> > Have you tried to measure any of the workloads Mel and Andrea have
> > pointed out during the previous review discussion? In other words what
> > is the impact on the THP success rate and allocation latencies for other
> > usecases?
> 
> It isn't a property of the workload, it's a property of the how fragmented 
> both local and remote memory is.  In Andrea's case, I believe he has 
> stated that memory compaction has failed locally and the resulting reclaim 
> activity ends up looping and causing it the thrash the local node whereas 
> 75% of remote memory is free and not fragmented.  So we have local 
> fragmentation and reclaim is very expensive to enable compaction to 
> succeed, if it ever does succeed[*], and mostly free remote memory.
> 
> If remote memory is also fragmented, Andrea's case will run into a much 
> more severe swap storm as a result of not setting __GFP_THISNODE.  The 
> premise of the entire change is that his remote memory is mostly free so 
> fallback results in a quick allocation.  For balanced nodes, that's not 
> going to be the case.  The fix to prevent the heavy reclaim activity is to 
> set __GFP_NORETRY as the page allocator suspects, which patch 2 here does.

You are not answering my question and I take it as you haven't actually
tested this in a variety of workload and base your assumptions on an
artificial worst case scenario. Please correct me if I am wrong.

-- 
Michal Hocko
SUSE Labs

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-04 Thread David Rientjes

On Tue, 4 Dec 2018, Michal Hocko wrote:

> > This fixes a 13.9% of remote memory access regression and 40% remote
> > memory allocation regression on Haswell when the local node is fragmented
> > for hugepage sized pages and memory is being faulted with either the thp
> > defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
> > 
> > The usecase that initially identified this issue were binaries that mremap
> > their .text segment to be backed by transparent hugepages on startup.
> > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().
> 
> Do you have something you can share with so that other people can play
> and try to reproduce?
> 

This is a single MADV_HUGEPAGE usecase, there is nothing special about it.  
It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and 
faulted the memory with a fragmented local node and then measured the 
remote access latency to the remote hugepage that occurs without setting 
__GFP_THISNODE.  You can also measure the remote allocation latency by 
fragmenting the entire system and then faulting.

(Remapping the text segment only involves parsing /proc/self/exe, mmap, 
madvise, memcpy, and mremap.)

> > This requires a full revert and partial revert of commits merged during
> > the 4.20 rc cycle.  The full revert, of ac5b2c18911f ("mm: thp: relax
> > __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
> > amounts of swap activity on the local zone when faulting hugepages by
> > falling back to remote memory.  This remote allocation causes the access
> > regression and, if fragmented, the allocation regression.
> 
> Have you tried to measure any of the workloads Mel and Andrea have
> pointed out during the previous review discussion? In other words what
> is the impact on the THP success rate and allocation latencies for other
> usecases?

It isn't a property of the workload, it's a property of the how fragmented 
both local and remote memory is.  In Andrea's case, I believe he has 
stated that memory compaction has failed locally and the resulting reclaim 
activity ends up looping and causing it the thrash the local node whereas 
75% of remote memory is free and not fragmented.  So we have local 
fragmentation and reclaim is very expensive to enable compaction to 
succeed, if it ever does succeed[*], and mostly free remote memory.

If remote memory is also fragmented, Andrea's case will run into a much 
more severe swap storm as a result of not setting __GFP_THISNODE.  The 
premise of the entire change is that his remote memory is mostly free so 
fallback results in a quick allocation.  For balanced nodes, that's not 
going to be the case.  The fix to prevent the heavy reclaim activity is to 
set __GFP_NORETRY as the page allocator suspects, which patch 2 here does.

That's an interesting memory state to

 [*] Reclaim here would only be beneficial if we fail the order-0 
 watermark check in __compaction_suitable() *and* the reclaimed
 memory can be accessed during isolate_freepages().

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-04 Thread David Rientjes

On Tue, 4 Dec 2018, Vlastimil Babka wrote:

> So, AFAIK, the situation is:
> 
> - commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The
> intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking
> both as it seemed to make sense).

Yes, both are based on the preference to fault local thp and fallback to 
local pages before allocating remotely because it does not cause the 
performance regression introduced by not setting __GFP_THISNODE.

> - The resulting node-reclaim-like behavior regressed Andrea's KVM
> workloads, but reverting it (only for madvised or non-default
> defrag=always THP by commit ac5b2c18911f) would regress David's
> workloads starting with 4.20 to pre-4.1 levels.
> 

Almost, but the defrag=always case had the subtle difference of also 
setting __GFP_NORETRY whereas MADV_HUGEPAGE did not.  This was different 
than the comment in __alloc_pages_slowpath() that expected thp fault 
allocations to be caught by checking __GFP_NORETRY.

> If the decision is that it's too late to revert a 4.1 regression for one
> kind of workload in 4.20 because it causes regression for another
> workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20
> and don't rush a different fix (patch 2) to 4.20. It's not a big
> difference if a 4.1 regression is fixed in 4.20 or 4.21?
> 

The revert is certainly needed to prevent the regression, yes, but I 
anticipate that Andrea will report back that patch 2 at least improves the 
situation for the problem that he was addressing, specifically that it is 
pointless to thrash any node or reclaim unnecessarily when compaction has 
already failed.  This is what setting __GFP_NORETRY for all thp fault 
allocations fixes.

> Because there might be other unexpected consequences of patch 2 that
> testing won't be able to catch in the remaining 4.20 rc's. And I'm not
> even sure if it will fix Andrea's workloads. While it should prevent
> node-reclaim-like thrashing, it will still mean that KVM (or anyone)
> won't be able to allocate THP's remotely, even if the local node is
> exhausted of both huge and base pages.
> 

Patch 2 does nothing with respect to the remote allocation policy; it 
simply prevents reclaim (and potentially thrashing).  Patch 1 sets 
__GFP_THISNODE to prevent the remote allocation.

Note that the commit to be reverted in patch 1, if not reverted, would 
cause an even more severe regression from Andrea's case if remote memory 
is fragmented as well: it opens the door to thrashing both local and 
remote memory instead of only local memory.  I measured this as a 40% 
allocation latency regression when purposefully fragmenting all nodes and 
faulting without __GFP_THISNODE, and that was on Haswell; I can imagine it 
would be even greater on Rome.

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-04 Thread Vlastimil Babka

On 12/4/18 12:50 AM, David Rientjes wrote:
> This fixes a 13.9% of remote memory access regression and 40% remote
> memory allocation regression on Haswell when the local node is fragmented
> for hugepage sized pages and memory is being faulted with either the thp
> defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
> 
> The usecase that initially identified this issue were binaries that mremap
> their .text segment to be backed by transparent hugepages on startup.
> They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().
> 
> This requires a full revert and partial revert of commits merged during
> the 4.20 rc cycle.  The full revert, of ac5b2c18911f ("mm: thp: relax
> __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
> amounts of swap activity on the local zone when faulting hugepages by
> falling back to remote memory.  This remote allocation causes the access
> regression and, if fragmented, the allocation regression.
> 
> This patchset also fixes that issue by not attempting direct reclaim at
> all when compaction fails to free a hugepage.  Note that if remote memory
> was also low or fragmented that ac5b2c18911f ("mm: thp: relax
> __GFP_THISNODE for MADV_HUGEPAGE mappings") would only have compounded the
> problem it attempts to address by now thrashing all nodes instead of only
> the local node.
> 
> The reverts for the stable trees will be different: just a straight revert
> of commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE
> mappings") is likely needed.
> 
> Cross compiled for architectures with thp support and thp enabled:
> arc (with ISA_ARCV2), arm (with ARM_LPAE), arm64, i386, mips64, powerpc, 
> s390, sparc, x86_64.
> 
> Andrea, is this acceptable?

So, AFAIK, the situation is:

- commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The
intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking
both as it seemed to make sense).
- The resulting node-reclaim-like behavior regressed Andrea's KVM
workloads, but reverting it (only for madvised or non-default
defrag=always THP by commit ac5b2c18911f) would regress David's
workloads starting with 4.20 to pre-4.1 levels.

If the decision is that it's too late to revert a 4.1 regression for one
kind of workload in 4.20 because it causes regression for another
workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20
and don't rush a different fix (patch 2) to 4.20. It's not a big
difference if a 4.1 regression is fixed in 4.20 or 4.21?

Because there might be other unexpected consequences of patch 2 that
testing won't be able to catch in the remaining 4.20 rc's. And I'm not
even sure if it will fix Andrea's workloads. While it should prevent
node-reclaim-like thrashing, it will still mean that KVM (or anyone)
won't be able to allocate THP's remotely, even if the local node is
exhausted of both huge and base pages.

> ---
>  drivers/gpu/drm/ttm/ttm_page_alloc.c |8 +++---
>  drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |3 --
>  include/linux/gfp.h  |3 +-
>  include/linux/mempolicy.h|2 -
>  mm/huge_memory.c |   41 
> +++
>  mm/mempolicy.c   |7 +++--
>  mm/page_alloc.c  |   16 
>  7 files changed, 42 insertions(+), 38 deletions(-)
>

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-03 Thread Michal Hocko

On Mon 03-12-18 15:50:18, David Rientjes wrote:
> This fixes a 13.9% of remote memory access regression and 40% remote
> memory allocation regression on Haswell when the local node is fragmented
> for hugepage sized pages and memory is being faulted with either the thp
> defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
> 
> The usecase that initially identified this issue were binaries that mremap
> their .text segment to be backed by transparent hugepages on startup.
> They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().

Do you have something you can share with so that other people can play
and try to reproduce?

> This requires a full revert and partial revert of commits merged during
> the 4.20 rc cycle.  The full revert, of ac5b2c18911f ("mm: thp: relax
> __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
> amounts of swap activity on the local zone when faulting hugepages by
> falling back to remote memory.  This remote allocation causes the access
> regression and, if fragmented, the allocation regression.

Have you tried to measure any of the workloads Mel and Andrea have
pointed out during the previous review discussion? In other words what
is the impact on the THP success rate and allocation latencies for other
usecases?
-- 
Michal Hocko
SUSE Labs

[patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-03 Thread David Rientjes

This fixes a 13.9% of remote memory access regression and 40% remote
memory allocation regression on Haswell when the local node is fragmented
for hugepage sized pages and memory is being faulted with either the thp
defrag setting of "always" or has been madvised with MADV_HUGEPAGE.

The usecase that initially identified this issue were binaries that mremap
their .text segment to be backed by transparent hugepages on startup.
They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().

This requires a full revert and partial revert of commits merged during
the 4.20 rc cycle.  The full revert, of ac5b2c18911f ("mm: thp: relax
__GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
amounts of swap activity on the local zone when faulting hugepages by
falling back to remote memory.  This remote allocation causes the access
regression and, if fragmented, the allocation regression.

This patchset also fixes that issue by not attempting direct reclaim at
all when compaction fails to free a hugepage.  Note that if remote memory
was also low or fragmented that ac5b2c18911f ("mm: thp: relax
__GFP_THISNODE for MADV_HUGEPAGE mappings") would only have compounded the
problem it attempts to address by now thrashing all nodes instead of only
the local node.

The reverts for the stable trees will be different: just a straight revert
of commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE
mappings") is likely needed.

Cross compiled for architectures with thp support and thp enabled:
arc (with ISA_ARCV2), arm (with ARM_LPAE), arm64, i386, mips64, powerpc, 
s390, sparc, x86_64.

Andrea, is this acceptable?
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c |8 +++---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |3 --
 include/linux/gfp.h  |3 +-
 include/linux/mempolicy.h|2 -
 mm/huge_memory.c |   41 +++
 mm/mempolicy.c   |7 +++--
 mm/page_alloc.c  |   16 
 7 files changed, 42 insertions(+), 38 deletions(-)

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

[patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

16 matches

Site Navigation

Mail list logo

Footer information