Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Wed, 5 Dec 2018, Andrea Arcangeli wrote: > > I've must have said this at least six or seven times: fault latency is > > In your original regression report in this thread to Linus: > > https://lkml.kernel.org/r/alpine.deb.2.21.1811281504030.231...@chino.kir.corp.google.com > > you said "On a fragmented host, the change itself showed a 13.9% > access latency regression on Haswell and up to 40% allocation latency > > regression. This is more substantial on Naples and Rome. I also > ^^ > measured similar numbers to this for Haswell." > > > secondary to the *access* latency. We want to try hard for MADV_HUGEPAGE > > users to do synchronous compaction and try to make a hugepage available. > > I'm glad you said it six or seven times now, because you forgot to > mention in the above email that the "40% allocation/fault latency > regression" you reported above, is actually a secondary concern because > those must be long lived allocations and we can't yet generate > compound pages for free after all.. > I've been referring to the long history of this discussion, namely my explicit Nacked-by in https://marc.info/?l=linux-kernel&m=153868420126775 two months ago stating the 13.9% access latency regression. The patch was nonetheless still merged and I proposed the revert for the same chief complaint, and it was reverted. I brought up the access latency issue three months ago in https://marc.info/?l=linux-kernel&m=153661012118046 and said allocation latency was a secondary concern, specifically that our users of MADV_HUGEPAGE are willing to accept the increased allocation latency for local hugepages. > BTW, I never bothered to ask yet, but, did you enable NUMA balancing > in your benchmarks? NUMA balancing would fix the access latency very > easily too, so that 13.9% access latency must quickly disappear if you > correctly have NUMA balancing enabled in a NUMA system. > No, we do not have CONFIG_NUMA_BALANCING enabled. The __GFP_THISNODE behavior for hugepages was added in 4.0 for the PPC usecase, not by me. That had nothing to do with the madvise mode: the initial documentation referred to the mode as a way to prevent an increase in rss for configs where "enabled" was set to madvise. The allocation policy was never about MADV_HUGEPAGE in any 4.x kernel, it was only an indication for certain defrag settings to determine how much work should be done to allocate *local* hugepages at fault. If you are saying that the change in allocator policy in a patch from Aneesh almost four years ago and has gone unreported by anybody up until a few months ago, I can understand the frustration. I do, however, support the __GFP_THISNODE change he made because his data shows the same results as mine. I've suggested a very simple extension, specifically a prctl() mode that is inherited across fork, that would allow a workload to specify that it prefers remote allocations over local compaction/reclaim because it is too large to fit on a single node. I'd value your feedback for that suggestion to fix your usecase.
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Wed, Dec 05, 2018 at 02:10:47PM -0800, David Rientjes wrote: > I've must have said this at least six or seven times: fault latency is In your original regression report in this thread to Linus: https://lkml.kernel.org/r/alpine.deb.2.21.1811281504030.231...@chino.kir.corp.google.com you said "On a fragmented host, the change itself showed a 13.9% access latency regression on Haswell and up to 40% allocation latency regression. This is more substantial on Naples and Rome. I also ^^ measured similar numbers to this for Haswell." > secondary to the *access* latency. We want to try hard for MADV_HUGEPAGE > users to do synchronous compaction and try to make a hugepage available. I'm glad you said it six or seven times now, because you forgot to mention in the above email that the "40% allocation/fault latency regression" you reported above, is actually a secondary concern because those must be long lived allocations and we can't yet generate compound pages for free after all.. > We really want to be backed by hugepages, but certainly not when the > access latency becomes 13.9% worse as a result compared to local pages of > the native page size. Yes the only regression that you measure that isn't only a secondary concern, is that 13.9% access latency because of not immediate NUMA locality. BTW, I never bothered to ask yet, but, did you enable NUMA balancing in your benchmarks? NUMA balancing would fix the access latency very easily too, so that 13.9% access latency must quickly disappear if you correctly have NUMA balancing enabled in a NUMA system. Furthermore NUMA balancing is fully converging guaranteed if the workload can fit in a single node (your case or __GFP_THISNODE would hardly fly in the first place). It'll work even better for you because you copy off all MAP_PRIVATE binaries into MAP_ANON to make them THP backed so they can also be replicated per NODE and they won't increase the NUMA balancing false sharing. And khugepaged always remains NUMA agnostic so it won't risk to stop on NUMA balancing toes no matter how we tweak the MADV_HUGEPAGE behavior. > This is not a system-wide configuration detail, it is specific to the > workload: does it span more than one node or not? No workload that can > fit into a single node, which you also say is going to be the majority of > workloads on today's platforms, is going to want to revert __GFP_THISNODE > behavior of the past almost four years. It perfectly makes sense, > however, to be a new mempolicy mode, a new madvise mode, or a prctl. qemu has been using MADV_HUGEPAGE since the below commit in Oct 2012. >From ad0b5321f1f797274603ebbe20108b0750baee94 Mon Sep 17 00:00:00 2001 From: Luiz Capitulino Date: Fri, 5 Oct 2012 16:47:57 -0300 Subject: [PATCH 1/1] Call MADV_HUGEPAGE for guest RAM allocations This makes it possible for QEMU to use transparent huge pages (THP) when transparent_hugepage/enabled=madvise. Otherwise THP is only used when it's enabled system wide. Signed-off-by: Luiz Capitulino Signed-off-by: Anthony Liguori Signed-off-by: Andrea Arcangeli --- exec.c | 1 + osdep.h | 5 + 2 files changed, 6 insertions(+) diff --git a/exec.c b/exec.c index c4ed6fdef1..750008c4e1 100644 --- a/exec.c +++ b/exec.c @@ -2571,6 +2571,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host, cpu_physical_memory_set_dirty_range(new_block->offset, size, 0xff); qemu_ram_setup_dump(new_block->host, size); +qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE); if (kvm_enabled()) kvm_setup_guest_memory(new_block->host, size); diff --git a/osdep.h b/osdep.h index cb213e0295..c5fd3d91ff 100644 --- a/osdep.h +++ b/osdep.h @@ -108,6 +108,11 @@ void qemu_vfree(void *ptr); #else #define QEMU_MADV_DONTDUMP QEMU_MADV_INVALID #endif +#ifdef MADV_HUGEPAGE +#define QEMU_MADV_HUGEPAGE MADV_HUGEPAGE +#else +#define QEMU_MADV_HUGEPAGE QEMU_MADV_INVALID +#endif #elif defined(CONFIG_POSIX_MADVISE) > Please: if we wish to change behavior from February 2015, let's extend the > API to allow for remote allocations in several of the ways we have already > brainstormed rather than cause regressions. So if I understand correctly, you broke QEMU 3 years ago (again only noticed Jan 2018 because it requires >2 nodes and it requires a VM larger than 1 NODE which isn't common). QEMU before your regression in April 2015: Finished: 30 GB mapped, 10.163159s elapsed, 2.95GB/s Finished: 34 GB mapped, 11.806526s elapsed, 2.88GB/s Finished: 38 GB mapped, 10.369081s elapsed, 3.66GB/s Finished: 42 GB mapped, 12.357719s elapsed, 3.40GB/s QEMU after your regression in April 2015: Finished: 30 GB mapped, 8.821632s elapsed, 3.40GB/s Finished: 34 GB mapped, 341.979543s elapsed, 0.10GB/s Finished: 38 GB mapped, 761.933231s elapsed, 0.05GB/s Finished: 42 GB mapped, 1188.409235s elapsed, 0.04GB/s And now you ask open source upstream QEMU
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Wed, 5 Dec 2018, Andrea Arcangeli wrote: > > High thp utilization is not always better, especially when those hugepages > > are accessed remotely and introduce the regressions that I've reported. > > Seeking high thp utilization at all costs is not the goal if it causes > > workloads to regress. > > Is it possible what you need is a defrag=compactonly_thisnode to set > instead of the default defrag=madvise? The fact you seem concerned > about page fault latencies doesn't make your workload an obvious > candidate for MADV_HUGEPAGE to begin with. At least unless you decide > to smooth the MADV_HUGEPAGE behavior with an mbind that will simply > add __GFP_THISNODE to the allocations, perhaps you'll be even faster > if you invoke reclaim in the local node for 4k allocations too. > I've must have said this at least six or seven times: fault latency is secondary to the *access* latency. We want to try hard for MADV_HUGEPAGE users to do synchronous compaction and try to make a hugepage available. We really want to be backed by hugepages, but certainly not when the access latency becomes 13.9% worse as a result compared to local pages of the native page size. This is not a system-wide configuration detail, it is specific to the workload: does it span more than one node or not? No workload that can fit into a single node, which you also say is going to be the majority of workloads on today's platforms, is going to want to revert __GFP_THISNODE behavior of the past almost four years. It perfectly makes sense, however, to be a new mempolicy mode, a new madvise mode, or a prctl. > It looks like for your workload THP is a nice to have add-on, which is > practically true of all workloads (with a few corner cases that must > use MADV_NOHUGEPAGE), and it's what the defrag= default is about. > > Is it possible that you just don't want to shut off completely > compaction in the page fault and if you're ok to do it for your > library, you may be ok with that for all other apps too? > We enable synchronous compaction for MADV_HUGEPAGE users, yes, because we are not concerned with the fault latency but rather the access latency. > That's a different stance from other MADV_HUGEPAGE users because you > don't seem to mind a severely crippled THP utilization in your > app. > If access latency is really better for local pages of the native page size, we of course want to fault those instead. For almost the past four years, the behavior of MADV_HUGEPAGE has been to compact and possibly reclaim locally and then fallback to local pages. It is exactly what our users of MADV_HUGEPAGE want; I did not introduce this NUMA locality restriction but our users have used it. Please: if we wish to change behavior from February 2015, let's extend the API to allow for remote allocations in several of the ways we have already brainstormed rather than cause regressions.
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Wed, Dec 05, 2018 at 11:49:26AM -0800, David Rientjes wrote: > High thp utilization is not always better, especially when those hugepages > are accessed remotely and introduce the regressions that I've reported. > Seeking high thp utilization at all costs is not the goal if it causes > workloads to regress. Is it possible what you need is a defrag=compactonly_thisnode to set instead of the default defrag=madvise? The fact you seem concerned about page fault latencies doesn't make your workload an obvious candidate for MADV_HUGEPAGE to begin with. At least unless you decide to smooth the MADV_HUGEPAGE behavior with an mbind that will simply add __GFP_THISNODE to the allocations, perhaps you'll be even faster if you invoke reclaim in the local node for 4k allocations too. It looks like for your workload THP is a nice to have add-on, which is practically true of all workloads (with a few corner cases that must use MADV_NOHUGEPAGE), and it's what the defrag= default is about. Is it possible that you just don't want to shut off completely compaction in the page fault and if you're ok to do it for your library, you may be ok with that for all other apps too? That's a different stance from other MADV_HUGEPAGE users because you don't seem to mind a severely crippled THP utilization in your app. With your patch the utilization will go down a lot compared to the previous __GFP_THISNODE swap storm capable and you're still very fine with that. The fact you're fine with that points in the direction of changing the default tuning for defrag= to something stronger than madvise (that is precisely the default setting that is forcing you to use MADV_HUGEPAGE to get a chance to get some THP once a in a while during the page fault, after some uptime). Considering mbind surprisingly isn't privileged (so I suppose it may cause swap storms equivalent to __GFP_THISNODE if maliciously used after all) you could even consider a defrag=thisnode to force compaction+defrag local to the node to retain your THP+NUMA dynamic partitioning behavior that ends up swappin heavy in the local node.
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Wed, 5 Dec 2018, Michal Hocko wrote: > > As we've been over countless times, this is the desired effect for > > workloads that fit on a single node. We want local pages of the native > > page size because they (1) are accessed faster than remote hugepages and > > (2) are candidates for collapse by khugepaged. > > > > For applications that do not fit in a single node, we have discussed > > possible ways to extend the API to allow remote faulting of hugepages, > > absent remote fragmentation as well, then the long-standing behavior is > > preserved and large applications can use the API to increase their thp > > success rate. > > OK, I just give up. This doesn't lead anywhere. You keep repeating the > same stuff over and over, neglect other usecases and actually force them > to do something special just to keep your very specific usecase which > you clearly refuse to abstract into a form other people can experiment > with or at least provide more detailed broken down numbers for a more > serious analyses. Fault latency is only a part of the picture which is > much more complex. Look at Mel's report to get an impression of what > might be really useful for a _productive_ discussion. The other usecases is part of patch 2/2 in this series that is functionally similar to the __GFP_COMPACT_ONLY patch that Andrea proposed. We can also work to extend the API to allow remote thp allocations. Patch 1/2 reverts the behavior of commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") which added NUMA locality on top of an already conflated madvise mode. Prior to this commit that was merged for 4.20, *all* thp faults were constrained to the local node; this has been the case for three years and even prior to that in other kernels. It turns out that allowing remote allocations introduces access latency in the presence of local fragmentation. The solution is not to conflate MADV_HUGEPAGE with any sematic that suggests it allows remote thp allocations, especially when that changes long-standing behavior, regresses my usecase, and regresses the kernel test robot. I'll change patch 1/2 to not touch new_page() so that we are only addressing thp faults and post a v2.
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Wed 05-12-18 11:49:26, David Rientjes wrote: > On Wed, 5 Dec 2018, Michal Hocko wrote: > > > > The revert is certainly needed to prevent the regression, yes, but I > > > anticipate that Andrea will report back that patch 2 at least improves > > > the > > > situation for the problem that he was addressing, specifically that it is > > > pointless to thrash any node or reclaim unnecessarily when compaction has > > > already failed. This is what setting __GFP_NORETRY for all thp fault > > > allocations fixes. > > > > Yes but earlier numbers from Mel and repeated again [1] simply show > > that the swap storms are only handled in favor of an absolute drop of > > THP success rate. > > > > As we've been over countless times, this is the desired effect for > workloads that fit on a single node. We want local pages of the native > page size because they (1) are accessed faster than remote hugepages and > (2) are candidates for collapse by khugepaged. > > For applications that do not fit in a single node, we have discussed > possible ways to extend the API to allow remote faulting of hugepages, > absent remote fragmentation as well, then the long-standing behavior is > preserved and large applications can use the API to increase their thp > success rate. OK, I just give up. This doesn't lead anywhere. You keep repeating the same stuff over and over, neglect other usecases and actually force them to do something special just to keep your very specific usecase which you clearly refuse to abstract into a form other people can experiment with or at least provide more detailed broken down numbers for a more serious analyses. Fault latency is only a part of the picture which is much more complex. Look at Mel's report to get an impression of what might be really useful for a _productive_ discussion. -- Michal Hocko SUSE Labs
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Wed, 5 Dec 2018, Michal Hocko wrote: > > The revert is certainly needed to prevent the regression, yes, but I > > anticipate that Andrea will report back that patch 2 at least improves the > > situation for the problem that he was addressing, specifically that it is > > pointless to thrash any node or reclaim unnecessarily when compaction has > > already failed. This is what setting __GFP_NORETRY for all thp fault > > allocations fixes. > > Yes but earlier numbers from Mel and repeated again [1] simply show > that the swap storms are only handled in favor of an absolute drop of > THP success rate. > As we've been over countless times, this is the desired effect for workloads that fit on a single node. We want local pages of the native page size because they (1) are accessed faster than remote hugepages and (2) are candidates for collapse by khugepaged. For applications that do not fit in a single node, we have discussed possible ways to extend the API to allow remote faulting of hugepages, absent remote fragmentation as well, then the long-standing behavior is preserved and large applications can use the API to increase their thp success rate. > Yes, this is understood. So we are getting worst of both. We have a > numa locality side effect of MADV_HUGEPAGE and we have a poor THP > utilization. So how come this is an improvement. Especially when the > reported regression hasn't been demonstrated on a real or repeatable > workload but rather a very vague presumably worst case behavior where > the access penalty is absolutely prevailing. > High thp utilization is not always better, especially when those hugepages are accessed remotely and introduce the regressions that I've reported. Seeking high thp utilization at all costs is not the goal if it causes workloads to regress.
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Wed, 5 Dec 2018, Mel Gorman wrote: > > This is a single MADV_HUGEPAGE usecase, there is nothing special about it. > > It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and > > faulted the memory with a fragmented local node and then measured the > > remote access latency to the remote hugepage that occurs without setting > > __GFP_THISNODE. You can also measure the remote allocation latency by > > fragmenting the entire system and then faulting. > > > > I'll make the same point as before, the form the fragmentation takes > matters as well as the types of pages that are resident and whether > they are active or not. It affects the level of work the system does > as well as the overall success rate of operations (be it reclaim, THP > allocation, compaction, whatever). This is why a reproduction case that is > representative of the problem you're facing on the real workload matters > would have been helpful because then any alternative proposal could have > taken your workload into account during testing. > We know from Andrea's report that compaction is failing, and repeatedly failing because otherwise we would not need excessive swapping to make it work. That can mean one of two things: (1) a general low-on-memory situation that causes us repeatedly to be under watermarks to deem compaction suitable (isolate_freepages() will be too painful) or (2) compaction has the memory that it needs but is failing to make a hugepage available because all pages from a pageblock cannot be migrated. If (1), perhaps in the presence of an antagonist that is quickly allocating the memory before compaction can pass watermark checks, further reclaim is not beneficial: the allocation is becoming too expensive and there is no guarantee that compaction can find this reclaimed memory in isolate_freepages(). I chose to duplicate (2) by synthetically introducing fragmentation (high-order slab, free every other one) locally to test the patch that does not set __GFP_THISNODE. The result is a remote transparent hugepage, but we do not even need to get to the point of local compaction for that fallback to happen. And this is where I measure the 13.9% access latency regression for the lifetime of the binary as a result of this patch. If local compaction works the first time, great! But that is not what is happening in Andrea's report and as a result of not setting __GFP_THISNODE we are *guaranteed* worse access latency and may encounter even worse allocation latency if the remote memory is fragmented as well. So while I'm only testing the functional behavior of the patch itself, I cannot speak to the nature of the local fragmentation on Andrea's systems.
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Tue, Dec 04, 2018 at 02:25:54PM -0800, David Rientjes wrote: > On Tue, 4 Dec 2018, Michal Hocko wrote: > > > > This fixes a 13.9% of remote memory access regression and 40% remote > > > memory allocation regression on Haswell when the local node is fragmented > > > for hugepage sized pages and memory is being faulted with either the thp > > > defrag setting of "always" or has been madvised with MADV_HUGEPAGE. > > > > > > The usecase that initially identified this issue were binaries that mremap > > > their .text segment to be backed by transparent hugepages on startup. > > > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap(). > > > > Do you have something you can share with so that other people can play > > and try to reproduce? > > > > This is a single MADV_HUGEPAGE usecase, there is nothing special about it. > It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and > faulted the memory with a fragmented local node and then measured the > remote access latency to the remote hugepage that occurs without setting > __GFP_THISNODE. You can also measure the remote allocation latency by > fragmenting the entire system and then faulting. > I'll make the same point as before, the form the fragmentation takes matters as well as the types of pages that are resident and whether they are active or not. It affects the level of work the system does as well as the overall success rate of operations (be it reclaim, THP allocation, compaction, whatever). This is why a reproduction case that is representative of the problem you're facing on the real workload matters would have been helpful because then any alternative proposal could have taken your workload into account during testing. -- Mel Gorman SUSE Labs
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Tue 04-12-18 14:04:10, David Rientjes wrote: > On Tue, 4 Dec 2018, Vlastimil Babka wrote: > > > So, AFAIK, the situation is: > > > > - commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The > > intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking > > both as it seemed to make sense). > > Yes, both are based on the preference to fault local thp and fallback to > local pages before allocating remotely because it does not cause the > performance regression introduced by not setting __GFP_THISNODE. > > > - The resulting node-reclaim-like behavior regressed Andrea's KVM > > workloads, but reverting it (only for madvised or non-default > > defrag=always THP by commit ac5b2c18911f) would regress David's > > workloads starting with 4.20 to pre-4.1 levels. > > > > Almost, but the defrag=always case had the subtle difference of also > setting __GFP_NORETRY whereas MADV_HUGEPAGE did not. This was different > than the comment in __alloc_pages_slowpath() that expected thp fault > allocations to be caught by checking __GFP_NORETRY. > > > If the decision is that it's too late to revert a 4.1 regression for one > > kind of workload in 4.20 because it causes regression for another > > workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20 > > and don't rush a different fix (patch 2) to 4.20. It's not a big > > difference if a 4.1 regression is fixed in 4.20 or 4.21? > > > > The revert is certainly needed to prevent the regression, yes, but I > anticipate that Andrea will report back that patch 2 at least improves the > situation for the problem that he was addressing, specifically that it is > pointless to thrash any node or reclaim unnecessarily when compaction has > already failed. This is what setting __GFP_NORETRY for all thp fault > allocations fixes. Yes but earlier numbers from Mel and repeated again [1] simply show that the swap storms are only handled in favor of an absolute drop of THP success rate. > > Because there might be other unexpected consequences of patch 2 that > > testing won't be able to catch in the remaining 4.20 rc's. And I'm not > > even sure if it will fix Andrea's workloads. While it should prevent > > node-reclaim-like thrashing, it will still mean that KVM (or anyone) > > won't be able to allocate THP's remotely, even if the local node is > > exhausted of both huge and base pages. > > > > Patch 2 does nothing with respect to the remote allocation policy; it > simply prevents reclaim (and potentially thrashing). Patch 1 sets > __GFP_THISNODE to prevent the remote allocation. Yes, this is understood. So we are getting worst of both. We have a numa locality side effect of MADV_HUGEPAGE and we have a poor THP utilization. So how come this is an improvement. Especially when the reported regression hasn't been demonstrated on a real or repeatable workload but rather a very vague presumably worst case behavior where the access penalty is absolutely prevailing. [1] http://lkml.kernel.org/r/20181204104558.gv23...@techsingularity.net -- Michal Hocko SUSE Labs
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Tue 04-12-18 14:25:54, David Rientjes wrote: > On Tue, 4 Dec 2018, Michal Hocko wrote: > > > > This fixes a 13.9% of remote memory access regression and 40% remote > > > memory allocation regression on Haswell when the local node is fragmented > > > for hugepage sized pages and memory is being faulted with either the thp > > > defrag setting of "always" or has been madvised with MADV_HUGEPAGE. > > > > > > The usecase that initially identified this issue were binaries that mremap > > > their .text segment to be backed by transparent hugepages on startup. > > > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap(). > > > > Do you have something you can share with so that other people can play > > and try to reproduce? > > > > This is a single MADV_HUGEPAGE usecase, there is nothing special about it. > It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and > faulted the memory with a fragmented local node and then measured the > remote access latency to the remote hugepage that occurs without setting > __GFP_THISNODE. You can also measure the remote allocation latency by > fragmenting the entire system and then faulting. > > (Remapping the text segment only involves parsing /proc/self/exe, mmap, > madvise, memcpy, and mremap.) How does this reflect your real workload and regressions in it. It certainly shows the worst case behavior where the access penalty is prevalent while there are no other metrics which might be interesting or even important. E.g. page table savings or the TLB pressure in general when THP fail too eagerly. As Anrea mentioned there are really valid cases where the remote latency pays off. Have you actually seen the advertised regression in the real workload and do you have any means to simulate that workload. > > > This requires a full revert and partial revert of commits merged during > > > the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax > > > __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large > > > amounts of swap activity on the local zone when faulting hugepages by > > > falling back to remote memory. This remote allocation causes the access > > > regression and, if fragmented, the allocation regression. > > > > Have you tried to measure any of the workloads Mel and Andrea have > > pointed out during the previous review discussion? In other words what > > is the impact on the THP success rate and allocation latencies for other > > usecases? > > It isn't a property of the workload, it's a property of the how fragmented > both local and remote memory is. In Andrea's case, I believe he has > stated that memory compaction has failed locally and the resulting reclaim > activity ends up looping and causing it the thrash the local node whereas > 75% of remote memory is free and not fragmented. So we have local > fragmentation and reclaim is very expensive to enable compaction to > succeed, if it ever does succeed[*], and mostly free remote memory. > > If remote memory is also fragmented, Andrea's case will run into a much > more severe swap storm as a result of not setting __GFP_THISNODE. The > premise of the entire change is that his remote memory is mostly free so > fallback results in a quick allocation. For balanced nodes, that's not > going to be the case. The fix to prevent the heavy reclaim activity is to > set __GFP_NORETRY as the page allocator suspects, which patch 2 here does. You are not answering my question and I take it as you haven't actually tested this in a variety of workload and base your assumptions on an artificial worst case scenario. Please correct me if I am wrong. -- Michal Hocko SUSE Labs
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Tue, 4 Dec 2018, Michal Hocko wrote: > > This fixes a 13.9% of remote memory access regression and 40% remote > > memory allocation regression on Haswell when the local node is fragmented > > for hugepage sized pages and memory is being faulted with either the thp > > defrag setting of "always" or has been madvised with MADV_HUGEPAGE. > > > > The usecase that initially identified this issue were binaries that mremap > > their .text segment to be backed by transparent hugepages on startup. > > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap(). > > Do you have something you can share with so that other people can play > and try to reproduce? > This is a single MADV_HUGEPAGE usecase, there is nothing special about it. It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and faulted the memory with a fragmented local node and then measured the remote access latency to the remote hugepage that occurs without setting __GFP_THISNODE. You can also measure the remote allocation latency by fragmenting the entire system and then faulting. (Remapping the text segment only involves parsing /proc/self/exe, mmap, madvise, memcpy, and mremap.) > > This requires a full revert and partial revert of commits merged during > > the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax > > __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large > > amounts of swap activity on the local zone when faulting hugepages by > > falling back to remote memory. This remote allocation causes the access > > regression and, if fragmented, the allocation regression. > > Have you tried to measure any of the workloads Mel and Andrea have > pointed out during the previous review discussion? In other words what > is the impact on the THP success rate and allocation latencies for other > usecases? It isn't a property of the workload, it's a property of the how fragmented both local and remote memory is. In Andrea's case, I believe he has stated that memory compaction has failed locally and the resulting reclaim activity ends up looping and causing it the thrash the local node whereas 75% of remote memory is free and not fragmented. So we have local fragmentation and reclaim is very expensive to enable compaction to succeed, if it ever does succeed[*], and mostly free remote memory. If remote memory is also fragmented, Andrea's case will run into a much more severe swap storm as a result of not setting __GFP_THISNODE. The premise of the entire change is that his remote memory is mostly free so fallback results in a quick allocation. For balanced nodes, that's not going to be the case. The fix to prevent the heavy reclaim activity is to set __GFP_NORETRY as the page allocator suspects, which patch 2 here does. That's an interesting memory state to [*] Reclaim here would only be beneficial if we fail the order-0 watermark check in __compaction_suitable() *and* the reclaimed memory can be accessed during isolate_freepages().
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Tue, 4 Dec 2018, Vlastimil Babka wrote: > So, AFAIK, the situation is: > > - commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The > intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking > both as it seemed to make sense). Yes, both are based on the preference to fault local thp and fallback to local pages before allocating remotely because it does not cause the performance regression introduced by not setting __GFP_THISNODE. > - The resulting node-reclaim-like behavior regressed Andrea's KVM > workloads, but reverting it (only for madvised or non-default > defrag=always THP by commit ac5b2c18911f) would regress David's > workloads starting with 4.20 to pre-4.1 levels. > Almost, but the defrag=always case had the subtle difference of also setting __GFP_NORETRY whereas MADV_HUGEPAGE did not. This was different than the comment in __alloc_pages_slowpath() that expected thp fault allocations to be caught by checking __GFP_NORETRY. > If the decision is that it's too late to revert a 4.1 regression for one > kind of workload in 4.20 because it causes regression for another > workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20 > and don't rush a different fix (patch 2) to 4.20. It's not a big > difference if a 4.1 regression is fixed in 4.20 or 4.21? > The revert is certainly needed to prevent the regression, yes, but I anticipate that Andrea will report back that patch 2 at least improves the situation for the problem that he was addressing, specifically that it is pointless to thrash any node or reclaim unnecessarily when compaction has already failed. This is what setting __GFP_NORETRY for all thp fault allocations fixes. > Because there might be other unexpected consequences of patch 2 that > testing won't be able to catch in the remaining 4.20 rc's. And I'm not > even sure if it will fix Andrea's workloads. While it should prevent > node-reclaim-like thrashing, it will still mean that KVM (or anyone) > won't be able to allocate THP's remotely, even if the local node is > exhausted of both huge and base pages. > Patch 2 does nothing with respect to the remote allocation policy; it simply prevents reclaim (and potentially thrashing). Patch 1 sets __GFP_THISNODE to prevent the remote allocation. Note that the commit to be reverted in patch 1, if not reverted, would cause an even more severe regression from Andrea's case if remote memory is fragmented as well: it opens the door to thrashing both local and remote memory instead of only local memory. I measured this as a 40% allocation latency regression when purposefully fragmenting all nodes and faulting without __GFP_THISNODE, and that was on Haswell; I can imagine it would be even greater on Rome.
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On 12/4/18 12:50 AM, David Rientjes wrote: > This fixes a 13.9% of remote memory access regression and 40% remote > memory allocation regression on Haswell when the local node is fragmented > for hugepage sized pages and memory is being faulted with either the thp > defrag setting of "always" or has been madvised with MADV_HUGEPAGE. > > The usecase that initially identified this issue were binaries that mremap > their .text segment to be backed by transparent hugepages on startup. > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap(). > > This requires a full revert and partial revert of commits merged during > the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax > __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large > amounts of swap activity on the local zone when faulting hugepages by > falling back to remote memory. This remote allocation causes the access > regression and, if fragmented, the allocation regression. > > This patchset also fixes that issue by not attempting direct reclaim at > all when compaction fails to free a hugepage. Note that if remote memory > was also low or fragmented that ac5b2c18911f ("mm: thp: relax > __GFP_THISNODE for MADV_HUGEPAGE mappings") would only have compounded the > problem it attempts to address by now thrashing all nodes instead of only > the local node. > > The reverts for the stable trees will be different: just a straight revert > of commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE > mappings") is likely needed. > > Cross compiled for architectures with thp support and thp enabled: > arc (with ISA_ARCV2), arm (with ARM_LPAE), arm64, i386, mips64, powerpc, > s390, sparc, x86_64. > > Andrea, is this acceptable? So, AFAIK, the situation is: - commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking both as it seemed to make sense). - The resulting node-reclaim-like behavior regressed Andrea's KVM workloads, but reverting it (only for madvised or non-default defrag=always THP by commit ac5b2c18911f) would regress David's workloads starting with 4.20 to pre-4.1 levels. If the decision is that it's too late to revert a 4.1 regression for one kind of workload in 4.20 because it causes regression for another workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20 and don't rush a different fix (patch 2) to 4.20. It's not a big difference if a 4.1 regression is fixed in 4.20 or 4.21? Because there might be other unexpected consequences of patch 2 that testing won't be able to catch in the remaining 4.20 rc's. And I'm not even sure if it will fix Andrea's workloads. While it should prevent node-reclaim-like thrashing, it will still mean that KVM (or anyone) won't be able to allocate THP's remotely, even if the local node is exhausted of both huge and base pages. > --- > drivers/gpu/drm/ttm/ttm_page_alloc.c |8 +++--- > drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |3 -- > include/linux/gfp.h |3 +- > include/linux/mempolicy.h|2 - > mm/huge_memory.c | 41 > +++ > mm/mempolicy.c |7 +++-- > mm/page_alloc.c | 16 > 7 files changed, 42 insertions(+), 38 deletions(-) >
Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
On Mon 03-12-18 15:50:18, David Rientjes wrote: > This fixes a 13.9% of remote memory access regression and 40% remote > memory allocation regression on Haswell when the local node is fragmented > for hugepage sized pages and memory is being faulted with either the thp > defrag setting of "always" or has been madvised with MADV_HUGEPAGE. > > The usecase that initially identified this issue were binaries that mremap > their .text segment to be backed by transparent hugepages on startup. > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap(). Do you have something you can share with so that other people can play and try to reproduce? > This requires a full revert and partial revert of commits merged during > the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax > __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large > amounts of swap activity on the local zone when faulting hugepages by > falling back to remote memory. This remote allocation causes the access > regression and, if fragmented, the allocation regression. Have you tried to measure any of the workloads Mel and Andrea have pointed out during the previous review discussion? In other words what is the impact on the THP success rate and allocation latencies for other usecases? -- Michal Hocko SUSE Labs
[patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions
This fixes a 13.9% of remote memory access regression and 40% remote memory allocation regression on Haswell when the local node is fragmented for hugepage sized pages and memory is being faulted with either the thp defrag setting of "always" or has been madvised with MADV_HUGEPAGE. The usecase that initially identified this issue were binaries that mremap their .text segment to be backed by transparent hugepages on startup. They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap(). This requires a full revert and partial revert of commits merged during the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large amounts of swap activity on the local zone when faulting hugepages by falling back to remote memory. This remote allocation causes the access regression and, if fragmented, the allocation regression. This patchset also fixes that issue by not attempting direct reclaim at all when compaction fails to free a hugepage. Note that if remote memory was also low or fragmented that ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") would only have compounded the problem it attempts to address by now thrashing all nodes instead of only the local node. The reverts for the stable trees will be different: just a straight revert of commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") is likely needed. Cross compiled for architectures with thp support and thp enabled: arc (with ISA_ARCV2), arm (with ARM_LPAE), arm64, i386, mips64, powerpc, s390, sparc, x86_64. Andrea, is this acceptable? --- drivers/gpu/drm/ttm/ttm_page_alloc.c |8 +++--- drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |3 -- include/linux/gfp.h |3 +- include/linux/mempolicy.h|2 - mm/huge_memory.c | 41 +++ mm/mempolicy.c |7 +++-- mm/page_alloc.c | 16 7 files changed, 42 insertions(+), 38 deletions(-)