Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Thu 06-12-18 15:43:26, David Rientjes wrote: > On Wed, 5 Dec 2018, Linus Torvalds wrote: > > > > Ok, I've applied David's latest patch. > > > > > > I'm not at all objecting to tweaking this further, I just didn't want > > > to have this regression stand. > > > > Hmm. Can somebody (David?) also perhaps try to state what the > > different latency impacts end up being? I suspect it's been mentioned > > several times during the argument, but it would be nice to have a > > "going forward, this is what I care about" kind of setup for good > > default behavior. > > > > I'm in the process of writing a more complete test case for this but I > benchmarked a few platforms based solely on remote hugepages vs local > small pages vs remote hugepages. My previous numbers were based on data > from actual workloads. Has this materialized into anything we can use? We plan to discuss this particular topic at the LSFMM this year and it would be great to have something to play with. I am quite nervious that we have left quite a common case with a bad performance based on a complain that we cannot really reproduce so it is really hard to move on. -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Fri, Dec 21, 2018 at 02:18:45PM -0800, David Rientjes wrote: > On Fri, 14 Dec 2018, Vlastimil Babka wrote: > > > > It would be interesting to know if anybody has tried using the per-zone > > > free_area's to determine migration targets and set a bit if it should be > > > considered a migration source or a migration target. If all pages for a > > > pageblock are not on free_areas, they are fully used. > > > > Repurposing/adding a new pageblock bit was in my mind to help multiple > > compactors not undo each other's work in the scheme where there's no > > free page scanner, but I didn't implement it yet. > > > > It looks like Mel has a series posted that still is implemented with > linear scans through memory, so I'm happy to move the discussion there; I > think the goal for compaction with regard to this thread is determining > whether reclaim in the page allocator would actually be useful and > targeted reclaim to make memory available for isolate_freepages() could be > expensive. I'd hope that we could move in a direction where compaction > doesn't care where the pageblock is and does the minimal amount of work > possible to make a high-order page available, not sure if that's possible > with a linear scan. I'll take a look at Mel's series though. That series has evolved significantly because there was a lot of missing pieces. While it's somewhat ready other than badly written changelogs, I didn't post it because I'm going offline and wouldn't respond to feedback and I imagine others are offline too and unavailable for review. Besides, the merge window is about to open and I know there are patches in Andrews tree for mainline that should be taken into account. The series is now 25 patches long and covers a lot of pre-requisites that would be necessary before removing the linear scanner. What is critical for a purely free-list scanner is that the exit conditions are identified and the series provides a lot of the pieces. For example, a non-linear scanner must properly control skip bits and isolate pageblocks from multiple compaction instances which this series does. The main takeawy from the series is that it reduces system CPU usage by 17%, reduces free scan rates by 99.5% and increases THP allocation success rates by 33% giving almost 99% allocation success rates. It also; o Isolates pageblocks for a single compaction instance o Synchronises async/sync scanners when appropriate to reduce rescanning o Identifies when a pageblock is being rescanned and is "sticky" and makes forward progress instead of looping excessively o Smarter logic when clearing pageblock skip bits so reduce scanning o Various different methods for reducing unnecessary scanning o Better handling of contention o Avoids compaction of remote nodes in direct compaction context If you do not want to wait until the new year, it's at git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-fast-compact-v2r15 Preliminary results based on thpscale using MADV_HUGEPAGE to allocate huge pages on a fragmented system. thpscale Fault Latencies 4.20.0-rc6 4.20.0-rc6 mmotm-20181210 noremote-v2r14 Amean fault-both-1 864.83 ( 0.00%) 1006.88 * -16.43%* Amean fault-both-3 3566.05 ( 0.00%) 2460.97 * 30.99%* Amean fault-both-5 5685.02 ( 0.00%) 4052.92 * 28.71%* Amean fault-both-7 7289.40 ( 0.00%) 5929.65 ( 18.65%) Amean fault-both-1210937.46 ( 0.00%) 8870.53 ( 18.90%) Amean fault-both-1815440.48 ( 0.00%)11464.86 * 25.75%* Amean fault-both-2415345.83 ( 0.00%)13040.01 * 15.03%* Amean fault-both-3020159.73 ( 0.00%)16618.73 * 17.56%* Amean fault-both-3220843.51 ( 0.00%)14401.25 * 30.91%* Fault latency (either huge or base) is mostly improved even when 32 tasks are trying to allocate huge pages on an 8-CPU single socket machine where contention is a factor thpscale Percentage Faults Huge 4.20.0-rc6 4.20.0-rc6 mmotm-20181210 noremote-v2r14 Percentage huge-196.03 ( 0.00%) 96.94 ( 0.95%) Percentage huge-371.43 ( 0.00%) 95.43 ( 33.60%) Percentage huge-570.44 ( 0.00%) 96.85 ( 37.48%) Percentage huge-770.39 ( 0.00%) 94.77 ( 34.63%) Percentage huge-12 71.53 ( 0.00%) 98.07 ( 37.11%) Percentage huge-18 70.61 ( 0.00%) 98.42 ( 39.38%) Percentage huge-24 71.84 ( 0.00%) 97.85 ( 36.20%) Percentage huge-30 69.94 ( 0.00%) 98.13 ( 40.31%) Percentage huge-32 66.92 ( 0.00%) 97.79 ( 46.13%) 96-98% of THP requests get huge pages on request 4.20.0-rc6 4.20.0-rc6 mmotm-20181210noremote-v2r14 User 27.30 27.86 System 192.70 159.42 Elapsed 580.13 571.98
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Fri, 14 Dec 2018, Vlastimil Babka wrote: > > It would be interesting to know if anybody has tried using the per-zone > > free_area's to determine migration targets and set a bit if it should be > > considered a migration source or a migration target. If all pages for a > > pageblock are not on free_areas, they are fully used. > > Repurposing/adding a new pageblock bit was in my mind to help multiple > compactors not undo each other's work in the scheme where there's no > free page scanner, but I didn't implement it yet. > It looks like Mel has a series posted that still is implemented with linear scans through memory, so I'm happy to move the discussion there; I think the goal for compaction with regard to this thread is determining whether reclaim in the page allocator would actually be useful and targeted reclaim to make memory available for isolate_freepages() could be expensive. I'd hope that we could move in a direction where compaction doesn't care where the pageblock is and does the minimal amount of work possible to make a high-order page available, not sure if that's possible with a linear scan. I'll take a look at Mel's series though.
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Fri, 14 Dec 2018, Mel Gorman wrote: > > In other words, I think there is a lot of potential stranding that occurs > > for both scanners that could otherwise result in completely free > > pageblocks. If there a single movable page present near the end of the > > zone in an otherwise fully free pageblock, surely we can do better than > > the current implementation that would never consider this very easy to > > compact memory. > > > > While it's somewhat premature, I posted a series before I had a full set > of results because it uses free lists to reduce searches and reduces > inference between multiple scanners. Preliminary results indicated it > boosted allocation success rates by 20%ish, reduced migration scanning > by 99% and free scanning by 27%. > Always good to have code to look at, I'll take a closer look. I've unfortunately been distracted with other kernel issues lately :/
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Fri, Dec 14, 2018 at 01:04:11PM -0800, David Rientjes wrote: > On Wed, 12 Dec 2018, Vlastimil Babka wrote: > > > > Regarding the role of direct reclaim in the allocator, I think we need > > > work on the feedback from compaction to determine whether it's > > > worthwhile. > > > That's difficult because of the point I continue to bring up: > > > isolate_freepages() is not necessarily always able to access this freed > > > memory. > > > > That's one of the *many* reasons why having free base pages doesn't > > guarantee compaction success. We can and will improve on that. But I > > don't think it would be e.g. practical to check the pfns of free pages > > wrt compaction scanner positions and decide based on that. > > Yeah, agreed. Rather than proposing that memory is only reclaimed if its > known that it can be accessible to isolate_freepages(), I'm wondering > about the implementation of the freeing scanner entirely. > > In other words, I think there is a lot of potential stranding that occurs > for both scanners that could otherwise result in completely free > pageblocks. If there a single movable page present near the end of the > zone in an otherwise fully free pageblock, surely we can do better than > the current implementation that would never consider this very easy to > compact memory. > While it's somewhat premature, I posted a series before I had a full set of results because it uses free lists to reduce searches and reduces inference between multiple scanners. Preliminary results indicated it boosted allocation success rates by 20%ish, reduced migration scanning by 99% and free scanning by 27%. > The same problem occurs for the migration scanner where we can iterate > over a ton of free memory that is never considered a suitable migration > target. The implementation that attempts to migrate all memory toward the > end of the zone penalizes the freeing scanner when it is reset: we just > iterate over a ton of used pages. > Yes, partially addressed in series. It can be improved significantly but it hit a boundary condition near the points where compaction scanners meet. I dropped the patch in question as it needs more thought on how to deal with the boundary condition without remigrating the blocks close to it. Besides, at 14 patches, it would probably be best to get that reviewed and finalised before building upon it further so review would be welcome. > Has anybody tried a migration scanner that isn't linearly based, rather > finding the highest-order free page of the same migratetype, iterating the > pages of its pageblock, and using this to determine whether the actual > migration will be worthwhile or not? I could imagine pageblock_skip being > repurposed for this as the heuristic. > Yes, but it has downsides. Redoing the same work on pageblocks, tracking state and tracking the exit conditions are tricky. I think it's best to squeeze the most out of the linear scanning first and the series is the first step in that. > It would be interesting to know if anybody has tried using the per-zone > free_area's to determine migration targets and set a bit if it should be > considered a migration source or a migration target. If all pages for a > pageblock are not on free_areas, they are fully used. > Series has patches which implement something similar to this idea. -- Mel Gorman SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On 12/14/18 10:04 PM, David Rientjes wrote: > On Wed, 12 Dec 2018, Vlastimil Babka wrote: ... > Reclaim likely could be deterministically useful if we consider a redesign > of how migration sources and targets are determined in compaction. > > Has anybody tried a migration scanner that isn't linearly based, rather > finding the highest-order free page of the same migratetype, iterating the > pages of its pageblock, and using this to determine whether the actual > migration will be worthwhile or not? Not exactly that AFAIK, but a year ago in my series [1] patch 6 made migration scanner 'prescan' the block of requested order before actually trying to isolate anything for migration. > I could imagine pageblock_skip being > repurposed for this as the heuristic. > > Finding migration targets would be more tricky, but if we iterate the > pages of the pageblock for low-order free pages and find them to be mostly > used, that seems more appropriate than just pushing all memory to the end > of the zone? Agree. That was patch 8/8 of the same series [1]. > It would be interesting to know if anybody has tried using the per-zone > free_area's to determine migration targets and set a bit if it should be > considered a migration source or a migration target. If all pages for a > pageblock are not on free_areas, they are fully used. Repurposing/adding a new pageblock bit was in my mind to help multiple compactors not undo each other's work in the scheme where there's no free page scanner, but I didn't implement it yet. >>> otherwise we fail and defer because it wasn't able >>> to make a hugepage available. >> >> Note that THP fault compaction doesn't actually defer itself, which I >> think is a weakness of the current implementation and hope that patch 3 >> in my series from yesterday [1] can address that. Because defering is >> the general feedback mechanism that we have for suppressing compaction >> (and thus associated reclaim) in cases it fails for any reason, not just >> the one you mention. Instead of inspecting failure conditions in detail, >> which would be costly, it's a simple statistical approach. And when >> compaction is improved to fail less, defering automatically also happens >> less. >> > > I couldn't get the link to work, unfortunately, I don't think the patch > series made it to LKML :/ I do see it archived for linux-mm, though, so > I'll take a look, thanks! Yeah I forgot to Cc: LKML, but you were also in direct To: so you should have received them directly. Also the abovementioned series, but that's year ago. My fault for not returning to it after being done with the Meltdown fun. I hope to do that soon. [1] https://marc.info/?l=linux-mm&m=151315560308753 >> [1] https://lkml.kernel.org/r/20181211142941.20500-1-vba...@suse.cz >>
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, 12 Dec 2018, Vlastimil Babka wrote: > > Regarding the role of direct reclaim in the allocator, I think we need > > work on the feedback from compaction to determine whether it's worthwhile. > > That's difficult because of the point I continue to bring up: > > isolate_freepages() is not necessarily always able to access this freed > > memory. > > That's one of the *many* reasons why having free base pages doesn't > guarantee compaction success. We can and will improve on that. But I > don't think it would be e.g. practical to check the pfns of free pages > wrt compaction scanner positions and decide based on that. Yeah, agreed. Rather than proposing that memory is only reclaimed if its known that it can be accessible to isolate_freepages(), I'm wondering about the implementation of the freeing scanner entirely. In other words, I think there is a lot of potential stranding that occurs for both scanners that could otherwise result in completely free pageblocks. If there a single movable page present near the end of the zone in an otherwise fully free pageblock, surely we can do better than the current implementation that would never consider this very easy to compact memory. For hugepages, we don't care what pageblock we allocate from. There are requirements for MAX_ORDER-1, but I assume we shouldn't optimize for these cases (and if CMA has requirements for a migration/freeing scanner redesign, I think that can be special cased). The same problem occurs for the migration scanner where we can iterate over a ton of free memory that is never considered a suitable migration target. The implementation that attempts to migrate all memory toward the end of the zone penalizes the freeing scanner when it is reset: we just iterate over a ton of used pages. Reclaim likely could be deterministically useful if we consider a redesign of how migration sources and targets are determined in compaction. Has anybody tried a migration scanner that isn't linearly based, rather finding the highest-order free page of the same migratetype, iterating the pages of its pageblock, and using this to determine whether the actual migration will be worthwhile or not? I could imagine pageblock_skip being repurposed for this as the heuristic. Finding migration targets would be more tricky, but if we iterate the pages of the pageblock for low-order free pages and find them to be mostly used, that seems more appropriate than just pushing all memory to the end of the zone? It would be interesting to know if anybody has tried using the per-zone free_area's to determine migration targets and set a bit if it should be considered a migration source or a migration target. If all pages for a pageblock are not on free_areas, they are fully used. > > otherwise we fail and defer because it wasn't able > > to make a hugepage available. > > Note that THP fault compaction doesn't actually defer itself, which I > think is a weakness of the current implementation and hope that patch 3 > in my series from yesterday [1] can address that. Because defering is > the general feedback mechanism that we have for suppressing compaction > (and thus associated reclaim) in cases it fails for any reason, not just > the one you mention. Instead of inspecting failure conditions in detail, > which would be costly, it's a simple statistical approach. And when > compaction is improved to fail less, defering automatically also happens > less. > I couldn't get the link to work, unfortunately, I don't think the patch series made it to LKML :/ I do see it archived for linux-mm, though, so I'll take a look, thanks! > [1] https://lkml.kernel.org/r/20181211142941.20500-1-vba...@suse.cz >
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed 12-12-18 12:00:16, Andrea Arcangeli wrote: [...] > Adding MADV_THISNODE/MADV_NODE_RECLAIM, will guarantee his proprietary > software binary will run at maximum performance without cache > interference, and he's happy to accept the risk of massive slowdown in > case the local node is truly OOM. The fallback, despite very > inefficient, will still happen without OOM killer triggering. I believe this fits much better into a MPOL_$FOO rather than MADV_$FOO. But other than that I full agree. There are reasonable usecases for the node reclaim like behavior. As a bonus you do not get local node only but all nodes within reclaim distance as well. -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, Dec 12, 2018 at 10:50:51AM +0100, Michal Hocko wrote: > I can be convinced that larger pages really require a different behavior > than base pages but you should better show _real_ numbers on a wider > variety workloads to back your claims. I have only heard hand waving and I agree with your point about node_reclaim and I think David complaint of "I got remote THP instead of local 4k" with our proposed fix, is going to morph into "I got remote 4k instead of local 4k" with his favorite fix. Because David stopped calling reclaim with __GFP_THISNODE, the moment the node is full of pagecache, node_reclaim behavior will go away and even 4k pages will start to be allocated remote (and because of __GFP_THISNODE set in the THP allocation, all readily available or trivial to compact remote THP will be ignored too). What David needs I think is a way to set __GFP_THISNODE for THP *and 4k* allocations and if both fails in a row with __GFP_THISNODE set, we need to repeat the whole thing without __GFP_THISNODE set (ideally with a mask to skip the node that we already scraped down to the bottom during the initial __GFP_THISNODE pass). This way his proprietary software binary will work even better than before when the local node is fragmented and he'll finally be able to get the speedup from remote THP too in case the local node is truly OOM, but all other nodes are full of readily available THP. To achieve this without a new MADV_THISNODE/MADV_NODE_RECLAIM, we'd need a way to start with __GFP_THISNODE and then draw the line in reclaim and decide to drop __GFP_THISNODE when too much pressure mounts in the local node, but like you said it becomes like node_reclaim and it would be better if it can be done with an opt-in, like MADV_HUGEPAGE because not all workloads would benefit from such extra pagecache reclaim cost (like not all workload benefits from synchronous compaction). I think some NUMA reclaim mode semantics ended up being embedded and hidden in the THP MADV_HUGEPAGE, but they imposed massive slowdown to all workloads that can't cope with the node_reclaim mode behavior because they don't fit in a node. Adding MADV_THISNODE/MADV_NODE_RECLAIM, will guarantee his proprietary software binary will run at maximum performance without cache interference, and he's happy to accept the risk of massive slowdown in case the local node is truly OOM. The fallback, despite very inefficient, will still happen without OOM killer triggering. Thanks, Andrea
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
Hello, I now found a two socket EPYC (is this Naples?) to try to confirm the THP effect of intra-socket THP. CPU(s):128 On-line CPU(s) list: 0-127 Thread(s) per core:2 Core(s) per socket:32 Socket(s): 2 NUMA node(s): 8 NUMA node0 CPU(s): 0-7,64-71 NUMA node1 CPU(s): 8-15,72-79 NUMA node2 CPU(s): 16-23,80-87 NUMA node3 CPU(s): 24-31,88-95 NUMA node4 CPU(s): 32-39,96-103 NUMA node5 CPU(s): 40-47,104-111 NUMA node6 CPU(s): 48-55,112-119 NUMA node7 CPU(s): 56-63,120-127 # numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71 node 0 size: 32658 MB node 0 free: 31554 MB node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79 node 1 size: 32767 MB node 1 free: 31854 MB node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87 node 2 size: 32767 MB node 2 free: 31535 MB node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95 node 3 size: 32767 MB node 3 free: 31777 MB node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103 node 4 size: 32767 MB node 4 free: 31949 MB node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111 node 5 size: 32767 MB node 5 free: 31957 MB node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119 node 6 size: 32767 MB node 6 free: 31945 MB node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127 node 7 size: 32767 MB node 7 free: 31958 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 16 32 32 32 32 1: 16 10 16 16 32 32 32 32 2: 16 16 10 16 32 32 32 32 3: 16 16 16 10 32 32 32 32 4: 32 32 32 32 10 16 16 16 5: 32 32 32 32 16 10 16 16 6: 32 32 32 32 16 16 10 16 7: 32 32 32 32 16 16 16 10 # for i in 0 8 16 24 32 40 48 56; do numactl -m 0 -C $i /tmp/numa-thp-bench; done random writes MADV_HUGEPAGE 17622885 usec random writes MADV_NOHUGEPAGE 25316593 usec random writes MADV_NOHUGEPAGE 25291927 usec random writes MADV_HUGEPAGE 17672446 usec random writes MADV_HUGEPAGE 25698555 usec random writes MADV_NOHUGEPAGE 36413941 usec random writes MADV_NOHUGEPAGE 36402155 usec random writes MADV_HUGEPAGE 25689574 usec random writes MADV_HUGEPAGE 25136558 usec random writes MADV_NOHUGEPAGE 35562724 usec random writes MADV_NOHUGEPAGE 35504708 usec random writes MADV_HUGEPAGE 25123186 usec random writes MADV_HUGEPAGE 25137002 usec random writes MADV_NOHUGEPAGE 35577429 usec random writes MADV_NOHUGEPAGE 35582865 usec random writes MADV_HUGEPAGE 25116561 usec random writes MADV_HUGEPAGE 40281721 usec random writes MADV_NOHUGEPAGE 56891233 usec random writes MADV_NOHUGEPAGE 56924134 usec random writes MADV_HUGEPAGE 40286512 usec random writes MADV_HUGEPAGE 40377662 usec random writes MADV_NOHUGEPAGE 56731400 usec random writes MADV_NOHUGEPAGE 56443959 usec random writes MADV_HUGEPAGE 40379022 usec random writes MADV_HUGEPAGE 33907588 usec random writes MADV_NOHUGEPAGE 47609976 usec random writes MADV_NOHUGEPAGE 47523481 usec random writes MADV_HUGEPAGE 33881974 usec random writes MADV_HUGEPAGE 40809719 usec random writes MADV_NOHUGEPAGE 57148321 usec random writes MADV_NOHUGEPAGE 57164499 usec random writes MADV_HUGEPAGE 40802979 usec # grep EPYC /proc/cpuinfo |head -1 model name : AMD EPYC 7601 32-Core Processor I suppose node 0-1-2-3 are socket 0 and node 4-5-6-7 are socket 1. With the ram kept in nodeid 0, cpuid 0 is NUMA local, cpuid 8,16,24 are NUMA intrasocket remote and cpuid 32 40 48 56 are NUMA intersocket remote. local 4k -> local THP: +43.6% improvement local 4k -> intersocket remote THP: -1.4% intersocket remote 4k -> intersocket remote THP: +41.6% local 4k -> intersocket remote 4k: -30.4% local THP -> intersocket remote THP: -31.4% local 4k -> intrasocket remote THP: -37.15% (-25% on node 6?) intrasocket remote 4k -> intrasocket remote THP: +41.23% local 4k -> intrasocket remote 4k: -55.5% (-46% on node 6?) local THP -> intrasocket remote THP: -56.25% (-47% on node 6?) In short intrasocket is a whole lot more expensive (4k -55% THP -56%) than intersocket (4k -30% THP -31%)... as expected. The benefits of THP vs 4k remains the same for intrasocket (+41.23%) and intersocket (+41.6%) and local (+43.6%), also as expected. The above was measured on bare metal on guests the impact of THP as usual will be multiplied (I can try to measure that another time). So while before I couldn't confirm that THP didn't help intersocket, I think I can confirm it helps just like intrasocket and local now on this architecture. Especially intresocket the slowdown from remote THP compared to local 4k is a tiny -1% so in theory __GFP_THISNODE would at least need to switch to __GFP_THISSOCKET for this architecture.. (I'm not suggesting that, I'm talking in theory). Intersocket is even more favorable than a 2 node 1 socket threadripper and a 2 node (2 sockets?) skylake in fact even on bare metal. Losing the +41% THP benefit
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On 12/12/18 1:37 AM, David Rientjes wrote: > > Regarding the role of direct reclaim in the allocator, I think we need > work on the feedback from compaction to determine whether it's worthwhile. > That's difficult because of the point I continue to bring up: > isolate_freepages() is not necessarily always able to access this freed > memory. That's one of the *many* reasons why having free base pages doesn't guarantee compaction success. We can and will improve on that. But I don't think it would be e.g. practical to check the pfns of free pages wrt compaction scanner positions and decide based on that. Also when you invoke reclaim, you can't tell in advance those pfns, so I'm not sure how the better feedback from compaction to reclaim for this specific aspect would be supposed to work? > But for cases where we get COMPACT_SKIPPED because the order-0 > watermarks are failing, reclaim *is* likely to have an impact in the > success of compaction, Yes that's the heuristic we rely on. > otherwise we fail and defer because it wasn't able > to make a hugepage available. Note that THP fault compaction doesn't actually defer itself, which I think is a weakness of the current implementation and hope that patch 3 in my series from yesterday [1] can address that. Because defering is the general feedback mechanism that we have for suppressing compaction (and thus associated reclaim) in cases it fails for any reason, not just the one you mention. Instead of inspecting failure conditions in detail, which would be costly, it's a simple statistical approach. And when compaction is improved to fail less, defering automatically also happens less. > [ If we run compaction regardless of the order-0 watermark check and find >a pageblock where we can likely free a hugepage because it is >fragmented movable pages, this is a pretty good indication that reclaim >is worthwhile iff the reclaimed memory is beyond the migration scanner. ] I don't think that would be a good direction to pursue, to let scanning happen even without having the free pages. Also as I've mentioned above, LRU-based reclaim cannot satisfy your 'iff' condition, unless it inspected the pfn's it freed, and continued reclaiming until enough of those beyond migration scanner were freed. Instead IMHO we should look again into replacing the free scanner with direct allocation from freelists. [1] https://lkml.kernel.org/r/20181211142941.20500-1-vba...@suse.cz
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Tue 11-12-18 16:37:22, David Rientjes wrote: [...] > Since it depends on the workload, specifically workloads that fit within a > single node, I think the reasonable approach would be to have a sane > default regardless of the use of MADV_HUGEPAGE or thp defrag settings and > then optimzie for the minority of cases where the workload does not fit in > a single node. I'm assuming there is no debate about these larger > workloads being in the minority, although we have single machines where > this encompasses the totality of their workloads. Your assumption is wrong I believe. This is the fundamental disagreement we are discussing here. You are essentially arguing for node_reclaim (formerly zone_reclaim) behavior for THP pages. All that without any actual data on wider variety of workloads. As the matter of _fact_ we know that node_reclaim behavior is not a suitable default. We did that mistake in the past and we had to revert that default _exactly_ because a wider variety of workloads suffered from over reclaim and performance issues as a result of constant reclaim You have also haven't explained why you do care so much about remote THP while you do not care about remote base bages (the page allocator falls back to those as soon as the kswapd doesn't keep pace with the allocation rate; THP or high order pages in general is analoguous with kcompactd doing a pro-active compaction). Like the base pages we do not want larger pages to fallback to a remote node too easily. There is no question about that I believe. I can be convinced that larger pages really require a different behavior than base pages but you should better show _real_ numbers on a wider variety workloads to back your claims. I have only heard hand waving and a very vague and quite doubtful numbers for a non-disclosed benchmark without a clear indication on how it relates to real world workloads. So color me unconvinced. -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Sun, 9 Dec 2018, Andrea Arcangeli wrote: > You didn't release the proprietary software that depends on > __GFP_THISNODE behavior and that you're afraid is getting a > regression. > > Could you at least release with an open source license the benchmark > software that you must have used to do the above measurement to > understand why it gives such a weird result on remote THP? > Hi Andrea, As I said in response to Linus, I'm in the process of writing a more complete benchmarking test across all of our platforms for access and allocation latency for x86 (both Intel and AMD), POWER8/9, and arm64, and doing so on a kernel with minimum overhead (for the allocation latency, I want to remove things like mem cgroup overhead from the result). > On skylake and on the threadripper I can't confirm that there isn't a > significant benefit from cross socket hugepage over cross socket small > page. > > Skylake Xeon(R) Gold 5115: > > # numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29 > node 0 size: 15602 MB > node 0 free: 14077 MB > node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39 > node 1 size: 16099 MB > node 1 free: 15949 MB > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 > # numactl -m 0 -C 0 ./numa-thp-bench > random writes MADV_HUGEPAGE 10109753 usec > random writes MADV_NOHUGEPAGE 13682041 usec > random writes MADV_NOHUGEPAGE 13704208 usec > random writes MADV_HUGEPAGE 10120405 usec > # numactl -m 0 -C 10 ./numa-thp-bench > random writes MADV_HUGEPAGE 15393923 usec > random writes MADV_NOHUGEPAGE 19644793 usec > random writes MADV_NOHUGEPAGE 19671287 usec > random writes MADV_HUGEPAGE 15495281 usec > # grep Xeon /proc/cpuinfo |head -1 > model name : Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz > > local 4k -> local 2m: +35% > local 4k -> remote 2m: -11% > remote 4k -> remote 2m: +26% > > threadripper 1950x: > > # numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 > node 0 size: 15982 MB > node 0 free: 14422 MB > node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 > node 1 size: 16124 MB > node 1 free: 5357 MB > node distances: > node 0 1 > 0: 10 16 > 1: 16 10 > # numactl -m 0 -C 0 /tmp/numa-thp-bench > random writes MADV_HUGEPAGE 12902667 usec > random writes MADV_NOHUGEPAGE 17543070 usec > random writes MADV_NOHUGEPAGE 17568858 usec > random writes MADV_HUGEPAGE 12896588 usec > # numactl -m 0 -C 8 /tmp/numa-thp-bench > random writes MADV_HUGEPAGE 19663515 usec > random writes MADV_NOHUGEPAGE 27819864 usec > random writes MADV_NOHUGEPAGE 27844066 usec > random writes MADV_HUGEPAGE 19662706 usec > # grep Threadripper /proc/cpuinfo |head -1 > model name : AMD Ryzen Threadripper 1950X 16-Core Processor > > local 4k -> local 2m: +35% > local 4k -> remote 2m: -10% > remote 4k -> remote 2m: +41% > > Or if you prefer reversed in terms of compute time (negative > percentage is better in this case): > > local 4k -> local 2m: -26% > local 4k -> remote 2m: +12% > remote 4k -> remote 2m: -29% > > It's true that local 4k is generally a win vs remote THP when the > workload is memory bound also for the threadripper, the threadripper > seems even more favorable to remote THP than skylake Xeon is. > My results are organized slightly different since it considers local hugepages as the baseline and is what we optimize for: on Broadwell, I've obtained more accurate results that show local small pages at +3.8%, remote hugepages at +12.8% and remote small pages at +18.8%. I think we both agree that the locality preference for workloads that fit within a single node is local hugepage -> local small page -> remote hugepage -> remote small page, and that has been unchanged in any of benchmarking results for either of us. > The above is the host bare metal result. Now let's try guest mode on > the threadripper. The last two lines seems more reliable (the first > two lines also needs to fault in the guest RAM because the guest > was fresh booted). > > guest backed by local 2M pages: > > random writes MADV_HUGEPAGE 16025855 usec > random writes MADV_NOHUGEPAGE 21903002 usec > random writes MADV_NOHUGEPAGE 19762767 usec > random writes MADV_HUGEPAGE 15189231 usec > > guest backed by remote 2M pages: > > random writes MADV_HUGEPAGE 25434251 usec > random writes MADV_NOHUGEPAGE 32404119 usec > random writes MADV_NOHUGEPAGE 31455592 usec > random writes MADV_HUGEPAGE 22248304 usec > > guest backed by local 4k pages: > > random writes MADV_HUGEPAGE 28945251 usec > random writes MADV_NOHUGEPAGE 32217690 usec > random writes MADV_NOHUGEPAGE 30664731 usec > random writes MADV_HUGEPAGE 22981082 usec > > guest backed by remote 4k pages: > > random writes MADV_HUGEPAGE 43772939 usec > random writes MADV_NOHUGEPAGE 52745664 usec > random writes MADV_NOHUGEPAGE 51632065 usec > random writes MADV_HUGEPAGE 40263194 usec > > I haven't yet tr
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
Hello, On Sun, Dec 09, 2018 at 04:29:13PM -0800, David Rientjes wrote: > [..] on this platform, at least, hugepages are > preferred on the same socket but there isn't a significant benefit from > getting a cross socket hugepage over small page. [..] You didn't release the proprietary software that depends on __GFP_THISNODE behavior and that you're afraid is getting a regression. Could you at least release with an open source license the benchmark software that you must have used to do the above measurement to understand why it gives such a weird result on remote THP? On skylake and on the threadripper I can't confirm that there isn't a significant benefit from cross socket hugepage over cross socket small page. Skylake Xeon(R) Gold 5115: # numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29 node 0 size: 15602 MB node 0 free: 14077 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39 node 1 size: 16099 MB node 1 free: 15949 MB node distances: node 0 1 0: 10 21 1: 21 10 # numactl -m 0 -C 0 ./numa-thp-bench random writes MADV_HUGEPAGE 10109753 usec random writes MADV_NOHUGEPAGE 13682041 usec random writes MADV_NOHUGEPAGE 13704208 usec random writes MADV_HUGEPAGE 10120405 usec # numactl -m 0 -C 10 ./numa-thp-bench random writes MADV_HUGEPAGE 15393923 usec random writes MADV_NOHUGEPAGE 19644793 usec random writes MADV_NOHUGEPAGE 19671287 usec random writes MADV_HUGEPAGE 15495281 usec # grep Xeon /proc/cpuinfo |head -1 model name : Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz local 4k -> local 2m: +35% local 4k -> remote 2m: -11% remote 4k -> remote 2m: +26% threadripper 1950x: # numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 15982 MB node 0 free: 14422 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 16124 MB node 1 free: 5357 MB node distances: node 0 1 0: 10 16 1: 16 10 # numactl -m 0 -C 0 /tmp/numa-thp-bench random writes MADV_HUGEPAGE 12902667 usec random writes MADV_NOHUGEPAGE 17543070 usec random writes MADV_NOHUGEPAGE 17568858 usec random writes MADV_HUGEPAGE 12896588 usec # numactl -m 0 -C 8 /tmp/numa-thp-bench random writes MADV_HUGEPAGE 19663515 usec random writes MADV_NOHUGEPAGE 27819864 usec random writes MADV_NOHUGEPAGE 27844066 usec random writes MADV_HUGEPAGE 19662706 usec # grep Threadripper /proc/cpuinfo |head -1 model name : AMD Ryzen Threadripper 1950X 16-Core Processor local 4k -> local 2m: +35% local 4k -> remote 2m: -10% remote 4k -> remote 2m: +41% Or if you prefer reversed in terms of compute time (negative percentage is better in this case): local 4k -> local 2m: -26% local 4k -> remote 2m: +12% remote 4k -> remote 2m: -29% It's true that local 4k is generally a win vs remote THP when the workload is memory bound also for the threadripper, the threadripper seems even more favorable to remote THP than skylake Xeon is. The above is the host bare metal result. Now let's try guest mode on the threadripper. The last two lines seems more reliable (the first two lines also needs to fault in the guest RAM because the guest was fresh booted). guest backed by local 2M pages: random writes MADV_HUGEPAGE 16025855 usec random writes MADV_NOHUGEPAGE 21903002 usec random writes MADV_NOHUGEPAGE 19762767 usec random writes MADV_HUGEPAGE 15189231 usec guest backed by remote 2M pages: random writes MADV_HUGEPAGE 25434251 usec random writes MADV_NOHUGEPAGE 32404119 usec random writes MADV_NOHUGEPAGE 31455592 usec random writes MADV_HUGEPAGE 22248304 usec guest backed by local 4k pages: random writes MADV_HUGEPAGE 28945251 usec random writes MADV_NOHUGEPAGE 32217690 usec random writes MADV_NOHUGEPAGE 30664731 usec random writes MADV_HUGEPAGE 22981082 usec guest backed by remote 4k pages: random writes MADV_HUGEPAGE 43772939 usec random writes MADV_NOHUGEPAGE 52745664 usec random writes MADV_NOHUGEPAGE 51632065 usec random writes MADV_HUGEPAGE 40263194 usec I haven't yet tried the guest mode on the skylake nor haswell/broadwell. I can do that too but I don't expect a significant difference. On a threadripper guest, the remote 2m is practically identical to local 4k. So shutting down compaction to try to generate local 4k memory looks a sure loss. Even if we ignore the guest mode results completely, if we don't make assumption on the workload to be able to fit in the node, if I use MADV_HUGEPAGE I think I'd prefer the risk of a -10% slowdown if the THP page ends up in a remote node, than not getting the +41% THP speedup on remote memory if the pagetable ends up being remote or the 4k page itself ends up being remote over time. The cons left from your latest patch, is that you eventually also lose the +35% speedup when compaction is clogged by COMPACT_SKIPPED, which for a guest mode computation translates in losing the +59% speedup of having host local THP (when guest uses 4k pages). khu
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Thu, 6 Dec 2018, Linus Torvalds wrote: > > On Broadwell, the access latency to local small pages was +5.6%, remote > > hugepages +16.4%, and remote small pages +19.9%. > > > > On Naples, the access latency to local small pages was +4.9%, intrasocket > > hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages > > +26.6%, and intersocket hugepages +29.2% > > Are those two last numbers transposed? > > Or why would small page accesses be *faster* than hugepages for the > intersocket case? > > Of course, depending on testing, maybe the page itself was remote, but > the page tables were random, and you happened to get a remote page > table for the hugepage case? > Yes, looks like that was the case, if the page tables were from the same node as the intersocket remote hugepage it looks like a ~0.1% increase accessing small pages, so basically unchanged. So this complicates the allocation strategy somewhat; on this platform, at least, hugepages are preferred on the same socket but there isn't a significant benefit from getting a cross socket hugepage over small page. The typical way this is resolved is based on the SLIT and how the kernel defines RECLAIM_DISTANCE. I'm not sure that we can expect the distances between proximity domains to be defined according to this value for a one-size-fits-all solution. I've always thought that RECLAIM_DISTANCE should be configurable so that initscripts can actually determine its ideal value when using vm.zone_reclaim_mode. > > So it *appears* from the x86 platforms that NUMA matters much more > > significantly than hugeness, but remote hugepages are a slight win over > > remote small pages. PPC appeared the same wrt the local node but then > > prefers hugeness over affinity when it comes to remote pages. > > I do think POWER at least historically has much weaker TLB fills, but > also very costly page table creation/teardown. Constant-time O(1) > arguments about hash lookups are only worth so much when the constant > time is pretty big. They've been working on it. > > So at least on POWER, afaik one issue is literally that hugepages made > the hash setup and teardown situation much better. > I'm still working on the more elaborate test case that will generate these results because I think I can use it at boot to determine an ideal RECLAIM_DISTANCE. I can also get numbers for hash vs radix MMU if you're interested. > One thing that might be worth looking at is whether the process itself > is all that node-local. Maybe we could aim for a policy that says > "prefer local memory, but if we notice that the accesses to this vma > aren't all that local, then who cares?". > > IOW, the default could be something more dynamic than just "always use > __GFP_THISNODE". It could be more along the lines of "start off using > __GFP_THISNODE, but for longer-lived processes that bounce around > across nodes, maybe relax it?" > It would allow the use of MPOL_PREFERRED for an exact preference if they are known to not be bounced around. This would be required for processes that are bound to the cpus of a single node through cpuset or sched_setaffinity() but unconstrained as far as memory is concerned. The goal of __GFP_THISNODE being the default for thp, however, is that we *know* we're going to be accessing it locally at least in the short term, perhaps forever. Any other default would assume the remotely allocated hugepage would eventually be accessed locally, otherwise we would have been much better off just failing the hugepage allocation and accessing small pages. You could make an assumption that's the case iff the process does not fit in its local node, and I think that would be the minority of applications. I guess there could be some heuristic that could determine this based on MM_ANONPAGES of Andrea's qemu and zone->zone_pgdat->node_present_pages. It feels like something that should be more exactly defined, though, for the application to say that it prefers remote hugepages over local small pages because it can't access either locally forever anyway. This was where I suggested a new prctl() mode so that an application can prefer remote hugepages because it knows it's larger than the single node and that requires no change to the binary itself because it is inherited across fork. The sane default, though, seems to always prefer local allocation, whether hugepages or small pages, for the majority of workloads since that's where the lowest access latency is. > Honestly, I think things like vm_policy etc should not be the solution > - yes, some people may know *exactly* what access patterns they want, > but for most situations, I think the policy should be that defaults > "just work". > > In fact, I wish even MADV_HUGEPAGE itself were to approach being a > no-op with THP. > Besides the NUMA locality of the allocations, we still have the allocation latency concern that MADV_HUGEPAGE changes. The madvise mode
Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)
On Fri, 7 Dec 2018, Vlastimil Babka wrote: > >> But *that* in turn makes for other possible questions: > >> > >> - if the reason we couldn't get a local hugepage is that we're simply > >> out of local memory (huge *or* small), then maybe a remote hugepage is > >> better. > >> > >>Note that this now implies that the choice can be an issue of "did > >> the hugepage allocation fail due to fragmentation, or due to the node > >> being low of memory" > > How exactly do you tell? Many systems are simply low on memory due to > > caching. A clean pagecache is quite cheap to reclaim but it can be more > > expensive to fault in. Do we consider it to be a viable target? > > Compaction can report if it failed (more precisely: was skipped) due to > low memory, or for other reasons. It doesn't distinguish how easily > reclaimable is the memory, but I don't think we should reclaim anything > (see below). > Note that just reclaiming when the order-0 watermark in __compaction_suitable() fails is unfortunately not always sufficient: it needs to be accessible to isolate_freepages(). For order-9 memory, it's possible for isolate_migratepages_block() to skip over a top of free pages that were just reclaimed if there are unmovable pages preventing the entire pageblock from being freed.
Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)
On 12/7/18 8:49 AM, Michal Hocko wrote: >> But *that* in turn makes for other possible questions: >> >> - if the reason we couldn't get a local hugepage is that we're simply >> out of local memory (huge *or* small), then maybe a remote hugepage is >> better. >> >>Note that this now implies that the choice can be an issue of "did >> the hugepage allocation fail due to fragmentation, or due to the node >> being low of memory" > How exactly do you tell? Many systems are simply low on memory due to > caching. A clean pagecache is quite cheap to reclaim but it can be more > expensive to fault in. Do we consider it to be a viable target? Compaction can report if it failed (more precisely: was skipped) due to low memory, or for other reasons. It doesn't distinguish how easily reclaimable is the memory, but I don't think we should reclaim anything (see below). >> and there is the other question that I asked in the other thread >> (before subject edit): >> >> - how local is the load to begin with? >> >>Relatively shortlived processes - or processes that are explicitly >> bound to a node - might have different preferences than some >> long-lived process where the CPU bounces around, and might have >> different trade-offs for the local vs remote question too. > Agreed > >> So just based on David's numbers, and some wild handwaving on my part, >> a slightly more complex, but still very sensible default might be >> something like >> >> 1) try to do a cheap local node hugepage allocation >> >> Rationale: everybody agrees this is the best case. >> >> But if that fails: >> >> 2) look at compacting and the local node, but not very hard. >> >> If there's lots of memory on the local node, but synchronous >> compaction doesn't do anything easily, just fall back to small pages. > Do we reclaim at this stage or this is mostly GFP_NOWAIT attempt? I would expect no reclaim, because for non-THP faults we also don't reclaim the local node before trying to allocate from remote node. If somebody wants such behavior they can enable the node reclaim mode. THP faults shouldn't be different in this regard, right?
Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)
On Thu 06-12-18 20:31:46, Linus Torvalds wrote: > [ Oops. different thread for me due to edited subject, so I saw this > after replying to the earlier email by David ] Sorry about that but I really wanted to make the actual discussion about semantic clearly distinguished because the thread just grown too large with back and forth that didn't lead to anywhere. > On Thu, Dec 6, 2018 at 1:14 AM Michal Hocko wrote: > > > > MADV_HUGEPAGE changes the picture because the caller expressed a need > > for THP and is willing to go extra mile to get it. > > Actually, I think MADV_HUGEPAGE should just be > "TRANSPARENT_HUGEPAGE_ALWAYS but only for this vma". Yes, that is the case and I didn't want to make the description more complicated than necessary so I've focused only on the current default. But historically we have treated defrag=always and MADV_HUGEPAGE the same. [...] > >I believe that something like the below would be sensible > > 1) THP on a local node with compaction not giving up too early > > 2) THP on a remote node in NOWAIT mode - so no direct > >compaction/reclaim (trigger kswapd/kcompactd only for > >defrag=defer+madvise) > > 3) fallback to the base page allocation > > That doesn't sound insane to me. That said, the numbers David quoted > do fairly strongly imply that local small-pages are actually preferred > to any remote THP pages. As I and others pointed out elsewhere remote penalty is just a part of the picture and on its own might be quite misleading. There are other aspects (TLB pressure, page tables overhead etc) that might amortize the access latency. > But *that* in turn makes for other possible questions: > > - if the reason we couldn't get a local hugepage is that we're simply > out of local memory (huge *or* small), then maybe a remote hugepage is > better. > >Note that this now implies that the choice can be an issue of "did > the hugepage allocation fail due to fragmentation, or due to the node > being low of memory" How exactly do you tell? Many systems are simply low on memory due to caching. A clean pagecache is quite cheap to reclaim but it can be more expensive to fault in. Do we consider it to be a viable target? > > and there is the other question that I asked in the other thread > (before subject edit): > > - how local is the load to begin with? > >Relatively shortlived processes - or processes that are explicitly > bound to a node - might have different preferences than some > long-lived process where the CPU bounces around, and might have > different trade-offs for the local vs remote question too. Agreed > So just based on David's numbers, and some wild handwaving on my part, > a slightly more complex, but still very sensible default might be > something like > > 1) try to do a cheap local node hugepage allocation > > Rationale: everybody agrees this is the best case. > > But if that fails: > > 2) look at compacting and the local node, but not very hard. > > If there's lots of memory on the local node, but synchronous > compaction doesn't do anything easily, just fall back to small pages. Do we reclaim at this stage or this is mostly GFP_NOWAIT attempt? > Rationale: local memory is generally more important than THP. > > If that fails (ie local node is simply low on memory): > > 3) Try to do remote THP allocation > > Rationale: Ok, we simply didn't have a lot of local memory, so > it's not just a question of fragmentation. If it *had* been > fragmentation, lots of small local pages would have been better than a > remote THP page. > > Oops, remote THP allocation failed (possibly after synchronous > remote compaction, but maybe this is where we do kcompactd). > > 4) Just do any small page, and do reclaim etc. THP isn't happening, > and it's not a priority when you're starting to feel memory pressure. If 2) doesn't reclaim heavily (e.g. only try to reclaim clean page cache) or even do NOWAIT (which would be even better) then I _think_ this sounds sane. > In general, I really would want to avoid magic kernel command lines > (or sysfs settings, or whatever) making a huge difference in behavior. > So I really wish people would see the whole > 'transparent_hugepage_flags' thing as a way for kernel developers to > try different settings, not as a way for users to tune their loads. > > Our default should work as sane defaults, we shouldn't have a "ok, > let's have this sysfs tunable and let people make their own > decisions". That's a cop-out. Agreed. I cannot say I am happy with all the ways THP can be tuned. It is quite confusing to say the least. -- Michal Hocko SUSE Labs
Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)
On Thu 06-12-18 15:49:04, David Rientjes wrote: > On Thu, 6 Dec 2018, Michal Hocko wrote: > > > MADV_HUGEPAGE changes the picture because the caller expressed a need > > for THP and is willing to go extra mile to get it. That involves > > allocation latency and as of now also a potential remote access. We do > > not have complete agreement on the later but the prevailing argument is > > that any strong NUMA locality is just reinventing node-reclaim story > > again or makes THP success rate down the toilet (to quote Mel). I agree > > that we do not want to fallback to a remote node overeagerly. I believe > > that something like the below would be sensible > > 1) THP on a local node with compaction not giving up too early > > 2) THP on a remote node in NOWAIT mode - so no direct > >compaction/reclaim (trigger kswapd/kcompactd only for > >defrag=defer+madvise) > > 3) fallback to the base page allocation > > > > I disagree that MADV_HUGEPAGE should take on any new semantic that > overrides the preference of node local memory for a hugepage, which is the > nearly four year behavior. The order of MADV_HUGEPAGE preferences listed > above would cause current users to regress who rely on local small page > fallback rather than remote hugepages because the access latency is much > better. I think the preference of remote hugepages over local small pages > needs to be expressed differently to prevent regression. Such a model would be broken. It doesn't provide consistent semantic and leads to surprising results. MADV_HUGEPAGE with local node binding will not prevent remote base pages to be used and you are back to square one. It has been a huge mistake to merge your __GFP_THISNODE patch back then in 4.1. Especially with an absolute lack of numbers for a variety of workloads. I still believe we can do better, offer a sane mem policy to help workloads with higher locality demands but it is outright wrong to confalte demand for THP with the locality semantic. If this is absolutely no go then we need a MADV_HUGEPAGE_SANE... -- Michal Hocko SUSE Labs
Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)
[ Oops. different thread for me due to edited subject, so I saw this after replying to the earlier email by David ] On Thu, Dec 6, 2018 at 1:14 AM Michal Hocko wrote: > > MADV_HUGEPAGE changes the picture because the caller expressed a need > for THP and is willing to go extra mile to get it. Actually, I think MADV_HUGEPAGE should just be "TRANSPARENT_HUGEPAGE_ALWAYS but only for this vma". So MADV_HUGEPAGE shouldn't change any behavior at all, if the kernel was built with TRANSPARENT_HUGEPAGE_ALWAYS. Put another way: even if you decide to run a kernel that does *not* have that "always THP" (beause you presumably think that it's too blunt an instrument), then MADV_HUGEPAGE says "for _this_ vma, do the 'always THP' bebavior" I think those semantics would be a whole lot easier to explain to people, and perhaps more imporantly, starting off from that kind of mindset also gives good guidance to what MADV_HUGEPAGE behavior should be: it should be sane enough that it makes sense as the _default_ behavior for the TRANSPARENT_HUGEPAGE_ALWAYS configuration. But that also means that no, MADV_HUGEPAGE doesn't really change the picture. All it does is says "I know that for this vma, THP really does make sense as a default". It doesn't say "I _have_ to have THP", exactly like TRANSPARENT_HUGEPAGE_ALWAYS does not mean that every allocation should strive to be THP. >I believe that something like the below would be sensible > 1) THP on a local node with compaction not giving up too early > 2) THP on a remote node in NOWAIT mode - so no direct >compaction/reclaim (trigger kswapd/kcompactd only for >defrag=defer+madvise) > 3) fallback to the base page allocation That doesn't sound insane to me. That said, the numbers David quoted do fairly strongly imply that local small-pages are actually preferred to any remote THP pages. But *that* in turn makes for other possible questions: - if the reason we couldn't get a local hugepage is that we're simply out of local memory (huge *or* small), then maybe a remote hugepage is better. Note that this now implies that the choice can be an issue of "did the hugepage allocation fail due to fragmentation, or due to the node being low of memory" and there is the other question that I asked in the other thread (before subject edit): - how local is the load to begin with? Relatively shortlived processes - or processes that are explicitly bound to a node - might have different preferences than some long-lived process where the CPU bounces around, and might have different trade-offs for the local vs remote question too. So just based on David's numbers, and some wild handwaving on my part, a slightly more complex, but still very sensible default might be something like 1) try to do a cheap local node hugepage allocation Rationale: everybody agrees this is the best case. But if that fails: 2) look at compacting and the local node, but not very hard. If there's lots of memory on the local node, but synchronous compaction doesn't do anything easily, just fall back to small pages. Rationale: local memory is generally more important than THP. If that fails (ie local node is simply low on memory): 3) Try to do remote THP allocation Rationale: Ok, we simply didn't have a lot of local memory, so it's not just a question of fragmentation. If it *had* been fragmentation, lots of small local pages would have been better than a remote THP page. Oops, remote THP allocation failed (possibly after synchronous remote compaction, but maybe this is where we do kcompactd). 4) Just do any small page, and do reclaim etc. THP isn't happening, and it's not a priority when you're starting to feel memory pressure. In general, I really would want to avoid magic kernel command lines (or sysfs settings, or whatever) making a huge difference in behavior. So I really wish people would see the whole 'transparent_hugepage_flags' thing as a way for kernel developers to try different settings, not as a way for users to tune their loads. Our default should work as sane defaults, we shouldn't have a "ok, let's have this sysfs tunable and let people make their own decisions". That's a cop-out. Btw, don't get me wrong: I'm not suggesting removing the sysfs knob. As a debug tool, it's great, where you can ask "ok, do things work better if you set THP-defrag to defer+madvise". I'm just saying that we should *not* use that sysfs flag as an excuse for "ok, if we get the default wrong, people can make their own defaults". We should strive to do well enough that it really shouldn't be an issue in normal situations. Linus smime.p7s Description: S/MIME Cryptographic Signature
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Thu, Dec 6, 2018 at 3:43 PM David Rientjes wrote: > > On Broadwell, the access latency to local small pages was +5.6%, remote > hugepages +16.4%, and remote small pages +19.9%. > > On Naples, the access latency to local small pages was +4.9%, intrasocket > hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages > +26.6%, and intersocket hugepages +29.2% Are those two last numbers transposed? Or why would small page accesses be *faster* than hugepages for the intersocket case? Of course, depending on testing, maybe the page itself was remote, but the page tables were random, and you happened to get a remote page table for the hugepage case? > The results on Murano were similar, which is why I suspect Aneesh > introduced the __GFP_THISNODE requirement for thp in 4.0, which preferred, > in order, local small pages, remote 1-hop hugepages, remote 2-hop > hugepages, remote 1-hop small pages, remote 2-hop small pages. it sounds like on the whole the TLB advantage of hugepages is smaller than the locality advantage. Which doesn't surprise me on x86, because TLB costs really are fairly low. Very good TLB fills, relatively to what I've seen elsewhere. > So it *appears* from the x86 platforms that NUMA matters much more > significantly than hugeness, but remote hugepages are a slight win over > remote small pages. PPC appeared the same wrt the local node but then > prefers hugeness over affinity when it comes to remote pages. I do think POWER at least historically has much weaker TLB fills, but also very costly page table creation/teardown. Constant-time O(1) arguments about hash lookups are only worth so much when the constant time is pretty big. They've been working on it. So at least on POWER, afaik one issue is literally that hugepages made the hash setup and teardown situation much better. One thing that might be worth looking at is whether the process itself is all that node-local. Maybe we could aim for a policy that says "prefer local memory, but if we notice that the accesses to this vma aren't all that local, then who cares?". IOW, the default could be something more dynamic than just "always use __GFP_THISNODE". It could be more along the lines of "start off using __GFP_THISNODE, but for longer-lived processes that bounce around across nodes, maybe relax it?" I don't think we have that kind of information right now, though, do we? Honestly, I think things like vm_policy etc should not be the solution - yes, some people may know *exactly* what access patterns they want, but for most situations, I think the policy should be that defaults "just work". In fact, I wish even MADV_HUGEPAGE itself were to approach being a no-op with THP. We already have TRANSPARENT_HUGEPAGE_ALWAYS being the default kconfig option (but I think it's a bit detbatable, because I'm not sure everybody always agrees about memory use), so on the whole MADV_HUGEPAGE shouldn't really *do* anything. Linus smime.p7s Description: S/MIME Cryptographic Signature
Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)
On Thu, 6 Dec 2018, Michal Hocko wrote: > MADV_HUGEPAGE changes the picture because the caller expressed a need > for THP and is willing to go extra mile to get it. That involves > allocation latency and as of now also a potential remote access. We do > not have complete agreement on the later but the prevailing argument is > that any strong NUMA locality is just reinventing node-reclaim story > again or makes THP success rate down the toilet (to quote Mel). I agree > that we do not want to fallback to a remote node overeagerly. I believe > that something like the below would be sensible > 1) THP on a local node with compaction not giving up too early > 2) THP on a remote node in NOWAIT mode - so no direct > compaction/reclaim (trigger kswapd/kcompactd only for > defrag=defer+madvise) > 3) fallback to the base page allocation > I disagree that MADV_HUGEPAGE should take on any new semantic that overrides the preference of node local memory for a hugepage, which is the nearly four year behavior. The order of MADV_HUGEPAGE preferences listed above would cause current users to regress who rely on local small page fallback rather than remote hugepages because the access latency is much better. I think the preference of remote hugepages over local small pages needs to be expressed differently to prevent regression.
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, 5 Dec 2018, Linus Torvalds wrote: > > Ok, I've applied David's latest patch. > > > > I'm not at all objecting to tweaking this further, I just didn't want > > to have this regression stand. > > Hmm. Can somebody (David?) also perhaps try to state what the > different latency impacts end up being? I suspect it's been mentioned > several times during the argument, but it would be nice to have a > "going forward, this is what I care about" kind of setup for good > default behavior. > I'm in the process of writing a more complete test case for this but I benchmarked a few platforms based solely on remote hugepages vs local small pages vs remote hugepages. My previous numbers were based on data from actual workloads. For all platforms, local hugepages are the premium, of course. On Broadwell, the access latency to local small pages was +5.6%, remote hugepages +16.4%, and remote small pages +19.9%. On Naples, the access latency to local small pages was +4.9%, intrasocket hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages +26.6%, and intersocket hugepages +29.2% The results on Murano were similar, which is why I suspect Aneesh introduced the __GFP_THISNODE requirement for thp in 4.0, which preferred, in order, local small pages, remote 1-hop hugepages, remote 2-hop hugepages, remote 1-hop small pages, remote 2-hop small pages. So it *appears* from the x86 platforms that NUMA matters much more significantly than hugeness, but remote hugepages are a slight win over remote small pages. PPC appeared the same wrt the local node but then prefers hugeness over affinity when it comes to remote pages. Of course this could be much different on platforms I have not tested. I can look at POWER9 but I suspect it will be similar to Murano. > How much of the problem ends up being about the cost of compaction vs > the cost of getting a remote node bigpage? > > That would seem to be a fairly major issue, but __GFP_THISNODE affects > both. It limits compaction to just this now, in addition to obviously > limiting the allocation result. > > I realize that we probably do want to just have explicit policies that > do not exist right now, but what are (a) sane defaults, and (b) sane > policies? > The common case is that local node allocation, whether huge or small, is *always* better. After that, I assume than some actual measurement of access latency at boot would be better than hardcoding a single policy in the page allocator for everybody. On my x86 platforms, it's always a simple preference of "try huge, try small, go to the next nearest node, repeat". On my PPC platforms, it's "try local huge, try local small, try huge from remaining nodes, try small from remaining nodes." > For example, if we cannot get a hugepage on this node, but we *do* get > a node-local small page, is the local memory advantage simply better > than the possible TLB advantage? > > Because if that's the case (at least commonly), then that in itself is > a fairly good argument for "hugepage allocations should always be > THISNODE". > > But David also did mention the actual allocation overhead itself in > the commit, and maybe the math is more "try to get a local hugepage, > but if no such thing exists, see if you can get a remote hugepage > _cheaply_". > > So another model can be "do local-only compaction, but allow non-local > allocation if the local node doesn't have anything". IOW, if other > nodes have hugepages available, pick them up, but don't try to compact > other nodes to do so? > It would be nice if there was a specific policy that was optimal on all platforms; since that's not the case, introducing a sane default policy is going to require some complexity. It would likely always make sense to allocate huge over small pages remotely when local allocation is not possible both for MADV_HUGEPAGE users and non-MADV_HUGEPAGE users. That would require a restructuring of how thp fallback is done which, today, is try to allocate huge locally and fail so handle_pte_fault() can take it from there and would obviously touch more than just the page allocator. I *suspect* that's not all that common because it's easier to reclaim some pages and fault local small pages instead, which always has better access latency. What's different in this discussion thus far is workloads that do not fit into a single node so allocating remote hugepages is actually better than constantly reclaiming and compacting locally. Mempolicies are interesting, but I worry about the interaction it would have with small page policies because you can only define one mode: we may have a combination of default, interleave, bind, and preferred policies for huge and small memory and that may become overly complex. Since these workloads are in the minority and it seems, to me at least, that it's a property of the size of the workload rather than a general desire for remote hugepages over small pages
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On 12/6/18 1:54 AM, Andrea Arcangeli wrote: > On Wed, Dec 05, 2018 at 04:18:14PM -0800, David Rientjes wrote: >> On Wed, 5 Dec 2018, Andrea Arcangeli wrote: >> >> Note that in addition to COMPACT_SKIPPED that you mention, compaction can >> fail with COMPACT_COMPLETE, meaning the full scan has finished without >> freeing a hugepage, or COMPACT_DEFERRED, meaning that doing another scan >> is unlikely to produce a different result. COMPACT_SKIPPED makes sense to >> do reclaim if it can become accessible to isolate_freepages() and >> hopefully another allocator does not allocate from these newly freed pages >> before compaction can scan the zone again. For COMPACT_COMPLETE and >> COMPACT_DEFERRED, reclaim is unlikely to ever help. > > The COMPACT_COMPLETE and (COMPACT_PARTIAL_SKIPPED for that matter) > seems just a mistake in the max() evaluation try_to_compact_pages() > that let it return COMPACT_COMPLETE and COMPACT_PARTIAL_SKIPPED. I > think it should just return COMPACT_DEFERRED in those two cases and it > should be enforced forced for all prio. > > There are really only 3 cases that matter for the caller: > > 1) succeed -> we got the page > 2) defer -> we failed (caller won't care about why) > 3) skipped -> failed because not enough 4k freed -> reclaim must be invoked > then >compaction can be retried > > PARTIAL_SKIPPED/COMPLETE both fall into 2) above so for the caller > they should be treated the same way. It doesn't seem very concerning > that it may try like if it succeeded and do a spurious single reclaim > invocation, but it's good to fix this and take the COMPACT_DEFERRED > nopage path in the __GFP_NORETRY case. Yeah good point. I wouldn't change the general logic of try_to_compact_pages() though, but the condition for __GFP_NORETRY can simply change to: if (compact_result != COMPACT_SKIPPED) goto nopage; I can make a patch ASAP together with a few others I think are needed, that should hopefully avoid the need for __GFP_COMPACT_ONLY or checks based on order. What's probably unavoidable though is adding back __GFP_NORETRY for madvised allocations (i.e. partially reverting 2516035499b95), but David was fine with that and your __GFP_ONLY_COMPACT approach effectively did it too.
MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)
On Wed 05-12-18 16:58:02, Linus Torvalds wrote: [...] > I realize that we probably do want to just have explicit policies that > do not exist right now, but what are (a) sane defaults, and (b) sane > policies? I would focus on the current default first (which is defrag=madvise). This means that we only try the cheapest possible THP without MADV_HUGEPAGE. If there is none we simply fallback. We do restrict to the local node. I guess there is a general agreement that this is a sane default. MADV_HUGEPAGE changes the picture because the caller expressed a need for THP and is willing to go extra mile to get it. That involves allocation latency and as of now also a potential remote access. We do not have complete agreement on the later but the prevailing argument is that any strong NUMA locality is just reinventing node-reclaim story again or makes THP success rate down the toilet (to quote Mel). I agree that we do not want to fallback to a remote node overeagerly. I believe that something like the below would be sensible 1) THP on a local node with compaction not giving up too early 2) THP on a remote node in NOWAIT mode - so no direct compaction/reclaim (trigger kswapd/kcompactd only for defrag=defer+madvise) 3) fallback to the base page allocation This would allow both full memory utilization and try to be as local as possible. Whoever strongly prefers NUMA locality should be using MPOL_NODE_RECLAIM (or similar) and that would skip 2 and make 1) and 2) use more aggressive compaction and reclaim. This will also fit into our existing NUMA api. MPOL_NODE_RECLAIM wouldn't be restricted to THP obviously. It would act on base pages as well and it would basically use the same implementation as we have for the global node_reclaim and make it usable again. Does this sound at least remotely sane? -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, Dec 5, 2018 at 3:51 PM Linus Torvalds wrote: > > Ok, I've applied David's latest patch. > > I'm not at all objecting to tweaking this further, I just didn't want > to have this regression stand. Hmm. Can somebody (David?) also perhaps try to state what the different latency impacts end up being? I suspect it's been mentioned several times during the argument, but it would be nice to have a "going forward, this is what I care about" kind of setup for good default behavior. How much of the problem ends up being about the cost of compaction vs the cost of getting a remote node bigpage? That would seem to be a fairly major issue, but __GFP_THISNODE affects both. It limits compaction to just this now, in addition to obviously limiting the allocation result. I realize that we probably do want to just have explicit policies that do not exist right now, but what are (a) sane defaults, and (b) sane policies? For example, if we cannot get a hugepage on this node, but we *do* get a node-local small page, is the local memory advantage simply better than the possible TLB advantage? Because if that's the case (at least commonly), then that in itself is a fairly good argument for "hugepage allocations should always be THISNODE". But David also did mention the actual allocation overhead itself in the commit, and maybe the math is more "try to get a local hugepage, but if no such thing exists, see if you can get a remote hugepage _cheaply_". So another model can be "do local-only compaction, but allow non-local allocation if the local node doesn't have anything". IOW, if other nodes have hugepages available, pick them up, but don't try to compact other nodes to do so? And yet another model might be "do a least-effort thing, give me a local hugepage if it exists, otherwise fall back to small pages". So there are different combinations of "try compaction" vs "local-remote". Linus
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, Dec 05, 2018 at 04:18:14PM -0800, David Rientjes wrote: > On Wed, 5 Dec 2018, Andrea Arcangeli wrote: > > > __GFP_COMPACT_ONLY gave an hope it could give some middle ground but > > it shows awful compaction results, it basically destroys compaction > > effectiveness and we know why (COMPACT_SKIPPED must call reclaim or > > compaction can't succeed because there's not enough free memory in the > > node). If somebody used MADV_HUGEPAGE compaction should still work and > > not fail like that. Compaction would fail to be effective even in the > > local node where __GFP_THISNODE didn't fail. Worst of all it'd fail > > even on non-NUMA systems (that would be easy to fix though by making > > the HPAGE_PMD_ORDER check conditional to NUMA being enabled at > > runtime). > > > > Note that in addition to COMPACT_SKIPPED that you mention, compaction can > fail with COMPACT_COMPLETE, meaning the full scan has finished without > freeing a hugepage, or COMPACT_DEFERRED, meaning that doing another scan > is unlikely to produce a different result. COMPACT_SKIPPED makes sense to > do reclaim if it can become accessible to isolate_freepages() and > hopefully another allocator does not allocate from these newly freed pages > before compaction can scan the zone again. For COMPACT_COMPLETE and > COMPACT_DEFERRED, reclaim is unlikely to ever help. The COMPACT_COMPLETE and (COMPACT_PARTIAL_SKIPPED for that matter) seems just a mistake in the max() evaluation try_to_compact_pages() that let it return COMPACT_COMPLETE and COMPACT_PARTIAL_SKIPPED. I think it should just return COMPACT_DEFERRED in those two cases and it should be enforced forced for all prio. There are really only 3 cases that matter for the caller: 1) succeed -> we got the page 2) defer -> we failed (caller won't care about why) 3) skipped -> failed because not enough 4k freed -> reclaim must be invoked then compaction can be retried PARTIAL_SKIPPED/COMPLETE both fall into 2) above so for the caller they should be treated the same way. It doesn't seem very concerning that it may try like if it succeeded and do a spurious single reclaim invocation, but it's good to fix this and take the COMPACT_DEFERRED nopage path in the __GFP_NORETRY case.
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, 5 Dec 2018, Andrea Arcangeli wrote: > __GFP_COMPACT_ONLY gave an hope it could give some middle ground but > it shows awful compaction results, it basically destroys compaction > effectiveness and we know why (COMPACT_SKIPPED must call reclaim or > compaction can't succeed because there's not enough free memory in the > node). If somebody used MADV_HUGEPAGE compaction should still work and > not fail like that. Compaction would fail to be effective even in the > local node where __GFP_THISNODE didn't fail. Worst of all it'd fail > even on non-NUMA systems (that would be easy to fix though by making > the HPAGE_PMD_ORDER check conditional to NUMA being enabled at > runtime). > Note that in addition to COMPACT_SKIPPED that you mention, compaction can fail with COMPACT_COMPLETE, meaning the full scan has finished without freeing a hugepage, or COMPACT_DEFERRED, meaning that doing another scan is unlikely to produce a different result. COMPACT_SKIPPED makes sense to do reclaim if it can become accessible to isolate_freepages() and hopefully another allocator does not allocate from these newly freed pages before compaction can scan the zone again. For COMPACT_COMPLETE and COMPACT_DEFERRED, reclaim is unlikely to ever help.
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
Hello, On Wed, Dec 05, 2018 at 01:59:32PM -0800, David Rientjes wrote: > [..] and the kernel test robot has reported, [..] Just for completeness you may have missed one email: https://lkml.kernel.org/r/87tvk1yjkp@yhuang-dev.intel.com 'So I think the report should have been a "performance improvement" instead of "performance regression".'
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, Dec 5, 2018 at 3:36 PM Andrea Arcangeli wrote: > > Like said earlier still better to apply __GFP_COMPACT_ONLY or David's > patch than to return to v4.18 though. Ok, I've applied David's latest patch. I'm not at all objecting to tweaking this further, I just didn't want to have this regression stand. Linus
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, Dec 05, 2018 at 02:03:10PM -0800, Linus Torvalds wrote: > On Wed, Dec 5, 2018 at 12:40 PM Andrea Arcangeli wrote: > > > > So ultimately we decided that the saner behavior that gives the least > > risk of regression for the short term, until we can do something > > better, was the one that is already applied upstream. > > You're ignoring the fact that people *did* report things regressed. I don't ignore regressions.. after all the only reason I touched this code is that I have been asked to fix a regression that made the upstream kernel unusable in some enterprise workloads with very large processes. Enterprise releases don't happen every year so it's normal we noticed only last January a 3 years old regression. The fact it's an old regression doesn't make it any less relevant. It took until August until I had the time to track down this specific regression which artificially delayed this by another 8 months. With regard to David's specific regression I didn't ignore it either, I just prioritize on which regression has to be fixed with the most urgency and David's regression is less severe than the one we're fixing here. I posted below the numbers for the regression that is more urgent to fix. Now suppose (like I think is likely) David may be better off setting __GFP_THISNODE across the board including for 4k pages not just for THP. I don't think anybody would be ok if we set __GFP_THISNODE by on 4k pages too unless it's done under a very specific new MPOL. It'll probably work even better for him probably (the cache will be pushed into remote nodes by 4k allocations too, and even more of the app data and executable will be in the local NUMA node). But that's unusable for anything except his specialized workload that tends to fit in a single node and can accept to pay an incredible slowdown if it ever spills over (as long as the process is not getting OOM killed he's ok because it's such an uncommon occurrence for him that he can pay an extreme cost just to avoid OOM killing). It's totally fine to optimize such things with an opt-in like a new MPOL that makes those assumptions about process size, but it's that's an unacceptable assumption to impose on all workloads, because it breaks the VM bad for all workload that can't fit in a single NUMA node. > That's the part I find unacceptable. You're saying "we picked > something that minimized regressions". > > No it didn't. The regression is present and real, and is on a real > load, not a benchmark. > > So that argument is clearly bogus. Note that "this give the least risk of regression" I never meant the risk is zero. Obviously we know it's higher than zero. Otherwise David would have no regression in the first place. So I stand by my argument that this is what "gives the least risk of regression" if you're given any workload you know nothing about that uses MADV_HUGEPAGE and it's benefiting from it and you don't know beforehand if it can fit or not fit in a single NUMA node. If you knew for sure it can fit in a single NUMA node, __GFP_THISNODE would be better, obviously, but the same applies to 4k pages too... and we're not setting __GFP_THISNODE on 4k allocations under MPOL_DEFAULT. So I'm all for fixing David's workload but here we're trying to generalize an ad-hoc NUMA optimization that isn't necessarily only applicable to THP order allocations either, like it's a generic good thing when it isn't. __GFP_COMPACT_ONLY gave an hope it could give some middle ground but it shows awful compaction results, it basically destroys compaction effectiveness and we know why (COMPACT_SKIPPED must call reclaim or compaction can't succeed because there's not enough free memory in the node). If somebody used MADV_HUGEPAGE compaction should still work and not fail like that. Compaction would fail to be effective even in the local node where __GFP_THISNODE didn't fail. Worst of all it'd fail even on non-NUMA systems (that would be easy to fix though by making the HPAGE_PMD_ORDER check conditional to NUMA being enabled at runtime). Like said earlier still better to apply __GFP_COMPACT_ONLY or David's patch than to return to v4.18 though. === From: Andrea Arcangeli To: Andrew Morton Cc: linux...@kvack.org, Alex Williamson , David Rientjes , Vlastimil Babka Subject: [PATCH 1/1] mm: thp: fix transparent_hugepage/defrag = madvise || always Date: Sun, 19 Aug 2018 23:26:40 -0400 qemu uses MADV_HUGEPAGE which allows direct compaction (i.e. __GFP_DIRECT_RECLAIM is set). The problem is that direct compaction combined with the NUMA __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very hard the local node, instead of failing the allocation if there's no THP available in the local node. Such logic was ok until __GFP_THISNODE was added to the THP allocation path even with MPOL_DEFAULT. The idea behind the __GFP_THISNODE addition, is that it is better to provide local memory in PAGE_SIZE units than to use remote NUMA THP b
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, 5 Dec 2018, Linus Torvalds wrote: > > So ultimately we decided that the saner behavior that gives the least > > risk of regression for the short term, until we can do something > > better, was the one that is already applied upstream. > > You're ignoring the fact that people *did* report things regressed. > > That's the part I find unacceptable. You're saying "we picked > something that minimized regressions". > > No it didn't. The regression is present and real, and is on a real > load, not a benchmark. > > So that argument is clearly bogus. > > I'm going to revert the commit since people apparently seem to be > ignoring this fundamental issue. > > Real workloads regressed. The regressions got reported. Ignoring that > isn't acceptable. > Please allow me to prepare my v2 because it's not a clean revert due to the follow-up 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") and will incorporate the feedback from Michal to not change anything outside of the thp fault path.
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, Dec 5, 2018 at 12:40 PM Andrea Arcangeli wrote: > > So ultimately we decided that the saner behavior that gives the least > risk of regression for the short term, until we can do something > better, was the one that is already applied upstream. You're ignoring the fact that people *did* report things regressed. That's the part I find unacceptable. You're saying "we picked something that minimized regressions". No it didn't. The regression is present and real, and is on a real load, not a benchmark. So that argument is clearly bogus. I'm going to revert the commit since people apparently seem to be ignoring this fundamental issue. Real workloads regressed. The regressions got reported. Ignoring that isn't acceptable. Linus
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, 5 Dec 2018, Andrea Arcangeli wrote: > > thpscale Percentage Faults Huge > >4.20.0-rc4 4.20.0-rc4 > >mmots-20181130 gfpthisnode-v1r1 > > Percentage huge-395.14 ( 0.00%)7.94 ( -91.65%) > > Percentage huge-591.28 ( 0.00%)5.00 ( -94.52%) > > Percentage huge-786.87 ( 0.00%)9.36 ( -89.22%) > > Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%) > > Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%) > > Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%) > > Percentage huge-30 83.66 ( 0.00%) 31.85 ( -61.93%) > > Percentage huge-32 83.89 ( 0.00%) 29.09 ( -65.32%) > > > > They're down the toilet. 3 threads are able to get 95% of the requested > > THP pages with Andrews tree as of Nov 30th. David's patch drops that to > > 8% success rate. > > This is the downside of David's patch very well exposed above. And > this will make non-NUMA system regress like above too despite they > have no issue to begin with (which is probably why nobody noticed the > trouble with __GFP_THISNODE reclaim until recently, combined with the > fact most workloads can fit in a single NUMA node). > > So we're effectively crippling down MADV_HUGEPAGE effectiveness on > non-NUMA (where it cannot help to do so) and on NUMA (as a workaround > for the false positive swapout storms) because in some workload and > system THP improvements are less significant than NUMA improvements. > For context, you're referring to the patch I posted that is similar to __GFP_COMPACT_ONLY and patch 2/2 in my series. It's not referring to the revert of the 4.20-rc commit that relaxes the __GFP_THISNODE restriction on thp faults and conflates MADV_HUGEPAGE with NUMA locality. For 4.20, I believe at minimum that patch 1/2 should be merged to restore what we have had for three years, stop piling more semantics on top of the intent (or perceived intent) of MADV_HUGEPAGE, and address the swap storm issue separately. > The higher fault latency is generally the higher cost you pay to get > the good initial THP utilization for apps that do long lived > allocations and in turn can use MADV_HUGEPAGE without downsides. The > cost of compaction pays off over time. > > Short lived allocations sensitive to the allocation latency should not > use MADV_HUGEPAGE in the first place. If you don't want high latency > you shouldn't use MADV_HUGEPAGE and khugepaged already uses > __GFP_THISNODE but it replaces memory so it has a neutral memory > footprint at it, so it's ok with regard to reclaim. > Completely agreed, and is why we want to try synchronous memory compaction to try to allocate hugepages locally in our usecases as well. We aren't particularly concerned about the allocation latency, that is secondary to the long-lived access latency regression that occurs when you do not set __GFP_THISNODE. > In my view David's workload is the outlier that uses MADV_HUGEPAGE but > pretends a low latency and NUMA local behavior as first priority. If > your workload fits in the per-socket CPU cache it doesn't matter which > node it is but it totally matters if you've 2M or 4k tlb. I'm not even > talking about KVM where THP has a multipler effect with EPT. > Hm, no, we do not mind the high allocation latency for MADV_HUGEPAGE users. We *do* care about access latency and that is due to NUMA locality. Before commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings"), *all* thp faults were done with __GFP_THISNODE and had been for at least three years. That commit conflates MADV_HUGEPAGE with a new semantic that it allows remote allocation instead of what it has done for three years: try harder synchronously to allocate hugepages locally. We obviously need to address the problem in another way and not change long-standing behavior that causes regressions. Either my patch 2/2, __GFP_COMPACT_ONLY, a new mempolicy mode, new madvise mode, prctl, etc. > Even if you make the __GFP_NORETRY change for the HPAGE_PMD_ORDER to > skip reclaim in David's patch conditional NUMA being enabled in the > host (so that it won't cripple THP utilization also on non-NUMA > systems), imagine that you go in the bios, turn off interleaving to > enable host NUMA and THP utilization unexpectedly drops significantly > for your VM. > What's needed is appropriate feedback from memory compaction to determine if reclaim is worthwhile: checking only COMPACT_DEFERRED is insufficient. We need to determine if compaction has failed due to order-0 low watermark checks or whether it simply failed to defragment memory so a hugepage could be allocated. Determining if compaction has failed due to order-0 low watermark checks is harder than it seems because the reclaimed memory may not be accessible by isolate_freepages(); we don't have the ability to only reclai
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
Hello, Sorry, it has been challenging to keep up with all fast replies, so I'll start by answering to the critical result below: On Tue, Dec 04, 2018 at 10:45:58AM +, Mel Gorman wrote: > thpscale Percentage Faults Huge >4.20.0-rc4 4.20.0-rc4 >mmots-20181130 gfpthisnode-v1r1 > Percentage huge-395.14 ( 0.00%)7.94 ( -91.65%) > Percentage huge-591.28 ( 0.00%)5.00 ( -94.52%) > Percentage huge-786.87 ( 0.00%)9.36 ( -89.22%) > Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%) > Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%) > Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%) > Percentage huge-30 83.66 ( 0.00%) 31.85 ( -61.93%) > Percentage huge-32 83.89 ( 0.00%) 29.09 ( -65.32%) > > They're down the toilet. 3 threads are able to get 95% of the requested > THP pages with Andrews tree as of Nov 30th. David's patch drops that to > 8% success rate. This is the downside of David's patch very well exposed above. And this will make non-NUMA system regress like above too despite they have no issue to begin with (which is probably why nobody noticed the trouble with __GFP_THISNODE reclaim until recently, combined with the fact most workloads can fit in a single NUMA node). So we're effectively crippling down MADV_HUGEPAGE effectiveness on non-NUMA (where it cannot help to do so) and on NUMA (as a workaround for the false positive swapout storms) because in some workload and system THP improvements are less significant than NUMA improvements. The higher fault latency is generally the higher cost you pay to get the good initial THP utilization for apps that do long lived allocations and in turn can use MADV_HUGEPAGE without downsides. The cost of compaction pays off over time. Short lived allocations sensitive to the allocation latency should not use MADV_HUGEPAGE in the first place. If you don't want high latency you shouldn't use MADV_HUGEPAGE and khugepaged already uses __GFP_THISNODE but it replaces memory so it has a neutral memory footprint at it, so it's ok with regard to reclaim. In my view David's workload is the outlier that uses MADV_HUGEPAGE but pretends a low latency and NUMA local behavior as first priority. If your workload fits in the per-socket CPU cache it doesn't matter which node it is but it totally matters if you've 2M or 4k tlb. I'm not even talking about KVM where THP has a multipler effect with EPT. Even if you make the __GFP_NORETRY change for the HPAGE_PMD_ORDER to skip reclaim in David's patch conditional NUMA being enabled in the host (so that it won't cripple THP utilization also on non-NUMA systems), imagine that you go in the bios, turn off interleaving to enable host NUMA and THP utilization unexpectedly drops significantly for your VM. Rome ryzen architecture has been mentioned several times by David but in my threadripper (not-Rome, as it's supposed to be available in 2019 only AFIK) enabling THP made a measurable difference for me for some workloads. As opposed if I turn off NUMA by setting up the interleaving in the dimm I get a barely measurable slowdown. So I'm surprised in Rome there's such a radical difference in behavior. Like Mel said we need to work towards a more complete solution than putting __GFP_THISNODE from the outside and then turning off reclaim from the inside. Mel made examples of things that should happen, that won't increase allocation latency and that can't happen with __GFP_THISNODE. I'll try to describe again what's going on: 1: The allocator is being asked through __GFP_THISNODE "ignore all remote nodes for all reclaim and compaction" from the outside. Compaction then returns COMPACT_SKIPPED and tells the allocator "I can generate many more huge pages if you reclaim/swapout 2M of anon memory in this node, the only reason I failed to compact memory is because there aren't enough 4k fragmented pages free in this zone". The allocator then goes ahead and swaps 2M and invokes compaction again that succeeds the order 9 allocation fine. Goto 1; The above keeps running in a loop at every additional page fault of the app using MADV_HUGEPAGE until all RAM of the node is swapped out and replaced by THP and all others nodes had 100% free memory, potentially 100% order 9, but the allocator completely ignored all other nodes. That is the thing that we're fixing here, because such swap storms caused massive slowdowns. If the workload can't fit in a single node, it's like running with only a fraction of the RAM. So David's patch (and __GFP_COMPACT_ONLY) to fix the above swap storm, inside the allocator skips reclaim entirely when compaction tells "I can generate one more HPAGE_PMD_ORDER compound page if you reclaim/swap 2M", if __GFP_NORETRY is set (and makes sure __GFP_NORETRY is always set for THP). And that however prevents to generate any more THP globa
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, 5 Dec 2018, Michal Hocko wrote: > > It isn't specific to MADV_HUGEPAGE, it is the policy for all transparent > > hugepage allocations, including defrag=always. We agree that > > MADV_HUGEPAGE is not exactly defined: does it mean try harder to allocate > > a hugepage locally, try compaction synchronous to the fault, allow remote > > fallback? It's undefined. > > Yeah, it is certainly underdefined. One thing is clear though. Using > MADV_HUGEPAGE implies that the specific mapping benefits from THPs and > is willing to pay associated init cost. This doesn't imply anything > regarding NUMA locality and as we have NUMA API it shouldn't even > attempt to do so because it would be conflating two things. This is exactly why we use MADV_HUGEPAGE when remapping our text segment to be backed by transparent hugepages, we want to pay the cost at startup to fault thp and that involves synchronous memory compaction rather than quickly falling back to remote memory. This is making the case for me. > > So to answer "what is so different about THP?", it's the performance data. > > The NUMA locality matters more than whether the pages are huge or not. We > > also have the added benefit of khugepaged being able to collapse pages > > locally if fragmentation improves rather than being stuck accessing a > > remote hugepage forever. > > Please back your claims by a variety of workloads. Including mentioned > KVMs one. You keep hand waving about access latency completely ignoring > all other aspects and that makes my suspicious that you do not really > appreciate all the complexity here even stronger. > I discussed the tradeoff of local hugepages vs local pages vs remote hugepages in https://marc.info/?l=linux-kernel&m=154077010828940 on Broadwell, Haswell, and Rome. When a single application does not fit on a single node, we obviously need to extend the API to allow it to fault remotely. We can do that without changing long-standing behavior that prefers to only fault locally and causing real-world users to regress. Your suggestions about how we can extend the API are all very logical. [ Note that is not the regression being addressed here, however, which is massive swap storms due to a fragmented local node, which is why the __GFP_COMPACT_ONLY patch was also proposed by Andrea. The ability to prefer faulting remotely is a worthwhile extension but it does no good whatsoever if we can encounter massive swap storms because we didn't set __GFP_NORETRY appropriately (which both of our patches do) both locally and now remotely. ]
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed 05-12-18 10:43:43, Mel Gorman wrote: > On Wed, Dec 05, 2018 at 10:08:56AM +0100, Michal Hocko wrote: > > On Tue 04-12-18 16:47:23, David Rientjes wrote: > > > On Tue, 4 Dec 2018, Mel Gorman wrote: > > > > > > > What should also be kept in mind is that we should avoid conflating > > > > locality preferences with THP preferences which is separate from THP > > > > allocation latencies. The whole __GFP_THISNODE approach is pushing too > > > > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag > > > > are used which is very unfortunate given that MADV_HUGEPAGE in itself > > > > says > > > > nothing about locality -- that is the business of other madvise flags or > > > > a specific policy. > > > > > > We currently lack those other madvise modes or mempolicies: mbind() is > > > not > > > a viable alternative because we do not want to oom kill when local memory > > > is depleted, we want to fallback to remote memory. > > > > Yes, there was a clear agreement that there is no suitable mempolicy > > right now and there were proposals to introduce MPOL_NODE_RECLAIM to > > introduce that behavior. This would be an improvement regardless of THP > > because global node-reclaim policy was simply a disaster we had to turn > > off by default and the global semantic was a reason people just gave up > > using it completely. > > > > The alternative is to define a clear semantic for THP allocation > requests that are considered "light" regardless of whether that needs a > GFP flag or not. A sensible default might be > > o Allocate THP local if the amount of work is light or non-existant. > o Allocate THP remote if one is freely available with no additional work > (maybe kick remote kcompactd) > o Allocate base page local if the amount of work is light or non-existant > o Allocate base page remote if the amount of work is light or non-existant > o Do heavy work in zonelist order until a base page is allocated somewhere I am not sure about the ordering without a deeper consideration but I thin THP should reflect the approach we have for base bages. > It's not something could be clearly expressed with either NORETRY or > THISNODE but longer-term might be saner than chopping and changing on > which flags are more important and which workload is most relevant. That > runs the risk of a revert-loop where each person targetting one workload > reverts one patch to insert another until someone throws up their hands > in frustration and just carries patches out-of-tree long-term. Fully agreed! > I'm not going to prototype something along these lines for now as > fundamentally a better compaction could cut out part of the root cause > of pain. Yes there is some ground work to be done first. -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, Dec 05, 2018 at 10:08:56AM +0100, Michal Hocko wrote: > On Tue 04-12-18 16:47:23, David Rientjes wrote: > > On Tue, 4 Dec 2018, Mel Gorman wrote: > > > > > What should also be kept in mind is that we should avoid conflating > > > locality preferences with THP preferences which is separate from THP > > > allocation latencies. The whole __GFP_THISNODE approach is pushing too > > > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag > > > are used which is very unfortunate given that MADV_HUGEPAGE in itself says > > > nothing about locality -- that is the business of other madvise flags or > > > a specific policy. > > > > We currently lack those other madvise modes or mempolicies: mbind() is not > > a viable alternative because we do not want to oom kill when local memory > > is depleted, we want to fallback to remote memory. > > Yes, there was a clear agreement that there is no suitable mempolicy > right now and there were proposals to introduce MPOL_NODE_RECLAIM to > introduce that behavior. This would be an improvement regardless of THP > because global node-reclaim policy was simply a disaster we had to turn > off by default and the global semantic was a reason people just gave up > using it completely. > The alternative is to define a clear semantic for THP allocation requests that are considered "light" regardless of whether that needs a GFP flag or not. A sensible default might be o Allocate THP local if the amount of work is light or non-existant. o Allocate THP remote if one is freely available with no additional work (maybe kick remote kcompactd) o Allocate base page local if the amount of work is light or non-existant o Allocate base page remote if the amount of work is light or non-existant o Do heavy work in zonelist order until a base page is allocated somewhere It's not something could be clearly expressed with either NORETRY or THISNODE but longer-term might be saner than chopping and changing on which flags are more important and which workload is most relevant. That runs the risk of a revert-loop where each person targetting one workload reverts one patch to insert another until someone throws up their hands in frustration and just carries patches out-of-tree long-term. I'm not going to prototype something along these lines for now as fundamentally a better compaction could cut out part of the root cause of pain. -- Mel Gorman SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Tue 04-12-18 16:07:27, David Rientjes wrote: > On Tue, 4 Dec 2018, Michal Hocko wrote: > > > The thing I am really up to here is that reintroduction of > > __GFP_THISNODE, which you are pushing for, will conflate madvise mode > > resp. defrag=always with a numa placement policy because the allocation > > doesn't fallback to a remote node. > > > > It isn't specific to MADV_HUGEPAGE, it is the policy for all transparent > hugepage allocations, including defrag=always. We agree that > MADV_HUGEPAGE is not exactly defined: does it mean try harder to allocate > a hugepage locally, try compaction synchronous to the fault, allow remote > fallback? It's undefined. Yeah, it is certainly underdefined. One thing is clear though. Using MADV_HUGEPAGE implies that the specific mapping benefits from THPs and is willing to pay associated init cost. This doesn't imply anything regarding NUMA locality and as we have NUMA API it shouldn't even attempt to do so because it would be conflating two things. [...] > > And that is a fundamental problem and the antipattern I am talking > > about. Look at it this way. All normal allocations are utilizing all the > > available memory even though they might hit a remote latency penalty. If > > you do care about NUMA placement you have an API to enforce a specific > > placement. What is so different about THP to behave differently. Do > > we really want to later invent an API to actually allow to utilize all > > the memory? There are certainly usecases (that triggered the discussion > > previously) that do not mind the remote latency because all other > > benefits simply outweight it? > > > > What is different about THP is that on every platform I have measured, > NUMA matters more than hugepages. Obviously if on Broadwell, Haswell, and > Rome, remote hugepages were a performance win over local pages, this > discussion would not be happening. Faulting local pages rather than > local hugepages, if possible, is easy and doesn't require reclaim. > Faulting remote pages rather than reclaiming local pages is easy in your > scenario, it's non-disruptive. You keep ignoring all other usecases mentioned before and that is not really helpful. Access cost can be amortized by other savings. Not to mention NUMA balancing moving around hot THPs with remote accesses. > So to answer "what is so different about THP?", it's the performance data. > The NUMA locality matters more than whether the pages are huge or not. We > also have the added benefit of khugepaged being able to collapse pages > locally if fragmentation improves rather than being stuck accessing a > remote hugepage forever. Please back your claims by a variety of workloads. Including mentioned KVMs one. You keep hand waving about access latency completely ignoring all other aspects and that makes my suspicious that you do not really appreciate all the complexity here even stronger. If there was a general consensus we want to make THP very special wrt. numa locality, I could live with that. It would be inconsistency in the API and as such something that will kick us sooner or later. But it seems that _you_ are the only one to push that direction and you keep ignoring all other usecases consistently throughout all the discussions we have had so far. Several people keeps pointing out that this is a wrong direction but that seems to be completely ignored. I believe that the only way forward is back your claims by numbers covering a larger set of THP users and prove that remote THP is a wrong default behavior. But you cannot really push that through based on a single usecase of yours which you refuse to describe beyond a simple access latency metric. -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Tue, Dec 04, 2018 at 10:45:58AM +, Mel Gorman wrote: > I have *one* result of the series on a 1-socket machine running > "thpscale". It creates a file, punches holes in it to create a > very light form of fragmentation and then tries THP allocations > using madvise measuring latency and success rates. It's the > global-dhp__workload_thpscale-madvhugepage in mmtests using XFS as the > filesystem. > > thpscale Fault Latencies > 4.20.0-rc4 4.20.0-rc4 > mmots-20181130 gfpthisnode-v1r1 > Amean fault-base-3 5358.54 ( 0.00%) 2408.93 * 55.04%* > Amean fault-base-5 9742.30 ( 0.00%) 3035.25 * 68.84%* > Amean fault-base-7 13069.18 ( 0.00%) 4362.22 * 66.62%* > Amean fault-base-1214882.53 ( 0.00%) 9424.38 * 36.67%* > Amean fault-base-1815692.75 ( 0.00%)16280.03 ( -3.74%) > Amean fault-base-2428775.11 ( 0.00%)18374.84 * 36.14%* > Amean fault-base-3042056.32 ( 0.00%)21984.55 * 47.73%* > Amean fault-base-3238634.26 ( 0.00%)22199.49 * 42.54%* > Amean fault-huge-1 0.00 ( 0.00%)0.00 ( 0.00%) > Amean fault-huge-3 3628.86 ( 0.00%) 963.45 * 73.45%* > Amean fault-huge-5 4926.42 ( 0.00%) 2959.85 * 39.92%* > Amean fault-huge-7 6717.15 ( 0.00%) 3828.68 * 43.00%* > Amean fault-huge-1211393.47 ( 0.00%) 5772.92 * 49.33%* > Amean fault-huge-1816979.38 ( 0.00%) 4435.95 * 73.87%* > Amean fault-huge-2416558.00 ( 0.00%) 4416.46 * 73.33%* > Amean fault-huge-3020351.46 ( 0.00%) 5099.73 * 74.94%* > Amean fault-huge-3223332.54 ( 0.00%) 6524.73 * 72.04%* > > So, looks like massive latency improvements but then the THP allocation > success rates > > thpscale Percentage Faults Huge >4.20.0-rc4 4.20.0-rc4 >mmots-20181130 gfpthisnode-v1r1 > Percentage huge-395.14 ( 0.00%)7.94 ( -91.65%) > Percentage huge-591.28 ( 0.00%)5.00 ( -94.52%) > Percentage huge-786.87 ( 0.00%)9.36 ( -89.22%) > Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%) > Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%) > Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%) > Percentage huge-30 83.66 ( 0.00%) 31.85 ( -61.93%) > Percentage huge-32 83.89 ( 0.00%) 29.09 ( -65.32%) > Other results arrived once the grid caught up and it's a mixed bag of gains and losses roughtly along the lines predicted by the discussion already -- namely locality is better as long as the workload fits, compaction is reduced, reclaim is reduced, THP allocation success rates are reduced but latencies are often better. Whether this is "good" or "bad" depends on whether you have a workload that benefits because it's neither universally good or bad. It would still be nice to hear how Andreas fared but I think we'll reach the same conclusion -- the patches shuffles the problem around with limited effort to address the root causes so all we end up changing is the identity of the person who complains about their workload. One might be tempted to think that the reduced latencies in some cases are great but not if the workload is one that benefits from longer startup costs in exchange for lower runtime costs in the active phase. For the much longer answer, I'll focus on the two-socket results because they are more relevant to the current discussion. The workloads are not realistic in the slightest, they just happen to trigger some of the interesting corner cases. global-dhp__workload_usemem-stress-numa-compact o Plain anonymous faulting workload o defrag=always (not representative, simply triggers a bad case) 4.20.0-rc4 4.20.0-rc4 mmots-20181130 gfpthisnode-v1r1 Amean Elapsd-1 26.79 ( 0.00%) 34.92 * -30.37%* Amean Elapsd-37.32 ( 0.00%)8.10 * -10.61%* Amean Elapsd-45.53 ( 0.00%)5.64 ( -1.94%) Units are seconds, time to complete 30.37% worse for the single-threaded case. No direct reclaim activity but other activity is interesting and I'll pick it out snippets; 4.20.0-rc4 4.20.0-rc4 mmots-20181130gfpthisnode-v1r1 Swap Ins 8 0 Swap Outs 1546 0 Allocation stalls0 0 Fragmentation stalls 02022 Direct pages scanned 0 0 Kswapd pages scanned 427191078 Kswapd pages reclaimed 410821049 Page writes by reclaim1546 0 Page writes file 0
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Tue 04-12-18 16:47:23, David Rientjes wrote: > On Tue, 4 Dec 2018, Mel Gorman wrote: > > > What should also be kept in mind is that we should avoid conflating > > locality preferences with THP preferences which is separate from THP > > allocation latencies. The whole __GFP_THISNODE approach is pushing too > > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag > > are used which is very unfortunate given that MADV_HUGEPAGE in itself says > > nothing about locality -- that is the business of other madvise flags or > > a specific policy. > > We currently lack those other madvise modes or mempolicies: mbind() is not > a viable alternative because we do not want to oom kill when local memory > is depleted, we want to fallback to remote memory. Yes, there was a clear agreement that there is no suitable mempolicy right now and there were proposals to introduce MPOL_NODE_RECLAIM to introduce that behavior. This would be an improvement regardless of THP because global node-reclaim policy was simply a disaster we had to turn off by default and the global semantic was a reason people just gave up using it completely. [...] > Sure, but not at the cost of regressing real-world workloads; what is > being asked for here is legitimate and worthy of an extension, but since > the long-standing behavior has been to use __GFP_THISNODE and people > depend on that for NUMA locality, Well, your patch has altered the semantic and has introduced a subtle and _undocumented_ NUMA policy into MADV_HUGEPAGE. All that without any workload numbers. It would be preferable to have a simulator of those real world workloads of course but even getting some more detailed metric - e.g. without the patch we have X THP utilization and the runtime characteristics Y but without X1 and Y1). > can we not fix the swap storm and look > to extending the API to include workloads that span multiple nodes? Yes, we definitely want to address swap storms. No question about that. But our established approach for NUMA policy has been to fallback to other nodes and everybody focused on NUMA latency should use NUMA API to achive that. Not vice versa. As I've said in other thread, I am OK with restoring __GFP_THISNODE for now but we should really have a very good plan for further steps. And that involves an agreed NUMA behavior. I haven't seen any widespread agreement on that yet though. [...] > > I would also re-emphasise that a major problem with addressing this > > problem is that we do not have a general reproducible test case for > > David's scenario where as we do have reproduction cases for the others. > > They're not related to KVM but that doesn't matter because it's enough > > to have a memory hog try allocating more memory than fits on a single node. > > > > It's trivial to reproduce this issue: fragment all local memory that > compaction cannot resolve, do posix_memalign() for hugepage aligned > memory, and measure the access latency. To fragment all local memory, you > can simply insert a kernel module and allocate high-order memory (just do > kmem_cache_alloc_node() or get_page() to pin it so compaction fails or > punch holes in the file as you did above). You can do this for all memory > rather than the local node to measure the even more severe allocation > latency regression that not setting __GFP_THISNODE introduces. Sure, but can we get some numbers from a real workload rather than an artificial worst case? The utilization issue Mel pointed out before and here again is a real concern IMHO. We we definitely need a better picture to make an educated decision. -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Tue, 4 Dec 2018, Mel Gorman wrote: > What should also be kept in mind is that we should avoid conflating > locality preferences with THP preferences which is separate from THP > allocation latencies. The whole __GFP_THISNODE approach is pushing too > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag > are used which is very unfortunate given that MADV_HUGEPAGE in itself says > nothing about locality -- that is the business of other madvise flags or > a specific policy. We currently lack those other madvise modes or mempolicies: mbind() is not a viable alternative because we do not want to oom kill when local memory is depleted, we want to fallback to remote memory. In my response to Michal, I noted three possible usecases that MADV_HUGEPAGE either currently has or has taken before: direct compaction/reclaim, avoid increased rss, and allow fallback to remote memory. It's certainly not the business of one madvise mode to define this. Thus, I'm trying to return to the behavior that was in 4.1 and what was restored three years ago because suddenly changing the behavior to allow remote allocation causes real-world regressions. > Using remote nodes is bad but reclaiming excessively > and pushing data out of memory is worse as the latency to fault data back > from disk is higher than a remote access. > That's discussing two things at the same time: local fragmentation and local low-on-memory conditions. If compaction quickly fails and local pages are available as fallback, that requires no reclaim. If we're truly low-on-memory locally then it is obviously better to allocate remotely than aggressively reclaim. > Andrea already pointed it out -- workloads that fit within a node are happy > to reclaim local memory, particularly in the case where the existing data > is old which is the ideal for David. Workloads that do not fit within a > node will often prefer using remote memory -- either THP or base pages > in the general case and THP for definite in the KVM case. While KVM > might not like remote memory, using THP at least reduces the page table > access overhead even if the access is remote and eventually automatic > NUMA balancing might intervene. > Sure, but not at the cost of regressing real-world workloads; what is being asked for here is legitimate and worthy of an extension, but since the long-standing behavior has been to use __GFP_THISNODE and people depend on that for NUMA locality, can we not fix the swap storm and look to extending the API to include workloads that span multiple nodes? > I have *one* result of the series on a 1-socket machine running > "thpscale". It creates a file, punches holes in it to create a > very light form of fragmentation and then tries THP allocations > using madvise measuring latency and success rates. It's the > global-dhp__workload_thpscale-madvhugepage in mmtests using XFS as the > filesystem. > > thpscale Fault Latencies > 4.20.0-rc4 4.20.0-rc4 > mmots-20181130 gfpthisnode-v1r1 > Amean fault-base-3 5358.54 ( 0.00%) 2408.93 * 55.04%* > Amean fault-base-5 9742.30 ( 0.00%) 3035.25 * 68.84%* > Amean fault-base-7 13069.18 ( 0.00%) 4362.22 * 66.62%* > Amean fault-base-1214882.53 ( 0.00%) 9424.38 * 36.67%* > Amean fault-base-1815692.75 ( 0.00%)16280.03 ( -3.74%) > Amean fault-base-2428775.11 ( 0.00%)18374.84 * 36.14%* > Amean fault-base-3042056.32 ( 0.00%)21984.55 * 47.73%* > Amean fault-base-3238634.26 ( 0.00%)22199.49 * 42.54%* > Amean fault-huge-1 0.00 ( 0.00%)0.00 ( 0.00%) > Amean fault-huge-3 3628.86 ( 0.00%) 963.45 * 73.45%* > Amean fault-huge-5 4926.42 ( 0.00%) 2959.85 * 39.92%* > Amean fault-huge-7 6717.15 ( 0.00%) 3828.68 * 43.00%* > Amean fault-huge-1211393.47 ( 0.00%) 5772.92 * 49.33%* > Amean fault-huge-1816979.38 ( 0.00%) 4435.95 * 73.87%* > Amean fault-huge-2416558.00 ( 0.00%) 4416.46 * 73.33%* > Amean fault-huge-3020351.46 ( 0.00%) 5099.73 * 74.94%* > Amean fault-huge-3223332.54 ( 0.00%) 6524.73 * 72.04%* > > So, looks like massive latency improvements but then the THP allocation > success rates > > thpscale Percentage Faults Huge >4.20.0-rc4 4.20.0-rc4 >mmots-20181130 gfpthisnode-v1r1 > Percentage huge-395.14 ( 0.00%)7.94 ( -91.65%) > Percentage huge-591.28 ( 0.00%)5.00 ( -94.52%) > Percentage huge-786.87 ( 0.00%)9.36 ( -89.22%) > Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%) > Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%) > Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Tue, 4 Dec 2018, Michal Hocko wrote: > The thing I am really up to here is that reintroduction of > __GFP_THISNODE, which you are pushing for, will conflate madvise mode > resp. defrag=always with a numa placement policy because the allocation > doesn't fallback to a remote node. > It isn't specific to MADV_HUGEPAGE, it is the policy for all transparent hugepage allocations, including defrag=always. We agree that MADV_HUGEPAGE is not exactly defined: does it mean try harder to allocate a hugepage locally, try compaction synchronous to the fault, allow remote fallback? It's undefined. The original intent was to be used when thp is disabled system wide (enabled set to "madvise") because its possible the rss of the process increases if backed by thp. That occurs either if faulting on a hugepage aligned area or based on max_ptes_none. So we have at least three possible policies that have evolved over time: preventing increased rss, direct compaction, remote fallback. Certainly not something that fits under a single madvise mode. > And that is a fundamental problem and the antipattern I am talking > about. Look at it this way. All normal allocations are utilizing all the > available memory even though they might hit a remote latency penalty. If > you do care about NUMA placement you have an API to enforce a specific > placement. What is so different about THP to behave differently. Do > we really want to later invent an API to actually allow to utilize all > the memory? There are certainly usecases (that triggered the discussion > previously) that do not mind the remote latency because all other > benefits simply outweight it? > What is different about THP is that on every platform I have measured, NUMA matters more than hugepages. Obviously if on Broadwell, Haswell, and Rome, remote hugepages were a performance win over local pages, this discussion would not be happening. Faulting local pages rather than local hugepages, if possible, is easy and doesn't require reclaim. Faulting remote pages rather than reclaiming local pages is easy in your scenario, it's non-disruptive. So to answer "what is so different about THP?", it's the performance data. The NUMA locality matters more than whether the pages are huge or not. We also have the added benefit of khugepaged being able to collapse pages locally if fragmentation improves rather than being stuck accessing a remote hugepage forever. > That being said what should users who want to use all the memory do to > use as many THPs as possible? If those users want to accept the performance degradation of allocating remote hugepages instead of local pages, that should likely be an extension, either madvise or prctl. That's not necessarily the usecase Andrea would have, I don't believe: he'd still prefer to compact memory locally and avoid the swap storm than allocate remotely. If impossible to reclaim locally for regular pages, remote hugepages may be more beneficial than remote pages.
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
Much of this thread is a rehash of previous discussions so as a result, I glossed over parts of it so there will be a degree of error. Very preliminary results from David's approach are below and the bottom line is that it might fix some latency issues and locality issues at the cost of a high degree of THP allocation failure. On Tue, Dec 04, 2018 at 10:22:26AM +0100, Vlastimil Babka wrote: > > + if (order == pageblock_order && > > + !(current->flags & PF_KTHREAD)) > > + goto nopage; > > > > and just goes "Eww". > > > > But I think the real problem is that it's the "goto nopage" thing that > > makes _sense_, and the current cases for "let's try compaction" that > > More precisely it's "let's try reclaim + compaction". > The original intent has been muddied and special cased but the general idea was that compaction needs space to work with to both succeed and avoid excessive scanning -- particularly in direct context that is visible to the application. Before compaction, linear-reclaim (aka lumpy reclaim) was used but this caused both page age inversion issues and excessive thrasing. In Andrew's tree, there are patches that also do small amounts of reclaim in response to fragmentation which in some cases alleviates the need for the reclaim + compaction step as the reclaim has sometimes already happened. This has reduced latencies and increased THP allocation success rates but not by enough which needs further work. Parts of compaction are in need of a revisit. I'm in the process of doing but it's time consuming to do this because of the level of testing required at every step. The prototype currently is 12 patches and growing and I'm not sure what the final series will look like or how far it'll go. At this point, I believe that even when it's finished that the concept of "do some reclaim and try compaction" will remain. I'm focused primarily on the compaction core at the moment rather than the outer part in the page allocator. > > are the odd ones, and then David adds one new special case for the > > sensible behavior. > > > > For example, why would COMPACT_DEFERRED mean "don't bother", but not > > all the other reasons it didn't really make sense? > > COMPACT_DEFERRED means that compaction was failing recently, even with > sufficient free pages (e.g. freed by direct reclaim), so it doesn't make > sense to continue. Yes, the intent is that recent failures should not incur more useless scanning and stalls. As it is, the latencies are too high and too often it's useless work. Historically, this was put into place as the time spent in compaction was too high and the THP allocation latencies were so bad that it was preferred to disable THP entirely. This has improved in recent years with general improvements and changes to defaults but there is room to improve. Again, it's something I'm looking into but it's slow. > > > > So does it really make sense to fall through AT ALL to that "retry" > > case, when we explicitly already had (gfp_mask & __GFP_NORETRY)? > > Well if there was no free memory to begin with, and thus compaction > returned COMPACT_SKIPPED, then we didn't really "try" anything yet, so > there's nothing to "not retry". > What should also be kept in mind is that we should avoid conflating locality preferences with THP preferences which is separate from THP allocation latencies. The whole __GFP_THISNODE approach is pushing too hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag are used which is very unfortunate given that MADV_HUGEPAGE in itself says nothing about locality -- that is the business of other madvise flags or a specific policy. Using remote nodes is bad but reclaiming excessively and pushing data out of memory is worse as the latency to fault data back from disk is higher than a remote access. Andrea already pointed it out -- workloads that fit within a node are happy to reclaim local memory, particularly in the case where the existing data is old which is the ideal for David. Workloads that do not fit within a node will often prefer using remote memory -- either THP or base pages in the general case and THP for definite in the KVM case. While KVM might not like remote memory, using THP at least reduces the page table access overhead even if the access is remote and eventually automatic NUMA balancing might intervene. > > Maybe the real fix is to instead of adding yet another special case > > for "goto nopage", it should just be unconditional: simply don't try > > to compact large-pages if __GFP_NORETRY was set. > > I think that would destroy THP success rates too much, in situations > where reclaim and compaction would succeed, because there's enough > easily reclaimable and migratable memory. > Tests are in progress but yes, this is the primary risk of abandoning the allocation request too early. I've already found during developing the prototype series
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On 12/3/18 11:27 PM, Linus Torvalds wrote: > On Mon, Dec 3, 2018 at 2:04 PM Linus Torvalds > wrote: >> >> so I think all of David's patch is somewhat sensible, even if that >> specific "order == pageblock_order" test really looks like it might >> want to be clarified. > > Side note: I think maybe people should just look at that whole > compaction logic for that block, because it doesn't make much sense to > me: > > /* > * Checks for costly allocations with __GFP_NORETRY, which > * includes THP page fault allocations > */ > if (costly_order && (gfp_mask & __GFP_NORETRY)) { > /* > * If compaction is deferred for high-order > allocations, > * it is because sync compaction recently failed. If > * this is the case and the caller requested a THP > * allocation, we do not want to heavily disrupt the > * system, so we fail the allocation instead of > entering > * direct reclaim. > */ > if (compact_result == COMPACT_DEFERRED) > goto nopage; > > /* > * Looks like reclaim/compaction is worth trying, but > * sync compaction could be very expensive, so keep > * using async compaction. > */ > compact_priority = INIT_COMPACT_PRIORITY; > } > > this is where David wants to add *his* odd test, and I think everybody > looks at that added case > > + if (order == pageblock_order && > + !(current->flags & PF_KTHREAD)) > + goto nopage; > > and just goes "Eww". > > But I think the real problem is that it's the "goto nopage" thing that > makes _sense_, and the current cases for "let's try compaction" that More precisely it's "let's try reclaim + compaction". > are the odd ones, and then David adds one new special case for the > sensible behavior. > > For example, why would COMPACT_DEFERRED mean "don't bother", but not > all the other reasons it didn't really make sense? COMPACT_DEFERRED means that compaction was failing recently, even with sufficient free pages (e.g. freed by direct reclaim), so it doesn't make sense to continue. What are "all the other reasons"? __alloc_pages_direct_compact() could have also returned COMPACT_SKIPPED, which means compaction actually didn't happen at all, because there's not enough free pages. > So does it really make sense to fall through AT ALL to that "retry" > case, when we explicitly already had (gfp_mask & __GFP_NORETRY)? Well if there was no free memory to begin with, and thus compaction returned COMPACT_SKIPPED, then we didn't really "try" anything yet, so there's nothing to "not retry". > Maybe the real fix is to instead of adding yet another special case > for "goto nopage", it should just be unconditional: simply don't try > to compact large-pages if __GFP_NORETRY was set. I think that would destroy THP success rates too much, in situations where reclaim and compaction would succeed, because there's enough easily reclaimable and migratable memory. > Hmm? I dunno. Right now - for 4.20, I'd obviously want to keep changes > smallish, so a hacky added special case might be the right thing to > do. But the code does look odd, doesn't it? > > I think part of it comes from the fact that we *used* to do the > compaction first, and then we did the reclaim, and then it was > re-orghanized to do reclaim first, but it tried to keep semantic > changes minimal and some of the above comes from that re-org. IIRC the point of reorg was that in typical case we actually do want to try the reclaim first (or only), and the exception are those THP-ish allocations where typically the problem is fragmentation, and not number of free pages, so we check first if we can defragment the memory or whether it makes sense to free pages in case the defragmentation is expected to help afterwards. It seemed better to put this special case out of the main reclaim/compaction retry-with-increasing-priority loop for non-costly-order allocations that in general can't fail. Vlastimil > I think. > > Linus >
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon 03-12-18 13:53:21, David Rientjes wrote: > On Mon, 3 Dec 2018, Michal Hocko wrote: > > > > I think extending functionality so thp can be allocated remotely if truly > > > desired is worthwhile > > > > This is a complete NUMA policy antipatern that we have for all other > > user memory allocations. So far you have to be explicit for your numa > > requirements. You are trying to conflate NUMA api with MADV and that is > > just conflating two orthogonal things and that is just wrong. > > > > No, the page allocator change for both my patch and __GFP_COMPACT_ONLY has > nothing to do with any madvise() mode. It has to do with where thp > allocations are preferred. Yes, this is different than other memory > allocations where it doesn't cause a 13.9% access latency regression for > the lifetime of a binary for users who back their text with hugepages. > MADV_HUGEPAGE still has its purpose to try synchronous memory compaction > at fault time under all thp defrag modes other than "never". The specific > problem being reported here, and that both my patch and __GFP_COMPACT_ONLY > address, is the pointless reclaim activity that does not assist in making > compaction more successful. You do not address my concern though. Sure there are reclaim related issues. Nobody is questioning that. But that is only half of the problem. The thing I am really up to here is that reintroduction of __GFP_THISNODE, which you are pushing for, will conflate madvise mode resp. defrag=always with a numa placement policy because the allocation doesn't fallback to a remote node. And that is a fundamental problem and the antipattern I am talking about. Look at it this way. All normal allocations are utilizing all the available memory even though they might hit a remote latency penalty. If you do care about NUMA placement you have an API to enforce a specific placement. What is so different about THP to behave differently. Do we really want to later invent an API to actually allow to utilize all the memory? There are certainly usecases (that triggered the discussion previously) that do not mind the remote latency because all other benefits simply outweight it? That being said what should users who want to use all the memory do to use as many THPs as possible? -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, 3 Dec 2018, Linus Torvalds wrote: > Side note: I think maybe people should just look at that whole > compaction logic for that block, because it doesn't make much sense to > me: > > /* > * Checks for costly allocations with __GFP_NORETRY, which > * includes THP page fault allocations > */ > if (costly_order && (gfp_mask & __GFP_NORETRY)) { > /* > * If compaction is deferred for high-order > allocations, > * it is because sync compaction recently failed. If > * this is the case and the caller requested a THP > * allocation, we do not want to heavily disrupt the > * system, so we fail the allocation instead of > entering > * direct reclaim. > */ > if (compact_result == COMPACT_DEFERRED) > goto nopage; > > /* > * Looks like reclaim/compaction is worth trying, but > * sync compaction could be very expensive, so keep > * using async compaction. > */ > compact_priority = INIT_COMPACT_PRIORITY; > } > > this is where David wants to add *his* odd test, and I think everybody > looks at that added case > > + if (order == pageblock_order && > + !(current->flags & PF_KTHREAD)) > + goto nopage; > > and just goes "Eww". > > But I think the real problem is that it's the "goto nopage" thing that > makes _sense_, and the current cases for "let's try compaction" that > are the odd ones, and then David adds one new special case for the > sensible behavior. > > For example, why would COMPACT_DEFERRED mean "don't bother", but not > all the other reasons it didn't really make sense? > > So does it really make sense to fall through AT ALL to that "retry" > case, when we explicitly already had (gfp_mask & __GFP_NORETRY)? > > Maybe the real fix is to instead of adding yet another special case > for "goto nopage", it should just be unconditional: simply don't try > to compact large-pages if __GFP_NORETRY was set. > I think what is intended, which may not be represented by the code, is that if compaction is not suitable (__compaction_suitable() returns COMPACT_SKIPPED because of failing watermarks) that for non-hugepage allocations reclaim may be useful. We just want to reclaim memory so that memory compaction has pages available for migration targets. Note the same caveat I keep bringing up still applies, though: if reclaim frees memory that is iterated over by the compaction migration scanner, it was pointless. That is a memory compaction implementation detail and can lead to a lot of unnecessary reclaim (or even thrashing) if unmovable page fragmentation cause compaction to fail even after it has migrated everything it could. I think the likelihood of that happening increases by the allocation order.
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, Dec 3, 2018 at 2:04 PM Linus Torvalds wrote: > > so I think all of David's patch is somewhat sensible, even if that > specific "order == pageblock_order" test really looks like it might > want to be clarified. Side note: I think maybe people should just look at that whole compaction logic for that block, because it doesn't make much sense to me: /* * Checks for costly allocations with __GFP_NORETRY, which * includes THP page fault allocations */ if (costly_order && (gfp_mask & __GFP_NORETRY)) { /* * If compaction is deferred for high-order allocations, * it is because sync compaction recently failed. If * this is the case and the caller requested a THP * allocation, we do not want to heavily disrupt the * system, so we fail the allocation instead of entering * direct reclaim. */ if (compact_result == COMPACT_DEFERRED) goto nopage; /* * Looks like reclaim/compaction is worth trying, but * sync compaction could be very expensive, so keep * using async compaction. */ compact_priority = INIT_COMPACT_PRIORITY; } this is where David wants to add *his* odd test, and I think everybody looks at that added case + if (order == pageblock_order && + !(current->flags & PF_KTHREAD)) + goto nopage; and just goes "Eww". But I think the real problem is that it's the "goto nopage" thing that makes _sense_, and the current cases for "let's try compaction" that are the odd ones, and then David adds one new special case for the sensible behavior. For example, why would COMPACT_DEFERRED mean "don't bother", but not all the other reasons it didn't really make sense? So does it really make sense to fall through AT ALL to that "retry" case, when we explicitly already had (gfp_mask & __GFP_NORETRY)? Maybe the real fix is to instead of adding yet another special case for "goto nopage", it should just be unconditional: simply don't try to compact large-pages if __GFP_NORETRY was set. Hmm? I dunno. Right now - for 4.20, I'd obviously want to keep changes smallish, so a hacky added special case might be the right thing to do. But the code does look odd, doesn't it? I think part of it comes from the fact that we *used* to do the compaction first, and then we did the reclaim, and then it was re-orghanized to do reclaim first, but it tried to keep semantic changes minimal and some of the above comes from that re-org. I think. Linus
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, Dec 3, 2018 at 12:12 PM Andrea Arcangeli wrote: > > On Mon, Dec 03, 2018 at 11:28:07AM -0800, Linus Torvalds wrote: > > > > One is the patch posted by Andrea earlier in this thread, which seems > > to target just this known regression. > > For the short term the important thing is to fix the VM regression one > way or another, I don't personally mind which way. > > > The other seems to be to revert commit ac5b2c1891 and instead apply > > > > > > https://lore.kernel.org/lkml/alpine.deb.2.21.1810081303060.221...@chino.kir.corp.google.com/ > > > > which also seems to be sensible. > > In my earlier review of David's patch, it looked runtime equivalent to > the __GFP_COMPACT_ONLY solution. It has the only advantage of adding a I think there's a missing "not" in the above. > new gfpflag until we're sure we need it but it's the worst solution > available for the long term in my view. It'd be ok to apply it as > stop-gap measure though. So I have no really strong opinions either way. I looking at the two options, I think I'd personally have a slight preference for that patch by David, not so much because it doesn't add a new GFP flag, but because it seems to make it a lot more explicit that GFP_TRANSHUGE_LIGHT automatically implies __GFP_NORETRY. I think that makes a whole lot of conceptual sense with the whole meaning of GFP_TRANSHUGE_LIGHT. It's all about "no reclaim/compaction", but honestly, doesn't __GFP_NORETRY make sense? So I look at David's patch, and I go "that makes sense", and then I compare it with ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") and that makes me go "ok, that's a hack". So *if* reverting ac5b2c18911f and applying David's patch instead fixes the KVM latency issues (which I assume it really should do, simply thanks to __GFP_NORETRY), then I think that makes more sense. That said, I do agree that the if (order == pageblock_order ...) test in __alloc_pages_slowpath() in David's patch then argues for "that looks hacky". But that code *is* inside the test for if (costly_order && (gfp_mask & __GFP_NORETRY)) { so within the context of that (not visible in the patch itself), it looks like a sensible model. The whole point of that block is, as the comment above it says /* * Checks for costly allocations with __GFP_NORETRY, which * includes THP page fault allocations */ so I think all of David's patch is somewhat sensible, even if that specific "order == pageblock_order" test really looks like it might want to be clarified. BUT. With all that said, I really don't mind that __GFP_COMPACT_ONLY approach either. I think David's patch makes sense in a bigger context, while the __GFP_COMPACT_ONLY patch makes sense in the context of "let's just fix this _particular_ special case. As long as both work (and apparently they do), either is perfectly find by me. Some kind of "Thunderdome for patches" is needed, with an epic soundtrack. "Two patches enter, one patch leaves!" I don't so much care which one. Linus
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, 3 Dec 2018, Michal Hocko wrote: > > I think extending functionality so thp can be allocated remotely if truly > > desired is worthwhile > > This is a complete NUMA policy antipatern that we have for all other > user memory allocations. So far you have to be explicit for your numa > requirements. You are trying to conflate NUMA api with MADV and that is > just conflating two orthogonal things and that is just wrong. > No, the page allocator change for both my patch and __GFP_COMPACT_ONLY has nothing to do with any madvise() mode. It has to do with where thp allocations are preferred. Yes, this is different than other memory allocations where it doesn't cause a 13.9% access latency regression for the lifetime of a binary for users who back their text with hugepages. MADV_HUGEPAGE still has its purpose to try synchronous memory compaction at fault time under all thp defrag modes other than "never". The specific problem being reported here, and that both my patch and __GFP_COMPACT_ONLY address, is the pointless reclaim activity that does not assist in making compaction more successful. > Let's put the __GFP_THISNODE issue aside. I do not remember you > confirming that __GFP_COMPACT_ONLY patch is OK for you (sorry it might > got lost in the emails storm from back then) but if that is the only > agreeable solution for now then I can live with that. The discussion between my patch and Andrea's patch seemed to only be about whether this should be a gfp bit or not > __GFP_NORETRY hack > was shown to not work properly by Mel AFAIR. Again if I misremember then > I am sorry and I can live with that. Andrea's patch as posted in this thread sets __GFP_NORETRY for __GFP_ONLY_COMPACT, so both my patch and his patch require it. His patch gets this behavior for page faults by way of alloc_pages_vma(), mine gets it from modifying GFP_TRANSHUGE. > But conflating MADV_TRANSHUGE with > an implicit numa placement policy and/or adding an opt-in for remote > NUMA placing is completely backwards and a broken API which will likely > bites us later. I sincerely hope we are not going to repeat mistakes > from the past. Assuming s/MADV_TRANSHUGE/MADV_HUGEPAGE/. Again, this is *not* about the madvise(); it's specifically about the role of direct reclaim in the allocation of a transparent hugepage at fault time regardless of any madvise() because you can get the same behavior with defrag=always (and the inconsistent use of __GFP_NORETRY there that is fixed by both of our patches).
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon 03-12-18 12:39:34, David Rientjes wrote: > On Mon, 3 Dec 2018, Michal Hocko wrote: > > > I have merely said that a better THP locality needs more work and during > > the review discussion I have even volunteered to work on that. There > > are other reclaim related fixes under work right now. All I am saying > > is that MADV_TRANSHUGE having numa locality implications cannot satisfy > > all the usecases and it is particurarly KVM that suffers from it. > > I think extending functionality so thp can be allocated remotely if truly > desired is worthwhile This is a complete NUMA policy antipatern that we have for all other user memory allocations. So far you have to be explicit for your numa requirements. You are trying to conflate NUMA api with MADV and that is just conflating two orthogonal things and that is just wrong. Let's put the __GFP_THISNODE issue aside. I do not remember you confirming that __GFP_COMPACT_ONLY patch is OK for you (sorry it might got lost in the emails storm from back then) but if that is the only agreeable solution for now then I can live with that. __GFP_NORETRY hack was shown to not work properly by Mel AFAIR. Again if I misremember then I am sorry and I can live with that. But conflating MADV_TRANSHUGE with an implicit numa placement policy and/or adding an opt-in for remote NUMA placing is completely backwards and a broken API which will likely bites us later. I sincerely hope we are not going to repeat mistakes from the past. -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, 3 Dec 2018, Michal Hocko wrote: > I have merely said that a better THP locality needs more work and during > the review discussion I have even volunteered to work on that. There > are other reclaim related fixes under work right now. All I am saying > is that MADV_TRANSHUGE having numa locality implications cannot satisfy > all the usecases and it is particurarly KVM that suffers from it. I think extending functionality so thp can be allocated remotely if truly desired is worthwhile just so long as it does not cause regressions for other users. I think that is separate from the swap storm regression that Andrea is reporting, however, since that would also exist even if we allowed remote thp allocations when the host is fully fragmented rather than only locally fragmented.
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, 3 Dec 2018, Andrea Arcangeli wrote: > In my earlier review of David's patch, it looked runtime equivalent to > the __GFP_COMPACT_ONLY solution. It has the only advantage of adding a > new gfpflag until we're sure we need it but it's the worst solution > available for the long term in my view. It'd be ok to apply it as > stop-gap measure though. > > The "order == pageblock_order" hardcoding inside the allocator to > workaround the __GFP_THISNODE flag passed from outside the allocator > in the THP MADV_HUGEPAGE case, didn't look very attractive because > it's not just THP allocating order >0 pages. > We have two different things to consider: NUMA locality and the order of the allocation. THP is preferred locally and we know the order. For the other high-order pages you're referring to, I don't know if they are using __GFP_THISNODE or not (likely not). I see them as two separate issues. For thp on all platforms I have measured it on specifically for this patch (Broadwell, Haswell, Rome) there is a clear advantage to faulting local pages of the native page size over remote hugepages. It also has the added effect of allowing khugepaged to collapse it into a hugepage later if fragmentation allows (the reason why khugepaged cares about NUMA locality, the same reason I do). This is the rationale for __GFP_THISNODE for thp allocations. For order == pageblock_order (or more correctly order >= pageblock_order), this is not based on NUMA whatsoever but is rather based on the implementation of memory compaction. If it has already failed (or was deferred for order-HPAGE_PMD_ORDER), reclaim cannot be shown to help if memory compaction cannot utilize the freed memory in isolate_freepages(), so that reclaim has been pointless. If compaction fails for other reasons (any unmovable page preventing a pageblock from becoming free), *all* reclaim activity has been pointless. > It'd be nicer if whatever compaction latency optimization that applies > to THP could also apply to all other allocation orders too and the > hardcoding of the THP order prevents that. > > On the same lines if __GFP_THISNODE is so badly needed by > MADV_HUGEPAGE, all other larger order allocations should also be able > to take advantage of __GFP_THISNODE without ending in the same VM > corner cases that required the "order == pageblock_order" hardcoding > inside the allocator. > > If you prefer David's patch I would suggest pageblock_order to be > replaced with HPAGE_PMD_ORDER so it's more likely to match the THP > order in all archs. > That sounds fine and I will do that in my v2.
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, 3 Dec 2018, Andrea Arcangeli wrote: > It's trivial to reproduce the badness by running a memhog process that > allocates more than the RAM of 1 NUMA node, under defrag=always > setting (or by changing memhog to use MADV_HUGEPAGE) and it'll create > swap storms despite 75% of the RAM is completely free in a 4 node NUMA > (or 50% of RAM free in a 2 node NUMA) etc.. > > How can it be ok to push the system into gigabytes of swap by default > without any special capability despite 50% - 75% or more of the RAM is > free? That's the downside of the __GFP_THISNODE optimizaton. > The swap storm is the issue that is being addressed. If your remote memory is as low as local memory, the patch to clear __GFP_THISNODE has done nothing to fix it: you still get swap storms and memory compaction can still fail if the per-zone freeing scanner cannot utilize the reclaimed memory. Recall that this patch to clear __GFP_THISNODE was measured by me to have a 40% increase in allocation latency for fragmented remote memory on Haswell. It makes the problem much, much worse. > __GFP_THISNODE helps increasing NUMA locality if your app can fit in a > single node which is the common David's workload. But if his workload > would more often than not fit in a single node, he would also run into > an unacceptable slowdown because of the __GFP_THISNODE. > Which is why I have suggested that we do not do direct reclaim, as the page allocator implementation expects all thp page fault allocations to have __GFP_NORETRY set, because no amount of reclaim can be shown to be useful to the memory compaction freeing scanner if it is iterated over by the migration scanner. > I think there's lots of room for improvement for the future, but in my > view that __GFP_THISNODE as it was implemented was an incomplete hack, > that opened the door for bad VM corner cases that should not happen. > __GFP_THISNODE is intended specifically because of the remote access latency increase that is encountered if you fault remote hugepages over local pages of the native page size.
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, Dec 03, 2018 at 11:28:07AM -0800, Linus Torvalds wrote: > On Mon, Dec 3, 2018 at 10:59 AM Michal Hocko wrote: > > > > You are misinterpreting my words. I haven't dismissed anything. I do > > recognize both usecases under discussion. > > > > I have merely said that a better THP locality needs more work and during > > the review discussion I have even volunteered to work on that. > > We have two known patches that seem to have no real downsides. > > One is the patch posted by Andrea earlier in this thread, which seems > to target just this known regression. For the short term the important thing is to fix the VM regression one way or another, I don't personally mind which way. > The other seems to be to revert commit ac5b2c1891 and instead apply > > > https://lore.kernel.org/lkml/alpine.deb.2.21.1810081303060.221...@chino.kir.corp.google.com/ > > which also seems to be sensible. In my earlier review of David's patch, it looked runtime equivalent to the __GFP_COMPACT_ONLY solution. It has the only advantage of adding a new gfpflag until we're sure we need it but it's the worst solution available for the long term in my view. It'd be ok to apply it as stop-gap measure though. The "order == pageblock_order" hardcoding inside the allocator to workaround the __GFP_THISNODE flag passed from outside the allocator in the THP MADV_HUGEPAGE case, didn't look very attractive because it's not just THP allocating order >0 pages. It'd be nicer if whatever compaction latency optimization that applies to THP could also apply to all other allocation orders too and the hardcoding of the THP order prevents that. On the same lines if __GFP_THISNODE is so badly needed by MADV_HUGEPAGE, all other larger order allocations should also be able to take advantage of __GFP_THISNODE without ending in the same VM corner cases that required the "order == pageblock_order" hardcoding inside the allocator. If you prefer David's patch I would suggest pageblock_order to be replaced with HPAGE_PMD_ORDER so it's more likely to match the THP order in all archs. Thanks, Andrea
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, Dec 3, 2018 at 10:59 AM Michal Hocko wrote: > > You are misinterpreting my words. I haven't dismissed anything. I do > recognize both usecases under discussion. > > I have merely said that a better THP locality needs more work and during > the review discussion I have even volunteered to work on that. We have two known patches that seem to have no real downsides. One is the patch posted by Andrea earlier in this thread, which seems to target just this known regression. The other seems to be to revert commit ac5b2c1891 and instead apply https://lore.kernel.org/lkml/alpine.deb.2.21.1810081303060.221...@chino.kir.corp.google.com/ which also seems to be sensible. I'm not seeing why the KVM load would react badly to either of those models, and they are known to fix the google local-node issue. Linus
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, Dec 03, 2018 at 07:59:54PM +0100, Michal Hocko wrote: > I have merely said that a better THP locality needs more work and during > the review discussion I have even volunteered to work on that. There > are other reclaim related fixes under work right now. All I am saying > is that MADV_TRANSHUGE having numa locality implications cannot satisfy > all the usecases and it is particurarly KVM that suffers from it. I'd like to clarify it's not just KVM, we found with KVM because for KVM it's fairly common to create VM that won't possibly fit in a single node, while most other apps don't tend to allocate that much memory. It's trivial to reproduce the badness by running a memhog process that allocates more than the RAM of 1 NUMA node, under defrag=always setting (or by changing memhog to use MADV_HUGEPAGE) and it'll create swap storms despite 75% of the RAM is completely free in a 4 node NUMA (or 50% of RAM free in a 2 node NUMA) etc.. How can it be ok to push the system into gigabytes of swap by default without any special capability despite 50% - 75% or more of the RAM is free? That's the downside of the __GFP_THISNODE optimizaton. __GFP_THISNODE helps increasing NUMA locality if your app can fit in a single node which is the common David's workload. But if his workload would more often than not fit in a single node, he would also run into an unacceptable slowdown because of the __GFP_THISNODE. I think there's lots of room for improvement for the future, but in my view that __GFP_THISNODE as it was implemented was an incomplete hack, that opened the door for bad VM corner cases that should not happen. It also would be nice to have a reproducer for David's workload, the software to run the binary on THP is not released either. We have lots of reproducer for the corner case introduced by the __GFP_THISNODE trick. So this is basically a revert of the commit that made MADV_HUGEPAGE with __GFP_THISNODE behave like a privileged (although not as static) mbind. I provided an alternative but we weren't sure if that was the best long term solution that could satisfy everyone because it does have some drawback too. Thanks, Andrea
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon 03-12-18 10:45:35, Linus Torvalds wrote: > On Mon, Dec 3, 2018 at 10:30 AM Michal Hocko wrote: > > > > I do not get it. 5265047ac301 which this patch effectively reverts has > > regressed kvm workloads. People started to notice only later because > > they were not running on kernels with that commit until later. We have > > 4.4 based kernels reports. What do you propose to do for those people? > > We have at least two patches that others claim to fix things. > > You dismissed them and said "can't be done". You are misinterpreting my words. I haven't dismissed anything. I do recognize both usecases under discussion. I have merely said that a better THP locality needs more work and during the review discussion I have even volunteered to work on that. There are other reclaim related fixes under work right now. All I am saying is that MADV_TRANSHUGE having numa locality implications cannot satisfy all the usecases and it is particurarly KVM that suffers from it. -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, Dec 3, 2018 at 10:30 AM Michal Hocko wrote: > > I do not get it. 5265047ac301 which this patch effectively reverts has > regressed kvm workloads. People started to notice only later because > they were not running on kernels with that commit until later. We have > 4.4 based kernels reports. What do you propose to do for those people? We have at least two patches that others claim to fix things. You dismissed them and said "can't be done". As a result, I'm not really interested in this discussion. Linus
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon 03-12-18 10:19:55, Linus Torvalds wrote: > On Mon, Dec 3, 2018 at 10:15 AM Michal Hocko wrote: > > > > The thing is that there is no universal win here. There are two > > different types of workloads and we cannot satisfy both. > > Ok, if that's the case, then I'll just revert the commit. > > Michal, our rules are very simple: we don't generate regressions. It's > better to have old reliable behavior than to start creating *new* > problems. I do not get it. 5265047ac301 which this patch effectively reverts has regressed kvm workloads. People started to notice only later because they were not running on kernels with that commit until later. We have 4.4 based kernels reports. What do you propose to do for those people? Let me remind that it was David who introduced 5265047ac301, presumably because his workload benefits from it. -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, Dec 3, 2018 at 10:15 AM Michal Hocko wrote: > > The thing is that there is no universal win here. There are two > different types of workloads and we cannot satisfy both. Ok, if that's the case, then I'll just revert the commit. Michal, our rules are very simple: we don't generate regressions. It's better to have old reliable behavior than to start creating *new* problems. Linus
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon 03-12-18 10:01:18, Linus Torvalds wrote: > On Wed, Nov 28, 2018 at 8:48 AM Linus Torvalds > wrote: > > > > On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying wrote: > > > > > > In general, memory allocation fairness among processes should be a good > > > thing. So I think the report should have been a "performance > > > improvement" instead of "performance regression". > > > > Hey, when you put it that way... > > > > Let's ignore this issue for now, and see if it shows up in some real > > workload and people complain. > > Well, David Rientjes points out that it *does* cause real problems in > real workloads, so it's not just this benchmark run that shows the > issue. The thing is that there is no universal win here. There are two different types of workloads and we cannot satisfy both. This has been discussed at lenght during the review process. David's workload makes some assumptions about the MADV_HUGEPAGE numa placement. There are other workalods like KVM setups which do not really require that and those are ones which regressed. The prevalent consensus was that a NUMA placement is not really implied by MADV_HUGEPAGE because a) this has never been documented or intended behavior and b) it is not a universal win (basically the same as node/zone_reclaim which used to be on by default on some NUMA setups). Reverting the patch would regress another class of workloads. As we cannot satisfy both I believe we should make the API clear and in favor of a more relaxed workloads. Those with special requirements should have a proper API to reflect that (this is our general NUMA policy pattern already). -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, Nov 28, 2018 at 8:48 AM Linus Torvalds wrote: > > On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying wrote: > > > > In general, memory allocation fairness among processes should be a good > > thing. So I think the report should have been a "performance > > improvement" instead of "performance regression". > > Hey, when you put it that way... > > Let's ignore this issue for now, and see if it shows up in some real > workload and people complain. Well, David Rientjes points out that it *does* cause real problems in real workloads, so it's not just this benchmark run that shows the issue. So I guess we should revert, or at least fix. David, please post your numbers again in public along with your suggested solution... Linus
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, 28 Nov 2018, Linus Torvalds wrote: > On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying wrote: > > > > From the above data, for the parent commit 3 processes exited within > > 14s, another 3 exited within 100s. For this commit, the first process > > exited at 203s. That is, this commit makes memory allocation more fair > > among processes, so that processes proceeded at more similar speed. But > > this raises system memory footprint too, so triggered much more swap, > > thus lower benchmark score. > > > > In general, memory allocation fairness among processes should be a good > > thing. So I think the report should have been a "performance > > improvement" instead of "performance regression". > > Hey, when you put it that way... > > Let's ignore this issue for now, and see if it shows up in some real > workload and people complain. > Well, I originally complained[*] when the change was first proposed and when the stable backports were proposed[**]. On a fragmented host, the change itself showed a 13.9% access latency regression on Haswell and up to 40% allocation latency regression. This is more substantial on Naples and Rome. I also measured similar numbers to this for Haswell. We are particularly hit hard by this because we have libraries that remap the text segment of binaries to hugepages; hugetlbfs is not widely used so this normally falls back to transparent hugepages. We mmap(), madvise(MADV_HUGEPAGE), memcpy(), mremap(). We fully accept the latency to do this when the binary starts because the access latency at runtime is so much better. With this change, however, we have no userspace workaround other than mbind() to prefer the local node. On all of our platforms, native sized pages are always a win over remote hugepages and it leaves open the opportunity that we collapse memory into hugepages later by khugepaged if fragmentation is the issue. mbind() is not viable if the local node is saturated, we are ok with falling back to remote pages of the native page size when the local node is oom; this would result in an oom kill if we used it to retain the old behavior. Given this severe access and allocation latency regression, we must revert this patch in our own kernel, there is simply no path forward without doing so. [*] https://marc.info/?l=linux-kernel&m=153868420126775 [**] https://marc.info/?l=linux-kernel&m=154269994800842
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Wed, Nov 28, 2018 at 08:48:46AM -0800, Linus Torvalds wrote: > On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying wrote: > > > > From the above data, for the parent commit 3 processes exited within > > 14s, another 3 exited within 100s. For this commit, the first process > > exited at 203s. That is, this commit makes memory allocation more fair > > among processes, so that processes proceeded at more similar speed. But > > this raises system memory footprint too, so triggered much more swap, > > thus lower benchmark score. Ok so it was the previous more unfair behavior that increased overall performance. It was also unclear to me that this was a full swap storm test. > > In general, memory allocation fairness among processes should be a good > > thing. So I think the report should have been a "performance > > improvement" instead of "performance regression". > > Hey, when you put it that way... > > Let's ignore this issue for now, and see if it shows up in some real > workload and people complain. Agreed. With regard to the other question about 4.4 backports, 4.0 didn't have __GFP_THISNODE, so this will still revert to the previous behavior and it won't risk to land into uncharted territory. So there should be no major concern for the backports. We should still work on improving this area, for now the idea was to apply a strict hotfix that would just revert to the previous behavior without introducing new features and new APIs, that would also have the side effect of diminishing THP utilization under MADV_HUGEPAGE. Thanks! Andrea
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying wrote: > > From the above data, for the parent commit 3 processes exited within > 14s, another 3 exited within 100s. For this commit, the first process > exited at 203s. That is, this commit makes memory allocation more fair > among processes, so that processes proceeded at more similar speed. But > this raises system memory footprint too, so triggered much more swap, > thus lower benchmark score. > > In general, memory allocation fairness among processes should be a good > thing. So I think the report should have been a "performance > improvement" instead of "performance regression". Hey, when you put it that way... Let's ignore this issue for now, and see if it shows up in some real workload and people complain. Linus
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Tue 27-11-18 14:50:05, Linus Torvalds wrote: > On Tue, Nov 27, 2018 at 12:57 PM Andrea Arcangeli wrote: > > > > This difference can only happen with defrag=always, and that's not the > > current upstream default. > > Ok, thanks. That makes it a bit less critical. > > > That MADV_HUGEPAGE causes flights with NUMA balancing is not great > > indeed, qemu needs NUMA locality too, but then the badness caused by > > __GFP_THISNODE was a larger regression in the worst case for qemu. > [...] > > So the short term alternative again would be the alternate patch that > > does __GFP_THISNODE|GFP_ONLY_COMPACT appended below. > > Sounds like we should probably do this. Particularly since Vlastimil > pointed out that we'd otherwise have issues with the back-port for 4.4 > where that "defrag=always" was the default. > > The patch doesn't look horrible, and it directly addresses this > particular issue. > > Is there some reason we wouldn't want to do it? We have discussed it previously and the biggest concern was that it introduces a new GFP flag with a very weird and one-off semantic. Anytime we have done that in the past it basically kicked back because people have started to use such a flag and any further changes were really hard to do. So I would really prefer some more systematic solution. And I believe we can do that here. MADV_HUGEPAGE (resp. THP always enabled) has gained a local memory policy with the patch which got effectively reverted. I do believe that conflating "I want THP" with "I want them local" is just wrong from the API point of view. There are different classes of usecases which obviously disagree on the later. So I believe that a long term solution should introduce a MPOL_NODE_RECLAIM kind of policy. It would effectively reclaim local nodes (within NODE_RECLAIM distance) before falling to other nodes. Apart from that we need a less disruptive reclaim driven by compaction and Mel is already working on that AFAIK. -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
Andrea Arcangeli writes: > Hi Linus, > > On Tue, Nov 27, 2018 at 09:08:50AM -0800, Linus Torvalds wrote: >> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot >> wrote: >> > >> > FYI, we noticed a -61.3% regression of vm-scalability.throughput due >> > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for >> > MADV_HUGEPAGE mappings") >> >> Well, that's certainly noticeable and not good. > > Noticed this email too. > > This difference can only happen with defrag=always, and that's not the > current upstream default. > > thp_enabled: always > thp_defrag: always > ^^ emulates MADV_HUGEPAGE always set > >> Andrea, I suspect it might be causing fights with auto numa migration.. > > That MADV_HUGEPAGE causes flights with NUMA balancing is not great > indeed, qemu needs NUMA locality too, but then the badness caused by > __GFP_THISNODE was a larger regression in the worst case for qemu. > > When the kernel wants to allocate a THP from node A, if there are no > THP generated on node A but there are in node B, they'll be picked from > node B now. > > __GFP_THISNODE previously prevented any THP to be allocated from any > node except A. This introduces a higher chance of initial misplacement > which NUMA balancing will correct over time, but it should only apply > to long lived allocations under MADV_HUGEPAGE. Perhaps the workload > here uses short lived allocations and sets defrag=always which is not > optimal to begin with? > > The motivation of the fix, is that the previous code invoked reclaim > with __GFP_THISNODE set. That looked insecure and such behavior should > only have been possible under a mlock/mempolicy > capability. __GFP_THISNODE is like a transient but still privileged > mbind for reclaim. > > Before the fix, __GFP_THISNODE would end up swapping out everything > from node A to free 4k pages from node A, despite perhaps there were > gigabytes of memory free in node B. That caused severe regression to > threaded workloads whose memory spanned more than one NUMA node. So > again going back doesn't sounds great for NUMA in general. > > The vmscalability test is most certainly not including any > multithreaded process whose memory doesn't fit in a single NUMA node > or we'd see also the other side of the tradeoff. It'd be nice to add > such a test to be sure that the old __GFP_THISNODE behavior won't > happen again for apps that don't fit in a single node. The test case is to test swap subsystem. Where tens (32 in test job) processes are run to eat memory to trigger to swap to NVMe disk. The memory size to eat is almost same in this commit and its parent. But I found the swap triggered is much more for this commit. 70934968 ± 10% +51.7% 1.076e+08 ± 3% proc-vmstat.pswpout One possibility is that for parent commit, some processes exit much earlier than other processes, so the total memory requirement of the whole system is much lower. So I dig more on test log and found, For the parent commit, $ grep 'usecs =' vm-scalability 24573771360 bytes / 13189705 usecs = 1819435 KB/s 24573771360 bytes / 13853913 usecs = 1732205 KB/s 24573771360 bytes / 42953388 usecs = 558694 KB/s 24573771360 bytes / 52782761 usecs = 454652 KB/s 24573771360 bytes / 84026989 usecs = 285596 KB/s 24573771360 bytes / 111677310 usecs = 214885 KB/s 24573771360 bytes / 146084698 usecs = 164273 KB/s 24573771360 bytes / 146978329 usecs = 163274 KB/s 24573771360 bytes / 149371224 usecs = 160658 KB/s 24573771360 bytes / 162892070 usecs = 147323 KB/s 24573771360 bytes / 177949001 usecs = 134857 KB/s 24573771360 bytes / 181729992 usecs = 132052 KB/s 24573771360 bytes / 189812698 usecs = 126428 KB/s 24573771360 bytes / 190992698 usecs = 125647 KB/s 24573771360 bytes / 200039238 usecs = 119965 KB/s 24573771360 bytes / 201254461 usecs = 119241 KB/s 24573771360 bytes / 202825077 usecs = 118317 KB/s 24573771360 bytes / 203441285 usecs = 117959 KB/s 24573771360 bytes / 205378150 usecs = 116847 KB/s 24573771360 bytes / 204840555 usecs = 117153 KB/s 24573771360 bytes / 206235458 usecs = 116361 KB/s 24573771360 bytes / 206419877 usecs = 116257 KB/s 24573771360 bytes / 206619347 usecs = 116145 KB/s 24573771360 bytes / 206942267 usecs = 115963 KB/s 24573771360 bytes / 210289229 usecs = 114118 KB/s 24573771360 bytes / 210504531 usecs = 114001 KB/s 24573771360 bytes / 210521351 usecs = 113992 KB/s 24573771360 bytes / 211012852 usecs = 113726 KB/s 24573771360 bytes / 211547509 usecs = 113439 KB/s 24573771360 bytes / 212179521 usecs = 113101 KB/s 24573771360 bytes / 212907825 usecs = 112714 KB/s 24573771360 bytes / 215558786 usecs = 111328 KB/s For this commit, $ grep 'usecs =' vm-scalability 24573681072 bytes / 203705073 usecs = 117806 KB/s 24573681072 bytes / 216146130 usecs = 111025 KB/s 24573681072 bytes / 257234408 usecs = 93291 KB/s 24573681072 bytes / 259530715 usecs = 92465 KB/s 24573681072 bytes / 261335046 usecs = 91827 KB/s 24573681072 bytes / 260134706 usecs = 92251 KB/s 2457368
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Tue, Nov 27, 2018 at 12:57 PM Andrea Arcangeli wrote: > > This difference can only happen with defrag=always, and that's not the > current upstream default. Ok, thanks. That makes it a bit less critical. > That MADV_HUGEPAGE causes flights with NUMA balancing is not great > indeed, qemu needs NUMA locality too, but then the badness caused by > __GFP_THISNODE was a larger regression in the worst case for qemu. [...] > So the short term alternative again would be the alternate patch that > does __GFP_THISNODE|GFP_ONLY_COMPACT appended below. Sounds like we should probably do this. Particularly since Vlastimil pointed out that we'd otherwise have issues with the back-port for 4.4 where that "defrag=always" was the default. The patch doesn't look horrible, and it directly addresses this particular issue. Is there some reason we wouldn't want to do it? Linus
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
Hi Linus, On Tue, Nov 27, 2018 at 09:08:50AM -0800, Linus Torvalds wrote: > On Mon, Nov 26, 2018 at 10:24 PM kernel test robot > wrote: > > > > FYI, we noticed a -61.3% regression of vm-scalability.throughput due > > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for > > MADV_HUGEPAGE mappings") > > Well, that's certainly noticeable and not good. Noticed this email too. This difference can only happen with defrag=always, and that's not the current upstream default. thp_enabled: always thp_defrag: always ^^ emulates MADV_HUGEPAGE always set > Andrea, I suspect it might be causing fights with auto numa migration.. That MADV_HUGEPAGE causes flights with NUMA balancing is not great indeed, qemu needs NUMA locality too, but then the badness caused by __GFP_THISNODE was a larger regression in the worst case for qemu. When the kernel wants to allocate a THP from node A, if there are no THP generated on node A but there are in node B, they'll be picked from node B now. __GFP_THISNODE previously prevented any THP to be allocated from any node except A. This introduces a higher chance of initial misplacement which NUMA balancing will correct over time, but it should only apply to long lived allocations under MADV_HUGEPAGE. Perhaps the workload here uses short lived allocations and sets defrag=always which is not optimal to begin with? The motivation of the fix, is that the previous code invoked reclaim with __GFP_THISNODE set. That looked insecure and such behavior should only have been possible under a mlock/mempolicy capability. __GFP_THISNODE is like a transient but still privileged mbind for reclaim. Before the fix, __GFP_THISNODE would end up swapping out everything from node A to free 4k pages from node A, despite perhaps there were gigabytes of memory free in node B. That caused severe regression to threaded workloads whose memory spanned more than one NUMA node. So again going back doesn't sounds great for NUMA in general. The vmscalability test is most certainly not including any multithreaded process whose memory doesn't fit in a single NUMA node or we'd see also the other side of the tradeoff. It'd be nice to add such a test to be sure that the old __GFP_THISNODE behavior won't happen again for apps that don't fit in a single node. > Lots more system time, but also look at this: > > >1122389 ± 9% +17.2%1315380 ± 4% proc-vmstat.numa_hit > > 214722 ± 5% +21.6% 261076 ± 3% > > proc-vmstat.numa_huge_pte_updates > >1108142 ± 9% +17.4%1300857 ± 4% proc-vmstat.numa_local > > 145368 ± 48% +63.1% 237050 ± 17% proc-vmstat.numa_miss > > 159615 ± 44% +57.6% 251573 ± 16% proc-vmstat.numa_other > > 185.50 ± 81% +8278.6% 15542 ± 40% > > proc-vmstat.numa_pages_migrated > > Should the commit be reverted? Or perhaps at least modified? I proposed two solutions, the other one required a new minor feature: __GFP_ONLY_COMPACT. The other solution wouldn't regress like above. The THP utilization ratio would decrease though (it had margin for improvement though). Kirill preferred the __GFP_ONLY_COMPACT, I was mostly neutral because it's a tradeoff. So the short term alternative again would be the alternate patch that does __GFP_THISNODE|GFP_ONLY_COMPACT appended below. There's no particular problem in restricting only compaction to the local node and to skip reclaim entirely during a THP allocation as long as reclaim is skipped entirely. David implemented a hardcoded version of GFP_COMPACTONLY too which was runtime equivalent, but it was hardcoded for THP only the allocator, but it looks less flexible to hardcode it for THP. The current fix you merged is simpler overall and puts us back to a "stable" state without introducing new (minor) features. The below is for further review of the potential alternative (which has still margin for improvement). === From: Andrea Arcangeli Subject: [PATCH 1/2] mm: thp: consolidate policy_nodemask call Just a minor cleanup. Signed-off-by: Andrea Arcangeli --- mm/mempolicy.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 01f1a14facc4..d6512ef28cde 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2026,6 +2026,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, goto out; } + nmask = policy_nodemask(gfp, pol); + if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) { int hpage_node = node; @@ -2043,7 +2045,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, !(pol->flags & MPOL_F_LOCAL)) hpage_node = pol->v.preferred_node; - nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(hpage_node, *nmask)) { mpol_cond_put(pol);
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On 11/27/18 8:05 PM, Vlastimil Babka wrote: > On 11/27/18 6:08 PM, Linus Torvalds wrote: >> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot >> wrote: >>> >>> FYI, we noticed a -61.3% regression of vm-scalability.throughput due >>> to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for >>> MADV_HUGEPAGE mappings") >> >> Well, that's certainly noticeable and not good. >> >> Andrea, I suspect it might be causing fights with auto numa migration.. >> >> Lots more system time, but also look at this: >> >>>1122389 ± 9% +17.2%1315380 ± 4% proc-vmstat.numa_hit >>> 214722 ± 5% +21.6% 261076 ± 3% >>> proc-vmstat.numa_huge_pte_updates >>>1108142 ± 9% +17.4%1300857 ± 4% proc-vmstat.numa_local >>> 145368 ± 48% +63.1% 237050 ± 17% proc-vmstat.numa_miss >>> 159615 ± 44% +57.6% 251573 ± 16% proc-vmstat.numa_other >>> 185.50 ± 81% +8278.6% 15542 ± 40% >>> proc-vmstat.numa_pages_migrated >> >> Should the commit be reverted? Or perhaps at least modified? > > This part of the test's config is important: > > thp_defrag: always > > While the commit targets MADV_HUGEPAGE mappings (such as Andrea's > kvm-qemu usecase), with defrag=always, all mappings behave almost as a > MADV_HUGEPAGE mapping. That's no longer a default for some years now and Specifically, that's 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option") merged in v4.5. So we might actually hit this regression with 4.4 stable backport... > I think nobody recommends it. In the default configuration nothing > changes for non-madvise mappings. > > Vlastimil > >> Linus >> >
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On 11/27/18 6:08 PM, Linus Torvalds wrote: > On Mon, Nov 26, 2018 at 10:24 PM kernel test robot > wrote: >> >> FYI, we noticed a -61.3% regression of vm-scalability.throughput due >> to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for >> MADV_HUGEPAGE mappings") > > Well, that's certainly noticeable and not good. > > Andrea, I suspect it might be causing fights with auto numa migration.. > > Lots more system time, but also look at this: > >>1122389 ± 9% +17.2%1315380 ± 4% proc-vmstat.numa_hit >> 214722 ± 5% +21.6% 261076 ± 3% >> proc-vmstat.numa_huge_pte_updates >>1108142 ± 9% +17.4%1300857 ± 4% proc-vmstat.numa_local >> 145368 ± 48% +63.1% 237050 ± 17% proc-vmstat.numa_miss >> 159615 ± 44% +57.6% 251573 ± 16% proc-vmstat.numa_other >> 185.50 ± 81% +8278.6% 15542 ± 40% proc-vmstat.numa_pages_migrated > > Should the commit be reverted? Or perhaps at least modified? This part of the test's config is important: thp_defrag: always While the commit targets MADV_HUGEPAGE mappings (such as Andrea's kvm-qemu usecase), with defrag=always, all mappings behave almost as a MADV_HUGEPAGE mapping. That's no longer a default for some years now and I think nobody recommends it. In the default configuration nothing changes for non-madvise mappings. Vlastimil > Linus >
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Tue 27-11-18 19:17:27, Michal Hocko wrote: > On Tue 27-11-18 09:08:50, Linus Torvalds wrote: > > On Mon, Nov 26, 2018 at 10:24 PM kernel test robot > > wrote: > > > > > > FYI, we noticed a -61.3% regression of vm-scalability.throughput due > > > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for > > > MADV_HUGEPAGE mappings") > > > > Well, that's certainly noticeable and not good. > > > > Andrea, I suspect it might be causing fights with auto numa migration.. > > > > Lots more system time, but also look at this: > > > > >1122389 ± 9% +17.2%1315380 ± 4% proc-vmstat.numa_hit > > > 214722 ± 5% +21.6% 261076 ± 3% > > > proc-vmstat.numa_huge_pte_updates > > >1108142 ± 9% +17.4%1300857 ± 4% proc-vmstat.numa_local > > > 145368 ± 48% +63.1% 237050 ± 17% proc-vmstat.numa_miss > > > 159615 ± 44% +57.6% 251573 ± 16% proc-vmstat.numa_other > > > 185.50 ± 81% +8278.6% 15542 ± 40% > > > proc-vmstat.numa_pages_migrated > > > > Should the commit be reverted? Or perhaps at least modified? > > Well, the commit is trying to revert to the behavior before > 5265047ac301 because there are real usecases that suffered from that > change and bug reports as a result of that. > > will-it-scale is certainly worth considering but it is an artificial > testcase. A higher NUMA miss rate is an expected side effect of the > patch because the fallback to a different NUMA node is more likely. The > __GFP_THISNODE side effect is basically introducing node-reclaim > behavior for THPages. Another thing is that there is no good behavior > for everybody. Reclaim locally vs. THP on a remote node is hard to > tell by default. We have discussed that at length and there were some > conclusions. One of them is that we need a numa policy to tell whether > a expensive localility is preferred over remote allocation. Also we > definitely need a better pro-active defragmentation to allow larger > pages on a local node. This is a work in progress and this patch is a > stop gap fix. Btw. the associated discussion is http://lkml.kernel.org/r/20180925120326.24392-1-mho...@kernel.org -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Tue 27-11-18 09:08:50, Linus Torvalds wrote: > On Mon, Nov 26, 2018 at 10:24 PM kernel test robot > wrote: > > > > FYI, we noticed a -61.3% regression of vm-scalability.throughput due > > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for > > MADV_HUGEPAGE mappings") > > Well, that's certainly noticeable and not good. > > Andrea, I suspect it might be causing fights with auto numa migration.. > > Lots more system time, but also look at this: > > >1122389 ± 9% +17.2%1315380 ± 4% proc-vmstat.numa_hit > > 214722 ± 5% +21.6% 261076 ± 3% > > proc-vmstat.numa_huge_pte_updates > >1108142 ± 9% +17.4%1300857 ± 4% proc-vmstat.numa_local > > 145368 ± 48% +63.1% 237050 ± 17% proc-vmstat.numa_miss > > 159615 ± 44% +57.6% 251573 ± 16% proc-vmstat.numa_other > > 185.50 ± 81% +8278.6% 15542 ± 40% > > proc-vmstat.numa_pages_migrated > > Should the commit be reverted? Or perhaps at least modified? Well, the commit is trying to revert to the behavior before 5265047ac301 because there are real usecases that suffered from that change and bug reports as a result of that. will-it-scale is certainly worth considering but it is an artificial testcase. A higher NUMA miss rate is an expected side effect of the patch because the fallback to a different NUMA node is more likely. The __GFP_THISNODE side effect is basically introducing node-reclaim behavior for THPages. Another thing is that there is no good behavior for everybody. Reclaim locally vs. THP on a remote node is hard to tell by default. We have discussed that at length and there were some conclusions. One of them is that we need a numa policy to tell whether a expensive localility is preferred over remote allocation. Also we definitely need a better pro-active defragmentation to allow larger pages on a local node. This is a work in progress and this patch is a stop gap fix. -- Michal Hocko SUSE Labs
Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
On Mon, Nov 26, 2018 at 10:24 PM kernel test robot wrote: > > FYI, we noticed a -61.3% regression of vm-scalability.throughput due > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for > MADV_HUGEPAGE mappings") Well, that's certainly noticeable and not good. Andrea, I suspect it might be causing fights with auto numa migration.. Lots more system time, but also look at this: >1122389 ± 9% +17.2%1315380 ± 4% proc-vmstat.numa_hit > 214722 ± 5% +21.6% 261076 ± 3% > proc-vmstat.numa_huge_pte_updates >1108142 ± 9% +17.4%1300857 ± 4% proc-vmstat.numa_local > 145368 ± 48% +63.1% 237050 ± 17% proc-vmstat.numa_miss > 159615 ± 44% +57.6% 251573 ± 16% proc-vmstat.numa_other > 185.50 ± 81% +8278.6% 15542 ± 40% proc-vmstat.numa_pages_migrated Should the commit be reverted? Or perhaps at least modified? Linus