subject:"\[LKP\] \[mm\] ac5b2c1891\: vm\-scalability.throughput \-61.3% regression"

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2019-04-15 Thread Michal Hocko

On Thu 06-12-18 15:43:26, David Rientjes wrote:
> On Wed, 5 Dec 2018, Linus Torvalds wrote:
> 
> > > Ok, I've applied David's latest patch.
> > >
> > > I'm not at all objecting to tweaking this further, I just didn't want
> > > to have this regression stand.
> > 
> > Hmm. Can somebody (David?) also perhaps try to state what the
> > different latency impacts end up being? I suspect it's been mentioned
> > several times during the argument, but it would be nice to have a
> > "going forward, this is what I care about" kind of setup for good
> > default behavior.
> > 
> 
> I'm in the process of writing a more complete test case for this but I 
> benchmarked a few platforms based solely on remote hugepages vs local 
> small pages vs remote hugepages.  My previous numbers were based on data 
> from actual workloads.

Has this materialized into anything we can use? We plan to discuss this
particular topic at the LSFMM this year and it would be great to have
something to play with.

I am quite nervious that we have left quite a common case with a
bad performance based on a complain that we cannot really reproduce
so it is really hard to move on.
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-22 Thread Mel Gorman

On Fri, Dec 21, 2018 at 02:18:45PM -0800, David Rientjes wrote:
> On Fri, 14 Dec 2018, Vlastimil Babka wrote:
> 
> > > It would be interesting to know if anybody has tried using the per-zone 
> > > free_area's to determine migration targets and set a bit if it should be 
> > > considered a migration source or a migration target.  If all pages for a 
> > > pageblock are not on free_areas, they are fully used.
> > 
> > Repurposing/adding a new pageblock bit was in my mind to help multiple
> > compactors not undo each other's work in the scheme where there's no
> > free page scanner, but I didn't implement it yet.
> > 
> 
> It looks like Mel has a series posted that still is implemented with 
> linear scans through memory, so I'm happy to move the discussion there; I 
> think the goal for compaction with regard to this thread is determining 
> whether reclaim in the page allocator would actually be useful and 
> targeted reclaim to make memory available for isolate_freepages() could be 
> expensive.  I'd hope that we could move in a direction where compaction 
> doesn't care where the pageblock is and does the minimal amount of work 
> possible to make a high-order page available, not sure if that's possible 
> with a linear scan.  I'll take a look at Mel's series though.

That series has evolved significantly because there was a lot of missing
pieces. While it's somewhat ready other than badly written changelogs, I
didn't post it because I'm going offline and wouldn't respond to feedback
and I imagine others are offline too and unavailable for review. Besides,
the merge window is about to open and I know there are patches in Andrews
tree for mainline that should be taken into account.

The series is now 25 patches long and covers a lot of pre-requisites that
would be necessary before removing the linear scanner. What is critical
for a purely free-list scanner is that the exit conditions are identified
and the series provides a lot of the pieces. For example, a non-linear
scanner must properly control skip bits and isolate pageblocks from
multiple compaction instances which this series does.

The main takeawy from the series is that it reduces system CPU usage by
17%, reduces free scan rates by 99.5% and increases THP allocation success
rates by 33% giving almost 99% allocation success rates. It also;

o Isolates pageblocks for a single compaction instance
o Synchronises async/sync scanners when appropriate to reduce rescanning
o Identifies when a pageblock is being rescanned and is "sticky" and
  makes forward progress instead of looping excessively
o Smarter logic when clearing pageblock skip bits so reduce scanning
o Various different methods for reducing unnecessary scanning
o Better handling of contention
o Avoids compaction of remote nodes in direct compaction context

If you do not want to wait until the new year, it's at
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
mm-fast-compact-v2r15

Preliminary results based on thpscale using MADV_HUGEPAGE to allocate
huge pages on a fragmented system.

thpscale Fault Latencies
4.20.0-rc6 4.20.0-rc6
mmotm-20181210 noremote-v2r14
Amean fault-both-1   864.83 (   0.00%) 1006.88 * -16.43%*
Amean fault-both-3  3566.05 (   0.00%) 2460.97 *  30.99%*
Amean fault-both-5  5685.02 (   0.00%) 4052.92 *  28.71%*
Amean fault-both-7  7289.40 (   0.00%) 5929.65 (  18.65%)
Amean fault-both-1210937.46 (   0.00%) 8870.53 (  18.90%)
Amean fault-both-1815440.48 (   0.00%)11464.86 *  25.75%*
Amean fault-both-2415345.83 (   0.00%)13040.01 *  15.03%*
Amean fault-both-3020159.73 (   0.00%)16618.73 *  17.56%*
Amean fault-both-3220843.51 (   0.00%)14401.25 *  30.91%*

Fault latency (either huge or base) is mostly improved even when 32
tasks are trying to allocate huge pages on an 8-CPU single socket
machine where contention is a factor

thpscale Percentage Faults Huge
   4.20.0-rc6 4.20.0-rc6
   mmotm-20181210 noremote-v2r14
Percentage huge-196.03 (   0.00%)   96.94 (   0.95%)
Percentage huge-371.43 (   0.00%)   95.43 (  33.60%)
Percentage huge-570.44 (   0.00%)   96.85 (  37.48%)
Percentage huge-770.39 (   0.00%)   94.77 (  34.63%)
Percentage huge-12   71.53 (   0.00%)   98.07 (  37.11%)
Percentage huge-18   70.61 (   0.00%)   98.42 (  39.38%)
Percentage huge-24   71.84 (   0.00%)   97.85 (  36.20%)
Percentage huge-30   69.94 (   0.00%)   98.13 (  40.31%)
Percentage huge-32   66.92 (   0.00%)   97.79 (  46.13%)

96-98% of THP requests get huge pages on request

 4.20.0-rc6  4.20.0-rc6
   mmotm-20181210noremote-v2r14
User  27.30   27.86
System   192.70  159.42
Elapsed  580.13  571.98

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-21 Thread David Rientjes

On Fri, 14 Dec 2018, Vlastimil Babka wrote:

> > It would be interesting to know if anybody has tried using the per-zone 
> > free_area's to determine migration targets and set a bit if it should be 
> > considered a migration source or a migration target.  If all pages for a 
> > pageblock are not on free_areas, they are fully used.
> 
> Repurposing/adding a new pageblock bit was in my mind to help multiple
> compactors not undo each other's work in the scheme where there's no
> free page scanner, but I didn't implement it yet.
> 

It looks like Mel has a series posted that still is implemented with 
linear scans through memory, so I'm happy to move the discussion there; I 
think the goal for compaction with regard to this thread is determining 
whether reclaim in the page allocator would actually be useful and 
targeted reclaim to make memory available for isolate_freepages() could be 
expensive.  I'd hope that we could move in a direction where compaction 
doesn't care where the pageblock is and does the minimal amount of work 
possible to make a high-order page available, not sure if that's possible 
with a linear scan.  I'll take a look at Mel's series though.

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-21 Thread David Rientjes

On Fri, 14 Dec 2018, Mel Gorman wrote:

> > In other words, I think there is a lot of potential stranding that occurs 
> > for both scanners that could otherwise result in completely free 
> > pageblocks.  If there a single movable page present near the end of the 
> > zone in an otherwise fully free pageblock, surely we can do better than 
> > the current implementation that would never consider this very easy to 
> > compact memory.
> > 
> 
> While it's somewhat premature, I posted a series before I had a full set
> of results because it uses free lists to reduce searches and reduces
> inference between multiple scanners. Preliminary results indicated it
> boosted allocation success rates by 20%ish, reduced migration scanning
> by 99% and free scanning by 27%.
> 

Always good to have code to look at, I'll take a closer look.  I've 
unfortunately been distracted with other kernel issues lately :/

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-14 Thread Mel Gorman

On Fri, Dec 14, 2018 at 01:04:11PM -0800, David Rientjes wrote:
> On Wed, 12 Dec 2018, Vlastimil Babka wrote:
> 
> > > Regarding the role of direct reclaim in the allocator, I think we need 
> > > work on the feedback from compaction to determine whether it's 
> > > worthwhile.  
> > > That's difficult because of the point I continue to bring up: 
> > > isolate_freepages() is not necessarily always able to access this freed 
> > > memory.
> > 
> > That's one of the *many* reasons why having free base pages doesn't
> > guarantee compaction success. We can and will improve on that. But I
> > don't think it would be e.g. practical to check the pfns of free pages
> > wrt compaction scanner positions and decide based on that.
> 
> Yeah, agreed.  Rather than proposing that memory is only reclaimed if its 
> known that it can be accessible to isolate_freepages(), I'm wondering 
> about the implementation of the freeing scanner entirely.
> 
> In other words, I think there is a lot of potential stranding that occurs 
> for both scanners that could otherwise result in completely free 
> pageblocks.  If there a single movable page present near the end of the 
> zone in an otherwise fully free pageblock, surely we can do better than 
> the current implementation that would never consider this very easy to 
> compact memory.
> 

While it's somewhat premature, I posted a series before I had a full set
of results because it uses free lists to reduce searches and reduces
inference between multiple scanners. Preliminary results indicated it
boosted allocation success rates by 20%ish, reduced migration scanning
by 99% and free scanning by 27%.

> The same problem occurs for the migration scanner where we can iterate 
> over a ton of free memory that is never considered a suitable migration 
> target.  The implementation that attempts to migrate all memory toward the 
> end of the zone penalizes the freeing scanner when it is reset: we just 
> iterate over a ton of used pages.
> 

Yes, partially addressed in series. It can be improved significantly but it
hit a boundary condition near the points where compaction scanners meet. I
dropped the patch in question as it needs more thought on how to deal
with the boundary condition without remigrating the blocks close to it.
Besides, at 14 patches, it would probably be best to get that reviewed
and finalised before building upon it further so review would be welcome.

> Has anybody tried a migration scanner that isn't linearly based, rather 
> finding the highest-order free page of the same migratetype, iterating the 
> pages of its pageblock, and using this to determine whether the actual 
> migration will be worthwhile or not?  I could imagine pageblock_skip being 
> repurposed for this as the heuristic.
> 

Yes, but it has downsides. Redoing the same work on pageblocks, tracking
state and tracking the exit conditions are tricky. I think it's best to
squeeze the most out of the linear scanning first and the series is the
first step in that.

> It would be interesting to know if anybody has tried using the per-zone 
> free_area's to determine migration targets and set a bit if it should be 
> considered a migration source or a migration target.  If all pages for a 
> pageblock are not on free_areas, they are fully used.
> 

Series has patches which implement something similar to this idea.

-- 
Mel Gorman
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-14 Thread Vlastimil Babka

On 12/14/18 10:04 PM, David Rientjes wrote:
> On Wed, 12 Dec 2018, Vlastimil Babka wrote:

...

> Reclaim likely could be deterministically useful if we consider a redesign 
> of how migration sources and targets are determined in compaction.
> 
> Has anybody tried a migration scanner that isn't linearly based, rather 
> finding the highest-order free page of the same migratetype, iterating the 
> pages of its pageblock, and using this to determine whether the actual 
> migration will be worthwhile or not?

Not exactly that AFAIK, but a year ago in my series [1] patch 6 made
migration scanner 'prescan' the block of requested order before actually
trying to isolate anything for migration.

> I could imagine pageblock_skip being 
> repurposed for this as the heuristic.
> 
> Finding migration targets would be more tricky, but if we iterate the 
> pages of the pageblock for low-order free pages and find them to be mostly 
> used, that seems more appropriate than just pushing all memory to the end 
> of the zone?

Agree. That was patch 8/8 of the same series [1].

> It would be interesting to know if anybody has tried using the per-zone 
> free_area's to determine migration targets and set a bit if it should be 
> considered a migration source or a migration target.  If all pages for a 
> pageblock are not on free_areas, they are fully used.

Repurposing/adding a new pageblock bit was in my mind to help multiple
compactors not undo each other's work in the scheme where there's no
free page scanner, but I didn't implement it yet.

>>> otherwise we fail and defer because it wasn't able 
>>> to make a hugepage available.
>>
>> Note that THP fault compaction doesn't actually defer itself, which I
>> think is a weakness of the current implementation and hope that patch 3
>> in my series from yesterday [1] can address that. Because defering is
>> the general feedback mechanism that we have for suppressing compaction
>> (and thus associated reclaim) in cases it fails for any reason, not just
>> the one you mention. Instead of inspecting failure conditions in detail,
>> which would be costly, it's a simple statistical approach. And when
>> compaction is improved to fail less, defering automatically also happens
>> less.
>>
> 
> I couldn't get the link to work, unfortunately, I don't think the patch 
> series made it to LKML :/  I do see it archived for linux-mm, though, so 
> I'll take a look, thanks!

Yeah I forgot to Cc: LKML, but you were also in direct To: so you should
have received them directly. Also the abovementioned series, but that's
year ago. My fault for not returning to it after being done with the
Meltdown fun. I hope to do that soon.

[1] https://marc.info/?l=linux-mm&m=151315560308753

>> [1] https://lkml.kernel.org/r/20181211142941.20500-1-vba...@suse.cz
>>

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-14 Thread David Rientjes

On Wed, 12 Dec 2018, Vlastimil Babka wrote:

> > Regarding the role of direct reclaim in the allocator, I think we need 
> > work on the feedback from compaction to determine whether it's worthwhile.  
> > That's difficult because of the point I continue to bring up: 
> > isolate_freepages() is not necessarily always able to access this freed 
> > memory.
> 
> That's one of the *many* reasons why having free base pages doesn't
> guarantee compaction success. We can and will improve on that. But I
> don't think it would be e.g. practical to check the pfns of free pages
> wrt compaction scanner positions and decide based on that.

Yeah, agreed.  Rather than proposing that memory is only reclaimed if its 
known that it can be accessible to isolate_freepages(), I'm wondering 
about the implementation of the freeing scanner entirely.

In other words, I think there is a lot of potential stranding that occurs 
for both scanners that could otherwise result in completely free 
pageblocks.  If there a single movable page present near the end of the 
zone in an otherwise fully free pageblock, surely we can do better than 
the current implementation that would never consider this very easy to 
compact memory.

For hugepages, we don't care what pageblock we allocate from.  There are 
requirements for MAX_ORDER-1, but I assume we shouldn't optimize for these 
cases (and if CMA has requirements for a migration/freeing scanner 
redesign, I think that can be special cased).

The same problem occurs for the migration scanner where we can iterate 
over a ton of free memory that is never considered a suitable migration 
target.  The implementation that attempts to migrate all memory toward the 
end of the zone penalizes the freeing scanner when it is reset: we just 
iterate over a ton of used pages.

Reclaim likely could be deterministically useful if we consider a redesign 
of how migration sources and targets are determined in compaction.

Has anybody tried a migration scanner that isn't linearly based, rather 
finding the highest-order free page of the same migratetype, iterating the 
pages of its pageblock, and using this to determine whether the actual 
migration will be worthwhile or not?  I could imagine pageblock_skip being 
repurposed for this as the heuristic.

Finding migration targets would be more tricky, but if we iterate the 
pages of the pageblock for low-order free pages and find them to be mostly 
used, that seems more appropriate than just pushing all memory to the end 
of the zone?

It would be interesting to know if anybody has tried using the per-zone 
free_area's to determine migration targets and set a bit if it should be 
considered a migration source or a migration target.  If all pages for a 
pageblock are not on free_areas, they are fully used.

> > otherwise we fail and defer because it wasn't able 
> > to make a hugepage available.
> 
> Note that THP fault compaction doesn't actually defer itself, which I
> think is a weakness of the current implementation and hope that patch 3
> in my series from yesterday [1] can address that. Because defering is
> the general feedback mechanism that we have for suppressing compaction
> (and thus associated reclaim) in cases it fails for any reason, not just
> the one you mention. Instead of inspecting failure conditions in detail,
> which would be costly, it's a simple statistical approach. And when
> compaction is improved to fail less, defering automatically also happens
> less.
> 

I couldn't get the link to work, unfortunately, I don't think the patch 
series made it to LKML :/  I do see it archived for linux-mm, though, so 
I'll take a look, thanks!

> [1] https://lkml.kernel.org/r/20181211142941.20500-1-vba...@suse.cz
>

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-14 Thread Michal Hocko

On Wed 12-12-18 12:00:16, Andrea Arcangeli wrote:
[...]
> Adding MADV_THISNODE/MADV_NODE_RECLAIM, will guarantee his proprietary
> software binary will run at maximum performance without cache
> interference, and he's happy to accept the risk of massive slowdown in
> case the local node is truly OOM. The fallback, despite very
> inefficient, will still happen without OOM killer triggering.

I believe this fits much better into a MPOL_$FOO rather than MADV_$FOO.
But other than that I full agree. There are reasonable usecases for the
node reclaim like behavior. As a bonus you do not get local node only
but all nodes within reclaim distance as well.
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-12 Thread Andrea Arcangeli

On Wed, Dec 12, 2018 at 10:50:51AM +0100, Michal Hocko wrote:
> I can be convinced that larger pages really require a different behavior
> than base pages but you should better show _real_ numbers on a wider
> variety workloads to back your claims. I have only heard hand waving and

I agree with your point about node_reclaim and I think David complaint
of "I got remote THP instead of local 4k" with our proposed fix, is
going to morph into "I got remote 4k instead of local 4k" with his
favorite fix.

Because David stopped calling reclaim with __GFP_THISNODE, the moment
the node is full of pagecache, node_reclaim behavior will go away and
even 4k pages will start to be allocated remote (and because of
__GFP_THISNODE set in the THP allocation, all readily available or
trivial to compact remote THP will be ignored too).

What David needs I think is a way to set __GFP_THISNODE for THP *and
4k* allocations and if both fails in a row with __GFP_THISNODE set, we
need to repeat the whole thing without __GFP_THISNODE set (ideally
with a mask to skip the node that we already scraped down to the
bottom during the initial __GFP_THISNODE pass). This way his
proprietary software binary will work even better than before when the
local node is fragmented and he'll finally be able to get the speedup
from remote THP too in case the local node is truly OOM, but all other
nodes are full of readily available THP.

To achieve this without a new MADV_THISNODE/MADV_NODE_RECLAIM, we'd
need a way to start with __GFP_THISNODE and then draw the line in
reclaim and decide to drop __GFP_THISNODE when too much pressure
mounts in the local node, but like you said it becomes like
node_reclaim and it would be better if it can be done with an opt-in,
like MADV_HUGEPAGE because not all workloads would benefit from such
extra pagecache reclaim cost (like not all workload benefits from
synchronous compaction).

I think some NUMA reclaim mode semantics ended up being embedded and
hidden in the THP MADV_HUGEPAGE, but they imposed massive slowdown to
all workloads that can't cope with the node_reclaim mode behavior
because they don't fit in a node.

Adding MADV_THISNODE/MADV_NODE_RECLAIM, will guarantee his proprietary
software binary will run at maximum performance without cache
interference, and he's happy to accept the risk of massive slowdown in
case the local node is truly OOM. The fallback, despite very
inefficient, will still happen without OOM killer triggering.

Thanks,
Andrea

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-12 Thread Andrea Arcangeli

Hello,

I now found a two socket EPYC (is this Naples?) to try to confirm the
THP effect of intra-socket THP.

CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:32
Socket(s): 2
NUMA node(s):  8
NUMA node0 CPU(s): 0-7,64-71
NUMA node1 CPU(s): 8-15,72-79
NUMA node2 CPU(s): 16-23,80-87
NUMA node3 CPU(s): 24-31,88-95
NUMA node4 CPU(s): 32-39,96-103
NUMA node5 CPU(s): 40-47,104-111
NUMA node6 CPU(s): 48-55,112-119
NUMA node7 CPU(s): 56-63,120-127

# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 32658 MB
node 0 free: 31554 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 32767 MB
node 1 free: 31854 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 32767 MB
node 2 free: 31535 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 32767 MB
node 3 free: 31777 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 32767 MB
node 4 free: 31949 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 32767 MB
node 5 free: 31957 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 32767 MB
node 6 free: 31945 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 32767 MB
node 7 free: 31958 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  16  16  16  32  32  32  32 
  1:  16  10  16  16  32  32  32  32 
  2:  16  16  10  16  32  32  32  32 
  3:  16  16  16  10  32  32  32  32 
  4:  32  32  32  32  10  16  16  16 
  5:  32  32  32  32  16  10  16  16 
  6:  32  32  32  32  16  16  10  16 
  7:  32  32  32  32  16  16  16  10 
# for i in 0 8 16 24 32 40 48 56; do numactl -m 0 -C $i /tmp/numa-thp-bench; 
done
random writes MADV_HUGEPAGE 17622885 usec
random writes MADV_NOHUGEPAGE 25316593 usec
random writes MADV_NOHUGEPAGE 25291927 usec
random writes MADV_HUGEPAGE 17672446 usec
random writes MADV_HUGEPAGE 25698555 usec
random writes MADV_NOHUGEPAGE 36413941 usec
random writes MADV_NOHUGEPAGE 36402155 usec
random writes MADV_HUGEPAGE 25689574 usec
random writes MADV_HUGEPAGE 25136558 usec
random writes MADV_NOHUGEPAGE 35562724 usec
random writes MADV_NOHUGEPAGE 35504708 usec
random writes MADV_HUGEPAGE 25123186 usec
random writes MADV_HUGEPAGE 25137002 usec
random writes MADV_NOHUGEPAGE 35577429 usec
random writes MADV_NOHUGEPAGE 35582865 usec
random writes MADV_HUGEPAGE 25116561 usec
random writes MADV_HUGEPAGE 40281721 usec
random writes MADV_NOHUGEPAGE 56891233 usec
random writes MADV_NOHUGEPAGE 56924134 usec
random writes MADV_HUGEPAGE 40286512 usec
random writes MADV_HUGEPAGE 40377662 usec
random writes MADV_NOHUGEPAGE 56731400 usec
random writes MADV_NOHUGEPAGE 56443959 usec
random writes MADV_HUGEPAGE 40379022 usec
random writes MADV_HUGEPAGE 33907588 usec
random writes MADV_NOHUGEPAGE 47609976 usec
random writes MADV_NOHUGEPAGE 47523481 usec
random writes MADV_HUGEPAGE 33881974 usec
random writes MADV_HUGEPAGE 40809719 usec
random writes MADV_NOHUGEPAGE 57148321 usec
random writes MADV_NOHUGEPAGE 57164499 usec
random writes MADV_HUGEPAGE 40802979 usec
# grep EPYC /proc/cpuinfo |head -1
model name  : AMD EPYC 7601 32-Core Processor

I suppose node 0-1-2-3 are socket 0 and node 4-5-6-7 are socket 1.

With the ram kept in nodeid 0, cpuid 0 is NUMA local, cpuid 8,16,24
are NUMA intrasocket remote and cpuid 32 40 48 56 are NUMA
intersocket remote.

local 4k -> local THP: +43.6% improvement

local 4k -> intersocket remote THP: -1.4%
intersocket remote 4k -> intersocket remote THP: +41.6%

local 4k -> intersocket remote 4k: -30.4%
local THP -> intersocket remote THP: -31.4%

local 4k -> intrasocket remote THP: -37.15% (-25% on node 6?)
intrasocket remote 4k -> intrasocket remote THP: +41.23%

local 4k -> intrasocket remote 4k: -55.5% (-46% on node 6?)
local THP -> intrasocket remote THP: -56.25% (-47% on node 6?)

In short intrasocket is a whole lot more expensive (4k -55% THP -56%)
than intersocket (4k -30% THP -31%)... as expected. The benefits of
THP vs 4k remains the same for intrasocket (+41.23%) and intersocket
(+41.6%) and local (+43.6%), also as expected.

The above was measured on bare metal on guests the impact of THP as
usual will be multiplied (I can try to measure that another time).

So while before I couldn't confirm that THP didn't help intersocket, I
think I can confirm it helps just like intrasocket and local now on
this architecture.

Especially intresocket the slowdown from remote THP compared to local
4k is a tiny -1% so in theory __GFP_THISNODE would at least need to
switch to __GFP_THISSOCKET for this architecture.. (I'm not suggesting
that, I'm talking in theory). Intersocket is even more favorable than
a 2 node 1 socket threadripper and a 2 node (2 sockets?) skylake in
fact even on bare metal.

Losing the +41% THP benefit

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-12 Thread Vlastimil Babka

On 12/12/18 1:37 AM, David Rientjes wrote:
> 
> Regarding the role of direct reclaim in the allocator, I think we need 
> work on the feedback from compaction to determine whether it's worthwhile.  
> That's difficult because of the point I continue to bring up: 
> isolate_freepages() is not necessarily always able to access this freed 
> memory.

That's one of the *many* reasons why having free base pages doesn't
guarantee compaction success. We can and will improve on that. But I
don't think it would be e.g. practical to check the pfns of free pages
wrt compaction scanner positions and decide based on that. Also when you
invoke reclaim, you can't tell in advance those pfns, so I'm not sure
how the better feedback from compaction to reclaim for this specific
aspect would be supposed to work?

> But for cases where we get COMPACT_SKIPPED because the order-0 
> watermarks are failing, reclaim *is* likely to have an impact in the 
> success of compaction,

Yes that's the heuristic we rely on.

> otherwise we fail and defer because it wasn't able 
> to make a hugepage available.

Note that THP fault compaction doesn't actually defer itself, which I
think is a weakness of the current implementation and hope that patch 3
in my series from yesterday [1] can address that. Because defering is
the general feedback mechanism that we have for suppressing compaction
(and thus associated reclaim) in cases it fails for any reason, not just
the one you mention. Instead of inspecting failure conditions in detail,
which would be costly, it's a simple statistical approach. And when
compaction is improved to fail less, defering automatically also happens
less.

>  [ If we run compaction regardless of the order-0 watermark check and find
>a pageblock where we can likely free a hugepage because it is 
>fragmented movable pages, this is a pretty good indication that reclaim
>is worthwhile iff the reclaimed memory is beyond the migration scanner. ]

I don't think that would be a good direction to pursue, to let scanning
happen even without having the free pages. Also as I've mentioned above,
LRU-based reclaim cannot satisfy your 'iff' condition, unless it
inspected the pfn's it freed, and continued reclaiming until enough of
those beyond migration scanner were freed. Instead IMHO we should look
again into replacing the free scanner with direct allocation from freelists.

[1] https://lkml.kernel.org/r/20181211142941.20500-1-vba...@suse.cz

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-12 Thread Michal Hocko

On Tue 11-12-18 16:37:22, David Rientjes wrote:
[...]
> Since it depends on the workload, specifically workloads that fit within a 
> single node, I think the reasonable approach would be to have a sane 
> default regardless of the use of MADV_HUGEPAGE or thp defrag settings and 
> then optimzie for the minority of cases where the workload does not fit in 
> a single node.  I'm assuming there is no debate about these larger 
> workloads being in the minority, although we have single machines where 
> this encompasses the totality of their workloads.

Your assumption is wrong I believe. This is the fundamental disagreement
we are discussing here. You are essentially arguing for node_reclaim
(formerly zone_reclaim) behavior for THP pages. All that without any
actual data on wider variety of workloads. As the matter of _fact_ we
know that node_reclaim behavior is not a suitable default. We did
that mistake in the past and we had to revert that default _exactly_
because a wider variety of workloads suffered from over reclaim and
performance issues as a result of constant reclaim

You have also haven't explained why you do care so much about remote THP
while you do not care about remote base bages (the page allocator
falls back to those as soon as the kswapd doesn't keep pace with the
allocation rate; THP or high order pages in general is analoguous with
kcompactd doing a pro-active compaction). Like the base pages we do not
want larger pages to fallback to a remote node too easily. There is no
question about that I believe.

I can be convinced that larger pages really require a different behavior
than base pages but you should better show _real_ numbers on a wider
variety workloads to back your claims. I have only heard hand waving and
a very vague and quite doubtful numbers for a non-disclosed benchmark
without a clear indication on how it relates to real world workloads. So
color me unconvinced.
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-11 Thread David Rientjes

On Sun, 9 Dec 2018, Andrea Arcangeli wrote:

> You didn't release the proprietary software that depends on
> __GFP_THISNODE behavior and that you're afraid is getting a
> regression.
> 
> Could you at least release with an open source license the benchmark
> software that you must have used to do the above measurement to
> understand why it gives such a weird result on remote THP?
> 

Hi Andrea,

As I said in response to Linus, I'm in the process of writing a more 
complete benchmarking test across all of our platforms for access and 
allocation latency for x86 (both Intel and AMD), POWER8/9, and arm64, and 
doing so on a kernel with minimum overhead (for the allocation latency, I 
want to remove things like mem cgroup overhead from the result).

> On skylake and on the threadripper I can't confirm that there isn't a
> significant benefit from cross socket hugepage over cross socket small
> page.
> 
> Skylake Xeon(R) Gold 5115:
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
> node 0 size: 15602 MB
> node 0 free: 14077 MB
> node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
> node 1 size: 16099 MB
> node 1 free: 15949 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> # numactl -m 0 -C 0 ./numa-thp-bench
> random writes MADV_HUGEPAGE 10109753 usec
> random writes MADV_NOHUGEPAGE 13682041 usec
> random writes MADV_NOHUGEPAGE 13704208 usec
> random writes MADV_HUGEPAGE 10120405 usec
> # numactl -m 0 -C 10 ./numa-thp-bench
> random writes MADV_HUGEPAGE 15393923 usec
> random writes MADV_NOHUGEPAGE 19644793 usec
> random writes MADV_NOHUGEPAGE 19671287 usec
> random writes MADV_HUGEPAGE 15495281 usec
> # grep Xeon /proc/cpuinfo |head -1
> model name  : Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz
> 
> local 4k -> local 2m: +35%
> local 4k -> remote 2m: -11% 
> remote 4k -> remote 2m: +26%
> 
> threadripper 1950x:
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> node 0 size: 15982 MB
> node 0 free: 14422 MB
> node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> node 1 size: 16124 MB
> node 1 free: 5357 MB
> node distances:
> node   0   1
>   0:  10  16
>   1:  16  10
> # numactl -m 0 -C 0 /tmp/numa-thp-bench
> random writes MADV_HUGEPAGE 12902667 usec
> random writes MADV_NOHUGEPAGE 17543070 usec
> random writes MADV_NOHUGEPAGE 17568858 usec
> random writes MADV_HUGEPAGE 12896588 usec
> # numactl -m 0 -C 8 /tmp/numa-thp-bench
> random writes MADV_HUGEPAGE 19663515 usec
> random writes MADV_NOHUGEPAGE 27819864 usec
> random writes MADV_NOHUGEPAGE 27844066 usec
> random writes MADV_HUGEPAGE 19662706 usec
> # grep Threadripper /proc/cpuinfo |head -1
> model name  : AMD Ryzen Threadripper 1950X 16-Core Processor
> 
> local 4k -> local 2m: +35%
> local 4k -> remote 2m: -10% 
> remote 4k -> remote 2m: +41%
> 
> Or if you prefer reversed in terms of compute time (negative
> percentage is better in this case):
> 
> local 4k -> local 2m: -26%
> local 4k -> remote 2m: +12%
> remote 4k -> remote 2m: -29%
> 
> It's true that local 4k is generally a win vs remote THP when the
> workload is memory bound also for the threadripper, the threadripper
> seems even more favorable to remote THP than skylake Xeon is.
> 

My results are organized slightly different since it considers local 
hugepages as the baseline and is what we optimize for: on Broadwell, I've 
obtained more accurate results that show local small pages at +3.8%, 
remote hugepages at +12.8% and remote small pages at +18.8%.  I think we 
both agree that the locality preference for workloads that fit within a 
single node is local hugepage -> local small page -> remote hugepage -> 
remote small page, and that has been unchanged in any of benchmarking 
results for either of us.

> The above is the host bare metal result. Now let's try guest mode on
> the threadripper. The last two lines seems more reliable (the first
> two lines also needs to fault in the guest RAM because the guest
> was fresh booted).
> 
> guest backed by local 2M pages:
> 
> random writes MADV_HUGEPAGE 16025855 usec
> random writes MADV_NOHUGEPAGE 21903002 usec
> random writes MADV_NOHUGEPAGE 19762767 usec
> random writes MADV_HUGEPAGE 15189231 usec
> 
> guest backed by remote 2M pages:
> 
> random writes MADV_HUGEPAGE 25434251 usec
> random writes MADV_NOHUGEPAGE 32404119 usec
> random writes MADV_NOHUGEPAGE 31455592 usec
> random writes MADV_HUGEPAGE 22248304 usec
> 
> guest backed by local 4k pages:
> 
> random writes MADV_HUGEPAGE 28945251 usec
> random writes MADV_NOHUGEPAGE 32217690 usec
> random writes MADV_NOHUGEPAGE 30664731 usec
> random writes MADV_HUGEPAGE 22981082 usec
> 
> guest backed by remote 4k pages:
> 
> random writes MADV_HUGEPAGE 43772939 usec
> random writes MADV_NOHUGEPAGE 52745664 usec
> random writes MADV_NOHUGEPAGE 51632065 usec
> random writes MADV_HUGEPAGE 40263194 usec
> 
> I haven't yet tr

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-09 Thread Andrea Arcangeli

Hello,

On Sun, Dec 09, 2018 at 04:29:13PM -0800, David Rientjes wrote:
> [..] on this platform, at least, hugepages are 
> preferred on the same socket but there isn't a significant benefit from 
> getting a cross socket hugepage over small page. [..]

You didn't release the proprietary software that depends on
__GFP_THISNODE behavior and that you're afraid is getting a
regression.

Could you at least release with an open source license the benchmark
software that you must have used to do the above measurement to
understand why it gives such a weird result on remote THP?

On skylake and on the threadripper I can't confirm that there isn't a
significant benefit from cross socket hugepage over cross socket small
page.

Skylake Xeon(R) Gold 5115:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 15602 MB
node 0 free: 14077 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 16099 MB
node 1 free: 15949 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10
# numactl -m 0 -C 0 ./numa-thp-bench
random writes MADV_HUGEPAGE 10109753 usec
random writes MADV_NOHUGEPAGE 13682041 usec
random writes MADV_NOHUGEPAGE 13704208 usec
random writes MADV_HUGEPAGE 10120405 usec
# numactl -m 0 -C 10 ./numa-thp-bench
random writes MADV_HUGEPAGE 15393923 usec
random writes MADV_NOHUGEPAGE 19644793 usec
random writes MADV_NOHUGEPAGE 19671287 usec
random writes MADV_HUGEPAGE 15495281 usec
# grep Xeon /proc/cpuinfo |head -1
model name  : Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz

local 4k -> local 2m: +35%
local 4k -> remote 2m: -11% 
remote 4k -> remote 2m: +26%

threadripper 1950x:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 15982 MB
node 0 free: 14422 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 16124 MB
node 1 free: 5357 MB
node distances:
node   0   1
  0:  10  16
  1:  16  10
# numactl -m 0 -C 0 /tmp/numa-thp-bench
random writes MADV_HUGEPAGE 12902667 usec
random writes MADV_NOHUGEPAGE 17543070 usec
random writes MADV_NOHUGEPAGE 17568858 usec
random writes MADV_HUGEPAGE 12896588 usec
# numactl -m 0 -C 8 /tmp/numa-thp-bench
random writes MADV_HUGEPAGE 19663515 usec
random writes MADV_NOHUGEPAGE 27819864 usec
random writes MADV_NOHUGEPAGE 27844066 usec
random writes MADV_HUGEPAGE 19662706 usec
# grep Threadripper /proc/cpuinfo |head -1
model name  : AMD Ryzen Threadripper 1950X 16-Core Processor

local 4k -> local 2m: +35%
local 4k -> remote 2m: -10% 
remote 4k -> remote 2m: +41%

Or if you prefer reversed in terms of compute time (negative
percentage is better in this case):

local 4k -> local 2m: -26%
local 4k -> remote 2m: +12%
remote 4k -> remote 2m: -29%

It's true that local 4k is generally a win vs remote THP when the
workload is memory bound also for the threadripper, the threadripper
seems even more favorable to remote THP than skylake Xeon is.

The above is the host bare metal result. Now let's try guest mode on
the threadripper. The last two lines seems more reliable (the first
two lines also needs to fault in the guest RAM because the guest
was fresh booted).

guest backed by local 2M pages:

random writes MADV_HUGEPAGE 16025855 usec
random writes MADV_NOHUGEPAGE 21903002 usec
random writes MADV_NOHUGEPAGE 19762767 usec
random writes MADV_HUGEPAGE 15189231 usec

guest backed by remote 2M pages:

random writes MADV_HUGEPAGE 25434251 usec
random writes MADV_NOHUGEPAGE 32404119 usec
random writes MADV_NOHUGEPAGE 31455592 usec
random writes MADV_HUGEPAGE 22248304 usec

guest backed by local 4k pages:

random writes MADV_HUGEPAGE 28945251 usec
random writes MADV_NOHUGEPAGE 32217690 usec
random writes MADV_NOHUGEPAGE 30664731 usec
random writes MADV_HUGEPAGE 22981082 usec

guest backed by remote 4k pages:

random writes MADV_HUGEPAGE 43772939 usec
random writes MADV_NOHUGEPAGE 52745664 usec
random writes MADV_NOHUGEPAGE 51632065 usec
random writes MADV_HUGEPAGE 40263194 usec

I haven't yet tried the guest mode on the skylake nor
haswell/broadwell. I can do that too but I don't expect a significant
difference.

On a threadripper guest, the remote 2m is practically identical to
local 4k. So shutting down compaction to try to generate local 4k
memory looks a sure loss.

Even if we ignore the guest mode results completely, if we don't make
assumption on the workload to be able to fit in the node, if I use
MADV_HUGEPAGE I think I'd prefer the risk of a -10% slowdown if the
THP page ends up in a remote node, than not getting the +41% THP
speedup on remote memory if the pagetable ends up being remote or the
4k page itself ends up being remote over time.

The cons left from your latest patch, is that you eventually also lose
the +35% speedup when compaction is clogged by COMPACT_SKIPPED, which
for a guest mode computation translates in losing the +59% speedup of
having host local THP (when guest uses 4k pages). khu

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-09 Thread David Rientjes

On Thu, 6 Dec 2018, Linus Torvalds wrote:

> > On Broadwell, the access latency to local small pages was +5.6%, remote
> > hugepages +16.4%, and remote small pages +19.9%.
> >
> > On Naples, the access latency to local small pages was +4.9%, intrasocket
> > hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages
> > +26.6%, and intersocket hugepages +29.2%
> 
> Are those two last numbers transposed?
> 
> Or why would small page accesses be *faster* than hugepages for the
> intersocket case?
> 
> Of course, depending on testing, maybe the page itself was remote, but
> the page tables were random, and you happened to get a remote page
> table for the hugepage case?
> 

Yes, looks like that was the case, if the page tables were from the same 
node as the intersocket remote hugepage it looks like a ~0.1% increase 
accessing small pages, so basically unchanged.  So this complicates the 
allocation strategy somewhat; on this platform, at least, hugepages are 
preferred on the same socket but there isn't a significant benefit from 
getting a cross socket hugepage over small page.

The typical way this is resolved is based on the SLIT and how the kernel 
defines RECLAIM_DISTANCE.  I'm not sure that we can expect the distances 
between proximity domains to be defined according to this value for a 
one-size-fits-all solution.  I've always thought that RECLAIM_DISTANCE 
should be configurable so that initscripts can actually determine its 
ideal value when using vm.zone_reclaim_mode.

> > So it *appears* from the x86 platforms that NUMA matters much more
> > significantly than hugeness, but remote hugepages are a slight win over
> > remote small pages.  PPC appeared the same wrt the local node but then
> > prefers hugeness over affinity when it comes to remote pages.
> 
> I do think POWER at least historically has much weaker TLB fills, but
> also very costly page table creation/teardown. Constant-time O(1)
> arguments about hash lookups are only worth so much when the constant
> time is pretty big. They've been working on it.
> 
> So at least on POWER, afaik one issue is literally that hugepages made
> the hash setup and teardown situation much better.
> 

I'm still working on the more elaborate test case that will generate these 
results because I think I can use it at boot to determine an ideal 
RECLAIM_DISTANCE.  I can also get numbers for hash vs radix MMU if you're 
interested.

> One thing that might be worth looking at is whether the process itself
> is all that node-local. Maybe we could aim for a policy that says
> "prefer local memory, but if we notice that the accesses to this vma
> aren't all that local, then who cares?".
> 
> IOW, the default could be something more dynamic than just "always use
> __GFP_THISNODE". It could be more along the lines of "start off using
> __GFP_THISNODE, but for longer-lived processes that bounce around
> across nodes, maybe relax it?"
> 

It would allow the use of MPOL_PREFERRED for an exact preference if they 
are known to not be bounced around.  This would be required for processes 
that are bound to the cpus of a single node through cpuset or 
sched_setaffinity() but unconstrained as far as memory is concerned.

The goal of __GFP_THISNODE being the default for thp, however, is that we 
*know* we're going to be accessing it locally at least in the short term, 
perhaps forever.  Any other default would assume the remotely allocated 
hugepage would eventually be accessed locally, otherwise we would have 
been much better off just failing the hugepage allocation and accessing 
small pages.  You could make an assumption that's the case iff the process 
does not fit in its local node, and I think that would be the minority of 
applications.

I guess there could be some heuristic that could determine this based on 
MM_ANONPAGES of Andrea's qemu and zone->zone_pgdat->node_present_pages.  
It feels like something that should be more exactly defined, though, for 
the application to say that it prefers remote hugepages over local 
small pages because it can't access either locally forever anyway.

This was where I suggested a new prctl() mode so that an application can 
prefer remote hugepages because it knows it's larger than the single node 
and that requires no change to the binary itself because it is inherited 
across fork.

The sane default, though, seems to always prefer local allocation, whether 
hugepages or small pages, for the majority of workloads since that's where 
the lowest access latency is.

> Honestly, I think things like vm_policy etc should not be the solution
> - yes, some people may know *exactly* what access patterns they want,
> but for most situations, I think the policy should be that defaults
> "just work".
> 
> In fact, I wish even MADV_HUGEPAGE itself were to approach being a
> no-op with THP.
> 

Besides the NUMA locality of the allocations, we still have the allocation 
latency concern that MADV_HUGEPAGE changes.  The madvise mode

Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

2018-12-07 Thread David Rientjes

On Fri, 7 Dec 2018, Vlastimil Babka wrote:

> >> But *that* in turn makes for other possible questions:
> >>
> >>  - if the reason we couldn't get a local hugepage is that we're simply
> >> out of local memory (huge *or* small), then maybe a remote hugepage is
> >> better.
> >>
> >>Note that this now implies that the choice can be an issue of "did
> >> the hugepage allocation fail due to fragmentation, or due to the node
> >> being low of memory"
> > How exactly do you tell? Many systems are simply low on memory due to
> > caching. A clean pagecache is quite cheap to reclaim but it can be more
> > expensive to fault in. Do we consider it to be a viable target?
> 
> Compaction can report if it failed (more precisely: was skipped) due to
> low memory, or for other reasons. It doesn't distinguish how easily
> reclaimable is the memory, but I don't think we should reclaim anything
> (see below).
> 

Note that just reclaiming when the order-0 watermark in 
__compaction_suitable() fails is unfortunately not always sufficient: it 
needs to be accessible to isolate_freepages().  For order-9 memory, it's 
possible for isolate_migratepages_block() to skip over a top of free pages 
that were just reclaimed if there are unmovable pages preventing the 
entire pageblock from being freed.

Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

2018-12-07 Thread Vlastimil Babka

On 12/7/18 8:49 AM, Michal Hocko wrote:
>> But *that* in turn makes for other possible questions:
>>
>>  - if the reason we couldn't get a local hugepage is that we're simply
>> out of local memory (huge *or* small), then maybe a remote hugepage is
>> better.
>>
>>Note that this now implies that the choice can be an issue of "did
>> the hugepage allocation fail due to fragmentation, or due to the node
>> being low of memory"
> How exactly do you tell? Many systems are simply low on memory due to
> caching. A clean pagecache is quite cheap to reclaim but it can be more
> expensive to fault in. Do we consider it to be a viable target?

Compaction can report if it failed (more precisely: was skipped) due to
low memory, or for other reasons. It doesn't distinguish how easily
reclaimable is the memory, but I don't think we should reclaim anything
(see below).

>> and there is the other question that I asked in the other thread
>> (before subject edit):
>>
>>  - how local is the load to begin with?
>>
>>Relatively shortlived processes - or processes that are explicitly
>> bound to a node - might have different preferences than some
>> long-lived process where the CPU bounces around, and might have
>> different trade-offs for the local vs remote question too.
> Agreed
> 
>> So just based on David's numbers, and some wild handwaving on my part,
>> a slightly more complex, but still very sensible default might be
>> something like
>>
>>  1) try to do a cheap local node hugepage allocation
>>
>> Rationale: everybody agrees this is the best case.
>>
>> But if that fails:
>>
>>  2) look at compacting and the local node, but not very hard.
>>
>> If there's lots of memory on the local node, but synchronous
>> compaction doesn't do anything easily, just fall back to small pages.
> Do we reclaim at this stage or this is mostly GFP_NOWAIT attempt?

I would expect no reclaim, because for non-THP faults we also don't
reclaim the local node before trying to allocate from remote node. If
somebody wants such behavior they can enable the node reclaim mode. THP
faults shouldn't be different in this regard, right?

Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

2018-12-06 Thread Michal Hocko

On Thu 06-12-18 20:31:46, Linus Torvalds wrote:
> [ Oops. different thread for me due to edited subject, so I saw this
> after replying to the earlier email by David ]

Sorry about that but I really wanted to make the actual discussion about
semantic clearly distinguished because the thread just grown too large
with back and forth that didn't lead to anywhere.

> On Thu, Dec 6, 2018 at 1:14 AM Michal Hocko  wrote:
> >
> > MADV_HUGEPAGE changes the picture because the caller expressed a need
> > for THP and is willing to go extra mile to get it.
> 
> Actually, I think MADV_HUGEPAGE should just be
> "TRANSPARENT_HUGEPAGE_ALWAYS but only for this vma".

Yes, that is the case and I didn't want to make the description more
complicated than necessary so I've focused only on the current default.
But historically we have treated defrag=always and MADV_HUGEPAGE the
same.

[...]
> >I believe that something like the below would be sensible
> > 1) THP on a local node with compaction not giving up too early
> > 2) THP on a remote node in NOWAIT mode - so no direct
> >compaction/reclaim (trigger kswapd/kcompactd only for
> >defrag=defer+madvise)
> > 3) fallback to the base page allocation
> 
> That doesn't sound insane to me. That said, the numbers David quoted
> do fairly strongly imply that local small-pages are actually preferred
> to any remote THP pages.

As I and others pointed out elsewhere remote penalty is just a part of
the picture and on its own might be quite misleading. There are other
aspects (TLB pressure, page tables overhead etc) that might amortize the
access latency.

> But *that* in turn makes for other possible questions:
> 
>  - if the reason we couldn't get a local hugepage is that we're simply
> out of local memory (huge *or* small), then maybe a remote hugepage is
> better.
> 
>Note that this now implies that the choice can be an issue of "did
> the hugepage allocation fail due to fragmentation, or due to the node
> being low of memory"

How exactly do you tell? Many systems are simply low on memory due to
caching. A clean pagecache is quite cheap to reclaim but it can be more
expensive to fault in. Do we consider it to be a viable target?

> 
> and there is the other question that I asked in the other thread
> (before subject edit):
> 
>  - how local is the load to begin with?
> 
>Relatively shortlived processes - or processes that are explicitly
> bound to a node - might have different preferences than some
> long-lived process where the CPU bounces around, and might have
> different trade-offs for the local vs remote question too.

Agreed

> So just based on David's numbers, and some wild handwaving on my part,
> a slightly more complex, but still very sensible default might be
> something like
> 
>  1) try to do a cheap local node hugepage allocation
> 
> Rationale: everybody agrees this is the best case.
> 
> But if that fails:
> 
>  2) look at compacting and the local node, but not very hard.
> 
> If there's lots of memory on the local node, but synchronous
> compaction doesn't do anything easily, just fall back to small pages.

Do we reclaim at this stage or this is mostly GFP_NOWAIT attempt?

> Rationale: local memory is generally more important than THP.
> 
> If that fails (ie local node is simply low on memory):
> 
>  3) Try to do remote THP allocation
> 
>  Rationale: Ok, we simply didn't have a lot of local memory, so
> it's not just a question of fragmentation. If it *had* been
> fragmentation, lots of small local pages would have been better than a
> remote THP page.
> 
>  Oops, remote THP allocation failed (possibly after synchronous
> remote compaction, but maybe this is where we do kcompactd).
> 
>  4) Just do any small page, and do reclaim etc. THP isn't happening,
> and it's not a priority when you're starting to feel memory pressure.

If 2) doesn't reclaim heavily (e.g. only try to reclaim clean page
cache) or even do NOWAIT (which would be even better) then I _think_
this sounds sane.

> In general, I really would want to avoid magic kernel command lines
> (or sysfs settings, or whatever) making a huge difference in behavior.
> So I really wish people would see the whole
> 'transparent_hugepage_flags' thing as a way for kernel developers to
> try different settings, not as a way for users to tune their loads.
> 
> Our default should work as sane defaults, we shouldn't have a "ok,
> let's have this sysfs tunable and let people make their own
> decisions". That's a cop-out.

Agreed. I cannot say I am happy with all the ways THP can be tuned. It
is quite confusing to say the least.

-- 
Michal Hocko
SUSE Labs

Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

2018-12-06 Thread Michal Hocko

On Thu 06-12-18 15:49:04, David Rientjes wrote:
> On Thu, 6 Dec 2018, Michal Hocko wrote:
> 
> > MADV_HUGEPAGE changes the picture because the caller expressed a need
> > for THP and is willing to go extra mile to get it. That involves
> > allocation latency and as of now also a potential remote access. We do
> > not have complete agreement on the later but the prevailing argument is
> > that any strong NUMA locality is just reinventing node-reclaim story
> > again or makes THP success rate down the toilet (to quote Mel). I agree
> > that we do not want to fallback to a remote node overeagerly. I believe
> > that something like the below would be sensible
> > 1) THP on a local node with compaction not giving up too early
> > 2) THP on a remote node in NOWAIT mode - so no direct
> >compaction/reclaim (trigger kswapd/kcompactd only for
> >defrag=defer+madvise)
> > 3) fallback to the base page allocation
> > 
> 
> I disagree that MADV_HUGEPAGE should take on any new semantic that 
> overrides the preference of node local memory for a hugepage, which is the 
> nearly four year behavior.  The order of MADV_HUGEPAGE preferences listed 
> above would cause current users to regress who rely on local small page 
> fallback rather than remote hugepages because the access latency is much 
> better.  I think the preference of remote hugepages over local small pages 
> needs to be expressed differently to prevent regression.

Such a model would be broken. It doesn't provide consistent semantic and
leads to surprising results. MADV_HUGEPAGE with local node binding will
not prevent remote base pages to be used and you are back to square one.

It has been a huge mistake to merge your __GFP_THISNODE patch back then
in 4.1. Especially with an absolute lack of numbers for a variety of
workloads. I still believe we can do better, offer a sane mem policy to
help workloads with higher locality demands but it is outright wrong
to confalte demand for THP with the locality semantic.

If this is absolutely no go then we need a MADV_HUGEPAGE_SANE...

-- 
Michal Hocko
SUSE Labs

Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

2018-12-06 Thread Linus Torvalds

[ Oops. different thread for me due to edited subject, so I saw this
after replying to the earlier email by David ]

On Thu, Dec 6, 2018 at 1:14 AM Michal Hocko  wrote:
>
> MADV_HUGEPAGE changes the picture because the caller expressed a need
> for THP and is willing to go extra mile to get it.

Actually, I think MADV_HUGEPAGE should just be
"TRANSPARENT_HUGEPAGE_ALWAYS but only for this vma".

So MADV_HUGEPAGE shouldn't change any behavior at all, if the kernel
was built with TRANSPARENT_HUGEPAGE_ALWAYS.

Put another way: even if you decide to run a kernel that does *not*
have that "always THP" (beause you presumably think that it's too
blunt an instrument), then MADV_HUGEPAGE says "for _this_ vma, do the
'always THP' bebavior"

I think those semantics would be a whole lot easier to explain to
people, and perhaps more imporantly, starting off from that kind of
mindset also gives good guidance to what MADV_HUGEPAGE behavior should
be: it should be sane enough that it makes sense as the _default_
behavior for the TRANSPARENT_HUGEPAGE_ALWAYS configuration.

But that also means that no, MADV_HUGEPAGE doesn't really change the
picture. All it does is says "I know that for this vma, THP really
does make sense as a default".

It doesn't say "I _have_ to have THP", exactly like
TRANSPARENT_HUGEPAGE_ALWAYS does not mean that every allocation should
strive to be THP.

>I believe that something like the below would be sensible
> 1) THP on a local node with compaction not giving up too early
> 2) THP on a remote node in NOWAIT mode - so no direct
>compaction/reclaim (trigger kswapd/kcompactd only for
>defrag=defer+madvise)
> 3) fallback to the base page allocation

That doesn't sound insane to me. That said, the numbers David quoted
do fairly strongly imply that local small-pages are actually preferred
to any remote THP pages.

But *that* in turn makes for other possible questions:

 - if the reason we couldn't get a local hugepage is that we're simply
out of local memory (huge *or* small), then maybe a remote hugepage is
better.

   Note that this now implies that the choice can be an issue of "did
the hugepage allocation fail due to fragmentation, or due to the node
being low of memory"

and there is the other question that I asked in the other thread
(before subject edit):

 - how local is the load to begin with?

   Relatively shortlived processes - or processes that are explicitly
bound to a node - might have different preferences than some
long-lived process where the CPU bounces around, and might have
different trade-offs for the local vs remote question too.

So just based on David's numbers, and some wild handwaving on my part,
a slightly more complex, but still very sensible default might be
something like

 1) try to do a cheap local node hugepage allocation

Rationale: everybody agrees this is the best case.

But if that fails:

 2) look at compacting and the local node, but not very hard.

If there's lots of memory on the local node, but synchronous
compaction doesn't do anything easily, just fall back to small pages.

Rationale: local memory is generally more important than THP.

If that fails (ie local node is simply low on memory):

 3) Try to do remote THP allocation

 Rationale: Ok, we simply didn't have a lot of local memory, so
it's not just a question of fragmentation. If it *had* been
fragmentation, lots of small local pages would have been better than a
remote THP page.

 Oops, remote THP allocation failed (possibly after synchronous
remote compaction, but maybe this is where we do kcompactd).

 4) Just do any small page, and do reclaim etc. THP isn't happening,
and it's not a priority when you're starting to feel memory pressure.

In general, I really would want to avoid magic kernel command lines
(or sysfs settings, or whatever) making a huge difference in behavior.
So I really wish people would see the whole
'transparent_hugepage_flags' thing as a way for kernel developers to
try different settings, not as a way for users to tune their loads.

Our default should work as sane defaults, we shouldn't have a "ok,
let's have this sysfs tunable and let people make their own
decisions". That's a cop-out.

Btw, don't get me wrong: I'm not suggesting removing the sysfs knob.
As a debug tool, it's great, where you can ask "ok, do things work
better if you set THP-defrag to defer+madvise".

I'm just saying that we should *not* use that sysfs flag as an excuse
for "ok, if we get the default wrong, people can make their own
defaults". We should strive to do well enough that it really shouldn't
be an issue in normal situations.

 Linus

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-06 Thread Linus Torvalds

On Thu, Dec 6, 2018 at 3:43 PM David Rientjes  wrote:
>
> On Broadwell, the access latency to local small pages was +5.6%, remote
> hugepages +16.4%, and remote small pages +19.9%.
>
> On Naples, the access latency to local small pages was +4.9%, intrasocket
> hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages
> +26.6%, and intersocket hugepages +29.2%

Are those two last numbers transposed?

Or why would small page accesses be *faster* than hugepages for the
intersocket case?

Of course, depending on testing, maybe the page itself was remote, but
the page tables were random, and you happened to get a remote page
table for the hugepage case?

> The results on Murano were similar, which is why I suspect Aneesh
> introduced the __GFP_THISNODE requirement for thp in 4.0, which preferred,
> in order, local small pages, remote 1-hop hugepages, remote 2-hop
> hugepages, remote 1-hop small pages, remote 2-hop small pages.

it sounds like on the whole the TLB advantage of hugepages is smaller
than the locality advantage.

Which doesn't surprise me on x86, because TLB costs really are fairly
low. Very good TLB fills, relatively to what I've seen elsewhere.

> So it *appears* from the x86 platforms that NUMA matters much more
> significantly than hugeness, but remote hugepages are a slight win over
> remote small pages.  PPC appeared the same wrt the local node but then
> prefers hugeness over affinity when it comes to remote pages.

I do think POWER at least historically has much weaker TLB fills, but
also very costly page table creation/teardown. Constant-time O(1)
arguments about hash lookups are only worth so much when the constant
time is pretty big. They've been working on it.

So at least on POWER, afaik one issue is literally that hugepages made
the hash setup and teardown situation much better.

One thing that might be worth looking at is whether the process itself
is all that node-local. Maybe we could aim for a policy that says
"prefer local memory, but if we notice that the accesses to this vma
aren't all that local, then who cares?".

IOW, the default could be something more dynamic than just "always use
__GFP_THISNODE". It could be more along the lines of "start off using
__GFP_THISNODE, but for longer-lived processes that bounce around
across nodes, maybe relax it?"

I don't think we have that kind of information right now, though, do we?

Honestly, I think things like vm_policy etc should not be the solution
- yes, some people may know *exactly* what access patterns they want,
but for most situations, I think the policy should be that defaults
"just work".

In fact, I wish even MADV_HUGEPAGE itself were to approach being a
no-op with THP.

We already have TRANSPARENT_HUGEPAGE_ALWAYS being the default kconfig
option (but I think it's a bit detbatable, because I'm not sure
everybody always agrees about memory use), so on the whole
MADV_HUGEPAGE shouldn't really *do* anything.

 Linus

smime.p7s
Description: S/MIME Cryptographic Signature

Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

2018-12-06 Thread David Rientjes

On Thu, 6 Dec 2018, Michal Hocko wrote:

> MADV_HUGEPAGE changes the picture because the caller expressed a need
> for THP and is willing to go extra mile to get it. That involves
> allocation latency and as of now also a potential remote access. We do
> not have complete agreement on the later but the prevailing argument is
> that any strong NUMA locality is just reinventing node-reclaim story
> again or makes THP success rate down the toilet (to quote Mel). I agree
> that we do not want to fallback to a remote node overeagerly. I believe
> that something like the below would be sensible
>   1) THP on a local node with compaction not giving up too early
>   2) THP on a remote node in NOWAIT mode - so no direct
>  compaction/reclaim (trigger kswapd/kcompactd only for
>  defrag=defer+madvise)
>   3) fallback to the base page allocation
> 

I disagree that MADV_HUGEPAGE should take on any new semantic that 
overrides the preference of node local memory for a hugepage, which is the 
nearly four year behavior.  The order of MADV_HUGEPAGE preferences listed 
above would cause current users to regress who rely on local small page 
fallback rather than remote hugepages because the access latency is much 
better.  I think the preference of remote hugepages over local small pages 
needs to be expressed differently to prevent regression.

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-06 Thread David Rientjes

On Wed, 5 Dec 2018, Linus Torvalds wrote:

> > Ok, I've applied David's latest patch.
> >
> > I'm not at all objecting to tweaking this further, I just didn't want
> > to have this regression stand.
> 
> Hmm. Can somebody (David?) also perhaps try to state what the
> different latency impacts end up being? I suspect it's been mentioned
> several times during the argument, but it would be nice to have a
> "going forward, this is what I care about" kind of setup for good
> default behavior.
> 

I'm in the process of writing a more complete test case for this but I 
benchmarked a few platforms based solely on remote hugepages vs local 
small pages vs remote hugepages.  My previous numbers were based on data 
from actual workloads.

For all platforms, local hugepages are the premium, of course.

On Broadwell, the access latency to local small pages was +5.6%, remote 
hugepages +16.4%, and remote small pages +19.9%.

On Naples, the access latency to local small pages was +4.9%, intrasocket 
hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages 
+26.6%, and intersocket hugepages +29.2%

The results on Murano were similar, which is why I suspect Aneesh 
introduced the __GFP_THISNODE requirement for thp in 4.0, which preferred, 
in order, local small pages, remote 1-hop hugepages, remote 2-hop 
hugepages, remote 1-hop small pages, remote 2-hop small pages.

So it *appears* from the x86 platforms that NUMA matters much more 
significantly than hugeness, but remote hugepages are a slight win over 
remote small pages.  PPC appeared the same wrt the local node but then 
prefers hugeness over affinity when it comes to remote pages.

Of course this could be much different on platforms I have not tested.  I 
can look at POWER9 but I suspect it will be similar to Murano.

> How much of the problem ends up being about the cost of compaction vs
> the cost of getting a remote node bigpage?
> 
> That would seem to be a fairly major issue, but __GFP_THISNODE affects
> both. It limits compaction to just this now, in addition to obviously
> limiting the allocation result.
> 
> I realize that we probably do want to just have explicit policies that
> do not exist right now, but what are (a) sane defaults, and (b) sane
> policies?
> 

The common case is that local node allocation, whether huge or small, is 
*always* better.  After that, I assume than some actual measurement of 
access latency at boot would be better than hardcoding a single policy in 
the page allocator for everybody.  On my x86 platforms, it's always a 
simple preference of "try huge, try small, go to the next nearest node, 
repeat".  On my PPC platforms, it's "try local huge, try local small, try 
huge from remaining nodes, try small from remaining nodes."

> For example, if we cannot get a hugepage on this node, but we *do* get
> a node-local small page, is the local memory advantage simply better
> than the possible TLB advantage?
> 
> Because if that's the case (at least commonly), then that in itself is
> a fairly good argument for "hugepage allocations should always be
> THISNODE".
> 
> But David also did mention the actual allocation overhead itself in
> the commit, and maybe the math is more "try to get a local hugepage,
> but if no such thing exists, see if you can get a remote hugepage
> _cheaply_".
> 
> So another model can be "do local-only compaction, but allow non-local
> allocation if the local node doesn't have anything". IOW, if other
> nodes have hugepages available, pick them up, but don't try to compact
> other nodes to do so?
> 

It would be nice if there was a specific policy that was optimal on all 
platforms; since that's not the case, introducing a sane default policy is 
going to require some complexity.

It would likely always make sense to allocate huge over small pages 
remotely when local allocation is not possible both for MADV_HUGEPAGE 
users and non-MADV_HUGEPAGE users.  That would require a restructuring of 
how thp fallback is done which, today, is try to allocate huge locally and 
fail so handle_pte_fault() can take it from there and would obviously 
touch more than just the page allocator.  I *suspect* that's not all that 
common because it's easier to reclaim some pages and fault local small 
pages instead, which always has better access latency.

What's different in this discussion thus far is workloads that do not fit 
into a single node so allocating remote hugepages is actually better than 
constantly reclaiming and compacting locally.  Mempolicies are 
interesting, but I worry about the interaction it would have with small 
page policies because you can only define one mode: we may have a 
combination of default, interleave, bind, and preferred policies for huge 
and small memory and that may become overly complex.

Since these workloads are in the minority and it seems, to me at least, 
that it's a property of the size of the workload rather than a general 
desire for remote hugepages over small pages

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-06 Thread Vlastimil Babka

On 12/6/18 1:54 AM, Andrea Arcangeli wrote:
> On Wed, Dec 05, 2018 at 04:18:14PM -0800, David Rientjes wrote:
>> On Wed, 5 Dec 2018, Andrea Arcangeli wrote:
>>
>> Note that in addition to COMPACT_SKIPPED that you mention, compaction can 
>> fail with COMPACT_COMPLETE, meaning the full scan has finished without 
>> freeing a hugepage, or COMPACT_DEFERRED, meaning that doing another scan 
>> is unlikely to produce a different result.  COMPACT_SKIPPED makes sense to 
>> do reclaim if it can become accessible to isolate_freepages() and 
>> hopefully another allocator does not allocate from these newly freed pages 
>> before compaction can scan the zone again.  For COMPACT_COMPLETE and 
>> COMPACT_DEFERRED, reclaim is unlikely to ever help.
> 
> The COMPACT_COMPLETE and (COMPACT_PARTIAL_SKIPPED for that matter)
> seems just a mistake in the max() evaluation try_to_compact_pages()
> that let it return COMPACT_COMPLETE and COMPACT_PARTIAL_SKIPPED. I
> think it should just return COMPACT_DEFERRED in those two cases and it
> should be enforced forced for all prio.
> 
> There are really only 3 cases that matter for the caller:
> 
> 1) succeed -> we got the page
> 2) defer -> we failed (caller won't care about why)
> 3) skipped -> failed because not enough 4k freed -> reclaim must be invoked 
> then
>compaction can be retried
> 
> PARTIAL_SKIPPED/COMPLETE both fall into 2) above so for the caller
> they should be treated the same way. It doesn't seem very concerning
> that it may try like if it succeeded and do a spurious single reclaim
> invocation, but it's good to fix this and take the COMPACT_DEFERRED
> nopage path in the __GFP_NORETRY case.

Yeah good point. I wouldn't change the general logic of
try_to_compact_pages() though, but the condition for __GFP_NORETRY can
simply change to:

if (compact_result != COMPACT_SKIPPED)
 goto nopage;

I can make a patch ASAP together with a few others I think are needed,
that should hopefully avoid the need for __GFP_COMPACT_ONLY or checks
based on order. What's probably unavoidable though is adding back
__GFP_NORETRY for madvised allocations (i.e. partially reverting
2516035499b95), but David was fine with that and your __GFP_ONLY_COMPACT
approach effectively did it too.

MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

2018-12-06 Thread Michal Hocko

On Wed 05-12-18 16:58:02, Linus Torvalds wrote:
[...]
> I realize that we probably do want to just have explicit policies that
> do not exist right now, but what are (a) sane defaults, and (b) sane
> policies?

I would focus on the current default first (which is defrag=madvise).
This means that we only try the cheapest possible THP without
MADV_HUGEPAGE. If there is none we simply fallback. We do restrict to
the local node. I guess there is a general agreement that this is a sane
default.

MADV_HUGEPAGE changes the picture because the caller expressed a need
for THP and is willing to go extra mile to get it. That involves
allocation latency and as of now also a potential remote access. We do
not have complete agreement on the later but the prevailing argument is
that any strong NUMA locality is just reinventing node-reclaim story
again or makes THP success rate down the toilet (to quote Mel). I agree
that we do not want to fallback to a remote node overeagerly. I believe
that something like the below would be sensible
1) THP on a local node with compaction not giving up too early
2) THP on a remote node in NOWAIT mode - so no direct
   compaction/reclaim (trigger kswapd/kcompactd only for
   defrag=defer+madvise)
3) fallback to the base page allocation

This would allow both full memory utilization and try to be as local as
possible. Whoever strongly prefers NUMA locality should be using
MPOL_NODE_RECLAIM (or similar) and that would skip 2 and make 1) and 2)
use more aggressive compaction and reclaim.

This will also fit into our existing NUMA api. MPOL_NODE_RECLAIM
wouldn't be restricted to THP obviously. It would act on base pages as
well and it would basically use the same implementation as we have for
the global node_reclaim and make it usable again.

Does this sound at least remotely sane?
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Linus Torvalds

On Wed, Dec 5, 2018 at 3:51 PM Linus Torvalds
 wrote:
>
> Ok, I've applied David's latest patch.
>
> I'm not at all objecting to tweaking this further, I just didn't want
> to have this regression stand.

Hmm. Can somebody (David?) also perhaps try to state what the
different latency impacts end up being? I suspect it's been mentioned
several times during the argument, but it would be nice to have a
"going forward, this is what I care about" kind of setup for good
default behavior.

How much of the problem ends up being about the cost of compaction vs
the cost of getting a remote node bigpage?

That would seem to be a fairly major issue, but __GFP_THISNODE affects
both. It limits compaction to just this now, in addition to obviously
limiting the allocation result.

I realize that we probably do want to just have explicit policies that
do not exist right now, but what are (a) sane defaults, and (b) sane
policies?

For example, if we cannot get a hugepage on this node, but we *do* get
a node-local small page, is the local memory advantage simply better
than the possible TLB advantage?

Because if that's the case (at least commonly), then that in itself is
a fairly good argument for "hugepage allocations should always be
THISNODE".

But David also did mention the actual allocation overhead itself in
the commit, and maybe the math is more "try to get a local hugepage,
but if no such thing exists, see if you can get a remote hugepage
_cheaply_".

So another model can be "do local-only compaction, but allow non-local
allocation if the local node doesn't have anything". IOW, if other
nodes have hugepages available, pick them up, but don't try to compact
other nodes to do so?

And yet another model might be "do a least-effort thing, give me a
local hugepage if it exists, otherwise fall back to small pages".

So there are different combinations of "try compaction" vs "local-remote".

  Linus

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Andrea Arcangeli

On Wed, Dec 05, 2018 at 04:18:14PM -0800, David Rientjes wrote:
> On Wed, 5 Dec 2018, Andrea Arcangeli wrote:
> 
> > __GFP_COMPACT_ONLY gave an hope it could give some middle ground but
> > it shows awful compaction results, it basically destroys compaction
> > effectiveness and we know why (COMPACT_SKIPPED must call reclaim or
> > compaction can't succeed because there's not enough free memory in the
> > node). If somebody used MADV_HUGEPAGE compaction should still work and
> > not fail like that. Compaction would fail to be effective even in the
> > local node where __GFP_THISNODE didn't fail. Worst of all it'd fail
> > even on non-NUMA systems (that would be easy to fix though by making
> > the HPAGE_PMD_ORDER check conditional to NUMA being enabled at
> > runtime).
> > 
> 
> Note that in addition to COMPACT_SKIPPED that you mention, compaction can 
> fail with COMPACT_COMPLETE, meaning the full scan has finished without 
> freeing a hugepage, or COMPACT_DEFERRED, meaning that doing another scan 
> is unlikely to produce a different result.  COMPACT_SKIPPED makes sense to 
> do reclaim if it can become accessible to isolate_freepages() and 
> hopefully another allocator does not allocate from these newly freed pages 
> before compaction can scan the zone again.  For COMPACT_COMPLETE and 
> COMPACT_DEFERRED, reclaim is unlikely to ever help.

The COMPACT_COMPLETE and (COMPACT_PARTIAL_SKIPPED for that matter)
seems just a mistake in the max() evaluation try_to_compact_pages()
that let it return COMPACT_COMPLETE and COMPACT_PARTIAL_SKIPPED. I
think it should just return COMPACT_DEFERRED in those two cases and it
should be enforced forced for all prio.

There are really only 3 cases that matter for the caller:

1) succeed -> we got the page
2) defer -> we failed (caller won't care about why)
3) skipped -> failed because not enough 4k freed -> reclaim must be invoked then
   compaction can be retried

PARTIAL_SKIPPED/COMPLETE both fall into 2) above so for the caller
they should be treated the same way. It doesn't seem very concerning
that it may try like if it succeeded and do a spurious single reclaim
invocation, but it's good to fix this and take the COMPACT_DEFERRED
nopage path in the __GFP_NORETRY case.

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread David Rientjes

On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> __GFP_COMPACT_ONLY gave an hope it could give some middle ground but
> it shows awful compaction results, it basically destroys compaction
> effectiveness and we know why (COMPACT_SKIPPED must call reclaim or
> compaction can't succeed because there's not enough free memory in the
> node). If somebody used MADV_HUGEPAGE compaction should still work and
> not fail like that. Compaction would fail to be effective even in the
> local node where __GFP_THISNODE didn't fail. Worst of all it'd fail
> even on non-NUMA systems (that would be easy to fix though by making
> the HPAGE_PMD_ORDER check conditional to NUMA being enabled at
> runtime).
> 

Note that in addition to COMPACT_SKIPPED that you mention, compaction can 
fail with COMPACT_COMPLETE, meaning the full scan has finished without 
freeing a hugepage, or COMPACT_DEFERRED, meaning that doing another scan 
is unlikely to produce a different result.  COMPACT_SKIPPED makes sense to 
do reclaim if it can become accessible to isolate_freepages() and 
hopefully another allocator does not allocate from these newly freed pages 
before compaction can scan the zone again.  For COMPACT_COMPLETE and 
COMPACT_DEFERRED, reclaim is unlikely to ever help.

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Andrea Arcangeli

Hello,

On Wed, Dec 05, 2018 at 01:59:32PM -0800, David Rientjes wrote:
> [..] and the kernel test robot has reported, [..]

Just for completeness you may have missed one email:
https://lkml.kernel.org/r/87tvk1yjkp@yhuang-dev.intel.com

'So I think the report should have been a "performance
improvement" instead of "performance regression".'

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Linus Torvalds

On Wed, Dec 5, 2018 at 3:36 PM Andrea Arcangeli  wrote:
>
> Like said earlier still better to apply __GFP_COMPACT_ONLY or David's
> patch than to return to v4.18 though.

Ok, I've applied David's latest patch.

I'm not at all objecting to tweaking this further, I just didn't want
to have this regression stand.

   Linus

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Andrea Arcangeli

On Wed, Dec 05, 2018 at 02:03:10PM -0800, Linus Torvalds wrote:
> On Wed, Dec 5, 2018 at 12:40 PM Andrea Arcangeli  wrote:
> >
> > So ultimately we decided that the saner behavior that gives the least
> > risk of regression for the short term, until we can do something
> > better, was the one that is already applied upstream.
> 
> You're ignoring the fact that people *did* report things regressed.

I don't ignore regressions.. after all the only reason I touched this
code is that I have been asked to fix a regression that made the
upstream kernel unusable in some enterprise workloads with very large
processes. Enterprise releases don't happen every year so it's normal
we noticed only last January a 3 years old regression. The fact it's
an old regression doesn't make it any less relevant. It took until
August until I had the time to track down this specific
regression which artificially delayed this by another 8 months.

With regard to David's specific regression I didn't ignore it either,
I just prioritize on which regression has to be fixed with the most
urgency and David's regression is less severe than the one we're
fixing here. I posted below the numbers for the regression that is
more urgent to fix.

Now suppose (like I think is likely) David may be better off setting
__GFP_THISNODE across the board including for 4k pages not just for
THP. I don't think anybody would be ok if we set __GFP_THISNODE by on
4k pages too unless it's done under a very specific new MPOL. It'll
probably work even better for him probably (the cache will be pushed
into remote nodes by 4k allocations too, and even more of the app data
and executable will be in the local NUMA node). But that's unusable
for anything except his specialized workload that tends to fit in a
single node and can accept to pay an incredible slowdown if it ever
spills over (as long as the process is not getting OOM killed he's ok
because it's such an uncommon occurrence for him that he can pay an
extreme cost just to avoid OOM killing). It's totally fine to optimize
such things with an opt-in like a new MPOL that makes those
assumptions about process size, but it's that's an unacceptable
assumption to impose on all workloads, because it breaks the VM bad
for all workload that can't fit in a single NUMA node.

> That's the part I find unacceptable. You're saying "we picked
> something that minimized regressions".
> 
> No it didn't. The regression is present and real, and is on a real
> load, not a benchmark.
> 
> So that argument is clearly bogus.

Note that "this give the least risk of regression" I never meant the
risk is zero. Obviously we know it's higher than zero. Otherwise David
would have no regression in the first place.

So I stand by my argument that this is what "gives the least risk of
regression" if you're given any workload you know nothing about that
uses MADV_HUGEPAGE and it's benefiting from it and you don't know
beforehand if it can fit or not fit in a single NUMA node.

If you knew for sure it can fit in a single NUMA node, __GFP_THISNODE
would be better, obviously, but the same applies to 4k pages
too... and we're not setting __GFP_THISNODE on 4k allocations under
MPOL_DEFAULT.

So I'm all for fixing David's workload but here we're trying to
generalize an ad-hoc NUMA optimization that isn't necessarily only
applicable to THP order allocations either, like it's a generic good
thing when it isn't.

__GFP_COMPACT_ONLY gave an hope it could give some middle ground but
it shows awful compaction results, it basically destroys compaction
effectiveness and we know why (COMPACT_SKIPPED must call reclaim or
compaction can't succeed because there's not enough free memory in the
node). If somebody used MADV_HUGEPAGE compaction should still work and
not fail like that. Compaction would fail to be effective even in the
local node where __GFP_THISNODE didn't fail. Worst of all it'd fail
even on non-NUMA systems (that would be easy to fix though by making
the HPAGE_PMD_ORDER check conditional to NUMA being enabled at
runtime).

Like said earlier still better to apply __GFP_COMPACT_ONLY or David's
patch than to return to v4.18 though.

===
From: Andrea Arcangeli 
To: Andrew Morton 
Cc: linux...@kvack.org,
Alex Williamson ,
David Rientjes ,
Vlastimil Babka 
Subject: [PATCH 1/1] mm: thp: fix transparent_hugepage/defrag = madvise || 
always
Date: Sun, 19 Aug 2018 23:26:40 -0400

qemu uses MADV_HUGEPAGE which allows direct compaction (i.e.
__GFP_DIRECT_RECLAIM is set).

The problem is that direct compaction combined with the NUMA
__GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
hard the local node, instead of failing the allocation if there's no
THP available in the local node.

Such logic was ok until __GFP_THISNODE was added to the THP allocation
path even with MPOL_DEFAULT.

The idea behind the __GFP_THISNODE addition, is that it is better to
provide local memory in PAGE_SIZE units than to use remote NUMA THP
b

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread David Rientjes

On Wed, 5 Dec 2018, Linus Torvalds wrote:

> > So ultimately we decided that the saner behavior that gives the least
> > risk of regression for the short term, until we can do something
> > better, was the one that is already applied upstream.
> 
> You're ignoring the fact that people *did* report things regressed.
> 
> That's the part I find unacceptable. You're saying "we picked
> something that minimized regressions".
> 
> No it didn't. The regression is present and real, and is on a real
> load, not a benchmark.
> 
> So that argument is clearly bogus.
> 
> I'm going to revert the commit since people apparently seem to be
> ignoring this fundamental issue.
> 
> Real workloads regressed.  The regressions got reported. Ignoring that
> isn't acceptable.
> 

Please allow me to prepare my v2 because it's not a clean revert due to 
the follow-up 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
alloc_hugepage_direct_gfpmask") and will incorporate the feedback from 
Michal to not change anything outside of the thp fault path.

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Linus Torvalds

On Wed, Dec 5, 2018 at 12:40 PM Andrea Arcangeli  wrote:
>
> So ultimately we decided that the saner behavior that gives the least
> risk of regression for the short term, until we can do something
> better, was the one that is already applied upstream.

You're ignoring the fact that people *did* report things regressed.

That's the part I find unacceptable. You're saying "we picked
something that minimized regressions".

No it didn't. The regression is present and real, and is on a real
load, not a benchmark.

So that argument is clearly bogus.

I'm going to revert the commit since people apparently seem to be
ignoring this fundamental issue.

Real workloads regressed.  The regressions got reported. Ignoring that
isn't acceptable.

Linus

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread David Rientjes

On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > thpscale Percentage Faults Huge
> >4.20.0-rc4 4.20.0-rc4
> >mmots-20181130   gfpthisnode-v1r1
> > Percentage huge-395.14 (   0.00%)7.94 ( -91.65%)
> > Percentage huge-591.28 (   0.00%)5.00 ( -94.52%)
> > Percentage huge-786.87 (   0.00%)9.36 ( -89.22%)
> > Percentage huge-12   83.36 (   0.00%)   21.03 ( -74.78%)
> > Percentage huge-18   83.04 (   0.00%)   30.73 ( -63.00%)
> > Percentage huge-24   83.74 (   0.00%)   27.47 ( -67.20%)
> > Percentage huge-30   83.66 (   0.00%)   31.85 ( -61.93%)
> > Percentage huge-32   83.89 (   0.00%)   29.09 ( -65.32%)
> > 
> > They're down the toilet. 3 threads are able to get 95% of the requested
> > THP pages with Andrews tree as of Nov 30th. David's patch drops that to
> > 8% success rate.
> 
> This is the downside of David's patch very well exposed above. And
> this will make non-NUMA system regress like above too despite they
> have no issue to begin with (which is probably why nobody noticed the
> trouble with __GFP_THISNODE reclaim until recently, combined with the
> fact most workloads can fit in a single NUMA node).
> 
> So we're effectively crippling down MADV_HUGEPAGE effectiveness on
> non-NUMA (where it cannot help to do so) and on NUMA (as a workaround
> for the false positive swapout storms) because in some workload and
> system THP improvements are less significant than NUMA improvements.
> 

For context, you're referring to the patch I posted that is similar to 
__GFP_COMPACT_ONLY and patch 2/2 in my series.  It's not referring to the 
revert of the 4.20-rc commit that relaxes the __GFP_THISNODE restriction 
on thp faults and conflates MADV_HUGEPAGE with NUMA locality.  For 4.20, I 
believe at minimum that patch 1/2 should be merged to restore what we have 
had for three years, stop piling more semantics on top of the intent (or 
perceived intent) of MADV_HUGEPAGE, and address the swap storm issue 
separately.

> The higher fault latency is generally the higher cost you pay to get
> the good initial THP utilization for apps that do long lived
> allocations and in turn can use MADV_HUGEPAGE without downsides. The
> cost of compaction pays off over time.
> 
> Short lived allocations sensitive to the allocation latency should not
> use MADV_HUGEPAGE in the first place. If you don't want high latency
> you shouldn't use MADV_HUGEPAGE and khugepaged already uses
> __GFP_THISNODE but it replaces memory so it has a neutral memory
> footprint at it, so it's ok with regard to reclaim.
> 

Completely agreed, and is why we want to try synchronous memory compaction 
to try to allocate hugepages locally in our usecases as well.  We aren't 
particularly concerned about the allocation latency, that is secondary to 
the long-lived access latency regression that occurs when you do not set 
__GFP_THISNODE.

> In my view David's workload is the outlier that uses MADV_HUGEPAGE but
> pretends a low latency and NUMA local behavior as first priority. If
> your workload fits in the per-socket CPU cache it doesn't matter which
> node it is but it totally matters if you've 2M or 4k tlb. I'm not even
> talking about KVM where THP has a multipler effect with EPT.
> 

Hm, no, we do not mind the high allocation latency for MADV_HUGEPAGE 
users.  We *do* care about access latency and that is due to NUMA 
locality.  Before commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for 
MADV_HUGEPAGE mappings"), *all* thp faults were done with __GFP_THISNODE 
and had been for at least three years.  That commit conflates 
MADV_HUGEPAGE with a new semantic that it allows remote allocation instead 
of what it has done for three years: try harder synchronously to allocate 
hugepages locally.  We obviously need to address the problem in another 
way and not change long-standing behavior that causes regressions.  Either 
my patch 2/2, __GFP_COMPACT_ONLY, a new mempolicy mode, new madvise mode, 
prctl, etc.

> Even if you make the __GFP_NORETRY change for the HPAGE_PMD_ORDER to
> skip reclaim in David's patch conditional NUMA being enabled in the
> host (so that it won't cripple THP utilization also on non-NUMA
> systems), imagine that you go in the bios, turn off interleaving to
> enable host NUMA and THP utilization unexpectedly drops significantly
> for your VM.
> 

What's needed is appropriate feedback from memory compaction to determine 
if reclaim is worthwhile: checking only COMPACT_DEFERRED is insufficient.  
We need to determine if compaction has failed due to order-0 low watermark 
checks or whether it simply failed to defragment memory so a hugepage 
could be allocated.  Determining if compaction has failed due to order-0 
low watermark checks is harder than it seems because the reclaimed memory 
may not be accessible by isolate_freepages(); we don't have the ability to 
only reclai

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Andrea Arcangeli

Hello,

Sorry, it has been challenging to keep up with all fast replies, so
I'll start by answering to the critical result below:

On Tue, Dec 04, 2018 at 10:45:58AM +, Mel Gorman wrote:
> thpscale Percentage Faults Huge
>4.20.0-rc4 4.20.0-rc4
>mmots-20181130   gfpthisnode-v1r1
> Percentage huge-395.14 (   0.00%)7.94 ( -91.65%)
> Percentage huge-591.28 (   0.00%)5.00 ( -94.52%)
> Percentage huge-786.87 (   0.00%)9.36 ( -89.22%)
> Percentage huge-12   83.36 (   0.00%)   21.03 ( -74.78%)
> Percentage huge-18   83.04 (   0.00%)   30.73 ( -63.00%)
> Percentage huge-24   83.74 (   0.00%)   27.47 ( -67.20%)
> Percentage huge-30   83.66 (   0.00%)   31.85 ( -61.93%)
> Percentage huge-32   83.89 (   0.00%)   29.09 ( -65.32%)
> 
> They're down the toilet. 3 threads are able to get 95% of the requested
> THP pages with Andrews tree as of Nov 30th. David's patch drops that to
> 8% success rate.

This is the downside of David's patch very well exposed above. And
this will make non-NUMA system regress like above too despite they
have no issue to begin with (which is probably why nobody noticed the
trouble with __GFP_THISNODE reclaim until recently, combined with the
fact most workloads can fit in a single NUMA node).

So we're effectively crippling down MADV_HUGEPAGE effectiveness on
non-NUMA (where it cannot help to do so) and on NUMA (as a workaround
for the false positive swapout storms) because in some workload and
system THP improvements are less significant than NUMA improvements.

The higher fault latency is generally the higher cost you pay to get
the good initial THP utilization for apps that do long lived
allocations and in turn can use MADV_HUGEPAGE without downsides. The
cost of compaction pays off over time.

Short lived allocations sensitive to the allocation latency should not
use MADV_HUGEPAGE in the first place. If you don't want high latency
you shouldn't use MADV_HUGEPAGE and khugepaged already uses
__GFP_THISNODE but it replaces memory so it has a neutral memory
footprint at it, so it's ok with regard to reclaim.

In my view David's workload is the outlier that uses MADV_HUGEPAGE but
pretends a low latency and NUMA local behavior as first priority. If
your workload fits in the per-socket CPU cache it doesn't matter which
node it is but it totally matters if you've 2M or 4k tlb. I'm not even
talking about KVM where THP has a multipler effect with EPT.

Even if you make the __GFP_NORETRY change for the HPAGE_PMD_ORDER to
skip reclaim in David's patch conditional NUMA being enabled in the
host (so that it won't cripple THP utilization also on non-NUMA
systems), imagine that you go in the bios, turn off interleaving to
enable host NUMA and THP utilization unexpectedly drops significantly
for your VM.

Rome ryzen architecture has been mentioned several times by David but
in my threadripper (not-Rome, as it's supposed to be available in 2019
only AFIK) enabling THP made a measurable difference for me for some
workloads. As opposed if I turn off NUMA by setting up the
interleaving in the dimm I get a barely measurable slowdown. So I'm
surprised in Rome there's such a radical difference in behavior.

Like Mel said we need to work towards a more complete solution than
putting __GFP_THISNODE from the outside and then turning off reclaim
from the inside. Mel made examples of things that should
happen, that won't increase allocation latency and that can't happen
with __GFP_THISNODE.

I'll try to describe again what's going on:

1: The allocator is being asked through __GFP_THISNODE "ignore all
remote nodes for all reclaim and compaction" from the
outside. Compaction then returns COMPACT_SKIPPED and tells the
allocator "I can generate many more huge pages if you reclaim/swapout
2M of anon memory in this node, the only reason I failed to compact
memory is because there aren't enough 4k fragmented pages free in this
zone". The allocator then goes ahead and swaps 2M and invokes
compaction again that succeeds the order 9 allocation fine. Goto 1;

The above keeps running in a loop at every additional page fault of
the app using MADV_HUGEPAGE until all RAM of the node is swapped out
and replaced by THP and all others nodes had 100% free memory,
potentially 100% order 9, but the allocator completely ignored all
other nodes. That is the thing that we're fixing here, because such
swap storms caused massive slowdowns. If the workload can't fit in a
single node, it's like running with only a fraction of the RAM.

So David's patch (and __GFP_COMPACT_ONLY) to fix the above swap storm,
inside the allocator skips reclaim entirely when compaction tells "I
can generate one more HPAGE_PMD_ORDER compound page if you
reclaim/swap 2M", if __GFP_NORETRY is set (and makes sure
__GFP_NORETRY is always set for THP). And that however prevents to
generate any more THP globa

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread David Rientjes

On Wed, 5 Dec 2018, Michal Hocko wrote:

> > It isn't specific to MADV_HUGEPAGE, it is the policy for all transparent 
> > hugepage allocations, including defrag=always.  We agree that 
> > MADV_HUGEPAGE is not exactly defined: does it mean try harder to allocate 
> > a hugepage locally, try compaction synchronous to the fault, allow remote 
> > fallback?  It's undefined.
> 
> Yeah, it is certainly underdefined. One thing is clear though. Using
> MADV_HUGEPAGE implies that the specific mapping benefits from THPs and
> is willing to pay associated init cost. This doesn't imply anything
> regarding NUMA locality and as we have NUMA API it shouldn't even
> attempt to do so because it would be conflating two things.

This is exactly why we use MADV_HUGEPAGE when remapping our text segment 
to be backed by transparent hugepages, we want to pay the cost at startup 
to fault thp and that involves synchronous memory compaction rather than 
quickly falling back to remote memory.  This is making the case for me.

> > So to answer "what is so different about THP?", it's the performance data.  
> > The NUMA locality matters more than whether the pages are huge or not.  We 
> > also have the added benefit of khugepaged being able to collapse pages 
> > locally if fragmentation improves rather than being stuck accessing a 
> > remote hugepage forever.
> 
> Please back your claims by a variety of workloads. Including mentioned
> KVMs one. You keep hand waving about access latency completely ignoring
> all other aspects and that makes my suspicious that you do not really
> appreciate all the complexity here even stronger.
> 

I discussed the tradeoff of local hugepages vs local pages vs remote 
hugepages in https://marc.info/?l=linux-kernel&m=154077010828940 on 
Broadwell, Haswell, and Rome.  When a single application does not fit on a 
single node, we obviously need to extend the API to allow it to fault 
remotely.  We can do that without changing long-standing behavior that 
prefers to only fault locally and causing real-world users to regress.  
Your suggestions about how we can extend the API are all very logical.

 [ Note that is not the regression being addressed here, however, which is 
   massive swap storms due to a fragmented local node, which is why the
   __GFP_COMPACT_ONLY patch was also proposed by Andrea.  The ability to
   prefer faulting remotely is a worthwhile extension but it does no
   good whatsoever if we can encounter massive swap storms because we
   didn't set __GFP_NORETRY appropriately (which both of our patches do)
   both locally and now remotely. ]

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Michal Hocko

On Wed 05-12-18 10:43:43, Mel Gorman wrote:
> On Wed, Dec 05, 2018 at 10:08:56AM +0100, Michal Hocko wrote:
> > On Tue 04-12-18 16:47:23, David Rientjes wrote:
> > > On Tue, 4 Dec 2018, Mel Gorman wrote:
> > > 
> > > > What should also be kept in mind is that we should avoid conflating
> > > > locality preferences with THP preferences which is separate from THP
> > > > allocation latencies. The whole __GFP_THISNODE approach is pushing too
> > > > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
> > > > are used which is very unfortunate given that MADV_HUGEPAGE in itself 
> > > > says
> > > > nothing about locality -- that is the business of other madvise flags or
> > > > a specific policy.
> > > 
> > > We currently lack those other madvise modes or mempolicies: mbind() is 
> > > not 
> > > a viable alternative because we do not want to oom kill when local memory 
> > > is depleted, we want to fallback to remote memory.
> > 
> > Yes, there was a clear agreement that there is no suitable mempolicy
> > right now and there were proposals to introduce MPOL_NODE_RECLAIM to
> > introduce that behavior. This would be an improvement regardless of THP
> > because global node-reclaim policy was simply a disaster we had to turn
> > off by default and the global semantic was a reason people just gave up
> > using it completely.
> > 
> 
> The alternative is to define a clear semantic for THP allocation
> requests that are considered "light" regardless of whether that needs a
> GFP flag or not. A sensible default might be
> 
> o Allocate THP local if the amount of work is light or non-existant.
> o Allocate THP remote if one is freely available with no additional work
>   (maybe kick remote kcompactd)
> o Allocate base page local if the amount of work is light or non-existant
> o Allocate base page remote if the amount of work is light or non-existant
> o Do heavy work in zonelist order until a base page is allocated somewhere

I am not sure about the ordering without a deeper consideration but I
thin THP should reflect the approach we have for base bages.

> It's not something could be clearly expressed with either NORETRY or
> THISNODE but longer-term might be saner than chopping and changing on
> which flags are more important and which workload is most relevant. That
> runs the risk of a revert-loop where each person targetting one workload
> reverts one patch to insert another until someone throws up their hands
> in frustration and just carries patches out-of-tree long-term.

Fully agreed!

> I'm not going to prototype something along these lines for now as
> fundamentally a better compaction could cut out part of the root cause
> of pain.

Yes there is some ground work to be done first.

-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Mel Gorman

On Wed, Dec 05, 2018 at 10:08:56AM +0100, Michal Hocko wrote:
> On Tue 04-12-18 16:47:23, David Rientjes wrote:
> > On Tue, 4 Dec 2018, Mel Gorman wrote:
> > 
> > > What should also be kept in mind is that we should avoid conflating
> > > locality preferences with THP preferences which is separate from THP
> > > allocation latencies. The whole __GFP_THISNODE approach is pushing too
> > > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
> > > are used which is very unfortunate given that MADV_HUGEPAGE in itself says
> > > nothing about locality -- that is the business of other madvise flags or
> > > a specific policy.
> > 
> > We currently lack those other madvise modes or mempolicies: mbind() is not 
> > a viable alternative because we do not want to oom kill when local memory 
> > is depleted, we want to fallback to remote memory.
> 
> Yes, there was a clear agreement that there is no suitable mempolicy
> right now and there were proposals to introduce MPOL_NODE_RECLAIM to
> introduce that behavior. This would be an improvement regardless of THP
> because global node-reclaim policy was simply a disaster we had to turn
> off by default and the global semantic was a reason people just gave up
> using it completely.
> 

The alternative is to define a clear semantic for THP allocation
requests that are considered "light" regardless of whether that needs a
GFP flag or not. A sensible default might be

o Allocate THP local if the amount of work is light or non-existant.
o Allocate THP remote if one is freely available with no additional work
  (maybe kick remote kcompactd)
o Allocate base page local if the amount of work is light or non-existant
o Allocate base page remote if the amount of work is light or non-existant
o Do heavy work in zonelist order until a base page is allocated somewhere

It's not something could be clearly expressed with either NORETRY or
THISNODE but longer-term might be saner than chopping and changing on
which flags are more important and which workload is most relevant. That
runs the risk of a revert-loop where each person targetting one workload
reverts one patch to insert another until someone throws up their hands
in frustration and just carries patches out-of-tree long-term.

I'm not going to prototype something along these lines for now as
fundamentally a better compaction could cut out part of the root cause
of pain.

-- 
Mel Gorman
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Michal Hocko

On Tue 04-12-18 16:07:27, David Rientjes wrote:
> On Tue, 4 Dec 2018, Michal Hocko wrote:
> 
> > The thing I am really up to here is that reintroduction of
> > __GFP_THISNODE, which you are pushing for, will conflate madvise mode
> > resp. defrag=always with a numa placement policy because the allocation
> > doesn't fallback to a remote node.
> > 
> 
> It isn't specific to MADV_HUGEPAGE, it is the policy for all transparent 
> hugepage allocations, including defrag=always.  We agree that 
> MADV_HUGEPAGE is not exactly defined: does it mean try harder to allocate 
> a hugepage locally, try compaction synchronous to the fault, allow remote 
> fallback?  It's undefined.

Yeah, it is certainly underdefined. One thing is clear though. Using
MADV_HUGEPAGE implies that the specific mapping benefits from THPs and
is willing to pay associated init cost. This doesn't imply anything
regarding NUMA locality and as we have NUMA API it shouldn't even
attempt to do so because it would be conflating two things.
[...]

> > And that is a fundamental problem and the antipattern I am talking
> > about. Look at it this way. All normal allocations are utilizing all the
> > available memory even though they might hit a remote latency penalty. If
> > you do care about NUMA placement you have an API to enforce a specific
> > placement.  What is so different about THP to behave differently. Do
> > we really want to later invent an API to actually allow to utilize all
> > the memory? There are certainly usecases (that triggered the discussion
> > previously) that do not mind the remote latency because all other
> > benefits simply outweight it?
> > 
> 
> What is different about THP is that on every platform I have measured, 
> NUMA matters more than hugepages.  Obviously if on Broadwell, Haswell, and 
> Rome, remote hugepages were a performance win over local pages, this 
> discussion would not be happening.  Faulting local pages rather than 
> local hugepages, if possible, is easy and doesn't require reclaim.  
> Faulting remote pages rather than reclaiming local pages is easy in your 
> scenario, it's non-disruptive.

You keep ignoring all other usecases mentioned before and that is not
really helpful. Access cost can be amortized by other savings. Not to
mention NUMA balancing moving around hot THPs with remote accesses.

> So to answer "what is so different about THP?", it's the performance data.  
> The NUMA locality matters more than whether the pages are huge or not.  We 
> also have the added benefit of khugepaged being able to collapse pages 
> locally if fragmentation improves rather than being stuck accessing a 
> remote hugepage forever.

Please back your claims by a variety of workloads. Including mentioned
KVMs one. You keep hand waving about access latency completely ignoring
all other aspects and that makes my suspicious that you do not really
appreciate all the complexity here even stronger.

If there was a general consensus we want to make THP very special wrt.
numa locality, I could live with that. It would be inconsistency in the
API and as such something that will kick us sooner or later. But it
seems that _you_ are the only one to push that direction and you keep
ignoring all other usecases consistently throughout all the discussions
we have had so far. Several people keeps pointing out that this is a
wrong direction but that seems to be completely ignored.

I believe that the only way forward is back your claims by numbers
covering a larger set of THP users and prove that remote THP is
a wrong default behavior. But you cannot really push that through based
on a single usecase of yours which you refuse to describe beyond a
simple access latency metric.
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Mel Gorman

On Tue, Dec 04, 2018 at 10:45:58AM +, Mel Gorman wrote:
> I have *one* result of the series on a 1-socket machine running
> "thpscale". It creates a file, punches holes in it to create a
> very light form of fragmentation and then tries THP allocations
> using madvise measuring latency and success rates. It's the
> global-dhp__workload_thpscale-madvhugepage in mmtests using XFS as the
> filesystem.
> 
> thpscale Fault Latencies
> 4.20.0-rc4 4.20.0-rc4
> mmots-20181130   gfpthisnode-v1r1
> Amean fault-base-3  5358.54 (   0.00%) 2408.93 *  55.04%*
> Amean fault-base-5  9742.30 (   0.00%) 3035.25 *  68.84%*
> Amean fault-base-7 13069.18 (   0.00%) 4362.22 *  66.62%*
> Amean fault-base-1214882.53 (   0.00%) 9424.38 *  36.67%*
> Amean fault-base-1815692.75 (   0.00%)16280.03 (  -3.74%)
> Amean fault-base-2428775.11 (   0.00%)18374.84 *  36.14%*
> Amean fault-base-3042056.32 (   0.00%)21984.55 *  47.73%*
> Amean fault-base-3238634.26 (   0.00%)22199.49 *  42.54%*
> Amean fault-huge-1 0.00 (   0.00%)0.00 (   0.00%)
> Amean fault-huge-3  3628.86 (   0.00%)  963.45 *  73.45%*
> Amean fault-huge-5  4926.42 (   0.00%) 2959.85 *  39.92%*
> Amean fault-huge-7  6717.15 (   0.00%) 3828.68 *  43.00%*
> Amean fault-huge-1211393.47 (   0.00%) 5772.92 *  49.33%*
> Amean fault-huge-1816979.38 (   0.00%) 4435.95 *  73.87%*
> Amean fault-huge-2416558.00 (   0.00%) 4416.46 *  73.33%*
> Amean fault-huge-3020351.46 (   0.00%) 5099.73 *  74.94%*
> Amean fault-huge-3223332.54 (   0.00%) 6524.73 *  72.04%*
> 
> So, looks like massive latency improvements but then the THP allocation
> success rates
> 
> thpscale Percentage Faults Huge
>4.20.0-rc4 4.20.0-rc4
>mmots-20181130   gfpthisnode-v1r1
> Percentage huge-395.14 (   0.00%)7.94 ( -91.65%)
> Percentage huge-591.28 (   0.00%)5.00 ( -94.52%)
> Percentage huge-786.87 (   0.00%)9.36 ( -89.22%)
> Percentage huge-12   83.36 (   0.00%)   21.03 ( -74.78%)
> Percentage huge-18   83.04 (   0.00%)   30.73 ( -63.00%)
> Percentage huge-24   83.74 (   0.00%)   27.47 ( -67.20%)
> Percentage huge-30   83.66 (   0.00%)   31.85 ( -61.93%)
> Percentage huge-32   83.89 (   0.00%)   29.09 ( -65.32%)
> 

Other results arrived once the grid caught up and it's a mixed bag of
gains and losses roughtly along the lines predicted by the discussion
already -- namely locality is better as long as the workload fits,
compaction is reduced, reclaim is reduced, THP allocation success rates
are reduced but latencies are often better.

Whether this is "good" or "bad" depends on whether you have a workload
that benefits because it's neither universally good or bad. It would
still be nice to hear how Andreas fared but I think we'll reach the same
conclusion -- the patches shuffles the problem around with limited effort
to address the root causes so all we end up changing is the identity of
the person who complains about their workload. One might be tempted to
think that the reduced latencies in some cases are great but not if the
workload is one that benefits from longer startup costs in exchange for
lower runtime costs in the active phase.

For the much longer answer, I'll focus on the two-socket results because
they are more relevant to the current discussion. The workloads are
not realistic in the slightest, they just happen to trigger some of the
interesting corner cases.

global-dhp__workload_usemem-stress-numa-compact
o Plain anonymous faulting workload
o defrag=always (not representative, simply triggers a bad case)

   4.20.0-rc4 4.20.0-rc4
   mmots-20181130   gfpthisnode-v1r1
Amean Elapsd-1   26.79 (   0.00%)   34.92 * -30.37%*
Amean Elapsd-37.32 (   0.00%)8.10 * -10.61%*
Amean Elapsd-45.53 (   0.00%)5.64 (  -1.94%)

Units are seconds, time to complete 30.37% worse for the single-threaded
case. No direct reclaim activity but other activity is interesting and
I'll pick it out snippets;

4.20.0-rc4  4.20.0-rc4
  mmots-20181130gfpthisnode-v1r1
Swap Ins 8   0
Swap Outs 1546   0
Allocation stalls0   0
Fragmentation stalls 02022
Direct pages scanned 0   0
Kswapd pages scanned 427191078
Kswapd pages reclaimed   410821049
Page writes by reclaim1546   0
Page writes file 0

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Michal Hocko

On Tue 04-12-18 16:47:23, David Rientjes wrote:
> On Tue, 4 Dec 2018, Mel Gorman wrote:
> 
> > What should also be kept in mind is that we should avoid conflating
> > locality preferences with THP preferences which is separate from THP
> > allocation latencies. The whole __GFP_THISNODE approach is pushing too
> > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
> > are used which is very unfortunate given that MADV_HUGEPAGE in itself says
> > nothing about locality -- that is the business of other madvise flags or
> > a specific policy.
> 
> We currently lack those other madvise modes or mempolicies: mbind() is not 
> a viable alternative because we do not want to oom kill when local memory 
> is depleted, we want to fallback to remote memory.

Yes, there was a clear agreement that there is no suitable mempolicy
right now and there were proposals to introduce MPOL_NODE_RECLAIM to
introduce that behavior. This would be an improvement regardless of THP
because global node-reclaim policy was simply a disaster we had to turn
off by default and the global semantic was a reason people just gave up
using it completely.

[...]

> Sure, but not at the cost of regressing real-world workloads; what is 
> being asked for here is legitimate and worthy of an extension, but since 
> the long-standing behavior has been to use __GFP_THISNODE and people 
> depend on that for NUMA locality,

Well, your patch has altered the semantic and has introduced a subtle
and _undocumented_ NUMA policy into MADV_HUGEPAGE. All that without any
workload numbers. It would be preferable to have a simulator of those
real world workloads of course but even getting some more detailed
metric - e.g. without the patch we have X THP utilization and the
runtime characteristics Y but without X1 and Y1).

> can we not fix the swap storm and look 
> to extending the API to include workloads that span multiple nodes?

Yes, we definitely want to address swap storms. No question about that.
But our established approach for NUMA policy has been to fallback to
other nodes and everybody focused on NUMA latency should use NUMA API to
achive that. Not vice versa.

As I've said in other thread, I am OK with restoring __GFP_THISNODE for
now but we should really have a very good plan for further steps. And
that involves an agreed NUMA behavior. I haven't seen any widespread
agreement on that yet though.

[...]
> > I would also re-emphasise that a major problem with addressing this
> > problem is that we do not have a general reproducible test case for
> > David's scenario where as we do have reproduction cases for the others.
> > They're not related to KVM but that doesn't matter because it's enough
> > to have a memory hog try allocating more memory than fits on a single node.
> > 
> 
> It's trivial to reproduce this issue: fragment all local memory that 
> compaction cannot resolve, do posix_memalign() for hugepage aligned 
> memory, and measure the access latency.  To fragment all local memory, you 
> can simply insert a kernel module and allocate high-order memory (just do 
> kmem_cache_alloc_node() or get_page() to pin it so compaction fails or 
> punch holes in the file as you did above).  You can do this for all memory 
> rather than the local node to measure the even more severe allocation 
> latency regression that not setting __GFP_THISNODE introduces.

Sure, but can we get some numbers from a real workload rather than an
artificial worst case? The utilization issue Mel pointed out before and
here again is a real concern IMHO. We we definitely need a better
picture to make an educated decision.
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-04 Thread David Rientjes

On Tue, 4 Dec 2018, Mel Gorman wrote:

> What should also be kept in mind is that we should avoid conflating
> locality preferences with THP preferences which is separate from THP
> allocation latencies. The whole __GFP_THISNODE approach is pushing too
> hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
> are used which is very unfortunate given that MADV_HUGEPAGE in itself says
> nothing about locality -- that is the business of other madvise flags or
> a specific policy.

We currently lack those other madvise modes or mempolicies: mbind() is not 
a viable alternative because we do not want to oom kill when local memory 
is depleted, we want to fallback to remote memory.  In my response to 
Michal, I noted three possible usecases that MADV_HUGEPAGE either 
currently has or has taken before: direct compaction/reclaim, avoid 
increased rss, and allow fallback to remote memory.  It's certainly not 
the business of one madvise mode to define this.  Thus, I'm trying to 
return to the behavior that was in 4.1 and what was restored three years 
ago because suddenly changing the behavior to allow remote allocation 
causes real-world regressions.

> Using remote nodes is bad but reclaiming excessively
> and pushing data out of memory is worse as the latency to fault data back
> from disk is higher than a remote access.
> 

That's discussing two things at the same time: local fragmentation and 
local low-on-memory conditions.  If compaction quickly fails and local 
pages are available as fallback, that requires no reclaim.  If we're truly 
low-on-memory locally then it is obviously better to allocate remotely 
than aggressively reclaim.

> Andrea already pointed it out -- workloads that fit within a node are happy
> to reclaim local memory, particularly in the case where the existing data
> is old which is the ideal for David. Workloads that do not fit within a
> node will often prefer using remote memory -- either THP or base pages
> in the general case and THP for definite in the KVM case. While KVM
> might not like remote memory, using THP at least reduces the page table
> access overhead even if the access is remote and eventually automatic
> NUMA balancing might intervene.
> 

Sure, but not at the cost of regressing real-world workloads; what is 
being asked for here is legitimate and worthy of an extension, but since 
the long-standing behavior has been to use __GFP_THISNODE and people 
depend on that for NUMA locality, can we not fix the swap storm and look 
to extending the API to include workloads that span multiple nodes?

> I have *one* result of the series on a 1-socket machine running
> "thpscale". It creates a file, punches holes in it to create a
> very light form of fragmentation and then tries THP allocations
> using madvise measuring latency and success rates. It's the
> global-dhp__workload_thpscale-madvhugepage in mmtests using XFS as the
> filesystem.
> 
> thpscale Fault Latencies
> 4.20.0-rc4 4.20.0-rc4
> mmots-20181130   gfpthisnode-v1r1
> Amean fault-base-3  5358.54 (   0.00%) 2408.93 *  55.04%*
> Amean fault-base-5  9742.30 (   0.00%) 3035.25 *  68.84%*
> Amean fault-base-7 13069.18 (   0.00%) 4362.22 *  66.62%*
> Amean fault-base-1214882.53 (   0.00%) 9424.38 *  36.67%*
> Amean fault-base-1815692.75 (   0.00%)16280.03 (  -3.74%)
> Amean fault-base-2428775.11 (   0.00%)18374.84 *  36.14%*
> Amean fault-base-3042056.32 (   0.00%)21984.55 *  47.73%*
> Amean fault-base-3238634.26 (   0.00%)22199.49 *  42.54%*
> Amean fault-huge-1 0.00 (   0.00%)0.00 (   0.00%)
> Amean fault-huge-3  3628.86 (   0.00%)  963.45 *  73.45%*
> Amean fault-huge-5  4926.42 (   0.00%) 2959.85 *  39.92%*
> Amean fault-huge-7  6717.15 (   0.00%) 3828.68 *  43.00%*
> Amean fault-huge-1211393.47 (   0.00%) 5772.92 *  49.33%*
> Amean fault-huge-1816979.38 (   0.00%) 4435.95 *  73.87%*
> Amean fault-huge-2416558.00 (   0.00%) 4416.46 *  73.33%*
> Amean fault-huge-3020351.46 (   0.00%) 5099.73 *  74.94%*
> Amean fault-huge-3223332.54 (   0.00%) 6524.73 *  72.04%*
> 
> So, looks like massive latency improvements but then the THP allocation
> success rates
> 
> thpscale Percentage Faults Huge
>4.20.0-rc4 4.20.0-rc4
>mmots-20181130   gfpthisnode-v1r1
> Percentage huge-395.14 (   0.00%)7.94 ( -91.65%)
> Percentage huge-591.28 (   0.00%)5.00 ( -94.52%)
> Percentage huge-786.87 (   0.00%)9.36 ( -89.22%)
> Percentage huge-12   83.36 (   0.00%)   21.03 ( -74.78%)
> Percentage huge-18   83.04 (   0.00%)   30.73 ( -63.00%)
> Percentage huge-24   83.74 (   0.00%)   27.47 ( -67.20%

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-04 Thread David Rientjes

On Tue, 4 Dec 2018, Michal Hocko wrote:

> The thing I am really up to here is that reintroduction of
> __GFP_THISNODE, which you are pushing for, will conflate madvise mode
> resp. defrag=always with a numa placement policy because the allocation
> doesn't fallback to a remote node.
> 

It isn't specific to MADV_HUGEPAGE, it is the policy for all transparent 
hugepage allocations, including defrag=always.  We agree that 
MADV_HUGEPAGE is not exactly defined: does it mean try harder to allocate 
a hugepage locally, try compaction synchronous to the fault, allow remote 
fallback?  It's undefined.

The original intent was to be used when thp is disabled system wide 
(enabled set to "madvise") because its possible the rss of the process 
increases if backed by thp.  That occurs either if faulting on a hugepage 
aligned area or based on max_ptes_none.  So we have at least three 
possible policies that have evolved over time: preventing increased rss, 
direct compaction, remote fallback.  Certainly not something that fits 
under a single madvise mode.

> And that is a fundamental problem and the antipattern I am talking
> about. Look at it this way. All normal allocations are utilizing all the
> available memory even though they might hit a remote latency penalty. If
> you do care about NUMA placement you have an API to enforce a specific
> placement.  What is so different about THP to behave differently. Do
> we really want to later invent an API to actually allow to utilize all
> the memory? There are certainly usecases (that triggered the discussion
> previously) that do not mind the remote latency because all other
> benefits simply outweight it?
> 

What is different about THP is that on every platform I have measured, 
NUMA matters more than hugepages.  Obviously if on Broadwell, Haswell, and 
Rome, remote hugepages were a performance win over local pages, this 
discussion would not be happening.  Faulting local pages rather than 
local hugepages, if possible, is easy and doesn't require reclaim.  
Faulting remote pages rather than reclaiming local pages is easy in your 
scenario, it's non-disruptive.

So to answer "what is so different about THP?", it's the performance data.  
The NUMA locality matters more than whether the pages are huge or not.  We 
also have the added benefit of khugepaged being able to collapse pages 
locally if fragmentation improves rather than being stuck accessing a 
remote hugepage forever.

> That being said what should users who want to use all the memory do to
> use as many THPs as possible?

If those users want to accept the performance degradation of allocating 
remote hugepages instead of local pages, that should likely be an 
extension, either madvise or prctl.  That's not necessarily the usecase 
Andrea would have, I don't believe: he'd still prefer to compact memory 
locally and avoid the swap storm than allocate remotely.  If impossible to 
reclaim locally for regular pages, remote hugepages may be more beneficial 
than remote pages.

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-04 Thread Mel Gorman

Much of this thread is a rehash of previous discussions so as a result,
I glossed over parts of it so there will be a degree of error. Very
preliminary results from David's approach are below and the bottom line
is that it might fix some latency issues and locality issues at the cost
of a high degree of THP allocation failure.

On Tue, Dec 04, 2018 at 10:22:26AM +0100, Vlastimil Babka wrote:
> > +   if (order == pageblock_order &&
> > +   !(current->flags & PF_KTHREAD))
> > +   goto nopage;
> > 
> > and just goes "Eww".
> > 
> > But I think the real problem is that it's the "goto nopage" thing that
> > makes _sense_, and the current cases for "let's try compaction" that
> 
> More precisely it's "let's try reclaim + compaction".
> 

The original intent has been muddied and special cased but the general idea
was that compaction needs space to work with to both succeed and avoid
excessive scanning -- particularly in direct context that is visible to
the application. Before compaction, linear-reclaim (aka lumpy reclaim)
was used but this caused both page age inversion issues and excessive
thrasing. In Andrew's tree, there are patches that also do small amounts
of reclaim in response to fragmentation which in some cases alleviates
the need for the reclaim + compaction step as the reclaim has sometimes
already happened.  This has reduced latencies and increased THP allocation
success rates but not by enough which needs further work.

Parts of compaction are in need of a revisit. I'm in the process of doing
but it's time consuming to do this because of the level of testing required
at every step. The prototype currently is 12 patches and growing and I'm
not sure what the final series will look like or how far it'll go.

At this point, I believe that even when it's finished that the concept of
"do some reclaim and try compaction" will remain. I'm focused primarily
on the compaction core at the moment rather than the outer part in the
page allocator.

> > are the odd ones, and then David adds one new special case for the
> > sensible behavior.
> > 
> > For example, why would COMPACT_DEFERRED mean "don't bother", but not
> > all the other reasons it didn't really make sense?
> 
> COMPACT_DEFERRED means that compaction was failing recently, even with
> sufficient free pages (e.g. freed by direct reclaim), so it doesn't make
> sense to continue.

Yes, the intent is that recent failures should not incur more useless
scanning and stalls. As it is, the latencies are too high and too often
it's useless work. Historically, this was put into place as the time
spent in compaction was too high and the THP allocation latencies were so
bad that it was preferred to disable THP entirely. This has improved in
recent years with general improvements and changes to defaults but there
is room to improve. Again, it's something I'm looking into but it's slow.

> > 
> > So does it really make sense to fall through AT ALL to that "retry"
> > case, when we explicitly already had (gfp_mask & __GFP_NORETRY)?
> 
> Well if there was no free memory to begin with, and thus compaction
> returned COMPACT_SKIPPED, then we didn't really "try" anything yet, so
> there's nothing to "not retry".
> 

What should also be kept in mind is that we should avoid conflating
locality preferences with THP preferences which is separate from THP
allocation latencies. The whole __GFP_THISNODE approach is pushing too
hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
are used which is very unfortunate given that MADV_HUGEPAGE in itself says
nothing about locality -- that is the business of other madvise flags or
a specific policy.  Using remote nodes is bad but reclaiming excessively
and pushing data out of memory is worse as the latency to fault data back
from disk is higher than a remote access.

Andrea already pointed it out -- workloads that fit within a node are happy
to reclaim local memory, particularly in the case where the existing data
is old which is the ideal for David. Workloads that do not fit within a
node will often prefer using remote memory -- either THP or base pages
in the general case and THP for definite in the KVM case. While KVM
might not like remote memory, using THP at least reduces the page table
access overhead even if the access is remote and eventually automatic
NUMA balancing might intervene.

> > Maybe the real fix is to instead of adding yet another special case
> > for "goto nopage", it should just be unconditional: simply don't try
> > to compact large-pages if __GFP_NORETRY was set.
> 
> I think that would destroy THP success rates too much, in situations
> where reclaim and compaction would succeed, because there's enough
> easily reclaimable and migratable memory.
> 

Tests are in progress but yes, this is the primary risk of abandoning
the allocation request too early. I've already found during developing
the prototype series

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-04 Thread Vlastimil Babka

On 12/3/18 11:27 PM, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 2:04 PM Linus Torvalds
>  wrote:
>>
>> so I think all of David's patch is somewhat sensible, even if that
>> specific "order == pageblock_order" test really looks like it might
>> want to be clarified.
> 
> Side note: I think maybe people should just look at that whole
> compaction logic for that block, because it doesn't make much sense to
> me:
> 
> /*
>  * Checks for costly allocations with __GFP_NORETRY, which
>  * includes THP page fault allocations
>  */
> if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> /*
>  * If compaction is deferred for high-order 
> allocations,
>  * it is because sync compaction recently failed. If
>  * this is the case and the caller requested a THP
>  * allocation, we do not want to heavily disrupt the
>  * system, so we fail the allocation instead of 
> entering
>  * direct reclaim.
>  */
> if (compact_result == COMPACT_DEFERRED)
> goto nopage;
> 
> /*
>  * Looks like reclaim/compaction is worth trying, but
>  * sync compaction could be very expensive, so keep
>  * using async compaction.
>  */
> compact_priority = INIT_COMPACT_PRIORITY;
> }
> 
> this is where David wants to add *his* odd test, and I think everybody
> looks at that added case
> 
> +   if (order == pageblock_order &&
> +   !(current->flags & PF_KTHREAD))
> +   goto nopage;
> 
> and just goes "Eww".
> 
> But I think the real problem is that it's the "goto nopage" thing that
> makes _sense_, and the current cases for "let's try compaction" that

More precisely it's "let's try reclaim + compaction".

> are the odd ones, and then David adds one new special case for the
> sensible behavior.
> 
> For example, why would COMPACT_DEFERRED mean "don't bother", but not
> all the other reasons it didn't really make sense?

COMPACT_DEFERRED means that compaction was failing recently, even with
sufficient free pages (e.g. freed by direct reclaim), so it doesn't make
sense to continue.
What are "all the other reasons"? __alloc_pages_direct_compact() could
have also returned COMPACT_SKIPPED, which means compaction actually
didn't happen at all, because there's not enough free pages.

> So does it really make sense to fall through AT ALL to that "retry"
> case, when we explicitly already had (gfp_mask & __GFP_NORETRY)?

Well if there was no free memory to begin with, and thus compaction
returned COMPACT_SKIPPED, then we didn't really "try" anything yet, so
there's nothing to "not retry".

> Maybe the real fix is to instead of adding yet another special case
> for "goto nopage", it should just be unconditional: simply don't try
> to compact large-pages if __GFP_NORETRY was set.

I think that would destroy THP success rates too much, in situations
where reclaim and compaction would succeed, because there's enough
easily reclaimable and migratable memory.

> Hmm? I dunno. Right now - for 4.20, I'd obviously want to keep changes
> smallish, so a hacky added special case might be the right thing to
> do. But the code does look odd, doesn't it?
> 
> I think part of it comes from the fact that we *used* to do the
> compaction first, and then we did the reclaim, and then it was
> re-orghanized to do reclaim first, but it tried to keep semantic
> changes minimal and some of the above comes from that re-org.

IIRC the point of reorg was that in typical case we actually do want to
try the reclaim first (or only), and the exception are those THP-ish
allocations where typically the problem is fragmentation, and not number
of free pages, so we check first if we can defragment the memory or
whether it makes sense to free pages in case the defragmentation is
expected to help afterwards. It seemed better to put this special case
out of the main reclaim/compaction retry-with-increasing-priority loop
for non-costly-order allocations that in general can't fail.

Vlastimil

> I think.
> 
> Linus
>

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-04 Thread Michal Hocko

On Mon 03-12-18 13:53:21, David Rientjes wrote:
> On Mon, 3 Dec 2018, Michal Hocko wrote:
> 
> > > I think extending functionality so thp can be allocated remotely if truly 
> > > desired is worthwhile
> > 
> > This is a complete NUMA policy antipatern that we have for all other
> > user memory allocations. So far you have to be explicit for your numa
> > requirements. You are trying to conflate NUMA api with MADV and that is
> > just conflating two orthogonal things and that is just wrong.
> > 
> 
> No, the page allocator change for both my patch and __GFP_COMPACT_ONLY has 
> nothing to do with any madvise() mode.  It has to do with where thp 
> allocations are preferred.  Yes, this is different than other memory 
> allocations where it doesn't cause a 13.9% access latency regression for 
> the lifetime of a binary for users who back their text with hugepages.  
> MADV_HUGEPAGE still has its purpose to try synchronous memory compaction 
> at fault time under all thp defrag modes other than "never".  The specific 
> problem being reported here, and that both my patch and __GFP_COMPACT_ONLY 
> address, is the pointless reclaim activity that does not assist in making 
> compaction more successful.

You do not address my concern though. Sure there are reclaim related
issues. Nobody is questioning that. But that is only half of the
problem.

The thing I am really up to here is that reintroduction of
__GFP_THISNODE, which you are pushing for, will conflate madvise mode
resp. defrag=always with a numa placement policy because the allocation
doesn't fallback to a remote node.

And that is a fundamental problem and the antipattern I am talking
about. Look at it this way. All normal allocations are utilizing all the
available memory even though they might hit a remote latency penalty. If
you do care about NUMA placement you have an API to enforce a specific
placement.  What is so different about THP to behave differently. Do
we really want to later invent an API to actually allow to utilize all
the memory? There are certainly usecases (that triggered the discussion
previously) that do not mind the remote latency because all other
benefits simply outweight it?

That being said what should users who want to use all the memory do to
use as many THPs as possible?
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread David Rientjes

On Mon, 3 Dec 2018, Linus Torvalds wrote:

> Side note: I think maybe people should just look at that whole
> compaction logic for that block, because it doesn't make much sense to
> me:
> 
> /*
>  * Checks for costly allocations with __GFP_NORETRY, which
>  * includes THP page fault allocations
>  */
> if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> /*
>  * If compaction is deferred for high-order 
> allocations,
>  * it is because sync compaction recently failed. If
>  * this is the case and the caller requested a THP
>  * allocation, we do not want to heavily disrupt the
>  * system, so we fail the allocation instead of 
> entering
>  * direct reclaim.
>  */
> if (compact_result == COMPACT_DEFERRED)
> goto nopage;
> 
> /*
>  * Looks like reclaim/compaction is worth trying, but
>  * sync compaction could be very expensive, so keep
>  * using async compaction.
>  */
> compact_priority = INIT_COMPACT_PRIORITY;
> }
> 
> this is where David wants to add *his* odd test, and I think everybody
> looks at that added case
> 
> +   if (order == pageblock_order &&
> +   !(current->flags & PF_KTHREAD))
> +   goto nopage;
> 
> and just goes "Eww".
> 
> But I think the real problem is that it's the "goto nopage" thing that
> makes _sense_, and the current cases for "let's try compaction" that
> are the odd ones, and then David adds one new special case for the
> sensible behavior.
> 
> For example, why would COMPACT_DEFERRED mean "don't bother", but not
> all the other reasons it didn't really make sense?
> 
> So does it really make sense to fall through AT ALL to that "retry"
> case, when we explicitly already had (gfp_mask & __GFP_NORETRY)?
> 
> Maybe the real fix is to instead of adding yet another special case
> for "goto nopage", it should just be unconditional: simply don't try
> to compact large-pages if __GFP_NORETRY was set.
> 

I think what is intended, which may not be represented by the code, is 
that if compaction is not suitable (__compaction_suitable() returns 
COMPACT_SKIPPED because of failing watermarks) that for non-hugepage 
allocations reclaim may be useful.  We just want to reclaim memory so that 
memory compaction has pages available for migration targets.

Note the same caveat I keep bringing up still applies, though: if reclaim 
frees memory that is iterated over by the compaction migration scanner, it 
was pointless.  That is a memory compaction implementation detail and can 
lead to a lot of unnecessary reclaim (or even thrashing) if unmovable page 
fragmentation cause compaction to fail even after it has migrated 
everything it could.  I think the likelihood of that happening increases 
by the allocation order.

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds

On Mon, Dec 3, 2018 at 2:04 PM Linus Torvalds
 wrote:
>
> so I think all of David's patch is somewhat sensible, even if that
> specific "order == pageblock_order" test really looks like it might
> want to be clarified.

Side note: I think maybe people should just look at that whole
compaction logic for that block, because it doesn't make much sense to
me:

/*
 * Checks for costly allocations with __GFP_NORETRY, which
 * includes THP page fault allocations
 */
if (costly_order && (gfp_mask & __GFP_NORETRY)) {
/*
 * If compaction is deferred for high-order allocations,
 * it is because sync compaction recently failed. If
 * this is the case and the caller requested a THP
 * allocation, we do not want to heavily disrupt the
 * system, so we fail the allocation instead of entering
 * direct reclaim.
 */
if (compact_result == COMPACT_DEFERRED)
goto nopage;

/*
 * Looks like reclaim/compaction is worth trying, but
 * sync compaction could be very expensive, so keep
 * using async compaction.
 */
compact_priority = INIT_COMPACT_PRIORITY;
}

this is where David wants to add *his* odd test, and I think everybody
looks at that added case

+   if (order == pageblock_order &&
+   !(current->flags & PF_KTHREAD))
+   goto nopage;

and just goes "Eww".

But I think the real problem is that it's the "goto nopage" thing that
makes _sense_, and the current cases for "let's try compaction" that
are the odd ones, and then David adds one new special case for the
sensible behavior.

For example, why would COMPACT_DEFERRED mean "don't bother", but not
all the other reasons it didn't really make sense?

So does it really make sense to fall through AT ALL to that "retry"
case, when we explicitly already had (gfp_mask & __GFP_NORETRY)?

Maybe the real fix is to instead of adding yet another special case
for "goto nopage", it should just be unconditional: simply don't try
to compact large-pages if __GFP_NORETRY was set.

Hmm? I dunno. Right now - for 4.20, I'd obviously want to keep changes
smallish, so a hacky added special case might be the right thing to
do. But the code does look odd, doesn't it?

I think part of it comes from the fact that we *used* to do the
compaction first, and then we did the reclaim, and then it was
re-orghanized to do reclaim first, but it tried to keep semantic
changes minimal and some of the above comes from that re-org.

I think.

Linus

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds

On Mon, Dec 3, 2018 at 12:12 PM Andrea Arcangeli  wrote:
>
> On Mon, Dec 03, 2018 at 11:28:07AM -0800, Linus Torvalds wrote:
> >
> > One is the patch posted by Andrea earlier in this thread, which seems
> > to target just this known regression.
>
> For the short term the important thing is to fix the VM regression one
> way or another, I don't personally mind which way.
>
> > The other seems to be to revert commit ac5b2c1891  and instead apply
> >
> >   
> > https://lore.kernel.org/lkml/alpine.deb.2.21.1810081303060.221...@chino.kir.corp.google.com/
> >
> > which also seems to be sensible.
>
> In my earlier review of David's patch, it looked runtime equivalent to
> the __GFP_COMPACT_ONLY solution. It has the only advantage of adding a

I think there's a missing "not" in the above.

> new gfpflag until we're sure we need it but it's the worst solution
> available for the long term in my view. It'd be ok to apply it as
> stop-gap measure though.

So I have no really strong opinions either way.

I looking at the two options, I think I'd personally have a slight
preference for that patch by David, not so much because it doesn't add
a new GFP flag, but because it seems to make it a lot more explicit
that GFP_TRANSHUGE_LIGHT automatically implies __GFP_NORETRY.

I think that makes a whole lot of conceptual sense with the whole
meaning of GFP_TRANSHUGE_LIGHT. It's all about "no
reclaim/compaction", but honestly, doesn't __GFP_NORETRY make sense?

So I look at David's patch, and I go "that makes sense", and then I
compare it with ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
MADV_HUGEPAGE mappings") and that makes me go "ok, that's a hack".

So *if* reverting ac5b2c18911f and applying David's patch instead
fixes the KVM latency issues (which I assume it really should do,
simply thanks to __GFP_NORETRY), then I think that makes more sense.

That said, I do agree that the

if (order == pageblock_order ...)

test in __alloc_pages_slowpath() in David's patch then argues for
"that looks hacky".  But that code *is* inside the test for

if (costly_order && (gfp_mask & __GFP_NORETRY)) {

so within the context of that (not visible in the patch itself), it
looks like a sensible model. The whole point of that block is, as the
comment above it says

/*
 * Checks for costly allocations with __GFP_NORETRY, which
 * includes THP page fault allocations
 */

so I think all of David's patch is somewhat sensible, even if that
specific "order == pageblock_order" test really looks like it might
want to be clarified.

BUT.

With all that said, I really don't mind that __GFP_COMPACT_ONLY
approach either. I think David's patch makes sense in a bigger
context, while the __GFP_COMPACT_ONLY patch makes sense in the context
of "let's just fix this _particular_ special case.

As long as both work (and apparently they do), either is perfectly find by me.

Some kind of "Thunderdome for patches" is needed, with an epic soundtrack.

   "Two patches enter, one patch leaves!"

I don't so much care which one.

 Linus

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread David Rientjes

On Mon, 3 Dec 2018, Michal Hocko wrote:

> > I think extending functionality so thp can be allocated remotely if truly 
> > desired is worthwhile
> 
> This is a complete NUMA policy antipatern that we have for all other
> user memory allocations. So far you have to be explicit for your numa
> requirements. You are trying to conflate NUMA api with MADV and that is
> just conflating two orthogonal things and that is just wrong.
> 

No, the page allocator change for both my patch and __GFP_COMPACT_ONLY has 
nothing to do with any madvise() mode.  It has to do with where thp 
allocations are preferred.  Yes, this is different than other memory 
allocations where it doesn't cause a 13.9% access latency regression for 
the lifetime of a binary for users who back their text with hugepages.  
MADV_HUGEPAGE still has its purpose to try synchronous memory compaction 
at fault time under all thp defrag modes other than "never".  The specific 
problem being reported here, and that both my patch and __GFP_COMPACT_ONLY 
address, is the pointless reclaim activity that does not assist in making 
compaction more successful.

> Let's put the __GFP_THISNODE issue aside. I do not remember you
> confirming that __GFP_COMPACT_ONLY patch is OK for you (sorry it might
> got lost in the emails storm from back then) but if that is the only
> agreeable solution for now then I can live with that.

The discussion between my patch and Andrea's patch seemed to only be about 
whether this should be a gfp bit or not

> __GFP_NORETRY hack
> was shown to not work properly by Mel AFAIR. Again if I misremember then
> I am sorry and I can live with that.

Andrea's patch as posted in this thread sets __GFP_NORETRY for 
__GFP_ONLY_COMPACT, so both my patch and his patch require it.  His patch 
gets this behavior for page faults by way of alloc_pages_vma(), mine gets 
it from modifying GFP_TRANSHUGE.

> But conflating MADV_TRANSHUGE with
> an implicit numa placement policy and/or adding an opt-in for remote
> NUMA placing is completely backwards and a broken API which will likely
> bites us later. I sincerely hope we are not going to repeat mistakes
> from the past.

Assuming s/MADV_TRANSHUGE/MADV_HUGEPAGE/.  Again, this is *not* about the 
madvise(); it's specifically about the role of direct reclaim in the 
allocation of a transparent hugepage at fault time regardless of any 
madvise() because you can get the same behavior with defrag=always (and 
the inconsistent use of __GFP_NORETRY there that is fixed by both of our 
patches).

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Michal Hocko

On Mon 03-12-18 12:39:34, David Rientjes wrote:
> On Mon, 3 Dec 2018, Michal Hocko wrote:
> 
> > I have merely said that a better THP locality needs more work and during
> > the review discussion I have even volunteered to work on that. There
> > are other reclaim related fixes under work right now. All I am saying
> > is that MADV_TRANSHUGE having numa locality implications cannot satisfy
> > all the usecases and it is particurarly KVM that suffers from it.
> 
> I think extending functionality so thp can be allocated remotely if truly 
> desired is worthwhile

This is a complete NUMA policy antipatern that we have for all other
user memory allocations. So far you have to be explicit for your numa
requirements. You are trying to conflate NUMA api with MADV and that is
just conflating two orthogonal things and that is just wrong.

Let's put the __GFP_THISNODE issue aside. I do not remember you
confirming that __GFP_COMPACT_ONLY patch is OK for you (sorry it might
got lost in the emails storm from back then) but if that is the only
agreeable solution for now then I can live with that. __GFP_NORETRY hack
was shown to not work properly by Mel AFAIR. Again if I misremember then
I am sorry and I can live with that. But conflating MADV_TRANSHUGE with
an implicit numa placement policy and/or adding an opt-in for remote
NUMA placing is completely backwards and a broken API which will likely
bites us later. I sincerely hope we are not going to repeat mistakes
from the past.
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread David Rientjes

On Mon, 3 Dec 2018, Michal Hocko wrote:

> I have merely said that a better THP locality needs more work and during
> the review discussion I have even volunteered to work on that. There
> are other reclaim related fixes under work right now. All I am saying
> is that MADV_TRANSHUGE having numa locality implications cannot satisfy
> all the usecases and it is particurarly KVM that suffers from it.

I think extending functionality so thp can be allocated remotely if truly 
desired is worthwhile just so long as it does not cause regressions for 
other users.  I think that is separate from the swap storm regression that 
Andrea is reporting, however, since that would also exist even if we 
allowed remote thp allocations when the host is fully fragmented rather 
than only locally fragmented.

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread David Rientjes

On Mon, 3 Dec 2018, Andrea Arcangeli wrote:

> In my earlier review of David's patch, it looked runtime equivalent to
> the __GFP_COMPACT_ONLY solution. It has the only advantage of adding a
> new gfpflag until we're sure we need it but it's the worst solution
> available for the long term in my view. It'd be ok to apply it as
> stop-gap measure though.
> 
> The "order == pageblock_order" hardcoding inside the allocator to
> workaround the __GFP_THISNODE flag passed from outside the allocator
> in the THP MADV_HUGEPAGE case, didn't look very attractive because
> it's not just THP allocating order >0 pages.
> 

We have two different things to consider: NUMA locality and the order of 
the allocation.  THP is preferred locally and we know the order.  For the 
other high-order pages you're referring to, I don't know if they are using 
__GFP_THISNODE or not (likely not).  I see them as two separate issues.

For thp on all platforms I have measured it on specifically for this patch 
(Broadwell, Haswell, Rome) there is a clear advantage to faulting local 
pages of the native page size over remote hugepages.  It also has the 
added effect of allowing khugepaged to collapse it into a hugepage later 
if fragmentation allows (the reason why khugepaged cares about NUMA 
locality, the same reason I do).  This is the rationale for __GFP_THISNODE 
for thp allocations.

For order == pageblock_order (or more correctly order >= pageblock_order), 
this is not based on NUMA whatsoever but is rather based on the 
implementation of memory compaction.  If it has already failed (or was 
deferred for order-HPAGE_PMD_ORDER), reclaim cannot be shown to help if 
memory compaction cannot utilize the freed memory in isolate_freepages(), 
so that reclaim has been pointless.  If compaction fails for other reasons 
(any unmovable page preventing a pageblock from becoming free), *all* 
reclaim activity has been pointless.  

> It'd be nicer if whatever compaction latency optimization that applies
> to THP could also apply to all other allocation orders too and the
> hardcoding of the THP order prevents that.
> 
> On the same lines if __GFP_THISNODE is so badly needed by
> MADV_HUGEPAGE, all other larger order allocations should also be able
> to take advantage of __GFP_THISNODE without ending in the same VM
> corner cases that required the "order == pageblock_order" hardcoding
> inside the allocator.
> 
> If you prefer David's patch I would suggest pageblock_order to be
> replaced with HPAGE_PMD_ORDER so it's more likely to match the THP
> order in all archs.
> 

That sounds fine and I will do that in my v2.

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread David Rientjes

On Mon, 3 Dec 2018, Andrea Arcangeli wrote:

> It's trivial to reproduce the badness by running a memhog process that
> allocates more than the RAM of 1 NUMA node, under defrag=always
> setting (or by changing memhog to use MADV_HUGEPAGE) and it'll create
> swap storms despite 75% of the RAM is completely free in a 4 node NUMA
> (or 50% of RAM free in a 2 node NUMA) etc..
> 
> How can it be ok to push the system into gigabytes of swap by default
> without any special capability despite 50% - 75% or more of the RAM is
> free? That's the downside of the __GFP_THISNODE optimizaton.
> 

The swap storm is the issue that is being addressed.  If your remote 
memory is as low as local memory, the patch to clear __GFP_THISNODE has 
done nothing to fix it: you still get swap storms and memory compaction 
can still fail if the per-zone freeing scanner cannot utilize the 
reclaimed memory.  Recall that this patch to clear __GFP_THISNODE was 
measured by me to have a 40% increase in allocation latency for fragmented 
remote memory on Haswell.  It makes the problem much, much worse.

> __GFP_THISNODE helps increasing NUMA locality if your app can fit in a
> single node which is the common David's workload. But if his workload
> would more often than not fit in a single node, he would also run into
> an unacceptable slowdown because of the __GFP_THISNODE.
> 

Which is why I have suggested that we do not do direct reclaim, as the 
page allocator implementation expects all thp page fault allocations to 
have __GFP_NORETRY set, because no amount of reclaim can be shown to be 
useful to the memory compaction freeing scanner if it is iterated over by 
the migration scanner.

> I think there's lots of room for improvement for the future, but in my
> view that __GFP_THISNODE as it was implemented was an incomplete hack,
> that opened the door for bad VM corner cases that should not happen.
> 

__GFP_THISNODE is intended specifically because of the remote access 
latency increase that is encountered if you fault remote hugepages over 
local pages of the native page size.

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Andrea Arcangeli

On Mon, Dec 03, 2018 at 11:28:07AM -0800, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 10:59 AM Michal Hocko  wrote:
> >
> > You are misinterpreting my words. I haven't dismissed anything. I do
> > recognize both usecases under discussion.
> >
> > I have merely said that a better THP locality needs more work and during
> > the review discussion I have even volunteered to work on that.
> 
> We have two known patches that seem to have no real downsides.
> 
> One is the patch posted by Andrea earlier in this thread, which seems
> to target just this known regression.

For the short term the important thing is to fix the VM regression one
way or another, I don't personally mind which way.

> The other seems to be to revert commit ac5b2c1891  and instead apply
> 
>   
> https://lore.kernel.org/lkml/alpine.deb.2.21.1810081303060.221...@chino.kir.corp.google.com/
> 
> which also seems to be sensible.

In my earlier review of David's patch, it looked runtime equivalent to
the __GFP_COMPACT_ONLY solution. It has the only advantage of adding a
new gfpflag until we're sure we need it but it's the worst solution
available for the long term in my view. It'd be ok to apply it as
stop-gap measure though.

The "order == pageblock_order" hardcoding inside the allocator to
workaround the __GFP_THISNODE flag passed from outside the allocator
in the THP MADV_HUGEPAGE case, didn't look very attractive because
it's not just THP allocating order >0 pages.

It'd be nicer if whatever compaction latency optimization that applies
to THP could also apply to all other allocation orders too and the
hardcoding of the THP order prevents that.

On the same lines if __GFP_THISNODE is so badly needed by
MADV_HUGEPAGE, all other larger order allocations should also be able
to take advantage of __GFP_THISNODE without ending in the same VM
corner cases that required the "order == pageblock_order" hardcoding
inside the allocator.

If you prefer David's patch I would suggest pageblock_order to be
replaced with HPAGE_PMD_ORDER so it's more likely to match the THP
order in all archs.

Thanks,
Andrea

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds

On Mon, Dec 3, 2018 at 10:59 AM Michal Hocko  wrote:
>
> You are misinterpreting my words. I haven't dismissed anything. I do
> recognize both usecases under discussion.
>
> I have merely said that a better THP locality needs more work and during
> the review discussion I have even volunteered to work on that.

We have two known patches that seem to have no real downsides.

One is the patch posted by Andrea earlier in this thread, which seems
to target just this known regression.

The other seems to be to revert commit ac5b2c1891  and instead apply

https://lore.kernel.org/lkml/alpine.deb.2.21.1810081303060.221...@chino.kir.corp.google.com/

which also seems to be sensible.

I'm not seeing why the KVM load would react badly to either of those
models, and they are known to fix the google local-node issue.

  Linus

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Andrea Arcangeli

On Mon, Dec 03, 2018 at 07:59:54PM +0100, Michal Hocko wrote:
> I have merely said that a better THP locality needs more work and during
> the review discussion I have even volunteered to work on that. There
> are other reclaim related fixes under work right now. All I am saying
> is that MADV_TRANSHUGE having numa locality implications cannot satisfy
> all the usecases and it is particurarly KVM that suffers from it.

I'd like to clarify it's not just KVM, we found with KVM because for
KVM it's fairly common to create VM that won't possibly fit in a
single node, while most other apps don't tend to allocate that much
memory.

It's trivial to reproduce the badness by running a memhog process that
allocates more than the RAM of 1 NUMA node, under defrag=always
setting (or by changing memhog to use MADV_HUGEPAGE) and it'll create
swap storms despite 75% of the RAM is completely free in a 4 node NUMA
(or 50% of RAM free in a 2 node NUMA) etc..

How can it be ok to push the system into gigabytes of swap by default
without any special capability despite 50% - 75% or more of the RAM is
free? That's the downside of the __GFP_THISNODE optimizaton.

__GFP_THISNODE helps increasing NUMA locality if your app can fit in a
single node which is the common David's workload. But if his workload
would more often than not fit in a single node, he would also run into
an unacceptable slowdown because of the __GFP_THISNODE.

I think there's lots of room for improvement for the future, but in my
view that __GFP_THISNODE as it was implemented was an incomplete hack,
that opened the door for bad VM corner cases that should not happen.

It also would be nice to have a reproducer for David's workload, the
software to run the binary on THP is not released either. We have lots
of reproducer for the corner case introduced by the __GFP_THISNODE
trick.

So this is basically a revert of the commit that made MADV_HUGEPAGE
with __GFP_THISNODE behave like a privileged (although not as static)
mbind.

I provided an alternative but we weren't sure if that was the best
long term solution that could satisfy everyone because it does have
some drawback too.

Thanks,
Andrea

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Michal Hocko

On Mon 03-12-18 10:45:35, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 10:30 AM Michal Hocko  wrote:
> >
> > I do not get it. 5265047ac301 which this patch effectively reverts has
> > regressed kvm workloads. People started to notice only later because
> > they were not running on kernels with that commit until later. We have
> > 4.4 based kernels reports. What do you propose to do for those people?
> 
> We have at least two patches that others claim to fix things.
> 
> You dismissed them and said "can't be done".

You are misinterpreting my words. I haven't dismissed anything. I do
recognize both usecases under discussion.

I have merely said that a better THP locality needs more work and during
the review discussion I have even volunteered to work on that. There
are other reclaim related fixes under work right now. All I am saying
is that MADV_TRANSHUGE having numa locality implications cannot satisfy
all the usecases and it is particurarly KVM that suffers from it.
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds

On Mon, Dec 3, 2018 at 10:30 AM Michal Hocko  wrote:
>
> I do not get it. 5265047ac301 which this patch effectively reverts has
> regressed kvm workloads. People started to notice only later because
> they were not running on kernels with that commit until later. We have
> 4.4 based kernels reports. What do you propose to do for those people?

We have at least two patches that others claim to fix things.

You dismissed them and said "can't be done".

As a result, I'm not really interested in this discussion.

Linus

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Michal Hocko

On Mon 03-12-18 10:19:55, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 10:15 AM Michal Hocko  wrote:
> >
> > The thing is that there is no universal win here. There are two
> > different types of workloads and we cannot satisfy both.
> 
> Ok, if that's the case, then I'll just revert the commit.
> 
> Michal, our rules are very simple: we don't generate regressions. It's
> better to have old reliable behavior than to start creating *new*
> problems.

I do not get it. 5265047ac301 which this patch effectively reverts has
regressed kvm workloads. People started to notice only later because
they were not running on kernels with that commit until later. We have
4.4 based kernels reports. What do you propose to do for those people?
Let me remind that it was David who introduced 5265047ac301, presumably
because his workload benefits from it.
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds

On Mon, Dec 3, 2018 at 10:15 AM Michal Hocko  wrote:
>
> The thing is that there is no universal win here. There are two
> different types of workloads and we cannot satisfy both.

Ok, if that's the case, then I'll just revert the commit.

Michal, our rules are very simple: we don't generate regressions. It's
better to have old reliable behavior than to start creating *new*
problems.

 Linus

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Michal Hocko

On Mon 03-12-18 10:01:18, Linus Torvalds wrote:
> On Wed, Nov 28, 2018 at 8:48 AM Linus Torvalds
>  wrote:
> >
> > On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying  wrote:
> > >
> > > In general, memory allocation fairness among processes should be a good
> > > thing.  So I think the report should have been a "performance
> > > improvement" instead of "performance regression".
> >
> > Hey, when you put it that way...
> >
> > Let's ignore this issue for now, and see if it shows up in some real
> > workload and people complain.
> 
> Well, David Rientjes points out that it *does* cause real problems in
> real workloads, so it's not just this benchmark run that shows the
> issue.

The thing is that there is no universal win here. There are two
different types of workloads and we cannot satisfy both. This has been
discussed at lenght during the review process. David's workload makes
some assumptions about the MADV_HUGEPAGE numa placement. There are other
workalods like KVM setups which do not really require that and those are
ones which regressed.

The prevalent consensus was that a NUMA placement is not really implied
by MADV_HUGEPAGE because a) this has never been documented or intended
behavior and b) it is not a universal win (basically the same as
node/zone_reclaim which used to be on by default on some NUMA setups).

Reverting the patch would regress another class of workloads. As we
cannot satisfy both I believe we should make the API clear and in favor
of a more relaxed workloads. Those with special requirements should have
a proper API to reflect that (this is our general NUMA policy pattern
already).
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds

On Wed, Nov 28, 2018 at 8:48 AM Linus Torvalds
 wrote:
>
> On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying  wrote:
> >
> > In general, memory allocation fairness among processes should be a good
> > thing.  So I think the report should have been a "performance
> > improvement" instead of "performance regression".
>
> Hey, when you put it that way...
>
> Let's ignore this issue for now, and see if it shows up in some real
> workload and people complain.

Well, David Rientjes points out that it *does* cause real problems in
real workloads, so it's not just this benchmark run that shows the
issue.

So I guess we should revert, or at least fix. David, please post your
numbers again in public along with your suggested solution...

   Linus

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-28 Thread David Rientjes

On Wed, 28 Nov 2018, Linus Torvalds wrote:

> On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying  wrote:
> >
> > From the above data, for the parent commit 3 processes exited within
> > 14s, another 3 exited within 100s.  For this commit, the first process
> > exited at 203s.  That is, this commit makes memory allocation more fair
> > among processes, so that processes proceeded at more similar speed.  But
> > this raises system memory footprint too, so triggered much more swap,
> > thus lower benchmark score.
> >
> > In general, memory allocation fairness among processes should be a good
> > thing.  So I think the report should have been a "performance
> > improvement" instead of "performance regression".
> 
> Hey, when you put it that way...
> 
> Let's ignore this issue for now, and see if it shows up in some real
> workload and people complain.
> 

Well, I originally complained[*] when the change was first proposed and 
when the stable backports were proposed[**].  On a fragmented host, the 
change itself showed a 13.9% access latency regression on Haswell and up 
to 40% allocation latency regression.  This is more substantial on Naples 
and Rome.  I also measured similar numbers to this for Haswell.

We are particularly hit hard by this because we have libraries that remap 
the text segment of binaries to hugepages; hugetlbfs is not widely used so 
this normally falls back to transparent hugepages.  We mmap(), 
madvise(MADV_HUGEPAGE), memcpy(), mremap().  We fully accept the latency 
to do this when the binary starts because the access latency at runtime is 
so much better.

With this change, however, we have no userspace workaround other than 
mbind() to prefer the local node.  On all of our platforms, native sized 
pages are always a win over remote hugepages and it leaves open the 
opportunity that we collapse memory into hugepages later by khugepaged if 
fragmentation is the issue.  mbind() is not viable if the local node is 
saturated, we are ok with falling back to remote pages of the native page 
size when the local node is oom; this would result in an oom kill if we 
used it to retain the old behavior.

Given this severe access and allocation latency regression, we must revert 
this patch in our own kernel, there is simply no path forward without 
doing so.

[*] https://marc.info/?l=linux-kernel&m=153868420126775
[**] https://marc.info/?l=linux-kernel&m=154269994800842

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-28 Thread Andrea Arcangeli

On Wed, Nov 28, 2018 at 08:48:46AM -0800, Linus Torvalds wrote:
> On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying  wrote:
> >
> > From the above data, for the parent commit 3 processes exited within
> > 14s, another 3 exited within 100s.  For this commit, the first process
> > exited at 203s.  That is, this commit makes memory allocation more fair
> > among processes, so that processes proceeded at more similar speed.  But
> > this raises system memory footprint too, so triggered much more swap,
> > thus lower benchmark score.

Ok so it was the previous more unfair behavior that increased overall
performance. It was also unclear to me that this was a full swap storm
test.

> > In general, memory allocation fairness among processes should be a good
> > thing.  So I think the report should have been a "performance
> > improvement" instead of "performance regression".
> 
> Hey, when you put it that way...
> 
> Let's ignore this issue for now, and see if it shows up in some real
> workload and people complain.

Agreed.

With regard to the other question about 4.4 backports, 4.0 didn't have
__GFP_THISNODE, so this will still revert to the previous behavior and
it won't risk to land into uncharted territory. So there should be no
major concern for the backports.

We should still work on improving this area, for now the idea was to
apply a strict hotfix that would just revert to the previous behavior
without introducing new features and new APIs, that would also have
the side effect of diminishing THP utilization under MADV_HUGEPAGE.

Thanks!
Andrea

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-28 Thread Linus Torvalds

On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying  wrote:
>
> From the above data, for the parent commit 3 processes exited within
> 14s, another 3 exited within 100s.  For this commit, the first process
> exited at 203s.  That is, this commit makes memory allocation more fair
> among processes, so that processes proceeded at more similar speed.  But
> this raises system memory footprint too, so triggered much more swap,
> thus lower benchmark score.
>
> In general, memory allocation fairness among processes should be a good
> thing.  So I think the report should have been a "performance
> improvement" instead of "performance regression".

Hey, when you put it that way...

Let's ignore this issue for now, and see if it shows up in some real
workload and people complain.

 Linus

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Michal Hocko

On Tue 27-11-18 14:50:05, Linus Torvalds wrote:
> On Tue, Nov 27, 2018 at 12:57 PM Andrea Arcangeli  wrote:
> >
> > This difference can only happen with defrag=always, and that's not the
> > current upstream default.
> 
> Ok, thanks. That makes it a bit less critical.
> 
> > That MADV_HUGEPAGE causes flights with NUMA balancing is not great
> > indeed, qemu needs NUMA locality too, but then the badness caused by
> > __GFP_THISNODE was a larger regression in the worst case for qemu.
> [...]
> > So the short term alternative again would be the alternate patch that
> > does __GFP_THISNODE|GFP_ONLY_COMPACT appended below.
> 
> Sounds like we should probably do this. Particularly since Vlastimil
> pointed out that we'd otherwise have issues with the back-port for 4.4
> where that "defrag=always" was the default.
> 
> The patch doesn't look horrible, and it directly addresses this
> particular issue.
> 
> Is there some reason we wouldn't want to do it?

We have discussed it previously and the biggest concern was that it
introduces a new GFP flag with a very weird and one-off semantic.
Anytime we have done that in the past it basically kicked back because
people have started to use such a flag and any further changes were
really hard to do. So I would really prefer some more systematic
solution. And I believe we can do that here. MADV_HUGEPAGE (resp. THP
always enabled) has gained a local memory policy with the patch which
got effectively reverted. I do believe that conflating "I want THP" with
"I want them local" is just wrong from the API point of view. There are
different classes of usecases which obviously disagree on the later.

So I believe that a long term solution should introduce a
MPOL_NODE_RECLAIM kind of policy. It would effectively reclaim local
nodes (within NODE_RECLAIM distance) before falling to other nodes.

Apart from that we need a less disruptive reclaim driven by compaction
and Mel is already working on that AFAIK.
-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Huang, Ying

Andrea Arcangeli  writes:

> Hi Linus,
>
> On Tue, Nov 27, 2018 at 09:08:50AM -0800, Linus Torvalds wrote:
>> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
>>  wrote:
>> >
>> > FYI, we noticed a -61.3% regression of vm-scalability.throughput due
>> > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
>> > MADV_HUGEPAGE mappings")
>> 
>> Well, that's certainly noticeable and not good.
>
> Noticed this email too.
>
> This difference can only happen with defrag=always, and that's not the
> current upstream default.
>
> thp_enabled: always
> thp_defrag: always
> ^^ emulates MADV_HUGEPAGE always set
>
>> Andrea, I suspect it might be causing fights with auto numa migration..
>
> That MADV_HUGEPAGE causes flights with NUMA balancing is not great
> indeed, qemu needs NUMA locality too, but then the badness caused by
> __GFP_THISNODE was a larger regression in the worst case for qemu.
>
> When the kernel wants to allocate a THP from node A, if there are no
> THP generated on node A but there are in node B, they'll be picked from
> node B now.
>
> __GFP_THISNODE previously prevented any THP to be allocated from any
> node except A. This introduces a higher chance of initial misplacement
> which NUMA balancing will correct over time, but it should only apply
> to long lived allocations under MADV_HUGEPAGE. Perhaps the workload
> here uses short lived allocations and sets defrag=always which is not
> optimal to begin with?
>
> The motivation of the fix, is that the previous code invoked reclaim
> with __GFP_THISNODE set. That looked insecure and such behavior should
> only have been possible under a mlock/mempolicy
> capability. __GFP_THISNODE is like a transient but still privileged
> mbind for reclaim.
>
> Before the fix, __GFP_THISNODE would end up swapping out everything
> from node A to free 4k pages from node A, despite perhaps there were
> gigabytes of memory free in node B. That caused severe regression to
> threaded workloads whose memory spanned more than one NUMA node. So
> again going back doesn't sounds great for NUMA in general.
>
> The vmscalability test is most certainly not including any
> multithreaded process whose memory doesn't fit in a single NUMA node
> or we'd see also the other side of the tradeoff. It'd be nice to add
> such a test to be sure that the old __GFP_THISNODE behavior won't
> happen again for apps that don't fit in a single node.

The test case is to test swap subsystem.  Where tens (32 in test job)
processes are run to eat memory to trigger to swap to NVMe disk.  The
memory size to eat is almost same in this commit and its parent.  But I
found the swap triggered is much more for this commit.

  70934968 ± 10% +51.7%  1.076e+08 ±  3%  proc-vmstat.pswpout

One possibility is that for parent commit, some processes exit much
earlier than other processes, so the total memory requirement of the
whole system is much lower.  So I dig more on test log and found,


For the parent commit,

$ grep 'usecs =' vm-scalability
24573771360 bytes / 13189705 usecs = 1819435 KB/s
24573771360 bytes / 13853913 usecs = 1732205 KB/s
24573771360 bytes / 42953388 usecs = 558694 KB/s
24573771360 bytes / 52782761 usecs = 454652 KB/s
24573771360 bytes / 84026989 usecs = 285596 KB/s
24573771360 bytes / 111677310 usecs = 214885 KB/s
24573771360 bytes / 146084698 usecs = 164273 KB/s
24573771360 bytes / 146978329 usecs = 163274 KB/s
24573771360 bytes / 149371224 usecs = 160658 KB/s
24573771360 bytes / 162892070 usecs = 147323 KB/s
24573771360 bytes / 177949001 usecs = 134857 KB/s
24573771360 bytes / 181729992 usecs = 132052 KB/s
24573771360 bytes / 189812698 usecs = 126428 KB/s
24573771360 bytes / 190992698 usecs = 125647 KB/s
24573771360 bytes / 200039238 usecs = 119965 KB/s
24573771360 bytes / 201254461 usecs = 119241 KB/s
24573771360 bytes / 202825077 usecs = 118317 KB/s
24573771360 bytes / 203441285 usecs = 117959 KB/s
24573771360 bytes / 205378150 usecs = 116847 KB/s
24573771360 bytes / 204840555 usecs = 117153 KB/s
24573771360 bytes / 206235458 usecs = 116361 KB/s
24573771360 bytes / 206419877 usecs = 116257 KB/s
24573771360 bytes / 206619347 usecs = 116145 KB/s
24573771360 bytes / 206942267 usecs = 115963 KB/s
24573771360 bytes / 210289229 usecs = 114118 KB/s
24573771360 bytes / 210504531 usecs = 114001 KB/s
24573771360 bytes / 210521351 usecs = 113992 KB/s
24573771360 bytes / 211012852 usecs = 113726 KB/s
24573771360 bytes / 211547509 usecs = 113439 KB/s
24573771360 bytes / 212179521 usecs = 113101 KB/s
24573771360 bytes / 212907825 usecs = 112714 KB/s
24573771360 bytes / 215558786 usecs = 111328 KB/s

For this commit,

$ grep 'usecs =' vm-scalability
24573681072 bytes / 203705073 usecs = 117806 KB/s
24573681072 bytes / 216146130 usecs = 111025 KB/s
24573681072 bytes / 257234408 usecs = 93291 KB/s
24573681072 bytes / 259530715 usecs = 92465 KB/s
24573681072 bytes / 261335046 usecs = 91827 KB/s
24573681072 bytes / 260134706 usecs = 92251 KB/s
2457368

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Linus Torvalds

On Tue, Nov 27, 2018 at 12:57 PM Andrea Arcangeli  wrote:
>
> This difference can only happen with defrag=always, and that's not the
> current upstream default.

Ok, thanks. That makes it a bit less critical.

> That MADV_HUGEPAGE causes flights with NUMA balancing is not great
> indeed, qemu needs NUMA locality too, but then the badness caused by
> __GFP_THISNODE was a larger regression in the worst case for qemu.
[...]
> So the short term alternative again would be the alternate patch that
> does __GFP_THISNODE|GFP_ONLY_COMPACT appended below.

Sounds like we should probably do this. Particularly since Vlastimil
pointed out that we'd otherwise have issues with the back-port for 4.4
where that "defrag=always" was the default.

The patch doesn't look horrible, and it directly addresses this
particular issue.

Is there some reason we wouldn't want to do it?

   Linus

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Andrea Arcangeli

Hi Linus,

On Tue, Nov 27, 2018 at 09:08:50AM -0800, Linus Torvalds wrote:
> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
>  wrote:
> >
> > FYI, we noticed a -61.3% regression of vm-scalability.throughput due
> > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > MADV_HUGEPAGE mappings")
> 
> Well, that's certainly noticeable and not good.

Noticed this email too.

This difference can only happen with defrag=always, and that's not the
current upstream default.

thp_enabled: always
thp_defrag: always
^^ emulates MADV_HUGEPAGE always set

> Andrea, I suspect it might be causing fights with auto numa migration..

That MADV_HUGEPAGE causes flights with NUMA balancing is not great
indeed, qemu needs NUMA locality too, but then the badness caused by
__GFP_THISNODE was a larger regression in the worst case for qemu.

When the kernel wants to allocate a THP from node A, if there are no
THP generated on node A but there are in node B, they'll be picked from
node B now.

__GFP_THISNODE previously prevented any THP to be allocated from any
node except A. This introduces a higher chance of initial misplacement
which NUMA balancing will correct over time, but it should only apply
to long lived allocations under MADV_HUGEPAGE. Perhaps the workload
here uses short lived allocations and sets defrag=always which is not
optimal to begin with?

The motivation of the fix, is that the previous code invoked reclaim
with __GFP_THISNODE set. That looked insecure and such behavior should
only have been possible under a mlock/mempolicy
capability. __GFP_THISNODE is like a transient but still privileged
mbind for reclaim.

Before the fix, __GFP_THISNODE would end up swapping out everything
from node A to free 4k pages from node A, despite perhaps there were
gigabytes of memory free in node B. That caused severe regression to
threaded workloads whose memory spanned more than one NUMA node. So
again going back doesn't sounds great for NUMA in general.

The vmscalability test is most certainly not including any
multithreaded process whose memory doesn't fit in a single NUMA node
or we'd see also the other side of the tradeoff. It'd be nice to add
such a test to be sure that the old __GFP_THISNODE behavior won't
happen again for apps that don't fit in a single node.

> Lots more system time, but also look at this:
> 
> >1122389 ±  9% +17.2%1315380 ±  4%  proc-vmstat.numa_hit
> > 214722 ±  5% +21.6% 261076 ±  3%  
> > proc-vmstat.numa_huge_pte_updates
> >1108142 ±  9% +17.4%1300857 ±  4%  proc-vmstat.numa_local
> > 145368 ± 48% +63.1% 237050 ± 17%  proc-vmstat.numa_miss
> > 159615 ± 44% +57.6% 251573 ± 16%  proc-vmstat.numa_other
> > 185.50 ± 81%   +8278.6%  15542 ± 40%  
> > proc-vmstat.numa_pages_migrated
> 
> Should the commit be reverted? Or perhaps at least modified?

I proposed two solutions, the other one required a new minor feature:
__GFP_ONLY_COMPACT. The other solution wouldn't regress like
above. The THP utilization ratio would decrease though (it had margin
for improvement though).

Kirill preferred the __GFP_ONLY_COMPACT, I was mostly neutral because
it's a tradeoff.

So the short term alternative again would be the alternate patch that
does __GFP_THISNODE|GFP_ONLY_COMPACT appended below.

There's no particular problem in restricting only compaction to the
local node and to skip reclaim entirely during a THP allocation as
long as reclaim is skipped entirely.

David implemented a hardcoded version of GFP_COMPACTONLY too which was
runtime equivalent, but it was hardcoded for THP only the allocator,
but it looks less flexible to hardcode it for THP.

The current fix you merged is simpler overall and puts us back to a
"stable" state without introducing new (minor) features.

The below is for further review of the potential alternative (which
has still margin for improvement).

===
From: Andrea Arcangeli 
Subject: [PATCH 1/2] mm: thp: consolidate policy_nodemask call

Just a minor cleanup.

Signed-off-by: Andrea Arcangeli 
---
 mm/mempolicy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 01f1a14facc4..d6512ef28cde 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2026,6 +2026,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
vm_area_struct *vma,
goto out;
}

+   nmask = policy_nodemask(gfp, pol);
+
if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
int hpage_node = node;

@@ -2043,7 +2045,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
vm_area_struct *vma,
!(pol->flags & MPOL_F_LOCAL))
hpage_node = pol->v.preferred_node;

-   nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(hpage_node, *nmask)) {
mpol_cond_put(pol);

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Vlastimil Babka

On 11/27/18 8:05 PM, Vlastimil Babka wrote:
> On 11/27/18 6:08 PM, Linus Torvalds wrote:
>> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
>>  wrote:
>>>
>>> FYI, we noticed a -61.3% regression of vm-scalability.throughput due
>>> to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
>>> MADV_HUGEPAGE mappings")
>>
>> Well, that's certainly noticeable and not good.
>>
>> Andrea, I suspect it might be causing fights with auto numa migration..
>>
>> Lots more system time, but also look at this:
>>
>>>1122389 ±  9% +17.2%1315380 ±  4%  proc-vmstat.numa_hit
>>> 214722 ±  5% +21.6% 261076 ±  3%  
>>> proc-vmstat.numa_huge_pte_updates
>>>1108142 ±  9% +17.4%1300857 ±  4%  proc-vmstat.numa_local
>>> 145368 ± 48% +63.1% 237050 ± 17%  proc-vmstat.numa_miss
>>> 159615 ± 44% +57.6% 251573 ± 16%  proc-vmstat.numa_other
>>> 185.50 ± 81%   +8278.6%  15542 ± 40%  
>>> proc-vmstat.numa_pages_migrated
>>
>> Should the commit be reverted? Or perhaps at least modified?
> 
> This part of the test's config is important:
> 
> thp_defrag: always
> 
> While the commit targets MADV_HUGEPAGE mappings (such as Andrea's
> kvm-qemu usecase), with defrag=always, all mappings behave almost as a
> MADV_HUGEPAGE mapping. That's no longer a default for some years now and

Specifically, that's 444eb2a449ef ("mm: thp: set THP defrag by default
to madvise and add a stall-free defrag option") merged in v4.5. So we
might actually hit this regression with 4.4 stable backport...

> I think nobody recommends it. In the default configuration nothing
> changes for non-madvise mappings.
> 
> Vlastimil
> 
>>  Linus
>>
>

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Vlastimil Babka

On 11/27/18 6:08 PM, Linus Torvalds wrote:
> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
>  wrote:
>>
>> FYI, we noticed a -61.3% regression of vm-scalability.throughput due
>> to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
>> MADV_HUGEPAGE mappings")
> 
> Well, that's certainly noticeable and not good.
> 
> Andrea, I suspect it might be causing fights with auto numa migration..
> 
> Lots more system time, but also look at this:
> 
>>1122389 ±  9% +17.2%1315380 ±  4%  proc-vmstat.numa_hit
>> 214722 ±  5% +21.6% 261076 ±  3%  
>> proc-vmstat.numa_huge_pte_updates
>>1108142 ±  9% +17.4%1300857 ±  4%  proc-vmstat.numa_local
>> 145368 ± 48% +63.1% 237050 ± 17%  proc-vmstat.numa_miss
>> 159615 ± 44% +57.6% 251573 ± 16%  proc-vmstat.numa_other
>> 185.50 ± 81%   +8278.6%  15542 ± 40%  proc-vmstat.numa_pages_migrated
> 
> Should the commit be reverted? Or perhaps at least modified?

This part of the test's config is important:

thp_defrag: always

While the commit targets MADV_HUGEPAGE mappings (such as Andrea's
kvm-qemu usecase), with defrag=always, all mappings behave almost as a
MADV_HUGEPAGE mapping. That's no longer a default for some years now and
I think nobody recommends it. In the default configuration nothing
changes for non-madvise mappings.

Vlastimil

>  Linus
>

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Michal Hocko

On Tue 27-11-18 19:17:27, Michal Hocko wrote:
> On Tue 27-11-18 09:08:50, Linus Torvalds wrote:
> > On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
> >  wrote:
> > >
> > > FYI, we noticed a -61.3% regression of vm-scalability.throughput due
> > > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > > MADV_HUGEPAGE mappings")
> > 
> > Well, that's certainly noticeable and not good.
> > 
> > Andrea, I suspect it might be causing fights with auto numa migration..
> > 
> > Lots more system time, but also look at this:
> > 
> > >1122389 ±  9% +17.2%1315380 ±  4%  proc-vmstat.numa_hit
> > > 214722 ±  5% +21.6% 261076 ±  3%  
> > > proc-vmstat.numa_huge_pte_updates
> > >1108142 ±  9% +17.4%1300857 ±  4%  proc-vmstat.numa_local
> > > 145368 ± 48% +63.1% 237050 ± 17%  proc-vmstat.numa_miss
> > > 159615 ± 44% +57.6% 251573 ± 16%  proc-vmstat.numa_other
> > > 185.50 ± 81%   +8278.6%  15542 ± 40%  
> > > proc-vmstat.numa_pages_migrated
> > 
> > Should the commit be reverted? Or perhaps at least modified?
> 
> Well, the commit is trying to revert to the behavior before
> 5265047ac301 because there are real usecases that suffered from that
> change and bug reports as a result of that.
> 
> will-it-scale is certainly worth considering but it is an artificial
> testcase. A higher NUMA miss rate is an expected side effect of the
> patch because the fallback to a different NUMA node is more likely. The
> __GFP_THISNODE side effect is basically introducing node-reclaim
> behavior for THPages. Another thing is that there is no good behavior
> for everybody. Reclaim locally vs. THP on a remote node is hard to
> tell by default. We have discussed that at length and there were some
> conclusions. One of them is that we need a numa policy to tell whether
> a expensive localility is preferred over remote allocation.  Also we
> definitely need a better pro-active defragmentation to allow larger
> pages on a local node. This is a work in progress and this patch is a
> stop gap fix.

Btw. the associated discussion is 
http://lkml.kernel.org/r/20180925120326.24392-1-mho...@kernel.org

-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Michal Hocko

On Tue 27-11-18 09:08:50, Linus Torvalds wrote:
> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
>  wrote:
> >
> > FYI, we noticed a -61.3% regression of vm-scalability.throughput due
> > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > MADV_HUGEPAGE mappings")
> 
> Well, that's certainly noticeable and not good.
> 
> Andrea, I suspect it might be causing fights with auto numa migration..
> 
> Lots more system time, but also look at this:
> 
> >1122389 ±  9% +17.2%1315380 ±  4%  proc-vmstat.numa_hit
> > 214722 ±  5% +21.6% 261076 ±  3%  
> > proc-vmstat.numa_huge_pte_updates
> >1108142 ±  9% +17.4%1300857 ±  4%  proc-vmstat.numa_local
> > 145368 ± 48% +63.1% 237050 ± 17%  proc-vmstat.numa_miss
> > 159615 ± 44% +57.6% 251573 ± 16%  proc-vmstat.numa_other
> > 185.50 ± 81%   +8278.6%  15542 ± 40%  
> > proc-vmstat.numa_pages_migrated
> 
> Should the commit be reverted? Or perhaps at least modified?

Well, the commit is trying to revert to the behavior before
5265047ac301 because there are real usecases that suffered from that
change and bug reports as a result of that.

will-it-scale is certainly worth considering but it is an artificial
testcase. A higher NUMA miss rate is an expected side effect of the
patch because the fallback to a different NUMA node is more likely. The
__GFP_THISNODE side effect is basically introducing node-reclaim
behavior for THPages. Another thing is that there is no good behavior
for everybody. Reclaim locally vs. THP on a remote node is hard to
tell by default. We have discussed that at length and there were some
conclusions. One of them is that we need a numa policy to tell whether
a expensive localility is preferred over remote allocation.  Also we
definitely need a better pro-active defragmentation to allow larger
pages on a local node. This is a work in progress and this patch is a
stop gap fix.

-- 
Michal Hocko
SUSE Labs

Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Linus Torvalds

On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
 wrote:
>
> FYI, we noticed a -61.3% regression of vm-scalability.throughput due
> to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> MADV_HUGEPAGE mappings")

Well, that's certainly noticeable and not good.

Andrea, I suspect it might be causing fights with auto numa migration..

Lots more system time, but also look at this:

>1122389 ±  9% +17.2%1315380 ±  4%  proc-vmstat.numa_hit
> 214722 ±  5% +21.6% 261076 ±  3%  
> proc-vmstat.numa_huge_pte_updates
>1108142 ±  9% +17.4%1300857 ±  4%  proc-vmstat.numa_local
> 145368 ± 48% +63.1% 237050 ± 17%  proc-vmstat.numa_miss
> 159615 ± 44% +57.6% 251573 ± 16%  proc-vmstat.numa_other
> 185.50 ± 81%   +8278.6%  15542 ± 40%  proc-vmstat.numa_pages_migrated

Should the commit be reverted? Or perhaps at least modified?

 Linus

75 matches

Mail list logo