Re: [PATCH 4.4 131/160] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-11-21 Thread Michal Hocko
On Tue 20-11-18 15:53:10, David Rientjes wrote:
> On Tue, 20 Nov 2018, Michal Hocko wrote:
> 
> > On Mon 19-11-18 14:16:24, David Rientjes wrote:
> > > On Mon, 19 Nov 2018, Greg Kroah-Hartman wrote:
> > > 
> > > > 4.4-stable review patch.  If anyone has any objections, please let me 
> > > > know.
> > > > 
> > > 
> > > As I noted when this patch was originally proposed and when I nacked 
> > > it[*] 
> > > because it causes a 13.9% increase in remote memory access latency and up 
> > > to 40% increase in remote memory allocation latency on much of our 
> > > software stack that uses MADV_HUGEPAGE after mremapping the text segment 
> > > to memory backed by hugepages, I don't think this is stable material.
> > 
> > There was a wider consensus that this is the most minimal fix for users
> > who see a regression introduced by 5265047ac301 ("mm, thp: really
> > limit transparent hugepage allocation to local node"). As it has been
> > discussed extensively there is no universal win but we should always opt
> > for the safer side which this patch is accomplishing. The changelog goes
> > in length explaining them along with numbers. I am not happy that your
> > particular workload is suffering but this area certainly requires much
> > more changes to satisfy wider range of users.
> > 
> > > The 4.4 kernel is almost three years old and this changes the NUMA 
> > > locality of any user of MADV_HUGEPAGE.
> > 
> > Yes and we have seen bug reports as we adopted this older kernel only
> > now.
> 
> I think the responsible thing to do would be allow users to remain on 
> their stable kernel that they know works, whether that's 4.4 or any of the 
> others this is proposed for, and downgrade from any current kernel release 
> that causes their workloads to have such severe regressions once they try 
> a kernel with this commit.

But we do know that there are people affected on 4.4 kernel. Besides
that we can revert in the stable tree as soon as we see bug reports on
new stable tree releases.

Really, there is no single proper behavior. It was a mistake to merge
5265047ac301. Since then we are in an unfortunate situation that some
workload might have started to depend on the new behavior.

But rather than repeating the previous long discussion I would call for
a new one which actually deals with fallouts. AFAIR there is a patch
series to reduce the fragmentation issues by Mel with a zero feedback so
far. I also think we should start discussing a new memory policy to
establish the semantic you are after.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4.4 131/160] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-11-21 Thread Michal Hocko
On Tue 20-11-18 15:53:10, David Rientjes wrote:
> On Tue, 20 Nov 2018, Michal Hocko wrote:
> 
> > On Mon 19-11-18 14:16:24, David Rientjes wrote:
> > > On Mon, 19 Nov 2018, Greg Kroah-Hartman wrote:
> > > 
> > > > 4.4-stable review patch.  If anyone has any objections, please let me 
> > > > know.
> > > > 
> > > 
> > > As I noted when this patch was originally proposed and when I nacked 
> > > it[*] 
> > > because it causes a 13.9% increase in remote memory access latency and up 
> > > to 40% increase in remote memory allocation latency on much of our 
> > > software stack that uses MADV_HUGEPAGE after mremapping the text segment 
> > > to memory backed by hugepages, I don't think this is stable material.
> > 
> > There was a wider consensus that this is the most minimal fix for users
> > who see a regression introduced by 5265047ac301 ("mm, thp: really
> > limit transparent hugepage allocation to local node"). As it has been
> > discussed extensively there is no universal win but we should always opt
> > for the safer side which this patch is accomplishing. The changelog goes
> > in length explaining them along with numbers. I am not happy that your
> > particular workload is suffering but this area certainly requires much
> > more changes to satisfy wider range of users.
> > 
> > > The 4.4 kernel is almost three years old and this changes the NUMA 
> > > locality of any user of MADV_HUGEPAGE.
> > 
> > Yes and we have seen bug reports as we adopted this older kernel only
> > now.
> 
> I think the responsible thing to do would be allow users to remain on 
> their stable kernel that they know works, whether that's 4.4 or any of the 
> others this is proposed for, and downgrade from any current kernel release 
> that causes their workloads to have such severe regressions once they try 
> a kernel with this commit.

But we do know that there are people affected on 4.4 kernel. Besides
that we can revert in the stable tree as soon as we see bug reports on
new stable tree releases.

Really, there is no single proper behavior. It was a mistake to merge
5265047ac301. Since then we are in an unfortunate situation that some
workload might have started to depend on the new behavior.

But rather than repeating the previous long discussion I would call for
a new one which actually deals with fallouts. AFAIR there is a patch
series to reduce the fragmentation issues by Mel with a zero feedback so
far. I also think we should start discussing a new memory policy to
establish the semantic you are after.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4.4 131/160] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-11-20 Thread David Rientjes
On Tue, 20 Nov 2018, Michal Hocko wrote:

> On Mon 19-11-18 14:16:24, David Rientjes wrote:
> > On Mon, 19 Nov 2018, Greg Kroah-Hartman wrote:
> > 
> > > 4.4-stable review patch.  If anyone has any objections, please let me 
> > > know.
> > > 
> > 
> > As I noted when this patch was originally proposed and when I nacked it[*] 
> > because it causes a 13.9% increase in remote memory access latency and up 
> > to 40% increase in remote memory allocation latency on much of our 
> > software stack that uses MADV_HUGEPAGE after mremapping the text segment 
> > to memory backed by hugepages, I don't think this is stable material.
> 
> There was a wider consensus that this is the most minimal fix for users
> who see a regression introduced by 5265047ac301 ("mm, thp: really
> limit transparent hugepage allocation to local node"). As it has been
> discussed extensively there is no universal win but we should always opt
> for the safer side which this patch is accomplishing. The changelog goes
> in length explaining them along with numbers. I am not happy that your
> particular workload is suffering but this area certainly requires much
> more changes to satisfy wider range of users.
> 
> > The 4.4 kernel is almost three years old and this changes the NUMA 
> > locality of any user of MADV_HUGEPAGE.
> 
> Yes and we have seen bug reports as we adopted this older kernel only
> now.

I think the responsible thing to do would be allow users to remain on 
their stable kernel that they know works, whether that's 4.4 or any of the 
others this is proposed for, and downgrade from any current kernel release 
that causes their workloads to have such severe regressions once they try 
a kernel with this commit.


Re: [PATCH 4.4 131/160] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-11-20 Thread David Rientjes
On Tue, 20 Nov 2018, Michal Hocko wrote:

> On Mon 19-11-18 14:16:24, David Rientjes wrote:
> > On Mon, 19 Nov 2018, Greg Kroah-Hartman wrote:
> > 
> > > 4.4-stable review patch.  If anyone has any objections, please let me 
> > > know.
> > > 
> > 
> > As I noted when this patch was originally proposed and when I nacked it[*] 
> > because it causes a 13.9% increase in remote memory access latency and up 
> > to 40% increase in remote memory allocation latency on much of our 
> > software stack that uses MADV_HUGEPAGE after mremapping the text segment 
> > to memory backed by hugepages, I don't think this is stable material.
> 
> There was a wider consensus that this is the most minimal fix for users
> who see a regression introduced by 5265047ac301 ("mm, thp: really
> limit transparent hugepage allocation to local node"). As it has been
> discussed extensively there is no universal win but we should always opt
> for the safer side which this patch is accomplishing. The changelog goes
> in length explaining them along with numbers. I am not happy that your
> particular workload is suffering but this area certainly requires much
> more changes to satisfy wider range of users.
> 
> > The 4.4 kernel is almost three years old and this changes the NUMA 
> > locality of any user of MADV_HUGEPAGE.
> 
> Yes and we have seen bug reports as we adopted this older kernel only
> now.

I think the responsible thing to do would be allow users to remain on 
their stable kernel that they know works, whether that's 4.4 or any of the 
others this is proposed for, and downgrade from any current kernel release 
that causes their workloads to have such severe regressions once they try 
a kernel with this commit.


Re: [PATCH 4.4 131/160] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-11-19 Thread Michal Hocko
On Mon 19-11-18 14:16:24, David Rientjes wrote:
> On Mon, 19 Nov 2018, Greg Kroah-Hartman wrote:
> 
> > 4.4-stable review patch.  If anyone has any objections, please let me know.
> > 
> 
> As I noted when this patch was originally proposed and when I nacked it[*] 
> because it causes a 13.9% increase in remote memory access latency and up 
> to 40% increase in remote memory allocation latency on much of our 
> software stack that uses MADV_HUGEPAGE after mremapping the text segment 
> to memory backed by hugepages, I don't think this is stable material.

There was a wider consensus that this is the most minimal fix for users
who see a regression introduced by 5265047ac301 ("mm, thp: really
limit transparent hugepage allocation to local node"). As it has been
discussed extensively there is no universal win but we should always opt
for the safer side which this patch is accomplishing. The changelog goes
in length explaining them along with numbers. I am not happy that your
particular workload is suffering but this area certainly requires much
more changes to satisfy wider range of users.

> The 4.4 kernel is almost three years old and this changes the NUMA 
> locality of any user of MADV_HUGEPAGE.

Yes and we have seen bug reports as we adopted this older kernel only
now.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4.4 131/160] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-11-19 Thread Michal Hocko
On Mon 19-11-18 14:16:24, David Rientjes wrote:
> On Mon, 19 Nov 2018, Greg Kroah-Hartman wrote:
> 
> > 4.4-stable review patch.  If anyone has any objections, please let me know.
> > 
> 
> As I noted when this patch was originally proposed and when I nacked it[*] 
> because it causes a 13.9% increase in remote memory access latency and up 
> to 40% increase in remote memory allocation latency on much of our 
> software stack that uses MADV_HUGEPAGE after mremapping the text segment 
> to memory backed by hugepages, I don't think this is stable material.

There was a wider consensus that this is the most minimal fix for users
who see a regression introduced by 5265047ac301 ("mm, thp: really
limit transparent hugepage allocation to local node"). As it has been
discussed extensively there is no universal win but we should always opt
for the safer side which this patch is accomplishing. The changelog goes
in length explaining them along with numbers. I am not happy that your
particular workload is suffering but this area certainly requires much
more changes to satisfy wider range of users.

> The 4.4 kernel is almost three years old and this changes the NUMA 
> locality of any user of MADV_HUGEPAGE.

Yes and we have seen bug reports as we adopted this older kernel only
now.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4.4 131/160] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-11-19 Thread David Rientjes
On Mon, 19 Nov 2018, Greg Kroah-Hartman wrote:

> 4.4-stable review patch.  If anyone has any objections, please let me know.
> 

As I noted when this patch was originally proposed and when I nacked it[*] 
because it causes a 13.9% increase in remote memory access latency and up 
to 40% increase in remote memory allocation latency on much of our 
software stack that uses MADV_HUGEPAGE after mremapping the text segment 
to memory backed by hugepages, I don't think this is stable material.

The 4.4 kernel is almost three years old and this changes the NUMA 
locality of any user of MADV_HUGEPAGE.

Although the page was merged even after my objection, we must revert it in 
our own kernel because there is no userspace workaround to restore the 
behavior previous to this patch absent using an MPOL_BIND mempolicy which 
would have unwanted side effect of oom killing if the node is out of 
memory for pages of the native size, which would be a non-starter.

 [*] https://marc.info/?l=linux-kernel=153868420126775


Re: [PATCH 4.4 131/160] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-11-19 Thread David Rientjes
On Mon, 19 Nov 2018, Greg Kroah-Hartman wrote:

> 4.4-stable review patch.  If anyone has any objections, please let me know.
> 

As I noted when this patch was originally proposed and when I nacked it[*] 
because it causes a 13.9% increase in remote memory access latency and up 
to 40% increase in remote memory allocation latency on much of our 
software stack that uses MADV_HUGEPAGE after mremapping the text segment 
to memory backed by hugepages, I don't think this is stable material.

The 4.4 kernel is almost three years old and this changes the NUMA 
locality of any user of MADV_HUGEPAGE.

Although the page was merged even after my objection, we must revert it in 
our own kernel because there is no userspace workaround to restore the 
behavior previous to this patch absent using an MPOL_BIND mempolicy which 
would have unwanted side effect of oom killing if the node is out of 
memory for pages of the native size, which would be a non-starter.

 [*] https://marc.info/?l=linux-kernel=153868420126775


[PATCH 4.4 131/160] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-11-19 Thread Greg Kroah-Hartman
4.4-stable review patch.  If anyone has any objections, please let me know.

--

From: Andrea Arcangeli 

commit ac5b2c18911ffe95c08d69273917f90212cf5659 upstream.

THP allocation might be really disruptive when allocated on NUMA system
with the local node full or hard to reclaim.  Stefan has posted an
allocation stall report on 4.12 based SLES kernel which suggests the
same issue:

  kvm: page allocation stalls for 194572ms, order:9, 
mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
 nodemask=(null)
  kvm cpuset=/ mems_allowed=0-1
  CPU: 10 PID: 84752 Comm: kvm Tainted: GW 4.12.0+98-ph 001 SLE15 (unreleased)
  Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
  Call Trace:
   dump_stack+0x5c/0x84
   warn_alloc+0xe0/0x180
   __alloc_pages_slowpath+0x820/0xc90
   __alloc_pages_nodemask+0x1cc/0x210
   alloc_pages_vma+0x1e5/0x280
   do_huge_pmd_wp_page+0x83f/0xf00
   __handle_mm_fault+0x93d/0x1060
   handle_mm_fault+0xc6/0x1b0
   __do_page_fault+0x230/0x430
   do_page_fault+0x2a/0x70
   page_fault+0x7b/0x80
   [...]
  Mem-Info:
  active_anon:126315487 inactive_anon:1612476 isolated_anon:5
   active_file:60183 inactive_file:245285 isolated_file:0
   unevictable:15657 dirty:286 writeback:1 unstable:0
   slab_reclaimable:75543 slab_unreclaimable:2509111
   mapped:81814 shmem:31764 pagetables:370616 bounce:0
   free:32294031 free_pcp:6233 free_cma:0
  Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB 
inactive_file:981168kB unevictable:13368kB isolated(anon):0kB 
isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB 
shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB 
unstable:0kB all_unreclaimable? no
  Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB 
inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB 
mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB 
shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB 
all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE for THP requests which are
requesting the direct reclaim.  This effectivelly reverts 5265047ac301
on the grounds that the zone/node reclaim was known to be disruptive due
to premature reclaim when there was memory free.  While it made sense at
the time for HPC workloads without NUMA awareness on rare machines, it
was ultimately harmful in the majority of cases.  The existing behaviour
is similar, if not as widespare as it applies to a corner case but
crucially, it cannot be tuned around like zone_reclaim_mode can.  The
default behaviour should always be to cause the least harm for the
common case.

If there are specialised use cases out there that want zone_reclaim_mode
in specific cases, then it can be built on top.  Longterm we should
consider a memory policy which allows for the node reclaim like behavior
for the specific memory ranges which would allow a

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarca...@redhat.com

Mel said:

: Both patches look correct to me but I'm responding to this one because
: it's the fix.  The change makes sense and moves further away from the
: severe stalling behaviour we used to see with both THP and zone reclaim
: mode.
:
: I put together a basic experiment with usemem configured to reference a
: buffer multiple times that is 80% the size of main memory on a 2-socket
: box with symmetric node sizes and defrag set to "always".  The defrag
: setting is not the default 

[PATCH 4.4 131/160] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-11-19 Thread Greg Kroah-Hartman
4.4-stable review patch.  If anyone has any objections, please let me know.

--

From: Andrea Arcangeli 

commit ac5b2c18911ffe95c08d69273917f90212cf5659 upstream.

THP allocation might be really disruptive when allocated on NUMA system
with the local node full or hard to reclaim.  Stefan has posted an
allocation stall report on 4.12 based SLES kernel which suggests the
same issue:

  kvm: page allocation stalls for 194572ms, order:9, 
mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
 nodemask=(null)
  kvm cpuset=/ mems_allowed=0-1
  CPU: 10 PID: 84752 Comm: kvm Tainted: GW 4.12.0+98-ph 001 SLE15 (unreleased)
  Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
  Call Trace:
   dump_stack+0x5c/0x84
   warn_alloc+0xe0/0x180
   __alloc_pages_slowpath+0x820/0xc90
   __alloc_pages_nodemask+0x1cc/0x210
   alloc_pages_vma+0x1e5/0x280
   do_huge_pmd_wp_page+0x83f/0xf00
   __handle_mm_fault+0x93d/0x1060
   handle_mm_fault+0xc6/0x1b0
   __do_page_fault+0x230/0x430
   do_page_fault+0x2a/0x70
   page_fault+0x7b/0x80
   [...]
  Mem-Info:
  active_anon:126315487 inactive_anon:1612476 isolated_anon:5
   active_file:60183 inactive_file:245285 isolated_file:0
   unevictable:15657 dirty:286 writeback:1 unstable:0
   slab_reclaimable:75543 slab_unreclaimable:2509111
   mapped:81814 shmem:31764 pagetables:370616 bounce:0
   free:32294031 free_pcp:6233 free_cma:0
  Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB 
inactive_file:981168kB unevictable:13368kB isolated(anon):0kB 
isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB 
shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB 
unstable:0kB all_unreclaimable? no
  Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB 
inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB 
mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB 
shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB 
all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE for THP requests which are
requesting the direct reclaim.  This effectivelly reverts 5265047ac301
on the grounds that the zone/node reclaim was known to be disruptive due
to premature reclaim when there was memory free.  While it made sense at
the time for HPC workloads without NUMA awareness on rare machines, it
was ultimately harmful in the majority of cases.  The existing behaviour
is similar, if not as widespare as it applies to a corner case but
crucially, it cannot be tuned around like zone_reclaim_mode can.  The
default behaviour should always be to cause the least harm for the
common case.

If there are specialised use cases out there that want zone_reclaim_mode
in specific cases, then it can be built on top.  Longterm we should
consider a memory policy which allows for the node reclaim like behavior
for the specific memory ranges which would allow a

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarca...@redhat.com

Mel said:

: Both patches look correct to me but I'm responding to this one because
: it's the fix.  The change makes sense and moves further away from the
: severe stalling behaviour we used to see with both THP and zone reclaim
: mode.
:
: I put together a basic experiment with usemem configured to reference a
: buffer multiple times that is 80% the size of main memory on a 2-socket
: box with symmetric node sizes and defrag set to "always".  The defrag
: setting is not the default