Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-29 Thread Andrea Arcangeli
Hello,

On Mon, Oct 29, 2018 at 11:08:34AM +0100, Michal Hocko wrote:
> This seems like a separate issue which should better be debugged. Please
> open a new thread describing the problem and the state of the node.

Yes, in my view it should be evaluated separately too, because it's
overall less concerning: __GFP_THISNODE there can only be set by the
root user there. So it has a chance to be legitimate behavior
there. Let's focus on solving the __GFP_THISNODE that any user in the
system can set (not only root) and cause severe and unexpected swap
storms or slowdowns to all other processes run by other users.

ls -l /sys/kernel/mm/hugepages/*/nr_hugepages

(and boot command line)

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-29 Thread Andrea Arcangeli
Hello,

On Mon, Oct 29, 2018 at 11:08:34AM +0100, Michal Hocko wrote:
> This seems like a separate issue which should better be debugged. Please
> open a new thread describing the problem and the state of the node.

Yes, in my view it should be evaluated separately too, because it's
overall less concerning: __GFP_THISNODE there can only be set by the
root user there. So it has a chance to be legitimate behavior
there. Let's focus on solving the __GFP_THISNODE that any user in the
system can set (not only root) and cause severe and unexpected swap
storms or slowdowns to all other processes run by other users.

ls -l /sys/kernel/mm/hugepages/*/nr_hugepages

(and boot command line)

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 20:42:53, Balbir Singh wrote:
> On Mon, Oct 29, 2018 at 10:00:35AM +0100, Michal Hocko wrote:
[...]
> > These hugetlb allocations might be disruptive and that is an expected
> > behavior because this is an explicit requirement from an admin to
> > pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just
> > underlines that requirement.
> 
> Yes, but in the absence of a particular node, for example via sysctl
> (as the compaction does), I don't think it is a hard requirement to get
> a page from a particular node.

Again this seems like a deliberate decision. You want your distributions
as even as possible otherwise the NUMA placement will be much less
deterministic. At least that was the case for a long time. If you
have different per-node preferences, just use NUMA aware pre-allocation.

> I agree we need __GFP_RETRY_FAIL, in any
> case the real root cause for me is should_reclaim_continue() which keeps
> the task looping without making forward progress.

This seems like a separate issue which should better be debugged. Please
open a new thread describing the problem and the state of the node.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 20:42:53, Balbir Singh wrote:
> On Mon, Oct 29, 2018 at 10:00:35AM +0100, Michal Hocko wrote:
[...]
> > These hugetlb allocations might be disruptive and that is an expected
> > behavior because this is an explicit requirement from an admin to
> > pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just
> > underlines that requirement.
> 
> Yes, but in the absence of a particular node, for example via sysctl
> (as the compaction does), I don't think it is a hard requirement to get
> a page from a particular node.

Again this seems like a deliberate decision. You want your distributions
as even as possible otherwise the NUMA placement will be much less
deterministic. At least that was the case for a long time. If you
have different per-node preferences, just use NUMA aware pre-allocation.

> I agree we need __GFP_RETRY_FAIL, in any
> case the real root cause for me is should_reclaim_continue() which keeps
> the task looping without making forward progress.

This seems like a separate issue which should better be debugged. Please
open a new thread describing the problem and the state of the node.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-29 Thread Balbir Singh
On Mon, Oct 29, 2018 at 10:00:35AM +0100, Michal Hocko wrote:
> On Mon 29-10-18 16:17:52, Balbir Singh wrote:
> [...]
> > I wonder if alloc_pool_huge_page() should also trim out it's logic
> > of __GFP_THISNODE for the same reasons as mentioned here. I like
> > that we round robin to alloc the pool pages, but __GFP_THISNODE
> > might be an overkill for that case as well.
> 
> alloc_pool_huge_page uses __GFP_THISNODE for a different reason than
> THP. We really do want to allocated for a per-node pool. THP can
> fallback or use a different node.
> 

Not really

static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
...
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
...
for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
if (page)
break;
}


The code just tries to distribute the pool

> These hugetlb allocations might be disruptive and that is an expected
> behavior because this is an explicit requirement from an admin to
> pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just
> underlines that requirement.

Yes, but in the absence of a particular node, for example via sysctl
(as the compaction does), I don't think it is a hard requirement to get
a page from a particular node. I agree we need __GFP_RETRY_FAIL, in any
case the real root cause for me is should_reclaim_continue() which keeps
the task looping without making forward progress.

The __GFP_THISNODE was again an example of mis-leading the allocator
in this case, IMHO.

> 
> Maybe the compaction logic could be improved and that might be a shared
> goal with future changes though.

I'll also send my RFC once my testing is done, assuming I get it to reproduce
with a desired frequency.

Balbir Singh.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-29 Thread Balbir Singh
On Mon, Oct 29, 2018 at 10:00:35AM +0100, Michal Hocko wrote:
> On Mon 29-10-18 16:17:52, Balbir Singh wrote:
> [...]
> > I wonder if alloc_pool_huge_page() should also trim out it's logic
> > of __GFP_THISNODE for the same reasons as mentioned here. I like
> > that we round robin to alloc the pool pages, but __GFP_THISNODE
> > might be an overkill for that case as well.
> 
> alloc_pool_huge_page uses __GFP_THISNODE for a different reason than
> THP. We really do want to allocated for a per-node pool. THP can
> fallback or use a different node.
> 

Not really

static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
...
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
...
for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
if (page)
break;
}


The code just tries to distribute the pool

> These hugetlb allocations might be disruptive and that is an expected
> behavior because this is an explicit requirement from an admin to
> pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just
> underlines that requirement.

Yes, but in the absence of a particular node, for example via sysctl
(as the compaction does), I don't think it is a hard requirement to get
a page from a particular node. I agree we need __GFP_RETRY_FAIL, in any
case the real root cause for me is should_reclaim_continue() which keeps
the task looping without making forward progress.

The __GFP_THISNODE was again an example of mis-leading the allocator
in this case, IMHO.

> 
> Maybe the compaction logic could be improved and that might be a shared
> goal with future changes though.

I'll also send my RFC once my testing is done, assuming I get it to reproduce
with a desired frequency.

Balbir Singh.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 16:17:52, Balbir Singh wrote:
[...]
> I wonder if alloc_pool_huge_page() should also trim out it's logic
> of __GFP_THISNODE for the same reasons as mentioned here. I like
> that we round robin to alloc the pool pages, but __GFP_THISNODE
> might be an overkill for that case as well.

alloc_pool_huge_page uses __GFP_THISNODE for a different reason than
THP. We really do want to allocated for a per-node pool. THP can
fallback or use a different node.

These hugetlb allocations might be disruptive and that is an expected
behavior because this is an explicit requirement from an admin to
pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just
underlines that requirement.

Maybe the compaction logic could be improved and that might be a shared
goal with future changes though.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 16:17:52, Balbir Singh wrote:
[...]
> I wonder if alloc_pool_huge_page() should also trim out it's logic
> of __GFP_THISNODE for the same reasons as mentioned here. I like
> that we round robin to alloc the pool pages, but __GFP_THISNODE
> might be an overkill for that case as well.

alloc_pool_huge_page uses __GFP_THISNODE for a different reason than
THP. We really do want to allocated for a per-node pool. THP can
fallback or use a different node.

These hugetlb allocations might be disruptive and that is an expected
behavior because this is an explicit requirement from an admin to
pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just
underlines that requirement.

Maybe the compaction logic could be improved and that might be a shared
goal with future changes though.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-28 Thread Balbir Singh
On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote:
> From: Andrea Arcangeli 
> 
> THP allocation might be really disruptive when allocated on NUMA system
> with the local node full or hard to reclaim. Stefan has posted an
> allocation stall report on 4.12 based SLES kernel which suggests the
> same issue:
> 
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, 
> mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
>  nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: GW 4.12.0+98-ph 
>  class="resolved">001 SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 
> 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80
> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>  active_file:60183 inactive_file:245285 isolated_file:0
>  unevictable:15657 dirty:286 writeback:1 unstable:0
>  slab_reclaimable:75543 slab_unreclaimable:2509111
>  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB 
> active_file:240648kB inactive_file:981168kB unevictable:13368kB 
> isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB 
> writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 
> 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB 
> active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB 
> isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB 
> shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB 
> unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE for THP requests which are
> requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
> the grounds that the zone/node reclaim was known to be disruptive due
> to premature reclaim when there was memory free. While it made sense at
> the time for HPC workloads without NUMA awareness on rare machines, it
> was ultimately harmful in the majority of cases. The existing behaviour
> is similiar, if not as widespare as it applies to a corner case but
> crucially, it cannot be tuned around like zone_reclaim_mode can. The
> default behaviour should always be to cause the least harm for the
> common case.
> 
> If there are specialised use cases out there that want zone_reclaim_mode
> in 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-28 Thread Balbir Singh
On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote:
> From: Andrea Arcangeli 
> 
> THP allocation might be really disruptive when allocated on NUMA system
> with the local node full or hard to reclaim. Stefan has posted an
> allocation stall report on 4.12 based SLES kernel which suggests the
> same issue:
> 
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, 
> mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
>  nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: GW 4.12.0+98-ph 
>  class="resolved">001 SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 
> 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80
> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>  active_file:60183 inactive_file:245285 isolated_file:0
>  unevictable:15657 dirty:286 writeback:1 unstable:0
>  slab_reclaimable:75543 slab_unreclaimable:2509111
>  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB 
> active_file:240648kB inactive_file:981168kB unevictable:13368kB 
> isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB 
> writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 
> 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB 
> active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB 
> isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB 
> shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB 
> unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE for THP requests which are
> requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
> the grounds that the zone/node reclaim was known to be disruptive due
> to premature reclaim when there was memory free. While it made sense at
> the time for HPC workloads without NUMA awareness on rare machines, it
> was ultimately harmful in the majority of cases. The existing behaviour
> is similiar, if not as widespare as it applies to a corner case but
> crucially, it cannot be tuned around like zone_reclaim_mode can. The
> default behaviour should always be to cause the least harm for the
> common case.
> 
> If there are specialised use cases out there that want zone_reclaim_mode
> in 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-28 Thread David Rientjes
On Mon, 22 Oct 2018, Zi Yan wrote:

> Hi David,
> 

Hi!

> On 22 Oct 2018, at 17:04, David Rientjes wrote:
> 
> > On Tue, 16 Oct 2018, Mel Gorman wrote:
> > 
> > > I consider this to be an unfortunate outcome. On the one hand, we have a
> > > problem that three people can trivially reproduce with known test cases
> > > and a patch shown to resolve the problem. Two of those three people work
> > > on distributions that are exposed to a large number of users. On the
> > > other, we have a problem that requires the system to be in a specific
> > > state and an unknown workload that suffers badly from the remote access
> > > penalties with a patch that has review concerns and has not been proven
> > > to resolve the trivial cases.
> > 
> > The specific state is that remote memory is fragmented as well, this is
> > not atypical.  Removing __GFP_THISNODE to avoid thrashing a zone will only
> > be beneficial when you can allocate remotely instead.  When you cannot
> > allocate remotely instead, you've made the problem much worse for
> > something that should be __GFP_NORETRY in the first place (and was for
> > years) and should never thrash.
> > 
> > I'm not interested in patches that require remote nodes to have an
> > abundance of free or unfragmented memory to avoid regressing.
> 
> I just wonder what is the page allocation priority list in your environment,
> assuming all memory nodes are so fragmented that no huge pages can be
> obtained without compaction or reclaim.
> 
> Here is my version of that list, please let me know if it makes sense to you:
> 
> 1. local huge pages: with compaction and/or page reclaim, you are willing
> to pay the penalty of getting huge pages;
> 
> 2. local base pages: since, in your system, remote data accesses have much
> higher penalty than the extra TLB misses incurred by the base page size;
> 
> 3. remote huge pages: at least it is better than remote base pages;
> 
> 4. remote base pages: it performs worst in terms of locality and TLBs.
> 

I have a ton of different platforms available.  Consider a very basic 
access latency evaluation on Broadwell on a running production system: 
remote hugepage vs remote PAGE_SIZE pages had about 5% better access 
latency.  Remote PAGE_SIZE pages vs local pages is a 12% degradation.  On 
Naples, remote hugepage vs remote PAGE_SIZE had 2% better access latency 
intrasocket, no better access latency intersocket.  Remote PAGE_SIZE pages 
vs local is a 16% degradation intrasocket and 38% degradation intersocket.

My list removes (3) from your list, but is otherwise unchanged.  I remove 
(3) because 2-5% better access latency is nice, but we'd much rather fault 
local base pages and then let khugepaged collapse it into a local hugepage 
when fragmentation is improved or we have freed memory.  That is where we 
can get a much better result, 41% better access latency on Broadwell and 
52% better access latncy on Naples.  I wouldn't trade that for 2-5% 
immediate remote hugepages.

It just so happens that prior to this patch, the implementation of the 
page allocator matches this preference.

> In addition, to prioritize local base pages over remote pages,
> the original huge page allocation has to fail, then kernel can
> fall back to base page allocations. And you will never get remote
> huge pages any more if the local base page allocation fails,
> because there is no way back to huge page allocation after the fallback.
> 

That is exactly what we want, we want khugepaged to collapse memory into 
local hugepages for the big improvement rather than persistently access a 
hugepage remotely; the win of the remote hugepage just isn't substantial 
enough, and the win of the local hugepage is just too great.

> > I'd like to know, specifically:
> > 
> >  - what measurable affect my patch has that is better solved with removing
> >__GFP_THISNODE on systems where remote memory is also fragmented?
> > 
> >  - what platforms benefit from remote access to hugepages vs accessing
> >local small pages (I've asked this maybe 4 or 5 times now)?
> > 
> >  - how is reclaiming (and possibly thrashing) memory helpful if compaction
> >fails to free an entire pageblock due to slab fragmentation due to low
> >on memory conditions and the page allocator preference to return node-
> >local memory?
> > 
> >  - how is reclaiming (and possibly thrashing) memory helpful if compaction
> >cannot access the memory reclaimed because the freeing scanner has
> >already passed by it, or the migration scanner has passed by it, since
> >this reclaim is not targeted to pages it can find?
> > 
> >  - what metrics can be introduced to the page allocator so that we can
> >determine that reclaiming (and possibly thrashing) memory will result
> >in a hugepage being allocated?
> 
> The slab fragmentation and whether reclaim/compaction can help form
> huge pages seem to orthogonal to this patch, which tries to decide
> the priority between locality and 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-28 Thread David Rientjes
On Mon, 22 Oct 2018, Zi Yan wrote:

> Hi David,
> 

Hi!

> On 22 Oct 2018, at 17:04, David Rientjes wrote:
> 
> > On Tue, 16 Oct 2018, Mel Gorman wrote:
> > 
> > > I consider this to be an unfortunate outcome. On the one hand, we have a
> > > problem that three people can trivially reproduce with known test cases
> > > and a patch shown to resolve the problem. Two of those three people work
> > > on distributions that are exposed to a large number of users. On the
> > > other, we have a problem that requires the system to be in a specific
> > > state and an unknown workload that suffers badly from the remote access
> > > penalties with a patch that has review concerns and has not been proven
> > > to resolve the trivial cases.
> > 
> > The specific state is that remote memory is fragmented as well, this is
> > not atypical.  Removing __GFP_THISNODE to avoid thrashing a zone will only
> > be beneficial when you can allocate remotely instead.  When you cannot
> > allocate remotely instead, you've made the problem much worse for
> > something that should be __GFP_NORETRY in the first place (and was for
> > years) and should never thrash.
> > 
> > I'm not interested in patches that require remote nodes to have an
> > abundance of free or unfragmented memory to avoid regressing.
> 
> I just wonder what is the page allocation priority list in your environment,
> assuming all memory nodes are so fragmented that no huge pages can be
> obtained without compaction or reclaim.
> 
> Here is my version of that list, please let me know if it makes sense to you:
> 
> 1. local huge pages: with compaction and/or page reclaim, you are willing
> to pay the penalty of getting huge pages;
> 
> 2. local base pages: since, in your system, remote data accesses have much
> higher penalty than the extra TLB misses incurred by the base page size;
> 
> 3. remote huge pages: at least it is better than remote base pages;
> 
> 4. remote base pages: it performs worst in terms of locality and TLBs.
> 

I have a ton of different platforms available.  Consider a very basic 
access latency evaluation on Broadwell on a running production system: 
remote hugepage vs remote PAGE_SIZE pages had about 5% better access 
latency.  Remote PAGE_SIZE pages vs local pages is a 12% degradation.  On 
Naples, remote hugepage vs remote PAGE_SIZE had 2% better access latency 
intrasocket, no better access latency intersocket.  Remote PAGE_SIZE pages 
vs local is a 16% degradation intrasocket and 38% degradation intersocket.

My list removes (3) from your list, but is otherwise unchanged.  I remove 
(3) because 2-5% better access latency is nice, but we'd much rather fault 
local base pages and then let khugepaged collapse it into a local hugepage 
when fragmentation is improved or we have freed memory.  That is where we 
can get a much better result, 41% better access latency on Broadwell and 
52% better access latncy on Naples.  I wouldn't trade that for 2-5% 
immediate remote hugepages.

It just so happens that prior to this patch, the implementation of the 
page allocator matches this preference.

> In addition, to prioritize local base pages over remote pages,
> the original huge page allocation has to fail, then kernel can
> fall back to base page allocations. And you will never get remote
> huge pages any more if the local base page allocation fails,
> because there is no way back to huge page allocation after the fallback.
> 

That is exactly what we want, we want khugepaged to collapse memory into 
local hugepages for the big improvement rather than persistently access a 
hugepage remotely; the win of the remote hugepage just isn't substantial 
enough, and the win of the local hugepage is just too great.

> > I'd like to know, specifically:
> > 
> >  - what measurable affect my patch has that is better solved with removing
> >__GFP_THISNODE on systems where remote memory is also fragmented?
> > 
> >  - what platforms benefit from remote access to hugepages vs accessing
> >local small pages (I've asked this maybe 4 or 5 times now)?
> > 
> >  - how is reclaiming (and possibly thrashing) memory helpful if compaction
> >fails to free an entire pageblock due to slab fragmentation due to low
> >on memory conditions and the page allocator preference to return node-
> >local memory?
> > 
> >  - how is reclaiming (and possibly thrashing) memory helpful if compaction
> >cannot access the memory reclaimed because the freeing scanner has
> >already passed by it, or the migration scanner has passed by it, since
> >this reclaim is not targeted to pages it can find?
> > 
> >  - what metrics can be introduced to the page allocator so that we can
> >determine that reclaiming (and possibly thrashing) memory will result
> >in a hugepage being allocated?
> 
> The slab fragmentation and whether reclaim/compaction can help form
> huge pages seem to orthogonal to this patch, which tries to decide
> the priority between locality and 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-23 Thread Mel Gorman
On Tue, Oct 23, 2018 at 08:57:45AM +0100, Mel Gorman wrote:
> Note that I accept it's trivial to fragment memory in a harmful way.
> I've prototyped a test case yesterday that uses fio in the following way
> to fragment memory
> 
> o fio of many small files (64K)
> o create initial pages using writes that disable fallocate and create
>   inodes on first open. This is massively inefficient from an IO
>   perspective but it mixes slab and page cache allocations so all
>   NUMA nodes get fragmented.
> o Size the page cache so that it's 150% the size of memory so it forces
>   reclaim activity and new fio activity to further mix slab and page
>   cache allocations
> o After initial write, run parallel readers to keep slab active and run
>   this for the same length of time the initial writes took so fio has
>   called stat() on the existing files and begun the read phase. This
>   forces the slab and page cache pages to remain "live" and difficult
>   to reclaim/compact.
> o Finally, start a workload that allocates THP after the warmup phase
>   but while fio is still runnning to measure allocation success rate
>   and latencies
> 

The tests completed shortly after I wrote this mail so I can put some
figures to the intuitions expressed in this mail. I'm truncating the
reports for clarity but can upload the full data if necessary.

The target system is a 2-socket using E5-2670 v3 (Haswell). Base kernel
is 4.19. The baseline is an unpatched kernel. relaxthisnode-v1r1 is
patch 1 of Michal's series and does not include the second cleanup.
noretry-v1r1 is David's alternative

global-dhp__workload_usemem-stress-numa-compact
(no filesystem as this is the trivial case of allocating anonymous
 memory on a freshly booted system. Figures are elapsed time)

   4.19.0 4.19.0
 4.19.0
  vanilla relaxthisnode-v1r1   
noretry-v1r1
Amean System-1   14.16 (   0.00%)   12.35 *  12.75%*   15.96 * 
-12.70%*
Amean System-3   15.14 (   0.00%)9.83 *  35.08%*   11.00 *  
27.34%*
Amean System-49.88 (   0.00%)9.85 (   0.25%)9.80 (  
 0.75%)
Amean Elapsd-1   29.23 (   0.00%)   26.16 *  10.50%*   33.81 * 
-15.70%*
Amean Elapsd-3   25.67 (   0.00%)7.28 *  71.63%*8.49 *  
66.93%*
Amean Elapsd-45.49 (   0.00%)5.53 (  -0.76%)5.46 (  
 0.49%)

The figures in () are the percentage gain/loss. If it's around *'s then
the automation has guessed at the results are outside the noise.

System CPU usage is reduced by both as reported but Micha's gives a
10.5% gain and David's is a 15.7% loss. Boith appear to be outside the
noise. While not included here, the vanilla kernel swaps heavily with a 56%
reclaim efficiency (pages scanned vs pages reclaimed) and neither of the
proposed patches swaps and it's all from direct reclaim activity. Michal's
patch does not enter reclaim, David's enters reclaim but it's very light.

global-dhp__workload_thpfioscale-xfs
(Uses fio to fragment memory and keep slab and page cache active while
 there is an attempt to allocate THP in parallel. No special madvise
 flags or tuning is applied. A dedicated test partition is used for
 fio and XFS was the target filesystem that is recreated on every test)
thpfioscale Fault Latencies
   4.19.0 4.19.0
 4.19.0
  vanilla relaxthisnode-v1r1   
noretry-v1r1
Amean fault-base-5 1471.95 (   0.00%) 1515.64 (  -2.97%) 
1491.05 (  -1.30%)
Amean fault-huge-50.00 (   0.00%)  534.51 * -99.00%*
0.00 (   0.00%)

thpfioscale Percentage Faults Huge
  4.19.0 4.19.0 
4.19.0
 vanilla relaxthisnode-v1r1   
noretry-v1r1
Percentage huge-50.00 (   0.00%)1.18 ( 100.00%)0.00 (   
0.00%)

Both patches incur a slight hit to fault latency (measured in microseconds)
but it's well within the noise. While not included here, the variance is
massive (min 1052 microseconds, max 282348 microseconds in the vanilla
kernel. Both patches reduce the worst-case scenarios. All kernels show
terrible allocation success rates. Michal's had a 1.18% success rate but
that's probably luck.

global-dhp__workload_thpfioscale-madvhugepage-xfs
(Same as the last test but the THP allocation program uses
 MADV_HUGEPAGE)

thpfioscale Fault Latencies
   4.19.0 4.19.0
 4.19.0
  vanilla relaxthisnode-v1r1   
noretry-v1r1
Amean fault-base-5 6772.84 (   0.00%)10256.30 * -51.43%* 
1574.45 *  76.75%*
Amean fault-huge-5 2644.19 (   0.00%) 5314.17 *-100.98%* 
3517.89 ( -33.04%)


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-23 Thread Mel Gorman
On Tue, Oct 23, 2018 at 08:57:45AM +0100, Mel Gorman wrote:
> Note that I accept it's trivial to fragment memory in a harmful way.
> I've prototyped a test case yesterday that uses fio in the following way
> to fragment memory
> 
> o fio of many small files (64K)
> o create initial pages using writes that disable fallocate and create
>   inodes on first open. This is massively inefficient from an IO
>   perspective but it mixes slab and page cache allocations so all
>   NUMA nodes get fragmented.
> o Size the page cache so that it's 150% the size of memory so it forces
>   reclaim activity and new fio activity to further mix slab and page
>   cache allocations
> o After initial write, run parallel readers to keep slab active and run
>   this for the same length of time the initial writes took so fio has
>   called stat() on the existing files and begun the read phase. This
>   forces the slab and page cache pages to remain "live" and difficult
>   to reclaim/compact.
> o Finally, start a workload that allocates THP after the warmup phase
>   but while fio is still runnning to measure allocation success rate
>   and latencies
> 

The tests completed shortly after I wrote this mail so I can put some
figures to the intuitions expressed in this mail. I'm truncating the
reports for clarity but can upload the full data if necessary.

The target system is a 2-socket using E5-2670 v3 (Haswell). Base kernel
is 4.19. The baseline is an unpatched kernel. relaxthisnode-v1r1 is
patch 1 of Michal's series and does not include the second cleanup.
noretry-v1r1 is David's alternative

global-dhp__workload_usemem-stress-numa-compact
(no filesystem as this is the trivial case of allocating anonymous
 memory on a freshly booted system. Figures are elapsed time)

   4.19.0 4.19.0
 4.19.0
  vanilla relaxthisnode-v1r1   
noretry-v1r1
Amean System-1   14.16 (   0.00%)   12.35 *  12.75%*   15.96 * 
-12.70%*
Amean System-3   15.14 (   0.00%)9.83 *  35.08%*   11.00 *  
27.34%*
Amean System-49.88 (   0.00%)9.85 (   0.25%)9.80 (  
 0.75%)
Amean Elapsd-1   29.23 (   0.00%)   26.16 *  10.50%*   33.81 * 
-15.70%*
Amean Elapsd-3   25.67 (   0.00%)7.28 *  71.63%*8.49 *  
66.93%*
Amean Elapsd-45.49 (   0.00%)5.53 (  -0.76%)5.46 (  
 0.49%)

The figures in () are the percentage gain/loss. If it's around *'s then
the automation has guessed at the results are outside the noise.

System CPU usage is reduced by both as reported but Micha's gives a
10.5% gain and David's is a 15.7% loss. Boith appear to be outside the
noise. While not included here, the vanilla kernel swaps heavily with a 56%
reclaim efficiency (pages scanned vs pages reclaimed) and neither of the
proposed patches swaps and it's all from direct reclaim activity. Michal's
patch does not enter reclaim, David's enters reclaim but it's very light.

global-dhp__workload_thpfioscale-xfs
(Uses fio to fragment memory and keep slab and page cache active while
 there is an attempt to allocate THP in parallel. No special madvise
 flags or tuning is applied. A dedicated test partition is used for
 fio and XFS was the target filesystem that is recreated on every test)
thpfioscale Fault Latencies
   4.19.0 4.19.0
 4.19.0
  vanilla relaxthisnode-v1r1   
noretry-v1r1
Amean fault-base-5 1471.95 (   0.00%) 1515.64 (  -2.97%) 
1491.05 (  -1.30%)
Amean fault-huge-50.00 (   0.00%)  534.51 * -99.00%*
0.00 (   0.00%)

thpfioscale Percentage Faults Huge
  4.19.0 4.19.0 
4.19.0
 vanilla relaxthisnode-v1r1   
noretry-v1r1
Percentage huge-50.00 (   0.00%)1.18 ( 100.00%)0.00 (   
0.00%)

Both patches incur a slight hit to fault latency (measured in microseconds)
but it's well within the noise. While not included here, the variance is
massive (min 1052 microseconds, max 282348 microseconds in the vanilla
kernel. Both patches reduce the worst-case scenarios. All kernels show
terrible allocation success rates. Michal's had a 1.18% success rate but
that's probably luck.

global-dhp__workload_thpfioscale-madvhugepage-xfs
(Same as the last test but the THP allocation program uses
 MADV_HUGEPAGE)

thpfioscale Fault Latencies
   4.19.0 4.19.0
 4.19.0
  vanilla relaxthisnode-v1r1   
noretry-v1r1
Amean fault-base-5 6772.84 (   0.00%)10256.30 * -51.43%* 
1574.45 *  76.75%*
Amean fault-huge-5 2644.19 (   0.00%) 5314.17 *-100.98%* 
3517.89 ( -33.04%)


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-23 Thread Mel Gorman
On Mon, Oct 22, 2018 at 02:04:32PM -0700, David Rientjes wrote:
> On Tue, 16 Oct 2018, Mel Gorman wrote:
> 
> > I consider this to be an unfortunate outcome. On the one hand, we have a
> > problem that three people can trivially reproduce with known test cases
> > and a patch shown to resolve the problem. Two of those three people work
> > on distributions that are exposed to a large number of users. On the
> > other, we have a problem that requires the system to be in a specific
> > state and an unknown workload that suffers badly from the remote access
> > penalties with a patch that has review concerns and has not been proven
> > to resolve the trivial cases.
> 
> The specific state is that remote memory is fragmented as well, this is 
> not atypical. 

While not necessarily atypical, *how* it gets fragmented is important.
The final state of fragmentation depends on both the input allocation
stream *and* the liveness of the mobility of unmovable pages. This is
why a target workload is crucial. While it's trivial to fragment memory,
it can be in a state that may or may not be trivially compacted.

For example, a fragmenting workload that interleaves slab allocations
with page cache may fragment memory but if the files are not being used
at the time of a THP allocation then they are trivially
reclaimed/compacted. However, if the files are in active use, the system
remains active.

You make the following claim in another response

The test case is rather trivial: fragment all memory with order-4
memory to replicate a fragmented local zone, use sched_setaffinity()
to bind to that node, and fault a reasonable number of hugepages
(128MB, 256, whatever).

You do not describe why it's order-4 allocations that are fragmenting or why
they remain live. order-4 might imply large jumbo frame allocations but they
typically do not remain live for very long unless you are using a driver
that recycles large numbers of order-4 pages. It's a critical detail. If
your test case is trivial then post it so there is a common base to work
from. A test case has been requested from you multiple times already.

Note that I accept it's trivial to fragment memory in a harmful way.
I've prototyped a test case yesterday that uses fio in the following way
to fragment memory

o fio of many small files (64K)
o create initial pages using writes that disable fallocate and create
  inodes on first open. This is massively inefficient from an IO
  perspective but it mixes slab and page cache allocations so all
  NUMA nodes get fragmented.
o Size the page cache so that it's 150% the size of memory so it forces
  reclaim activity and new fio activity to further mix slab and page
  cache allocations
o After initial write, run parallel readers to keep slab active and run
  this for the same length of time the initial writes took so fio has
  called stat() on the existing files and begun the read phase. This
  forces the slab and page cache pages to remain "live" and difficult
  to reclaim/compact.
o Finally, start a workload that allocates THP after the warmup phase
  but while fio is still runnning to measure allocation success rate
  and latencies

There are three configurations that use default advice, MADV_HUGEPAGE and
"always" fragment that is still running to cover the differet configurations
of interest. However, this is completly useless to you as even if this test
cases are fixed, there is no guarantee at all that it helps yours. They
are still being evaluated as they were recently prototyped but the mmtests
configurations in case someone wishes to independently evaluate are;

config-global-dhp__workload_thpfioscale
config-global-dhp__workload_thpfioscale-defrag
config-global-dhp__workload_thpfioscale-madvhugepage

It's best to run them on a dedicated test partition if possible. Locally
I've configured them to use a freshly created XFS filesystem.

If you're going to block fixes for problems that multiple people
experience then at least do the courtesy of providing a test case or
prototype patches for the complex alternatives you are proposing so they
can be independently evaluated. 

> Removing __GFP_THISNODE to avoid thrashing a zone will only 
> be beneficial when you can allocate remotely instead.  When you cannot 
> allocate remotely instead, you've made the problem much worse for 
> something that should be __GFP_NORETRY in the first place (and was for 
> years) and should never thrash.
> 
> I'm not interested in patches that require remote nodes to have an 
> abundance of free or unfragmented memory to avoid regressing.
> 

If only there was a test case that could reliably demonstrate this
independenly so it can be analysed and fixed *hint hint*. Better yet,
kindly prototype a fix.

> > In the case of distributions, the first
> > patch addresses concerns with a common workload where on the other hand
> > we have an internal workload of a single company that is affected --
> > which indirectly 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-23 Thread Mel Gorman
On Mon, Oct 22, 2018 at 02:04:32PM -0700, David Rientjes wrote:
> On Tue, 16 Oct 2018, Mel Gorman wrote:
> 
> > I consider this to be an unfortunate outcome. On the one hand, we have a
> > problem that three people can trivially reproduce with known test cases
> > and a patch shown to resolve the problem. Two of those three people work
> > on distributions that are exposed to a large number of users. On the
> > other, we have a problem that requires the system to be in a specific
> > state and an unknown workload that suffers badly from the remote access
> > penalties with a patch that has review concerns and has not been proven
> > to resolve the trivial cases.
> 
> The specific state is that remote memory is fragmented as well, this is 
> not atypical. 

While not necessarily atypical, *how* it gets fragmented is important.
The final state of fragmentation depends on both the input allocation
stream *and* the liveness of the mobility of unmovable pages. This is
why a target workload is crucial. While it's trivial to fragment memory,
it can be in a state that may or may not be trivially compacted.

For example, a fragmenting workload that interleaves slab allocations
with page cache may fragment memory but if the files are not being used
at the time of a THP allocation then they are trivially
reclaimed/compacted. However, if the files are in active use, the system
remains active.

You make the following claim in another response

The test case is rather trivial: fragment all memory with order-4
memory to replicate a fragmented local zone, use sched_setaffinity()
to bind to that node, and fault a reasonable number of hugepages
(128MB, 256, whatever).

You do not describe why it's order-4 allocations that are fragmenting or why
they remain live. order-4 might imply large jumbo frame allocations but they
typically do not remain live for very long unless you are using a driver
that recycles large numbers of order-4 pages. It's a critical detail. If
your test case is trivial then post it so there is a common base to work
from. A test case has been requested from you multiple times already.

Note that I accept it's trivial to fragment memory in a harmful way.
I've prototyped a test case yesterday that uses fio in the following way
to fragment memory

o fio of many small files (64K)
o create initial pages using writes that disable fallocate and create
  inodes on first open. This is massively inefficient from an IO
  perspective but it mixes slab and page cache allocations so all
  NUMA nodes get fragmented.
o Size the page cache so that it's 150% the size of memory so it forces
  reclaim activity and new fio activity to further mix slab and page
  cache allocations
o After initial write, run parallel readers to keep slab active and run
  this for the same length of time the initial writes took so fio has
  called stat() on the existing files and begun the read phase. This
  forces the slab and page cache pages to remain "live" and difficult
  to reclaim/compact.
o Finally, start a workload that allocates THP after the warmup phase
  but while fio is still runnning to measure allocation success rate
  and latencies

There are three configurations that use default advice, MADV_HUGEPAGE and
"always" fragment that is still running to cover the differet configurations
of interest. However, this is completly useless to you as even if this test
cases are fixed, there is no guarantee at all that it helps yours. They
are still being evaluated as they were recently prototyped but the mmtests
configurations in case someone wishes to independently evaluate are;

config-global-dhp__workload_thpfioscale
config-global-dhp__workload_thpfioscale-defrag
config-global-dhp__workload_thpfioscale-madvhugepage

It's best to run them on a dedicated test partition if possible. Locally
I've configured them to use a freshly created XFS filesystem.

If you're going to block fixes for problems that multiple people
experience then at least do the courtesy of providing a test case or
prototype patches for the complex alternatives you are proposing so they
can be independently evaluated. 

> Removing __GFP_THISNODE to avoid thrashing a zone will only 
> be beneficial when you can allocate remotely instead.  When you cannot 
> allocate remotely instead, you've made the problem much worse for 
> something that should be __GFP_NORETRY in the first place (and was for 
> years) and should never thrash.
> 
> I'm not interested in patches that require remote nodes to have an 
> abundance of free or unfragmented memory to avoid regressing.
> 

If only there was a test case that could reliably demonstrate this
independenly so it can be analysed and fixed *hint hint*. Better yet,
kindly prototype a fix.

> > In the case of distributions, the first
> > patch addresses concerns with a common workload where on the other hand
> > we have an internal workload of a single company that is affected --
> > which indirectly 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-22 Thread Zi Yan

Hi David,

On 22 Oct 2018, at 17:04, David Rientjes wrote:


On Tue, 16 Oct 2018, Mel Gorman wrote:

I consider this to be an unfortunate outcome. On the one hand, we 
have a
problem that three people can trivially reproduce with known test 
cases
and a patch shown to resolve the problem. Two of those three people 
work

on distributions that are exposed to a large number of users. On the
other, we have a problem that requires the system to be in a specific
state and an unknown workload that suffers badly from the remote 
access
penalties with a patch that has review concerns and has not been 
proven

to resolve the trivial cases.


The specific state is that remote memory is fragmented as well, this 
is
not atypical.  Removing __GFP_THISNODE to avoid thrashing a zone will 
only

be beneficial when you can allocate remotely instead.  When you cannot
allocate remotely instead, you've made the problem much worse for
something that should be __GFP_NORETRY in the first place (and was for
years) and should never thrash.

I'm not interested in patches that require remote nodes to have an
abundance of free or unfragmented memory to avoid regressing.


I just wonder what is the page allocation priority list in your 
environment,

assuming all memory nodes are so fragmented that no huge pages can be
obtained without compaction or reclaim.

Here is my version of that list, please let me know if it makes sense to 
you:


1. local huge pages: with compaction and/or page reclaim, you are 
willing

to pay the penalty of getting huge pages;

2. local base pages: since, in your system, remote data accesses have 
much

higher penalty than the extra TLB misses incurred by the base page size;

3. remote huge pages: at least it is better than remote base pages;

4. remote base pages: it performs worst in terms of locality and TLBs.

This might not be easy to implement in current kernel, because
the zones from remote nodes will always be candidates when
kernel is trying get_page_from_freelist(). Only _GFP_THISNODE
and MPOL_BIND can eliminate these remote node zones, where _GFP_THISNODE
is a kernel version MPOL_BIND and overwrites any user space
memory policy other than MPOL_BIND, which is troublesome.

In addition, to prioritize local base pages over remote pages,
the original huge page allocation has to fail, then kernel can
fall back to base page allocations. And you will never get remote
huge pages any more if the local base page allocation fails,
because there is no way back to huge page allocation after the fallback.

Do you expect both behaviors?



In the case of distributions, the first
patch addresses concerns with a common workload where on the other 
hand

we have an internal workload of a single company that is affected --
which indirectly affects many users admittedly but only one entity 
directly.




The alternative, which is my patch, hasn't been tested or shown why it
cannot work.  We continue to talk about order >= pageblock_order vs
__GFP_COMPACTONLY.

I'd like to know, specifically:

 - what measurable affect my patch has that is better solved with 
removing

   __GFP_THISNODE on systems where remote memory is also fragmented?

 - what platforms benefit from remote access to hugepages vs accessing
   local small pages (I've asked this maybe 4 or 5 times now)?

 - how is reclaiming (and possibly thrashing) memory helpful if 
compaction
   fails to free an entire pageblock due to slab fragmentation due to 
low
   on memory conditions and the page allocator preference to return 
node-

   local memory?

 - how is reclaiming (and possibly thrashing) memory helpful if 
compaction

   cannot access the memory reclaimed because the freeing scanner has
   already passed by it, or the migration scanner has passed by it, 
since

   this reclaim is not targeted to pages it can find?

 - what metrics can be introduced to the page allocator so that we can
   determine that reclaiming (and possibly thrashing) memory will 
result

   in a hugepage being allocated?


The slab fragmentation and whether reclaim/compaction can help form
huge pages seem to orthogonal to this patch, which tries to decide
the priority between locality and huge pages.

For slab fragmentation, you might find this paper “Making Huge Pages 
Actually Useful”

(https://dl.acm.org/citation.cfm?id=3173203) helpful. The paper is
trying to minimize the number of page blocks that have both moveable and
non-moveable pages.


--
Best Regards
Yan Zi


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-22 Thread Zi Yan

Hi David,

On 22 Oct 2018, at 17:04, David Rientjes wrote:


On Tue, 16 Oct 2018, Mel Gorman wrote:

I consider this to be an unfortunate outcome. On the one hand, we 
have a
problem that three people can trivially reproduce with known test 
cases
and a patch shown to resolve the problem. Two of those three people 
work

on distributions that are exposed to a large number of users. On the
other, we have a problem that requires the system to be in a specific
state and an unknown workload that suffers badly from the remote 
access
penalties with a patch that has review concerns and has not been 
proven

to resolve the trivial cases.


The specific state is that remote memory is fragmented as well, this 
is
not atypical.  Removing __GFP_THISNODE to avoid thrashing a zone will 
only

be beneficial when you can allocate remotely instead.  When you cannot
allocate remotely instead, you've made the problem much worse for
something that should be __GFP_NORETRY in the first place (and was for
years) and should never thrash.

I'm not interested in patches that require remote nodes to have an
abundance of free or unfragmented memory to avoid regressing.


I just wonder what is the page allocation priority list in your 
environment,

assuming all memory nodes are so fragmented that no huge pages can be
obtained without compaction or reclaim.

Here is my version of that list, please let me know if it makes sense to 
you:


1. local huge pages: with compaction and/or page reclaim, you are 
willing

to pay the penalty of getting huge pages;

2. local base pages: since, in your system, remote data accesses have 
much

higher penalty than the extra TLB misses incurred by the base page size;

3. remote huge pages: at least it is better than remote base pages;

4. remote base pages: it performs worst in terms of locality and TLBs.

This might not be easy to implement in current kernel, because
the zones from remote nodes will always be candidates when
kernel is trying get_page_from_freelist(). Only _GFP_THISNODE
and MPOL_BIND can eliminate these remote node zones, where _GFP_THISNODE
is a kernel version MPOL_BIND and overwrites any user space
memory policy other than MPOL_BIND, which is troublesome.

In addition, to prioritize local base pages over remote pages,
the original huge page allocation has to fail, then kernel can
fall back to base page allocations. And you will never get remote
huge pages any more if the local base page allocation fails,
because there is no way back to huge page allocation after the fallback.

Do you expect both behaviors?



In the case of distributions, the first
patch addresses concerns with a common workload where on the other 
hand

we have an internal workload of a single company that is affected --
which indirectly affects many users admittedly but only one entity 
directly.




The alternative, which is my patch, hasn't been tested or shown why it
cannot work.  We continue to talk about order >= pageblock_order vs
__GFP_COMPACTONLY.

I'd like to know, specifically:

 - what measurable affect my patch has that is better solved with 
removing

   __GFP_THISNODE on systems where remote memory is also fragmented?

 - what platforms benefit from remote access to hugepages vs accessing
   local small pages (I've asked this maybe 4 or 5 times now)?

 - how is reclaiming (and possibly thrashing) memory helpful if 
compaction
   fails to free an entire pageblock due to slab fragmentation due to 
low
   on memory conditions and the page allocator preference to return 
node-

   local memory?

 - how is reclaiming (and possibly thrashing) memory helpful if 
compaction

   cannot access the memory reclaimed because the freeing scanner has
   already passed by it, or the migration scanner has passed by it, 
since

   this reclaim is not targeted to pages it can find?

 - what metrics can be introduced to the page allocator so that we can
   determine that reclaiming (and possibly thrashing) memory will 
result

   in a hugepage being allocated?


The slab fragmentation and whether reclaim/compaction can help form
huge pages seem to orthogonal to this patch, which tries to decide
the priority between locality and huge pages.

For slab fragmentation, you might find this paper “Making Huge Pages 
Actually Useful”

(https://dl.acm.org/citation.cfm?id=3173203) helpful. The paper is
trying to minimize the number of page blocks that have both moveable and
non-moveable pages.


--
Best Regards
Yan Zi


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-22 Thread David Rientjes
On Tue, 16 Oct 2018, Mel Gorman wrote:

> I consider this to be an unfortunate outcome. On the one hand, we have a
> problem that three people can trivially reproduce with known test cases
> and a patch shown to resolve the problem. Two of those three people work
> on distributions that are exposed to a large number of users. On the
> other, we have a problem that requires the system to be in a specific
> state and an unknown workload that suffers badly from the remote access
> penalties with a patch that has review concerns and has not been proven
> to resolve the trivial cases.

The specific state is that remote memory is fragmented as well, this is 
not atypical.  Removing __GFP_THISNODE to avoid thrashing a zone will only 
be beneficial when you can allocate remotely instead.  When you cannot 
allocate remotely instead, you've made the problem much worse for 
something that should be __GFP_NORETRY in the first place (and was for 
years) and should never thrash.

I'm not interested in patches that require remote nodes to have an 
abundance of free or unfragmented memory to avoid regressing.

> In the case of distributions, the first
> patch addresses concerns with a common workload where on the other hand
> we have an internal workload of a single company that is affected --
> which indirectly affects many users admittedly but only one entity directly.
> 

The alternative, which is my patch, hasn't been tested or shown why it 
cannot work.  We continue to talk about order >= pageblock_order vs
__GFP_COMPACTONLY.

I'd like to know, specifically:

 - what measurable affect my patch has that is better solved with removing
   __GFP_THISNODE on systems where remote memory is also fragmented?

 - what platforms benefit from remote access to hugepages vs accessing
   local small pages (I've asked this maybe 4 or 5 times now)?

 - how is reclaiming (and possibly thrashing) memory helpful if compaction
   fails to free an entire pageblock due to slab fragmentation due to low
   on memory conditions and the page allocator preference to return node-
   local memory?

 - how is reclaiming (and possibly thrashing) memory helpful if compaction
   cannot access the memory reclaimed because the freeing scanner has 
   already passed by it, or the migration scanner has passed by it, since
   this reclaim is not targeted to pages it can find?

 - what metrics can be introduced to the page allocator so that we can
   determine that reclaiming (and possibly thrashing) memory will result 
   in a hugepage being allocated?

Until we have answers, especially for the last, there is no reason why thp 
allocations should not be __GFP_NORETRY including for MADV_HUGEPAGE 
regions.  The implementation of memory compaction simply cannot guarantee 
that the cost is worthwhile.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-22 Thread David Rientjes
On Tue, 16 Oct 2018, Mel Gorman wrote:

> I consider this to be an unfortunate outcome. On the one hand, we have a
> problem that three people can trivially reproduce with known test cases
> and a patch shown to resolve the problem. Two of those three people work
> on distributions that are exposed to a large number of users. On the
> other, we have a problem that requires the system to be in a specific
> state and an unknown workload that suffers badly from the remote access
> penalties with a patch that has review concerns and has not been proven
> to resolve the trivial cases.

The specific state is that remote memory is fragmented as well, this is 
not atypical.  Removing __GFP_THISNODE to avoid thrashing a zone will only 
be beneficial when you can allocate remotely instead.  When you cannot 
allocate remotely instead, you've made the problem much worse for 
something that should be __GFP_NORETRY in the first place (and was for 
years) and should never thrash.

I'm not interested in patches that require remote nodes to have an 
abundance of free or unfragmented memory to avoid regressing.

> In the case of distributions, the first
> patch addresses concerns with a common workload where on the other hand
> we have an internal workload of a single company that is affected --
> which indirectly affects many users admittedly but only one entity directly.
> 

The alternative, which is my patch, hasn't been tested or shown why it 
cannot work.  We continue to talk about order >= pageblock_order vs
__GFP_COMPACTONLY.

I'd like to know, specifically:

 - what measurable affect my patch has that is better solved with removing
   __GFP_THISNODE on systems where remote memory is also fragmented?

 - what platforms benefit from remote access to hugepages vs accessing
   local small pages (I've asked this maybe 4 or 5 times now)?

 - how is reclaiming (and possibly thrashing) memory helpful if compaction
   fails to free an entire pageblock due to slab fragmentation due to low
   on memory conditions and the page allocator preference to return node-
   local memory?

 - how is reclaiming (and possibly thrashing) memory helpful if compaction
   cannot access the memory reclaimed because the freeing scanner has 
   already passed by it, or the migration scanner has passed by it, since
   this reclaim is not targeted to pages it can find?

 - what metrics can be introduced to the page allocator so that we can
   determine that reclaiming (and possibly thrashing) memory will result 
   in a hugepage being allocated?

Until we have answers, especially for the last, there is no reason why thp 
allocations should not be __GFP_NORETRY including for MADV_HUGEPAGE 
regions.  The implementation of memory compaction simply cannot guarantee 
that the cost is worthwhile.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-22 Thread David Rientjes
On Mon, 15 Oct 2018, Andrea Arcangeli wrote:

> > On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes 
> >  wrote:
> > >  Would it be possible to test with my 
> > > patch[*] that does not try reclaim to address the thrashing issue?
> > 
> > Yes please.
> 
> It'd also be great if a testcase reproducing the 40% higher access
> latency (with the one liner original fix) was available.
> 

I never said 40% higher access latency, I said 40% higher fault latency.  

The higher access latency is 13.9% as measured on Haswell.

The test case is rather trivial: fragment all memory with order-4 memory 
to replicate a fragmented local zone, use sched_setaffinity() to bind to 
that node, and fault a reasonable number of hugepages (128MB, 256, 
whatever).  The cost of faulting remotely in this case was measured to be 
40% higher than falling back to local small pages.  This occurs quite 
obviously because you are thrashing the remote node trying to allocate 
thp.

> We don't have a testcase for David's 40% latency increase problem, but
> that's likely to only happen when the system is somewhat low on memory
> globally.

Well, yes, but that's most of our systems.  We can't keep around gigabytes 
of memory free just to work around this patch.  Removing __GFP_THISNODE to 
avoid thrashing the local node obviously will incur a substantial 
performance degradation if you thrash the remote node as well.  This 
should be rather straight forward.

> When there's 75% or more of the RAM free (not even allocated as easily
> reclaimable pagecache) globally, you don't expect to hit heavy
> swapping.
> 

I agree there is no regression introduced by your patch when 75% of memory 
is free.

> The 40% THP allocation latency increase if you use MADV_HUGEPAGE in
> such window where all remote zones are fully fragmented is somehow
> lesser of a concern in my view (plus there's the compact deferred
> logic that should mitigate that scenario). Furthermore it is only a
> concern for page faults in MADV_HUGEPAGE ranges. If MADV_HUGEPAGE is
> set the userland allocation is long lived, so such higher allocation
> latency won't risk to hit short lived allocations that don't set
> MADV_HUGEPAGE (unless madvise=always, but that's not the default
> precisely because not all allocations are long lived).
> 
> If the MADV_HUGEPAGE using library was freely available it'd also be
> nice.
> 

You scan your mappings for .text segments, map a hugepage-aligned region 
sufficient in size, mremap() to that region, and do MADV_HUGEPAGE.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-22 Thread David Rientjes
On Mon, 15 Oct 2018, Andrea Arcangeli wrote:

> > On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes 
> >  wrote:
> > >  Would it be possible to test with my 
> > > patch[*] that does not try reclaim to address the thrashing issue?
> > 
> > Yes please.
> 
> It'd also be great if a testcase reproducing the 40% higher access
> latency (with the one liner original fix) was available.
> 

I never said 40% higher access latency, I said 40% higher fault latency.  

The higher access latency is 13.9% as measured on Haswell.

The test case is rather trivial: fragment all memory with order-4 memory 
to replicate a fragmented local zone, use sched_setaffinity() to bind to 
that node, and fault a reasonable number of hugepages (128MB, 256, 
whatever).  The cost of faulting remotely in this case was measured to be 
40% higher than falling back to local small pages.  This occurs quite 
obviously because you are thrashing the remote node trying to allocate 
thp.

> We don't have a testcase for David's 40% latency increase problem, but
> that's likely to only happen when the system is somewhat low on memory
> globally.

Well, yes, but that's most of our systems.  We can't keep around gigabytes 
of memory free just to work around this patch.  Removing __GFP_THISNODE to 
avoid thrashing the local node obviously will incur a substantial 
performance degradation if you thrash the remote node as well.  This 
should be rather straight forward.

> When there's 75% or more of the RAM free (not even allocated as easily
> reclaimable pagecache) globally, you don't expect to hit heavy
> swapping.
> 

I agree there is no regression introduced by your patch when 75% of memory 
is free.

> The 40% THP allocation latency increase if you use MADV_HUGEPAGE in
> such window where all remote zones are fully fragmented is somehow
> lesser of a concern in my view (plus there's the compact deferred
> logic that should mitigate that scenario). Furthermore it is only a
> concern for page faults in MADV_HUGEPAGE ranges. If MADV_HUGEPAGE is
> set the userland allocation is long lived, so such higher allocation
> latency won't risk to hit short lived allocations that don't set
> MADV_HUGEPAGE (unless madvise=always, but that's not the default
> precisely because not all allocations are long lived).
> 
> If the MADV_HUGEPAGE using library was freely available it'd also be
> nice.
> 

You scan your mappings for .text segments, map a hugepage-aligned region 
sufficient in size, mremap() to that region, and do MADV_HUGEPAGE.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-22 Thread David Rientjes
On Mon, 15 Oct 2018, Andrea Arcangeli wrote:

> > At the risk of beating a dead horse that has already been beaten, what are 
> > the plans for this patch when the merge window opens?  It would be rather 
> > unfortunate for us to start incurring a 14% increase in access latency and 
> > 40% increase in fault latency.  Would it be possible to test with my 
> > patch[*] that does not try reclaim to address the thrashing issue?  If 
> > that is satisfactory, I don't have a strong preference if it is done with 
> > a hardcoded pageblock_order and __GFP_NORETRY check or a new 
> > __GFP_COMPACT_ONLY flag.
> 
> I don't like the pageblock size hardcoding inside the page
> allocator. __GFP_COMPACT_ONLY is fully runtime equivalent, but it at
> least let the caller choose the behavior, so it looks more flexible.
> 

I'm not sure that I understand why the user would ever want to thrash 
their zone(s) for allocations of this order.  The problem here is 
specifically related to an entire pageblock becoming freeable and the 
unlikeliness that reclaiming/swapping/thrashing will assist memory 
compaction in making that happen.  For this reason, I think the 
order >= pageblock_order check is reasonable because it depends on the 
implementation of memory compaction.

Why do we need another gfp flag for thp allocations when they are made to 
be __GFP_NORETRY by default and it is very unlikely that reclaiming once 
and then retrying compaction is going to make an entire pageblock free?

I'd like to know (1) how continuous reclaim activity can make entire 
pageblocks freeable without thrashing and (2) the metrics that we can use 
to determine when it is worthwhile vs harmful.  I don't believe (1) is 
ever helpful based on the implementation of memory compaction and we lack 
(2) since reclaim is not targeted to memory that compaction can use.

> As long as compaction returns COMPACT_SKIPPED it's ok to keep doing
> reclaim and keep doing compaction, as long as compaction succeeds.
> 

Compaction will operate on 32 pages at a time and declare success each 
time and then pick up where it left off the next time it is called in the 
hope that it "succeeds" 512/32=16 times in a row while constantly 
reclaiming memory.  Even a single slab page in that pageblock will make 
all of this work useless.  Reclaimed memory not accessible by the freeing 
scanner will make its work useless.  We lack the ability to determine when 
compaction is successful in freeing a full pageblock.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-22 Thread David Rientjes
On Mon, 15 Oct 2018, Andrea Arcangeli wrote:

> > At the risk of beating a dead horse that has already been beaten, what are 
> > the plans for this patch when the merge window opens?  It would be rather 
> > unfortunate for us to start incurring a 14% increase in access latency and 
> > 40% increase in fault latency.  Would it be possible to test with my 
> > patch[*] that does not try reclaim to address the thrashing issue?  If 
> > that is satisfactory, I don't have a strong preference if it is done with 
> > a hardcoded pageblock_order and __GFP_NORETRY check or a new 
> > __GFP_COMPACT_ONLY flag.
> 
> I don't like the pageblock size hardcoding inside the page
> allocator. __GFP_COMPACT_ONLY is fully runtime equivalent, but it at
> least let the caller choose the behavior, so it looks more flexible.
> 

I'm not sure that I understand why the user would ever want to thrash 
their zone(s) for allocations of this order.  The problem here is 
specifically related to an entire pageblock becoming freeable and the 
unlikeliness that reclaiming/swapping/thrashing will assist memory 
compaction in making that happen.  For this reason, I think the 
order >= pageblock_order check is reasonable because it depends on the 
implementation of memory compaction.

Why do we need another gfp flag for thp allocations when they are made to 
be __GFP_NORETRY by default and it is very unlikely that reclaiming once 
and then retrying compaction is going to make an entire pageblock free?

I'd like to know (1) how continuous reclaim activity can make entire 
pageblocks freeable without thrashing and (2) the metrics that we can use 
to determine when it is worthwhile vs harmful.  I don't believe (1) is 
ever helpful based on the implementation of memory compaction and we lack 
(2) since reclaim is not targeted to memory that compaction can use.

> As long as compaction returns COMPACT_SKIPPED it's ok to keep doing
> reclaim and keep doing compaction, as long as compaction succeeds.
> 

Compaction will operate on 32 pages at a time and declare success each 
time and then pick up where it left off the next time it is called in the 
hope that it "succeeds" 512/32=16 times in a row while constantly 
reclaiming memory.  Even a single slab page in that pageblock will make 
all of this work useless.  Reclaimed memory not accessible by the freeing 
scanner will make its work useless.  We lack the ability to determine when 
compaction is successful in freeing a full pageblock.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-17 Thread Mel Gorman
On Tue, Oct 16, 2018 at 03:37:15PM -0700, Andrew Morton wrote:
> On Tue, 16 Oct 2018 08:46:06 +0100 Mel Gorman  wrote:
> > I consider this to be an unfortunate outcome. On the one hand, we have a
> > problem that three people can trivially reproduce with known test cases
> > and a patch shown to resolve the problem. Two of those three people work
> > on distributions that are exposed to a large number of users. On the
> > other, we have a problem that requires the system to be in a specific
> > state and an unknown workload that suffers badly from the remote access
> > penalties with a patch that has review concerns and has not been proven
> > to resolve the trivial cases. In the case of distributions, the first
> > patch addresses concerns with a common workload where on the other hand
> > we have an internal workload of a single company that is affected --
> > which indirectly affects many users admittedly but only one entity directly.
> > 
> > At the absolute minimum, a test case for the "system fragmentation incurs
> > access penalties for a workload" scenario that could both replicate the
> > fragmentation and demonstrate the problem should have been available before
> > the patch was rejected.  With the test case, there would be a chance that
> > others could analyse the problem and prototype some fixes. The test case
> > was requested in the thread and never produced so even if someone were to
> > prototype fixes, it would be dependant on a third party to test and produce
> > data which is a time-consuming loop. Instead, we are more or less in limbo.
> > 
> 
> OK, thanks.
> 
> But we're OK holding off for a few weeks, yes?  If we do that
> we'll still make it into 4.19.1.  Am reluctant to merge this while
> discussion, testing and possibly more development are ongoing.
> 

Without a test case that reproduces the Google case, we are a bit stuck.
Previous experience indicates that just fragmenting memory is not enough
to give a reliable case as unless the unmovable/reclaimable pages are
"sticky", the normal reclaim can handle it. Similarly, the access
pattern of the target workload is important as it would need to be
something larger than L3 cache to constantly hit the access penalty. We
do not know what the exact characteristics of the Google workload are
but we know that a fix for three cases is not equivalent for the Google
case.

The discussion has circled around wish-list items such as better
fragmentation control, node-aware compaction, improved compact deferred
logic and lower latencies with little in the way of actual specifics
of implementation or patches. Improving fragmentation control would
benefit from a workload that actually fragments so the extfrag events
can be monitored as well as maybe a dump of pageblocks with mixed pages.

On node-aware compaction, that was not implemented initially simply
because HighMem was common and that needs to be treated as a corner case
-- we cannot safely migrate pages from zone normal to highmem. That one
is relatively trivial to measure as it's a functional issue.

However, backing off compaction properly to maximise allocation success
rates while minimising allocation latency and access latency needs a live
workload that is representative. Trivial cases like the java workloads,
nas or usemem won't do as they either exhibit special locality or are
streaming readers/writers. Memcache might work but the driver in that
case is critical to ensure the access penalties are incurred. Again,
a modern example is missing.

As for why this took so long to discover, it is highly likely that it's
due to VM's being sized such as they typically fit in a NUMA node so
it would have avoided the worst case scenarios. Furthermore, a machine
dedicated to VM's has fewer concerns with respect to slab allocations
and unmovable allocations fragmenting memory long-term. Finally, the
worst case scenarios are encountered when there is a mix of different
workloads of variable duration which may be common in a Google-like setup
with different jobs being dispatched across a large network but less so
in other setups where a service tends to be persistent. We already know
that some of the worst performance problems take years to discover.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-17 Thread Mel Gorman
On Tue, Oct 16, 2018 at 03:37:15PM -0700, Andrew Morton wrote:
> On Tue, 16 Oct 2018 08:46:06 +0100 Mel Gorman  wrote:
> > I consider this to be an unfortunate outcome. On the one hand, we have a
> > problem that three people can trivially reproduce with known test cases
> > and a patch shown to resolve the problem. Two of those three people work
> > on distributions that are exposed to a large number of users. On the
> > other, we have a problem that requires the system to be in a specific
> > state and an unknown workload that suffers badly from the remote access
> > penalties with a patch that has review concerns and has not been proven
> > to resolve the trivial cases. In the case of distributions, the first
> > patch addresses concerns with a common workload where on the other hand
> > we have an internal workload of a single company that is affected --
> > which indirectly affects many users admittedly but only one entity directly.
> > 
> > At the absolute minimum, a test case for the "system fragmentation incurs
> > access penalties for a workload" scenario that could both replicate the
> > fragmentation and demonstrate the problem should have been available before
> > the patch was rejected.  With the test case, there would be a chance that
> > others could analyse the problem and prototype some fixes. The test case
> > was requested in the thread and never produced so even if someone were to
> > prototype fixes, it would be dependant on a third party to test and produce
> > data which is a time-consuming loop. Instead, we are more or less in limbo.
> > 
> 
> OK, thanks.
> 
> But we're OK holding off for a few weeks, yes?  If we do that
> we'll still make it into 4.19.1.  Am reluctant to merge this while
> discussion, testing and possibly more development are ongoing.
> 

Without a test case that reproduces the Google case, we are a bit stuck.
Previous experience indicates that just fragmenting memory is not enough
to give a reliable case as unless the unmovable/reclaimable pages are
"sticky", the normal reclaim can handle it. Similarly, the access
pattern of the target workload is important as it would need to be
something larger than L3 cache to constantly hit the access penalty. We
do not know what the exact characteristics of the Google workload are
but we know that a fix for three cases is not equivalent for the Google
case.

The discussion has circled around wish-list items such as better
fragmentation control, node-aware compaction, improved compact deferred
logic and lower latencies with little in the way of actual specifics
of implementation or patches. Improving fragmentation control would
benefit from a workload that actually fragments so the extfrag events
can be monitored as well as maybe a dump of pageblocks with mixed pages.

On node-aware compaction, that was not implemented initially simply
because HighMem was common and that needs to be treated as a corner case
-- we cannot safely migrate pages from zone normal to highmem. That one
is relatively trivial to measure as it's a functional issue.

However, backing off compaction properly to maximise allocation success
rates while minimising allocation latency and access latency needs a live
workload that is representative. Trivial cases like the java workloads,
nas or usemem won't do as they either exhibit special locality or are
streaming readers/writers. Memcache might work but the driver in that
case is critical to ensure the access penalties are incurred. Again,
a modern example is missing.

As for why this took so long to discover, it is highly likely that it's
due to VM's being sized such as they typically fit in a NUMA node so
it would have avoided the worst case scenarios. Furthermore, a machine
dedicated to VM's has fewer concerns with respect to slab allocations
and unmovable allocations fragmenting memory long-term. Finally, the
worst case scenarios are encountered when there is a mix of different
workloads of variable duration which may be common in a Google-like setup
with different jobs being dispatched across a large network but less so
in other setups where a service tends to be persistent. We already know
that some of the worst performance problems take years to discover.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-17 Thread Michal Hocko
On Tue 16-10-18 16:16:43, Andrew Morton wrote:
> On Tue, 16 Oct 2018 19:11:49 -0400 Andrea Arcangeli  
> wrote:
> 
> > This was a severe regression
> > compared to previous kernels that made important workloads unusable
> > and it starts when __GFP_THISNODE was added to THP allocations under
> > MADV_HUGEPAGE. It is not a significant risk to go to the previous
> > behavior before __GFP_THISNODE was added, it worked like that for
> > years.@s1@s2@s1
> 
> 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to
> local node") was April 2015.  That's a long time for a "severe
> regression" to go unnoticed?

Well, it gets some time to adopt changes in enterprise and we start
seeing people reporting this issue. That is why I believe we should
start with something really simple and stable tree backportable first
and then build something more complex on top.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-17 Thread Michal Hocko
On Tue 16-10-18 16:16:43, Andrew Morton wrote:
> On Tue, 16 Oct 2018 19:11:49 -0400 Andrea Arcangeli  
> wrote:
> 
> > This was a severe regression
> > compared to previous kernels that made important workloads unusable
> > and it starts when __GFP_THISNODE was added to THP allocations under
> > MADV_HUGEPAGE. It is not a significant risk to go to the previous
> > behavior before __GFP_THISNODE was added, it worked like that for
> > years.@s1@s2@s1
> 
> 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to
> local node") was April 2015.  That's a long time for a "severe
> regression" to go unnoticed?

Well, it gets some time to adopt changes in enterprise and we start
seeing people reporting this issue. That is why I believe we should
start with something really simple and stable tree backportable first
and then build something more complex on top.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-16 Thread Andrew Morton
On Tue, 16 Oct 2018 19:11:49 -0400 Andrea Arcangeli  wrote:

> This was a severe regression
> compared to previous kernels that made important workloads unusable
> and it starts when __GFP_THISNODE was added to THP allocations under
> MADV_HUGEPAGE. It is not a significant risk to go to the previous
> behavior before __GFP_THISNODE was added, it worked like that for
> years.

5265047ac301 ("mm, thp: really limit transparent hugepage allocation to
local node") was April 2015.  That's a long time for a "severe
regression" to go unnoticed?



Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-16 Thread Andrew Morton
On Tue, 16 Oct 2018 19:11:49 -0400 Andrea Arcangeli  wrote:

> This was a severe regression
> compared to previous kernels that made important workloads unusable
> and it starts when __GFP_THISNODE was added to THP allocations under
> MADV_HUGEPAGE. It is not a significant risk to go to the previous
> behavior before __GFP_THISNODE was added, it worked like that for
> years.

5265047ac301 ("mm, thp: really limit transparent hugepage allocation to
local node") was April 2015.  That's a long time for a "severe
regression" to go unnoticed?



Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-16 Thread Andrea Arcangeli
Hello,

On Tue, Oct 16, 2018 at 03:37:15PM -0700, Andrew Morton wrote:
> we'll still make it into 4.19.1.  Am reluctant to merge this while
> discussion, testing and possibly more development are ongoing.

I think there can be definitely more developments primarily to make
the compact deferred logic NUMA aware. Instead of a global deferred
logic, we should split it per zone per node so that it backs off
exponentially with an higher cap in remote nodes. The current global
"backoff" limit will still apply to the "local" zone compaction. Who
would like to work on that?

However I don't think it's worth waiting for that, because it's not a
trivial change.

Certainly we can't ship upstream in production with this bug, so if it
doesn't get fixed upstream we'll fix it downstream first until the
more developments are production ready. This was a severe regression
compared to previous kernels that made important workloads unusable
and it starts when __GFP_THISNODE was added to THP allocations under
MADV_HUGEPAGE. It is not a significant risk to go to the previous
behavior before __GFP_THISNODE was added, it worked like that for
years.

This was simply an optimization to some lucky workloads that can fit
in a single node, but it ended up breaking the VM for others that
can't possibly fit in a single node, so going back is safe.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-16 Thread Andrea Arcangeli
Hello,

On Tue, Oct 16, 2018 at 03:37:15PM -0700, Andrew Morton wrote:
> we'll still make it into 4.19.1.  Am reluctant to merge this while
> discussion, testing and possibly more development are ongoing.

I think there can be definitely more developments primarily to make
the compact deferred logic NUMA aware. Instead of a global deferred
logic, we should split it per zone per node so that it backs off
exponentially with an higher cap in remote nodes. The current global
"backoff" limit will still apply to the "local" zone compaction. Who
would like to work on that?

However I don't think it's worth waiting for that, because it's not a
trivial change.

Certainly we can't ship upstream in production with this bug, so if it
doesn't get fixed upstream we'll fix it downstream first until the
more developments are production ready. This was a severe regression
compared to previous kernels that made important workloads unusable
and it starts when __GFP_THISNODE was added to THP allocations under
MADV_HUGEPAGE. It is not a significant risk to go to the previous
behavior before __GFP_THISNODE was added, it worked like that for
years.

This was simply an optimization to some lucky workloads that can fit
in a single node, but it ended up breaking the VM for others that
can't possibly fit in a single node, so going back is safe.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-16 Thread Andrew Morton
On Tue, 16 Oct 2018 08:46:06 +0100 Mel Gorman  wrote:

> On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote:
> > On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes 
> >  wrote:
> > 
> > > At the risk of beating a dead horse that has already been beaten, what 
> > > are 
> > > the plans for this patch when the merge window opens?
> > 
> > I'll hold onto it until we've settled on something.  Worst case,
> > Andrea's original is easily backportable.
> > 
> 
> I consider this to be an unfortunate outcome. On the one hand, we have a
> problem that three people can trivially reproduce with known test cases
> and a patch shown to resolve the problem. Two of those three people work
> on distributions that are exposed to a large number of users. On the
> other, we have a problem that requires the system to be in a specific
> state and an unknown workload that suffers badly from the remote access
> penalties with a patch that has review concerns and has not been proven
> to resolve the trivial cases. In the case of distributions, the first
> patch addresses concerns with a common workload where on the other hand
> we have an internal workload of a single company that is affected --
> which indirectly affects many users admittedly but only one entity directly.
> 
> At the absolute minimum, a test case for the "system fragmentation incurs
> access penalties for a workload" scenario that could both replicate the
> fragmentation and demonstrate the problem should have been available before
> the patch was rejected.  With the test case, there would be a chance that
> others could analyse the problem and prototype some fixes. The test case
> was requested in the thread and never produced so even if someone were to
> prototype fixes, it would be dependant on a third party to test and produce
> data which is a time-consuming loop. Instead, we are more or less in limbo.
> 

OK, thanks.

But we're OK holding off for a few weeks, yes?  If we do that
we'll still make it into 4.19.1.  Am reluctant to merge this while
discussion, testing and possibly more development are ongoing.



Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-16 Thread Andrew Morton
On Tue, 16 Oct 2018 08:46:06 +0100 Mel Gorman  wrote:

> On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote:
> > On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes 
> >  wrote:
> > 
> > > At the risk of beating a dead horse that has already been beaten, what 
> > > are 
> > > the plans for this patch when the merge window opens?
> > 
> > I'll hold onto it until we've settled on something.  Worst case,
> > Andrea's original is easily backportable.
> > 
> 
> I consider this to be an unfortunate outcome. On the one hand, we have a
> problem that three people can trivially reproduce with known test cases
> and a patch shown to resolve the problem. Two of those three people work
> on distributions that are exposed to a large number of users. On the
> other, we have a problem that requires the system to be in a specific
> state and an unknown workload that suffers badly from the remote access
> penalties with a patch that has review concerns and has not been proven
> to resolve the trivial cases. In the case of distributions, the first
> patch addresses concerns with a common workload where on the other hand
> we have an internal workload of a single company that is affected --
> which indirectly affects many users admittedly but only one entity directly.
> 
> At the absolute minimum, a test case for the "system fragmentation incurs
> access penalties for a workload" scenario that could both replicate the
> fragmentation and demonstrate the problem should have been available before
> the patch was rejected.  With the test case, there would be a chance that
> others could analyse the problem and prototype some fixes. The test case
> was requested in the thread and never produced so even if someone were to
> prototype fixes, it would be dependant on a third party to test and produce
> data which is a time-consuming loop. Instead, we are more or less in limbo.
> 

OK, thanks.

But we're OK holding off for a few weeks, yes?  If we do that
we'll still make it into 4.19.1.  Am reluctant to merge this while
discussion, testing and possibly more development are ongoing.



Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-16 Thread Mel Gorman
On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote:
> On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes  
> wrote:
> 
> > At the risk of beating a dead horse that has already been beaten, what are 
> > the plans for this patch when the merge window opens?
> 
> I'll hold onto it until we've settled on something.  Worst case,
> Andrea's original is easily backportable.
> 

I consider this to be an unfortunate outcome. On the one hand, we have a
problem that three people can trivially reproduce with known test cases
and a patch shown to resolve the problem. Two of those three people work
on distributions that are exposed to a large number of users. On the
other, we have a problem that requires the system to be in a specific
state and an unknown workload that suffers badly from the remote access
penalties with a patch that has review concerns and has not been proven
to resolve the trivial cases. In the case of distributions, the first
patch addresses concerns with a common workload where on the other hand
we have an internal workload of a single company that is affected --
which indirectly affects many users admittedly but only one entity directly.

At the absolute minimum, a test case for the "system fragmentation incurs
access penalties for a workload" scenario that could both replicate the
fragmentation and demonstrate the problem should have been available before
the patch was rejected.  With the test case, there would be a chance that
others could analyse the problem and prototype some fixes. The test case
was requested in the thread and never produced so even if someone were to
prototype fixes, it would be dependant on a third party to test and produce
data which is a time-consuming loop. Instead, we are more or less in limbo.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-16 Thread Mel Gorman
On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote:
> On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes  
> wrote:
> 
> > At the risk of beating a dead horse that has already been beaten, what are 
> > the plans for this patch when the merge window opens?
> 
> I'll hold onto it until we've settled on something.  Worst case,
> Andrea's original is easily backportable.
> 

I consider this to be an unfortunate outcome. On the one hand, we have a
problem that three people can trivially reproduce with known test cases
and a patch shown to resolve the problem. Two of those three people work
on distributions that are exposed to a large number of users. On the
other, we have a problem that requires the system to be in a specific
state and an unknown workload that suffers badly from the remote access
penalties with a patch that has review concerns and has not been proven
to resolve the trivial cases. In the case of distributions, the first
patch addresses concerns with a common workload where on the other hand
we have an internal workload of a single company that is affected --
which indirectly affects many users admittedly but only one entity directly.

At the absolute minimum, a test case for the "system fragmentation incurs
access penalties for a workload" scenario that could both replicate the
fragmentation and demonstrate the problem should have been available before
the patch was rejected.  With the test case, there would be a chance that
others could analyse the problem and prototype some fixes. The test case
was requested in the thread and never produced so even if someone were to
prototype fixes, it would be dependant on a third party to test and produce
data which is a time-consuming loop. Instead, we are more or less in limbo.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread Andrea Arcangeli
Hello Andrew,

On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote:
> On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes  
> wrote:
> >  Would it be possible to test with my 
> > patch[*] that does not try reclaim to address the thrashing issue?
> 
> Yes please.

It'd also be great if a testcase reproducing the 40% higher access
latency (with the one liner original fix) was available.

We don't have a testcase for David's 40% latency increase problem, but
that's likely to only happen when the system is somewhat low on memory
globally. So the measurement must be done when compaction starts
failing globally on all zones, but before the system starts
swapping. The more global fragmentation the larger will be the window
between "compaction fails because all zones are too fragmented" and
"there is still free PAGE_SIZEd memory available to reclaim without
swapping it out". If I understood correctly, that is precisely the
window where the 40% higher latency should materialize.

The workload that shows the badness in the upstream code is fairly
trivial. Mel and Zi reproduced it too and I have two testcases that
can reproduce it, one with device assignment and the other is just
memhog. That's a massively larger window than the one where the 40%
higher latency materializes.

When there's 75% or more of the RAM free (not even allocated as easily
reclaimable pagecache) globally, you don't expect to hit heavy
swapping.

The 40% THP allocation latency increase if you use MADV_HUGEPAGE in
such window where all remote zones are fully fragmented is somehow
lesser of a concern in my view (plus there's the compact deferred
logic that should mitigate that scenario). Furthermore it is only a
concern for page faults in MADV_HUGEPAGE ranges. If MADV_HUGEPAGE is
set the userland allocation is long lived, so such higher allocation
latency won't risk to hit short lived allocations that don't set
MADV_HUGEPAGE (unless madvise=always, but that's not the default
precisely because not all allocations are long lived).

If the MADV_HUGEPAGE using library was freely available it'd also be
nice.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread Andrea Arcangeli
Hello Andrew,

On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote:
> On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes  
> wrote:
> >  Would it be possible to test with my 
> > patch[*] that does not try reclaim to address the thrashing issue?
> 
> Yes please.

It'd also be great if a testcase reproducing the 40% higher access
latency (with the one liner original fix) was available.

We don't have a testcase for David's 40% latency increase problem, but
that's likely to only happen when the system is somewhat low on memory
globally. So the measurement must be done when compaction starts
failing globally on all zones, but before the system starts
swapping. The more global fragmentation the larger will be the window
between "compaction fails because all zones are too fragmented" and
"there is still free PAGE_SIZEd memory available to reclaim without
swapping it out". If I understood correctly, that is precisely the
window where the 40% higher latency should materialize.

The workload that shows the badness in the upstream code is fairly
trivial. Mel and Zi reproduced it too and I have two testcases that
can reproduce it, one with device assignment and the other is just
memhog. That's a massively larger window than the one where the 40%
higher latency materializes.

When there's 75% or more of the RAM free (not even allocated as easily
reclaimable pagecache) globally, you don't expect to hit heavy
swapping.

The 40% THP allocation latency increase if you use MADV_HUGEPAGE in
such window where all remote zones are fully fragmented is somehow
lesser of a concern in my view (plus there's the compact deferred
logic that should mitigate that scenario). Furthermore it is only a
concern for page faults in MADV_HUGEPAGE ranges. If MADV_HUGEPAGE is
set the userland allocation is long lived, so such higher allocation
latency won't risk to hit short lived allocations that don't set
MADV_HUGEPAGE (unless madvise=always, but that's not the default
precisely because not all allocations are long lived).

If the MADV_HUGEPAGE using library was freely available it'd also be
nice.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread Andrea Arcangeli
On Mon, Oct 15, 2018 at 03:30:17PM -0700, David Rientjes wrote:
> At the risk of beating a dead horse that has already been beaten, what are 
> the plans for this patch when the merge window opens?  It would be rather 
> unfortunate for us to start incurring a 14% increase in access latency and 
> 40% increase in fault latency.  Would it be possible to test with my 
> patch[*] that does not try reclaim to address the thrashing issue?  If 
> that is satisfactory, I don't have a strong preference if it is done with 
> a hardcoded pageblock_order and __GFP_NORETRY check or a new 
> __GFP_COMPACT_ONLY flag.

I don't like the pageblock size hardcoding inside the page
allocator. __GFP_COMPACT_ONLY is fully runtime equivalent, but it at
least let the caller choose the behavior, so it looks more flexible.

To fix your 40% fault latency concern in the short term we could use
__GFP_COMPACT_ONLY, but I think we should get rid of
__GFP_COMPACT_ONLY later: we need more statistical data in the zone
structure to track remote compaction failures triggering because the
zone is fully fragmented.

Once the zone is fully fragmented we need to do a special exponential
backoff on that zone when the zone is from a remote node.

Furthermore at the first remote NUMA node zone failure caused by full
fragmentation we need to interrupt compaction and stop trying with all
remote nodes.

As long as compaction returns COMPACT_SKIPPED it's ok to keep doing
reclaim and keep doing compaction, as long as compaction succeeds.

What is causing the higher latency is the fact we try all zones from
all remote nodes even if there's a failure caused by full
fragmentation of all remote zone, and we're unable to skip (skip with
exponential backoff) only those zones where compaction cannot succeed
because of fragmentation.

Once we achieve the above deleting __GFP_COMPACT_ONLY will be a
trivial patch.

> I think the second issue of faulting remote thp by removing __GFP_THISNODE 
> needs supporting evidence that shows some platforms benefit from this (and 
> not with numa=fake on the command line :).
> 
>  [*] https://marc.info/?l=linux-kernel=153903127717471

That is needed to compare the current one liner fix with
__GFP_COMPACT_ONLY, but I don't think it's needed to compare v4.18
with the current fix. The badness of v4.18 was too bad keep, getting
local PAGE_SIZEd memory or remote THPs is a secondary issue.

In fact the main reason for __GFP_COMPACT_ONLY is not anymore such
tradeoff, but not to spend too much CPU in compaction when all nodes
are fragmented to avoid increasing the allocation latency too much.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread Andrea Arcangeli
On Mon, Oct 15, 2018 at 03:30:17PM -0700, David Rientjes wrote:
> At the risk of beating a dead horse that has already been beaten, what are 
> the plans for this patch when the merge window opens?  It would be rather 
> unfortunate for us to start incurring a 14% increase in access latency and 
> 40% increase in fault latency.  Would it be possible to test with my 
> patch[*] that does not try reclaim to address the thrashing issue?  If 
> that is satisfactory, I don't have a strong preference if it is done with 
> a hardcoded pageblock_order and __GFP_NORETRY check or a new 
> __GFP_COMPACT_ONLY flag.

I don't like the pageblock size hardcoding inside the page
allocator. __GFP_COMPACT_ONLY is fully runtime equivalent, but it at
least let the caller choose the behavior, so it looks more flexible.

To fix your 40% fault latency concern in the short term we could use
__GFP_COMPACT_ONLY, but I think we should get rid of
__GFP_COMPACT_ONLY later: we need more statistical data in the zone
structure to track remote compaction failures triggering because the
zone is fully fragmented.

Once the zone is fully fragmented we need to do a special exponential
backoff on that zone when the zone is from a remote node.

Furthermore at the first remote NUMA node zone failure caused by full
fragmentation we need to interrupt compaction and stop trying with all
remote nodes.

As long as compaction returns COMPACT_SKIPPED it's ok to keep doing
reclaim and keep doing compaction, as long as compaction succeeds.

What is causing the higher latency is the fact we try all zones from
all remote nodes even if there's a failure caused by full
fragmentation of all remote zone, and we're unable to skip (skip with
exponential backoff) only those zones where compaction cannot succeed
because of fragmentation.

Once we achieve the above deleting __GFP_COMPACT_ONLY will be a
trivial patch.

> I think the second issue of faulting remote thp by removing __GFP_THISNODE 
> needs supporting evidence that shows some platforms benefit from this (and 
> not with numa=fake on the command line :).
> 
>  [*] https://marc.info/?l=linux-kernel=153903127717471

That is needed to compare the current one liner fix with
__GFP_COMPACT_ONLY, but I don't think it's needed to compare v4.18
with the current fix. The badness of v4.18 was too bad keep, getting
local PAGE_SIZEd memory or remote THPs is a secondary issue.

In fact the main reason for __GFP_COMPACT_ONLY is not anymore such
tradeoff, but not to spend too much CPU in compaction when all nodes
are fragmented to avoid increasing the allocation latency too much.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread Andrew Morton
On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes  
wrote:

> At the risk of beating a dead horse that has already been beaten, what are 
> the plans for this patch when the merge window opens?

I'll hold onto it until we've settled on something.  Worst case,
Andrea's original is easily backportable.

>  It would be rather 
> unfortunate for us to start incurring a 14% increase in access latency and 
> 40% increase in fault latency.

Yes.

>  Would it be possible to test with my 
> patch[*] that does not try reclaim to address the thrashing issue?

Yes please.

And have you been able to test it with the sort of workloads which
Andrea is attempting to address?


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread Andrew Morton
On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes  
wrote:

> At the risk of beating a dead horse that has already been beaten, what are 
> the plans for this patch when the merge window opens?

I'll hold onto it until we've settled on something.  Worst case,
Andrea's original is easily backportable.

>  It would be rather 
> unfortunate for us to start incurring a 14% increase in access latency and 
> 40% increase in fault latency.

Yes.

>  Would it be possible to test with my 
> patch[*] that does not try reclaim to address the thrashing issue?

Yes please.

And have you been able to test it with the sort of workloads which
Andrea is attempting to address?


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread David Rientjes
On Wed, 10 Oct 2018, David Rientjes wrote:

> > I think "madvise vs mbind" is more an issue of "no-permission vs
> > permission" required. And if the processes ends up swapping out all
> > other process with their memory already allocated in the node, I think
> > some permission is correct to be required, in which case an mbind
> > looks a better fit. MPOL_PREFERRED also looks a first candidate for
> > investigation as it's already not black and white and allows spillover
> > and may already do the right thing in fact if set on top of
> > MADV_HUGEPAGE.
> > 
> 
> We would never want to thrash the local node for hugepages because there 
> is no guarantee that any swapping is useful.  On COMPACT_SKIPPED due to 
> low memory, we have very clear evidence that pageblocks are already 
> sufficiently fragmented by unmovable pages such that compaction itself, 
> even with abundant free memory, fails to free an entire pageblock due to 
> the allocator's preference to fragment pageblocks of fallback migratetypes 
> over returning remote free memory.
> 
> As I've stated, we do not want to reclaim pointlessly when compaction is 
> unable to access the freed memory or there is no guarantee it can free an 
> entire pageblock.  Doing so allows thrashing of the local node, or remote 
> nodes if __GFP_THISNODE is removed, and the hugepage still cannot be 
> allocated.  If this proposed mbind() that requires permissions is geared 
> to me as the user, I'm afraid the details of what leads to the thrashing 
> are not well understood because I certainly would never use this.
> 

At the risk of beating a dead horse that has already been beaten, what are 
the plans for this patch when the merge window opens?  It would be rather 
unfortunate for us to start incurring a 14% increase in access latency and 
40% increase in fault latency.  Would it be possible to test with my 
patch[*] that does not try reclaim to address the thrashing issue?  If 
that is satisfactory, I don't have a strong preference if it is done with 
a hardcoded pageblock_order and __GFP_NORETRY check or a new 
__GFP_COMPACT_ONLY flag.

I think the second issue of faulting remote thp by removing __GFP_THISNODE 
needs supporting evidence that shows some platforms benefit from this (and 
not with numa=fake on the command line :).

 [*] https://marc.info/?l=linux-kernel=153903127717471


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread David Rientjes
On Wed, 10 Oct 2018, David Rientjes wrote:

> > I think "madvise vs mbind" is more an issue of "no-permission vs
> > permission" required. And if the processes ends up swapping out all
> > other process with their memory already allocated in the node, I think
> > some permission is correct to be required, in which case an mbind
> > looks a better fit. MPOL_PREFERRED also looks a first candidate for
> > investigation as it's already not black and white and allows spillover
> > and may already do the right thing in fact if set on top of
> > MADV_HUGEPAGE.
> > 
> 
> We would never want to thrash the local node for hugepages because there 
> is no guarantee that any swapping is useful.  On COMPACT_SKIPPED due to 
> low memory, we have very clear evidence that pageblocks are already 
> sufficiently fragmented by unmovable pages such that compaction itself, 
> even with abundant free memory, fails to free an entire pageblock due to 
> the allocator's preference to fragment pageblocks of fallback migratetypes 
> over returning remote free memory.
> 
> As I've stated, we do not want to reclaim pointlessly when compaction is 
> unable to access the freed memory or there is no guarantee it can free an 
> entire pageblock.  Doing so allows thrashing of the local node, or remote 
> nodes if __GFP_THISNODE is removed, and the hugepage still cannot be 
> allocated.  If this proposed mbind() that requires permissions is geared 
> to me as the user, I'm afraid the details of what leads to the thrashing 
> are not well understood because I certainly would never use this.
> 

At the risk of beating a dead horse that has already been beaten, what are 
the plans for this patch when the merge window opens?  It would be rather 
unfortunate for us to start incurring a 14% increase in access latency and 
40% increase in fault latency.  Would it be possible to test with my 
patch[*] that does not try reclaim to address the thrashing issue?  If 
that is satisfactory, I don't have a strong preference if it is done with 
a hardcoded pageblock_order and __GFP_NORETRY check or a new 
__GFP_COMPACT_ONLY flag.

I think the second issue of faulting remote thp by removing __GFP_THISNODE 
needs supporting evidence that shows some platforms benefit from this (and 
not with numa=fake on the command line :).

 [*] https://marc.info/?l=linux-kernel=153903127717471


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-10 Thread David Rientjes
On Tue, 9 Oct 2018, Andrea Arcangeli wrote:

> I think "madvise vs mbind" is more an issue of "no-permission vs
> permission" required. And if the processes ends up swapping out all
> other process with their memory already allocated in the node, I think
> some permission is correct to be required, in which case an mbind
> looks a better fit. MPOL_PREFERRED also looks a first candidate for
> investigation as it's already not black and white and allows spillover
> and may already do the right thing in fact if set on top of
> MADV_HUGEPAGE.
> 

We would never want to thrash the local node for hugepages because there 
is no guarantee that any swapping is useful.  On COMPACT_SKIPPED due to 
low memory, we have very clear evidence that pageblocks are already 
sufficiently fragmented by unmovable pages such that compaction itself, 
even with abundant free memory, fails to free an entire pageblock due to 
the allocator's preference to fragment pageblocks of fallback migratetypes 
over returning remote free memory.

As I've stated, we do not want to reclaim pointlessly when compaction is 
unable to access the freed memory or there is no guarantee it can free an 
entire pageblock.  Doing so allows thrashing of the local node, or remote 
nodes if __GFP_THISNODE is removed, and the hugepage still cannot be 
allocated.  If this proposed mbind() that requires permissions is geared 
to me as the user, I'm afraid the details of what leads to the thrashing 
are not well understood because I certainly would never use this.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-10 Thread David Rientjes
On Tue, 9 Oct 2018, Andrea Arcangeli wrote:

> I think "madvise vs mbind" is more an issue of "no-permission vs
> permission" required. And if the processes ends up swapping out all
> other process with their memory already allocated in the node, I think
> some permission is correct to be required, in which case an mbind
> looks a better fit. MPOL_PREFERRED also looks a first candidate for
> investigation as it's already not black and white and allows spillover
> and may already do the right thing in fact if set on top of
> MADV_HUGEPAGE.
> 

We would never want to thrash the local node for hugepages because there 
is no guarantee that any swapping is useful.  On COMPACT_SKIPPED due to 
low memory, we have very clear evidence that pageblocks are already 
sufficiently fragmented by unmovable pages such that compaction itself, 
even with abundant free memory, fails to free an entire pageblock due to 
the allocator's preference to fragment pageblocks of fallback migratetypes 
over returning remote free memory.

As I've stated, we do not want to reclaim pointlessly when compaction is 
unable to access the freed memory or there is no guarantee it can free an 
entire pageblock.  Doing so allows thrashing of the local node, or remote 
nodes if __GFP_THISNODE is removed, and the hugepage still cannot be 
allocated.  If this proposed mbind() that requires permissions is geared 
to me as the user, I'm afraid the details of what leads to the thrashing 
are not well understood because I certainly would never use this.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-10 Thread David Rientjes
On Tue, 9 Oct 2018, Andrea Arcangeli wrote:

> On Tue, Oct 09, 2018 at 03:17:30PM -0700, David Rientjes wrote:
> > causes workloads to severely regress both in fault and access latency when 
> > we know that direct reclaim is unlikely to make direct compaction free an 
> > entire pageblock.  It's more likely than not that the reclaim was 
> > pointless and the allocation will still fail.
> 
> How do you know that? If all RAM is full of filesystem cache, but it's
> not heavily fragmented by slab or other unmovable objects, compaction
> will succeed every single time after reclaim frees 2M of cache like
> it's asked to do.
> 
> reclaim succeeds every time, compaction then succeeds every time.
> 
> Not doing reclaim after COMPACT_SKIPPED is returned simply makes
> compaction unable to compact memory once all nodes are filled by
> filesystem cache.
> 

For reclaim to assist memory compaction based on compaction's current 
implementation, it would require that the freeing scanner starting at the 
end of the zone can find these reclaimed pages as migration targets and 
that compaction will be able to utilize these migration targets to make an 
entire pageblock free.

In such low on memory conditions when a node is fully saturated, it is 
much less likely that memory compaction can free an entire pageblock even 
if the freeing scanner can find these now-free pages.  More likely is that 
we have unmovable pages in MIGRATE_MOVABLE pageblocks because the 
allocator allows fallback to pageblocks of other migratetypes to return 
node local memory (because affinity matters for kernel memory as it 
matters for thp) rather than fallback to remote memory.

This has caused us significant pain where we have 1.5GB of slab, for 
example, spread over 100GB of pageblocks once the node has become 
saturated.

So reclaim is not always "pointless" as you point out, but it should at 
least only be attempted if memory compaction could free an entire 
pageblock if it had free memory to migrate to.  It's much harder to make 
sure that these freed pages can be utilized by the freeing scanner.  Based 
on how memory compaction is implemented, I do not think any guarantee can 
be made that reclaim will ever be successful in allowing it to make 
order-9 memory available, unfortunately.

> > If memory compaction were patched such that it can report that it could 
> > successfully free a page of the specified order if there were free pages 
> > at the end of the zone it could migrate to, reclaim might be helpful.  But 
> > with the current implementation, I don't think that is reliably possible.  
> > These free pages could easily be skipped over by the migration scanner 
> > because of the presence of slab pages, for example, and unavailable to the 
> > freeing scanner.
> 
> Yes there's one case where reclaim is "pointless", but it happens once
> and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip
> reclaim then.
> 
> So you're right when we hit fragmentation there's one and only one
> "pointless" reclaim invocation. And immediately after we also
> exponentially backoff on the compaction invocations with the
> compaction deferred logic.
> 

This assumes that every time we get COMPACT_SKIPPED that if we can simply 
free memory that it'll succeed and that's definitely not the case based on 
the implementation of memory compaction: compaction_alloc() needs to find 
the memory and the migration scanner needs to free an entire pageblock.  
The migration scanner doesn't even look ahead to see if that's possible 
before starting to migrate pages, it's limited to COMPACT_CLUSTER_MAX.

The scenario we have: compaction returns COMPACT_SKIPPED; reclaim 
expensively tries to reclaim memory by thrashing the local node; the 
compaction migration scanner has already passed over the now-freed pages 
so it's inaccessible; if accessible, the migration scanner migrates memory 
to the newly freed pages but fails to make a pageblock free; loop.

My contention is that the second step is only justified if we can 
guarantee the freed memory can be useful for compaction and that it can 
free an entire pageblock for the hugepage if it can migrate.  Both of 
those are not possible to determine based on the current implementation.

> > I'd appreciate if Andrea can test this patch, have a rebuttal that we 
> > should still remove __GFP_THISNODE because we don't care about locality as 
> > much as forming a hugepage, we can make that change, and then merge this 
> > instead of causing such massive fault and access latencies.
> 
> I can certainly test, but from source review I'm already convinced
> it'll solve fine the "pathological THP allocation behavior", no
> argument about that. It's certainly better and more correct your patch
> than the current upstream (no security issues with lack of permissions
> for __GFP_THISNODE anymore either).
> 
> I expect your patch will run 100% equivalent to __GFP_COMPACT_ONLY
> alternative I posted, for our testcase 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-10 Thread David Rientjes
On Tue, 9 Oct 2018, Andrea Arcangeli wrote:

> On Tue, Oct 09, 2018 at 03:17:30PM -0700, David Rientjes wrote:
> > causes workloads to severely regress both in fault and access latency when 
> > we know that direct reclaim is unlikely to make direct compaction free an 
> > entire pageblock.  It's more likely than not that the reclaim was 
> > pointless and the allocation will still fail.
> 
> How do you know that? If all RAM is full of filesystem cache, but it's
> not heavily fragmented by slab or other unmovable objects, compaction
> will succeed every single time after reclaim frees 2M of cache like
> it's asked to do.
> 
> reclaim succeeds every time, compaction then succeeds every time.
> 
> Not doing reclaim after COMPACT_SKIPPED is returned simply makes
> compaction unable to compact memory once all nodes are filled by
> filesystem cache.
> 

For reclaim to assist memory compaction based on compaction's current 
implementation, it would require that the freeing scanner starting at the 
end of the zone can find these reclaimed pages as migration targets and 
that compaction will be able to utilize these migration targets to make an 
entire pageblock free.

In such low on memory conditions when a node is fully saturated, it is 
much less likely that memory compaction can free an entire pageblock even 
if the freeing scanner can find these now-free pages.  More likely is that 
we have unmovable pages in MIGRATE_MOVABLE pageblocks because the 
allocator allows fallback to pageblocks of other migratetypes to return 
node local memory (because affinity matters for kernel memory as it 
matters for thp) rather than fallback to remote memory.

This has caused us significant pain where we have 1.5GB of slab, for 
example, spread over 100GB of pageblocks once the node has become 
saturated.

So reclaim is not always "pointless" as you point out, but it should at 
least only be attempted if memory compaction could free an entire 
pageblock if it had free memory to migrate to.  It's much harder to make 
sure that these freed pages can be utilized by the freeing scanner.  Based 
on how memory compaction is implemented, I do not think any guarantee can 
be made that reclaim will ever be successful in allowing it to make 
order-9 memory available, unfortunately.

> > If memory compaction were patched such that it can report that it could 
> > successfully free a page of the specified order if there were free pages 
> > at the end of the zone it could migrate to, reclaim might be helpful.  But 
> > with the current implementation, I don't think that is reliably possible.  
> > These free pages could easily be skipped over by the migration scanner 
> > because of the presence of slab pages, for example, and unavailable to the 
> > freeing scanner.
> 
> Yes there's one case where reclaim is "pointless", but it happens once
> and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip
> reclaim then.
> 
> So you're right when we hit fragmentation there's one and only one
> "pointless" reclaim invocation. And immediately after we also
> exponentially backoff on the compaction invocations with the
> compaction deferred logic.
> 

This assumes that every time we get COMPACT_SKIPPED that if we can simply 
free memory that it'll succeed and that's definitely not the case based on 
the implementation of memory compaction: compaction_alloc() needs to find 
the memory and the migration scanner needs to free an entire pageblock.  
The migration scanner doesn't even look ahead to see if that's possible 
before starting to migrate pages, it's limited to COMPACT_CLUSTER_MAX.

The scenario we have: compaction returns COMPACT_SKIPPED; reclaim 
expensively tries to reclaim memory by thrashing the local node; the 
compaction migration scanner has already passed over the now-freed pages 
so it's inaccessible; if accessible, the migration scanner migrates memory 
to the newly freed pages but fails to make a pageblock free; loop.

My contention is that the second step is only justified if we can 
guarantee the freed memory can be useful for compaction and that it can 
free an entire pageblock for the hugepage if it can migrate.  Both of 
those are not possible to determine based on the current implementation.

> > I'd appreciate if Andrea can test this patch, have a rebuttal that we 
> > should still remove __GFP_THISNODE because we don't care about locality as 
> > much as forming a hugepage, we can make that change, and then merge this 
> > instead of causing such massive fault and access latencies.
> 
> I can certainly test, but from source review I'm already convinced
> it'll solve fine the "pathological THP allocation behavior", no
> argument about that. It's certainly better and more correct your patch
> than the current upstream (no security issues with lack of permissions
> for __GFP_THISNODE anymore either).
> 
> I expect your patch will run 100% equivalent to __GFP_COMPACT_ONLY
> alternative I posted, for our testcase 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-10 Thread Vlastimil Babka
On 10/10/18 12:51 AM, Andrea Arcangeli wrote:
> Yes there's one case where reclaim is "pointless", but it happens once
> and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip
> reclaim then.
> 
> So you're right when we hit fragmentation there's one and only one
> "pointless" reclaim invocation. And immediately after we also
> exponentially backoff on the compaction invocations with the
> compaction deferred logic.
> 
> We could try optimize away such "pointless" reclaim event for sure,
> but it's probably an optimization that may just get lost in the noise
> and may not be measurable, because it only happens once when the first
> full fragmentation is encountered.

Note there's a small catch in the above. defer_compaction() has always
only been called after a failure on higher priority than
COMPACT_PRIO_ASYNC, where it's assumed that async compaction can
terminate prematurely due to a number of reasons, so it doesn't mean
that the zone itself cannot be compacted.
And, for __GFP_NORETRY, if the initial compaction fails, we keep using
async compaction also for the second, after-reclaim attempt (which would
otherwise use SYNC_LIGHT):

/*
 * Looks like reclaim/compaction is worth trying, but
 * sync compaction could be very expensive, so keep
 * using async compaction.
 */
compact_priority = INIT_COMPACT_PRIORITY;

This doesn't affect current madvised THP allocation which doesn't use
__GFP_NORETRY, but could explain why you saw no benefit from changing it
to __GFP_NORETRY.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-10 Thread Vlastimil Babka
On 10/10/18 12:51 AM, Andrea Arcangeli wrote:
> Yes there's one case where reclaim is "pointless", but it happens once
> and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip
> reclaim then.
> 
> So you're right when we hit fragmentation there's one and only one
> "pointless" reclaim invocation. And immediately after we also
> exponentially backoff on the compaction invocations with the
> compaction deferred logic.
> 
> We could try optimize away such "pointless" reclaim event for sure,
> but it's probably an optimization that may just get lost in the noise
> and may not be measurable, because it only happens once when the first
> full fragmentation is encountered.

Note there's a small catch in the above. defer_compaction() has always
only been called after a failure on higher priority than
COMPACT_PRIO_ASYNC, where it's assumed that async compaction can
terminate prematurely due to a number of reasons, so it doesn't mean
that the zone itself cannot be compacted.
And, for __GFP_NORETRY, if the initial compaction fails, we keep using
async compaction also for the second, after-reclaim attempt (which would
otherwise use SYNC_LIGHT):

/*
 * Looks like reclaim/compaction is worth trying, but
 * sync compaction could be very expensive, so keep
 * using async compaction.
 */
compact_priority = INIT_COMPACT_PRIORITY;

This doesn't affect current madvised THP allocation which doesn't use
__GFP_NORETRY, but could explain why you saw no benefit from changing it
to __GFP_NORETRY.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Tue, Oct 09, 2018 at 04:25:10PM +0200, Michal Hocko wrote:
> On Tue 09-10-18 14:00:34, Mel Gorman wrote:
> > On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> > > [Sorry for being slow in responding but I was mostly offline last few
> > >  days]
> > > 
> > > On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> > > [...]
> > > > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > > > promises about locality and that introducing MADV_LOCAL for specialised
> > > > libraries may be more appropriate with the initial semantic being how it
> > > > treats MADV_HUGEPAGE regions.
> > > 
> > > I agree with your other points and not going to repeat them. I am not
> > > sure madvise s the best API for the purpose though. We are talking about
> > > memory policy here and there is an existing api for that so I would
> > > _prefer_ to reuse it for this purpose.
> > > 
> > 
> > I flip-flopped on that one in my head multiple times on the basis of
> > how strict it should be. Memory policies tend to be black or white --
> > bind here, interleave there, etc. It wasn't clear to me what the best
> > policy would be to describe "allocate local as best as you can but allow
> > fallbacks if necessary".

MPOL_PREFERRED is not black and white. In fact I asked David earlier
if MPOL_PREFERRED could check if it would already be a good fit for
this. Still the point is it requires privilege (and for a good
reason).

> I was thinking about MPOL_NODE_PROXIMITY with the following semantic:
> - try hard to allocate from a local or very close numa node(s) even when
> that requires expensive operations like the memory reclaim/compaction
> before falling back to other more distant numa nodes.

If MPOL_PREFERRED can't work something like this could be added.

I think "madvise vs mbind" is more an issue of "no-permission vs
permission" required. And if the processes ends up swapping out all
other process with their memory already allocated in the node, I think
some permission is correct to be required, in which case an mbind
looks a better fit. MPOL_PREFERRED also looks a first candidate for
investigation as it's already not black and white and allows spillover
and may already do the right thing in fact if set on top of
MADV_HUGEPAGE.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Tue, Oct 09, 2018 at 04:25:10PM +0200, Michal Hocko wrote:
> On Tue 09-10-18 14:00:34, Mel Gorman wrote:
> > On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> > > [Sorry for being slow in responding but I was mostly offline last few
> > >  days]
> > > 
> > > On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> > > [...]
> > > > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > > > promises about locality and that introducing MADV_LOCAL for specialised
> > > > libraries may be more appropriate with the initial semantic being how it
> > > > treats MADV_HUGEPAGE regions.
> > > 
> > > I agree with your other points and not going to repeat them. I am not
> > > sure madvise s the best API for the purpose though. We are talking about
> > > memory policy here and there is an existing api for that so I would
> > > _prefer_ to reuse it for this purpose.
> > > 
> > 
> > I flip-flopped on that one in my head multiple times on the basis of
> > how strict it should be. Memory policies tend to be black or white --
> > bind here, interleave there, etc. It wasn't clear to me what the best
> > policy would be to describe "allocate local as best as you can but allow
> > fallbacks if necessary".

MPOL_PREFERRED is not black and white. In fact I asked David earlier
if MPOL_PREFERRED could check if it would already be a good fit for
this. Still the point is it requires privilege (and for a good
reason).

> I was thinking about MPOL_NODE_PROXIMITY with the following semantic:
> - try hard to allocate from a local or very close numa node(s) even when
> that requires expensive operations like the memory reclaim/compaction
> before falling back to other more distant numa nodes.

If MPOL_PREFERRED can't work something like this could be added.

I think "madvise vs mbind" is more an issue of "no-permission vs
permission" required. And if the processes ends up swapping out all
other process with their memory already allocated in the node, I think
some permission is correct to be required, in which case an mbind
looks a better fit. MPOL_PREFERRED also looks a first candidate for
investigation as it's already not black and white and allows spillover
and may already do the right thing in fact if set on top of
MADV_HUGEPAGE.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Tue, Oct 09, 2018 at 03:17:30PM -0700, David Rientjes wrote:
> causes workloads to severely regress both in fault and access latency when 
> we know that direct reclaim is unlikely to make direct compaction free an 
> entire pageblock.  It's more likely than not that the reclaim was 
> pointless and the allocation will still fail.

How do you know that? If all RAM is full of filesystem cache, but it's
not heavily fragmented by slab or other unmovable objects, compaction
will succeed every single time after reclaim frees 2M of cache like
it's asked to do.

reclaim succeeds every time, compaction then succeeds every time.

Not doing reclaim after COMPACT_SKIPPED is returned simply makes
compaction unable to compact memory once all nodes are filled by
filesystem cache.

Certainly it's better not to invoke reclaim at all if __GFP_THISNODE
is set, than swapping out heavy over the local node. Doing so however
has the drawback of reducing the direct compaction effectiveness. I
don't think it's true that reclaim is generally "pointless", it's just
that invoking any reclaim backfired so bad if __GFP_THISNODE was set,
than anything else (including weakining compaction effectiveness) was
better.

> If memory compaction were patched such that it can report that it could 
> successfully free a page of the specified order if there were free pages 
> at the end of the zone it could migrate to, reclaim might be helpful.  But 
> with the current implementation, I don't think that is reliably possible.  
> These free pages could easily be skipped over by the migration scanner 
> because of the presence of slab pages, for example, and unavailable to the 
> freeing scanner.

Yes there's one case where reclaim is "pointless", but it happens once
and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip
reclaim then.

So you're right when we hit fragmentation there's one and only one
"pointless" reclaim invocation. And immediately after we also
exponentially backoff on the compaction invocations with the
compaction deferred logic.

We could try optimize away such "pointless" reclaim event for sure,
but it's probably an optimization that may just get lost in the noise
and may not be measurable, because it only happens once when the first
full fragmentation is encountered.

> I really think that for this patch to be merged over my proposed change 
> that it needs to be clearly demonstrated that reclaim was successful in 
> that it freed memory that was subsequently available to the compaction 
> freeing scanner and that enabled entire pageblocks to become free.  That, 
> in my experience, will very very seldom be successful because of internal 
> slab fragmentation, compaction_alloc() cannot soak up pages from the 
> reclaimed memory, and potentially thrash the zone completely pointlessly.  
> The last point is the problem being reported here, but the other two are 
> as legitimate.

I think the demonstration can already be inferred, because if hit full
memory fragmentation after every reclaim, __GFP_NORETRY would have
solved the "pathological THP allocation behavior" without requiring
your change to __GFP_NORETRY that makes it behave like
__GFP_COMPACT_ONLY for order == HPAGE_PMD_ORDER.

Anyway you can add a few statistic counters and verify in more
accurate way how often a COMPACT_SKIPPED + reclaim cycle is followed
by a COMPACT_DEFERRED.

> I'd appreciate if Andrea can test this patch, have a rebuttal that we 
> should still remove __GFP_THISNODE because we don't care about locality as 
> much as forming a hugepage, we can make that change, and then merge this 
> instead of causing such massive fault and access latencies.

I can certainly test, but from source review I'm already convinced
it'll solve fine the "pathological THP allocation behavior", no
argument about that. It's certainly better and more correct your patch
than the current upstream (no security issues with lack of permissions
for __GFP_THISNODE anymore either).

I expect your patch will run 100% equivalent to __GFP_COMPACT_ONLY
alternative I posted, for our testcase that hit into the "pathological
THP allocation behavior".

Your patch encodes __GFP_COMPACT_ONLY into the __GFP_NORETRY semantics
and hardcodes the __GFP_COMPACT_ONLY for all orders = HPAGE_PMD_SIZE
no matter which is the caller.

As opposed I let the caller choose and left __GFP_NORETRY semantics
alone and orthogonal to the __GFP_COMPACT_ONLY semantics. I think
letting the caller decide instead of hardcoding it for order 9 is
better, because __GFP_COMPACT_ONLY made sense to be set only if
__GFP_THISNODE was also set by the caller.

If a driver does an order 9 allocation with __GFP_THISNODE not set,
your patch will prevent it to allocate remote THP if all remote nodes
are full of cache (which is a reasonable common assumption as more THP
are allocated over time eating in all free memory). My patch didn't
alter that so I tend to prefer the __GFP_COMPACT_ONLY than the
hardcoding 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Tue, Oct 09, 2018 at 03:17:30PM -0700, David Rientjes wrote:
> causes workloads to severely regress both in fault and access latency when 
> we know that direct reclaim is unlikely to make direct compaction free an 
> entire pageblock.  It's more likely than not that the reclaim was 
> pointless and the allocation will still fail.

How do you know that? If all RAM is full of filesystem cache, but it's
not heavily fragmented by slab or other unmovable objects, compaction
will succeed every single time after reclaim frees 2M of cache like
it's asked to do.

reclaim succeeds every time, compaction then succeeds every time.

Not doing reclaim after COMPACT_SKIPPED is returned simply makes
compaction unable to compact memory once all nodes are filled by
filesystem cache.

Certainly it's better not to invoke reclaim at all if __GFP_THISNODE
is set, than swapping out heavy over the local node. Doing so however
has the drawback of reducing the direct compaction effectiveness. I
don't think it's true that reclaim is generally "pointless", it's just
that invoking any reclaim backfired so bad if __GFP_THISNODE was set,
than anything else (including weakining compaction effectiveness) was
better.

> If memory compaction were patched such that it can report that it could 
> successfully free a page of the specified order if there were free pages 
> at the end of the zone it could migrate to, reclaim might be helpful.  But 
> with the current implementation, I don't think that is reliably possible.  
> These free pages could easily be skipped over by the migration scanner 
> because of the presence of slab pages, for example, and unavailable to the 
> freeing scanner.

Yes there's one case where reclaim is "pointless", but it happens once
and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip
reclaim then.

So you're right when we hit fragmentation there's one and only one
"pointless" reclaim invocation. And immediately after we also
exponentially backoff on the compaction invocations with the
compaction deferred logic.

We could try optimize away such "pointless" reclaim event for sure,
but it's probably an optimization that may just get lost in the noise
and may not be measurable, because it only happens once when the first
full fragmentation is encountered.

> I really think that for this patch to be merged over my proposed change 
> that it needs to be clearly demonstrated that reclaim was successful in 
> that it freed memory that was subsequently available to the compaction 
> freeing scanner and that enabled entire pageblocks to become free.  That, 
> in my experience, will very very seldom be successful because of internal 
> slab fragmentation, compaction_alloc() cannot soak up pages from the 
> reclaimed memory, and potentially thrash the zone completely pointlessly.  
> The last point is the problem being reported here, but the other two are 
> as legitimate.

I think the demonstration can already be inferred, because if hit full
memory fragmentation after every reclaim, __GFP_NORETRY would have
solved the "pathological THP allocation behavior" without requiring
your change to __GFP_NORETRY that makes it behave like
__GFP_COMPACT_ONLY for order == HPAGE_PMD_ORDER.

Anyway you can add a few statistic counters and verify in more
accurate way how often a COMPACT_SKIPPED + reclaim cycle is followed
by a COMPACT_DEFERRED.

> I'd appreciate if Andrea can test this patch, have a rebuttal that we 
> should still remove __GFP_THISNODE because we don't care about locality as 
> much as forming a hugepage, we can make that change, and then merge this 
> instead of causing such massive fault and access latencies.

I can certainly test, but from source review I'm already convinced
it'll solve fine the "pathological THP allocation behavior", no
argument about that. It's certainly better and more correct your patch
than the current upstream (no security issues with lack of permissions
for __GFP_THISNODE anymore either).

I expect your patch will run 100% equivalent to __GFP_COMPACT_ONLY
alternative I posted, for our testcase that hit into the "pathological
THP allocation behavior".

Your patch encodes __GFP_COMPACT_ONLY into the __GFP_NORETRY semantics
and hardcodes the __GFP_COMPACT_ONLY for all orders = HPAGE_PMD_SIZE
no matter which is the caller.

As opposed I let the caller choose and left __GFP_NORETRY semantics
alone and orthogonal to the __GFP_COMPACT_ONLY semantics. I think
letting the caller decide instead of hardcoding it for order 9 is
better, because __GFP_COMPACT_ONLY made sense to be set only if
__GFP_THISNODE was also set by the caller.

If a driver does an order 9 allocation with __GFP_THISNODE not set,
your patch will prevent it to allocate remote THP if all remote nodes
are full of cache (which is a reasonable common assumption as more THP
are allocated over time eating in all free memory). My patch didn't
alter that so I tend to prefer the __GFP_COMPACT_ONLY than the
hardcoding 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote:
> The page allocator is expecting __GFP_NORETRY for thp allocations per its 
> comment:
> 
>   /*
>* Checks for costly allocations with __GFP_NORETRY, which
>* includes THP page fault allocations
>*/
>   if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> 
> And that enables us to check compact_result to determine whether thp 
> allocation should fail or continue to reclaim.  I don't think it helps 
> that some thp fault allocations use __GFP_NORETRY and others do not.  I 
> think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
> GFP_TRANSHUGE_LIGHT.

With your patch that changes how __GFP_NORETRY works for order =
pageblock_order then this can help. However it solves nothing for
large skbs and all other "costly_order" allocations. So if you think
it's so bad to have a 40% increased allocation latency if all nodes
are heavily fragmented for THP under MADV_HUGEPAGE (which are
guaranteed to be long lived allocations or they wouldn't set
MADV_HUGEPAGE in the first place), I'm not sure why you ignore the
same exact overhead for all other "costly_order" allocations that
would never set __GFP_THISNODE and will not even use
GFP_TRANSHUGE_LIGHT so they may not set __GFP_NORETRY at all.

I think it would be better to view the problem from a generic
"costly_order" allocation prospective without so much "hardcoded"
focus on THP MADV_HUGEPAGE only (which in fact are the least concern
of all in terms of that 40% latency increase if all nodes are totally
fragmented and compaction fails on all nodes, because they're long
lived so the impact of the increased latency is more likely to be lost
in the noise).

> Our library that uses MADV_HUGEPAGE only invokes direct compaction because 
> we're on an older kernel: it does not attempt to do reclaim to make 
> compaction happy so that it finds memory that it can migrate memory to.  
> For reference, we use defrag setting of "defer+madvise".  Everybody who 
> does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, 
> MADV_HUGEPAGE users do the same but also try direct compaction.  That 
> direct compaction uses a mode of MIGRATE_ASYNC so it normally fails 
> because of need_resched() or spinlock contention.
> These allocations always fail, MADV_HUGEPAGE or otherwise, without 
> invoking direct reclaim.

Even older kernels (before "defer+madvise" option was available)
always invoked reclaim if COMPACT_SKIPPED was returned. The
COMPACT_SKIPPED behavior I referred that explains why __GFP_NORETRY
doesn't prevent reclaim to be invoked, is nothing new, it always
worked that way from the day compaction was introduced.

So it's fairly strange that your kernel doesn't call reclaim at all if
COMPACT_SKIPPED is returned.

> I am agreeing with both you and Mel that it makes no sense to thrash the 
> local node to make compaction happy and then hugepage-order memory 
> available.  I'm only differing with you on the mechanism to fail early: we 
> never want to do attempt reclaim on thp allocations specifically because 
> it leads to the behavior you are addressing.

Not sure I can agree with the above.

If all nodes (not only the local one) are already below the watermarks
and compaction returns COMPACT_SKIPPED, there is zero global free
memory available, we would need to swap anyway to succeed the 2M
allocation. So it's better to reclaim 2M from on node and then retry
compaction again on the same node if compaction is failing on the node
because of COMPACT_SKIPPED and the global free memory is zero. If it
swapouts it would have swapped out anyway but this way THP will be
returned.

It's just __GFP_THISNODE that broke the above logic. __GFP_THISNODE
may just be safe to use only if the global amount of free memory (in
all nodes) is zero (or below HPAGE_PMD_SIZE generally speaking).

> My contention is that removing __GFP_THISNODE papers over the problem, 
> especially in cases where remote memory is also fragmnented. It incurs a 
> much higher (40%) fault latency and then incurs 13.9% greater access 
> latency.  It is not a good result, at least for Haswell, Naples, and Rome.
> 
> To make a case that we should fault hugepages remotely as fallback, either 
> for non-MADV_HUGEPAGE users who do not use direct compaction, or 
> MADV_HUGEPAGE users who use direct compaction, we need numbers that 
> suggest there is a performance benefit in terms of access latency to 
> suggest that it is better; this is especially the case when the fault 
> latency is 40% higher.  On Haswell, Naples, and Rome, it is quite obvious 
> that this patch works much harder to fault memory remotely that incurs a 
> substantial performance penalty when it fails and in the cases where it 
> succeeds we have much worse access latency.

What we're fixing is a effectively a breakage and insecure behavior
too and that must be fixed first. A heavily multithreaded processes

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote:
> The page allocator is expecting __GFP_NORETRY for thp allocations per its 
> comment:
> 
>   /*
>* Checks for costly allocations with __GFP_NORETRY, which
>* includes THP page fault allocations
>*/
>   if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> 
> And that enables us to check compact_result to determine whether thp 
> allocation should fail or continue to reclaim.  I don't think it helps 
> that some thp fault allocations use __GFP_NORETRY and others do not.  I 
> think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
> GFP_TRANSHUGE_LIGHT.

With your patch that changes how __GFP_NORETRY works for order =
pageblock_order then this can help. However it solves nothing for
large skbs and all other "costly_order" allocations. So if you think
it's so bad to have a 40% increased allocation latency if all nodes
are heavily fragmented for THP under MADV_HUGEPAGE (which are
guaranteed to be long lived allocations or they wouldn't set
MADV_HUGEPAGE in the first place), I'm not sure why you ignore the
same exact overhead for all other "costly_order" allocations that
would never set __GFP_THISNODE and will not even use
GFP_TRANSHUGE_LIGHT so they may not set __GFP_NORETRY at all.

I think it would be better to view the problem from a generic
"costly_order" allocation prospective without so much "hardcoded"
focus on THP MADV_HUGEPAGE only (which in fact are the least concern
of all in terms of that 40% latency increase if all nodes are totally
fragmented and compaction fails on all nodes, because they're long
lived so the impact of the increased latency is more likely to be lost
in the noise).

> Our library that uses MADV_HUGEPAGE only invokes direct compaction because 
> we're on an older kernel: it does not attempt to do reclaim to make 
> compaction happy so that it finds memory that it can migrate memory to.  
> For reference, we use defrag setting of "defer+madvise".  Everybody who 
> does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, 
> MADV_HUGEPAGE users do the same but also try direct compaction.  That 
> direct compaction uses a mode of MIGRATE_ASYNC so it normally fails 
> because of need_resched() or spinlock contention.
> These allocations always fail, MADV_HUGEPAGE or otherwise, without 
> invoking direct reclaim.

Even older kernels (before "defer+madvise" option was available)
always invoked reclaim if COMPACT_SKIPPED was returned. The
COMPACT_SKIPPED behavior I referred that explains why __GFP_NORETRY
doesn't prevent reclaim to be invoked, is nothing new, it always
worked that way from the day compaction was introduced.

So it's fairly strange that your kernel doesn't call reclaim at all if
COMPACT_SKIPPED is returned.

> I am agreeing with both you and Mel that it makes no sense to thrash the 
> local node to make compaction happy and then hugepage-order memory 
> available.  I'm only differing with you on the mechanism to fail early: we 
> never want to do attempt reclaim on thp allocations specifically because 
> it leads to the behavior you are addressing.

Not sure I can agree with the above.

If all nodes (not only the local one) are already below the watermarks
and compaction returns COMPACT_SKIPPED, there is zero global free
memory available, we would need to swap anyway to succeed the 2M
allocation. So it's better to reclaim 2M from on node and then retry
compaction again on the same node if compaction is failing on the node
because of COMPACT_SKIPPED and the global free memory is zero. If it
swapouts it would have swapped out anyway but this way THP will be
returned.

It's just __GFP_THISNODE that broke the above logic. __GFP_THISNODE
may just be safe to use only if the global amount of free memory (in
all nodes) is zero (or below HPAGE_PMD_SIZE generally speaking).

> My contention is that removing __GFP_THISNODE papers over the problem, 
> especially in cases where remote memory is also fragmnented. It incurs a 
> much higher (40%) fault latency and then incurs 13.9% greater access 
> latency.  It is not a good result, at least for Haswell, Naples, and Rome.
> 
> To make a case that we should fault hugepages remotely as fallback, either 
> for non-MADV_HUGEPAGE users who do not use direct compaction, or 
> MADV_HUGEPAGE users who use direct compaction, we need numbers that 
> suggest there is a performance benefit in terms of access latency to 
> suggest that it is better; this is especially the case when the fault 
> latency is 40% higher.  On Haswell, Naples, and Rome, it is quite obvious 
> that this patch works much harder to fault memory remotely that incurs a 
> substantial performance penalty when it fails and in the cases where it 
> succeeds we have much worse access latency.

What we're fixing is a effectively a breakage and insecure behavior
too and that must be fixed first. A heavily multithreaded processes

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread David Rientjes
On Tue, 9 Oct 2018, Mel Gorman wrote:

> > The page allocator is expecting __GFP_NORETRY for thp allocations per its 
> > comment:
> > 
> > /*
> >  * Checks for costly allocations with __GFP_NORETRY, which
> >  * includes THP page fault allocations
> >  */
> > if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> > 
> > And that enables us to check compact_result to determine whether thp 
> > allocation should fail or continue to reclaim.  I don't think it helps 
> > that some thp fault allocations use __GFP_NORETRY and others do not.  I 
> > think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
> > GFP_TRANSHUGE_LIGHT.
> > 
> 
> I am concerned that we may be trying to deal with this in terms of right
> and wrong when the range of workloads and their preferred semantics
> force us into a grey area.
> 
> The THP user without __GFP_NORETRY is "always"
> 
> always
> means that an application requesting THP will stall on
> allocation failure and directly reclaim pages and compact
> memory in an effort to allocate a THP immediately. This may be
> desirable for virtual machines that benefit heavily from THP
> use and are willing to delay the VM start to utilise them.
> 

I'm not sure what you mean when you say the thp user without __GFP_NORETRY 
is "always".  If you're referring to the defrag mode "always", 
__GFP_NORETRY is only possible for non-madvised memory regions.  All other 
defrag modes exclude it, and I argue they incorrectly exclude it.

I believe the userspace expectation, whether it is described well or not, 
is that defrag mode "always" simply gets the same behavior for defrag mode 
"madvise" but for everybody, not just MADV_HUGEPAGE regions.  Then, 
"defer+madvise" triggers kswapd/kcompactd but compacts for MADV_HUGEPAGE, 
"defer" triggers kswapd/compactd and fails, and "never" fails.  This, to 
me, seems like the most sane heuristic.

Whether reclaim is helpful or not is what I'm questioning, I don't think 
that it benefits any user to have strict adherence to a documentation that 
causes workloads to severely regress both in fault and access latency when 
we know that direct reclaim is unlikely to make direct compaction free an 
entire pageblock.  It's more likely than not that the reclaim was 
pointless and the allocation will still fail.

If memory compaction were patched such that it can report that it could 
successfully free a page of the specified order if there were free pages 
at the end of the zone it could migrate to, reclaim might be helpful.  But 
with the current implementation, I don't think that is reliably possible.  
These free pages could easily be skipped over by the migration scanner 
because of the presence of slab pages, for example, and unavailable to the 
freeing scanner.

> Removing __GFP_NORETRY in this instance is due to the tuning hinting that
> direct reclaim is prefectly acceptable. __GFP_NORETRY means that the kernel
> will not compact memory after a recent failure but if the caller is willing
> to reclaim memory, then it follows that retrying compaction is justified.
> 
> Is this the correct thing to do in 100% of cases? Probably not, there
> will be corner cases but it also does not necessarily follow that
> __GFP_NORETRY should be univeral.
> 

For reclaim to be useful, I believe we need a major rewrite of memory 
compaction such that the freeing scanner is eliminated and it can benefit 
from memory passed over by the migration scanner.  In this case, it would 
require the freeing scanner to do the actual reclaim itself.  I don't 
believe it is beneficial to any user that we reclaim memory from the zone 
when we don't know (1) that the migration scanner will eventually be able 
to free an entire pageblock, (2) the reclaimed memory can be accessed by 
the freeing scanner as a migration target, and (3) all this synchronous 
work was not done on behalf of a user only to lose the hugepage to a 
concurrent allocator that is not even madvised.

Since we lack all of this in the allocator/compaction/reclaim, it seems 
that the painful fault latency here can be corrected by doing what is 
actually useful, memory compaction, and rely on its heuristics of when to 
give up and when to proceed rather than thrash the zone.  The insane fault 
latency being reported here is certainly not what the user is asking for 
when it does MADV_HUGEPAGE, and removing __GFP_THISNODE doesn't help in 
any way if remote memory is fragmented or low on memory as well.

> What is missing here is an agreed upon set of reference test cases
> that can be used for the basis of evaluating patches like this. They
> should be somewhat representative of the target applications of virtual
> memory initialisation (forget about runtime at the moment as that is
> stacking problems), a simulator of the google workload and library and
> my test case of simply referencing an amount of memory larger 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread David Rientjes
On Tue, 9 Oct 2018, Mel Gorman wrote:

> > The page allocator is expecting __GFP_NORETRY for thp allocations per its 
> > comment:
> > 
> > /*
> >  * Checks for costly allocations with __GFP_NORETRY, which
> >  * includes THP page fault allocations
> >  */
> > if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> > 
> > And that enables us to check compact_result to determine whether thp 
> > allocation should fail or continue to reclaim.  I don't think it helps 
> > that some thp fault allocations use __GFP_NORETRY and others do not.  I 
> > think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
> > GFP_TRANSHUGE_LIGHT.
> > 
> 
> I am concerned that we may be trying to deal with this in terms of right
> and wrong when the range of workloads and their preferred semantics
> force us into a grey area.
> 
> The THP user without __GFP_NORETRY is "always"
> 
> always
> means that an application requesting THP will stall on
> allocation failure and directly reclaim pages and compact
> memory in an effort to allocate a THP immediately. This may be
> desirable for virtual machines that benefit heavily from THP
> use and are willing to delay the VM start to utilise them.
> 

I'm not sure what you mean when you say the thp user without __GFP_NORETRY 
is "always".  If you're referring to the defrag mode "always", 
__GFP_NORETRY is only possible for non-madvised memory regions.  All other 
defrag modes exclude it, and I argue they incorrectly exclude it.

I believe the userspace expectation, whether it is described well or not, 
is that defrag mode "always" simply gets the same behavior for defrag mode 
"madvise" but for everybody, not just MADV_HUGEPAGE regions.  Then, 
"defer+madvise" triggers kswapd/kcompactd but compacts for MADV_HUGEPAGE, 
"defer" triggers kswapd/compactd and fails, and "never" fails.  This, to 
me, seems like the most sane heuristic.

Whether reclaim is helpful or not is what I'm questioning, I don't think 
that it benefits any user to have strict adherence to a documentation that 
causes workloads to severely regress both in fault and access latency when 
we know that direct reclaim is unlikely to make direct compaction free an 
entire pageblock.  It's more likely than not that the reclaim was 
pointless and the allocation will still fail.

If memory compaction were patched such that it can report that it could 
successfully free a page of the specified order if there were free pages 
at the end of the zone it could migrate to, reclaim might be helpful.  But 
with the current implementation, I don't think that is reliably possible.  
These free pages could easily be skipped over by the migration scanner 
because of the presence of slab pages, for example, and unavailable to the 
freeing scanner.

> Removing __GFP_NORETRY in this instance is due to the tuning hinting that
> direct reclaim is prefectly acceptable. __GFP_NORETRY means that the kernel
> will not compact memory after a recent failure but if the caller is willing
> to reclaim memory, then it follows that retrying compaction is justified.
> 
> Is this the correct thing to do in 100% of cases? Probably not, there
> will be corner cases but it also does not necessarily follow that
> __GFP_NORETRY should be univeral.
> 

For reclaim to be useful, I believe we need a major rewrite of memory 
compaction such that the freeing scanner is eliminated and it can benefit 
from memory passed over by the migration scanner.  In this case, it would 
require the freeing scanner to do the actual reclaim itself.  I don't 
believe it is beneficial to any user that we reclaim memory from the zone 
when we don't know (1) that the migration scanner will eventually be able 
to free an entire pageblock, (2) the reclaimed memory can be accessed by 
the freeing scanner as a migration target, and (3) all this synchronous 
work was not done on behalf of a user only to lose the hugepage to a 
concurrent allocator that is not even madvised.

Since we lack all of this in the allocator/compaction/reclaim, it seems 
that the painful fault latency here can be corrected by doing what is 
actually useful, memory compaction, and rely on its heuristics of when to 
give up and when to proceed rather than thrash the zone.  The insane fault 
latency being reported here is certainly not what the user is asking for 
when it does MADV_HUGEPAGE, and removing __GFP_THISNODE doesn't help in 
any way if remote memory is fragmented or low on memory as well.

> What is missing here is an agreed upon set of reference test cases
> that can be used for the basis of evaluating patches like this. They
> should be somewhat representative of the target applications of virtual
> memory initialisation (forget about runtime at the moment as that is
> stacking problems), a simulator of the google workload and library and
> my test case of simply referencing an amount of memory larger 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Mel Gorman
On Tue, Oct 09, 2018 at 04:25:10PM +0200, Michal Hocko wrote:
> On Tue 09-10-18 14:00:34, Mel Gorman wrote:
> > On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> > > [Sorry for being slow in responding but I was mostly offline last few
> > >  days]
> > > 
> > > On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> > > [...]
> > > > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > > > promises about locality and that introducing MADV_LOCAL for specialised
> > > > libraries may be more appropriate with the initial semantic being how it
> > > > treats MADV_HUGEPAGE regions.
> > > 
> > > I agree with your other points and not going to repeat them. I am not
> > > sure madvise s the best API for the purpose though. We are talking about
> > > memory policy here and there is an existing api for that so I would
> > > _prefer_ to reuse it for this purpose.
> > > 
> > 
> > I flip-flopped on that one in my head multiple times on the basis of
> > how strict it should be. Memory policies tend to be black or white --
> > bind here, interleave there, etc. It wasn't clear to me what the best
> > policy would be to describe "allocate local as best as you can but allow
> > fallbacks if necessary".
> 
> I was thinking about MPOL_NODE_PROXIMITY with the following semantic:
> - try hard to allocate from a local or very close numa node(s) even when
> that requires expensive operations like the memory reclaim/compaction
> before falling back to other more distant numa nodes.
> 

Seems reasonable. It's not far from the general semantics I thought
MADV_LOCAL would have.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Mel Gorman
On Tue, Oct 09, 2018 at 04:25:10PM +0200, Michal Hocko wrote:
> On Tue 09-10-18 14:00:34, Mel Gorman wrote:
> > On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> > > [Sorry for being slow in responding but I was mostly offline last few
> > >  days]
> > > 
> > > On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> > > [...]
> > > > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > > > promises about locality and that introducing MADV_LOCAL for specialised
> > > > libraries may be more appropriate with the initial semantic being how it
> > > > treats MADV_HUGEPAGE regions.
> > > 
> > > I agree with your other points and not going to repeat them. I am not
> > > sure madvise s the best API for the purpose though. We are talking about
> > > memory policy here and there is an existing api for that so I would
> > > _prefer_ to reuse it for this purpose.
> > > 
> > 
> > I flip-flopped on that one in my head multiple times on the basis of
> > how strict it should be. Memory policies tend to be black or white --
> > bind here, interleave there, etc. It wasn't clear to me what the best
> > policy would be to describe "allocate local as best as you can but allow
> > fallbacks if necessary".
> 
> I was thinking about MPOL_NODE_PROXIMITY with the following semantic:
> - try hard to allocate from a local or very close numa node(s) even when
> that requires expensive operations like the memory reclaim/compaction
> before falling back to other more distant numa nodes.
> 

Seems reasonable. It's not far from the general semantics I thought
MADV_LOCAL would have.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 14:00:34, Mel Gorman wrote:
> On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> > [Sorry for being slow in responding but I was mostly offline last few
> >  days]
> > 
> > On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> > [...]
> > > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > > promises about locality and that introducing MADV_LOCAL for specialised
> > > libraries may be more appropriate with the initial semantic being how it
> > > treats MADV_HUGEPAGE regions.
> > 
> > I agree with your other points and not going to repeat them. I am not
> > sure madvise s the best API for the purpose though. We are talking about
> > memory policy here and there is an existing api for that so I would
> > _prefer_ to reuse it for this purpose.
> > 
> 
> I flip-flopped on that one in my head multiple times on the basis of
> how strict it should be. Memory policies tend to be black or white --
> bind here, interleave there, etc. It wasn't clear to me what the best
> policy would be to describe "allocate local as best as you can but allow
> fallbacks if necessary".

I was thinking about MPOL_NODE_PROXIMITY with the following semantic:
- try hard to allocate from a local or very close numa node(s) even when
that requires expensive operations like the memory reclaim/compaction
before falling back to other more distant numa nodes.


> Hence, I started leaning towards advise as it is
> really about advice that the kernel can ignore if necessary. That said,
> I don't feel as strongly about the "how" as I do about the fact that
> applications and libraries should not depend on side-effects of the
> MADV_HUGEPAGE implementation that relate to locality.

completely agreed on this.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 14:00:34, Mel Gorman wrote:
> On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> > [Sorry for being slow in responding but I was mostly offline last few
> >  days]
> > 
> > On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> > [...]
> > > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > > promises about locality and that introducing MADV_LOCAL for specialised
> > > libraries may be more appropriate with the initial semantic being how it
> > > treats MADV_HUGEPAGE regions.
> > 
> > I agree with your other points and not going to repeat them. I am not
> > sure madvise s the best API for the purpose though. We are talking about
> > memory policy here and there is an existing api for that so I would
> > _prefer_ to reuse it for this purpose.
> > 
> 
> I flip-flopped on that one in my head multiple times on the basis of
> how strict it should be. Memory policies tend to be black or white --
> bind here, interleave there, etc. It wasn't clear to me what the best
> policy would be to describe "allocate local as best as you can but allow
> fallbacks if necessary".

I was thinking about MPOL_NODE_PROXIMITY with the following semantic:
- try hard to allocate from a local or very close numa node(s) even when
that requires expensive operations like the memory reclaim/compaction
before falling back to other more distant numa nodes.


> Hence, I started leaning towards advise as it is
> really about advice that the kernel can ignore if necessary. That said,
> I don't feel as strongly about the "how" as I do about the fact that
> applications and libraries should not depend on side-effects of the
> MADV_HUGEPAGE implementation that relate to locality.

completely agreed on this.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Vlastimil Babka
On 10/8/18 10:41 PM, David Rientjes wrote:
> + /*
> +  * If faulting a hugepage, it is very unlikely that
> +  * thrashing the zonelist is going to assist compaction
> +  * in freeing an entire pageblock.  There are no
> +  * guarantees memory compaction can free an entire
> +  * pageblock under such memory pressure that it is
> +  * better to simply fail and fallback to native pages.
> +  */
> + if (order == pageblock_order &&
> + !(current->flags & PF_KTHREAD))
> + goto nopage;

After we got rid of similar hardcoded heuristics, I would be very
unhappy to start adding them back. A new gfp flag is also unfortunate,
but more acceptable to me.

> +
>   /*
>* Looks like reclaim/compaction is worth trying, but
>* sync compaction could be very expensive, so keep
> 



Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Vlastimil Babka
On 10/8/18 10:41 PM, David Rientjes wrote:
> + /*
> +  * If faulting a hugepage, it is very unlikely that
> +  * thrashing the zonelist is going to assist compaction
> +  * in freeing an entire pageblock.  There are no
> +  * guarantees memory compaction can free an entire
> +  * pageblock under such memory pressure that it is
> +  * better to simply fail and fallback to native pages.
> +  */
> + if (order == pageblock_order &&
> + !(current->flags & PF_KTHREAD))
> + goto nopage;

After we got rid of similar hardcoded heuristics, I would be very
unhappy to start adding them back. A new gfp flag is also unfortunate,
but more acceptable to me.

> +
>   /*
>* Looks like reclaim/compaction is worth trying, but
>* sync compaction could be very expensive, so keep
> 



Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Mel Gorman
On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> [Sorry for being slow in responding but I was mostly offline last few
>  days]
> 
> On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> [...]
> > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > promises about locality and that introducing MADV_LOCAL for specialised
> > libraries may be more appropriate with the initial semantic being how it
> > treats MADV_HUGEPAGE regions.
> 
> I agree with your other points and not going to repeat them. I am not
> sure madvise s the best API for the purpose though. We are talking about
> memory policy here and there is an existing api for that so I would
> _prefer_ to reuse it for this purpose.
> 

I flip-flopped on that one in my head multiple times on the basis of
how strict it should be. Memory policies tend to be black or white --
bind here, interleave there, etc. It wasn't clear to me what the best
policy would be to describe "allocate local as best as you can but allow
fallbacks if necessary". Hence, I started leaning towards advise as it is
really about advice that the kernel can ignore if necessary. That said,
I don't feel as strongly about the "how" as I do about the fact that
applications and libraries should not depend on side-effects of the
MADV_HUGEPAGE implementation that relate to locality.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Mel Gorman
On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> [Sorry for being slow in responding but I was mostly offline last few
>  days]
> 
> On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> [...]
> > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > promises about locality and that introducing MADV_LOCAL for specialised
> > libraries may be more appropriate with the initial semantic being how it
> > treats MADV_HUGEPAGE regions.
> 
> I agree with your other points and not going to repeat them. I am not
> sure madvise s the best API for the purpose though. We are talking about
> memory policy here and there is an existing api for that so I would
> _prefer_ to reuse it for this purpose.
> 

I flip-flopped on that one in my head multiple times on the basis of
how strict it should be. Memory policies tend to be black or white --
bind here, interleave there, etc. It wasn't clear to me what the best
policy would be to describe "allocate local as best as you can but allow
fallbacks if necessary". Hence, I started leaning towards advise as it is
really about advice that the kernel can ignore if necessary. That said,
I don't feel as strongly about the "how" as I do about the fact that
applications and libraries should not depend on side-effects of the
MADV_HUGEPAGE implementation that relate to locality.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Michal Hocko
[Sorry for being slow in responding but I was mostly offline last few
 days]

On Tue 09-10-18 10:48:25, Mel Gorman wrote:
[...]
> This goes back to my point that the MADV_HUGEPAGE hint should not make
> promises about locality and that introducing MADV_LOCAL for specialised
> libraries may be more appropriate with the initial semantic being how it
> treats MADV_HUGEPAGE regions.

I agree with your other points and not going to repeat them. I am not
sure madvise s the best API for the purpose though. We are talking about
memory policy here and there is an existing api for that so I would
_prefer_ to reuse it for this purpose.

Sure we will likely need somethin in the compaction as well but we
should start simple and go forward in smaller steps.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Michal Hocko
[Sorry for being slow in responding but I was mostly offline last few
 days]

On Tue 09-10-18 10:48:25, Mel Gorman wrote:
[...]
> This goes back to my point that the MADV_HUGEPAGE hint should not make
> promises about locality and that introducing MADV_LOCAL for specialised
> libraries may be more appropriate with the initial semantic being how it
> treats MADV_HUGEPAGE regions.

I agree with your other points and not going to repeat them. I am not
sure madvise s the best API for the purpose though. We are talking about
memory policy here and there is an existing api for that so I would
_prefer_ to reuse it for this purpose.

Sure we will likely need somethin in the compaction as well but we
should start simple and go forward in smaller steps.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Mel Gorman
On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote:
> On Fri, 5 Oct 2018, Andrea Arcangeli wrote:
> 
> > I tried to add just __GFP_NORETRY but it changes nothing. Try it
> > yourself if you think that can resolve the swap storm and excessive
> > reclaim CPU overhead... and see if it works. I didn't intend to
> > reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would
> > have worked. I tried adding __GFP_NORETRY first of course.
> > 
> > Reason why it doesn't help is: compaction fails because not enough
> > free RAM, reclaim is invoked, compaction succeeds, THP is allocated to
> > your lib user, compaction fails because not enough free RAM, reclaim
> > is invoked etc.. compact_result is not COMPACT_DEFERRED, but
> > COMPACT_SKIPPED.
> > 
> > See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to
> > still create the same heavy swap storm and unfairly penalize all apps
> > with memory allocated in the local node like if your library had
> > actually the kernel privilege to run mbind or mlock, which is not ok.
> > 
> > Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim
> > can run with __GFP_THISNODE set, all bets are off and we're back to
> > square one, no difference (at best marginal difference) with
> > __GFP_NORETRY being set.
> > 
> 
> The page allocator is expecting __GFP_NORETRY for thp allocations per its 
> comment:
> 
>   /*
>* Checks for costly allocations with __GFP_NORETRY, which
>* includes THP page fault allocations
>*/
>   if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> 
> And that enables us to check compact_result to determine whether thp 
> allocation should fail or continue to reclaim.  I don't think it helps 
> that some thp fault allocations use __GFP_NORETRY and others do not.  I 
> think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
> GFP_TRANSHUGE_LIGHT.
> 

I am concerned that we may be trying to deal with this in terms of right
and wrong when the range of workloads and their preferred semantics
force us into a grey area.

The THP user without __GFP_NORETRY is "always"

always
means that an application requesting THP will stall on
allocation failure and directly reclaim pages and compact
memory in an effort to allocate a THP immediately. This may be
desirable for virtual machines that benefit heavily from THP
use and are willing to delay the VM start to utilise them.

Removing __GFP_NORETRY in this instance is due to the tuning hinting that
direct reclaim is prefectly acceptable. __GFP_NORETRY means that the kernel
will not compact memory after a recent failure but if the caller is willing
to reclaim memory, then it follows that retrying compaction is justified.

Is this the correct thing to do in 100% of cases? Probably not, there
will be corner cases but it also does not necessarily follow that
__GFP_NORETRY should be univeral.

What is missing here is an agreed upon set of reference test cases
that can be used for the basis of evaluating patches like this. They
should be somewhat representative of the target applications of virtual
memory initialisation (forget about runtime at the moment as that is
stacking problems), a simulator of the google workload and library and
my test case of simply referencing an amount of memory larger than one
node. That would cover the current discussion at least but more would be
needed later. Otherwise we're going to endlessly whack-a-mole fixing one
workload and hurting another. It might be overkill but otherwise this
discussion risks going in circles.

The previous reference cases were ones that focused on either THP
allocation success rates or the benefit of THP itself, neither of which
are particularly useful in the current context.

> Our library that uses MADV_HUGEPAGE only invokes direct compaction because 
> we're on an older kernel: it does not attempt to do reclaim to make 
> compaction happy so that it finds memory that it can migrate memory to.  
> For reference, we use defrag setting of "defer+madvise".  Everybody who 
> does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, 
> MADV_HUGEPAGE users do the same but also try direct compaction.  That 
> direct compaction uses a mode of MIGRATE_ASYNC so it normally fails 
> because of need_resched() or spinlock contention.
> 
> These allocations always fail, MADV_HUGEPAGE or otherwise, without 
> invoking direct reclaim.
> 
> I am agreeing with both you and Mel that it makes no sense to thrash the 
> local node to make compaction happy and then hugepage-order memory 
> available.  I'm only differing with you on the mechanism to fail early: we 
> never want to do attempt reclaim on thp allocations specifically because 
> it leads to the behavior you are addressing.
> 
> My contention is that removing __GFP_THISNODE papers over the problem, 
> especially in cases where remote memory is also 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Mel Gorman
On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote:
> On Fri, 5 Oct 2018, Andrea Arcangeli wrote:
> 
> > I tried to add just __GFP_NORETRY but it changes nothing. Try it
> > yourself if you think that can resolve the swap storm and excessive
> > reclaim CPU overhead... and see if it works. I didn't intend to
> > reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would
> > have worked. I tried adding __GFP_NORETRY first of course.
> > 
> > Reason why it doesn't help is: compaction fails because not enough
> > free RAM, reclaim is invoked, compaction succeeds, THP is allocated to
> > your lib user, compaction fails because not enough free RAM, reclaim
> > is invoked etc.. compact_result is not COMPACT_DEFERRED, but
> > COMPACT_SKIPPED.
> > 
> > See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to
> > still create the same heavy swap storm and unfairly penalize all apps
> > with memory allocated in the local node like if your library had
> > actually the kernel privilege to run mbind or mlock, which is not ok.
> > 
> > Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim
> > can run with __GFP_THISNODE set, all bets are off and we're back to
> > square one, no difference (at best marginal difference) with
> > __GFP_NORETRY being set.
> > 
> 
> The page allocator is expecting __GFP_NORETRY for thp allocations per its 
> comment:
> 
>   /*
>* Checks for costly allocations with __GFP_NORETRY, which
>* includes THP page fault allocations
>*/
>   if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> 
> And that enables us to check compact_result to determine whether thp 
> allocation should fail or continue to reclaim.  I don't think it helps 
> that some thp fault allocations use __GFP_NORETRY and others do not.  I 
> think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
> GFP_TRANSHUGE_LIGHT.
> 

I am concerned that we may be trying to deal with this in terms of right
and wrong when the range of workloads and their preferred semantics
force us into a grey area.

The THP user without __GFP_NORETRY is "always"

always
means that an application requesting THP will stall on
allocation failure and directly reclaim pages and compact
memory in an effort to allocate a THP immediately. This may be
desirable for virtual machines that benefit heavily from THP
use and are willing to delay the VM start to utilise them.

Removing __GFP_NORETRY in this instance is due to the tuning hinting that
direct reclaim is prefectly acceptable. __GFP_NORETRY means that the kernel
will not compact memory after a recent failure but if the caller is willing
to reclaim memory, then it follows that retrying compaction is justified.

Is this the correct thing to do in 100% of cases? Probably not, there
will be corner cases but it also does not necessarily follow that
__GFP_NORETRY should be univeral.

What is missing here is an agreed upon set of reference test cases
that can be used for the basis of evaluating patches like this. They
should be somewhat representative of the target applications of virtual
memory initialisation (forget about runtime at the moment as that is
stacking problems), a simulator of the google workload and library and
my test case of simply referencing an amount of memory larger than one
node. That would cover the current discussion at least but more would be
needed later. Otherwise we're going to endlessly whack-a-mole fixing one
workload and hurting another. It might be overkill but otherwise this
discussion risks going in circles.

The previous reference cases were ones that focused on either THP
allocation success rates or the benefit of THP itself, neither of which
are particularly useful in the current context.

> Our library that uses MADV_HUGEPAGE only invokes direct compaction because 
> we're on an older kernel: it does not attempt to do reclaim to make 
> compaction happy so that it finds memory that it can migrate memory to.  
> For reference, we use defrag setting of "defer+madvise".  Everybody who 
> does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, 
> MADV_HUGEPAGE users do the same but also try direct compaction.  That 
> direct compaction uses a mode of MIGRATE_ASYNC so it normally fails 
> because of need_resched() or spinlock contention.
> 
> These allocations always fail, MADV_HUGEPAGE or otherwise, without 
> invoking direct reclaim.
> 
> I am agreeing with both you and Mel that it makes no sense to thrash the 
> local node to make compaction happy and then hugepage-order memory 
> available.  I'm only differing with you on the mechanism to fail early: we 
> never want to do attempt reclaim on thp allocations specifically because 
> it leads to the behavior you are addressing.
> 
> My contention is that removing __GFP_THISNODE papers over the problem, 
> especially in cases where remote memory is also 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-08 Thread David Rientjes
On Fri, 5 Oct 2018, Andrea Arcangeli wrote:

> I tried to add just __GFP_NORETRY but it changes nothing. Try it
> yourself if you think that can resolve the swap storm and excessive
> reclaim CPU overhead... and see if it works. I didn't intend to
> reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would
> have worked. I tried adding __GFP_NORETRY first of course.
> 
> Reason why it doesn't help is: compaction fails because not enough
> free RAM, reclaim is invoked, compaction succeeds, THP is allocated to
> your lib user, compaction fails because not enough free RAM, reclaim
> is invoked etc.. compact_result is not COMPACT_DEFERRED, but
> COMPACT_SKIPPED.
> 
> See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to
> still create the same heavy swap storm and unfairly penalize all apps
> with memory allocated in the local node like if your library had
> actually the kernel privilege to run mbind or mlock, which is not ok.
> 
> Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim
> can run with __GFP_THISNODE set, all bets are off and we're back to
> square one, no difference (at best marginal difference) with
> __GFP_NORETRY being set.
> 

The page allocator is expecting __GFP_NORETRY for thp allocations per its 
comment:

/*
 * Checks for costly allocations with __GFP_NORETRY, which
 * includes THP page fault allocations
 */
if (costly_order && (gfp_mask & __GFP_NORETRY)) {

And that enables us to check compact_result to determine whether thp 
allocation should fail or continue to reclaim.  I don't think it helps 
that some thp fault allocations use __GFP_NORETRY and others do not.  I 
think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
GFP_TRANSHUGE_LIGHT.

Our library that uses MADV_HUGEPAGE only invokes direct compaction because 
we're on an older kernel: it does not attempt to do reclaim to make 
compaction happy so that it finds memory that it can migrate memory to.  
For reference, we use defrag setting of "defer+madvise".  Everybody who 
does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, 
MADV_HUGEPAGE users do the same but also try direct compaction.  That 
direct compaction uses a mode of MIGRATE_ASYNC so it normally fails 
because of need_resched() or spinlock contention.

These allocations always fail, MADV_HUGEPAGE or otherwise, without 
invoking direct reclaim.

I am agreeing with both you and Mel that it makes no sense to thrash the 
local node to make compaction happy and then hugepage-order memory 
available.  I'm only differing with you on the mechanism to fail early: we 
never want to do attempt reclaim on thp allocations specifically because 
it leads to the behavior you are addressing.

My contention is that removing __GFP_THISNODE papers over the problem, 
especially in cases where remote memory is also fragmnented. It incurs a 
much higher (40%) fault latency and then incurs 13.9% greater access 
latency.  It is not a good result, at least for Haswell, Naples, and Rome.

To make a case that we should fault hugepages remotely as fallback, either 
for non-MADV_HUGEPAGE users who do not use direct compaction, or 
MADV_HUGEPAGE users who use direct compaction, we need numbers that 
suggest there is a performance benefit in terms of access latency to 
suggest that it is better; this is especially the case when the fault 
latency is 40% higher.  On Haswell, Naples, and Rome, it is quite obvious 
that this patch works much harder to fault memory remotely that incurs a 
substantial performance penalty when it fails and in the cases where it 
succeeds we have much worse access latency.

For users who bind their applications to a subset of cores on a NUMA node, 
fallback is egregious: we are guaranteed to never have local access 
latency and in the case when the local node is fragmented and remote 
memory is not our fault latency goes through the roof when local native 
pages would have been much better.

I've brought numbers on how poor this patch performs, so I'm asking for a 
rebuttal that suggests it is better on some platforms.  (1) On what 
platforms is it better to fault remote hugepages over local native pages?  
(2) What is the fault latency increase when remote nodes are fragmented as 
well?

> Like Mel said, your app just happens to fit in a local node, if the
> user of the lib is slightly different and allocates 16G on a system
> where each node is 4G, the post-fix MADV_HUGEPAGE will perform
> extremely better also for the lib user.
> 

It won't if the remote memory is fragmented; the patch is premised on the 
argument that remote memory is never fragmented or under memory pressure 
otherwise it is multiplying the fault latency by the number of nodes.  
Sure, creating test cases where the local node is under heavy memory 
pressure yet remote nodes are mostly free will minimize the impact this 
has on fault latency.  It still is a 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-08 Thread David Rientjes
On Fri, 5 Oct 2018, Andrea Arcangeli wrote:

> I tried to add just __GFP_NORETRY but it changes nothing. Try it
> yourself if you think that can resolve the swap storm and excessive
> reclaim CPU overhead... and see if it works. I didn't intend to
> reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would
> have worked. I tried adding __GFP_NORETRY first of course.
> 
> Reason why it doesn't help is: compaction fails because not enough
> free RAM, reclaim is invoked, compaction succeeds, THP is allocated to
> your lib user, compaction fails because not enough free RAM, reclaim
> is invoked etc.. compact_result is not COMPACT_DEFERRED, but
> COMPACT_SKIPPED.
> 
> See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to
> still create the same heavy swap storm and unfairly penalize all apps
> with memory allocated in the local node like if your library had
> actually the kernel privilege to run mbind or mlock, which is not ok.
> 
> Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim
> can run with __GFP_THISNODE set, all bets are off and we're back to
> square one, no difference (at best marginal difference) with
> __GFP_NORETRY being set.
> 

The page allocator is expecting __GFP_NORETRY for thp allocations per its 
comment:

/*
 * Checks for costly allocations with __GFP_NORETRY, which
 * includes THP page fault allocations
 */
if (costly_order && (gfp_mask & __GFP_NORETRY)) {

And that enables us to check compact_result to determine whether thp 
allocation should fail or continue to reclaim.  I don't think it helps 
that some thp fault allocations use __GFP_NORETRY and others do not.  I 
think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
GFP_TRANSHUGE_LIGHT.

Our library that uses MADV_HUGEPAGE only invokes direct compaction because 
we're on an older kernel: it does not attempt to do reclaim to make 
compaction happy so that it finds memory that it can migrate memory to.  
For reference, we use defrag setting of "defer+madvise".  Everybody who 
does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, 
MADV_HUGEPAGE users do the same but also try direct compaction.  That 
direct compaction uses a mode of MIGRATE_ASYNC so it normally fails 
because of need_resched() or spinlock contention.

These allocations always fail, MADV_HUGEPAGE or otherwise, without 
invoking direct reclaim.

I am agreeing with both you and Mel that it makes no sense to thrash the 
local node to make compaction happy and then hugepage-order memory 
available.  I'm only differing with you on the mechanism to fail early: we 
never want to do attempt reclaim on thp allocations specifically because 
it leads to the behavior you are addressing.

My contention is that removing __GFP_THISNODE papers over the problem, 
especially in cases where remote memory is also fragmnented. It incurs a 
much higher (40%) fault latency and then incurs 13.9% greater access 
latency.  It is not a good result, at least for Haswell, Naples, and Rome.

To make a case that we should fault hugepages remotely as fallback, either 
for non-MADV_HUGEPAGE users who do not use direct compaction, or 
MADV_HUGEPAGE users who use direct compaction, we need numbers that 
suggest there is a performance benefit in terms of access latency to 
suggest that it is better; this is especially the case when the fault 
latency is 40% higher.  On Haswell, Naples, and Rome, it is quite obvious 
that this patch works much harder to fault memory remotely that incurs a 
substantial performance penalty when it fails and in the cases where it 
succeeds we have much worse access latency.

For users who bind their applications to a subset of cores on a NUMA node, 
fallback is egregious: we are guaranteed to never have local access 
latency and in the case when the local node is fragmented and remote 
memory is not our fault latency goes through the roof when local native 
pages would have been much better.

I've brought numbers on how poor this patch performs, so I'm asking for a 
rebuttal that suggests it is better on some platforms.  (1) On what 
platforms is it better to fault remote hugepages over local native pages?  
(2) What is the fault latency increase when remote nodes are fragmented as 
well?

> Like Mel said, your app just happens to fit in a local node, if the
> user of the lib is slightly different and allocates 16G on a system
> where each node is 4G, the post-fix MADV_HUGEPAGE will perform
> extremely better also for the lib user.
> 

It won't if the remote memory is fragmented; the patch is premised on the 
argument that remote memory is never fragmented or under memory pressure 
otherwise it is multiplying the fault latency by the number of nodes.  
Sure, creating test cases where the local node is under heavy memory 
pressure yet remote nodes are mostly free will minimize the impact this 
has on fault latency.  It still is a 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread Andrea Arcangeli
Hello,

On Thu, Oct 04, 2018 at 04:05:26PM -0700, David Rientjes wrote:
> The source of the problem needs to be addressed: memory compaction.  We 
> regress because we lose __GFP_NORETRY and pointlessly try reclaim, but 

I commented in detail about the __GFP_NORETRY topic in the other email
so I will skip the discussion about __GFP_NORETRY in the context of
this answer except for the comment at the end of the email to the
actual code that implements __GFP_NORETRY.

> But that's a memory compaction issue, not a thp gfp mask issue; the 
> reclaim issue is responded to below.

Actually memory compaction has no issues whatsoever with
__GFP_THISNODE regardless of __GFP_NORETRY.

> This patch causes an even worse regression if all system memory is 
> fragmented such that thp cannot be allocated because it tries to stress 
> compaction on remote nodes, which ends up unsuccessfully, not just the 
> local node.
> 
> On Haswell, when all memory is fragmented (not just the local node as I 
> obtained by 13.9% regression result), the patch results in a fault latency 
> regression of 40.9% for MADV_HUGEPAGE region of 8GB.  This is because it 
> is thrashing both nodes pointlessly instead of just failing for 
> __GFP_THISNODE.

There's no I/O involved at the very least on compaction, nor we drop
any cache or shrink any slab by mistake by just invoking compaction.
Even when you hit the worst case "all nodes are 100% fragmented"
scenario that generates the 40% increased allocation latency, all
other tasks running in the local node will keep running fine, and they
won't be pushed away forcefully into swap with all their kernel cache
depleted, which is a mlock/mbind privileged behavior that the app
using the MADV_HUGEPAGE lib should not ever been able to inflict on
other processes running in the node from different users (users as in
uid).

Furthermore when you incur the worst case latency after that there's
compact deferred logic skipping compaction next time around if all
nodes were so fragmented to the point of guaranteed failure. While
there's nothing stopping reclaim to run every time COMPACT_SKIPPED is
returned just because compaction keeps succeeding as reclaim keeps
pushing more 2M amounts into swap from the local nodes.

I don't doubt with 1024 nodes things can get pretty bad when they're
all 100% fragmented, __GFP_THISNODE would win in such case, but then
what you're asking then is the __GFP_COMPACT_ONLY behavior. That will
solve it.

What we'd need probably regardless of how we solve this bug (because
not all compaction invocations are THP invocations... and we can't
keep making special cases and optimizations tailored for THP or we end
up in that same 40% higher latency for large skbs and other stuff) is
a more sophisticated COMPACT_DEFERRED logic where you can track when
remote compaction failed. Then you wait many more times before trying
a global compaction. It could be achieved with just a compact_deferred
counter in the zone/pgdat (wherever it fits best).

Overall I don't think the bug we're dealing with and the slowdown of
compaction on the remote nodes are comparable, also considering the
latter will still happen regardless if you've large skbs or other
drivers allocating large amounts of memory as an optimization.

> So the end result is that the patch regresses access latency forever by 
> 13.9% when the local node is fragmented because it is accessing remote thp 
> vs local pages of the native page size, and regresses fault latency of 
> 40.9% when the system is fully fragmented.  The only time that fault 
> latency is improved is when remote memory is not fully fragmented, but 
> then you must incur the remote access latency.

You get THP however which will reduce the TLB miss cost and maximize
TLB usage, so it depends on the app if that 13.9% cost is actually
offseted by the THP benefit or not.

It entirely depends if large part of the workload mostly fits in
in-socket CPU cache. The more the in-socket/node CPU cache pays off,
the more remote-THP also pays off. There would be definitely workloads
that would run faster, not slower, with the remote THP instead of
local PAGE_SIZEd memory. The benefit of THP is also larger for the
guest loads than for host loads, so it depends on that too.

We agree about the latency issue with a ton of RAM and thousands of
nodes, but again that can be mitigated with a NUMA friendly
COMPACT_DEFERRED logic NUMA aware. Even without such
NUMA-aware-compact_deferred logic improvement, the worst case of the
remote compaction behavior still doesn't look nearly as bad as this
bug by thinking about it. And it only is a concern for extremely large
NUMA systems (which may run the risk of running in other solubility
issues in other places if random workloads are applied to it and all
nodes are low on memory and fully fragmented which is far from common
scenario on those large systems), while the bug we fixed was hurting
badly all very common 2 nodes installs with workloads that 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread Andrea Arcangeli
Hello,

On Thu, Oct 04, 2018 at 04:05:26PM -0700, David Rientjes wrote:
> The source of the problem needs to be addressed: memory compaction.  We 
> regress because we lose __GFP_NORETRY and pointlessly try reclaim, but 

I commented in detail about the __GFP_NORETRY topic in the other email
so I will skip the discussion about __GFP_NORETRY in the context of
this answer except for the comment at the end of the email to the
actual code that implements __GFP_NORETRY.

> But that's a memory compaction issue, not a thp gfp mask issue; the 
> reclaim issue is responded to below.

Actually memory compaction has no issues whatsoever with
__GFP_THISNODE regardless of __GFP_NORETRY.

> This patch causes an even worse regression if all system memory is 
> fragmented such that thp cannot be allocated because it tries to stress 
> compaction on remote nodes, which ends up unsuccessfully, not just the 
> local node.
> 
> On Haswell, when all memory is fragmented (not just the local node as I 
> obtained by 13.9% regression result), the patch results in a fault latency 
> regression of 40.9% for MADV_HUGEPAGE region of 8GB.  This is because it 
> is thrashing both nodes pointlessly instead of just failing for 
> __GFP_THISNODE.

There's no I/O involved at the very least on compaction, nor we drop
any cache or shrink any slab by mistake by just invoking compaction.
Even when you hit the worst case "all nodes are 100% fragmented"
scenario that generates the 40% increased allocation latency, all
other tasks running in the local node will keep running fine, and they
won't be pushed away forcefully into swap with all their kernel cache
depleted, which is a mlock/mbind privileged behavior that the app
using the MADV_HUGEPAGE lib should not ever been able to inflict on
other processes running in the node from different users (users as in
uid).

Furthermore when you incur the worst case latency after that there's
compact deferred logic skipping compaction next time around if all
nodes were so fragmented to the point of guaranteed failure. While
there's nothing stopping reclaim to run every time COMPACT_SKIPPED is
returned just because compaction keeps succeeding as reclaim keeps
pushing more 2M amounts into swap from the local nodes.

I don't doubt with 1024 nodes things can get pretty bad when they're
all 100% fragmented, __GFP_THISNODE would win in such case, but then
what you're asking then is the __GFP_COMPACT_ONLY behavior. That will
solve it.

What we'd need probably regardless of how we solve this bug (because
not all compaction invocations are THP invocations... and we can't
keep making special cases and optimizations tailored for THP or we end
up in that same 40% higher latency for large skbs and other stuff) is
a more sophisticated COMPACT_DEFERRED logic where you can track when
remote compaction failed. Then you wait many more times before trying
a global compaction. It could be achieved with just a compact_deferred
counter in the zone/pgdat (wherever it fits best).

Overall I don't think the bug we're dealing with and the slowdown of
compaction on the remote nodes are comparable, also considering the
latter will still happen regardless if you've large skbs or other
drivers allocating large amounts of memory as an optimization.

> So the end result is that the patch regresses access latency forever by 
> 13.9% when the local node is fragmented because it is accessing remote thp 
> vs local pages of the native page size, and regresses fault latency of 
> 40.9% when the system is fully fragmented.  The only time that fault 
> latency is improved is when remote memory is not fully fragmented, but 
> then you must incur the remote access latency.

You get THP however which will reduce the TLB miss cost and maximize
TLB usage, so it depends on the app if that 13.9% cost is actually
offseted by the THP benefit or not.

It entirely depends if large part of the workload mostly fits in
in-socket CPU cache. The more the in-socket/node CPU cache pays off,
the more remote-THP also pays off. There would be definitely workloads
that would run faster, not slower, with the remote THP instead of
local PAGE_SIZEd memory. The benefit of THP is also larger for the
guest loads than for host loads, so it depends on that too.

We agree about the latency issue with a ton of RAM and thousands of
nodes, but again that can be mitigated with a NUMA friendly
COMPACT_DEFERRED logic NUMA aware. Even without such
NUMA-aware-compact_deferred logic improvement, the worst case of the
remote compaction behavior still doesn't look nearly as bad as this
bug by thinking about it. And it only is a concern for extremely large
NUMA systems (which may run the risk of running in other solubility
issues in other places if random workloads are applied to it and all
nodes are low on memory and fully fragmented which is far from common
scenario on those large systems), while the bug we fixed was hurting
badly all very common 2 nodes installs with workloads that 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread Andrea Arcangeli
Hi,

On Fri, Oct 05, 2018 at 01:35:15PM -0700, David Rientjes wrote:
> Why is it ever appropriate to do heavy reclaim and swap activity to 
> allocate a transparent hugepage?  This is exactly what the __GFP_NORETRY 
> check for high-order allocations is attempting to avoid, and it explicitly 
> states that it is for thp faults.  The fact that we lost __GFP_NORERY for 
> thp allocations for all settings, including the default setting, other 
> than yours (setting of "always") is what I'm focusing on.  There is no 
> guarantee that this activity will free an entire pageblock or that it is 
> even worthwhile.

I tried to add just __GFP_NORETRY but it changes nothing. Try it
yourself if you think that can resolve the swap storm and excessive
reclaim CPU overhead... and see if it works. I didn't intend to
reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would
have worked. I tried adding __GFP_NORETRY first of course.

Reason why it doesn't help is: compaction fails because not enough
free RAM, reclaim is invoked, compaction succeeds, THP is allocated to
your lib user, compaction fails because not enough free RAM, reclaim
is invoked etc.. compact_result is not COMPACT_DEFERRED, but
COMPACT_SKIPPED.

See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to
still create the same heavy swap storm and unfairly penalize all apps
with memory allocated in the local node like if your library had
actually the kernel privilege to run mbind or mlock, which is not ok.

Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim
can run with __GFP_THISNODE set, all bets are off and we're back to
square one, no difference (at best marginal difference) with
__GFP_NORETRY being set.

> That aside, removing __GFP_THISNODE can make the fault latency much worse 
> if remote notes are fragmented and/or reclaim has the inability to free 
> contiguous memory, which it likely cannot.  This is where I measured over 
> 40% fault latency regression from Linus's tree with this patch on a 
> fragmnented system where order-9 memory is neither available from node 0 
> or node 1 on Haswell.

Discussing the drawbacks of removing __GFP_THISNODE is an orthogonal
topic. __GFP_COMPACT_ONLY approach didn't have any of those drawbacks
about the remote latency because __GFP_THISNODE was still set at all
times, just as you like it. You seem to think __GFP_NORETRY will work
as well as __GFP_COMPACT_ONLY but it doesn't.

Calling compaction (and only compaction!) with __GFP_THISNODE set
doesn't break anything and that was what __GFP_COMPACT_ONLY was about.

> The behavior that MADV_HUGEPAGE specifies is certainly not clearly 
> defined, unfortunately.  The way that an application writer may read it, 
> as we have, is that it will make a stronger attempt at allocating a 
> hugepage at fault.  This actually works quite well when the allocation 
> correctly has __GFP_NORETRY, as it's supposed to, and compaction is 
> MIGRATE_ASYNC.

Like Mel said, your app just happens to fit in a local node, if the
user of the lib is slightly different and allocates 16G on a system
where each node is 4G, the post-fix MADV_HUGEPAGE will perform
extremely better also for the lib user.

And you know, if the lib user fits in one node, it can use mbind and
it won't hit OOM... and you'd need some capability giving the app
privilege anyway to keep MADV_HUGEPAGE as deep and unfair to the rest
of the processes running the local node (like mbind and mlock require
too).

Could you just run a test with the special lib and allocate 4 times
the size of a node, and see how the lib performs with upstream and
upstream+fix? Feel free to add __GFP_NORETRY anywhere you like in the
test of the upstream without fix.

The only constraint I would ask for the test (if the app using the lib
is not a massively multithreaded app, like qemu is, and you just
intend to run malloc(SIZEOFNODE*4); memset) is to run the app-lib
under "taskset -c 0". Otherwise NUMA balancing could move the the CPU
next to the last memory touched, which couldn't be done if each thread
accesses all ram at random from all 4 nodes at the same time (which is
a totally legitimate workload too and must not hit the "pathological
THP allocation performance").

> removed in a thp allocation.  I don't think anybody in this thread wants 
> 14% remote access latency regression if we allocate remotely or 40% fault 
> latency regression when remote nodes are fragmented as well.

Did you try the __GFP_COMPACT_ONLY patch? That won't have the 40%
fault latency already.

Also you're underestimating the benefit of THP given from remote nodes
for virt a bit, the 40% fault latency is not an issue when the
allocation is long lived, which is what MADV_HUGEPAGE is telling the
kernel, and the benefit of THP for guest is multiplied. It's more a
feature than a bug that 40% fault latency with MADV_HUGEPAGE set at
least for all long lived allocations (but if the allocations aren't
long lived, why should MADV_HUGEPAGE 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread Andrea Arcangeli
Hi,

On Fri, Oct 05, 2018 at 01:35:15PM -0700, David Rientjes wrote:
> Why is it ever appropriate to do heavy reclaim and swap activity to 
> allocate a transparent hugepage?  This is exactly what the __GFP_NORETRY 
> check for high-order allocations is attempting to avoid, and it explicitly 
> states that it is for thp faults.  The fact that we lost __GFP_NORERY for 
> thp allocations for all settings, including the default setting, other 
> than yours (setting of "always") is what I'm focusing on.  There is no 
> guarantee that this activity will free an entire pageblock or that it is 
> even worthwhile.

I tried to add just __GFP_NORETRY but it changes nothing. Try it
yourself if you think that can resolve the swap storm and excessive
reclaim CPU overhead... and see if it works. I didn't intend to
reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would
have worked. I tried adding __GFP_NORETRY first of course.

Reason why it doesn't help is: compaction fails because not enough
free RAM, reclaim is invoked, compaction succeeds, THP is allocated to
your lib user, compaction fails because not enough free RAM, reclaim
is invoked etc.. compact_result is not COMPACT_DEFERRED, but
COMPACT_SKIPPED.

See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to
still create the same heavy swap storm and unfairly penalize all apps
with memory allocated in the local node like if your library had
actually the kernel privilege to run mbind or mlock, which is not ok.

Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim
can run with __GFP_THISNODE set, all bets are off and we're back to
square one, no difference (at best marginal difference) with
__GFP_NORETRY being set.

> That aside, removing __GFP_THISNODE can make the fault latency much worse 
> if remote notes are fragmented and/or reclaim has the inability to free 
> contiguous memory, which it likely cannot.  This is where I measured over 
> 40% fault latency regression from Linus's tree with this patch on a 
> fragmnented system where order-9 memory is neither available from node 0 
> or node 1 on Haswell.

Discussing the drawbacks of removing __GFP_THISNODE is an orthogonal
topic. __GFP_COMPACT_ONLY approach didn't have any of those drawbacks
about the remote latency because __GFP_THISNODE was still set at all
times, just as you like it. You seem to think __GFP_NORETRY will work
as well as __GFP_COMPACT_ONLY but it doesn't.

Calling compaction (and only compaction!) with __GFP_THISNODE set
doesn't break anything and that was what __GFP_COMPACT_ONLY was about.

> The behavior that MADV_HUGEPAGE specifies is certainly not clearly 
> defined, unfortunately.  The way that an application writer may read it, 
> as we have, is that it will make a stronger attempt at allocating a 
> hugepage at fault.  This actually works quite well when the allocation 
> correctly has __GFP_NORETRY, as it's supposed to, and compaction is 
> MIGRATE_ASYNC.

Like Mel said, your app just happens to fit in a local node, if the
user of the lib is slightly different and allocates 16G on a system
where each node is 4G, the post-fix MADV_HUGEPAGE will perform
extremely better also for the lib user.

And you know, if the lib user fits in one node, it can use mbind and
it won't hit OOM... and you'd need some capability giving the app
privilege anyway to keep MADV_HUGEPAGE as deep and unfair to the rest
of the processes running the local node (like mbind and mlock require
too).

Could you just run a test with the special lib and allocate 4 times
the size of a node, and see how the lib performs with upstream and
upstream+fix? Feel free to add __GFP_NORETRY anywhere you like in the
test of the upstream without fix.

The only constraint I would ask for the test (if the app using the lib
is not a massively multithreaded app, like qemu is, and you just
intend to run malloc(SIZEOFNODE*4); memset) is to run the app-lib
under "taskset -c 0". Otherwise NUMA balancing could move the the CPU
next to the last memory touched, which couldn't be done if each thread
accesses all ram at random from all 4 nodes at the same time (which is
a totally legitimate workload too and must not hit the "pathological
THP allocation performance").

> removed in a thp allocation.  I don't think anybody in this thread wants 
> 14% remote access latency regression if we allocate remotely or 40% fault 
> latency regression when remote nodes are fragmented as well.

Did you try the __GFP_COMPACT_ONLY patch? That won't have the 40%
fault latency already.

Also you're underestimating the benefit of THP given from remote nodes
for virt a bit, the 40% fault latency is not an issue when the
allocation is long lived, which is what MADV_HUGEPAGE is telling the
kernel, and the benefit of THP for guest is multiplied. It's more a
feature than a bug that 40% fault latency with MADV_HUGEPAGE set at
least for all long lived allocations (but if the allocations aren't
long lived, why should MADV_HUGEPAGE 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread David Rientjes
On Fri, 5 Oct 2018, Mel Gorman wrote:

> > This causes, on average, a 13.9% access latency regression on Haswell, and 
> > the regression would likely be more severe on Naples and Rome.
> > 
> 
> That assumes that fragmentation prevents easy allocation which may very
> well be the case. While it would be great that compaction or the page
> allocator could be further improved to deal with fragmentation, it's
> outside the scope of this patch.
> 

Hi Mel,

The regression that Andrea is working on, correct me if I'm wrong, is 
heavy reclaim and swapping activity that is trying to desperately allocate 
local hugepages when the local node is fragmented based on advice provided 
by MADV_HUGEPAGE.

Why is it ever appropriate to do heavy reclaim and swap activity to 
allocate a transparent hugepage?  This is exactly what the __GFP_NORETRY 
check for high-order allocations is attempting to avoid, and it explicitly 
states that it is for thp faults.  The fact that we lost __GFP_NORERY for 
thp allocations for all settings, including the default setting, other 
than yours (setting of "always") is what I'm focusing on.  There is no 
guarantee that this activity will free an entire pageblock or that it is 
even worthwhile.

Why is thp memory ever being allocated without __GFP_NORETRY as the page 
allocator expects?

That aside, removing __GFP_THISNODE can make the fault latency much worse 
if remote notes are fragmented and/or reclaim has the inability to free 
contiguous memory, which it likely cannot.  This is where I measured over 
40% fault latency regression from Linus's tree with this patch on a 
fragmnented system where order-9 memory is neither available from node 0 
or node 1 on Haswell.

> > There exist libraries that allow the .text segment of processes to be 
> > remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE 
> > to stress local compaction to defragment node local memory for hugepages 
> > at startup. 
> 
> That is taking advantage of a co-incidence of the implementation.
> MADV_HUGEPAGE is *advice* that huge pages be used, not what the locality
> is. A hint for strong locality preferences should be separate advice
> (madvise) or a separate memory policy. Doing that is outside the context
> of this patch but nothing stops you introducing such a policy or madvise,
> whichever you think would be best for the libraries to consume (I'm only
> aware of libhugetlbfs but there might be others).
> 

The behavior that MADV_HUGEPAGE specifies is certainly not clearly 
defined, unfortunately.  The way that an application writer may read it, 
as we have, is that it will make a stronger attempt at allocating a 
hugepage at fault.  This actually works quite well when the allocation 
correctly has __GFP_NORETRY, as it's supposed to, and compaction is 
MIGRATE_ASYNC.

So rather than focusing on what MADV_HUGEPAGE has meant over the past 2+ 
years of kernels that we have implemented based on, or what it meant prior 
to that, is a fundamental question of the purpose of direct reclaim and 
swap activity that had always been precluded before __GFP_NORETRY was 
removed in a thp allocation.  I don't think anybody in this thread wants 
14% remote access latency regression if we allocate remotely or 40% fault 
latency regression when remote nodes are fragmented as well.

Removing __GFP_THISNODE only helps when remote memory is not fragmented, 
otherwise it multiplies the problem as I've shown.

The numbers that you provide while using the non-default option to mimick 
MADV_HUGEPAGE mappings but also use __GFP_NORETRY makes the actual source 
of the problem quite easy to identify: there is an inconsistency in the 
thp gfp mask and the page allocator implementation.

> > The cost, including the statistics Mel gathered, is 
> > acceptable for these processes: they are not concerned with startup cost, 
> > they are concerned only with optimal access latency while they are 
> > running.
> > 
> 
> Then such applications at startup have the option of setting
> zone_reclaim_mode during initialisation assuming a privileged helper
> can be created. That would be somewhat heavy handed and a longer-term
> solution would still be to create a proper memory policy of madvise flag
> for those libraries.
> 

We *never* want to use zone_reclaim_mode for these allocations, that would 
be even worse, we do not want to reclaim because we have a very unlikely 
chance of making pageblocks free without the involvement of compaction.  
We want to trigger memory compaction with a well-bounded cost that 
MIGRATE_ASYNC provides and then fail.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread David Rientjes
On Fri, 5 Oct 2018, Mel Gorman wrote:

> > This causes, on average, a 13.9% access latency regression on Haswell, and 
> > the regression would likely be more severe on Naples and Rome.
> > 
> 
> That assumes that fragmentation prevents easy allocation which may very
> well be the case. While it would be great that compaction or the page
> allocator could be further improved to deal with fragmentation, it's
> outside the scope of this patch.
> 

Hi Mel,

The regression that Andrea is working on, correct me if I'm wrong, is 
heavy reclaim and swapping activity that is trying to desperately allocate 
local hugepages when the local node is fragmented based on advice provided 
by MADV_HUGEPAGE.

Why is it ever appropriate to do heavy reclaim and swap activity to 
allocate a transparent hugepage?  This is exactly what the __GFP_NORETRY 
check for high-order allocations is attempting to avoid, and it explicitly 
states that it is for thp faults.  The fact that we lost __GFP_NORERY for 
thp allocations for all settings, including the default setting, other 
than yours (setting of "always") is what I'm focusing on.  There is no 
guarantee that this activity will free an entire pageblock or that it is 
even worthwhile.

Why is thp memory ever being allocated without __GFP_NORETRY as the page 
allocator expects?

That aside, removing __GFP_THISNODE can make the fault latency much worse 
if remote notes are fragmented and/or reclaim has the inability to free 
contiguous memory, which it likely cannot.  This is where I measured over 
40% fault latency regression from Linus's tree with this patch on a 
fragmnented system where order-9 memory is neither available from node 0 
or node 1 on Haswell.

> > There exist libraries that allow the .text segment of processes to be 
> > remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE 
> > to stress local compaction to defragment node local memory for hugepages 
> > at startup. 
> 
> That is taking advantage of a co-incidence of the implementation.
> MADV_HUGEPAGE is *advice* that huge pages be used, not what the locality
> is. A hint for strong locality preferences should be separate advice
> (madvise) or a separate memory policy. Doing that is outside the context
> of this patch but nothing stops you introducing such a policy or madvise,
> whichever you think would be best for the libraries to consume (I'm only
> aware of libhugetlbfs but there might be others).
> 

The behavior that MADV_HUGEPAGE specifies is certainly not clearly 
defined, unfortunately.  The way that an application writer may read it, 
as we have, is that it will make a stronger attempt at allocating a 
hugepage at fault.  This actually works quite well when the allocation 
correctly has __GFP_NORETRY, as it's supposed to, and compaction is 
MIGRATE_ASYNC.

So rather than focusing on what MADV_HUGEPAGE has meant over the past 2+ 
years of kernels that we have implemented based on, or what it meant prior 
to that, is a fundamental question of the purpose of direct reclaim and 
swap activity that had always been precluded before __GFP_NORETRY was 
removed in a thp allocation.  I don't think anybody in this thread wants 
14% remote access latency regression if we allocate remotely or 40% fault 
latency regression when remote nodes are fragmented as well.

Removing __GFP_THISNODE only helps when remote memory is not fragmented, 
otherwise it multiplies the problem as I've shown.

The numbers that you provide while using the non-default option to mimick 
MADV_HUGEPAGE mappings but also use __GFP_NORETRY makes the actual source 
of the problem quite easy to identify: there is an inconsistency in the 
thp gfp mask and the page allocator implementation.

> > The cost, including the statistics Mel gathered, is 
> > acceptable for these processes: they are not concerned with startup cost, 
> > they are concerned only with optimal access latency while they are 
> > running.
> > 
> 
> Then such applications at startup have the option of setting
> zone_reclaim_mode during initialisation assuming a privileged helper
> can be created. That would be somewhat heavy handed and a longer-term
> solution would still be to create a proper memory policy of madvise flag
> for those libraries.
> 

We *never* want to use zone_reclaim_mode for these allocations, that would 
be even worse, we do not want to reclaim because we have a very unlikely 
chance of making pageblocks free without the involvement of compaction.  
We want to trigger memory compaction with a well-bounded cost that 
MIGRATE_ASYNC provides and then fail.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread Mel Gorman
On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> On Tue, 25 Sep 2018, Michal Hocko wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index da858f794eb6..149b6f4cf023 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> > vm_area_struct *vma,
> > nmask = policy_nodemask(gfp, pol);
> > if (!nmask || node_isset(hpage_node, *nmask)) {
> > mpol_cond_put(pol);
> > -   page = __alloc_pages_node(hpage_node,
> > -   gfp | __GFP_THISNODE, order);
> > +   /*
> > +* We cannot invoke reclaim if __GFP_THISNODE
> > +* is set. Invoking reclaim with
> > +* __GFP_THISNODE set, would cause THP
> > +* allocations to trigger heavy swapping
> > +* despite there may be tons of free memory
> > +* (including potentially plenty of THP
> > +* already available in the buddy) on all the
> > +* other NUMA nodes.
> > +*
> > +* At most we could invoke compaction when
> > +* __GFP_THISNODE is set (but we would need to
> > +* refrain from invoking reclaim even if
> > +* compaction returned COMPACT_SKIPPED because
> > +* there wasn't not enough memory to succeed
> > +* compaction). For now just avoid
> > +* __GFP_THISNODE instead of limiting the
> > +* allocation path to a strict and single
> > +* compaction invocation.
> > +*
> > +* Supposedly if direct reclaim was enabled by
> > +* the caller, the app prefers THP regardless
> > +* of the node it comes from so this would be
> > +* more desiderable behavior than only
> > +* providing THP originated from the local
> > +* node in such case.
> > +*/
> > +   if (!(gfp & __GFP_DIRECT_RECLAIM))
> > +   gfp |= __GFP_THISNODE;
> > +   page = __alloc_pages_node(hpage_node, gfp, order);
> > goto out;
> > }
> > }
> 
> This causes, on average, a 13.9% access latency regression on Haswell, and 
> the regression would likely be more severe on Naples and Rome.
> 

That assumes that fragmentation prevents easy allocation which may very
well be the case. While it would be great that compaction or the page
allocator could be further improved to deal with fragmentation, it's
outside the scope of this patch.

> There exist libraries that allow the .text segment of processes to be 
> remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE 
> to stress local compaction to defragment node local memory for hugepages 
> at startup. 

That is taking advantage of a co-incidence of the implementation.
MADV_HUGEPAGE is *advice* that huge pages be used, not what the locality
is. A hint for strong locality preferences should be separate advice
(madvise) or a separate memory policy. Doing that is outside the context
of this patch but nothing stops you introducing such a policy or madvise,
whichever you think would be best for the libraries to consume (I'm only
aware of libhugetlbfs but there might be others).

> The cost, including the statistics Mel gathered, is 
> acceptable for these processes: they are not concerned with startup cost, 
> they are concerned only with optimal access latency while they are 
> running.
> 

Then such applications at startup have the option of setting
zone_reclaim_mode during initialisation assuming a privileged helper
can be created. That would be somewhat heavy handed and a longer-term
solution would still be to create a proper memory policy of madvise flag
for those libraries.

> So while it may take longer to start the process because memory compaction 
> is attempting to allocate hugepages with __GFP_DIRECT_RECLAIM, in the 
> cases where compaction is successful, this is a very significant long-term 
> win.  In cases where compaction fails, falling back to local pages of the 
> native page size instead of remote thp is a win for the remaining time 
> this process wins: as stated, 13.9% faster for all memory accesses to the 
> process's text while it runs on Haswell.
> 

Again, I remind you that it only benefits applications that prefectly
fit into NUMA nodes. Not all applications are created with that level of
awareness and easily get thrashed if using MADV_HUGEPAGE and do not fit
into a NUMA node.

While it is unfortunate that there are specialised applications that
benefit from the current configuration, I bet there is heavier usage of
qemu 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread Mel Gorman
On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> On Tue, 25 Sep 2018, Michal Hocko wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index da858f794eb6..149b6f4cf023 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> > vm_area_struct *vma,
> > nmask = policy_nodemask(gfp, pol);
> > if (!nmask || node_isset(hpage_node, *nmask)) {
> > mpol_cond_put(pol);
> > -   page = __alloc_pages_node(hpage_node,
> > -   gfp | __GFP_THISNODE, order);
> > +   /*
> > +* We cannot invoke reclaim if __GFP_THISNODE
> > +* is set. Invoking reclaim with
> > +* __GFP_THISNODE set, would cause THP
> > +* allocations to trigger heavy swapping
> > +* despite there may be tons of free memory
> > +* (including potentially plenty of THP
> > +* already available in the buddy) on all the
> > +* other NUMA nodes.
> > +*
> > +* At most we could invoke compaction when
> > +* __GFP_THISNODE is set (but we would need to
> > +* refrain from invoking reclaim even if
> > +* compaction returned COMPACT_SKIPPED because
> > +* there wasn't not enough memory to succeed
> > +* compaction). For now just avoid
> > +* __GFP_THISNODE instead of limiting the
> > +* allocation path to a strict and single
> > +* compaction invocation.
> > +*
> > +* Supposedly if direct reclaim was enabled by
> > +* the caller, the app prefers THP regardless
> > +* of the node it comes from so this would be
> > +* more desiderable behavior than only
> > +* providing THP originated from the local
> > +* node in such case.
> > +*/
> > +   if (!(gfp & __GFP_DIRECT_RECLAIM))
> > +   gfp |= __GFP_THISNODE;
> > +   page = __alloc_pages_node(hpage_node, gfp, order);
> > goto out;
> > }
> > }
> 
> This causes, on average, a 13.9% access latency regression on Haswell, and 
> the regression would likely be more severe on Naples and Rome.
> 

That assumes that fragmentation prevents easy allocation which may very
well be the case. While it would be great that compaction or the page
allocator could be further improved to deal with fragmentation, it's
outside the scope of this patch.

> There exist libraries that allow the .text segment of processes to be 
> remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE 
> to stress local compaction to defragment node local memory for hugepages 
> at startup. 

That is taking advantage of a co-incidence of the implementation.
MADV_HUGEPAGE is *advice* that huge pages be used, not what the locality
is. A hint for strong locality preferences should be separate advice
(madvise) or a separate memory policy. Doing that is outside the context
of this patch but nothing stops you introducing such a policy or madvise,
whichever you think would be best for the libraries to consume (I'm only
aware of libhugetlbfs but there might be others).

> The cost, including the statistics Mel gathered, is 
> acceptable for these processes: they are not concerned with startup cost, 
> they are concerned only with optimal access latency while they are 
> running.
> 

Then such applications at startup have the option of setting
zone_reclaim_mode during initialisation assuming a privileged helper
can be created. That would be somewhat heavy handed and a longer-term
solution would still be to create a proper memory policy of madvise flag
for those libraries.

> So while it may take longer to start the process because memory compaction 
> is attempting to allocate hugepages with __GFP_DIRECT_RECLAIM, in the 
> cases where compaction is successful, this is a very significant long-term 
> win.  In cases where compaction fails, falling back to local pages of the 
> native page size instead of remote thp is a win for the remaining time 
> this process wins: as stated, 13.9% faster for all memory accesses to the 
> process's text while it runs on Haswell.
> 

Again, I remind you that it only benefits applications that prefectly
fit into NUMA nodes. Not all applications are created with that level of
awareness and easily get thrashed if using MADV_HUGEPAGE and do not fit
into a NUMA node.

While it is unfortunate that there are specialised applications that
benefit from the current configuration, I bet there is heavier usage of
qemu 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-04 Thread David Rientjes
On Thu, 4 Oct 2018, Andrea Arcangeli wrote:

> Hello David,
> 

Hi Andrea,

> On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> > There are ways to address this without introducing regressions for 
> > existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
> > remote thp allocations, which users of this library would never set, or 
> > fix memory compaction so that it does not incur substantial allocation 
> > latency when it will likely fail.
> 
> These librarians needs to call a new MADV_ and the current
> MADV_HUGEPAGE should not be affected because the new MADV_ will
> require some capbility (i.e. root privilege).
> 
> qemu was the first user of MADV_HUGEPAGE and I don't think it's fair
> to break it and require change to it to run at higher privilege to
> retain the direct compaction behavior of MADV_HUGEPAGE.
> 
> The new behavior you ask to retain in MADV_HUGEPAGE, generated the
> same misbehavior to VM as mlock could have done too, so it can't just
> be given by default without any privilege whatsoever.
> 
> Ok you could mitigate the breakage that MADV_HUGEPAGE could have
> generated (before the recent fix) by isolating malicious or
> inefficient programs with memcg, but by default in a multiuser system
> without cgroups the global disruption provided before the fix
> (i.e. the pathological THP behavior) is not warranted. memcg shouldn't
> be mandatory to avoid a process to affect the VM in such a strong way
> (i.e. all other processes who happened to be allocated in the node
> where the THP allocation triggered, being trashed in swap like if all
> memory of all other nodes was not completely free).
> 

The source of the problem needs to be addressed: memory compaction.  We 
regress because we lose __GFP_NORETRY and pointlessly try reclaim, but 
deferred compaction is supposedly going to prevent repeated (and 
unnecessary) calls to memory compaction that ends up thrashing your local 
node.

This is likely because your workload has a size greater than 2MB * the 
deferred compaction threshold, normally set at 64.  This ends up 
repeatedly calling memory compaction and ending up being expensive when it 
should fail once and not be called again in the near term.

But that's a memory compaction issue, not a thp gfp mask issue; the 
reclaim issue is responded to below.

> Not only that, it's not only about malicious processes it's also
> excessively inefficient for processes that just don't fit in a local
> node and use MADV_HUGEPAGE. Your processes all fit in the local node
> for sure if they're happy about it. This was reported as a
> "pathological THP regression" after all in a workload that couldn't
> swap at all because of the iommu gup persistent refcount pins.
> 

This patch causes an even worse regression if all system memory is 
fragmented such that thp cannot be allocated because it tries to stress 
compaction on remote nodes, which ends up unsuccessfully, not just the 
local node.

On Haswell, when all memory is fragmented (not just the local node as I 
obtained by 13.9% regression result), the patch results in a fault latency 
regression of 40.9% for MADV_HUGEPAGE region of 8GB.  This is because it 
is thrashing both nodes pointlessly instead of just failing for 
__GFP_THISNODE.

So the end result is that the patch regresses access latency forever by 
13.9% when the local node is fragmented because it is accessing remote thp 
vs local pages of the native page size, and regresses fault latency of 
40.9% when the system is fully fragmented.  The only time that fault 
latency is improved is when remote memory is not fully fragmented, but 
then you must incur the remote access latency.

> Overall I think the call about the default behavior of MADV_HUGEPAGE
> is still between removing __GFP_THISNODE if gfp_flags can reclaim (the
> fix in -mm), or by changing direct compaction to only call compaction
> and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set.
> 

There's two issues: the expensiveness of the page allocator involving 
compaction for MADV_HUGEPAGE mappings and the desire for userspace to 
fault thp remotely and incur the 13.9% performance regression forever.

If reclaim is avoided like it should be with __GFP_NORETRY for even 
MADV_HUGEPAGE regions, you should only experience latency introduced by 
node local memory compaction.  The __GFP_NORETRY was removed by commit 
2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised 
allocations").  The current implementation of the page allocator does not 
match the expected behavior of the thp gfp flags.

Memory compaction has deferred compaction to avoid costly scanning when it 
has recently failed, and that likely needs to be addressed directly rather 
than relying on a count of how many times it has failed; if you fault more 
than 128MB at the same time, does it make sense to immediately compact 
again?  Likely not.

> To go beyond that some privilege is needed and a new 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-04 Thread David Rientjes
On Thu, 4 Oct 2018, Andrea Arcangeli wrote:

> Hello David,
> 

Hi Andrea,

> On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> > There are ways to address this without introducing regressions for 
> > existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
> > remote thp allocations, which users of this library would never set, or 
> > fix memory compaction so that it does not incur substantial allocation 
> > latency when it will likely fail.
> 
> These librarians needs to call a new MADV_ and the current
> MADV_HUGEPAGE should not be affected because the new MADV_ will
> require some capbility (i.e. root privilege).
> 
> qemu was the first user of MADV_HUGEPAGE and I don't think it's fair
> to break it and require change to it to run at higher privilege to
> retain the direct compaction behavior of MADV_HUGEPAGE.
> 
> The new behavior you ask to retain in MADV_HUGEPAGE, generated the
> same misbehavior to VM as mlock could have done too, so it can't just
> be given by default without any privilege whatsoever.
> 
> Ok you could mitigate the breakage that MADV_HUGEPAGE could have
> generated (before the recent fix) by isolating malicious or
> inefficient programs with memcg, but by default in a multiuser system
> without cgroups the global disruption provided before the fix
> (i.e. the pathological THP behavior) is not warranted. memcg shouldn't
> be mandatory to avoid a process to affect the VM in such a strong way
> (i.e. all other processes who happened to be allocated in the node
> where the THP allocation triggered, being trashed in swap like if all
> memory of all other nodes was not completely free).
> 

The source of the problem needs to be addressed: memory compaction.  We 
regress because we lose __GFP_NORETRY and pointlessly try reclaim, but 
deferred compaction is supposedly going to prevent repeated (and 
unnecessary) calls to memory compaction that ends up thrashing your local 
node.

This is likely because your workload has a size greater than 2MB * the 
deferred compaction threshold, normally set at 64.  This ends up 
repeatedly calling memory compaction and ending up being expensive when it 
should fail once and not be called again in the near term.

But that's a memory compaction issue, not a thp gfp mask issue; the 
reclaim issue is responded to below.

> Not only that, it's not only about malicious processes it's also
> excessively inefficient for processes that just don't fit in a local
> node and use MADV_HUGEPAGE. Your processes all fit in the local node
> for sure if they're happy about it. This was reported as a
> "pathological THP regression" after all in a workload that couldn't
> swap at all because of the iommu gup persistent refcount pins.
> 

This patch causes an even worse regression if all system memory is 
fragmented such that thp cannot be allocated because it tries to stress 
compaction on remote nodes, which ends up unsuccessfully, not just the 
local node.

On Haswell, when all memory is fragmented (not just the local node as I 
obtained by 13.9% regression result), the patch results in a fault latency 
regression of 40.9% for MADV_HUGEPAGE region of 8GB.  This is because it 
is thrashing both nodes pointlessly instead of just failing for 
__GFP_THISNODE.

So the end result is that the patch regresses access latency forever by 
13.9% when the local node is fragmented because it is accessing remote thp 
vs local pages of the native page size, and regresses fault latency of 
40.9% when the system is fully fragmented.  The only time that fault 
latency is improved is when remote memory is not fully fragmented, but 
then you must incur the remote access latency.

> Overall I think the call about the default behavior of MADV_HUGEPAGE
> is still between removing __GFP_THISNODE if gfp_flags can reclaim (the
> fix in -mm), or by changing direct compaction to only call compaction
> and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set.
> 

There's two issues: the expensiveness of the page allocator involving 
compaction for MADV_HUGEPAGE mappings and the desire for userspace to 
fault thp remotely and incur the 13.9% performance regression forever.

If reclaim is avoided like it should be with __GFP_NORETRY for even 
MADV_HUGEPAGE regions, you should only experience latency introduced by 
node local memory compaction.  The __GFP_NORETRY was removed by commit 
2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised 
allocations").  The current implementation of the page allocator does not 
match the expected behavior of the thp gfp flags.

Memory compaction has deferred compaction to avoid costly scanning when it 
has recently failed, and that likely needs to be addressed directly rather 
than relying on a count of how many times it has failed; if you fault more 
than 128MB at the same time, does it make sense to immediately compact 
again?  Likely not.

> To go beyond that some privilege is needed and a new 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-04 Thread Andrea Arcangeli
Hello David,

On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> There are ways to address this without introducing regressions for 
> existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
> remote thp allocations, which users of this library would never set, or 
> fix memory compaction so that it does not incur substantial allocation 
> latency when it will likely fail.

These librarians needs to call a new MADV_ and the current
MADV_HUGEPAGE should not be affected because the new MADV_ will
require some capbility (i.e. root privilege).

qemu was the first user of MADV_HUGEPAGE and I don't think it's fair
to break it and require change to it to run at higher privilege to
retain the direct compaction behavior of MADV_HUGEPAGE.

The new behavior you ask to retain in MADV_HUGEPAGE, generated the
same misbehavior to VM as mlock could have done too, so it can't just
be given by default without any privilege whatsoever.

Ok you could mitigate the breakage that MADV_HUGEPAGE could have
generated (before the recent fix) by isolating malicious or
inefficient programs with memcg, but by default in a multiuser system
without cgroups the global disruption provided before the fix
(i.e. the pathological THP behavior) is not warranted. memcg shouldn't
be mandatory to avoid a process to affect the VM in such a strong way
(i.e. all other processes who happened to be allocated in the node
where the THP allocation triggered, being trashed in swap like if all
memory of all other nodes was not completely free).

Not only that, it's not only about malicious processes it's also
excessively inefficient for processes that just don't fit in a local
node and use MADV_HUGEPAGE. Your processes all fit in the local node
for sure if they're happy about it. This was reported as a
"pathological THP regression" after all in a workload that couldn't
swap at all because of the iommu gup persistent refcount pins.

The alternative patch I posted which still invoked direct reclaim
locally, and falled back to NUMA-local PAGE_SIZEd allocations instead
of falling back to NUMA-remote THP, could have been given without
privilege, but it still won't swapout as this patch would have done,
so it still won't go as far as the MADV_HUGEPAGE behavior had before
the fix (for the lib users that strongly needed local memory).

Overall I think the call about the default behavior of MADV_HUGEPAGE
is still between removing __GFP_THISNODE if gfp_flags can reclaim (the
fix in -mm), or by changing direct compaction to only call compaction
and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set.

To go beyond that some privilege is needed and a new MADV_ flag can
require privilege or return error if there's not enough privilege. So
the lib with 100's users can try to use that new flag first, show an
error in stderr (maybe under debug), and fallback to MADV_HUGEPAGE if
the app hasn't enough privilege. The alternative is to add a new mem
policy less strict than MPOL_BIND to achieve what you need on top of
MADV_HUGEPAGE (which also would require some privilege of course as
all mbinds). I assume you already evaluated the preferred and local
mbinds and it's not a perfect fit?

If we keep this as a new MADV_HUGEPAGE_FORCE_LOCAL flag, you could
still add a THP sysfs/sysctl control to lift the privilege requirement
marking it as insecure setting in docs
(mm/transparent_hugepage/madv_hugepage_force_local=0|1 forced to 0 by
default). This would be on the same lines of other sysctl that
increase the max number of files open and such things (perhaps a
sysctl would be better in fact for tuning in /etc/sysctl.conf).

Note there was still some improvement left possible in my
__GFP_COMPACT_ONLY patch alternative. Notably if the watermarks for
the local node shown the local node not to have enough real "free"
PAGE_SIZEd pages to succeed the PAGE_SIZEd local THP allocation if
compaction failed, we should have relaxed __GFP_THISNODE and tried to
allocate THP from the NUMA-remote nodes before falling back to
PAGE_SIZEd allocations. That also won't require any new privilege.

To get the same behavior as before though you would need to drop all
caches with echo 3 >drop_caches before the app starts (and it still
won't swap anon memory which previous MADV_HUGEPAGE behavior would
have, but the whole point is that the previous MADV_HUGEPAGE behavior
would have backfired for the lib too if the app was allocating so much
RAM from the local node that it required swapouts of local anon
memory).

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-04 Thread Andrea Arcangeli
Hello David,

On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> There are ways to address this without introducing regressions for 
> existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
> remote thp allocations, which users of this library would never set, or 
> fix memory compaction so that it does not incur substantial allocation 
> latency when it will likely fail.

These librarians needs to call a new MADV_ and the current
MADV_HUGEPAGE should not be affected because the new MADV_ will
require some capbility (i.e. root privilege).

qemu was the first user of MADV_HUGEPAGE and I don't think it's fair
to break it and require change to it to run at higher privilege to
retain the direct compaction behavior of MADV_HUGEPAGE.

The new behavior you ask to retain in MADV_HUGEPAGE, generated the
same misbehavior to VM as mlock could have done too, so it can't just
be given by default without any privilege whatsoever.

Ok you could mitigate the breakage that MADV_HUGEPAGE could have
generated (before the recent fix) by isolating malicious or
inefficient programs with memcg, but by default in a multiuser system
without cgroups the global disruption provided before the fix
(i.e. the pathological THP behavior) is not warranted. memcg shouldn't
be mandatory to avoid a process to affect the VM in such a strong way
(i.e. all other processes who happened to be allocated in the node
where the THP allocation triggered, being trashed in swap like if all
memory of all other nodes was not completely free).

Not only that, it's not only about malicious processes it's also
excessively inefficient for processes that just don't fit in a local
node and use MADV_HUGEPAGE. Your processes all fit in the local node
for sure if they're happy about it. This was reported as a
"pathological THP regression" after all in a workload that couldn't
swap at all because of the iommu gup persistent refcount pins.

The alternative patch I posted which still invoked direct reclaim
locally, and falled back to NUMA-local PAGE_SIZEd allocations instead
of falling back to NUMA-remote THP, could have been given without
privilege, but it still won't swapout as this patch would have done,
so it still won't go as far as the MADV_HUGEPAGE behavior had before
the fix (for the lib users that strongly needed local memory).

Overall I think the call about the default behavior of MADV_HUGEPAGE
is still between removing __GFP_THISNODE if gfp_flags can reclaim (the
fix in -mm), or by changing direct compaction to only call compaction
and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set.

To go beyond that some privilege is needed and a new MADV_ flag can
require privilege or return error if there's not enough privilege. So
the lib with 100's users can try to use that new flag first, show an
error in stderr (maybe under debug), and fallback to MADV_HUGEPAGE if
the app hasn't enough privilege. The alternative is to add a new mem
policy less strict than MPOL_BIND to achieve what you need on top of
MADV_HUGEPAGE (which also would require some privilege of course as
all mbinds). I assume you already evaluated the preferred and local
mbinds and it's not a perfect fit?

If we keep this as a new MADV_HUGEPAGE_FORCE_LOCAL flag, you could
still add a THP sysfs/sysctl control to lift the privilege requirement
marking it as insecure setting in docs
(mm/transparent_hugepage/madv_hugepage_force_local=0|1 forced to 0 by
default). This would be on the same lines of other sysctl that
increase the max number of files open and such things (perhaps a
sysctl would be better in fact for tuning in /etc/sysctl.conf).

Note there was still some improvement left possible in my
__GFP_COMPACT_ONLY patch alternative. Notably if the watermarks for
the local node shown the local node not to have enough real "free"
PAGE_SIZEd pages to succeed the PAGE_SIZEd local THP allocation if
compaction failed, we should have relaxed __GFP_THISNODE and tried to
allocate THP from the NUMA-remote nodes before falling back to
PAGE_SIZEd allocations. That also won't require any new privilege.

To get the same behavior as before though you would need to drop all
caches with echo 3 >drop_caches before the app starts (and it still
won't swap anon memory which previous MADV_HUGEPAGE behavior would
have, but the whole point is that the previous MADV_HUGEPAGE behavior
would have backfired for the lib too if the app was allocating so much
RAM from the local node that it required swapouts of local anon
memory).

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-04 Thread David Rientjes
On Tue, 25 Sep 2018, Michal Hocko wrote:

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..149b6f4cf023 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> vm_area_struct *vma,
>   nmask = policy_nodemask(gfp, pol);
>   if (!nmask || node_isset(hpage_node, *nmask)) {
>   mpol_cond_put(pol);
> - page = __alloc_pages_node(hpage_node,
> - gfp | __GFP_THISNODE, order);
> + /*
> +  * We cannot invoke reclaim if __GFP_THISNODE
> +  * is set. Invoking reclaim with
> +  * __GFP_THISNODE set, would cause THP
> +  * allocations to trigger heavy swapping
> +  * despite there may be tons of free memory
> +  * (including potentially plenty of THP
> +  * already available in the buddy) on all the
> +  * other NUMA nodes.
> +  *
> +  * At most we could invoke compaction when
> +  * __GFP_THISNODE is set (but we would need to
> +  * refrain from invoking reclaim even if
> +  * compaction returned COMPACT_SKIPPED because
> +  * there wasn't not enough memory to succeed
> +  * compaction). For now just avoid
> +  * __GFP_THISNODE instead of limiting the
> +  * allocation path to a strict and single
> +  * compaction invocation.
> +  *
> +  * Supposedly if direct reclaim was enabled by
> +  * the caller, the app prefers THP regardless
> +  * of the node it comes from so this would be
> +  * more desiderable behavior than only
> +  * providing THP originated from the local
> +  * node in such case.
> +  */
> + if (!(gfp & __GFP_DIRECT_RECLAIM))
> + gfp |= __GFP_THISNODE;
> + page = __alloc_pages_node(hpage_node, gfp, order);
>   goto out;
>   }
>   }

This causes, on average, a 13.9% access latency regression on Haswell, and 
the regression would likely be more severe on Naples and Rome.

There exist libraries that allow the .text segment of processes to be 
remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE 
to stress local compaction to defragment node local memory for hugepages 
at startup.  The cost, including the statistics Mel gathered, is 
acceptable for these processes: they are not concerned with startup cost, 
they are concerned only with optimal access latency while they are 
running.

So while it may take longer to start the process because memory compaction 
is attempting to allocate hugepages with __GFP_DIRECT_RECLAIM, in the 
cases where compaction is successful, this is a very significant long-term 
win.  In cases where compaction fails, falling back to local pages of the 
native page size instead of remote thp is a win for the remaining time 
this process wins: as stated, 13.9% faster for all memory accesses to the 
process's text while it runs on Haswell.

So these processes, and there are 100's of users of these libraries, 
explicitly want to incur the cost of startup to attempt to get local thp 
and fallback only when necessary to local pages of the native page size.  
They want the behavior this patch is changing and regress if the patch is 
merged.

This patch does not provide an alternative beyond MPOL_BIND to preserve 
the existing behavior.  In that case, if local memory is actually oom, we 
unnecessarily oom kill processes which would be completely unacceptable to 
these users since they are fine to accept 10-20% of their memory being 
faulted remotely if necessary to prevent processes being oom killed.

These libraries were implemented with the existing behavior of 
MADV_HUGEPAGE in mind and I cannot ask for 100s of binaries to be rebuilt 
to enforce new constraints specifically for hugepages with no alternative 
that does not allow unnecessary oom killing.

There are ways to address this without introducing regressions for 
existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
remote thp allocations, which users of this library would never set, or 
fix memory compaction so that it does not incur substantial allocation 
latency when it will likely fail.

We don't introduce 13.9% regressions for binaries that are correctly using 
MADV_HUGEPAGE as it is implemented.

Nacked-by: David Rientjes 


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-04 Thread David Rientjes
On Tue, 25 Sep 2018, Michal Hocko wrote:

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..149b6f4cf023 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> vm_area_struct *vma,
>   nmask = policy_nodemask(gfp, pol);
>   if (!nmask || node_isset(hpage_node, *nmask)) {
>   mpol_cond_put(pol);
> - page = __alloc_pages_node(hpage_node,
> - gfp | __GFP_THISNODE, order);
> + /*
> +  * We cannot invoke reclaim if __GFP_THISNODE
> +  * is set. Invoking reclaim with
> +  * __GFP_THISNODE set, would cause THP
> +  * allocations to trigger heavy swapping
> +  * despite there may be tons of free memory
> +  * (including potentially plenty of THP
> +  * already available in the buddy) on all the
> +  * other NUMA nodes.
> +  *
> +  * At most we could invoke compaction when
> +  * __GFP_THISNODE is set (but we would need to
> +  * refrain from invoking reclaim even if
> +  * compaction returned COMPACT_SKIPPED because
> +  * there wasn't not enough memory to succeed
> +  * compaction). For now just avoid
> +  * __GFP_THISNODE instead of limiting the
> +  * allocation path to a strict and single
> +  * compaction invocation.
> +  *
> +  * Supposedly if direct reclaim was enabled by
> +  * the caller, the app prefers THP regardless
> +  * of the node it comes from so this would be
> +  * more desiderable behavior than only
> +  * providing THP originated from the local
> +  * node in such case.
> +  */
> + if (!(gfp & __GFP_DIRECT_RECLAIM))
> + gfp |= __GFP_THISNODE;
> + page = __alloc_pages_node(hpage_node, gfp, order);
>   goto out;
>   }
>   }

This causes, on average, a 13.9% access latency regression on Haswell, and 
the regression would likely be more severe on Naples and Rome.

There exist libraries that allow the .text segment of processes to be 
remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE 
to stress local compaction to defragment node local memory for hugepages 
at startup.  The cost, including the statistics Mel gathered, is 
acceptable for these processes: they are not concerned with startup cost, 
they are concerned only with optimal access latency while they are 
running.

So while it may take longer to start the process because memory compaction 
is attempting to allocate hugepages with __GFP_DIRECT_RECLAIM, in the 
cases where compaction is successful, this is a very significant long-term 
win.  In cases where compaction fails, falling back to local pages of the 
native page size instead of remote thp is a win for the remaining time 
this process wins: as stated, 13.9% faster for all memory accesses to the 
process's text while it runs on Haswell.

So these processes, and there are 100's of users of these libraries, 
explicitly want to incur the cost of startup to attempt to get local thp 
and fallback only when necessary to local pages of the native page size.  
They want the behavior this patch is changing and regress if the patch is 
merged.

This patch does not provide an alternative beyond MPOL_BIND to preserve 
the existing behavior.  In that case, if local memory is actually oom, we 
unnecessarily oom kill processes which would be completely unacceptable to 
these users since they are fine to accept 10-20% of their memory being 
faulted remotely if necessary to prevent processes being oom killed.

These libraries were implemented with the existing behavior of 
MADV_HUGEPAGE in mind and I cannot ask for 100s of binaries to be rebuilt 
to enforce new constraints specifically for hugepages with no alternative 
that does not allow unnecessary oom killing.

There are ways to address this without introducing regressions for 
existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
remote thp allocations, which users of this library would never set, or 
fix memory compaction so that it does not incur substantial allocation 
latency when it will likely fail.

We don't introduce 13.9% regressions for binaries that are correctly using 
MADV_HUGEPAGE as it is implemented.

Nacked-by: David Rientjes 


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-09-25 Thread Michal Hocko
On Tue 25-09-18 13:20:08, Mel Gorman wrote:
> On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote:
> > From: Andrea Arcangeli 
> > 
> > THP allocation might be really disruptive when allocated on NUMA system
> > with the local node full or hard to reclaim. Stefan has posted an
> > allocation stall report on 4.12 based SLES kernel which suggests the
> > same issue:
> > 
> > [245513.362669] kvm: page allocation stalls for 194572ms, order:9, 
> > mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
> >  nodemask=(null)
> > [245513.363983] kvm cpuset=/ mems_allowed=0-1
> > [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: GW 
> > 4.12.0+98-ph 001 SLE15 (unreleased)
> > [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 
> > 2.0 12/05/2017
> > [245513.365905] Call Trace:
> > [245513.366535]  dump_stack+0x5c/0x84
> > [245513.367148]  warn_alloc+0xe0/0x180
> > [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> > [245513.368406]  ? __slab_free+0xa9/0x2f0
> > [245513.369048]  ? __slab_free+0xa9/0x2f0
> > [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> > [245513.370300]  alloc_pages_vma+0x1e5/0x280
> > [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> > [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> > [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> > [245513.372812]  __handle_mm_fault+0x93d/0x1060
> > [245513.373439]  handle_mm_fault+0xc6/0x1b0
> > [245513.374042]  __do_page_fault+0x230/0x430
> > [245513.374679]  ? get_vtime_delta+0x13/0xb0
> > [245513.375411]  do_page_fault+0x2a/0x70
> > [245513.376145]  ? page_fault+0x65/0x80
> > [245513.376882]  page_fault+0x7b/0x80
> > [...]
> > [245513.382056] Mem-Info:
> > [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
> >  active_file:60183 inactive_file:245285 isolated_file:0
> >  unevictable:15657 dirty:286 writeback:1 unstable:0
> >  slab_reclaimable:75543 slab_unreclaimable:2509111
> >  mapped:81814 shmem:31764 pagetables:370616 bounce:0
> >  free:32294031 free_pcp:6233 free_cma:0
> > [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB 
> > active_file:240648kB inactive_file:981168kB unevictable:13368kB 
> > isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB 
> > writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 
> > 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB 
> > active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB 
> > isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB 
> > shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB 
> > unstable:0kB all_unreclaimable? no
> > 
> > The defrag mode is "madvise" and from the above report it is clear that
> > the THP has been allocated for MADV_HUGEPAGA vma.
> > 
> > Andrea has identified that the main source of the problem is
> > __GFP_THISNODE usage:
> > 
> > : The problem is that direct compaction combined with the NUMA
> > : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> > : hard the local node, instead of failing the allocation if there's no
> > : THP available in the local node.
> > :
> > : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> > : path even with MPOL_DEFAULT.
> > :
> > : The idea behind the __GFP_THISNODE addition, is that it is better to
> > : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> > : backed memory. That largely depends on the remote latency though, on
> > : threadrippers for example the overhead is relatively low in my
> > : experience.
> > :
> > : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> > : extremely slow qemu startup with vfio, if the VM is larger than the
> > : size of one host NUMA node. This is because it will try very hard to
> > : unsuccessfully swapout get_user_pages pinned pages as result of the
> > : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> > : allocations and instead of trying to allocate THP on other nodes (it
> > : would be even worse without vfio type1 GUP pins of course, except it'd
> > : be swapping heavily instead).
> > 
> > Fix this by removing __GFP_THISNODE for THP requests which are
> > requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
> > the grounds that the zone/node reclaim was known to be disruptive due
> > to premature reclaim when there was memory free. While it made sense at
> > the time for HPC workloads without NUMA awareness on rare machines, it
> > was ultimately harmful in the majority of cases. The existing behaviour
> > is similiar, if not as widespare as it applies to a corner case but
> > crucially, it cannot be tuned 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-09-25 Thread Michal Hocko
On Tue 25-09-18 13:20:08, Mel Gorman wrote:
> On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote:
> > From: Andrea Arcangeli 
> > 
> > THP allocation might be really disruptive when allocated on NUMA system
> > with the local node full or hard to reclaim. Stefan has posted an
> > allocation stall report on 4.12 based SLES kernel which suggests the
> > same issue:
> > 
> > [245513.362669] kvm: page allocation stalls for 194572ms, order:9, 
> > mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
> >  nodemask=(null)
> > [245513.363983] kvm cpuset=/ mems_allowed=0-1
> > [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: GW 
> > 4.12.0+98-ph 001 SLE15 (unreleased)
> > [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 
> > 2.0 12/05/2017
> > [245513.365905] Call Trace:
> > [245513.366535]  dump_stack+0x5c/0x84
> > [245513.367148]  warn_alloc+0xe0/0x180
> > [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> > [245513.368406]  ? __slab_free+0xa9/0x2f0
> > [245513.369048]  ? __slab_free+0xa9/0x2f0
> > [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> > [245513.370300]  alloc_pages_vma+0x1e5/0x280
> > [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> > [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> > [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> > [245513.372812]  __handle_mm_fault+0x93d/0x1060
> > [245513.373439]  handle_mm_fault+0xc6/0x1b0
> > [245513.374042]  __do_page_fault+0x230/0x430
> > [245513.374679]  ? get_vtime_delta+0x13/0xb0
> > [245513.375411]  do_page_fault+0x2a/0x70
> > [245513.376145]  ? page_fault+0x65/0x80
> > [245513.376882]  page_fault+0x7b/0x80
> > [...]
> > [245513.382056] Mem-Info:
> > [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
> >  active_file:60183 inactive_file:245285 isolated_file:0
> >  unevictable:15657 dirty:286 writeback:1 unstable:0
> >  slab_reclaimable:75543 slab_unreclaimable:2509111
> >  mapped:81814 shmem:31764 pagetables:370616 bounce:0
> >  free:32294031 free_pcp:6233 free_cma:0
> > [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB 
> > active_file:240648kB inactive_file:981168kB unevictable:13368kB 
> > isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB 
> > writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 
> > 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB 
> > active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB 
> > isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB 
> > shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB 
> > unstable:0kB all_unreclaimable? no
> > 
> > The defrag mode is "madvise" and from the above report it is clear that
> > the THP has been allocated for MADV_HUGEPAGA vma.
> > 
> > Andrea has identified that the main source of the problem is
> > __GFP_THISNODE usage:
> > 
> > : The problem is that direct compaction combined with the NUMA
> > : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> > : hard the local node, instead of failing the allocation if there's no
> > : THP available in the local node.
> > :
> > : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> > : path even with MPOL_DEFAULT.
> > :
> > : The idea behind the __GFP_THISNODE addition, is that it is better to
> > : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> > : backed memory. That largely depends on the remote latency though, on
> > : threadrippers for example the overhead is relatively low in my
> > : experience.
> > :
> > : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> > : extremely slow qemu startup with vfio, if the VM is larger than the
> > : size of one host NUMA node. This is because it will try very hard to
> > : unsuccessfully swapout get_user_pages pinned pages as result of the
> > : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> > : allocations and instead of trying to allocate THP on other nodes (it
> > : would be even worse without vfio type1 GUP pins of course, except it'd
> > : be swapping heavily instead).
> > 
> > Fix this by removing __GFP_THISNODE for THP requests which are
> > requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
> > the grounds that the zone/node reclaim was known to be disruptive due
> > to premature reclaim when there was memory free. While it made sense at
> > the time for HPC workloads without NUMA awareness on rare machines, it
> > was ultimately harmful in the majority of cases. The existing behaviour
> > is similiar, if not as widespare as it applies to a corner case but
> > crucially, it cannot be tuned 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-09-25 Thread Mel Gorman
On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote:
> From: Andrea Arcangeli 
> 
> THP allocation might be really disruptive when allocated on NUMA system
> with the local node full or hard to reclaim. Stefan has posted an
> allocation stall report on 4.12 based SLES kernel which suggests the
> same issue:
> 
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, 
> mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
>  nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: GW 4.12.0+98-ph 
>  class="resolved">001 SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 
> 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80
> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>  active_file:60183 inactive_file:245285 isolated_file:0
>  unevictable:15657 dirty:286 writeback:1 unstable:0
>  slab_reclaimable:75543 slab_unreclaimable:2509111
>  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB 
> active_file:240648kB inactive_file:981168kB unevictable:13368kB 
> isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB 
> writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 
> 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB 
> active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB 
> isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB 
> shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB 
> unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE for THP requests which are
> requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
> the grounds that the zone/node reclaim was known to be disruptive due
> to premature reclaim when there was memory free. While it made sense at
> the time for HPC workloads without NUMA awareness on rare machines, it
> was ultimately harmful in the majority of cases. The existing behaviour
> is similiar, if not as widespare as it applies to a corner case but
> crucially, it cannot be tuned around like zone_reclaim_mode can. The
> default behaviour should always be to cause the least harm for the
> common case.
> 
> If there are specialised use cases out there that want zone_reclaim_mode
> in 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-09-25 Thread Mel Gorman
On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote:
> From: Andrea Arcangeli 
> 
> THP allocation might be really disruptive when allocated on NUMA system
> with the local node full or hard to reclaim. Stefan has posted an
> allocation stall report on 4.12 based SLES kernel which suggests the
> same issue:
> 
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, 
> mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
>  nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: GW 4.12.0+98-ph 
>  class="resolved">001 SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 
> 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80
> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>  active_file:60183 inactive_file:245285 isolated_file:0
>  unevictable:15657 dirty:286 writeback:1 unstable:0
>  slab_reclaimable:75543 slab_unreclaimable:2509111
>  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB 
> active_file:240648kB inactive_file:981168kB unevictable:13368kB 
> isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB 
> writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 
> 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB 
> active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB 
> isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB 
> shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB 
> unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE for THP requests which are
> requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
> the grounds that the zone/node reclaim was known to be disruptive due
> to premature reclaim when there was memory free. While it made sense at
> the time for HPC workloads without NUMA awareness on rare machines, it
> was ultimately harmful in the majority of cases. The existing behaviour
> is similiar, if not as widespare as it applies to a corner case but
> crucially, it cannot be tuned around like zone_reclaim_mode can. The
> default behaviour should always be to cause the least harm for the
> common case.
> 
> If there are specialised use cases out there that want zone_reclaim_mode
> in