Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-10 Thread Kirill A. Shutemov
On Fri, Dec 07, 2018 at 03:05:28PM -0800, David Rientjes wrote:
> > > Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
> > > different policy for hugepage allocations, which were allocated through 
> > > alloc_hugepage_vma().  For hugepage allocations, if the allocating 
> > > process's node is in the set of allowed nodes, allocate with 
> > > __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
> > > __GFP_THISNODE instead).
> > 
> > Why is it wrong to fallback to an explicitly configured mbind mask?
> > 
> 
> The new_page() case is similar to the shmem_alloc_hugepage() case.  Prior 
> to 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
> alloc_hugepage_direct_gfpmask"), shmem_alloc_hugepage() did 
> alloc_pages_vma() with hugepage == true, which effected a different 
> allocation policy: if the node current is running on is allowed by the 
> policy, use __GFP_THISNODE (considering ac5b2c18911ff is reverted, which 
> it is in Linus's tree).
> 
> After 89c83fb539f9, we lose that and can fallback to remote memory.  Since 
> the discussion is on-going wrt the NUMA aspects of hugepage allocations, 
> it's better to have a stable 4.20 tree while that is being worked out and 
> likely deserves separate patches for both new_page() and 
> shmem_alloc_hugepage().  For the latter specifically, I assume it would be 
> nice to get an Acked-by by Kirill who implemented shmem_alloc_hugepage() 
> with hugepage == true back in 4.8 that also had the __GFP_THISNODE 
> behavior before the allocation policy is suddenly changed.

I do not have much experience with page_alloc/compaction/reclaim paths and
I don't feel that my opinion should have much weight here. Do not gate it
on me.

(I do follow the discussion, but I don't have anything meaningful to
contribute so far.)

-- 
 Kirill A. Shutemov


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-10 Thread Michal Hocko
On Fri 07-12-18 15:05:28, David Rientjes wrote:
> On Fri, 7 Dec 2018, Michal Hocko wrote:
[...]
> > > Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
> > > different policy for hugepage allocations, which were allocated through 
> > > alloc_hugepage_vma().  For hugepage allocations, if the allocating 
> > > process's node is in the set of allowed nodes, allocate with 
> > > __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
> > > __GFP_THISNODE instead).
> > 
> > Why is it wrong to fallback to an explicitly configured mbind mask?
> > 
> 
> The new_page() case is similar to the shmem_alloc_hugepage() case.  Prior 
> to 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
> alloc_hugepage_direct_gfpmask"), shmem_alloc_hugepage() did 
> alloc_pages_vma() with hugepage == true, which effected a different 
> allocation policy: if the node current is running on is allowed by the 
> policy, use __GFP_THISNODE (considering ac5b2c18911ff is reverted, which 
> it is in Linus's tree).
> 
> After 89c83fb539f9, we lose that and can fallback to remote memory.  Since 
> the discussion is on-going wrt the NUMA aspects of hugepage allocations, 
> it's better to have a stable 4.20 tree while that is being worked out and 
> likely deserves separate patches for both new_page() and 
> shmem_alloc_hugepage().  For the latter specifically, I assume it would be 
> nice to get an Acked-by by Kirill who implemented shmem_alloc_hugepage() 
> with hugepage == true back in 4.8 that also had the __GFP_THISNODE 
> behavior before the allocation policy is suddenly changed.

This should be a part of the changelog.

-- 
Michal Hocko
SUSE Labs


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread David Rientjes
On Fri, 7 Dec 2018, Michal Hocko wrote:

> > This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> > 
> > There are a couple of issues with 89c83fb539f9 independent of its partial 
> > revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> > allocations"):
> > 
> > Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> > alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> > alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> > an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> > the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> > policy_node().
> 
> Could you share a test case?
> 

Sorry, as Vlastimil pointed out this race does not exist anymore since 
commit 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations") 
since it removed the restructuring of alloc_hugepage_direct_gfpmask().  It 
existed prior to this commit for shared vma policies that could modify the 
policy between alloc_hugepage_direct_gfpmask() and alloc_pages_vma() if 
the policy switches to MPOL_BIND in that window.

> > Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
> > different policy for hugepage allocations, which were allocated through 
> > alloc_hugepage_vma().  For hugepage allocations, if the allocating 
> > process's node is in the set of allowed nodes, allocate with 
> > __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
> > __GFP_THISNODE instead).
> 
> Why is it wrong to fallback to an explicitly configured mbind mask?
> 

The new_page() case is similar to the shmem_alloc_hugepage() case.  Prior 
to 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
alloc_hugepage_direct_gfpmask"), shmem_alloc_hugepage() did 
alloc_pages_vma() with hugepage == true, which effected a different 
allocation policy: if the node current is running on is allowed by the 
policy, use __GFP_THISNODE (considering ac5b2c18911ff is reverted, which 
it is in Linus's tree).

After 89c83fb539f9, we lose that and can fallback to remote memory.  Since 
the discussion is on-going wrt the NUMA aspects of hugepage allocations, 
it's better to have a stable 4.20 tree while that is being worked out and 
likely deserves separate patches for both new_page() and 
shmem_alloc_hugepage().  For the latter specifically, I assume it would be 
nice to get an Acked-by by Kirill who implemented shmem_alloc_hugepage() 
with hugepage == true back in 4.8 that also had the __GFP_THISNODE 
behavior before the allocation policy is suddenly changed.


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread David Rientjes
On Fri, 7 Dec 2018, Michal Hocko wrote:

> > This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> > 
> > There are a couple of issues with 89c83fb539f9 independent of its partial 
> > revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> > allocations"):
> > 
> > Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> > alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> > alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> > an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> > the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> > policy_node().
> 
> Could you share a test case?
> 

Sorry, as Vlastimil pointed out this race does not exist anymore since 
commit 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations") 
since it removed the restructuring of alloc_hugepage_direct_gfpmask().  It 
existed prior to this commit for shared vma policies that could modify the 
policy between alloc_hugepage_direct_gfpmask() and alloc_pages_vma() if 
the policy switches to MPOL_BIND in that window.

> > Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
> > different policy for hugepage allocations, which were allocated through 
> > alloc_hugepage_vma().  For hugepage allocations, if the allocating 
> > process's node is in the set of allowed nodes, allocate with 
> > __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
> > __GFP_THISNODE instead).
> 
> Why is it wrong to fallback to an explicitly configured mbind mask?
> 

The new_page() case is similar to the shmem_alloc_hugepage() case.  Prior 
to 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
alloc_hugepage_direct_gfpmask"), shmem_alloc_hugepage() did 
alloc_pages_vma() with hugepage == true, which effected a different 
allocation policy: if the node current is running on is allowed by the 
policy, use __GFP_THISNODE (considering ac5b2c18911ff is reverted, which 
it is in Linus's tree).

After 89c83fb539f9, we lose that and can fallback to remote memory.  Since 
the discussion is on-going wrt the NUMA aspects of hugepage allocations, 
it's better to have a stable 4.20 tree while that is being worked out and 
likely deserves separate patches for both new_page() and 
shmem_alloc_hugepage().  For the latter specifically, I assume it would be 
nice to get an Acked-by by Kirill who implemented shmem_alloc_hugepage() 
with hugepage == true back in 4.8 that also had the __GFP_THISNODE 
behavior before the allocation policy is suddenly changed.


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread Linus Torvalds
On Fri, Dec 7, 2018 at 2:27 PM David Rientjes  wrote:
>
> I noticed the race in 89c83fb539f9 ("mm, thp: consolidate THP gfp handling
> into alloc_hugepage_direct_gfpmask") that is fixed by the revert, but as
> you noted it didn't cleanup the second part which is the balancing act for
> gfp flags between alloc_hugepage_direct_gfpmask() and alloc_pages_vma().
> Syzbot found this to trigger the WARN_ON_ONCE() you mention.

Can you rewrite the commit message to explain this better rather than
the misleading "race" description?

 Linus


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread Linus Torvalds
On Fri, Dec 7, 2018 at 2:27 PM David Rientjes  wrote:
>
> I noticed the race in 89c83fb539f9 ("mm, thp: consolidate THP gfp handling
> into alloc_hugepage_direct_gfpmask") that is fixed by the revert, but as
> you noted it didn't cleanup the second part which is the balancing act for
> gfp flags between alloc_hugepage_direct_gfpmask() and alloc_pages_vma().
> Syzbot found this to trigger the WARN_ON_ONCE() you mention.

Can you rewrite the commit message to explain this better rather than
the misleading "race" description?

 Linus


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread David Rientjes
On Fri, 7 Dec 2018, Vlastimil Babka wrote:

> > This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> > 
> > There are a couple of issues with 89c83fb539f9 independent of its partial 
> > revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> > allocations"):
> > 
> > Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> > alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> > alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> > an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> > the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> > policy_node().
> 
> AFAICS 2f0799a0ffc0 removed the policy check in
> alloc_hugepage_direct_gfpmask() comlpetely, so it's not racy and the
> warning will always trigger for a MPOL_BIND policy right now?
> 

Yup, looks like you hit it on the head.  This revert should have been done 
alongside 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
allocations"), I didn't appreciate how invasive the consolidation patch 
was.

I noticed the race in 89c83fb539f9 ("mm, thp: consolidate THP gfp handling 
into alloc_hugepage_direct_gfpmask") that is fixed by the revert, but as 
you noted it didn't cleanup the second part which is the balancing act for 
gfp flags between alloc_hugepage_direct_gfpmask() and alloc_pages_vma().  
Syzbot found this to trigger the WARN_ON_ONCE() you mention.

So we certainly need this patch for 4.20 as a follow-up to 2f0799a0ffc0.  
It's likely better to regroup and discuss NUMA aspects of all thp 
allocations separately with a stable 4.20.


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread David Rientjes
On Fri, 7 Dec 2018, Vlastimil Babka wrote:

> > This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> > 
> > There are a couple of issues with 89c83fb539f9 independent of its partial 
> > revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> > allocations"):
> > 
> > Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> > alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> > alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> > an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> > the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> > policy_node().
> 
> AFAICS 2f0799a0ffc0 removed the policy check in
> alloc_hugepage_direct_gfpmask() comlpetely, so it's not racy and the
> warning will always trigger for a MPOL_BIND policy right now?
> 

Yup, looks like you hit it on the head.  This revert should have been done 
alongside 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
allocations"), I didn't appreciate how invasive the consolidation patch 
was.

I noticed the race in 89c83fb539f9 ("mm, thp: consolidate THP gfp handling 
into alloc_hugepage_direct_gfpmask") that is fixed by the revert, but as 
you noted it didn't cleanup the second part which is the balancing act for 
gfp flags between alloc_hugepage_direct_gfpmask() and alloc_pages_vma().  
Syzbot found this to trigger the WARN_ON_ONCE() you mention.

So we certainly need this patch for 4.20 as a follow-up to 2f0799a0ffc0.  
It's likely better to regroup and discuss NUMA aspects of all thp 
allocations separately with a stable 4.20.


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread Vlastimil Babka
On 12/6/18 11:00 PM, David Rientjes wrote:
> This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> 
> There are a couple of issues with 89c83fb539f9 independent of its partial 
> revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> allocations"):
> 
> Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> policy_node().

AFAICS 2f0799a0ffc0 removed the policy check in
alloc_hugepage_direct_gfpmask() comlpetely, so it's not racy and the
warning will always trigger for a MPOL_BIND policy right now?


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread Vlastimil Babka
On 12/6/18 11:00 PM, David Rientjes wrote:
> This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> 
> There are a couple of issues with 89c83fb539f9 independent of its partial 
> revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> allocations"):
> 
> Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> policy_node().

AFAICS 2f0799a0ffc0 removed the policy check in
alloc_hugepage_direct_gfpmask() comlpetely, so it's not racy and the
warning will always trigger for a MPOL_BIND policy right now?


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread Michal Hocko
On Thu 06-12-18 14:00:20, David Rientjes wrote:
> This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> 
> There are a couple of issues with 89c83fb539f9 independent of its partial 
> revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> allocations"):
> 
> Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> policy_node().

Could you share a test case?

> Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
> different policy for hugepage allocations, which were allocated through 
> alloc_hugepage_vma().  For hugepage allocations, if the allocating 
> process's node is in the set of allowed nodes, allocate with 
> __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
> __GFP_THISNODE instead).

Why is it wrong to fallback to an explicitly configured mbind mask?

> This was changed for shmem_alloc_hugepage() to 
> allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in 
> mm/mempolicy.c which is functionally different behavior and removes the 
> requirement to only allocate hugepages locally.

Also why should new_page behave any differently for THP from any other
pages? There is no fallback allocation for those and so the mbind could
easily fail. So what is the actual rationale?

> The latter should have been reverted as part of 2f0799a0ffc0 as well.
> 
> Fully revert 89c83fb539f9 so that hugepage allocation policy is fully 
> restored and there is no race between alloc_hugepage_direct_gfpmask() and 
> alloc_pages_vma().
> 
> Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
> alloc_hugepage_direct_gfpmask")
> Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
> Signed-off-by: David Rientjes 
> ---
>  include/linux/gfp.h | 12 
>  mm/huge_memory.c| 27 +--
>  mm/mempolicy.c  | 32 +---
>  mm/shmem.c  |  2 +-
>  4 files changed, 51 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -510,18 +510,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
>  }
>  extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
>   struct vm_area_struct *vma, unsigned long addr,
> - int node);
> + int node, bool hugepage);
> +#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
> + alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
>  #else
>  #define alloc_pages(gfp_mask, order) \
>   alloc_pages_node(numa_node_id(), gfp_mask, order)
> -#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
> +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
> + alloc_pages(gfp_mask, order)
> +#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
>   alloc_pages(gfp_mask, order)
>  #endif
>  #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
>  #define alloc_page_vma(gfp_mask, vma, addr)  \
> - alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
> + alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
>  #define alloc_page_vma_node(gfp_mask, vma, addr, node)   \
> - alloc_pages_vma(gfp_mask, 0, vma, addr, node)
> + alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
>  
>  extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
>  extern unsigned long get_zeroed_page(gfp_t gfp_mask);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,30 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
> vm_fault *vmf,
>   *   available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct 
> *vma, unsigned long addr)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  {
>   const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> - const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
>  
>   /* Always do synchronous compaction */
>   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, 
> _hugepage_flags))
> - return GFP_TRANSHUGE | __GFP_THISNODE |
> -(vma_madvised ? 0 : __GFP_NORETRY);
> + return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>  
>   /* Kick kcompactd and fail quickly */
>   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, 
> _hugepage_flags))
> - return gfp_mask | __GFP_KSWAPD_RECLAIM;
> + return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>  

Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread Michal Hocko
On Thu 06-12-18 14:00:20, David Rientjes wrote:
> This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> 
> There are a couple of issues with 89c83fb539f9 independent of its partial 
> revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> allocations"):
> 
> Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> policy_node().

Could you share a test case?

> Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
> different policy for hugepage allocations, which were allocated through 
> alloc_hugepage_vma().  For hugepage allocations, if the allocating 
> process's node is in the set of allowed nodes, allocate with 
> __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
> __GFP_THISNODE instead).

Why is it wrong to fallback to an explicitly configured mbind mask?

> This was changed for shmem_alloc_hugepage() to 
> allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in 
> mm/mempolicy.c which is functionally different behavior and removes the 
> requirement to only allocate hugepages locally.

Also why should new_page behave any differently for THP from any other
pages? There is no fallback allocation for those and so the mbind could
easily fail. So what is the actual rationale?

> The latter should have been reverted as part of 2f0799a0ffc0 as well.
> 
> Fully revert 89c83fb539f9 so that hugepage allocation policy is fully 
> restored and there is no race between alloc_hugepage_direct_gfpmask() and 
> alloc_pages_vma().
> 
> Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
> alloc_hugepage_direct_gfpmask")
> Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
> Signed-off-by: David Rientjes 
> ---
>  include/linux/gfp.h | 12 
>  mm/huge_memory.c| 27 +--
>  mm/mempolicy.c  | 32 +---
>  mm/shmem.c  |  2 +-
>  4 files changed, 51 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -510,18 +510,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
>  }
>  extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
>   struct vm_area_struct *vma, unsigned long addr,
> - int node);
> + int node, bool hugepage);
> +#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
> + alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
>  #else
>  #define alloc_pages(gfp_mask, order) \
>   alloc_pages_node(numa_node_id(), gfp_mask, order)
> -#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
> +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
> + alloc_pages(gfp_mask, order)
> +#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
>   alloc_pages(gfp_mask, order)
>  #endif
>  #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
>  #define alloc_page_vma(gfp_mask, vma, addr)  \
> - alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
> + alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
>  #define alloc_page_vma_node(gfp_mask, vma, addr, node)   \
> - alloc_pages_vma(gfp_mask, 0, vma, addr, node)
> + alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
>  
>  extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
>  extern unsigned long get_zeroed_page(gfp_t gfp_mask);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,30 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
> vm_fault *vmf,
>   *   available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct 
> *vma, unsigned long addr)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  {
>   const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> - const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
>  
>   /* Always do synchronous compaction */
>   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, 
> _hugepage_flags))
> - return GFP_TRANSHUGE | __GFP_THISNODE |
> -(vma_madvised ? 0 : __GFP_NORETRY);
> + return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>  
>   /* Kick kcompactd and fail quickly */
>   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, 
> _hugepage_flags))
> - return gfp_mask | __GFP_KSWAPD_RECLAIM;
> + return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>  

[patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-06 Thread David Rientjes
This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.

There are a couple of issues with 89c83fb539f9 independent of its partial 
revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
allocations"):

Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
the same for shared vma policies, triggering the WARN_ON_ONCE() in 
policy_node().

Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
different policy for hugepage allocations, which were allocated through 
alloc_hugepage_vma().  For hugepage allocations, if the allocating 
process's node is in the set of allowed nodes, allocate with 
__GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
__GFP_THISNODE instead).  This was changed for shmem_alloc_hugepage() to 
allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in 
mm/mempolicy.c which is functionally different behavior and removes the 
requirement to only allocate hugepages locally.

The latter should have been reverted as part of 2f0799a0ffc0 as well.

Fully revert 89c83fb539f9 so that hugepage allocation policy is fully 
restored and there is no race between alloc_hugepage_direct_gfpmask() and 
alloc_pages_vma().

Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
alloc_hugepage_direct_gfpmask")
Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
Signed-off-by: David Rientjes 
---
 include/linux/gfp.h | 12 
 mm/huge_memory.c| 27 +--
 mm/mempolicy.c  | 32 +---
 mm/shmem.c  |  2 +-
 4 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -510,18 +510,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
 }
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
struct vm_area_struct *vma, unsigned long addr,
-   int node);
+   int node, bool hugepage);
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
+   alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
 #else
 #define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
+   alloc_pages(gfp_mask, order)
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)\
-   alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
+   alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
 #define alloc_page_vma_node(gfp_mask, vma, addr, node) \
-   alloc_pages_vma(gfp_mask, 0, vma, addr, node)
+   alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,30 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
vm_fault *vmf,
  * available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, 
unsigned long addr)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
-   const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 
/* Always do synchronous compaction */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, 
_hugepage_flags))
-   return GFP_TRANSHUGE | __GFP_THISNODE |
-  (vma_madvised ? 0 : __GFP_NORETRY);
+   return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 
/* Kick kcompactd and fail quickly */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, 
_hugepage_flags))
-   return gfp_mask | __GFP_KSWAPD_RECLAIM;
+   return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
 
/* Synchronous compaction if madvised, otherwise kick kcompactd */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, 
_hugepage_flags))
-   return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
- __GFP_KSWAPD_RECLAIM);
+   return GFP_TRANSHUGE_LIGHT |
+   (vma_madvised ? __GFP_DIRECT_RECLAIM :
+   __GFP_KSWAPD_RECLAIM);
 
 

[patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-06 Thread David Rientjes
This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.

There are a couple of issues with 89c83fb539f9 independent of its partial 
revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
allocations"):

Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
the same for shared vma policies, triggering the WARN_ON_ONCE() in 
policy_node().

Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
different policy for hugepage allocations, which were allocated through 
alloc_hugepage_vma().  For hugepage allocations, if the allocating 
process's node is in the set of allowed nodes, allocate with 
__GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
__GFP_THISNODE instead).  This was changed for shmem_alloc_hugepage() to 
allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in 
mm/mempolicy.c which is functionally different behavior and removes the 
requirement to only allocate hugepages locally.

The latter should have been reverted as part of 2f0799a0ffc0 as well.

Fully revert 89c83fb539f9 so that hugepage allocation policy is fully 
restored and there is no race between alloc_hugepage_direct_gfpmask() and 
alloc_pages_vma().

Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
alloc_hugepage_direct_gfpmask")
Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
Signed-off-by: David Rientjes 
---
 include/linux/gfp.h | 12 
 mm/huge_memory.c| 27 +--
 mm/mempolicy.c  | 32 +---
 mm/shmem.c  |  2 +-
 4 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -510,18 +510,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
 }
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
struct vm_area_struct *vma, unsigned long addr,
-   int node);
+   int node, bool hugepage);
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
+   alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
 #else
 #define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
+   alloc_pages(gfp_mask, order)
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)\
-   alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
+   alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
 #define alloc_page_vma_node(gfp_mask, vma, addr, node) \
-   alloc_pages_vma(gfp_mask, 0, vma, addr, node)
+   alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,30 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
vm_fault *vmf,
  * available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, 
unsigned long addr)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
-   const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 
/* Always do synchronous compaction */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, 
_hugepage_flags))
-   return GFP_TRANSHUGE | __GFP_THISNODE |
-  (vma_madvised ? 0 : __GFP_NORETRY);
+   return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 
/* Kick kcompactd and fail quickly */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, 
_hugepage_flags))
-   return gfp_mask | __GFP_KSWAPD_RECLAIM;
+   return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
 
/* Synchronous compaction if madvised, otherwise kick kcompactd */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, 
_hugepage_flags))
-   return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
- __GFP_KSWAPD_RECLAIM);
+   return GFP_TRANSHUGE_LIGHT |
+   (vma_madvised ? __GFP_DIRECT_RECLAIM :
+   __GFP_KSWAPD_RECLAIM);