from:"Mike Kravetz"

Re: [External] Re: [PATCH v20 6/9] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page

2021-04-20 Thread Mike Kravetz

On 4/20/21 1:46 AM, Muchun Song wrote:
> On Tue, Apr 20, 2021 at 7:20 AM Mike Kravetz  wrote:
>>
>> On 4/15/21 1:40 AM, Muchun Song wrote:
>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>> index 0abed7e766b8..6e970a7d3480 100644
>>> --- a/include/linux/hugetlb.h
>>> +++ b/include/linux/hugetlb.h
>>> @@ -525,6 +525,7 @@ unsigned long hugetlb_get_unmapped_area(struct file 
>>> *file, unsigned long addr,
>>>   *   code knows it has only reference.  All other examinations and
>>>   *   modifications require hugetlb_lock.
>>>   * HPG_freed - Set when page is on the free lists.
>>> + * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are 
>>> freed.
>>>   *   Synchronization: hugetlb_lock held for examination and modification.
>>
>> I like the per-page flag.  In previous versions of the series, you just
>> checked the free_vmemmap_pages_per_hpage() to determine if vmemmmap
>> should be allocated.  Is there any change in functionality that makes is
>> necessary to set the flag in each page, or is it mostly for flexibility
>> going forward?
> 
> Actually, only the routine of dissolving the page cares whether
> the page is on the buddy free list when update_and_free_page
> returns. But we cannot change the return type of the
> update_and_free_page (e.g. change return type from 'void' to 'int').
> Why? If the hugepage is freed through a kworker, we cannot
> know the return value when update_and_free_page returns.
> So adding a return value seems odd.
> 
> In the dissolving routine, We can allocate vmemmap pages first,
> if it is successful, then we can make sure that
> update_and_free_page can successfully free page. So I need
> some stuff to mark the page which does not need to allocate
> vmemmap pages.
> 
> On the surface, we seem to have a straightforward method
> to do this.
> 
> Add a new parameter 'alloc_vmemmap' to update_and_free_page() to
> indicate that the caller is already allocated the vmemmap pages.
> update_and_free_page() do not need to allocate. Just like below.
> 
>void update_and_free_page(struct hstate *h, struct page *page, bool atomic,
>bool alloc_vmemmap)
>{
>if (alloc_vmemmap)
>// allocate vmemmap pages
>}
> 
> But if the page is freed through a kworker. How to pass
> 'alloc_vmemmap' to the kworker? We can embed this
> information into the per-page flag. So if we introduce
> HPG_vmemmap_optimized, the parameter of
> alloc_vmemmap is also necessary.
> 
> So it seems that introducing HPG_vmemmap_optimized is
> a good choice.

Thanks for the explanation!

Agree that the flag is a good choice.  How about adding a comment like
this above the alloc_huge_page_vmemmap call in dissolve_free_huge_page?

/*
 * Normally update_and_free_page will allocate required vmemmmap before
 * freeing the page.  update_and_free_page will fail to free the page
 * if it can not allocate required vmemmap.  We need to adjust
 * max_huge_pages if the page is not freed.  Attempt to allocate
 * vmemmmap here so that we can take appropriate action on failure.
 */

...
>>> +static void add_hugetlb_page(struct hstate *h, struct page *page,
>>> +  bool adjust_surplus)
>>> +{
>>
>> We need to be a bit careful with hugepage specific flags that may be
>> set.  The routine remove_hugetlb_page which is called for 'page' before
>> this routine will not clear any of the hugepage specific flags.  If the
>> calling path goes through free_huge_page, most but not all flags are
>> cleared.
>>
>> We had a discussion about clearing the page->private field in Oscar's
>> series.  In the case of 'new' pages we can assume page->private is
>> cleared, but perhaps we should not make that assumption here.  Since we
>> hope to rarely call this routine, it might be safer to do something
>> like:
>>
>> set_page_private(page, 0);
>> SetHPageVmemmapOptimized(page);
> 
> Agree. Thanks for your reminder. I will fix this.
> 
>>
>>> + int nid = page_to_nid(page);
>>> +
>>> + lockdep_assert_held(_lock);
>>> +
>>> + INIT_LIST_HEAD(>lru);
>>> + h->nr_huge_pages++;
>>> + h->nr_huge_pages_node[nid]++;
>>> +
>>> + if (adjust_surplus) {
>>> + h->surplus_huge_pages++;
>>> + h->surplus_huge_pages_node[nid]++;
>>> + }
>>> +
>>> + set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
>>> +
>>> + /*
>>> +  * The r

Re: [PATCH v20 7/9] mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap

2021-04-19 Thread Mike Kravetz

On 4/15/21 1:40 AM, Muchun Song wrote:
> Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
> freeing unused vmemmap pages associated with each hugetlb page on boot.
> 
> We disables PMD mapping of vmemmap pages for x86-64 arch when this
> feature is enabled. Because vmemmap_remap_free() depends on vmemmap
> being base page mapped.
> 
> Signed-off-by: Muchun Song 
> Reviewed-by: Oscar Salvador 
> Reviewed-by: Barry Song 
> Reviewed-by: Miaohe Lin 
> Tested-by: Chen Huang 
> Tested-by: Bodeddula Balasubramaniam 
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 17 +
>  Documentation/admin-guide/mm/hugetlbpage.rst|  3 +++
>  arch/x86/mm/init_64.c   |  8 ++--
>  include/linux/hugetlb.h | 19 +++
>  mm/hugetlb_vmemmap.c| 24 
>  5 files changed, 69 insertions(+), 2 deletions(-)

Thanks,

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH v20 6/9] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page

2021-04-19 Thread Mike Kravetz

ode?  Or, should this test
be in the existing code?

Sorry, I am not seeing why this is needed.

> + arch_clear_hugepage_flags(page);
> + enqueue_huge_page(h, page);
> + }
> +}
> +
>  static void __update_and_free_page(struct hstate *h, struct page *page)
>  {
>   int i;
> @@ -1384,6 +1412,18 @@ static void __update_and_free_page(struct hstate *h, 
> struct page *page)
>   if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>   return;
>  
> + if (alloc_huge_page_vmemmap(h, page)) {
> + spin_lock_irq(_lock);
> + /*
> +  * If we cannot allocate vmemmap pages, just refuse to free the
> +  * page and put the page back on the hugetlb free list and treat
> +  * as a surplus page.
> +  */
> + add_hugetlb_page(h, page, true);
> + spin_unlock_irq(_lock);
> + return;
> + }
> +
>   for (i = 0; i < pages_per_huge_page(h);
>i++, subpage = mem_map_next(subpage, page, i)) {
>   subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
> @@ -1444,7 +1484,7 @@ static inline void flush_free_hpage_work(struct hstate 
> *h)
>  static void update_and_free_page(struct hstate *h, struct page *page,
>bool atomic)
>  {
> - if (!free_vmemmap_pages_per_hpage(h) || !atomic) {
> + if (!HPageVmemmapOptimized(page) || !atomic) {
>   __update_and_free_page(h, page);
>   return;
>   }

When update_and_free_pages_bulk was added it was done to avoid
lock/unlock cycles with each page.  At the time, I thought about the
addition of code to allocate vmmemmap, and the possibility that those
allocations could fail.  I thought it might make sense to perhaps
process the pages one at a time so that we could quit at the first
allocation failure.  After more thought, I think it is best to leave the
code to do bulk operations as you have done above.  Why?
- Just because one allocation fails does not mean the next will fail.
  It is possible the allocations could be from different nodes/zones.
- We will still need to put the requested number of pages into surplus
  state.

I am not suggesting you change anything.  Just wanted to share my
thoughts in case someone thought otherwise.

> @@ -1790,10 +1830,14 @@ static struct page *remove_pool_huge_page(struct 
> hstate *h,
>   * nothing for in-use hugepages and non-hugepages.
>   * This function returns values like below:
>   *
> - *  -EBUSY: failed to dissolved free hugepages or the hugepage is in-use
> - *  (allocated or reserved.)
> - *   0: successfully dissolved free hugepages or the page is not a
> - *  hugepage (considered as already dissolved)
> + *  -ENOMEM: failed to allocate vmemmap pages to free the freed hugepages
> + *   when the system is under memory pressure and the feature of
> + *   freeing unused vmemmap pages associated with each hugetlb page
> + *   is enabled.
> + *  -EBUSY:  failed to dissolved free hugepages or the hugepage is in-use
> + *   (allocated or reserved.)
> + *   0:  successfully dissolved free hugepages or the page is not a
> + *   hugepage (considered as already dissolved)
>   */
>  int dissolve_free_huge_page(struct page *page)
>  {
> @@ -1835,19 +1879,30 @@ int dissolve_free_huge_page(struct page *page)
>   goto retry;
>   }
>  
> - /*
> -  * Move PageHWPoison flag from head page to the raw error page,
> -  * which makes any subpages rather than the error page reusable.
> -  */
> - if (PageHWPoison(head) && page != head) {
> - SetPageHWPoison(page);
> - ClearPageHWPoison(head);
> - }
>   remove_hugetlb_page(h, page, false);
>   h->max_huge_pages--;
>   spin_unlock_irq(_lock);
> - update_and_free_page(h, head, false);
> - return 0;
> +
> + rc = alloc_huge_page_vmemmap(h, page);
> + if (!rc) {
> + /*
> +  * Move PageHWPoison flag from head page to the raw
> +  * error page, which makes any subpages rather than
> +  * the error page reusable.
> +  */
> + if (PageHWPoison(head) && page != head) {
> + SetPageHWPoison(page);
> + ClearPageHWPoison(head);
> + }
> + update_and_free_page(h, head, false);
>

Re: [PATCH v20 5/9] mm: hugetlb: defer freeing of HugeTLB pages

2021-04-16 Thread Mike Kravetz

On 4/15/21 1:40 AM, Muchun Song wrote:
> In the subsequent patch, we should allocate the vmemmap pages when
> freeing a HugeTLB page. But update_and_free_page() can be called
> under any context, so we cannot use GFP_KERNEL to allocate vmemmap
> pages. However, we can defer the actual freeing in a kworker to
> prevent from using GFP_ATOMIC to allocate the vmemmap pages.

Thanks!  I knew we would need to introduce a kworker for this when I
removed the kworker previously used in free_huge_page.

> The __update_and_free_page() is where the call to allocate vmemmmap
> pages will be inserted.

This patch adds the functionality required for __update_and_free_page
to potentially sleep and fail.  More questions will come up in the
subsequent patch when code must deal with the failures.

> 
> Signed-off-by: Muchun Song 
> ---
>  mm/hugetlb.c | 73 
> 
>  mm/hugetlb_vmemmap.c | 12 -
>  mm/hugetlb_vmemmap.h | 17 
>  3 files changed, 85 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 923d05e2806b..eeb8f5480170 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1376,7 +1376,7 @@ static void remove_hugetlb_page(struct hstate *h, 
> struct page *page,
>   h->nr_huge_pages_node[nid]--;
>  }
>  
> -static void update_and_free_page(struct hstate *h, struct page *page)
> +static void __update_and_free_page(struct hstate *h, struct page *page)
>  {
>   int i;
>   struct page *subpage = page;
> @@ -1399,12 +1399,73 @@ static void update_and_free_page(struct hstate *h, 
> struct page *page)
>   }
>  }
>  
> +/*
> + * As update_and_free_page() can be called under any context, so we cannot
> + * use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
> + * actual freeing in a workqueue to prevent from using GFP_ATOMIC to allocate
> + * the vmemmap pages.
> + *
> + * free_hpage_workfn() locklessly retrieves the linked list of pages to be
> + * freed and frees them one-by-one. As the page->mapping pointer is going
> + * to be cleared in free_hpage_workfn() anyway, it is reused as the 
> llist_node
> + * structure of a lockless linked list of huge pages to be freed.
> + */
> +static LLIST_HEAD(hpage_freelist);
> +
> +static void free_hpage_workfn(struct work_struct *work)
> +{
> + struct llist_node *node;
> +
> + node = llist_del_all(_freelist);
> +
> + while (node) {
> + struct page *page;
> + struct hstate *h;
> +
> + page = container_of((struct address_space **)node,
> +  struct page, mapping);
> + node = node->next;
> + page->mapping = NULL;
> + h = page_hstate(page);

The VM_BUG_ON_PAGE(!PageHuge(page), page) in page_hstate is going to
trigger because a previous call to remove_hugetlb_page() will
set_compound_page_dtor(page, NULL_COMPOUND_DTOR)

Note how h(hstate) is grabbed before calling update_and_free_page in
existing code.

We could potentially drop the !PageHuge(page) in page_hstate.  Or,
perhaps just use 'size_to_hstate(page_size(page))' in free_hpage_workfn.
-- 
Mike Kravetz

> +
> + __update_and_free_page(h, page);
> +
> + cond_resched();
> + }
> +}
> +static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> +
> +static inline void flush_free_hpage_work(struct hstate *h)
> +{
> + if (free_vmemmap_pages_per_hpage(h))
> + flush_work(_hpage_work);
> +}
> +
> +static void update_and_free_page(struct hstate *h, struct page *page,
> +  bool atomic)
> +{
> + if (!free_vmemmap_pages_per_hpage(h) || !atomic) {
> + __update_and_free_page(h, page);
> + return;
> + }
> +
> + /*
> +  * Defer freeing to avoid using GFP_ATOMIC to allocate vmemmap pages.
> +  *
> +  * Only call schedule_work() if hpage_freelist is previously
> +  * empty. Otherwise, schedule_work() had been called but the workfn
> +  * hasn't retrieved the list yet.
> +  */
> + if (llist_add((struct llist_node *)>mapping, _freelist))
> + schedule_work(_hpage_work);
> +}
> +
>  static void update_and_free_pages_bulk(struct hstate *h, struct list_head 
> *list)
>  {
>   struct page *page, *t_page;
>  
>   list_for_each_entry_safe(page, t_page, list, lru) {
> - update_and_free_page(h, page);
> + update_and_free_page(h, page, false);
>   cond_resched();
>   }
>  }
> @@ -1471,12 +1532,12 @@ void free_huge_page(struct page *page)
>   if (HPageTemporary(page)) {
>

Re: [PATCH v20 4/9] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page

2021-04-16 Thread Mike Kravetz

On 4/15/21 1:40 AM, Muchun Song wrote:
> Every HugeTLB has more than one struct page structure. We __know__ that
> we only use the first 4 (__NR_USED_SUBPAGE) struct page structures
> to store metadata associated with each HugeTLB.
> 
> There are a lot of struct page structures associated with each HugeTLB
> page. For tail pages, the value of compound_head is the same. So we can
> reuse first page of tail page structures. We map the virtual addresses
> of the remaining pages of tail page structures to the first tail page
> struct, and then free these page frames. Therefore, we need to reserve
> two pages as vmemmap areas.
> 
> When we allocate a HugeTLB page from the buddy, we can free some vmemmap
> pages associated with each HugeTLB page. It is more appropriate to do it
> in the prep_new_huge_page().
> 
> The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap
> pages associated with a HugeTLB page can be freed, returns zero for
> now, which means the feature is disabled. We will enable it once all
> the infrastructure is there.
> 
> Signed-off-by: Muchun Song 
> Reviewed-by: Oscar Salvador 
> Tested-by: Chen Huang 
> Tested-by: Bodeddula Balasubramaniam 
> Acked-by: Michal Hocko 

There may need to be some trivial rebasing due to Oscar's changes
when they go in.

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH v20 3/9] mm: hugetlb: gather discrete indexes of tail page

2021-04-16 Thread Mike Kravetz

On 4/15/21 1:39 AM, Muchun Song wrote:
> For HugeTLB page, there are more metadata to save in the struct page.
> But the head struct page cannot meet our needs, so we have to abuse
> other tail struct page to store the metadata. In order to avoid
> conflicts caused by subsequent use of more tail struct pages, we can
> gather these discrete indexes of tail struct page. In this case, it
> will be easier to add a new tail page index later.
> 
> Signed-off-by: Muchun Song 
> Reviewed-by: Oscar Salvador 
> Reviewed-by: Miaohe Lin 
> Tested-by: Chen Huang 
> Tested-by: Bodeddula Balasubramaniam 
> Acked-by: Michal Hocko 

Thanks,

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

Re: [PATCH v8 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-15 Thread Mike Kravetz

On 4/15/21 3:35 AM, Oscar Salvador wrote:
> alloc_contig_range will fail if it ever sees a HugeTLB page within the
> range we are trying to allocate, even when that page is free and can be
> easily reallocated.
> This has proved to be problematic for some users of alloc_contic_range,
> e.g: CMA and virtio-mem, where those would fail the call even when those
> pages lay in ZONE_MOVABLE and are free.
> 
> We can do better by trying to replace such page.
> 
> Free hugepages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugepage, and if it
> succeeds, it will replace the old one in the free hugepage pool.
> 
> The free page replacement is done under hugetlb_lock, so no external
> users of hugetlb will notice the change.
> To allocate the new huge page, we use alloc_buddy_huge_page(), so we
> do not have to deal with any counters, and prep_new_huge_page() is not
> called. This is valulable because in case we need to free the new page,
> we only need to call __free_pages().
> 
> Once we know that the page to be replaced is a genuine 0-refcounted
> huge page, we remove the old page from the freelist by remove_hugetlb_page().
> Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page()
> for the new huge page to properly initialize it and increment the
> hstate->nr_huge_pages counter (previously decremented by
> remove_hugetlb_page()).
> Once done, the page is enqueued by enqueue_huge_page() and it is ready
> to be used.
> 
> There is one tricky case when
> page's refcount is 0 because it is in the process of being released.
> A missing PageHugeFreed bit will tell us that freeing is in flight so
> we retry after dropping the hugetlb_lock. The race window should be
> small and the next retry should make a forward progress.
> 
> E.g:
> 
> CPU0  CPU1
> free_huge_page()  isolate_or_dissolve_huge_page
> PageHuge() == T
> alloc_and_dissolve_huge_page
>   alloc_buddy_huge_page()
>   spin_lock_irq(hugetlb_lock)
>   // PageHuge() && !PageHugeFreed &&
>   // !PageCount()
>   spin_unlock_irq(hugetlb_lock)
>   spin_lock_irq(hugetlb_lock)
>   1) update_and_free_page
>PageHuge() == F
>__free_pages()
>   2) enqueue_huge_page
>SetPageHugeFreed()
>   spin_unlock_irq(_lock)
> spin_lock_irq(hugetlb_lock)
>1) PageHuge() == F (freed by case#1 from 
> CPU0)
>  2) PageHuge() == T
>PageHugeFreed() == T
>- proceed with replacing the page
> 
> In the case above we retry as the window race is quite small and we have high
> chances to succeed next time.
> 
> With regard to the allocation, we restrict it to the node the page belongs
> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> 
> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador 
> Acked-by: Michal Hocko 

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

Re: [PATCH v8 3/7] mm,hugetlb: Drop clearing of flag from prep_new_huge_page

2021-04-15 Thread Mike Kravetz

On 4/15/21 3:35 AM, Oscar Salvador wrote:
> Pages allocated after boot get its private field cleared by means
> of post_alloc_hook().
> Pages allocated during boot, that is directly from the memblock allocator,
> get cleared by paging_init()->..->memmap_init_zone->..->__init_single_page()
> before any memblock allocation.
> 
> Based on this ground, let us remove the clearing of the flag from
> prep_new_huge_page() as it is not needed.
> 
> Signed-off-by: Oscar Salvador 

The comment "allocated after boot" made sense to me, but I can see where
Michal's suggestion was coming from (list the allocators that do the
clearing).

Also, listing this as a left over would be a good idea.

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages

2021-04-14 Thread Mike Kravetz

On 4/13/21 9:52 PM, Oscar Salvador wrote:
> On Tue, Apr 13, 2021 at 03:48:53PM -0700, Mike Kravetz wrote:
>> The label free_new is:
>>
>> free_new:
>> spin_unlock_irq(_lock);
>> __free_pages(new_page, huge_page_order(h));
>>
>> return ret;
>>
>> So, we are locking and immediately unlocking without any code in
>> between.  Usually, I don't like like multiple labels before return.
>> However, perhaps we should add another to avoid this unnecessary
>> cycle.  On the other hand, this is an uncommon race condition so the
>> simple code may be acceptable.
> 
> I guess we could have something like:
> 
>  free_new:
>  spin_unlock_irq(_lock);
>  free_new_nolock:
>  __free_pages(new_page, huge_page_order(h));
>  
>  return ret;
> 
> And let the retry go to there without locking. But as you said, the
> racecondition is rare enough, so I am not sure if this buys us much.
> But I can certainly add it if you feel strong about it.

No strong feelings.  I am fine with it as is.

-- 
Mike Kravetz

Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality

2021-04-14 Thread Mike Kravetz

On 4/13/21 9:59 PM, Oscar Salvador wrote:
> On Tue, Apr 13, 2021 at 02:33:41PM -0700, Mike Kravetz wrote:
>>> -static void prep_new_huge_page(struct hstate *h, struct page *page, int 
>>> nid)
>>> +/*
>>> + * Must be called with the hugetlb lock held
>>> + */
>>> +static void __prep_account_new_huge_page(struct hstate *h, int nid)
>>> +{
>>> +   h->nr_huge_pages++;
>>> +   h->nr_huge_pages_node[nid]++;
>>
>> I would prefer if we also move setting the destructor to this routine.
>>  set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
> 
> Uhm, but that is the routine that does the accounting, it feels wrong
> here, plus...
> 
>>
>> That way, PageHuge() will be false until it 'really' is a huge page.
>> If not, we could potentially go into that retry loop in
>> dissolve_free_huge_page or alloc_and_dissolve_huge_page in patch 5.
> 
> ...I do not follow here, could you please elaborate some more?
> Unless I am missing something, behaviour should not be any different with this
> patch.
> 

I was thinking of the time between the call to __prep_new_huge_page and
__prep_account_new_huge_page.  In that time, PageHuge() will be true but
the page is not yet fully being managed as a hugetlb page.  My thought
was that isolation, migration, offline or any code that does pfn
scanning might the page as PageHuge() (after taking lock) and start to
process it.

Now I see that in patch 5 you call both __prep_new_huge_page and
__prep_account_new_huge_page with the lock held.  So, no issue.  Sorry.

I 'think' there may still be a potential race with the prep_new_huge_page
routine, but that existed before any of your changes.  It may also be
that 'synchronization' at the pageblock level which prevents some of
these pfn scanning operations to operate on the same pageblocks may
prevent this from ever happening.

Mostly thinking out loud.  Upon further thought, I have no objections to
this change.  Sorry for the noise.

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-14 Thread Mike Kravetz

On 4/14/21 3:49 AM, Oscar Salvador wrote:
> On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote:
>> Well, to be precise it does the very same thing with memamp struct pages
>> but that is before the initialization code you have pointed out above.
>> In this context it just poisons the allocated content which is the GB
>> page storage.
> 
> Right.
> 
>>> I checked, and when we get there in __alloc_bootmem_huge_page, 
>>> page->private is
>>> still zeroed, so I guess it should be safe to assume that we do not really 
>>> need
>>> to clear the flag in __prep_new_huge_page() routine?
>>
>> It would be quite nasty if the struct pages content would be undefined.
>> Maybe that is possible but then I would rather stick the initialization
>> into __alloc_bootmem_huge_page.
> 
> Yes, but I do not think that is really possible unless I missed something.
> Let us see what Mike thinks of it, if there are no objections, we can
> get rid of the clearing flag right there.
>  

Thanks for crawling through that code Oscar!

I do not think you missed anything.  Let's just get rid of the flag
clearing.
-- 
Mike Kravetz

Re: [PATCH v7 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

2021-04-13 Thread Mike Kravetz

On 4/13/21 3:47 AM, Oscar Salvador wrote:
> pfn_range_valid_contig() bails out when it finds an in-use page or a
> hugetlb page, among other things.
> We can drop the in-use page check since __alloc_contig_pages can migrate
> away those pages, and the hugetlb page check can go too since
> isolate_migratepages_range is now capable of dealing with hugetlb pages.
> Either way, those checks are racy so let the end function handle it
> when the time comes.
> 
> Signed-off-by: Oscar Salvador 
> Suggested-by: David Hildenbrand 
> Reviewed-by: David Hildenbrand 
> ---
>  mm/page_alloc.c | 6 --
>  1 file changed, 6 deletions(-)

Acked-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages

2021-04-13 Thread Mike Kravetz

On 4/13/21 3:47 AM, Oscar Salvador wrote:
> alloc_contig_range() will fail if it finds a HugeTLB page within the range,
> without a chance to handle them. Since HugeTLB pages can be migrated as any
> LRU or Movable page, it does not make sense to bail out without trying.
> Enable the interface to recognize in-use HugeTLB pages so we can migrate
> them, and have much better chances to succeed the call.
> 
> Signed-off-by: Oscar Salvador 
> Reviewed-by: Mike Kravetz 
> Acked-by: Michal Hocko 

One small issue/question/request below.

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4a664d6e82c1..24a453ff47f2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2270,10 +2270,12 @@ static void restore_reserve_on_error(struct hstate *h,
>   * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old 
> one
>   * @h: struct hstate old page belongs to
>   * @old_page: Old page to dissolve
> + * @list: List to isolate the page in case we need to
>   * Returns 0 on success, otherwise negated error.
>   */
>  
> -static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
> *old_page)
> +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
> *old_page,
> + struct list_head *list)
>  {
>   gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>   int nid = page_to_nid(old_page);
> @@ -2300,9 +2302,13 @@ static int alloc_and_dissolve_huge_page(struct hstate 
> *h, struct page *old_page)
>   goto free_new;
>   } else if (page_count(old_page)) {
>   /*
> -  * Someone has grabbed the page, fail for now.
> +  * Someone has grabbed the page, try to isolate it here.
> +  * Fail with -EBUSY if not possible.
>*/
> - ret = -EBUSY;
> + spin_unlock_irq(_lock);
> + if (!isolate_huge_page(old_page, list))
> + ret = -EBUSY;
> + spin_lock_irq(_lock);
>   goto free_new;

The label free_new is:

free_new:
spin_unlock_irq(_lock);
__free_pages(new_page, huge_page_order(h));

return ret;

So, we are locking and immediately unlocking without any code in
between.  Usually, I don't like like multiple labels before return.
However, perhaps we should add another to avoid this unnecessary
cycle.  On the other hand, this is an uncommon race condition so the
simple code may be acceptable.
-- 
Mike Kravetz

Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-13 Thread Mike Kravetz

loc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL);
> + if (!new_page)
> + return -ENOMEM;
> +
> +retry:
> + spin_lock_irq(_lock);
> + if (!PageHuge(old_page)) {
> + /*
> +  * Freed from under us. Drop new_page too.
> +  */
> + goto free_new;
> + } else if (page_count(old_page)) {
> + /*
> +  * Someone has grabbed the page, fail for now.
> +  */
> + ret = -EBUSY;
> + goto free_new;
> + } else if (!HPageFreed(old_page)) {
> + /*
> +  * Page's refcount is 0 but it has not been enqueued in the
> +  * freelist yet. Race window is small, so we can succeed here if
> +  * we retry.
> +  */
> + spin_unlock_irq(_lock);
> + cond_resched();
> + goto retry;
> + } else {
> + /*
> +  * Ok, old_page is still a genuine free hugepage. Remove it from
> +  * the freelist and decrease the counters. These will be
> +  * incremented again when calling __prep_account_new_huge_page()
> +  * and enqueue_huge_page() for new_page. The counters will 
> remain
> +  * stable since this happens under the lock.
> +  */
> + remove_hugetlb_page(h, old_page, false);
> +
> + /*
> +  * Call __prep_new_huge_page() to construct the hugetlb page, 
> and
> +  * enqueue it then to place it in the freelists. After this,
> +  * counters are back on track. Free hugepages have a refcount 
> of 0,
> +  * so we need to decrease new_page's count as well.
> +  */
> + __prep_new_huge_page(new_page);
> + __prep_account_new_huge_page(h, nid);
> + page_ref_dec(new_page);
> + enqueue_huge_page(h, new_page);
> +
> + /*
> +  * Pages have been replaced, we can safely free the old one.
> +  */
> + spin_unlock_irq(_lock);
> + update_and_free_page(h, old_page);
> + }
> +
> + return ret;
> +
> +free_new:
> + spin_unlock_irq(_lock);
> + __free_pages(new_page, huge_page_order(h));
> +
> + return ret;
> +}
> +
> +int isolate_or_dissolve_huge_page(struct page *page)
> +{
> + struct hstate *h;
> + struct page *head;
> +
> + /*
> +  * The page might have been dissolved from under our feet, so make sure
> +  * to carefully check the state under the lock.
> +  * Return success when racing as if we dissolved the page ourselves.
> +  */
> + spin_lock_irq(_lock);
> + if (PageHuge(page)) {
> + head = compound_head(page);
> + h = page_hstate(head);
> + } else {
> + spin_unlock(_lock);

Should be be spin_unlock_irq(_lock);

Other than that, it looks good.
-- 
Mike Kravetz

Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality

2021-04-13 Thread Mike Kravetz

On 4/13/21 3:47 AM, Oscar Salvador wrote:
> Currently, prep_new_huge_page() performs two functions.
> It sets the right state for a new hugetlb, and increases the hstate's
> counters to account for the new page.
> 
> Let us split its functionality into two separate functions, decoupling
> the handling of the counters from initializing a hugepage.
> The outcome is having __prep_new_huge_page(), which only
> initializes the page , and __prep_account_new_huge_page(), which adds
> the new page to the hstate's counters.
> 
> This allows us to be able to set a hugetlb without having to worry
> about the counter/locking. It will prove useful in the next patch.
> prep_new_huge_page() still calls both functions.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/hugetlb.c | 19 ---
>  1 file changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e40d5fe5c63c..0607b2b71ac6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1483,7 +1483,16 @@ void free_huge_page(struct page *page)
>   }
>  }
>  
> -static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> +/*
> + * Must be called with the hugetlb lock held
> + */
> +static void __prep_account_new_huge_page(struct hstate *h, int nid)
> +{
> + h->nr_huge_pages++;
> + h->nr_huge_pages_node[nid]++;

I would prefer if we also move setting the destructor to this routine.
set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);

That way, PageHuge() will be false until it 'really' is a huge page.
If not, we could potentially go into that retry loop in
dissolve_free_huge_page or alloc_and_dissolve_huge_page in patch 5.
-- 
Mike Kravetz

> +}
> +
> +static void __prep_new_huge_page(struct page *page)
>  {
>   INIT_LIST_HEAD(>lru);
>   set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
> @@ -1491,9 +1500,13 @@ static void prep_new_huge_page(struct hstate *h, 
> struct page *page, int nid)
>   set_hugetlb_cgroup(page, NULL);
>   set_hugetlb_cgroup_rsvd(page, NULL);
>   ClearHPageFreed(page);
> +}
> +
> +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> +{
> + __prep_new_huge_page(page);
>   spin_lock_irq(_lock);
> - h->nr_huge_pages++;
> - h->nr_huge_pages_node[nid]++;
> + __prep_account_new_huge_page(h, nid);
>   spin_unlock_irq(_lock);
>  }
>  
>

Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-13 Thread Mike Kravetz

On 4/13/21 6:23 AM, Michal Hocko wrote:
> On Tue 13-04-21 12:47:43, Oscar Salvador wrote:
>> Currently, the clearing of the flag is done under the lock, but this
>> is unnecessary as we just allocated the page and we did not give it
>> away yet, so no one should be messing with it.
>>
>> Also, this helps making clear that here the lock is only protecting the
>> counter.
> 
> While moving the flag clearing is ok I am wondering why do we need that
> in the first place. I think it is just a leftover from 6c0371490140
> ("hugetlb: convert PageHugeFreed to HPageFreed flag"). Prior to that a tail
> page as been used to keep track of the state but now all happens in the
> head page and the flag uses page->private which is always initialized
> when allocated by the allocator (post_alloc_hook).

Yes, this was just left over from 6c0371490140.  And, as you mention
post_alloc_hook will clear page->private for all non-gigantic pages
allocated via buddy.

> Or do we need it for giga pages which are not allocated by the page
> allocator? If yes then moving it to prep_compound_gigantic_page would be
> better.

I am pretty sure dynamically allocated giga pages have page->Private
cleared as well.  It is not obvious, but the alloc_contig_range code
used to put together giga pages will end up calling isolate_freepages_range
that will call split_map_pages and then post_alloc_hook for each MAX_ORDER
block.  As mentioned, this is not obvious and I would hate to rely on this
behavior not changing.

> 
> So should we just drop it here?

The only place where page->private may not be initialized is when we do
allocations at boot time from memblock.  In this case, we will add the
pages to the free list via put_page/free_huge_page so the appropriate
flags will be cleared before anyone notices.

I'm wondering if we should just do a set_page_private(page, 0) here in
prep_new_huge_page since we now use that field for flags.  Or, is that
overkill?
-- 
Mike Kravetz

Re: [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes

2021-04-13 Thread Mike Kravetz

On 4/13/21 3:47 AM, Oscar Salvador wrote:
> Currently, isolate_migratepages_{range,block} and their callers use
> a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was
> any error during isolation.
> This does not work as soon as we need to start reporting different error
> codes and make sure we pass them down the chain, so they are properly
> interpreted by functions like e.g: alloc_contig_range.
> 
> Let us rework isolate_migratepages_{range,block} so we can report error
> codes.
> Since isolate_migratepages_block will stop returning the next pfn to be
> scanned, we reuse the cc->migrate_pfn field to keep track of that.
> 
> Signed-off-by: Oscar Salvador 
> Acked-by: Vlastimil Babka 

Acked-by: Mike Kravetz 

-- 
Mike Kravetz

Re: [PATCH v7 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range

2021-04-13 Thread Mike Kravetz

On 4/13/21 3:47 AM, Oscar Salvador wrote:
> Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or 
> -EBUSY,
> and report them down the chain.
> The problem is that when migrate_pages() reports -ENOMEM, we keep going till 
> we
> exhaust all the try-attempts (5 at the moment) instead of bailing out.
> 
> migrate_pages() bails out right away on -ENOMEM because it is considered a 
> fatal
> error. Do the same here instead of keep going and retrying.
> Note that this is not fixing a real issue, just a cosmetic change. Although we
> can save some cycles by backing off ealier
> 
> Signed-off-by: Oscar Salvador 
> Acked-by: Vlastimil Babka 
> Reviewed-by: David Hildenbrand 
> Acked-by: Michal Hocko 

Acked-by: Mike Kravetz 

-- 
Mike Kravetz

Re: [PATCH v2 5/5] mm/hugetlb: remove unused variable pseudo_vma in remove_inode_hugepages()

2021-04-12 Thread Mike Kravetz

On 4/10/21 12:23 AM, Miaohe Lin wrote:
> The local variable pseudo_vma is not used anymore.
> 
> Signed-off-by: Miaohe Lin 

Thanks,

That should have been removed with 1b426bac66e6 ("hugetlb: use same fault
hash key for shared and private mappings").

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH v2 4/5] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()

2021-04-12 Thread Mike Kravetz

On 4/10/21 12:23 AM, Miaohe Lin wrote:
> A rare out of memory error would prevent removal of the reserve map region
> for a page. hugetlb_fix_reserve_counts() handles this rare case to avoid
> dangling with incorrect counts. Unfortunately, hugepage_subpool_get_pages
> and hugetlb_acct_memory could possibly fail too. We should correctly handle
> these cases.
> 
> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
> Signed-off-by: Miaohe Lin 

Thanks,

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH v2 3/5] mm/hugeltb: clarify (chg - freed) won't go negative in hugetlb_unreserve_pages()

2021-04-12 Thread Mike Kravetz

On 4/10/21 12:23 AM, Miaohe Lin wrote:
> The resv_map could be NULL since this routine can be called in the evict
> inode path for all hugetlbfs inodes and we will have chg = 0 in this case.
> But (chg - freed) won't go negative as Mike pointed out:
> 
>  "If resv_map is NULL, then no hugetlb pages can be allocated/associated
>   with the file.  As a result, remove_inode_hugepages will never find any
>   huge pages associated with the inode and the passed value 'freed' will
>   always be zero."
> 
> Add a comment clarifying this to make it clear and also avoid confusion.
> 
> Signed-off-by: Miaohe Lin 

Thanks,

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH v2 2/5] mm/hugeltb: simplify the return code of __vma_reservation_common()

2021-04-12 Thread Mike Kravetz

On 4/10/21 12:23 AM, Miaohe Lin wrote:
> It's guaranteed that the vma is associated with a resv_map, i.e. either
> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
> have returned via !resv check above. So it's unneeded to check whether
> HPAGE_RESV_OWNER is set here. Simplify the return code to make it more
> clear.
> 
> Signed-off-by: Miaohe Lin 

Thanks,

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH v5 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-04-12 Thread Mike Kravetz

On 4/12/21 12:33 AM, Oscar Salvador wrote:
> On Fri, Apr 09, 2021 at 01:52:50PM -0700, Mike Kravetz wrote:
>> The new remove_hugetlb_page() routine is designed to remove a hugetlb
>> page from hugetlbfs processing.  It will remove the page from the active
>> or free list, update global counters and set the compound page
>> destructor to NULL so that PageHuge() will return false for the 'page'.
>> After this call, the 'page' can be treated as a normal compound page or
>> a collection of base size pages.
>>
>> update_and_free_page no longer decrements h->nr_huge_pages{_node} as
>> this is performed in remove_hugetlb_page.  The only functionality
>> performed by update_and_free_page is to free the base pages to the lower
>> level allocators.
>>
>> update_and_free_page is typically called after remove_hugetlb_page.
>>
>> remove_hugetlb_page is to be called with the hugetlb_lock held.
>>
>> Creating this routine and separating functionality is in preparation for
>> restructuring code to reduce lock hold times.  This commit should not
>> introduce any changes to functionality.
>>
>> Signed-off-by: Mike Kravetz 
>> Acked-by: Michal Hocko 
>> Reviewed-by: Miaohe Lin 
>> Reviewed-by: Muchun Song 
> 
> Reviewed-by: Oscar Salvador 
> 
> A "nit" below:
> 
>>  static void update_and_free_page(struct hstate *h, struct page *page)
>>  {
>>  int i;
>> @@ -1334,8 +1369,6 @@ static void update_and_free_page(struct hstate *h, 
>> struct page *page)
> 
> After this, update_and_free_page()'s job is to reset subpage's flags and free
> the page.
> Maybe we want to rename that function at some point, or maybe not as "update" 
> might
> already imply that. Just speaking out loud.

Thanks Oscar,

I did not think about a name change as the routine is still "updating"
subpages before freeing.  We can certainly keep this in mind in the future,
especially if there are more functionality changes.
-- 
Mike Kravetz

Re: [PATCH 0/9] userfaultfd: add minor fault handling for shmem

2021-04-09 Thread Mike Kravetz

On 4/9/21 2:18 PM, Peter Xu wrote:
> On Fri, Apr 09, 2021 at 10:03:53AM -0700, Axel Rasmussen wrote:
>> On Thu, Apr 8, 2021 at 10:04 PM Andrew Morton  
>> wrote:
>>>
>>> On Thu,  8 Apr 2021 16:43:18 -0700 Axel Rasmussen 
>>>  wrote:
>>>
>>>> The idea is that it will apply cleanly to akpm's tree, *replacing* the 
>>>> following
>>>> patches (i.e., drop these first, and then apply this series):
>>>>
>>>> userfaultfd-support-minor-fault-handling-for-shmem.patch
>>>> userfaultfd-support-minor-fault-handling-for-shmem-fix.patch
>>>> userfaultfd-support-minor-fault-handling-for-shmem-fix-2.patch
>>>> userfaultfd-support-minor-fault-handling-for-shmem-fix-3.patch
>>>> userfaultfd-support-minor-fault-handling-for-shmem-fix-4.patch
>>>> userfaultfd-selftests-use-memfd_create-for-shmem-test-type.patch
>>>> userfaultfd-selftests-create-alias-mappings-in-the-shmem-test.patch
>>>> userfaultfd-selftests-reinitialize-test-context-in-each-test.patch
>>>> userfaultfd-selftests-exercise-minor-fault-handling-shmem-support.patch
>>>
>>> Well.  the problem is,
>>>
>>>> + if (area_alias == MAP_FAILED)
>>>> + err("mmap of memfd alias failed");
>>>
>>> `err' doesn't exist until eleventy patches later, in Peter's
>>> "userfaultfd/selftests: unify error handling".  I got tired of (and
>>> lost confidence in) replacing "err(...)" with "fprintf(stderr, ...);
>>> exit(1)" everywhere then fixing up the fallout when Peter's patch came
>>> along.  Shudder.
>>
>> Oof - sorry about that!
>>
>>>
>>> Sorry, all this material pretty clearly isn't going to make 5.12
>>> (potentially nine days hence), so I shall drop all the userfaultfd
>>> patches.  Let's take a fresh run at all of this after -rc1.
>>
>> That's okay, my understanding was already that it certainly wouldn't
>> be in the 5.12 release, but that we might be ready in time for 5.13.
>>
>>>
>>>
>>> I have tentatively retained the first series:
>>>
>>> userfaultfd-add-minor-fault-registration-mode.patch
>>> userfaultfd-add-minor-fault-registration-mode-fix.patch
>>> userfaultfd-disable-huge-pmd-sharing-for-minor-registered-vmas.patch
>>> userfaultfd-hugetlbfs-only-compile-uffd-helpers-if-config-enabled.patch
>>> userfaultfd-add-uffdio_continue-ioctl.patch
>>> userfaultfd-update-documentation-to-describe-minor-fault-handling.patch
>>> userfaultfd-selftests-add-test-exercising-minor-fault-handling.patch
>>>
>>> but I don't believe they have had much testing standalone, without the
>>> other userfaultfd patches present.  So I don't think it's smart to
>>> upstream these in this cycle.  Or I could drop them so you and Peter
>>> can have a clean shot at redoing the whole thing.  Please let me know.
>>
>> From my perspective, both Peter's error handling and the hugetlbfs
>> minor faulting patches are ready to go. (Peter's most importantly; we
>> should establish that as a base, and put all the burden on resolving
>> conflicts with it on us instead of you :).)
>>
>> My memory was that Peter's patch was applied before my shmem series,
>> but it seems I was mistaken. So, maybe the best thing to do is to have
>> Peter send a version of it based on your tree, without the shmem
>> series? And then I'll resolve any conflicts in my tree?
>>
>> It's true that we haven't tested the hugetlbfs minor faults patch
>> extensively *with the shmem one also applied*, but it has had more
>> thorough review than the shmem one at this point (e.g. by Mike
>> Kravetz), and they're rather separate code paths (I'd be surprised if
>> one breaks the other).
> 
> Yes I think the hugetlb part should have got more review done.  IMHO it's a
> matter of whether Mike would still like to do a more thorough review, or seems
> okay to keep them.

I looked pretty closely at the hugetlb specific parts of the minor fault
handling series.  I only took a high level look at the code modifying and
dealing with the userfaultfd API.  The hugetlb specific parts looked fine
to me.  I can take a closer look at the userfaultfd API modifications,
but it would take more time for me to come up to speed on the APIs.
-- 
Mike Kravetz

[PATCH v5 0/8] make hugetlb put_page safe for all calling contexts

2021-04-09 Thread Mike Kravetz

IMPORTANT NOTE FOR REVIEWERS:  Andrew has removed Oscar Salvador's series
"Make alloc_contig_range handle Hugetlb pages" so that this series can
go in first.  Most issues discussed in v4 of this series do not apply
until Oscar's series is added and will be addressed then.  This could be
more accurately described as v3.2.  Changes from v3 only include:
- Trivial context changes
- Oscar's suggestions to move some VM_BUG_ON_PAGE calls and remove
  unnecessary HPage flag clearing in remove_hugetlb_page.
- Add a missing spin_lock to spin_lock_irq conversion in
  set_max_huge_pages.
- Acked-by: and Reviewed-by: tags from v3 remain with those from v4 that
  also apply.

Original cover letter follows:
This effort is the result a recent bug report [1].  Syzbot found a
potential deadlock in the hugetlb put_page/free_huge_page_path.
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
Since the free_huge_page_path already has code to 'hand off' page
free requests to a workqueue, a suggestion was proposed to make
the in_irq() detection accurate by always enabling PREEMPT_COUNT [2].
The outcome of that discussion was that the hugetlb put_page path
(free_huge_page) path should be properly fixed and safe for all calling
contexts.

This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
level, the series provides:
- Patches 1 & 2 change CMA bitmap mutex to an irq safe spinlock
- Patch 3 adds a mutex for proc/sysfs interfaces changing hugetlb counts
- Patches 4, 5 & 6 are aimed at reducing lock hold times.  To be clear
  the goal is to eliminate single lock hold times of a long duration.
  Overall lock hold time is not addressed.
- Patch 7 makes hugetlb_lock and subpool lock IRQ safe.  It also reverts
  the code which defers calls to a workqueue if !in_task.
- Patch 8 adds some lockdep_assert_held() calls

[1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
[2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.krav...@oracle.com

v4 -> v5
- Do not take the series "Make alloc_contig_range handle Hugetlb pages"
  into account.  It will be added after this series.
- In remove_hugetlb_page, move VM_BUG_ON_PAGE calls and remove
  unnecessary HPage flag clearing as suggested by Oscar.
- Add all collected Acked-by: and Reviewed-by:

v3 -> v4
- Add changes needed for the series "Make alloc_contig_range handle
  Hugetlb pages"

v2 -> v3
- Update commit message in patch 1 as suggested by Michal
- Do not use spin_lock_irqsave/spin_unlock_irqrestore when we know we
  are in task context as suggested by Michal
- Remove unnecessary INIT_LIST_HEAD() as suggested by Muchun

v1 -> v2
- Drop Roman's cma_release_nowait() patches and just change CMA mutex
  to an IRQ safe spinlock.
- Cleanups to variable names, commets and commit messages as suggested
  by Michal, Oscar, Miaohe and Muchun.
- Dropped unnecessary INIT_LIST_HEAD as suggested by Michal and list_del
  as suggested by Muchun.
- Created update_and_free_pages_bulk helper as suggested by Michal.
- Rebased on v5.12-rc4-mmotm-2021-03-28-16-37
- Added Acked-by: and Reviewed-by: from v1

RFC -> v1
- Add Roman's cma_release_nowait() patches.  This eliminated the need
  to do a workqueue handoff in hugetlb code.
- Use Michal's suggestion to batch pages for freeing.  This eliminated
  the need to recalculate loop control variables when dropping the lock.
- Added lockdep_assert_held() calls
- Rebased to v5.12-rc3-mmotm-2021-03-17-22-24

Mike Kravetz (8):
  mm/cma: change cma mutex to irq safe spinlock
  hugetlb: no need to drop hugetlb_lock to call cma_release
  hugetlb: add per-hstate mutex to synchronize user adjustments
  hugetlb: create remove_hugetlb_page() to separate functionality
  hugetlb: call update_and_free_page without hugetlb_lock
  hugetlb: change free_pool_huge_page to remove_pool_huge_page
  hugetlb: make free_huge_page irq safe
  hugetlb: add lockdep_assert_held() calls for hugetlb_lock

 include/linux/hugetlb.h |   1 +
 mm/cma.c|  18 +--
 mm/cma.h|   2 +-
 mm/cma_debug.c  |   8 +-
 mm/hugetlb.c| 337 +---
 mm/hugetlb_cgroup.c |   8 +-
 6 files changed, 194 insertions(+), 180 deletions(-)

-- 
2.30.2

[PATCH v5 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page

2021-04-09 Thread Mike Kravetz

free_pool_huge_page was called with hugetlb_lock held.  It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy.  free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.

Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held.  It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators.  The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.

Add new helper routine to call update_and_free_page for a list of pages.

Note: Some changes to the routine return_unused_surplus_pages are in
need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
race when freeing surplus pages") modified this routine to address a
race which could occur when dropping the hugetlb_lock in the loop that
removes pool pages.  Accounting changes introduced in that commit were
subtle and took some thought to understand.  This commit removes the
cond_resched_lock() and the potential race.  Therefore, remove the
subtle code and restore the more straight forward accounting effectively
reverting the commit.

Signed-off-by: Mike Kravetz 
Reviewed-by: Muchun Song 
Acked-by: Michal Hocko 
Reviewed-by: Oscar Salvador 
---
 mm/hugetlb.c | 93 
 1 file changed, 51 insertions(+), 42 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d3e5e49bf687..d4872303a76a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1204,7 +1204,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
 }
 
 /*
- * helper for free_pool_huge_page() - return the previously saved
+ * helper for remove_pool_huge_page() - return the previously saved
  * node ["this node"] from which to free a huge page.  Advance the
  * next node id whether or not we find a free huge page to free so
  * that the next attempt to free addresses the next node.
@@ -1384,6 +1384,16 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
}
 }
 
+static void update_and_free_pages_bulk(struct hstate *h, struct list_head 
*list)
+{
+   struct page *page, *t_page;
+
+   list_for_each_entry_safe(page, t_page, list, lru) {
+   update_and_free_page(h, page);
+   cond_resched();
+   }
+}
+
 struct hstate *size_to_hstate(unsigned long size)
 {
struct hstate *h;
@@ -1714,16 +1724,18 @@ static int alloc_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 }
 
 /*
- * Free huge page from pool from next node to free.
- * Attempt to keep persistent huge pages more or less
- * balanced over allowed nodes.
+ * Remove huge page from pool from next node to free.  Attempt to keep
+ * persistent huge pages more or less balanced over allowed nodes.
+ * This routine only 'removes' the hugetlb page.  The caller must make
+ * an additional call to free the page to low level allocators.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-bool acct_surplus)
+static struct page *remove_pool_huge_page(struct hstate *h,
+   nodemask_t *nodes_allowed,
+bool acct_surplus)
 {
int nr_nodes, node;
-   int ret = 0;
+   struct page *page = NULL;
 
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
@@ -1732,23 +1744,14 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 */
if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
!list_empty(>hugepage_freelists[node])) {
-   struct page *page =
-   list_entry(h->hugepage_freelists[node].next,
+   page = list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
-   /*
-* unlock/lock around update_and_free_page is temporary
-* and will be removed with subsequent patch.
-*/
-   spin_unlock(_lock);
-   update_and_free_page(h, page);
-   spin_lock(_lock);
-   ret = 1;
break;
}
}
 
-   return ret;
+   return page;
 }
 
 /*
@@ -2068,17 +2071,16 @@ static int gather_surplus_pages(struct hstate *h, long 
delta)
  *to the associated reservation map.
  * 2) Free any unused surplus pages

[PATCH v5 7/8] hugetlb: make free_huge_page irq safe

2021-04-09 Thread Mike Kravetz

Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
non-task context") was added to address the issue of free_huge_page
being called from irq context.  That commit hands off free_huge_page
processing to a workqueue if !in_task.  However, this doesn't cover
all the cases as pointed out by 0day bot lockdep report [1].

:  Possible interrupt unsafe locking scenario:
:
:CPU0CPU1
:
:   lock(hugetlb_lock);
:local_irq_disable();
:lock(slock-AF_INET);
:lock(hugetlb_lock);
:   
: lock(slock-AF_INET);

Shakeel has later explained that this is very likely TCP TX zerocopy
from hugetlb pages scenario when the networking code drops a last
reference to hugetlb page while having IRQ disabled. Hugetlb freeing
path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
chain can lead to a deadlock.

This commit addresses the issue by doing the following:
- Make hugetlb_lock irq safe.  This is mostly a simple process of
  changing spin_*lock calls to spin_*lock_irq* calls.
- Make subpool lock irq safe in a similar manner.
- Revert the !in_task check and workqueue handoff.

[1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Muchun Song 
Reviewed-by: Oscar Salvador 
---
 mm/hugetlb.c| 169 +---
 mm/hugetlb_cgroup.c |   8 +--
 2 files changed, 67 insertions(+), 110 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d4872303a76a..049ca0bccfcc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool 
*spool)
return true;
 }
 
-static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
+   unsigned long irq_flags)
 {
-   spin_unlock(>lock);
+   spin_unlock_irqrestore(>lock, irq_flags);
 
/* If no pages are used, and no other handles to the subpool
 * remain, give up any reservations based on minimum size and
@@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct 
hstate *h, long max_hpages,
 
 void hugepage_put_subpool(struct hugepage_subpool *spool)
 {
-   spin_lock(>lock);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
BUG_ON(!spool->count);
spool->count--;
-   unlock_or_release_subpool(spool);
+   unlock_or_release_subpool(spool, flags);
 }
 
 /*
@@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct 
hugepage_subpool *spool,
if (!spool)
return ret;
 
-   spin_lock(>lock);
+   spin_lock_irq(>lock);
 
if (spool->max_hpages != -1) {  /* maximum size accounting */
if ((spool->used_hpages + delta) <= spool->max_hpages)
@@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct 
hugepage_subpool *spool,
}
 
 unlock_ret:
-   spin_unlock(>lock);
+   spin_unlock_irq(>lock);
return ret;
 }
 
@@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct 
hugepage_subpool *spool,
   long delta)
 {
long ret = delta;
+   unsigned long flags;
 
if (!spool)
return delta;
 
-   spin_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
 
if (spool->max_hpages != -1)/* maximum size accounting */
spool->used_hpages -= delta;
@@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct 
hugepage_subpool *spool,
 * If hugetlbfs_put_super couldn't free spool due to an outstanding
 * quota reference, free it now.
 */
-   unlock_or_release_subpool(spool);
+   unlock_or_release_subpool(spool, flags);
 
return ret;
 }
@@ -1405,7 +1409,7 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
-static void __free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
/*
 * Can't pass hstate in here because it is called from the
@@ -1415,6 +1419,7 @@ static void __free_huge_page(struct page *page)
int nid = page_to_nid(page);
struct hugepage_subpool *spool = hugetlb_page_subpool(page);
bool restore_reserve;
+   unsigned long flags;
 
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(page_mapcount(page), page);
@@ -1443,7 +1448,7 @@ static void __free_huge_page(struct page *page)
restore_reserve = true;
}
 
-   spin_lock(_lock);
+   spin_lock_irqsave(_lock, flags);
ClearHPageMigra

[PATCH v5 5/8] hugetlb: call update_and_free_page without hugetlb_lock

2021-04-09 Thread Mike Kravetz

With the introduction of remove_hugetlb_page(), there is no need for
update_and_free_page to hold the hugetlb lock.  Change all callers to
drop the lock before calling.

With additional code modifications, this will allow loops which decrease
the huge page pool to drop the hugetlb_lock with each page to reduce
long hold times.

The ugly unlock/lock cycle in free_pool_huge_page will be removed in
a subsequent patch which restructures free_pool_huge_page.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Muchun Song 
Reviewed-by: Miaohe Lin 
Reviewed-by: Oscar Salvador 
---
 mm/hugetlb.c | 31 ++-
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 773081709916..d3e5e49bf687 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1444,16 +1444,18 @@ static void __free_huge_page(struct page *page)
 
if (HPageTemporary(page)) {
remove_hugetlb_page(h, page, false);
+   spin_unlock(_lock);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
remove_hugetlb_page(h, page, true);
+   spin_unlock(_lock);
update_and_free_page(h, page);
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
+   spin_unlock(_lock);
}
-   spin_unlock(_lock);
 }
 
 /*
@@ -1734,7 +1736,13 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
+   /*
+* unlock/lock around update_and_free_page is temporary
+* and will be removed with subsequent patch.
+*/
+   spin_unlock(_lock);
update_and_free_page(h, page);
+   spin_lock(_lock);
ret = 1;
break;
}
@@ -1803,8 +1811,9 @@ int dissolve_free_huge_page(struct page *page)
}
remove_hugetlb_page(h, page, false);
h->max_huge_pages--;
+   spin_unlock(_lock);
update_and_free_page(h, head);
-   rc = 0;
+   return 0;
}
 out:
spin_unlock(_lock);
@@ -2557,22 +2566,34 @@ static void try_to_free_low(struct hstate *h, unsigned 
long count,
nodemask_t *nodes_allowed)
 {
int i;
+   struct page *page, *next;
+   LIST_HEAD(page_list);
 
if (hstate_is_gigantic(h))
return;
 
+   /*
+* Collect pages to be freed on a list, and free after dropping lock
+*/
for_each_node_mask(i, *nodes_allowed) {
-   struct page *page, *next;
struct list_head *freel = >hugepage_freelists[i];
list_for_each_entry_safe(page, next, freel, lru) {
if (count >= h->nr_huge_pages)
-   return;
+   goto out;
if (PageHighMem(page))
continue;
remove_hugetlb_page(h, page, false);
-   update_and_free_page(h, page);
+   list_add(>lru, _list);
}
}
+
+out:
+   spin_unlock(_lock);
+   list_for_each_entry_safe(page, next, _list, lru) {
+   update_and_free_page(h, page);
+   cond_resched();
+   }
+   spin_lock(_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
-- 
2.30.2

[PATCH v5 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock

2021-04-09 Thread Mike Kravetz

After making hugetlb lock irq safe and separating some functionality
done under the lock, add some lockdep_assert_held to help verify
locking.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Miaohe Lin 
Reviewed-by: Muchun Song 
Reviewed-by: Oscar Salvador 
---
 mm/hugetlb.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 049ca0bccfcc..5cf2b7e5ca50 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1062,6 +1062,8 @@ static bool vma_has_reserves(struct vm_area_struct *vma, 
long chg)
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
int nid = page_to_nid(page);
+
+   lockdep_assert_held(_lock);
list_move(>lru, >hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
@@ -1073,6 +1075,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
hstate *h, int nid)
struct page *page;
bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
+   lockdep_assert_held(_lock);
list_for_each_entry(page, >hugepage_freelists[nid], lru) {
if (pin && !is_pinnable_page(page))
continue;
@@ -1344,6 +1347,7 @@ static void remove_hugetlb_page(struct hstate *h, struct 
page *page,
VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
 
+   lockdep_assert_held(_lock);
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
 
@@ -1694,6 +1698,7 @@ static struct page *remove_pool_huge_page(struct hstate 
*h,
int nr_nodes, node;
struct page *page = NULL;
 
+   lockdep_assert_held(_lock);
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
 * If we're returning unused surplus pages, only examine
@@ -1943,6 +1948,7 @@ static int gather_surplus_pages(struct hstate *h, long 
delta)
long needed, allocated;
bool alloc_ok = true;
 
+   lockdep_assert_held(_lock);
needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
if (needed <= 0) {
h->resv_huge_pages += delta;
@@ -2036,6 +2042,7 @@ static void return_unused_surplus_pages(struct hstate *h,
struct page *page;
LIST_HEAD(page_list);
 
+   lockdep_assert_held(_lock);
/* Uncommit the reservation */
h->resv_huge_pages -= unused_resv_pages;
 
@@ -2524,6 +2531,7 @@ static void try_to_free_low(struct hstate *h, unsigned 
long count,
int i;
LIST_HEAD(page_list);
 
+   lockdep_assert_held(_lock);
if (hstate_is_gigantic(h))
return;
 
@@ -2565,6 +2573,7 @@ static int adjust_pool_surplus(struct hstate *h, 
nodemask_t *nodes_allowed,
 {
int nr_nodes, node;
 
+   lockdep_assert_held(_lock);
VM_BUG_ON(delta != -1 && delta != 1);
 
if (delta < 0) {
-- 
2.30.2

[PATCH v5 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release

2021-04-09 Thread Mike Kravetz

Now that cma_release is non-blocking and irq safe, there is no need to
drop hugetlb_lock before calling.

Signed-off-by: Mike Kravetz 
Acked-by: Roman Gushchin 
Acked-by: Michal Hocko 
Reviewed-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 mm/hugetlb.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 07a95b8623ee..3a10b96a2124 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1348,14 +1348,8 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
set_page_refcounted(page);
if (hstate_is_gigantic(h)) {
-   /*
-* Temporarily drop the hugetlb_lock, because
-* we might block in free_gigantic_page().
-*/
-   spin_unlock(_lock);
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
-   spin_lock(_lock);
} else {
__free_pages(page, huge_page_order(h));
}
-- 
2.30.2

[PATCH v5 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-04-09 Thread Mike Kravetz

The new remove_hugetlb_page() routine is designed to remove a hugetlb
page from hugetlbfs processing.  It will remove the page from the active
or free list, update global counters and set the compound page
destructor to NULL so that PageHuge() will return false for the 'page'.
After this call, the 'page' can be treated as a normal compound page or
a collection of base size pages.

update_and_free_page no longer decrements h->nr_huge_pages{_node} as
this is performed in remove_hugetlb_page.  The only functionality
performed by update_and_free_page is to free the base pages to the lower
level allocators.

update_and_free_page is typically called after remove_hugetlb_page.

remove_hugetlb_page is to be called with the hugetlb_lock held.

Creating this routine and separating functionality is in preparation for
restructuring code to reduce lock hold times.  This commit should not
introduce any changes to functionality.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Miaohe Lin 
Reviewed-by: Muchun Song 
---
 mm/hugetlb.c | 65 
 1 file changed, 40 insertions(+), 25 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c8799a480784..773081709916 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1326,6 +1326,41 @@ static inline void destroy_compound_gigantic_page(struct 
page *page,
unsigned int order) { }
 #endif
 
+/*
+ * Remove hugetlb page from lists, and update dtor so that page appears
+ * as just a compound page.  A reference is held on the page.
+ *
+ * Must be called with hugetlb lock held.
+ */
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+   bool adjust_surplus)
+{
+   int nid = page_to_nid(page);
+
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
+
+   if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+   return;
+
+   list_del(>lru);
+
+   if (HPageFreed(page)) {
+   h->free_huge_pages--;
+   h->free_huge_pages_node[nid]--;
+   }
+   if (adjust_surplus) {
+   h->surplus_huge_pages--;
+   h->surplus_huge_pages_node[nid]--;
+   }
+
+   set_page_refcounted(page);
+   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+
+   h->nr_huge_pages--;
+   h->nr_huge_pages_node[nid]--;
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
int i;
@@ -1334,8 +1369,6 @@ static void update_and_free_page(struct hstate *h, struct 
page *page)
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
 
-   h->nr_huge_pages--;
-   h->nr_huge_pages_node[page_to_nid(page)]--;
for (i = 0; i < pages_per_huge_page(h);
 i++, subpage = mem_map_next(subpage, page, i)) {
subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -1343,10 +1376,6 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
1 << PG_active | 1 << PG_private |
1 << PG_writeback);
}
-   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
-   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
-   set_page_refcounted(page);
if (hstate_is_gigantic(h)) {
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
@@ -1414,15 +1443,12 @@ static void __free_huge_page(struct page *page)
h->resv_huge_pages++;
 
if (HPageTemporary(page)) {
-   list_del(>lru);
-   ClearHPageTemporary(page);
+   remove_hugetlb_page(h, page, false);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
-   list_del(>lru);
+   remove_hugetlb_page(h, page, true);
update_and_free_page(h, page);
-   h->surplus_huge_pages--;
-   h->surplus_huge_pages_node[nid]--;
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
@@ -1707,13 +1733,7 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
struct page *page =
list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
-   list_del(>lru);
-   h->free_huge_pages--;
-   h->free_huge_p

[PATCH v5 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments

2021-04-09 Thread Mike Kravetz

The helper routine hstate_next_node_to_alloc accesses and modifies the
hstate variable next_nid_to_alloc.  The helper is used by the routines
alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
called with hugetlb_lock held.  However, alloc_pool_huge_page can not
be called with the hugetlb lock held as it will call the page allocator.
Two instances of alloc_pool_huge_page could be run in parallel or
alloc_pool_huge_page could run in parallel with adjust_pool_surplus
which may result in the variable next_nid_to_alloc becoming invalid
for the caller and pages being allocated on the wrong node.

Both alloc_pool_huge_page and adjust_pool_surplus are only called from
the routine set_max_huge_pages after boot.  set_max_huge_pages is only
called as the reusult of a user writing to the proc/sysfs nr_hugepages,
or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.

It makes little sense to allow multiple adjustment to the number of
hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
allow one hugetlb page adjustment at a time.  This will synchronize
modifications to the next_nid_to_alloc variable.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Oscar Salvador 
Reviewed-by: Miaohe Lin 
Reviewed-by: Muchun Song 
Reviewed-by: David Hildenbrand 
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c| 8 
 2 files changed, 9 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a7f7d5f328dc..09f1fd12a6fa 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
+   struct mutex resize_lock;
int next_nid_to_alloc;
int next_nid_to_free;
unsigned int order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3a10b96a2124..c8799a480784 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2615,6 +2615,11 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
else
return -ENOMEM;
 
+   /*
+* resize_lock mutex prevents concurrent adjustments to number of
+* pages in hstate via the proc/sysfs interfaces.
+*/
+   mutex_lock(>resize_lock);
spin_lock(_lock);
 
/*
@@ -2647,6 +2652,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
if (count > persistent_huge_pages(h)) {
spin_unlock(_lock);
+   mutex_unlock(>resize_lock);
NODEMASK_FREE(node_alloc_noretry);
return -EINVAL;
}
@@ -2721,6 +2727,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
 out:
h->max_huge_pages = persistent_huge_pages(h);
spin_unlock(_lock);
+   mutex_unlock(>resize_lock);
 
NODEMASK_FREE(node_alloc_noretry);
 
@@ -3208,6 +3215,7 @@ void __init hugetlb_add_hstate(unsigned int order)
BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
BUG_ON(order == 0);
h = [hugetlb_max_hstate++];
+   mutex_init(>resize_lock);
h->order = order;
h->mask = ~(huge_page_size(h) - 1);
for (i = 0; i < MAX_NUMNODES; ++i)
-- 
2.30.2

[PATCH v5 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-04-09 Thread Mike Kravetz

cma_release is currently a sleepable operatation because the bitmap
manipulation is protected by cma->lock mutex. Hugetlb code which relies
on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
irq safe.

The lock doesn't protect any sleepable operation so it can be changed to
a (irq aware) spin lock. The bitmap processing should be quite fast in
typical case but if cma sizes grow to TB then we will likely need to
replace the lock by a more optimized bitmap implementation.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: David Hildenbrand 
Acked-by: Roman Gushchin 
---
 mm/cma.c   | 18 +-
 mm/cma.h   |  2 +-
 mm/cma_debug.c |  8 
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index f3bca4178c7f..995e15480937 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -24,7 +24,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long 
pfn,
 unsigned long count)
 {
unsigned long bitmap_no, bitmap_count;
+   unsigned long flags;
 
bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
bitmap_count = cma_bitmap_pages_to_bits(cma, count);
 
-   mutex_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
-   mutex_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static void __init cma_activate_area(struct cma *cma)
@@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
 pfn += pageblock_nr_pages)
init_cma_reserved_pageblock(pfn_to_page(pfn));
 
-   mutex_init(>lock);
+   spin_lock_init(>lock);
 
 #ifdef CONFIG_CMA_DEBUGFS
INIT_HLIST_HEAD(>mem_head);
@@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
unsigned long nr_part, nr_total = 0;
unsigned long nbits = cma_bitmap_maxno(cma);
 
-   mutex_lock(>lock);
+   spin_lock_irq(>lock);
pr_info("number of available pages: ");
for (;;) {
next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
@@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
start = next_zero_bit + nr_zero;
}
pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
 }
 #else
 static inline void cma_debug_show_areas(struct cma *cma) { }
@@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, unsigned long 
count,
goto out;
 
for (;;) {
-   mutex_lock(>lock);
+   spin_lock_irq(>lock);
bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
break;
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
@@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
 * our exclusive use. If the migration fails we will take the
 * lock again and unmark it.
 */
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
 
pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
diff --git a/mm/cma.h b/mm/cma.h
index 68ffad4e430d..2c775877eae2 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -15,7 +15,7 @@ struct cma {
unsigned long   count;
unsigned long   *bitmap;
unsigned int order_per_bit; /* Order of pages represented by one bit */
-   struct mutexlock;
+   spinlock_t  lock;
 #ifdef CONFIG_CMA_DEBUGFS
struct hlist_head mem_head;
spinlock_t mem_head_lock;
diff --git a/mm/cma_debug.c b/mm/cma_debug.c
index d5bf8aa34fdc..2e7704955f4f 100644
--- a/mm/cma_debug.c
+++ b/mm/cma_debug.c
@@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
struct cma *cma = data;
unsigned long used;
 
-   mutex_lock(>lock);
+   spin_lock_irq(>lock);
/* pages counter is smaller than sizeof(int) */
used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
*val = (u64)used << cma->order_per_bit;
 
return 0;
@@ -53,7 +53,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
unsigned long start, end = 0;
unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
 
-

Re: [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts

2021-04-09 Thread Mike Kravetz

On 4/8/21 10:05 PM, Andrew Morton wrote:
> On Thu, 8 Apr 2021 09:11:30 +0200 Oscar Salvador  wrote:
> 
>> But if It is going to be easier for Andrew, just pull them all out and I
>> will resend the whole series once this work goes in.
> 
> I think so.
> 
> I shall drop these:
> 
> mmpage_alloc-bail-out-earlier-on-enomem-in-alloc_contig_migrate_range.patch
> mmcompaction-let-isolate_migratepages_rangeblock-return-error-codes.patch
> mmcompaction-let-isolate_migratepages_rangeblock-return-error-codes-fix.patch
> mm-make-alloc_contig_range-handle-free-hugetlb-pages.patch
> mm-make-alloc_contig_range-handle-in-use-hugetlb-pages.patch
> mmpage_alloc-drop-unnecessary-checks-from-pfn_range_valid_contig.patch
> 
> and these:
> 
> mm-cma-change-cma-mutex-to-irq-safe-spinlock.patch
> hugetlb-no-need-to-drop-hugetlb_lock-to-call-cma_release.patch
> hugetlb-add-per-hstate-mutex-to-synchronize-user-adjustments.patch
> hugetlb-create-remove_hugetlb_page-to-separate-functionality.patch
> hugetlb-call-update_and_free_page-without-hugetlb_lock.patch
> hugetlb-change-free_pool_huge_page-to-remove_pool_huge_page.patch
> hugetlb-make-free_huge_page-irq-safe.patch
> hugetlb-make-free_huge_page-irq-safe-fix.patch
> hugetlb-add-lockdep_assert_held-calls-for-hugetlb_lock.patch
> 
> Along with notes-to-self that this:
> 
>   https://lkml.kernel.org/r/YGwnPCPaq1xKh/8...@hirez.programming.kicks-ass.net
> 
> might need attention and that this:
> 
>   hugetlb-make-free_huge_page-irq-safe.patch
> 
> might need updating.
> 

Thank you Andrew!

I will send a v5 shortly based on dropping the above patch.

-- 
Mike Kravetz

Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()

2021-04-08 Thread Mike Kravetz

On 4/8/21 8:01 PM, Miaohe Lin wrote:
> On 2021/4/9 6:53, Mike Kravetz wrote:
>>
>> Yes, add a comment to hugetlb_unreserve_pages saying that !resv_map
>> implies freed == 0.
>>
> 
> Sounds good!
> 
>> It would also be helpful to check for (chg - freed) == 0 and skip the
>> calls to hugepage_subpool_put_pages() and hugetlb_acct_memory().  Both
>> of those routines may perform an unnecessary lock/unlock cycle in this
>> case.
>>
>> A simple
>>  if (chg == free)
>>  return 0;
>> before the call to hugepage_subpool_put_pages would work.
> 
> This may not be really helpful because hugepage_subpool_put_pages() and 
> hugetlb_acct_memory()
> both would handle delta == 0 case without unnecessary lock/unlock cycle.
> Does this make sense for you? If so, I will prepare v2 with the changes to 
> add a comment
> to hugetlb_unreserve_pages() __without__ the check for (chg - freed) == 0.

Sorry, I forgot about the recent changes to check for delta == 0.
No need for the check here, just the comment.
-- 
Mike Kravetz

Re: [PATCH 4/4] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()

2021-04-08 Thread Mike Kravetz

On 4/2/21 2:32 AM, Miaohe Lin wrote:
> A rare out of memory error would prevent removal of the reserve map region
> for a page. hugetlb_fix_reserve_counts() handles this rare case to avoid
> dangling with incorrect counts. Unfortunately, hugepage_subpool_get_pages
> and hugetlb_acct_memory could possibly fail too. We should correctly handle
> these cases.

Yes, this is a potential issue.

The 'good news' is that hugetlb_fix_reserve_counts() is unlikely to ever
be called.  To do so would imply we could not allocate a region entry
which is only 6 words in size.  We also keep a 'cache' of entries so we
may not even need to allocate.

But, as mentioned it is a potential issue.

> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")

This is likely going to make this get picked by by stable releases.
That is unfortunate as mentioned above this is mostly theoretical.

> Signed-off-by: Miaohe Lin 
> ---
>  mm/hugetlb.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bdff8d23803f..ca5464ed04b7 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -745,13 +745,20 @@ void hugetlb_fix_reserve_counts(struct inode *inode)
>  {
>   struct hugepage_subpool *spool = subpool_inode(inode);
>   long rsv_adjust;
> + bool reserved = false;
>  
>   rsv_adjust = hugepage_subpool_get_pages(spool, 1);
> - if (rsv_adjust) {
> + if (rsv_adjust > 0) {
>   struct hstate *h = hstate_inode(inode);
>  
> - hugetlb_acct_memory(h, 1);
> + if (!hugetlb_acct_memory(h, 1))
> + reserved = true;
> + } else if (!rsv_adjust) {
> + reserved = true;
>   }
> +
> + if (!reserved)
> + pr_warn("hugetlb: fix reserve count failed\n");

We should expand this warning message a bit to indicate what this may
mean to the user.  Add something like"
"Huge Page Reserved count may go negative".
-- 
Mike Kravetz

Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()

2021-04-08 Thread Mike Kravetz

On 4/7/21 8:26 PM, Miaohe Lin wrote:
> On 2021/4/8 11:24, Miaohe Lin wrote:
>> On 2021/4/8 4:53, Mike Kravetz wrote:
>>> On 4/7/21 12:24 AM, Miaohe Lin wrote:
>>>> Hi:
>>>> On 2021/4/7 10:49, Mike Kravetz wrote:
>>>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>>>> The resv_map could be NULL since this routine can be called in the evict
>>>>>> inode path for all hugetlbfs inodes. So we could have chg = 0 and this
>>>>>> would result in a negative value when chg - freed. This is unexpected for
>>>>>> hugepage_subpool_put_pages() and hugetlb_acct_memory().
>>>>>
>>>>> I am not sure if this is possible.
>>>>>
>>>>> It is true that resv_map could be NULL.  However, I believe resv map
>>>>> can only be NULL for inodes that are not regular or link inodes.  This
>>>>> is the inode creation code in hugetlbfs_get_inode().
>>>>>
>>>>>/*
>>>>>  * Reserve maps are only needed for inodes that can have 
>>>>> associated
>>>>>  * page allocations.
>>>>>  */
>>>>> if (S_ISREG(mode) || S_ISLNK(mode)) {
>>>>> resv_map = resv_map_alloc();
>>>>> if (!resv_map)
>>>>> return NULL;
>>>>> }
>>>>>
>>>>
>>>> Agree.
>>>>
>>>>> If resv_map is NULL, then no hugetlb pages can be allocated/associated
>>>>> with the file.  As a result, remove_inode_hugepages will never find any
>>>>> huge pages associated with the inode and the passed value 'freed' will
>>>>> always be zero.
>>>>>
>>>>
>>>> But I am confused now. AFAICS, remove_inode_hugepages() searches the 
>>>> address_space of
>>>> the inode to remove the hugepages while does not care if inode has 
>>>> associated resv_map.
>>>> How does it prevent hugetlb pages from being allocated/associated with the 
>>>> file if
>>>> resv_map is NULL? Could you please explain this more?
>>>>
>>>
>>> Recall that there are only two ways to get huge pages associated with
>>> a hugetlbfs file: fallocate and mmap/write fault.  Directly writing to
>>> hugetlbfs files is not supported.
>>>
>>> If you take a closer look at hugetlbfs_get_inode, it has that code to
>>> allocate the resv map mentioned above as well as the following:
>>>
>>> switch (mode & S_IFMT) {
>>> default:
>>> init_special_inode(inode, mode, dev);
>>> break;
>>> case S_IFREG:
>>> inode->i_op = _inode_operations;
>>> inode->i_fop = _file_operations;
>>> break;
>>> case S_IFDIR:
>>> inode->i_op = _dir_inode_operations;
>>> inode->i_fop = _dir_operations;
>>>
>>> /* directory inodes start off with i_nlink == 2 (for 
>>> "." entry) */
>>> inc_nlink(inode);
>>> break;
>>> case S_IFLNK:
>>> inode->i_op = _symlink_inode_operations;
>>> inode_nohighmem(inode);
>>> break;
>>> }
>>>
>>> Notice that only S_IFREG inodes will have i_fop == 
>>> _file_operations.
>>> hugetlbfs_file_operations contain the hugetlbfs specific mmap and fallocate
>>> routines.  Hence, only files with S_IFREG inodes can potentially have
>>> associated huge pages.  S_IFLNK inodes can as well via file linking.
>>>
>>> If an inode is not S_ISREG(mode) || S_ISLNK(mode), then it will not have
>>> a resv_map.  In addition, it will not have hugetlbfs_file_operations and
>>> can not have associated huge pages.
>>>
>>
>> Many many thanks for detailed and patient explanation! :) I think I have got 
>> the idea!
>>
>>> I looked at this closely when adding commits
>>> 58b6e5e8f1ad hugetlbfs: fix memory leak for resv_map
>>> f27a5136f70a hugetlbfs: always use address space in inode for resv_map 
>>> pointer
>>>
>>> I may not be remembering all of the details correctly.  Commit f27a5136f70a
>>> added the comment that resv_map could be NULL to hugetlb_unreserve_pages.
>>>
>>
>> Since we must have freed == 0 while chg == 0. Should we make this assumption 
>> explict
>> by something like below?
>>
>> WARN_ON(chg < freed);
>>
> 
> Or just a comment to avoid confusion ?
> 

Yes, add a comment to hugetlb_unreserve_pages saying that !resv_map
implies freed == 0.

It would also be helpful to check for (chg - freed) == 0 and skip the
calls to hugepage_subpool_put_pages() and hugetlb_acct_memory().  Both
of those routines may perform an unnecessary lock/unlock cycle in this
case.

A simple
if (chg == free)
return 0;
before the call to hugepage_subpool_put_pages would work.
-- 
Mike Kravetz

Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()

2021-04-08 Thread Mike Kravetz

On 4/7/21 7:44 PM, Miaohe Lin wrote:
> On 2021/4/8 5:23, Mike Kravetz wrote:
>> On 4/6/21 8:09 PM, Miaohe Lin wrote:
>>> On 2021/4/7 10:37, Mike Kravetz wrote:
>>>> On 4/6/21 7:05 PM, Miaohe Lin wrote:
>>>>> Hi:
>>>>> On 2021/4/7 8:53, Mike Kravetz wrote:
>>>>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>>>>> It's guaranteed that the vma is associated with a resv_map, i.e. either
>>>>>>> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
>>>>>>> have returned via !resv check above. So ret must be less than 0 in the
>>>>>>> 'else' case. Simplify the return code to make this clear.
>>>>>>
>>>>>> I believe we still neeed that ternary operator in the return statement.
>>>>>> Why?
>>>>>>
>>>>>> There are two basic types of mappings to be concerned with:
>>>>>> shared and private.
>>>>>> For private mappings, a task can 'own' the mapping as indicated by
>>>>>> HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
>>>>>> to create a non-owner private mapping is to have a task with a private
>>>>>> mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
>>>>>> child process will not.  The idea is that since the child has a COW copy
>>>>>> of the mapping it should not consume reservations made by the parent.
>>>>>
>>>>> The child process will not have HPAGE_RESV_OWNER set because at fork 
>>>>> time, we do:
>>>>>   /*
>>>>>* Clear hugetlb-related page reserves for children. This only
>>>>>* affects MAP_PRIVATE mappings. Faults generated by the child
>>>>>* are not guaranteed to succeed, even if read-only
>>>>>*/
>>>>>   if (is_vm_hugetlb_page(tmp))
>>>>>   reset_vma_resv_huge_pages(tmp);
>>>>> i.e. we have vma->vm_private_data = (void *)0; for child process and 
>>>>> vma_resv_map() will
>>>>> return NULL in this case.
>>>>> Or am I missed something?
>>>>>
>>>>>> Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
>>>>>> reservations.
>>>>>> Hope that makens sense?
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Miaohe Lin 
>>>>>>> ---
>>>>>>>  mm/hugetlb.c | 2 +-
>>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>>>> index a03a50b7c410..b7864abded3d 100644
>>>>>>> --- a/mm/hugetlb.c
>>>>>>> +++ b/mm/hugetlb.c
>>>>>>> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct 
>>>>>>> hstate *h,
>>>>>>> return 1;
>>>>>>> }
>>>>>>> else
>>>>>>
>>>>>> This else also handles the case !HPAGE_RESV_OWNER.  In this case, we
>>>>>
>>>>> IMO, for the case !HPAGE_RESV_OWNER, we won't reach here. What do you 
>>>>> think?
>>>>>
>>>>
>>>> I think you are correct.
>>>>
>>>> However, if this is true we should be able to simply the code even
>>>> further.  There is no need to check for HPAGE_RESV_OWNER because we know
>>>> it must be set.  Correct?  If so, the code could look something like:
>>>>
>>>>if (vma->vm_flags & VM_MAYSHARE)
>>>>return ret;
>>>>
>>>>/* We know private mapping with HPAGE_RESV_OWNER */
>>>> * ...   *
>>>> * Add that existing comment */
>>>>
>>>>if (ret > 0)
>>>>return 0;
>>>>if (ret == 0)
>>>>return 1;
>>>>return ret;
>>>>
>>>
>>> Many thanks for good suggestion! What do you mean is this ?
>>
>> I think the below changes would work fine.
>>
>> However, this patch/discussion has made me ask the question.  Do we need
>> the HPAGE_RESV_OWNER flag?  Is the followng true?
>> !(vm_flags & VM_MAYSHARE) && vma_resv_map()  ===> HPAGE_RESV_OWNER
>> !(vm_flags & VM_MAYSHARE) && !vma_resv_map() ===> !HPAGE_RESV_OWNER
>>
> 
> I agree with you.
> 
> HPAGE_RESV_OWNER is set in hugetlb_reserve_pages() and there's no way to 
> clear it
> in the owner process. The child process can not inherit both HPAGE_RESV_OWNER 
> and
> resv_map. So for !HPAGE_RESV_OWNER vma, it knows nothing about resv_map.
> 
> IMO, in !(vm_flags & VM_MAYSHARE) case, we must have:
>   !!vma_resv_map() == !!HPAGE_RESV_OWNER
> 
>> I am not suggesting we eliminate the flag and make corresponding
>> changes.  Just curious if you believe we 'could' remove the flag and
>> depend on the above conditions.
>>
>> One reason for NOT removing the flag is that that flag itself and
>> supporting code and commnets help explain what happens with hugetlb
>> reserves for COW mappings.  That code is hard to understand and the
>> existing code and coments around HPAGE_RESV_OWNER help with
>> understanding.
> 
> Agree. These codes took me several days to understand...
> 

Please prepare v2 with the changes to remove the HPAGE_RESV_OWNER check
and move the large comment.


I would prefer to leave other places that mention HPAGE_RESV_OWNER
unchanged.

Thanks,
-- 
Mike Kravetz

Re: [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts

2021-04-07 Thread Mike Kravetz

Hello Andrew,

It has been suggested that this series be included before Oscar Salvador's
series "Make alloc_contig_range handle Hugetlb pages".  At a logical
level, here is what I think needs to happen.  However, I am not sure how
you do tree management and I am open to anything you suggest.  Please do
not start until we get an Ack from Oscar as he will need to participate.

Remove patches for this series in your tree from Mike Kravetz:
- hugetlb: add lockdep_assert_held() calls for hugetlb_lock
- hugetlb: fix irq locking omissions
- hugetlb: make free_huge_page irq safe
- hugetlb: change free_pool_huge_page to remove_pool_huge_page
- hugetlb: call update_and_free_page without hugetlb_lock
- hugetlb: create remove_hugetlb_page() to separate functionality
  /*
   * Technically, the following patches do not need to be removed as
   * they do not interact with Oscar's changes.  However, they do
   * contain 'cover letter comments' in the commit messages which may
   * not make sense out of context.
   */
- hugetlb: add per-hstate mutex to synchronize user adjustment
- hugetlb: no need to drop hugetlb_lock to call cma_release
- mm/cma: change cma mutex to irq safe spinlock

Remove patches for the series "Make alloc_contig_range handle Hugetlb pages"
from Oscar Salvador.
- mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig
- mm: make alloc_contig_range handle in-use hugetlb pages
- mm: make alloc_contig_range handle free hugetlb pages
  /*
   * Technically, the following patches do not need to be removed as
   * they do not interact with Mike's changes.  Again, they do
   * contain 'cover letter comments' in the commit messages which may
   * not make sense out of context.
   */
- mmcompaction-let-isolate_migratepages_rangeblock-return-error-codes-fix
- mm,compaction: let isolate_migratepages_{range,block} return error codes
- mm,page_alloc: bail out earlier on -ENOMEM in alloc_contig_migrate_range

After removing patches above, Mike will provide updated versions of:
/* If removed above */
- mm/cma: change cma mutex to irq safe spinlock
- hugetlb: no need to drop hugetlb_lock to call cma_release
- hugetlb: add per-hstate mutex to synchronize user adjustment
/* end of If removed above */
- hugetlb: create remove_hugetlb_page() to separate functionality
- hugetlb: call update_and_free_page without hugetlb_lock
- hugetlb: change free_pool_huge_page to remove_pool_huge_page
- hugetlb: make free_huge_page irq safe
- hugetlb: add lockdep_assert_held() calls for hugetlb_lock

With these patches in place, Oscar will provide updated versions of:
/* If removed above */
- mm,page_alloc: bail out earlier on -ENOMEM in alloc_contig_migrate_range
- mm,compaction: let isolate_migratepages_{range,block} return error codes
/* end of If removed above */
- mm: make alloc_contig_range handle free hugetlb pages
- mm: make alloc_contig_range handle in-use hugetlb pages
- mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig

Sorry that things ended up in their current state as it will cause more
work for you.
-- 
Mike Kravetz

Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()

2021-04-07 Thread Mike Kravetz

On 4/6/21 8:09 PM, Miaohe Lin wrote:
> On 2021/4/7 10:37, Mike Kravetz wrote:
>> On 4/6/21 7:05 PM, Miaohe Lin wrote:
>>> Hi:
>>> On 2021/4/7 8:53, Mike Kravetz wrote:
>>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>>> It's guaranteed that the vma is associated with a resv_map, i.e. either
>>>>> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
>>>>> have returned via !resv check above. So ret must be less than 0 in the
>>>>> 'else' case. Simplify the return code to make this clear.
>>>>
>>>> I believe we still neeed that ternary operator in the return statement.
>>>> Why?
>>>>
>>>> There are two basic types of mappings to be concerned with:
>>>> shared and private.
>>>> For private mappings, a task can 'own' the mapping as indicated by
>>>> HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
>>>> to create a non-owner private mapping is to have a task with a private
>>>> mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
>>>> child process will not.  The idea is that since the child has a COW copy
>>>> of the mapping it should not consume reservations made by the parent.
>>>
>>> The child process will not have HPAGE_RESV_OWNER set because at fork time, 
>>> we do:
>>> /*
>>>  * Clear hugetlb-related page reserves for children. This only
>>>  * affects MAP_PRIVATE mappings. Faults generated by the child
>>>  * are not guaranteed to succeed, even if read-only
>>>  */
>>> if (is_vm_hugetlb_page(tmp))
>>> reset_vma_resv_huge_pages(tmp);
>>> i.e. we have vma->vm_private_data = (void *)0; for child process and 
>>> vma_resv_map() will
>>> return NULL in this case.
>>> Or am I missed something?
>>>
>>>> Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
>>>> reservations.
>>>> Hope that makens sense?
>>>>
>>>>>
>>>>> Signed-off-by: Miaohe Lin 
>>>>> ---
>>>>>  mm/hugetlb.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>> index a03a50b7c410..b7864abded3d 100644
>>>>> --- a/mm/hugetlb.c
>>>>> +++ b/mm/hugetlb.c
>>>>> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate 
>>>>> *h,
>>>>>   return 1;
>>>>>   }
>>>>>   else
>>>>
>>>> This else also handles the case !HPAGE_RESV_OWNER.  In this case, we
>>>
>>> IMO, for the case !HPAGE_RESV_OWNER, we won't reach here. What do you think?
>>>
>>
>> I think you are correct.
>>
>> However, if this is true we should be able to simply the code even
>> further.  There is no need to check for HPAGE_RESV_OWNER because we know
>> it must be set.  Correct?  If so, the code could look something like:
>>
>>  if (vma->vm_flags & VM_MAYSHARE)
>>  return ret;
>>
>>  /* We know private mapping with HPAGE_RESV_OWNER */
>>   * ...   *
>>   * Add that existing comment */
>>
>>  if (ret > 0)
>>  return 0;
>>  if (ret == 0)
>>  return 1;
>>  return ret;
>>
> 
> Many thanks for good suggestion! What do you mean is this ?

I think the below changes would work fine.

However, this patch/discussion has made me ask the question.  Do we need
the HPAGE_RESV_OWNER flag?  Is the followng true?
!(vm_flags & VM_MAYSHARE) && vma_resv_map()  ===> HPAGE_RESV_OWNER
!(vm_flags & VM_MAYSHARE) && !vma_resv_map() ===> !HPAGE_RESV_OWNER

I am not suggesting we eliminate the flag and make corresponding
changes.  Just curious if you believe we 'could' remove the flag and
depend on the above conditions.

One reason for NOT removing the flag is that that flag itself and
supporting code and commnets help explain what happens with hugetlb
reserves for COW mappings.  That code is hard to understand and the
existing code and coments around HPAGE_RESV_OWNER help with
understanding.

-- 
Mike Kravetz

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a03a50b7c410..9b4c05699a90 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2163,27 +2163,26 @@ sta

Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()

2021-04-07 Thread Mike Kravetz

On 4/7/21 12:24 AM, Miaohe Lin wrote:
> Hi:
> On 2021/4/7 10:49, Mike Kravetz wrote:
>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>> The resv_map could be NULL since this routine can be called in the evict
>>> inode path for all hugetlbfs inodes. So we could have chg = 0 and this
>>> would result in a negative value when chg - freed. This is unexpected for
>>> hugepage_subpool_put_pages() and hugetlb_acct_memory().
>>
>> I am not sure if this is possible.
>>
>> It is true that resv_map could be NULL.  However, I believe resv map
>> can only be NULL for inodes that are not regular or link inodes.  This
>> is the inode creation code in hugetlbfs_get_inode().
>>
>>/*
>>  * Reserve maps are only needed for inodes that can have associated
>>  * page allocations.
>>  */
>> if (S_ISREG(mode) || S_ISLNK(mode)) {
>> resv_map = resv_map_alloc();
>> if (!resv_map)
>> return NULL;
>> }
>>
> 
> Agree.
> 
>> If resv_map is NULL, then no hugetlb pages can be allocated/associated
>> with the file.  As a result, remove_inode_hugepages will never find any
>> huge pages associated with the inode and the passed value 'freed' will
>> always be zero.
>>
> 
> But I am confused now. AFAICS, remove_inode_hugepages() searches the 
> address_space of
> the inode to remove the hugepages while does not care if inode has associated 
> resv_map.
> How does it prevent hugetlb pages from being allocated/associated with the 
> file if
> resv_map is NULL? Could you please explain this more?
> 

Recall that there are only two ways to get huge pages associated with
a hugetlbfs file: fallocate and mmap/write fault.  Directly writing to
hugetlbfs files is not supported.

If you take a closer look at hugetlbfs_get_inode, it has that code to
allocate the resv map mentioned above as well as the following:

switch (mode & S_IFMT) {
default:
init_special_inode(inode, mode, dev);
break;
case S_IFREG:
inode->i_op = _inode_operations;
inode->i_fop = _file_operations;
break;
case S_IFDIR:
inode->i_op = _dir_inode_operations;
inode->i_fop = _dir_operations;

/* directory inodes start off with i_nlink == 2 (for 
"." entry) */
inc_nlink(inode);
break;
case S_IFLNK:
inode->i_op = _symlink_inode_operations;
inode_nohighmem(inode);
break;
}

Notice that only S_IFREG inodes will have i_fop == _file_operations.
hugetlbfs_file_operations contain the hugetlbfs specific mmap and fallocate
routines.  Hence, only files with S_IFREG inodes can potentially have
associated huge pages.  S_IFLNK inodes can as well via file linking.

If an inode is not S_ISREG(mode) || S_ISLNK(mode), then it will not have
a resv_map.  In addition, it will not have hugetlbfs_file_operations and
can not have associated huge pages.

I looked at this closely when adding commits
58b6e5e8f1ad hugetlbfs: fix memory leak for resv_map
f27a5136f70a hugetlbfs: always use address space in inode for resv_map pointer

I may not be remembering all of the details correctly.  Commit f27a5136f70a
added the comment that resv_map could be NULL to hugetlb_unreserve_pages.
-- 
Mike Kravetz

Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()

2021-04-06 Thread Mike Kravetz

On 4/2/21 2:32 AM, Miaohe Lin wrote:
> The resv_map could be NULL since this routine can be called in the evict
> inode path for all hugetlbfs inodes. So we could have chg = 0 and this
> would result in a negative value when chg - freed. This is unexpected for
> hugepage_subpool_put_pages() and hugetlb_acct_memory().

I am not sure if this is possible.

It is true that resv_map could be NULL.  However, I believe resv map
can only be NULL for inodes that are not regular or link inodes.  This
is the inode creation code in hugetlbfs_get_inode().

   /*
 * Reserve maps are only needed for inodes that can have associated
 * page allocations.
 */
if (S_ISREG(mode) || S_ISLNK(mode)) {
resv_map = resv_map_alloc();
if (!resv_map)
return NULL;
}

If resv_map is NULL, then no hugetlb pages can be allocated/associated
with the file.  As a result, remove_inode_hugepages will never find any
huge pages associated with the inode and the passed value 'freed' will
always be zero.

Does that sound correct?

-- 
Mike Kravetz

> 
> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
> Signed-off-by: Miaohe Lin 
> ---
>  mm/hugetlb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b7864abded3d..bdff8d23803f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5413,6 +5413,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
> start, long end,
>   long chg = 0;
>   struct hugepage_subpool *spool = subpool_inode(inode);
>   long gbl_reserve;
> + long delta;
>  
>   /*
>* Since this routine can be called in the evict inode path for all
> @@ -5437,7 +5438,8 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
> start, long end,
>* If the subpool has a minimum size, the number of global
>* reservations to be released may be adjusted.
>*/
> - gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed));
> + delta = chg > 0 ? chg - freed : freed;
> + gbl_reserve = hugepage_subpool_put_pages(spool, delta);
>   hugetlb_acct_memory(h, -gbl_reserve);
>  
>   return 0;
>

Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()

2021-04-06 Thread Mike Kravetz

On 4/6/21 7:05 PM, Miaohe Lin wrote:
> Hi:
> On 2021/4/7 8:53, Mike Kravetz wrote:
>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>> It's guaranteed that the vma is associated with a resv_map, i.e. either
>>> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
>>> have returned via !resv check above. So ret must be less than 0 in the
>>> 'else' case. Simplify the return code to make this clear.
>>
>> I believe we still neeed that ternary operator in the return statement.
>> Why?
>>
>> There are two basic types of mappings to be concerned with:
>> shared and private.
>> For private mappings, a task can 'own' the mapping as indicated by
>> HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
>> to create a non-owner private mapping is to have a task with a private
>> mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
>> child process will not.  The idea is that since the child has a COW copy
>> of the mapping it should not consume reservations made by the parent.
> 
> The child process will not have HPAGE_RESV_OWNER set because at fork time, we 
> do:
>   /*
>* Clear hugetlb-related page reserves for children. This only
>* affects MAP_PRIVATE mappings. Faults generated by the child
>* are not guaranteed to succeed, even if read-only
>*/
>   if (is_vm_hugetlb_page(tmp))
>   reset_vma_resv_huge_pages(tmp);
> i.e. we have vma->vm_private_data = (void *)0; for child process and 
> vma_resv_map() will
> return NULL in this case.
> Or am I missed something?
> 
>> Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
>> reservations.
>> Hope that makens sense?
>>
>>>
>>> Signed-off-by: Miaohe Lin 
>>> ---
>>>  mm/hugetlb.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index a03a50b7c410..b7864abded3d 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate *h,
>>> return 1;
>>> }
>>> else
>>
>> This else also handles the case !HPAGE_RESV_OWNER.  In this case, we
> 
> IMO, for the case !HPAGE_RESV_OWNER, we won't reach here. What do you think?
> 

I think you are correct.

However, if this is true we should be able to simply the code even
further.  There is no need to check for HPAGE_RESV_OWNER because we know
it must be set.  Correct?  If so, the code could look something like:

if (vma->vm_flags & VM_MAYSHARE)
return ret;

/* We know private mapping with HPAGE_RESV_OWNER */
 * ...   *
 * Add that existing comment */

if (ret > 0)
return 0;
if (ret == 0)
return 1;
return ret;

-- 
Mike Kravetz

Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()

2021-04-06 Thread Mike Kravetz

On 4/2/21 2:32 AM, Miaohe Lin wrote:
> It's guaranteed that the vma is associated with a resv_map, i.e. either
> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
> have returned via !resv check above. So ret must be less than 0 in the
> 'else' case. Simplify the return code to make this clear.

I believe we still neeed that ternary operator in the return statement.
Why?

There are two basic types of mappings to be concerned with:
shared and private.
For private mappings, a task can 'own' the mapping as indicated by
HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
to create a non-owner private mapping is to have a task with a private
mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
child process will not.  The idea is that since the child has a COW copy
of the mapping it should not consume reservations made by the parent.
Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
reservations.
Hope that makens sense?

> 
> Signed-off-by: Miaohe Lin 
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a03a50b7c410..b7864abded3d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate *h,
>   return 1;
>   }
>   else

This else also handles the case !HPAGE_RESV_OWNER.  In this case, we
never want to indicate reservations are available.  The ternary makes
sure a positive value is never returned.

-- 
Mike Kravetz

> - return ret < 0 ? ret : 0;
> + return ret;
>  }
>  
>  static long vma_needs_reservation(struct hstate *h,
>

Re: [PATCH 1/4] mm/hugeltb: remove redundant VM_BUG_ON() in region_add()

2021-04-06 Thread Mike Kravetz

On 4/2/21 2:32 AM, Miaohe Lin wrote:
> The same VM_BUG_ON() check is already done in the callee. Remove this extra
> one to simplify the code slightly.
> 
> Signed-off-by: Miaohe Lin 

Thanks,
Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

> ---
>  mm/hugetlb.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c22111f3da20..a03a50b7c410 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -556,7 +556,6 @@ static long region_add(struct resv_map *resv, long f, 
> long t,
>   resv->adds_in_progress -= in_regions_needed;
>  
>   spin_unlock(>lock);
> - VM_BUG_ON(add < 0);
>   return add;
>  }
>  
>

Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-04-06 Thread Mike Kravetz

On 4/6/21 6:41 AM, Oscar Salvador wrote:
> On Mon, Apr 05, 2021 at 04:00:39PM -0700, Mike Kravetz wrote:
>> +static void remove_hugetlb_page(struct hstate *h, struct page *page,
>> +bool adjust_surplus)
>> +{
>> +int nid = page_to_nid(page);
>> +
>> +if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>> +return;
>> +
>> +list_del(>lru);
>> +
>> +if (HPageFreed(page)) {
>> +h->free_huge_pages--;
>> +h->free_huge_pages_node[nid]--;
>> +ClearHPageFreed(page);
>> +}
>> +if (adjust_surplus) {
>> +h->surplus_huge_pages--;
>> +h->surplus_huge_pages_node[nid]--;
>> +}
>> +
>> +VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
>> +VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
> 
> These checks feel a bit odd here.
> I would move them above, before we start messing with the counters?
> 

This routine is comprised of code that was previously in update_and_free_page
and __free_huge_page.   In those routines, the VM_BUG_ON_PAGE came after the
counter adjustments.  That is the only reason they are positioned as they are.

I agree that it makes more sense to add them to the beginning of the routine.

>> +
>> +ClearHPageTemporary(page);
> 
> Why clearing it unconditionally? I guess we do not really care, but
> someone might wonder why when reading the core.
> So I would either do as we used to do and only clear it in case of
> HPageTemporary(), or drop a comment.
> 

Technically, the HPage* flags are meaningless after calling this
routine.  So, there really is no need to modify them at all.  The
flag clearing code is left over from the routines in which they
previously existed.

Any clearing of HPage* flags in this routine is unnecessary and should
be removed to avoid any questions.
-- 
Mike Kravetz

Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-04-06 Thread Mike Kravetz

On 4/6/21 2:56 AM, Michal Hocko wrote:
> On Mon 05-04-21 16:00:39, Mike Kravetz wrote:
>> The new remove_hugetlb_page() routine is designed to remove a hugetlb
>> page from hugetlbfs processing.  It will remove the page from the active
>> or free list, update global counters and set the compound page
>> destructor to NULL so that PageHuge() will return false for the 'page'.
>> After this call, the 'page' can be treated as a normal compound page or
>> a collection of base size pages.
>>
>> update_and_free_page no longer decrements h->nr_huge_pages{_node} as
>> this is performed in remove_hugetlb_page.  The only functionality
>> performed by update_and_free_page is to free the base pages to the lower
>> level allocators.
>>
>> update_and_free_page is typically called after remove_hugetlb_page.
>>
>> remove_hugetlb_page is to be called with the hugetlb_lock held.
>>
>> Creating this routine and separating functionality is in preparation for
>> restructuring code to reduce lock hold times.  This commit should not
>> introduce any changes to functionality.
>>
>> Signed-off-by: Mike Kravetz 
> 
> Btw. I would prefer to reverse the ordering of this and Oscar's
> patchset. This one is a bug fix which might be interesting for stable
> backports while Oscar's work can be looked as a general functionality
> improvement.

Ok, that makes sense.

Andrew, can we make this happen?  It would require removing Oscar's
series until it can be modified to work on top of this.
There is only one small issue with this series as it originally went
into mmotm.  There is a missing conversion of spin_lock to spin_lock_irq
in patch 7.  In addition, there are some suggested changes from Oscar to
this patch.  I do not think they are necessary, but I could make those
as well.  Let me know what I can do to help make this happen.

>> @@ -2298,6 +2312,7 @@ static int alloc_and_dissolve_huge_page(struct hstate 
>> *h, struct page *old_page,
>>  /*
>>   * Freed from under us. Drop new_page too.
>>   */
>> +remove_hugetlb_page(h, new_page, false);
>>  update_and_free_page(h, new_page);
>>  goto unlock;
>>  } else if (page_count(old_page)) {
>> @@ -2305,6 +2320,7 @@ static int alloc_and_dissolve_huge_page(struct hstate 
>> *h, struct page *old_page,
>>   * Someone has grabbed the page, try to isolate it here.
>>   * Fail with -EBUSY if not possible.
>>   */
>> +remove_hugetlb_page(h, new_page, false);
>>  update_and_free_page(h, new_page);
>>  spin_unlock(_lock);
>>  if (!isolate_huge_page(old_page, list))
> 
> the page is not enqued anywhere here so remove_hugetlb_page would blow
> when linked list debugging is enabled.

I also thought this would be an issue.  However, INIT_LIST_HEAD would
have been called for the page so,

static inline void INIT_LIST_HEAD(struct list_head *list)
{
WRITE_ONCE(list->next, list);
list->prev = list;
}

The debug checks of concern in __list_del_entry_valid are:

CHECK_DATA_CORRUPTION(prev->next != entry,
"list_del corruption. prev->next should be %px, but was
%px\n",
entry, prev->next) ||
CHECK_DATA_CORRUPTION(next->prev != entry,
"list_del corruption. next->prev should be %px, but was
%px\n",
entry, next->prev))

Since, all pointers point back to the list(head) the check passes.  My
track record with the list routines is not so good, so I actually
forced list_del after INIT_LIST_HEAD with list debugging enabled and did
not enounter any issues.

Going forward, I agree it would be better to perhaps add a list_empty
check so that things do not blow up if the debugging code is changed.

At one time I also thought of splitting the functionality in
alloc_fresh_huge_page and prep_new_huge_page so that it would only
allocate/prep the page but not increment nr_huge_pages.  A new routine
would be used to increment the counter when it was actually put into use.
I thought this could be used when doing bulk adjustments in set_max_huge_pages
but the benefit would be minimal.  This seems like something that would
be useful in Oscar's alloc_and_dissolve_huge_page routine.
-- 
Mike Kravetz

[PATCH v4 0/8] make hugetlb put_page safe for all calling contexts

2021-04-05 Thread Mike Kravetz

IMPORTANT NOTE FOR REVIEWERS:  v3 of this series is in Andrew's mmotm
tree v5.12-rc5-mmotm-2021-03-31-21-27.  Muchun Song noticed that Oscar
Salvador's series "Make alloc_contig_range handle Hugetlb pages" was also
added to that mmotm version.  v3 of this series did not take Oscar's
changes into account.  v4 addresses those omissions.

v4 is based on the following:
- Start with v5.12-rc5-mmotm-2021-03-31-21-27
- Revert v3 of this series

Patch changes from v3:
1   - Trivial context fixup due to cma changes.  No functional changes
2, 3- No change
4, 5- Changes required due to "Make alloc_contig_range handle
  Hugetlb pages".  Specifically, alloc_and_dissolve_huge_page
  changes are in need of review.
6   - No change
7   - Fairly straight forward spin_*lock calls to spin_*lock_irq*
  changes in code from "Make alloc_contig_range handle Hugetlb pages"
  series.
8   - Trivial change due to context changes.  No functional change.

If easier to review, I could also provide delta patches on top of
patches 4, 5, and 7 of v3.

Original cover letter follows:
This effort is the result a recent bug report [1].  Syzbot found a
potential deadlock in the hugetlb put_page/free_huge_page_path.
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
Since the free_huge_page_path already has code to 'hand off' page
free requests to a workqueue, a suggestion was proposed to make
the in_irq() detection accurate by always enabling PREEMPT_COUNT [2].
The outcome of that discussion was that the hugetlb put_page path
(free_huge_page) path should be properly fixed and safe for all calling
contexts.

This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
level, the series provides:
- Patches 1 & 2 change CMA bitmap mutex to an irq safe spinlock
- Patch 3 adds a mutex for proc/sysfs interfaces changing hugetlb counts
- Patches 4, 5 & 6 are aimed at reducing lock hold times.  To be clear
  the goal is to eliminate single lock hold times of a long duration.
  Overall lock hold time is not addressed.
- Patch 7 makes hugetlb_lock and subpool lock IRQ safe.  It also reverts
  the code which defers calls to a workqueue if !in_task.
- Patch 8 adds some lockdep_assert_held() calls

[1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
[2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.krav...@oracle.com

v3 -> v4
- Add changes needed for the series "Make alloc_contig_range handle
  Hugetlb pages"

v2 -> v3
- Update commit message in patch 1 as suggested by Michal
- Do not use spin_lock_irqsave/spin_unlock_irqrestore when we know we
  are in task context as suggested by Michal
- Remove unnecessary INIT_LIST_HEAD() as suggested by Muchun

v1 -> v2
- Drop Roman's cma_release_nowait() patches and just change CMA mutex
  to an IRQ safe spinlock.
- Cleanups to variable names, commets and commit messages as suggested
  by Michal, Oscar, Miaohe and Muchun.
- Dropped unnecessary INIT_LIST_HEAD as suggested by Michal and list_del
  as suggested by Muchun.
- Created update_and_free_pages_bulk helper as suggested by Michal.
- Rebased on v5.12-rc4-mmotm-2021-03-28-16-37
- Added Acked-by: and Reviewed-by: from v1

RFC -> v1
- Add Roman's cma_release_nowait() patches.  This eliminated the need
  to do a workqueue handoff in hugetlb code.
- Use Michal's suggestion to batch pages for freeing.  This eliminated
  the need to recalculate loop control variables when dropping the lock.
- Added lockdep_assert_held() calls
- Rebased to v5.12-rc3-mmotm-2021-03-17-22-24


Mike Kravetz (8):
  mm/cma: change cma mutex to irq safe spinlock
  hugetlb: no need to drop hugetlb_lock to call cma_release
  hugetlb: add per-hstate mutex to synchronize user adjustments
  hugetlb: create remove_hugetlb_page() to separate functionality
  hugetlb: call update_and_free_page without hugetlb_lock
  hugetlb: change free_pool_huge_page to remove_pool_huge_page
  hugetlb: make free_huge_page irq safe
  hugetlb: add lockdep_assert_held() calls for hugetlb_lock

 include/linux/hugetlb.h |   1 +
 mm/cma.c|  18 +-
 mm/cma.h|   2 +-
 mm/cma_debug.c  |   8 +-
 mm/hugetlb.c| 384 +---
 mm/hugetlb_cgroup.c |   8 +-
 6 files changed, 218 insertions(+), 203 deletions(-)

-- 
2.30.2

[PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-04-05 Thread Mike Kravetz

The new remove_hugetlb_page() routine is designed to remove a hugetlb
page from hugetlbfs processing.  It will remove the page from the active
or free list, update global counters and set the compound page
destructor to NULL so that PageHuge() will return false for the 'page'.
After this call, the 'page' can be treated as a normal compound page or
a collection of base size pages.

update_and_free_page no longer decrements h->nr_huge_pages{_node} as
this is performed in remove_hugetlb_page.  The only functionality
performed by update_and_free_page is to free the base pages to the lower
level allocators.

update_and_free_page is typically called after remove_hugetlb_page.

remove_hugetlb_page is to be called with the hugetlb_lock held.

Creating this routine and separating functionality is in preparation for
restructuring code to reduce lock hold times.  This commit should not
introduce any changes to functionality.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 88 ++--
 1 file changed, 51 insertions(+), 37 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8497a3598c86..df2a3d1f632b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1055,18 +1055,13 @@ static bool vma_has_reserves(struct vm_area_struct 
*vma, long chg)
return false;
 }
 
-static void __enqueue_huge_page(struct list_head *list, struct page *page)
-{
-   list_move(>lru, list);
-   SetHPageFreed(page);
-}
-
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
int nid = page_to_nid(page);
-   __enqueue_huge_page(>hugepage_freelists[nid], page);
+   list_move(>lru, >hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
+   SetHPageFreed(page);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -1331,6 +1326,43 @@ static inline void destroy_compound_gigantic_page(struct 
page *page,
unsigned int order) { }
 #endif
 
+/*
+ * Remove hugetlb page from lists, and update dtor so that page appears
+ * as just a compound page.  A reference is held on the page.
+ *
+ * Must be called with hugetlb lock held.
+ */
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+   bool adjust_surplus)
+{
+   int nid = page_to_nid(page);
+
+   if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+   return;
+
+   list_del(>lru);
+
+   if (HPageFreed(page)) {
+   h->free_huge_pages--;
+   h->free_huge_pages_node[nid]--;
+   ClearHPageFreed(page);
+   }
+   if (adjust_surplus) {
+   h->surplus_huge_pages--;
+   h->surplus_huge_pages_node[nid]--;
+   }
+
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
+
+   ClearHPageTemporary(page);
+   set_page_refcounted(page);
+   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+
+   h->nr_huge_pages--;
+   h->nr_huge_pages_node[nid]--;
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
int i;
@@ -1339,8 +1371,6 @@ static void update_and_free_page(struct hstate *h, struct 
page *page)
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
 
-   h->nr_huge_pages--;
-   h->nr_huge_pages_node[page_to_nid(page)]--;
for (i = 0; i < pages_per_huge_page(h);
 i++, subpage = mem_map_next(subpage, page, i)) {
subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -1348,10 +1378,6 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
1 << PG_active | 1 << PG_private |
1 << PG_writeback);
}
-   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
-   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
-   set_page_refcounted(page);
if (hstate_is_gigantic(h)) {
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
@@ -1419,15 +1445,12 @@ static void __free_huge_page(struct page *page)
h->resv_huge_pages++;
 
if (HPageTemporary(page)) {
-   list_del(>lru);
-   ClearHPageTemporary(page);
+   remove_hugetlb_page(h, page, false);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
-   list_del(>lru);
+   remove_hugetlb_page(h, page, true);
up

[PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-04-05 Thread Mike Kravetz

cma_release is currently a sleepable operatation because the bitmap
manipulation is protected by cma->lock mutex. Hugetlb code which relies
on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
irq safe.

The lock doesn't protect any sleepable operation so it can be changed to
a (irq aware) spin lock. The bitmap processing should be quite fast in
typical case but if cma sizes grow to TB then we will likely need to
replace the lock by a more optimized bitmap implementation.

Signed-off-by: Mike Kravetz 
---
 mm/cma.c   | 18 +-
 mm/cma.h   |  2 +-
 mm/cma_debug.c |  8 
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index f3bca4178c7f..995e15480937 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -24,7 +24,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long 
pfn,
 unsigned long count)
 {
unsigned long bitmap_no, bitmap_count;
+   unsigned long flags;
 
bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
bitmap_count = cma_bitmap_pages_to_bits(cma, count);
 
-   mutex_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
-   mutex_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static void __init cma_activate_area(struct cma *cma)
@@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
 pfn += pageblock_nr_pages)
init_cma_reserved_pageblock(pfn_to_page(pfn));
 
-   mutex_init(>lock);
+   spin_lock_init(>lock);
 
 #ifdef CONFIG_CMA_DEBUGFS
INIT_HLIST_HEAD(>mem_head);
@@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
unsigned long nr_part, nr_total = 0;
unsigned long nbits = cma_bitmap_maxno(cma);
 
-   mutex_lock(>lock);
+   spin_lock_irq(>lock);
pr_info("number of available pages: ");
for (;;) {
next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
@@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
start = next_zero_bit + nr_zero;
}
pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
 }
 #else
 static inline void cma_debug_show_areas(struct cma *cma) { }
@@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, unsigned long 
count,
goto out;
 
for (;;) {
-   mutex_lock(>lock);
+   spin_lock_irq(>lock);
bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
break;
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
@@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
 * our exclusive use. If the migration fails we will take the
 * lock again and unmark it.
 */
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
 
pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
diff --git a/mm/cma.h b/mm/cma.h
index 68ffad4e430d..2c775877eae2 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -15,7 +15,7 @@ struct cma {
unsigned long   count;
unsigned long   *bitmap;
unsigned int order_per_bit; /* Order of pages represented by one bit */
-   struct mutexlock;
+   spinlock_t  lock;
 #ifdef CONFIG_CMA_DEBUGFS
struct hlist_head mem_head;
spinlock_t mem_head_lock;
diff --git a/mm/cma_debug.c b/mm/cma_debug.c
index d5bf8aa34fdc..2e7704955f4f 100644
--- a/mm/cma_debug.c
+++ b/mm/cma_debug.c
@@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
struct cma *cma = data;
unsigned long used;
 
-   mutex_lock(>lock);
+   spin_lock_irq(>lock);
/* pages counter is smaller than sizeof(int) */
used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
*val = (u64)used << cma->order_per_bit;
 
return 0;
@@ -53,7 +53,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
unsigned long start, end = 0;
unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
 
-   mutex_lock(>lock);
+   spin_lock_irq(>lock);
for (;;)

[PATCH v4 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments

2021-04-05 Thread Mike Kravetz

The helper routine hstate_next_node_to_alloc accesses and modifies the
hstate variable next_nid_to_alloc.  The helper is used by the routines
alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
called with hugetlb_lock held.  However, alloc_pool_huge_page can not
be called with the hugetlb lock held as it will call the page allocator.
Two instances of alloc_pool_huge_page could be run in parallel or
alloc_pool_huge_page could run in parallel with adjust_pool_surplus
which may result in the variable next_nid_to_alloc becoming invalid
for the caller and pages being allocated on the wrong node.

Both alloc_pool_huge_page and adjust_pool_surplus are only called from
the routine set_max_huge_pages after boot.  set_max_huge_pages is only
called as the reusult of a user writing to the proc/sysfs nr_hugepages,
or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.

It makes little sense to allow multiple adjustment to the number of
hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
allow one hugetlb page adjustment at a time.  This will synchronize
modifications to the next_nid_to_alloc variable.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Oscar Salvador 
Reviewed-by: Miaohe Lin 
Reviewed-by: Muchun Song 
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c| 8 
 2 files changed, 9 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d9b78e82652f..b92f25ccef58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
+   struct mutex resize_lock;
int next_nid_to_alloc;
int next_nid_to_free;
unsigned int order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1d62f0492e7b..8497a3598c86 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2730,6 +2730,11 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
else
return -ENOMEM;
 
+   /*
+* resize_lock mutex prevents concurrent adjustments to number of
+* pages in hstate via the proc/sysfs interfaces.
+*/
+   mutex_lock(>resize_lock);
spin_lock(_lock);
 
/*
@@ -2762,6 +2767,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
if (count > persistent_huge_pages(h)) {
spin_unlock(_lock);
+   mutex_unlock(>resize_lock);
NODEMASK_FREE(node_alloc_noretry);
return -EINVAL;
}
@@ -2836,6 +2842,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
 out:
h->max_huge_pages = persistent_huge_pages(h);
spin_unlock(_lock);
+   mutex_unlock(>resize_lock);
 
NODEMASK_FREE(node_alloc_noretry);
 
@@ -3323,6 +3330,7 @@ void __init hugetlb_add_hstate(unsigned int order)
BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
BUG_ON(order == 0);
h = [hugetlb_max_hstate++];
+   mutex_init(>resize_lock);
h->order = order;
h->mask = ~(huge_page_size(h) - 1);
for (i = 0; i < MAX_NUMNODES; ++i)
-- 
2.30.2

[PATCH v4 7/8] hugetlb: make free_huge_page irq safe

2021-04-05 Thread Mike Kravetz

Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
non-task context") was added to address the issue of free_huge_page
being called from irq context.  That commit hands off free_huge_page
processing to a workqueue if !in_task.  However, this doesn't cover
all the cases as pointed out by 0day bot lockdep report [1].

:  Possible interrupt unsafe locking scenario:
:
:CPU0CPU1
:
:   lock(hugetlb_lock);
:local_irq_disable();
:lock(slock-AF_INET);
:lock(hugetlb_lock);
:   
: lock(slock-AF_INET);

Shakeel has later explained that this is very likely TCP TX zerocopy
from hugetlb pages scenario when the networking code drops a last
reference to hugetlb page while having IRQ disabled. Hugetlb freeing
path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
chain can lead to a deadlock.

This commit addresses the issue by doing the following:
- Make hugetlb_lock irq safe.  This is mostly a simple process of
  changing spin_*lock calls to spin_*lock_irq* calls.
- Make subpool lock irq safe in a similar manner.
- Revert the !in_task check and workqueue handoff.

[1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Muchun Song 
---
 mm/hugetlb.c| 183 +---
 mm/hugetlb_cgroup.c |   8 +-
 2 files changed, 74 insertions(+), 117 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 93a2a11b9376..15b6e7aadb52 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool 
*spool)
return true;
 }
 
-static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
+   unsigned long irq_flags)
 {
-   spin_unlock(>lock);
+   spin_unlock_irqrestore(>lock, irq_flags);
 
/* If no pages are used, and no other handles to the subpool
 * remain, give up any reservations based on minimum size and
@@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct 
hstate *h, long max_hpages,
 
 void hugepage_put_subpool(struct hugepage_subpool *spool)
 {
-   spin_lock(>lock);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
BUG_ON(!spool->count);
spool->count--;
-   unlock_or_release_subpool(spool);
+   unlock_or_release_subpool(spool, flags);
 }
 
 /*
@@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct 
hugepage_subpool *spool,
if (!spool)
return ret;
 
-   spin_lock(>lock);
+   spin_lock_irq(>lock);
 
if (spool->max_hpages != -1) {  /* maximum size accounting */
if ((spool->used_hpages + delta) <= spool->max_hpages)
@@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct 
hugepage_subpool *spool,
}
 
 unlock_ret:
-   spin_unlock(>lock);
+   spin_unlock_irq(>lock);
return ret;
 }
 
@@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct 
hugepage_subpool *spool,
   long delta)
 {
long ret = delta;
+   unsigned long flags;
 
if (!spool)
return delta;
 
-   spin_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
 
if (spool->max_hpages != -1)/* maximum size accounting */
spool->used_hpages -= delta;
@@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct 
hugepage_subpool *spool,
 * If hugetlbfs_put_super couldn't free spool due to an outstanding
 * quota reference, free it now.
 */
-   unlock_or_release_subpool(spool);
+   unlock_or_release_subpool(spool, flags);
 
return ret;
 }
@@ -1407,7 +1411,7 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
-static void __free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
/*
 * Can't pass hstate in here because it is called from the
@@ -1417,6 +1421,7 @@ static void __free_huge_page(struct page *page)
int nid = page_to_nid(page);
struct hugepage_subpool *spool = hugetlb_page_subpool(page);
bool restore_reserve;
+   unsigned long flags;
 
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(page_mapcount(page), page);
@@ -1445,7 +1450,7 @@ static void __free_huge_page(struct page *page)
restore_reserve = true;
}
 
-   spin_lock(_lock);
+   spin_lock_irqsave(_lock, flags);
ClearHPageMigratable(page);
hugetlb_cgroup_uncharge_page(hstate_index(h),

[PATCH v4 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock

2021-04-05 Thread Mike Kravetz

After making hugetlb lock irq safe and separating some functionality
done under the lock, add some lockdep_assert_held to help verify
locking.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Miaohe Lin 
Reviewed-by: Muchun Song 
---
 mm/hugetlb.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 15b6e7aadb52..5d9f74e2b963 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1062,6 +1062,8 @@ static bool vma_has_reserves(struct vm_area_struct *vma, 
long chg)
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
int nid = page_to_nid(page);
+
+   lockdep_assert_held(_lock);
list_move(>lru, >hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
@@ -1073,6 +1075,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
hstate *h, int nid)
struct page *page;
bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
+   lockdep_assert_held(_lock);
list_for_each_entry(page, >hugepage_freelists[nid], lru) {
if (pin && !is_pinnable_page(page))
continue;
@@ -1341,6 +1344,7 @@ static void remove_hugetlb_page(struct hstate *h, struct 
page *page,
 {
int nid = page_to_nid(page);
 
+   lockdep_assert_held(_lock);
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
 
@@ -1696,6 +1700,7 @@ static struct page *remove_pool_huge_page(struct hstate 
*h,
int nr_nodes, node;
struct page *page = NULL;
 
+   lockdep_assert_held(_lock);
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
 * If we're returning unused surplus pages, only examine
@@ -1945,6 +1950,7 @@ static int gather_surplus_pages(struct hstate *h, long 
delta)
long needed, allocated;
bool alloc_ok = true;
 
+   lockdep_assert_held(_lock);
needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
if (needed <= 0) {
h->resv_huge_pages += delta;
@@ -2038,6 +2044,7 @@ static void return_unused_surplus_pages(struct hstate *h,
struct page *page;
LIST_HEAD(page_list);
 
+   lockdep_assert_held(_lock);
/* Uncommit the reservation */
h->resv_huge_pages -= unused_resv_pages;
 
@@ -2640,6 +2647,7 @@ static void try_to_free_low(struct hstate *h, unsigned 
long count,
int i;
LIST_HEAD(page_list);
 
+   lockdep_assert_held(_lock);
if (hstate_is_gigantic(h))
return;
 
@@ -2681,6 +2689,7 @@ static int adjust_pool_surplus(struct hstate *h, 
nodemask_t *nodes_allowed,
 {
int nr_nodes, node;
 
+   lockdep_assert_held(_lock);
VM_BUG_ON(delta != -1 && delta != 1);
 
if (delta < 0) {
-- 
2.30.2

[PATCH v4 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release

2021-04-05 Thread Mike Kravetz

Now that cma_release is non-blocking and irq safe, there is no need to
drop hugetlb_lock before calling.

Signed-off-by: Mike Kravetz 
Acked-by: Roman Gushchin 
Acked-by: Michal Hocko 
---
 mm/hugetlb.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c3e4baa4156..1d62f0492e7b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
set_page_refcounted(page);
if (hstate_is_gigantic(h)) {
-   /*
-* Temporarily drop the hugetlb_lock, because
-* we might block in free_gigantic_page().
-*/
-   spin_unlock(_lock);
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
-   spin_lock(_lock);
} else {
__free_pages(page, huge_page_order(h));
}
-- 
2.30.2

[PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock

2021-04-05 Thread Mike Kravetz

With the introduction of remove_hugetlb_page(), there is no need for
update_and_free_page to hold the hugetlb lock.  Change all callers to
drop the lock before calling.

With additional code modifications, this will allow loops which decrease
the huge page pool to drop the hugetlb_lock with each page to reduce
long hold times.

The ugly unlock/lock cycle in free_pool_huge_page will be removed in
a subsequent patch which restructures free_pool_huge_page.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 43 +--
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df2a3d1f632b..be6031a8e2a9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1446,16 +1446,18 @@ static void __free_huge_page(struct page *page)
 
if (HPageTemporary(page)) {
remove_hugetlb_page(h, page, false);
+   spin_unlock(_lock);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
remove_hugetlb_page(h, page, true);
+   spin_unlock(_lock);
update_and_free_page(h, page);
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
+   spin_unlock(_lock);
}
-   spin_unlock(_lock);
 }
 
 /*
@@ -1736,7 +1738,13 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
+   /*
+* unlock/lock around update_and_free_page is temporary
+* and will be removed with subsequent patch.
+*/
+   spin_unlock(_lock);
update_and_free_page(h, page);
+   spin_lock(_lock);
ret = 1;
break;
}
@@ -1805,8 +1813,9 @@ int dissolve_free_huge_page(struct page *page)
}
remove_hugetlb_page(h, page, false);
h->max_huge_pages--;
+   spin_unlock(_lock);
update_and_free_page(h, head);
-   rc = 0;
+   return 0;
}
 out:
spin_unlock(_lock);
@@ -2291,6 +2300,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, 
struct page *old_page,
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
int nid = page_to_nid(old_page);
struct page *new_page;
+   struct page *page_to_free;
int ret = 0;
 
/*
@@ -2313,16 +2323,16 @@ static int alloc_and_dissolve_huge_page(struct hstate 
*h, struct page *old_page,
 * Freed from under us. Drop new_page too.
 */
remove_hugetlb_page(h, new_page, false);
-   update_and_free_page(h, new_page);
-   goto unlock;
+   page_to_free = new_page;
+   goto unlock_free;
} else if (page_count(old_page)) {
/*
 * Someone has grabbed the page, try to isolate it here.
 * Fail with -EBUSY if not possible.
 */
remove_hugetlb_page(h, new_page, false);
-   update_and_free_page(h, new_page);
spin_unlock(_lock);
+   update_and_free_page(h, new_page);
if (!isolate_huge_page(old_page, list))
ret = -EBUSY;
return ret;
@@ -2344,11 +2354,12 @@ static int alloc_and_dissolve_huge_page(struct hstate 
*h, struct page *old_page,
 * enqueue_huge_page for new page.  Net result is no change.
 */
remove_hugetlb_page(h, old_page, false);
-   update_and_free_page(h, old_page);
enqueue_huge_page(h, new_page);
+   page_to_free = old_page;
}
-unlock:
+unlock_free:
spin_unlock(_lock);
+   update_and_free_page(h, page_to_free);
 
return ret;
 }
@@ -2671,22 +2682,34 @@ static void try_to_free_low(struct hstate *h, unsigned 
long count,
nodemask_t *nodes_allowed)
 {
int i;
+   struct page *page, *next;
+   LIST_HEAD(page_list);
 
if (hstate_is_gigantic(h))
return;
 
+   /*
+* Collect pages to be freed on a list, and free after dropping lock
+*/
for_each_node_mask(i, *nodes_allowed) {
-   struct page *page, *next;
struct list_head *freel = >hugepage_freelists[i];
list_for_each_entry_safe(page, next, freel, lru) {
if (count >= h->nr_huge_pages)
-

[PATCH v4 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page

2021-04-05 Thread Mike Kravetz

free_pool_huge_page was called with hugetlb_lock held.  It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy.  free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.

Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held.  It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators.  The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.

Add new helper routine to call update_and_free_page for a list of pages.

Note: Some changes to the routine return_unused_surplus_pages are in
need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
race when freeing surplus pages") modified this routine to address a
race which could occur when dropping the hugetlb_lock in the loop that
removes pool pages.  Accounting changes introduced in that commit were
subtle and took some thought to understand.  This commit removes the
cond_resched_lock() and the potential race.  Therefore, remove the
subtle code and restore the more straight forward accounting effectively
reverting the commit.

Signed-off-by: Mike Kravetz 
Reviewed-by: Muchun Song 
Acked-by: Michal Hocko 
---
 mm/hugetlb.c | 93 
 1 file changed, 51 insertions(+), 42 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index be6031a8e2a9..93a2a11b9376 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1204,7 +1204,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
 }
 
 /*
- * helper for free_pool_huge_page() - return the previously saved
+ * helper for remove_pool_huge_page() - return the previously saved
  * node ["this node"] from which to free a huge page.  Advance the
  * next node id whether or not we find a free huge page to free so
  * that the next attempt to free addresses the next node.
@@ -1386,6 +1386,16 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
}
 }
 
+static void update_and_free_pages_bulk(struct hstate *h, struct list_head 
*list)
+{
+   struct page *page, *t_page;
+
+   list_for_each_entry_safe(page, t_page, list, lru) {
+   update_and_free_page(h, page);
+   cond_resched();
+   }
+}
+
 struct hstate *size_to_hstate(unsigned long size)
 {
struct hstate *h;
@@ -1716,16 +1726,18 @@ static int alloc_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 }
 
 /*
- * Free huge page from pool from next node to free.
- * Attempt to keep persistent huge pages more or less
- * balanced over allowed nodes.
+ * Remove huge page from pool from next node to free.  Attempt to keep
+ * persistent huge pages more or less balanced over allowed nodes.
+ * This routine only 'removes' the hugetlb page.  The caller must make
+ * an additional call to free the page to low level allocators.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-bool acct_surplus)
+static struct page *remove_pool_huge_page(struct hstate *h,
+   nodemask_t *nodes_allowed,
+bool acct_surplus)
 {
int nr_nodes, node;
-   int ret = 0;
+   struct page *page = NULL;
 
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
@@ -1734,23 +1746,14 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 */
if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
!list_empty(>hugepage_freelists[node])) {
-   struct page *page =
-   list_entry(h->hugepage_freelists[node].next,
+   page = list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
-   /*
-* unlock/lock around update_and_free_page is temporary
-* and will be removed with subsequent patch.
-*/
-   spin_unlock(_lock);
-   update_and_free_page(h, page);
-   spin_lock(_lock);
-   ret = 1;
break;
}
}
 
-   return ret;
+   return page;
 }
 
 /*
@@ -2070,17 +2073,16 @@ static int gather_surplus_pages(struct hstate *h, long 
delta)
  *to the associated reservation map.
  * 2) Free any unused surplus pages that may have been all

Re: [External] [PATCH v3 7/8] hugetlb: make free_huge_page irq safe

2021-04-03 Thread Mike Kravetz

On 4/2/21 10:59 PM, Muchun Song wrote:
> On Sat, Apr 3, 2021 at 4:56 AM Mike Kravetz  wrote:
>>
>> On 4/2/21 5:47 AM, Muchun Song wrote:
>>> On Wed, Mar 31, 2021 at 11:42 AM Mike Kravetz  
>>> wrote:
>>>>
>>>> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
>>>> non-task context") was added to address the issue of free_huge_page
>>>> being called from irq context.  That commit hands off free_huge_page
>>>> processing to a workqueue if !in_task.  However, this doesn't cover
>>>> all the cases as pointed out by 0day bot lockdep report [1].
>>>>
>>>> :  Possible interrupt unsafe locking scenario:
>>>> :
>>>> :CPU0CPU1
>>>> :
>>>> :   lock(hugetlb_lock);
>>>> :local_irq_disable();
>>>> :lock(slock-AF_INET);
>>>> :lock(hugetlb_lock);
>>>> :   
>>>> : lock(slock-AF_INET);
>>>>
>>>> Shakeel has later explained that this is very likely TCP TX zerocopy
>>>> from hugetlb pages scenario when the networking code drops a last
>>>> reference to hugetlb page while having IRQ disabled. Hugetlb freeing
>>>> path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
>>>> chain can lead to a deadlock.
>>>>
>>>> This commit addresses the issue by doing the following:
>>>> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>>>>   changing spin_*lock calls to spin_*lock_irq* calls.
>>>> - Make subpool lock irq safe in a similar manner.
>>>> - Revert the !in_task check and workqueue handoff.
>>>>
>>>> [1] 
>>>> https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
>>>>
>>>> Signed-off-by: Mike Kravetz 
>>>> Acked-by: Michal Hocko 
>>>> Reviewed-by: Muchun Song 
>>>
>>> Hi Mike,
>>>
>>> Today I pulled the newest code (next-20210401). I found that
>>> alloc_and_dissolve_huge_page is not updated. In this function,
>>> hugetlb_lock is still non-irq safe. Maybe you or Oscar need
>>> to fix.
>>>
>>> Thanks.
>>
>> Thank you Muchun,
>>
>> Oscar's changes were not in Andrew's tree when I started on this series
>> and I failed to notice their inclusion.  In addition,
>> isolate_or_dissolve_huge_page also needs updating as well as a change in
>> set_max_huge_pages that was omitted while rebasing.
>>
>> Andrew, the following patch addresses those missing changes.  Ideally,
>> the changes should be combined/included in this patch.  If you want me
>> to sent another version of this patch or another series, let me know.
>>
>> From 450593eb3cea895f499ddc343c22424c552ea502 Mon Sep 17 00:00:00 2001
>> From: Mike Kravetz 
>> Date: Fri, 2 Apr 2021 13:18:13 -0700
>> Subject: [PATCH] hugetlb: fix irq locking omissions
>>
>> The pach "hugetlb: make free_huge_page irq safe" changed spin_*lock
>> calls to spin_*lock_irq* calls.  However, it missed several places
>> in the file hugetlb.c.  Add the overlooked changes.
>>
>> Signed-off-by: Mike Kravetz 
> 
> Thanks MIke. But there are still some places that need
> improvement. See below.
> 

Correct.  My apologies again for not fully taking into account the new
code from Oscar's series when working on this.

>> ---
>>  mm/hugetlb.c | 16 
>>  1 file changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index c22111f3da20..a6bfc6bcbc81 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -2284,7 +2284,7 @@ static int alloc_and_dissolve_huge_page(struct hstate 
>> *h, struct page *old_page,
>>  */
>> page_ref_dec(new_page);
>>  retry:
>> -   spin_lock(_lock);
>> +   spin_lock_irq(_lock);
>> if (!PageHuge(old_page)) {
>> /*
>>  * Freed from under us. Drop new_page too.
>> @@ -2297,7 +2297,7 @@ static int alloc_and_dissolve_huge_page(struct hstate 
>> *h, struct page *old_page,
>>  * Fail with -EBUSY if not possible.
>>  */
>> update_and_free_page(h, new_page);
> 
> Now update_and_free_page can be called without holding
> hugetlb_lock. We can move it out of hugetlb_lock. In this
> function, there are 3 places which call update_and_free_page().
> We can move all of them out of hugetlb_lock. Right?

We will need to do more than that.
The call to update_and_free_page in alloc_and_dissolve_huge_page
assumes old functionality.  i.e. It assumes h->nr_huge_pages will be
decremented in update_and_free_page.  This is no longer the case.

This will need to be fixed in patch 4 of my series which changes the
functionality of update_and_free_page.  I'm afraid a change there will
end up requiring changes in subsequent patches due to context.

I will have an update on Monday.
-- 
Mike Kravetz

Re: [External] [PATCH v3 7/8] hugetlb: make free_huge_page irq safe

2021-04-02 Thread Mike Kravetz

On 4/2/21 5:47 AM, Muchun Song wrote:
> On Wed, Mar 31, 2021 at 11:42 AM Mike Kravetz  wrote:
>>
>> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
>> non-task context") was added to address the issue of free_huge_page
>> being called from irq context.  That commit hands off free_huge_page
>> processing to a workqueue if !in_task.  However, this doesn't cover
>> all the cases as pointed out by 0day bot lockdep report [1].
>>
>> :  Possible interrupt unsafe locking scenario:
>> :
>> :CPU0CPU1
>> :
>> :   lock(hugetlb_lock);
>> :local_irq_disable();
>> :lock(slock-AF_INET);
>> :lock(hugetlb_lock);
>> :   
>> : lock(slock-AF_INET);
>>
>> Shakeel has later explained that this is very likely TCP TX zerocopy
>> from hugetlb pages scenario when the networking code drops a last
>> reference to hugetlb page while having IRQ disabled. Hugetlb freeing
>> path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
>> chain can lead to a deadlock.
>>
>> This commit addresses the issue by doing the following:
>> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>>   changing spin_*lock calls to spin_*lock_irq* calls.
>> - Make subpool lock irq safe in a similar manner.
>> - Revert the !in_task check and workqueue handoff.
>>
>> [1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
>>
>> Signed-off-by: Mike Kravetz 
>> Acked-by: Michal Hocko 
>> Reviewed-by: Muchun Song 
> 
> Hi Mike,
> 
> Today I pulled the newest code (next-20210401). I found that
> alloc_and_dissolve_huge_page is not updated. In this function,
> hugetlb_lock is still non-irq safe. Maybe you or Oscar need
> to fix.
> 
> Thanks.

Thank you Muchun,

Oscar's changes were not in Andrew's tree when I started on this series
and I failed to notice their inclusion.  In addition,
isolate_or_dissolve_huge_page also needs updating as well as a change in
set_max_huge_pages that was omitted while rebasing.

Andrew, the following patch addresses those missing changes.  Ideally,
the changes should be combined/included in this patch.  If you want me
to sent another version of this patch or another series, let me know.

>From 450593eb3cea895f499ddc343c22424c552ea502 Mon Sep 17 00:00:00 2001
From: Mike Kravetz 
Date: Fri, 2 Apr 2021 13:18:13 -0700
Subject: [PATCH] hugetlb: fix irq locking omissions

The pach "hugetlb: make free_huge_page irq safe" changed spin_*lock
calls to spin_*lock_irq* calls.  However, it missed several places
in the file hugetlb.c.  Add the overlooked changes.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c22111f3da20..a6bfc6bcbc81 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2284,7 +2284,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, 
struct page *old_page,
 */
page_ref_dec(new_page);
 retry:
-   spin_lock(_lock);
+   spin_lock_irq(_lock);
if (!PageHuge(old_page)) {
/*
 * Freed from under us. Drop new_page too.
@@ -2297,7 +2297,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, 
struct page *old_page,
 * Fail with -EBUSY if not possible.
 */
update_and_free_page(h, new_page);
-   spin_unlock(_lock);
+   spin_unlock_irq(_lock);
if (!isolate_huge_page(old_page, list))
ret = -EBUSY;
return ret;
@@ -2307,7 +2307,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, 
struct page *old_page,
 * freelist yet. Race window is small, so we can succed here if
 * we retry.
 */
-   spin_unlock(_lock);
+   spin_unlock_irq(_lock);
cond_resched();
goto retry;
} else {
@@ -2323,7 +2323,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, 
struct page *old_page,
__enqueue_huge_page(>hugepage_freelists[nid], new_page);
}
 unlock:
-   spin_unlock(_lock);
+   spin_unlock_irq(_lock);
 
return ret;
 }
@@ -2339,15 +2339,15 @@ int isolate_or_dissolve_huge_page(struct page *page, 
struct list_head *list)
 * to carefully check the state under the lock.
 * Return success when racing as if we dissolved the page ourselves.
 */
-   spin_lock(_lock);
+   spin_lock_irq(_lock);
if (PageHuge(page)) {
head = compound_head(pa

[PATCH v3 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments

2021-03-30 Thread Mike Kravetz

The helper routine hstate_next_node_to_alloc accesses and modifies the
hstate variable next_nid_to_alloc.  The helper is used by the routines
alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
called with hugetlb_lock held.  However, alloc_pool_huge_page can not
be called with the hugetlb lock held as it will call the page allocator.
Two instances of alloc_pool_huge_page could be run in parallel or
alloc_pool_huge_page could run in parallel with adjust_pool_surplus
which may result in the variable next_nid_to_alloc becoming invalid
for the caller and pages being allocated on the wrong node.

Both alloc_pool_huge_page and adjust_pool_surplus are only called from
the routine set_max_huge_pages after boot.  set_max_huge_pages is only
called as the reusult of a user writing to the proc/sysfs nr_hugepages,
or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.

It makes little sense to allow multiple adjustment to the number of
hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
allow one hugetlb page adjustment at a time.  This will synchronize
modifications to the next_nid_to_alloc variable.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Oscar Salvador 
Reviewed-by: Miaohe Lin 
Reviewed-by: Muchun Song 
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c| 8 
 2 files changed, 9 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d9b78e82652f..b92f25ccef58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
+   struct mutex resize_lock;
int next_nid_to_alloc;
int next_nid_to_free;
unsigned int order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1d62f0492e7b..8497a3598c86 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2730,6 +2730,11 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
else
return -ENOMEM;
 
+   /*
+* resize_lock mutex prevents concurrent adjustments to number of
+* pages in hstate via the proc/sysfs interfaces.
+*/
+   mutex_lock(>resize_lock);
spin_lock(_lock);
 
/*
@@ -2762,6 +2767,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
if (count > persistent_huge_pages(h)) {
spin_unlock(_lock);
+   mutex_unlock(>resize_lock);
NODEMASK_FREE(node_alloc_noretry);
return -EINVAL;
}
@@ -2836,6 +2842,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
 out:
h->max_huge_pages = persistent_huge_pages(h);
spin_unlock(_lock);
+   mutex_unlock(>resize_lock);
 
NODEMASK_FREE(node_alloc_noretry);
 
@@ -3323,6 +3330,7 @@ void __init hugetlb_add_hstate(unsigned int order)
BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
BUG_ON(order == 0);
h = [hugetlb_max_hstate++];
+   mutex_init(>resize_lock);
h->order = order;
h->mask = ~(huge_page_size(h) - 1);
for (i = 0; i < MAX_NUMNODES; ++i)
-- 
2.30.2

[PATCH v3 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page

2021-03-30 Thread Mike Kravetz

free_pool_huge_page was called with hugetlb_lock held.  It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy.  free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.

Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held.  It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators.  The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.

Add new helper routine to call update_and_free_page for a list of pages.

Note: Some changes to the routine return_unused_surplus_pages are in
need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
race when freeing surplus pages") modified this routine to address a
race which could occur when dropping the hugetlb_lock in the loop that
removes pool pages.  Accounting changes introduced in that commit were
subtle and took some thought to understand.  This commit removes the
cond_resched_lock() and the potential race.  Therefore, remove the
subtle code and restore the more straight forward accounting effectively
reverting the commit.

Signed-off-by: Mike Kravetz 
Reviewed-by: Muchun Song 
Acked-by: Michal Hocko 
---
 mm/hugetlb.c | 93 
 1 file changed, 51 insertions(+), 42 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac4be941a3e5..5b2ca4663d7f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1209,7 +1209,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
 }
 
 /*
- * helper for free_pool_huge_page() - return the previously saved
+ * helper for remove_pool_huge_page() - return the previously saved
  * node ["this node"] from which to free a huge page.  Advance the
  * next node id whether or not we find a free huge page to free so
  * that the next attempt to free addresses the next node.
@@ -1391,6 +1391,16 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
}
 }
 
+static void update_and_free_pages_bulk(struct hstate *h, struct list_head 
*list)
+{
+   struct page *page, *t_page;
+
+   list_for_each_entry_safe(page, t_page, list, lru) {
+   update_and_free_page(h, page);
+   cond_resched();
+   }
+}
+
 struct hstate *size_to_hstate(unsigned long size)
 {
struct hstate *h;
@@ -1721,16 +1731,18 @@ static int alloc_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 }
 
 /*
- * Free huge page from pool from next node to free.
- * Attempt to keep persistent huge pages more or less
- * balanced over allowed nodes.
+ * Remove huge page from pool from next node to free.  Attempt to keep
+ * persistent huge pages more or less balanced over allowed nodes.
+ * This routine only 'removes' the hugetlb page.  The caller must make
+ * an additional call to free the page to low level allocators.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-bool acct_surplus)
+static struct page *remove_pool_huge_page(struct hstate *h,
+   nodemask_t *nodes_allowed,
+bool acct_surplus)
 {
int nr_nodes, node;
-   int ret = 0;
+   struct page *page = NULL;
 
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
@@ -1739,23 +1751,14 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 */
if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
!list_empty(>hugepage_freelists[node])) {
-   struct page *page =
-   list_entry(h->hugepage_freelists[node].next,
+   page = list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
-   /*
-* unlock/lock around update_and_free_page is temporary
-* and will be removed with subsequent patch.
-*/
-   spin_unlock(_lock);
-   update_and_free_page(h, page);
-   spin_lock(_lock);
-   ret = 1;
break;
}
}
 
-   return ret;
+   return page;
 }
 
 /*
@@ -2075,17 +2078,16 @@ static int gather_surplus_pages(struct hstate *h, long 
delta)
  *to the associated reservation map.
  * 2) Free any unused surplus pages that may have been all

[PATCH v3 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release

2021-03-30 Thread Mike Kravetz

Now that cma_release is non-blocking and irq safe, there is no need to
drop hugetlb_lock before calling.

Signed-off-by: Mike Kravetz 
Acked-by: Roman Gushchin 
Acked-by: Michal Hocko 
---
 mm/hugetlb.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c3e4baa4156..1d62f0492e7b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
set_page_refcounted(page);
if (hstate_is_gigantic(h)) {
-   /*
-* Temporarily drop the hugetlb_lock, because
-* we might block in free_gigantic_page().
-*/
-   spin_unlock(_lock);
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
-   spin_lock(_lock);
} else {
__free_pages(page, huge_page_order(h));
}
-- 
2.30.2

[PATCH v3 7/8] hugetlb: make free_huge_page irq safe

2021-03-30 Thread Mike Kravetz

Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
non-task context") was added to address the issue of free_huge_page
being called from irq context.  That commit hands off free_huge_page
processing to a workqueue if !in_task.  However, this doesn't cover
all the cases as pointed out by 0day bot lockdep report [1].

:  Possible interrupt unsafe locking scenario:
:
:CPU0CPU1
:
:   lock(hugetlb_lock);
:local_irq_disable();
:lock(slock-AF_INET);
:lock(hugetlb_lock);
:   
: lock(slock-AF_INET);

Shakeel has later explained that this is very likely TCP TX zerocopy
from hugetlb pages scenario when the networking code drops a last
reference to hugetlb page while having IRQ disabled. Hugetlb freeing
path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
chain can lead to a deadlock.

This commit addresses the issue by doing the following:
- Make hugetlb_lock irq safe.  This is mostly a simple process of
  changing spin_*lock calls to spin_*lock_irq* calls.
- Make subpool lock irq safe in a similar manner.
- Revert the !in_task check and workqueue handoff.

[1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Muchun Song 
---
 mm/hugetlb.c| 167 
 mm/hugetlb_cgroup.c |   8 +--
 2 files changed, 66 insertions(+), 109 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5b2ca4663d7f..0bd4dc04df0f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool 
*spool)
return true;
 }
 
-static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
+   unsigned long irq_flags)
 {
-   spin_unlock(>lock);
+   spin_unlock_irqrestore(>lock, irq_flags);
 
/* If no pages are used, and no other handles to the subpool
 * remain, give up any reservations based on minimum size and
@@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct 
hstate *h, long max_hpages,
 
 void hugepage_put_subpool(struct hugepage_subpool *spool)
 {
-   spin_lock(>lock);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
BUG_ON(!spool->count);
spool->count--;
-   unlock_or_release_subpool(spool);
+   unlock_or_release_subpool(spool, flags);
 }
 
 /*
@@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct 
hugepage_subpool *spool,
if (!spool)
return ret;
 
-   spin_lock(>lock);
+   spin_lock_irq(>lock);
 
if (spool->max_hpages != -1) {  /* maximum size accounting */
if ((spool->used_hpages + delta) <= spool->max_hpages)
@@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct 
hugepage_subpool *spool,
}
 
 unlock_ret:
-   spin_unlock(>lock);
+   spin_unlock_irq(>lock);
return ret;
 }
 
@@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct 
hugepage_subpool *spool,
   long delta)
 {
long ret = delta;
+   unsigned long flags;
 
if (!spool)
return delta;
 
-   spin_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
 
if (spool->max_hpages != -1)/* maximum size accounting */
spool->used_hpages -= delta;
@@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct 
hugepage_subpool *spool,
 * If hugetlbfs_put_super couldn't free spool due to an outstanding
 * quota reference, free it now.
 */
-   unlock_or_release_subpool(spool);
+   unlock_or_release_subpool(spool, flags);
 
return ret;
 }
@@ -1412,7 +1416,7 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
-static void __free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
/*
 * Can't pass hstate in here because it is called from the
@@ -1422,6 +1426,7 @@ static void __free_huge_page(struct page *page)
int nid = page_to_nid(page);
struct hugepage_subpool *spool = hugetlb_page_subpool(page);
bool restore_reserve;
+   unsigned long flags;
 
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(page_mapcount(page), page);
@@ -1450,7 +1455,7 @@ static void __free_huge_page(struct page *page)
restore_reserve = true;
}
 
-   spin_lock(_lock);
+   spin_lock_irqsave(_lock, flags);
ClearHPageMigratable(page);
hugetlb_cgroup_uncharge_page(hstate_index(h),

[PATCH v3 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-03-30 Thread Mike Kravetz

The new remove_hugetlb_page() routine is designed to remove a hugetlb
page from hugetlbfs processing.  It will remove the page from the active
or free list, update global counters and set the compound page
destructor to NULL so that PageHuge() will return false for the 'page'.
After this call, the 'page' can be treated as a normal compound page or
a collection of base size pages.

remove_hugetlb_page is to be called with the hugetlb_lock held.

Creating this routine and separating functionality is in preparation for
restructuring code to reduce lock hold times.  This commit should not
introduce any changes to functionality.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Miaohe Lin 
Reviewed-by: Muchun Song 
---
 mm/hugetlb.c | 67 
 1 file changed, 42 insertions(+), 25 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8497a3598c86..16beae49 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1331,6 +1331,43 @@ static inline void destroy_compound_gigantic_page(struct 
page *page,
unsigned int order) { }
 #endif
 
+/*
+ * Remove hugetlb page from lists, and update dtor so that page appears
+ * as just a compound page.  A reference is held on the page.
+ *
+ * Must be called with hugetlb lock held.
+ */
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+   bool adjust_surplus)
+{
+   int nid = page_to_nid(page);
+
+   if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+   return;
+
+   list_del(>lru);
+
+   if (HPageFreed(page)) {
+   h->free_huge_pages--;
+   h->free_huge_pages_node[nid]--;
+   ClearHPageFreed(page);
+   }
+   if (adjust_surplus) {
+   h->surplus_huge_pages--;
+   h->surplus_huge_pages_node[nid]--;
+   }
+
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
+
+   ClearHPageTemporary(page);
+   set_page_refcounted(page);
+   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+
+   h->nr_huge_pages--;
+   h->nr_huge_pages_node[nid]--;
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
int i;
@@ -1339,8 +1376,6 @@ static void update_and_free_page(struct hstate *h, struct 
page *page)
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
 
-   h->nr_huge_pages--;
-   h->nr_huge_pages_node[page_to_nid(page)]--;
for (i = 0; i < pages_per_huge_page(h);
 i++, subpage = mem_map_next(subpage, page, i)) {
subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -1348,10 +1383,6 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
1 << PG_active | 1 << PG_private |
1 << PG_writeback);
}
-   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
-   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
-   set_page_refcounted(page);
if (hstate_is_gigantic(h)) {
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
@@ -1419,15 +1450,12 @@ static void __free_huge_page(struct page *page)
h->resv_huge_pages++;
 
if (HPageTemporary(page)) {
-   list_del(>lru);
-   ClearHPageTemporary(page);
+   remove_hugetlb_page(h, page, false);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
-   list_del(>lru);
+   remove_hugetlb_page(h, page, true);
update_and_free_page(h, page);
-   h->surplus_huge_pages--;
-   h->surplus_huge_pages_node[nid]--;
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
@@ -1712,13 +1740,7 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
struct page *page =
list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
-   list_del(>lru);
-   h->free_huge_pages--;
-   h->free_huge_pages_node[node]--;
-   if (acct_surplus) {
-   h->surplus_huge_pages--;
-   h->surplus_huge_pages_node[node]--;
-   }
+   remove_huge

[PATCH v3 0/8] make hugetlb put_page safe for all calling contexts

2021-03-30 Thread Mike Kravetz

This effort is the result a recent bug report [1].  Syzbot found a
potential deadlock in the hugetlb put_page/free_huge_page_path.
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
Since the free_huge_page_path already has code to 'hand off' page
free requests to a workqueue, a suggestion was proposed to make
the in_irq() detection accurate by always enabling PREEMPT_COUNT [2].
The outcome of that discussion was that the hugetlb put_page path
(free_huge_page) path should be properly fixed and safe for all calling
contexts.

This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
level, the series provides:
- Patches 1 & 2 change CMA bitmap mutex to an irq safe spinlock
- Patch 3 adds a mutex for proc/sysfs interfaces changing hugetlb counts
- Patches 4, 5 & 6 are aimed at reducing lock hold times.  To be clear
  the goal is to eliminate single lock hold times of a long duration.
  Overall lock hold time is not addressed.
- Patch 7 makes hugetlb_lock and subpool lock IRQ safe.  It also reverts
  the code which defers calls to a workqueue if !in_task.
- Patch 8 adds some lockdep_assert_held() calls

[1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
[2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.krav...@oracle.com

v2 -> v3
- Update commit message in patch 1 as suggested by Michal
- Do not use spin_lock_irqsave/spin_unlock_irqrestore when we know we
  are in task context as suggested by Michal
- Remove unnecessary INIT_LIST_HEAD() as suggested by Muchun
  
v1 -> v2
- Drop Roman's cma_release_nowait() patches and just change CMA mutex
  to an IRQ safe spinlock.
- Cleanups to variable names, commets and commit messages as suggested
  by Michal, Oscar, Miaohe and Muchun.
- Dropped unnecessary INIT_LIST_HEAD as suggested by Michal and list_del
  as suggested by Muchun.
- Created update_and_free_pages_bulk helper as suggested by Michal.
- Rebased on v5.12-rc4-mmotm-2021-03-28-16-37
- Added Acked-by: and Reviewed-by: from v1

RFC -> v1
- Add Roman's cma_release_nowait() patches.  This eliminated the need
  to do a workqueue handoff in hugetlb code.
- Use Michal's suggestion to batch pages for freeing.  This eliminated
  the need to recalculate loop control variables when dropping the lock.
- Added lockdep_assert_held() calls
- Rebased to v5.12-rc3-mmotm-2021-03-17-22-24

Mike Kravetz (8):
  mm/cma: change cma mutex to irq safe spinlock
  hugetlb: no need to drop hugetlb_lock to call cma_release
  hugetlb: add per-hstate mutex to synchronize user adjustments
  hugetlb: create remove_hugetlb_page() to separate functionality
  hugetlb: call update_and_free_page without hugetlb_lock
  hugetlb: change free_pool_huge_page to remove_pool_huge_page
  hugetlb: make free_huge_page irq safe
  hugetlb: add lockdep_assert_held() calls for hugetlb_lock

 include/linux/hugetlb.h |   1 +
 mm/cma.c|  18 +--
 mm/cma.h|   2 +-
 mm/cma_debug.c  |   8 +-
 mm/hugetlb.c| 337 +---
 mm/hugetlb_cgroup.c |   8 +-
 6 files changed, 195 insertions(+), 179 deletions(-)

-- 
2.30.2

[PATCH v3 5/8] hugetlb: call update_and_free_page without hugetlb_lock

2021-03-30 Thread Mike Kravetz

With the introduction of remove_hugetlb_page(), there is no need for
update_and_free_page to hold the hugetlb lock.  Change all callers to
drop the lock before calling.

With additional code modifications, this will allow loops which decrease
the huge page pool to drop the hugetlb_lock with each page to reduce
long hold times.

The ugly unlock/lock cycle in free_pool_huge_page will be removed in
a subsequent patch which restructures free_pool_huge_page.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Muchun Song 
Reviewed-by: Miaohe Lin 
---
 mm/hugetlb.c | 31 ++-
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 16beae49..ac4be941a3e5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1451,16 +1451,18 @@ static void __free_huge_page(struct page *page)
 
if (HPageTemporary(page)) {
remove_hugetlb_page(h, page, false);
+   spin_unlock(_lock);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
remove_hugetlb_page(h, page, true);
+   spin_unlock(_lock);
update_and_free_page(h, page);
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
+   spin_unlock(_lock);
}
-   spin_unlock(_lock);
 }
 
 /*
@@ -1741,7 +1743,13 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
+   /*
+* unlock/lock around update_and_free_page is temporary
+* and will be removed with subsequent patch.
+*/
+   spin_unlock(_lock);
update_and_free_page(h, page);
+   spin_lock(_lock);
ret = 1;
break;
}
@@ -1810,8 +1818,9 @@ int dissolve_free_huge_page(struct page *page)
}
remove_hugetlb_page(h, page, false);
h->max_huge_pages--;
+   spin_unlock(_lock);
update_and_free_page(h, head);
-   rc = 0;
+   return 0;
}
 out:
spin_unlock(_lock);
@@ -2674,22 +2683,34 @@ static void try_to_free_low(struct hstate *h, unsigned 
long count,
nodemask_t *nodes_allowed)
 {
int i;
+   struct page *page, *next;
+   LIST_HEAD(page_list);
 
if (hstate_is_gigantic(h))
return;
 
+   /*
+* Collect pages to be freed on a list, and free after dropping lock
+*/
for_each_node_mask(i, *nodes_allowed) {
-   struct page *page, *next;
struct list_head *freel = >hugepage_freelists[i];
list_for_each_entry_safe(page, next, freel, lru) {
if (count >= h->nr_huge_pages)
-   return;
+   goto out;
if (PageHighMem(page))
continue;
remove_hugetlb_page(h, page, false);
-   update_and_free_page(h, page);
+   list_add(>lru, _list);
}
}
+
+out:
+   spin_unlock(_lock);
+   list_for_each_entry_safe(page, next, _list, lru) {
+   update_and_free_page(h, page);
+   cond_resched();
+   }
+   spin_lock(_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
-- 
2.30.2

[PATCH v3 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-30 Thread Mike Kravetz

cma_release is currently a sleepable operatation because the bitmap
manipulation is protected by cma->lock mutex. Hugetlb code which relies
on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
irq safe.

The lock doesn't protect any sleepable operation so it can be changed to
a (irq aware) spin lock. The bitmap processing should be quite fast in
typical case but if cma sizes grow to TB then we will likely need to
replace the lock by a more optimized bitmap implementation.

Signed-off-by: Mike Kravetz 
---
 mm/cma.c   | 18 +-
 mm/cma.h   |  2 +-
 mm/cma_debug.c |  8 
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index b2393b892d3b..2380f2571eb5 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -24,7 +24,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long 
pfn,
 unsigned int count)
 {
unsigned long bitmap_no, bitmap_count;
+   unsigned long flags;
 
bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
bitmap_count = cma_bitmap_pages_to_bits(cma, count);
 
-   mutex_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
-   mutex_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static void __init cma_activate_area(struct cma *cma)
@@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
 pfn += pageblock_nr_pages)
init_cma_reserved_pageblock(pfn_to_page(pfn));
 
-   mutex_init(>lock);
+   spin_lock_init(>lock);
 
 #ifdef CONFIG_CMA_DEBUGFS
INIT_HLIST_HEAD(>mem_head);
@@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
unsigned long nr_part, nr_total = 0;
unsigned long nbits = cma_bitmap_maxno(cma);
 
-   mutex_lock(>lock);
+   spin_lock_irq(>lock);
pr_info("number of available pages: ");
for (;;) {
next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
@@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
start = next_zero_bit + nr_zero;
}
pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
 }
 #else
 static inline void cma_debug_show_areas(struct cma *cma) { }
@@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
goto out;
 
for (;;) {
-   mutex_lock(>lock);
+   spin_lock_irq(>lock);
bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
break;
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
@@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
 * our exclusive use. If the migration fails we will take the
 * lock again and unmark it.
 */
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
 
pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
diff --git a/mm/cma.h b/mm/cma.h
index 68ffad4e430d..2c775877eae2 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -15,7 +15,7 @@ struct cma {
unsigned long   count;
unsigned long   *bitmap;
unsigned int order_per_bit; /* Order of pages represented by one bit */
-   struct mutexlock;
+   spinlock_t  lock;
 #ifdef CONFIG_CMA_DEBUGFS
struct hlist_head mem_head;
spinlock_t mem_head_lock;
diff --git a/mm/cma_debug.c b/mm/cma_debug.c
index d5bf8aa34fdc..2e7704955f4f 100644
--- a/mm/cma_debug.c
+++ b/mm/cma_debug.c
@@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
struct cma *cma = data;
unsigned long used;
 
-   mutex_lock(>lock);
+   spin_lock_irq(>lock);
/* pages counter is smaller than sizeof(int) */
used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
-   mutex_unlock(>lock);
+   spin_unlock_irq(>lock);
*val = (u64)used << cma->order_per_bit;
 
return 0;
@@ -53,7 +53,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
unsigned long start, end = 0;
unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
 
-   mutex_lock(>lock);
+   spin_lock_irq(>lock);

[PATCH v3 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock

2021-03-30 Thread Mike Kravetz

After making hugetlb lock irq safe and separating some functionality
done under the lock, add some lockdep_assert_held to help verify
locking.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Miaohe Lin 
Reviewed-by: Muchun Song 
---
 mm/hugetlb.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0bd4dc04df0f..c22111f3da20 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1068,6 +1068,8 @@ static void __enqueue_huge_page(struct list_head *list, 
struct page *page)
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
int nid = page_to_nid(page);
+
+   lockdep_assert_held(_lock);
__enqueue_huge_page(>hugepage_freelists[nid], page);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
@@ -1078,6 +1080,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
hstate *h, int nid)
struct page *page;
bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
+   lockdep_assert_held(_lock);
list_for_each_entry(page, >hugepage_freelists[nid], lru) {
if (pin && !is_pinnable_page(page))
continue;
@@ -1346,6 +1349,7 @@ static void remove_hugetlb_page(struct hstate *h, struct 
page *page,
 {
int nid = page_to_nid(page);
 
+   lockdep_assert_held(_lock);
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
 
@@ -1701,6 +1705,7 @@ static struct page *remove_pool_huge_page(struct hstate 
*h,
int nr_nodes, node;
struct page *page = NULL;
 
+   lockdep_assert_held(_lock);
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
 * If we're returning unused surplus pages, only examine
@@ -1950,6 +1955,7 @@ static int gather_surplus_pages(struct hstate *h, long 
delta)
long needed, allocated;
bool alloc_ok = true;
 
+   lockdep_assert_held(_lock);
needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
if (needed <= 0) {
h->resv_huge_pages += delta;
@@ -2043,6 +2049,7 @@ static void return_unused_surplus_pages(struct hstate *h,
struct page *page;
LIST_HEAD(page_list);
 
+   lockdep_assert_held(_lock);
/* Uncommit the reservation */
h->resv_huge_pages -= unused_resv_pages;
 
@@ -2641,6 +2648,7 @@ static void try_to_free_low(struct hstate *h, unsigned 
long count,
int i;
LIST_HEAD(page_list);
 
+   lockdep_assert_held(_lock);
if (hstate_is_gigantic(h))
return;
 
@@ -2682,6 +2690,7 @@ static int adjust_pool_surplus(struct hstate *h, 
nodemask_t *nodes_allowed,
 {
int nr_nodes, node;
 
+   lockdep_assert_held(_lock);
VM_BUG_ON(delta != -1 && delta != 1);
 
if (delta < 0) {
-- 
2.30.2

Re: [External] [PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock

2021-03-30 Thread Mike Kravetz

On 3/29/21 7:21 PM, Muchun Song wrote:
> On Tue, Mar 30, 2021 at 7:24 AM Mike Kravetz  wrote:
>>
>> With the introduction of remove_hugetlb_page(), there is no need for
>> update_and_free_page to hold the hugetlb lock.  Change all callers to
>> drop the lock before calling.
>>
>> With additional code modifications, this will allow loops which decrease
>> the huge page pool to drop the hugetlb_lock with each page to reduce
>> long hold times.
>>
>> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
>> a subsequent patch which restructures free_pool_huge_page.
>>
>> Signed-off-by: Mike Kravetz 
>> Acked-by: Michal Hocko 
>> Reviewed-by: Muchun Song 
>> ---
>>  mm/hugetlb.c | 32 +++-
>>  1 file changed, 27 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 16beae49..dec7bd0dc63d 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1451,16 +1451,18 @@ static void __free_huge_page(struct page *page)
>>
>> if (HPageTemporary(page)) {
>> remove_hugetlb_page(h, page, false);
>> +   spin_unlock(_lock);
>> update_and_free_page(h, page);
>> } else if (h->surplus_huge_pages_node[nid]) {
>> /* remove the page from active list */
>> remove_hugetlb_page(h, page, true);
>> +   spin_unlock(_lock);
>> update_and_free_page(h, page);
>> } else {
>> arch_clear_hugepage_flags(page);
>> enqueue_huge_page(h, page);
>> +   spin_unlock(_lock);
>> }
>> -   spin_unlock(_lock);
>>  }
>>
>>  /*
>> @@ -1741,7 +1743,13 @@ static int free_pool_huge_page(struct hstate *h, 
>> nodemask_t *nodes_allowed,
>> list_entry(h->hugepage_freelists[node].next,
>>   struct page, lru);
>> remove_hugetlb_page(h, page, acct_surplus);
>> +   /*
>> +* unlock/lock around update_and_free_page is 
>> temporary
>> +* and will be removed with subsequent patch.
>> +*/
>> +   spin_unlock(_lock);
>> update_and_free_page(h, page);
>> +   spin_lock(_lock);
>> ret = 1;
>> break;
>> }
>> @@ -1810,8 +1818,9 @@ int dissolve_free_huge_page(struct page *page)
>> }
>> remove_hugetlb_page(h, page, false);
>> h->max_huge_pages--;
>> +   spin_unlock(_lock);
>> update_and_free_page(h, head);
>> -   rc = 0;
>> +   return 0;
>> }
>>  out:
>> spin_unlock(_lock);
>> @@ -2674,22 +2683,35 @@ static void try_to_free_low(struct hstate *h, 
>> unsigned long count,
>> nodemask_t *nodes_allowed)
>>  {
>> int i;
>> +   struct page *page, *next;
>> +   LIST_HEAD(page_list);
>>
>> if (hstate_is_gigantic(h))
>> return;
>>
>> +   /*
>> +* Collect pages to be freed on a list, and free after dropping lock
>> +*/
>> +   INIT_LIST_HEAD(_list);
> 
> INIT_LIST_HEAD is unnecessary. Because the macro of
> LIST_HEAD already initializes the list_head structure.
> 

Thanks.
I will fix here and the same issue in patch 6.
-- 
Mike Kravetz

Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-30 Thread Mike Kravetz

On 3/30/21 1:01 AM, Michal Hocko wrote:
> On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
>> Ideally, cma_release could be called from any context.  However, that is
>> not possible because a mutex is used to protect the per-area bitmap.
>> Change the bitmap to an irq safe spinlock.
> 
> I would phrase the changelog slightly differerent
> "
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
> 
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> "

That is better.  Thank you.

> 
> it seems that you are overusing irqsave variants even from context which
> are never called from the IRQ context so they do not need storing flags.
> 
> [...]

Yes.

>> @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
>>  unsigned long start = 0;
>>  unsigned long nr_part, nr_total = 0;
>>  unsigned long nbits = cma_bitmap_maxno(cma);
>> +unsigned long flags;
>>  
>> -mutex_lock(>lock);
>> +spin_lock_irqsave(>lock, flags);
> 
> spin_lock_irq should be sufficient. This is only called from the
> allocation context and that is never called from IRQ context.
> 

I will change this and those below.

Thanks for your continued reviews and patience.
-- 
Mike Kravetz

Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-29 Thread Mike Kravetz

On 3/29/21 6:20 PM, Song Bao Hua (Barry Song) wrote:
> 
> 
>> -Original Message-----
>> From: Mike Kravetz [mailto:mike.krav...@oracle.com]
>> Sent: Tuesday, March 30, 2021 12:24 PM
>> To: linux...@kvack.org; linux-kernel@vger.kernel.org
>> Cc: Roman Gushchin ; Michal Hocko ; Shakeel 
>> Butt
>> ; Oscar Salvador ; David Hildenbrand
>> ; Muchun Song ; David Rientjes
>> ; linmiaohe ; Peter Zijlstra
>> ; Matthew Wilcox ; HORIGUCHI NAOYA
>> ; Aneesh Kumar K . V ;
>> Waiman Long ; Peter Xu ; Mina Almasry
>> ; Hillf Danton ; Joonsoo Kim
>> ; Song Bao Hua (Barry Song)
>> ; Will Deacon ; Andrew Morton
>> ; Mike Kravetz 
>> Subject: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
>>
>> Ideally, cma_release could be called from any context.  However, that is
>> not possible because a mutex is used to protect the per-area bitmap.
>> Change the bitmap to an irq safe spinlock.
>>
>> Signed-off-by: Mike Kravetz 
> 
> It seems mutex_lock is locking some areas with bitmap operations which
> should be safe to atomic context.
> 
> Reviewed-by: Barry Song 

Thanks Barry,

Not sure if you saw questions from Michal in previous series?
There was some concern from Joonsoo in the past about lock hold time due
to bitmap scans.  You may have some insight into the typical size of CMA
areas on arm64.  I believe the calls to set up the areas specify one bit
per page.
-- 
Mike Kravetz

[PATCH v2 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock

2021-03-29 Thread Mike Kravetz

After making hugetlb lock irq safe and separating some functionality
done under the lock, add some lockdep_assert_held to help verify
locking.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Miaohe Lin 
Reviewed-by: Muchun Song 
---
 mm/hugetlb.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bf36abc2305a..06282f340f40 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1068,6 +1068,8 @@ static void __enqueue_huge_page(struct list_head *list, 
struct page *page)
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
int nid = page_to_nid(page);
+
+   lockdep_assert_held(_lock);
__enqueue_huge_page(>hugepage_freelists[nid], page);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
@@ -1078,6 +1080,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
hstate *h, int nid)
struct page *page;
bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
+   lockdep_assert_held(_lock);
list_for_each_entry(page, >hugepage_freelists[nid], lru) {
if (pin && !is_pinnable_page(page))
continue;
@@ -1346,6 +1349,7 @@ static void remove_hugetlb_page(struct hstate *h, struct 
page *page,
 {
int nid = page_to_nid(page);
 
+   lockdep_assert_held(_lock);
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
 
@@ -1701,6 +1705,7 @@ static struct page *remove_pool_huge_page(struct hstate 
*h,
int nr_nodes, node;
struct page *page = NULL;
 
+   lockdep_assert_held(_lock);
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
 * If we're returning unused surplus pages, only examine
@@ -1950,6 +1955,7 @@ static int gather_surplus_pages(struct hstate *h, long 
delta)
long needed, allocated;
bool alloc_ok = true;
 
+   lockdep_assert_held(_lock);
needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
if (needed <= 0) {
h->resv_huge_pages += delta;
@@ -2043,6 +2049,7 @@ static void return_unused_surplus_pages(struct hstate *h,
struct page *page;
LIST_HEAD(page_list);
 
+   lockdep_assert_held(_lock);
/* Uncommit the reservation */
h->resv_huge_pages -= unused_resv_pages;
 
@@ -2642,6 +2649,7 @@ static void try_to_free_low(struct hstate *h, unsigned 
long count,
int i;
LIST_HEAD(page_list);
 
+   lockdep_assert_held(_lock);
if (hstate_is_gigantic(h))
return;
 
@@ -2684,6 +2692,7 @@ static int adjust_pool_surplus(struct hstate *h, 
nodemask_t *nodes_allowed,
 {
int nr_nodes, node;
 
+   lockdep_assert_held(_lock);
VM_BUG_ON(delta != -1 && delta != 1);
 
if (delta < 0) {
-- 
2.30.2

[PATCH v2 7/8] hugetlb: make free_huge_page irq safe

2021-03-29 Thread Mike Kravetz

Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
non-task context") was added to address the issue of free_huge_page
being called from irq context.  That commit hands off free_huge_page
processing to a workqueue if !in_task.  However, this doesn't cover
all the cases as pointed out by 0day bot lockdep report [1].

:  Possible interrupt unsafe locking scenario:
:
:CPU0CPU1
:
:   lock(hugetlb_lock);
:local_irq_disable();
:lock(slock-AF_INET);
:lock(hugetlb_lock);
:   
: lock(slock-AF_INET);

Shakeel has later explained that this is very likely TCP TX zerocopy
from hugetlb pages scenario when the networking code drops a last
reference to hugetlb page while having IRQ disabled. Hugetlb freeing
path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
chain can lead to a deadlock.

This commit addresses the issue by doing the following:
- Make hugetlb_lock irq safe.  This is mostly a simple process of
  changing spin_*lock calls to spin_*lock_irq* calls.
- Make subpool lock irq safe in a similar manner.
- Revert the !in_task check and workqueue handoff.

[1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Muchun Song 
---
 mm/hugetlb.c| 167 
 mm/hugetlb_cgroup.c |   8 +--
 2 files changed, 66 insertions(+), 109 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d3f3cb8766b8..bf36abc2305a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool 
*spool)
return true;
 }
 
-static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
+   unsigned long irq_flags)
 {
-   spin_unlock(>lock);
+   spin_unlock_irqrestore(>lock, irq_flags);
 
/* If no pages are used, and no other handles to the subpool
 * remain, give up any reservations based on minimum size and
@@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct 
hstate *h, long max_hpages,
 
 void hugepage_put_subpool(struct hugepage_subpool *spool)
 {
-   spin_lock(>lock);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
BUG_ON(!spool->count);
spool->count--;
-   unlock_or_release_subpool(spool);
+   unlock_or_release_subpool(spool, flags);
 }
 
 /*
@@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct 
hugepage_subpool *spool,
if (!spool)
return ret;
 
-   spin_lock(>lock);
+   spin_lock_irq(>lock);
 
if (spool->max_hpages != -1) {  /* maximum size accounting */
if ((spool->used_hpages + delta) <= spool->max_hpages)
@@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct 
hugepage_subpool *spool,
}
 
 unlock_ret:
-   spin_unlock(>lock);
+   spin_unlock_irq(>lock);
return ret;
 }
 
@@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct 
hugepage_subpool *spool,
   long delta)
 {
long ret = delta;
+   unsigned long flags;
 
if (!spool)
return delta;
 
-   spin_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
 
if (spool->max_hpages != -1)/* maximum size accounting */
spool->used_hpages -= delta;
@@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct 
hugepage_subpool *spool,
 * If hugetlbfs_put_super couldn't free spool due to an outstanding
 * quota reference, free it now.
 */
-   unlock_or_release_subpool(spool);
+   unlock_or_release_subpool(spool, flags);
 
return ret;
 }
@@ -1412,7 +1416,7 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
-static void __free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
/*
 * Can't pass hstate in here because it is called from the
@@ -1422,6 +1426,7 @@ static void __free_huge_page(struct page *page)
int nid = page_to_nid(page);
struct hugepage_subpool *spool = hugetlb_page_subpool(page);
bool restore_reserve;
+   unsigned long flags;
 
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(page_mapcount(page), page);
@@ -1450,7 +1455,7 @@ static void __free_huge_page(struct page *page)
restore_reserve = true;
}
 
-   spin_lock(_lock);
+   spin_lock_irqsave(_lock, flags);
ClearHPageMigratable(page);
hugetlb_cgroup_uncharge_page(hstate_index(h),

[PATCH v2 5/8] hugetlb: call update_and_free_page without hugetlb_lock

2021-03-29 Thread Mike Kravetz

With the introduction of remove_hugetlb_page(), there is no need for
update_and_free_page to hold the hugetlb lock.  Change all callers to
drop the lock before calling.

With additional code modifications, this will allow loops which decrease
the huge page pool to drop the hugetlb_lock with each page to reduce
long hold times.

The ugly unlock/lock cycle in free_pool_huge_page will be removed in
a subsequent patch which restructures free_pool_huge_page.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Muchun Song 
---
 mm/hugetlb.c | 32 +++-
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 16beae49..dec7bd0dc63d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1451,16 +1451,18 @@ static void __free_huge_page(struct page *page)
 
if (HPageTemporary(page)) {
remove_hugetlb_page(h, page, false);
+   spin_unlock(_lock);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
remove_hugetlb_page(h, page, true);
+   spin_unlock(_lock);
update_and_free_page(h, page);
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
+   spin_unlock(_lock);
}
-   spin_unlock(_lock);
 }
 
 /*
@@ -1741,7 +1743,13 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
+   /*
+* unlock/lock around update_and_free_page is temporary
+* and will be removed with subsequent patch.
+*/
+   spin_unlock(_lock);
update_and_free_page(h, page);
+   spin_lock(_lock);
ret = 1;
break;
}
@@ -1810,8 +1818,9 @@ int dissolve_free_huge_page(struct page *page)
}
remove_hugetlb_page(h, page, false);
h->max_huge_pages--;
+   spin_unlock(_lock);
update_and_free_page(h, head);
-   rc = 0;
+   return 0;
}
 out:
spin_unlock(_lock);
@@ -2674,22 +2683,35 @@ static void try_to_free_low(struct hstate *h, unsigned 
long count,
nodemask_t *nodes_allowed)
 {
int i;
+   struct page *page, *next;
+   LIST_HEAD(page_list);
 
if (hstate_is_gigantic(h))
return;
 
+   /*
+* Collect pages to be freed on a list, and free after dropping lock
+*/
+   INIT_LIST_HEAD(_list);
for_each_node_mask(i, *nodes_allowed) {
-   struct page *page, *next;
struct list_head *freel = >hugepage_freelists[i];
list_for_each_entry_safe(page, next, freel, lru) {
if (count >= h->nr_huge_pages)
-   return;
+   goto out;
if (PageHighMem(page))
continue;
remove_hugetlb_page(h, page, false);
-   update_and_free_page(h, page);
+   list_add(>lru, _list);
}
}
+
+out:
+   spin_unlock(_lock);
+   list_for_each_entry_safe(page, next, _list, lru) {
+   update_and_free_page(h, page);
+   cond_resched();
+   }
+   spin_lock(_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
-- 
2.30.2

[PATCH v2 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page

2021-03-29 Thread Mike Kravetz

free_pool_huge_page was called with hugetlb_lock held.  It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy.  free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.

Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held.  It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators.  The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.

Add new helper routine to call update_and_free_page for a list of pages.

Note: Some changes to the routine return_unused_surplus_pages are in
need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
race when freeing surplus pages") modified this routine to address a
race which could occur when dropping the hugetlb_lock in the loop that
removes pool pages.  Accounting changes introduced in that commit were
subtle and took some thought to understand.  This commit removes the
cond_resched_lock() and the potential race.  Therefore, remove the
subtle code and restore the more straight forward accounting effectively
reverting the commit.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 95 +---
 1 file changed, 53 insertions(+), 42 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dec7bd0dc63d..d3f3cb8766b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1209,7 +1209,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
 }
 
 /*
- * helper for free_pool_huge_page() - return the previously saved
+ * helper for remove_pool_huge_page() - return the previously saved
  * node ["this node"] from which to free a huge page.  Advance the
  * next node id whether or not we find a free huge page to free so
  * that the next attempt to free addresses the next node.
@@ -1391,6 +1391,16 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
}
 }
 
+static void update_and_free_pages_bulk(struct hstate *h, struct list_head 
*list)
+{
+   struct page *page, *t_page;
+
+   list_for_each_entry_safe(page, t_page, list, lru) {
+   update_and_free_page(h, page);
+   cond_resched();
+   }
+}
+
 struct hstate *size_to_hstate(unsigned long size)
 {
struct hstate *h;
@@ -1721,16 +1731,18 @@ static int alloc_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 }
 
 /*
- * Free huge page from pool from next node to free.
- * Attempt to keep persistent huge pages more or less
- * balanced over allowed nodes.
+ * Remove huge page from pool from next node to free.  Attempt to keep
+ * persistent huge pages more or less balanced over allowed nodes.
+ * This routine only 'removes' the hugetlb page.  The caller must make
+ * an additional call to free the page to low level allocators.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-bool acct_surplus)
+static struct page *remove_pool_huge_page(struct hstate *h,
+   nodemask_t *nodes_allowed,
+bool acct_surplus)
 {
int nr_nodes, node;
-   int ret = 0;
+   struct page *page = NULL;
 
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
@@ -1739,23 +1751,14 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 */
if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
!list_empty(>hugepage_freelists[node])) {
-   struct page *page =
-   list_entry(h->hugepage_freelists[node].next,
+   page = list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
-   /*
-* unlock/lock around update_and_free_page is temporary
-* and will be removed with subsequent patch.
-*/
-   spin_unlock(_lock);
-   update_and_free_page(h, page);
-   spin_lock(_lock);
-   ret = 1;
break;
}
}
 
-   return ret;
+   return page;
 }
 
 /*
@@ -2075,17 +2078,16 @@ static int gather_surplus_pages(struct hstate *h, long 
delta)
  *to the associated reservation map.
  * 2) Free any unused surplus pages that may have been allocated to satisfy
  *the reservation.  A

[PATCH v2 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-03-29 Thread Mike Kravetz

The new remove_hugetlb_page() routine is designed to remove a hugetlb
page from hugetlbfs processing.  It will remove the page from the active
or free list, update global counters and set the compound page
destructor to NULL so that PageHuge() will return false for the 'page'.
After this call, the 'page' can be treated as a normal compound page or
a collection of base size pages.

remove_hugetlb_page is to be called with the hugetlb_lock held.

Creating this routine and separating functionality is in preparation for
restructuring code to reduce lock hold times.  This commit should not
introduce any changes to functionality.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Miaohe Lin 
Reviewed-by: Muchun Song 
---
 mm/hugetlb.c | 67 
 1 file changed, 42 insertions(+), 25 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8497a3598c86..16beae49 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1331,6 +1331,43 @@ static inline void destroy_compound_gigantic_page(struct 
page *page,
unsigned int order) { }
 #endif
 
+/*
+ * Remove hugetlb page from lists, and update dtor so that page appears
+ * as just a compound page.  A reference is held on the page.
+ *
+ * Must be called with hugetlb lock held.
+ */
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+   bool adjust_surplus)
+{
+   int nid = page_to_nid(page);
+
+   if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+   return;
+
+   list_del(>lru);
+
+   if (HPageFreed(page)) {
+   h->free_huge_pages--;
+   h->free_huge_pages_node[nid]--;
+   ClearHPageFreed(page);
+   }
+   if (adjust_surplus) {
+   h->surplus_huge_pages--;
+   h->surplus_huge_pages_node[nid]--;
+   }
+
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
+
+   ClearHPageTemporary(page);
+   set_page_refcounted(page);
+   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+
+   h->nr_huge_pages--;
+   h->nr_huge_pages_node[nid]--;
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
int i;
@@ -1339,8 +1376,6 @@ static void update_and_free_page(struct hstate *h, struct 
page *page)
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
 
-   h->nr_huge_pages--;
-   h->nr_huge_pages_node[page_to_nid(page)]--;
for (i = 0; i < pages_per_huge_page(h);
 i++, subpage = mem_map_next(subpage, page, i)) {
subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -1348,10 +1383,6 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
1 << PG_active | 1 << PG_private |
1 << PG_writeback);
}
-   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
-   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
-   set_page_refcounted(page);
if (hstate_is_gigantic(h)) {
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
@@ -1419,15 +1450,12 @@ static void __free_huge_page(struct page *page)
h->resv_huge_pages++;
 
if (HPageTemporary(page)) {
-   list_del(>lru);
-   ClearHPageTemporary(page);
+   remove_hugetlb_page(h, page, false);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
-   list_del(>lru);
+   remove_hugetlb_page(h, page, true);
update_and_free_page(h, page);
-   h->surplus_huge_pages--;
-   h->surplus_huge_pages_node[nid]--;
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
@@ -1712,13 +1740,7 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
struct page *page =
list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
-   list_del(>lru);
-   h->free_huge_pages--;
-   h->free_huge_pages_node[node]--;
-   if (acct_surplus) {
-   h->surplus_huge_pages--;
-   h->surplus_huge_pages_node[node]--;
-   }
+   remove_huge

[PATCH v2 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release

2021-03-29 Thread Mike Kravetz

Now that cma_release is non-blocking and irq safe, there is no need to
drop hugetlb_lock before calling.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c3e4baa4156..1d62f0492e7b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
set_page_refcounted(page);
if (hstate_is_gigantic(h)) {
-   /*
-* Temporarily drop the hugetlb_lock, because
-* we might block in free_gigantic_page().
-*/
-   spin_unlock(_lock);
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
-   spin_lock(_lock);
} else {
__free_pages(page, huge_page_order(h));
}
-- 
2.30.2

[PATCH v2 0/8] make hugetlb put_page safe for all calling contexts

2021-03-29 Thread Mike Kravetz

This effort is the result a recent bug report [1].  Syzbot found a
potential deadlock in the hugetlb put_page/free_huge_page_path.
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
Since the free_huge_page_path already has code to 'hand off' page
free requests to a workqueue, a suggestion was proposed to make
the in_irq() detection accurate by always enabling PREEMPT_COUNT [2].
The outcome of that discussion was that the hugetlb put_page path
(free_huge_page) path should be properly fixed and safe for all calling
contexts.

This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
level, the series provides:
- Patches 1 & 2 change CMA bitmap mutex to an irq safe spinlock
- Patch 3 adds a mutex for proc/sysfs interfaces changing hugetlb counts
- Patches 4, 5 & 6 are aimed at reducing lock hold times.  To be clear
  the goal is to eliminate single lock hold times of a long duration.
  Overall lock hold time is not addressed.
- Patch 7 makes hugetlb_lock and subpool lock IRQ safe.  It also reverts
  the code which defers calls to a workqueue if !in_task.
- Patch 8 adds some lockdep_assert_held() calls

[1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
[2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.krav...@oracle.com

v1 -> v2
- Drop Roman's cma_release_nowait() patches and just change CMA mutex
  to an IRQ safe spinlock.
- Cleanups to variable names, commets and commit messages as suggested
  by Michal, Oscar, Miaohe and Muchun.
- Dropped unnecessary INIT_LIST_HEAD as suggested by Michal and list_del
  as suggested by Muchun.
- Created update_and_free_pages_bulk helper as suggested by Michal.
- Rebased on v5.12-rc4-mmotm-2021-03-28-16-37
- Added Acked-by: and Reviewed-by: from v1

RFC -> v1
- Add Roman's cma_release_nowait() patches.  This eliminated the need
  to do a workqueue handoff in hugetlb code.
- Use Michal's suggestion to batch pages for freeing.  This eliminated
  the need to recalculate loop control variables when dropping the lock.
- Added lockdep_assert_held() calls
- Rebased to v5.12-rc3-mmotm-2021-03-17-22-24

Mike Kravetz (8):
  mm/cma: change cma mutex to irq safe spinlock
  hugetlb: no need to drop hugetlb_lock to call cma_release
  hugetlb: add per-hstate mutex to synchronize user adjustments
  hugetlb: create remove_hugetlb_page() to separate functionality
  hugetlb: call update_and_free_page without hugetlb_lock
  hugetlb: change free_pool_huge_page to remove_pool_huge_page
  hugetlb: make free_huge_page irq safe
  hugetlb: add lockdep_assert_held() calls for hugetlb_lock

 include/linux/hugetlb.h |   1 +
 mm/cma.c|  20 +--
 mm/cma.h|   2 +-
 mm/cma_debug.c  |  10 +-
 mm/hugetlb.c| 340 +---
 mm/hugetlb_cgroup.c |   8 +-
 6 files changed, 202 insertions(+), 179 deletions(-)

-- 
2.30.2

[PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-29 Thread Mike Kravetz

Ideally, cma_release could be called from any context.  However, that is
not possible because a mutex is used to protect the per-area bitmap.
Change the bitmap to an irq safe spinlock.

Signed-off-by: Mike Kravetz 
---
 mm/cma.c   | 20 +++-
 mm/cma.h   |  2 +-
 mm/cma_debug.c | 10 ++
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index b2393b892d3b..80875fd4487b 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -24,7 +24,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long 
pfn,
 unsigned int count)
 {
unsigned long bitmap_no, bitmap_count;
+   unsigned long flags;
 
bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
bitmap_count = cma_bitmap_pages_to_bits(cma, count);
 
-   mutex_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
-   mutex_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static void __init cma_activate_area(struct cma *cma)
@@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
 pfn += pageblock_nr_pages)
init_cma_reserved_pageblock(pfn_to_page(pfn));
 
-   mutex_init(>lock);
+   spin_lock_init(>lock);
 
 #ifdef CONFIG_CMA_DEBUGFS
INIT_HLIST_HEAD(>mem_head);
@@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
unsigned long start = 0;
unsigned long nr_part, nr_total = 0;
unsigned long nbits = cma_bitmap_maxno(cma);
+   unsigned long flags;
 
-   mutex_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
pr_info("number of available pages: ");
for (;;) {
next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
@@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma *cma)
start = next_zero_bit + nr_zero;
}
pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
-   mutex_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
 }
 #else
 static inline void cma_debug_show_areas(struct cma *cma) { }
@@ -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
unsigned long pfn = -1;
unsigned long start = 0;
unsigned long bitmap_maxno, bitmap_no, bitmap_count;
+   unsigned long flags;
size_t i;
struct page *page = NULL;
int ret = -ENOMEM;
@@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
goto out;
 
for (;;) {
-   mutex_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
break;
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
@@ -468,7 +470,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
 * our exclusive use. If the migration fails we will take the
 * lock again and unmark it.
 */
-   mutex_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
 
pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
diff --git a/mm/cma.h b/mm/cma.h
index 68ffad4e430d..2c775877eae2 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -15,7 +15,7 @@ struct cma {
unsigned long   count;
unsigned long   *bitmap;
unsigned int order_per_bit; /* Order of pages represented by one bit */
-   struct mutexlock;
+   spinlock_t  lock;
 #ifdef CONFIG_CMA_DEBUGFS
struct hlist_head mem_head;
spinlock_t mem_head_lock;
diff --git a/mm/cma_debug.c b/mm/cma_debug.c
index d5bf8aa34fdc..6379cfbfd568 100644
--- a/mm/cma_debug.c
+++ b/mm/cma_debug.c
@@ -35,11 +35,12 @@ static int cma_used_get(void *data, u64 *val)
 {
struct cma *cma = data;
unsigned long used;
+   unsigned long flags;
 
-   mutex_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
/* pages counter is smaller than sizeof(int) */
used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
-   mutex_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
*val = (u64)used << cma->order_per_bit;
 
return 0;
@@ -52,8 +53,9 @@ stati

[PATCH v2 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments

2021-03-29 Thread Mike Kravetz

The helper routine hstate_next_node_to_alloc accesses and modifies the
hstate variable next_nid_to_alloc.  The helper is used by the routines
alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
called with hugetlb_lock held.  However, alloc_pool_huge_page can not
be called with the hugetlb lock held as it will call the page allocator.
Two instances of alloc_pool_huge_page could be run in parallel or
alloc_pool_huge_page could run in parallel with adjust_pool_surplus
which may result in the variable next_nid_to_alloc becoming invalid
for the caller and pages being allocated on the wrong node.

Both alloc_pool_huge_page and adjust_pool_surplus are only called from
the routine set_max_huge_pages after boot.  set_max_huge_pages is only
called as the reusult of a user writing to the proc/sysfs nr_hugepages,
or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.

It makes little sense to allow multiple adjustment to the number of
hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
allow one hugetlb page adjustment at a time.  This will synchronize
modifications to the next_nid_to_alloc variable.

Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Oscar Salvador 
Reviewed-by: Miaohe Lin 
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c| 8 
 2 files changed, 9 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d9b78e82652f..b92f25ccef58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
+   struct mutex resize_lock;
int next_nid_to_alloc;
int next_nid_to_free;
unsigned int order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1d62f0492e7b..8497a3598c86 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2730,6 +2730,11 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
else
return -ENOMEM;
 
+   /*
+* resize_lock mutex prevents concurrent adjustments to number of
+* pages in hstate via the proc/sysfs interfaces.
+*/
+   mutex_lock(>resize_lock);
spin_lock(_lock);
 
/*
@@ -2762,6 +2767,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
if (count > persistent_huge_pages(h)) {
spin_unlock(_lock);
+   mutex_unlock(>resize_lock);
NODEMASK_FREE(node_alloc_noretry);
return -EINVAL;
}
@@ -2836,6 +2842,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
 out:
h->max_huge_pages = persistent_huge_pages(h);
spin_unlock(_lock);
+   mutex_unlock(>resize_lock);
 
NODEMASK_FREE(node_alloc_noretry);
 
@@ -3323,6 +3330,7 @@ void __init hugetlb_add_hstate(unsigned int order)
BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
BUG_ON(order == 0);
h = [hugetlb_max_hstate++];
+   mutex_init(>resize_lock);
h->order = order;
h->mask = ~(huge_page_size(h) - 1);
for (i = 0; i < MAX_NUMNODES; ++i)
-- 
2.30.2

Re: [External] [PATCH 7/8] hugetlb: make free_huge_page irq safe

2021-03-29 Thread Mike Kravetz

On 3/29/21 12:49 AM, Michal Hocko wrote:
> On Sat 27-03-21 15:06:36, Muchun Song wrote:
>> On Thu, Mar 25, 2021 at 8:29 AM Mike Kravetz  wrote:
>>>
>>> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
>>> non-task context") was added to address the issue of free_huge_page
>>> being called from irq context.  That commit hands off free_huge_page
>>> processing to a workqueue if !in_task.  However, as seen in [1] this
>>> does not cover all cases.  Instead, make the locks taken in the
>>> free_huge_page irq safe.
>>>
>>> This patch does the following:
>>> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>>>   changing spin_*lock calls to spin_*lock_irq* calls.
>>> - Make subpool lock irq safe in a similar manner.
>>> - Revert the !in_task check and workqueue handoff.
>>>
>>> [1] 
>>> https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
>>>
>>> Signed-off-by: Mike Kravetz 
>>
>> The changes are straightforward.
>>
>> Reviewed-by: Muchun Song 
>>
>> Since this patchset aims to fix a real word issue. Should we add a Fixes
>> tag?
> 
> Do we know since when it is possible to use hugetlb in the networking
> context? Maybe this is possible since ever but I am wondering why the
> lockdep started complaining only now. Maybe just fuzzing finally started
> using this setup which nobody does normally.
> 

>From my memory and email search, this first came up with powerpc iommu here:
https://lore.kernel.org/lkml/20180905112341.21355-1-aneesh.ku...@linux.ibm.com/

Aneesh proposed a solution similar to this, but 'fixed' the issue by changing
the powerpc code.

AFAICT, the put_page/free_huge_page code path has only been 'safe' to
call from task context since it was originally written.  The real
question is when was it first possible for some code to do (the last)
put_page for a hugetlbfs page from irq context?  My 'guess' is that this
may have been possible for quite a while.  I can imagine a dma reference
to a hugetlb page held after the user space reference goes away.
-- 
Mike Kravetz

Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()

2021-03-29 Thread Mike Kravetz

On 3/29/21 12:46 AM, Michal Hocko wrote:
> On Fri 26-03-21 14:32:01, Mike Kravetz wrote:
> [...]
>> - Just change the mutex to an irq safe spinlock.
> 
> Yes please.
> 
>>   AFAICT, the potential
>>   downsides could be:
>>   - Interrupts disabled during long bitmap scans
> 
> How large those bitmaps are in practice?
> 
>>   - Wasted cpu cycles (spinning) if there is much contention on lock
>>   Both of these would be more of an issue on small/embedded systems. I
>>   took a quick look at the callers of cma_alloc/cma_release and nothing
>>   stood out that could lead to high degrees of contention.  However, I
>>   could have missed something.
> 
> If this is really a practical concern then we can try a more complex
> solution based on some data.
> 

Ok, I will send v2 with this approach.  Adding Barry and Will on Cc: as
they were involved in adding more cma use cases for dma on arm.

-- 
Mike Kravetz

Re: [External] [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock

2021-03-28 Thread Mike Kravetz

On 3/26/21 11:54 PM, Muchun Song wrote:
> On Thu, Mar 25, 2021 at 8:29 AM Mike Kravetz  wrote:
>>
>> With the introduction of remove_hugetlb_page(), there is no need for
>> update_and_free_page to hold the hugetlb lock.  Change all callers to
>> drop the lock before calling.
>>
>> With additional code modifications, this will allow loops which decrease
>> the huge page pool to drop the hugetlb_lock with each page to reduce
>> long hold times.
>>
>> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
>> a subsequent patch which restructures free_pool_huge_page.
>>
>> Signed-off-by: Mike Kravetz 
> 
> Reviewed-by: Muchun Song 
> 
> Some nits below.

Thanks Muchun,

I agree with all your suggestions below, and will include modifications
in the next version.
-- 
Mike Kravetz

Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()

2021-03-26 Thread Mike Kravetz

On 3/25/21 4:49 PM, Mike Kravetz wrote:
> On 3/25/21 4:19 PM, Roman Gushchin wrote:
>> On Thu, Mar 25, 2021 at 01:12:51PM -0700, Minchan Kim wrote:
>>> On Thu, Mar 25, 2021 at 06:15:11PM +0100, David Hildenbrand wrote:
>>>> On 25.03.21 17:56, Mike Kravetz wrote:
>>>>> On 3/25/21 3:22 AM, Michal Hocko wrote:
>>>>>> On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
>>>>>>> On 25.03.21 01:28, Mike Kravetz wrote:
>>>>>>>> From: Roman Gushchin 
>>>>>>>>
>>>>>>>> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
>>>>>>>> It makes it a blocking function, which complicates its usage from
>>>>>>>> non-blocking contexts. For instance, hugetlbfs code is temporarily
>>>>>>>> dropping the hugetlb_lock spinlock to call cma_release().
>>>>>>>>
>>>>>>>> This patch introduces a non-blocking cma_release_nowait(), which
>>>>>>>> postpones the cma bitmap clearance. It's done later from a work
>>>>>>>> context. The first page in the cma allocation is used to store
>>>>>>>> the work struct. Because CMA allocations and de-allocations are
>>>>>>>> usually not that frequent, a single global workqueue is used.
>>>>>>>>
>>>>>>>> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
>>>>>>>> flushes the cma_release_wq workqueue. To avoid a performance
>>>>>>>> regression in the case when only cma_release() is used, gate it
>>>>>>>> by a per-cma area flag, which is set by the first call
>>>>>>>> of cma_release_nowait().
>>>>>>>>
>>>>>>>> Signed-off-by: Roman Gushchin 
>>>>>>>> [mike.krav...@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
>>>>>>>> Signed-off-by: Mike Kravetz 
>>>>>>>> ---
>>>>>>>
>>>>>>>
>>>>>>> 1. Is there a real reason this is a mutex and not a spin lock? It seems 
>>>>>>> to
>>>>>>> only protect the bitmap. Are bitmaps that huge that we spend a 
>>>>>>> significant
>>>>>>> amount of time in there?
>>>>>>
>>>>>> Good question. Looking at the code it doesn't seem that there is any
>>>>>> blockable operation or any heavy lifting done under the lock.
>>>>>> 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
>>>>>> the lock and there was a simple bitmat protection back then. I suspect
>>>>>> the patch just followed the cma_mutex lead and used the same type of the
>>>>>> lock. cma_mutex used to protect alloc_contig_range which is sleepable.
>>>>>>
>>>>>> This all suggests that there is no real reason to use a sleepable lock
>>>>>> at all and we do not need all this heavy lifting.
>>>>>>
>>>>>
>>>>> When Roman first proposed these patches, I brought up the same issue:
>>>>>
>>>>> https://lore.kernel.org/linux-mm/20201022023352.gc300...@carbon.dhcp.thefacebook.com/
>>>>>
>>>>> Previously, Roman proposed replacing the mutex with a spinlock but
>>>>> Joonsoo was opposed.
>>>>>
>>>>> Adding Joonsoo on Cc:
>>>>>
>>>>
>>>> There has to be a good reason not to. And if there is a good reason,
>>>> lockless clearing might be one feasible alternative.
>>>
>>> I also don't think nowait variant is good idea. If the scanning of
>>> bitmap is *really* significant, it might be signal that we need to
>>> introduce different technique or data structure not bitmap rather
>>> than a new API variant.
>>
>> I'd also prefer to just replace the mutex with a spinlock rather than 
>> fiddling
>> with a delayed release.
>>
> 
> I hope Joonsoo or someone else brings up specific concerns.  I do not
> know enough about all CMA use cases.  Certainly, in this specific use
> case converting to a spinlock would not be an issue.  Do note that we
> would want to convert to an irq safe spinlock and disable irqs if that
> makes any difference in the discussion.
> 

Suggestions on how to move forward would be appreciated.  I can think of
the following options.

- Use the the cma_release_nowait() routine as defined in this patch.

- Just change the mutex to an irq safe spinlock.  AFAICT, the potential
  downsides could be:
  - Interrupts disabled during long bitmap scans
  - Wasted cpu cycles (spinning) if there is much contention on lock
  Both of these would be more of an issue on small/embedded systems. I
  took a quick look at the callers of cma_alloc/cma_release and nothing
  stood out that could lead to high degrees of contention.  However, I
  could have missed something.

- Another idea I had was to allow the user to specify the locking type
  when creating a cma area.  In this way, cma areas which may have
  release calls from atomic context would be set up with an irq safe
  spinlock.  Others, would use the mutex.  I admit this is a hackish
  way to address the issue, but perhaps not much worse than the separate
  cma_release_nowait interface?

- Change the CMA bitmap to some other data structure and algorithm.
  This would obviously take more work.

Thanks,
-- 
Mike Kravetz

Re: [PATCH 0/8] make hugetlb put_page safe for all calling contexts

2021-03-26 Thread Mike Kravetz

On 3/25/21 6:42 PM, Miaohe Lin wrote:
> Hi:
> On 2021/3/25 8:28, Mike Kravetz wrote:
>> This effort is the result a recent bug report [1].  In subsequent
>> discussions [2], it was deemed necessary to properly fix the hugetlb
> 
> Many thanks for the effort. I have read the discussions and it is pretty long.
> Maybe it would be helpful if you give a brief summary here?
> 
>> put_page path (free_huge_page).  This RFC provides a possible way to
> 
> trival: Not RFC here.
> 
>> address the issue.  Comments are welcome/encouraged as several attempts
>> at this have been made in the past.
>>> This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
>> level, the series provides:
>> - Patches 1 & 2 from Roman Gushchin provide cma_release_nowait()
> 
> trival: missing description of the Patches 3 ?
> 

Thanks,
I will clean this up with next version.
-- 
Mike Kravetz

Re: [PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-03-26 Thread Mike Kravetz

On 3/25/21 7:10 PM, Miaohe Lin wrote:
> On 2021/3/25 8:28, Mike Kravetz wrote:
>> The new remove_hugetlb_page() routine is designed to remove a hugetlb
>> page from hugetlbfs processing.  It will remove the page from the active
>> or free list, update global counters and set the compound page
>> destructor to NULL so that PageHuge() will return false for the 'page'.
>> After this call, the 'page' can be treated as a normal compound page or
>> a collection of base size pages.
>>
>> remove_hugetlb_page is to be called with the hugetlb_lock held.
>>
>> Creating this routine and separating functionality is in preparation for
>> restructuring code to reduce lock hold times.
>>
>> Signed-off-by: Mike Kravetz 
>> ---
>>  mm/hugetlb.c | 70 +---
>>  1 file changed, 45 insertions(+), 25 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 404b0b1c5258..3938ec086b5c 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1327,6 +1327,46 @@ static inline void 
>> destroy_compound_gigantic_page(struct page *page,
>>  unsigned int order) { }
>>  #endif
>>  
>> +/*
>> + * Remove hugetlb page from lists, and update dtor so that page appears
>> + * as just a compound page.  A reference is held on the page.
>> + * NOTE: hugetlb specific page flags stored in page->private are not
>> + *   automatically cleared.  These flags may be used in routines
>> + *   which operate on the resulting compound page.
> 
> It seems HPageFreed and HPageTemporary is cleared. Which hugetlb specific 
> page flags
> is reserverd here and why? Could you please give a simple example to clarify
> this in the comment to help understand this NOTE?
> 

I will remove that NOTE: in the comment to avoid any confusion.

The NOTE was add in the RFC that contained a separate patch to add a flag
that tracked huge pages allocated from CMA.  That flag needed to remain
for subsequent freeing of such pages.  This is no longer needed.

> The code looks good to me. Many thanks!
> Reviewed-by: Miaohe Lin 

Thanks,
-- 
Mike Kravetz

Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()

2021-03-25 Thread Mike Kravetz

On 3/25/21 4:19 PM, Roman Gushchin wrote:
> On Thu, Mar 25, 2021 at 01:12:51PM -0700, Minchan Kim wrote:
>> On Thu, Mar 25, 2021 at 06:15:11PM +0100, David Hildenbrand wrote:
>>> On 25.03.21 17:56, Mike Kravetz wrote:
>>>> On 3/25/21 3:22 AM, Michal Hocko wrote:
>>>>> On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
>>>>>> On 25.03.21 01:28, Mike Kravetz wrote:
>>>>>>> From: Roman Gushchin 
>>>>>>>
>>>>>>> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
>>>>>>> It makes it a blocking function, which complicates its usage from
>>>>>>> non-blocking contexts. For instance, hugetlbfs code is temporarily
>>>>>>> dropping the hugetlb_lock spinlock to call cma_release().
>>>>>>>
>>>>>>> This patch introduces a non-blocking cma_release_nowait(), which
>>>>>>> postpones the cma bitmap clearance. It's done later from a work
>>>>>>> context. The first page in the cma allocation is used to store
>>>>>>> the work struct. Because CMA allocations and de-allocations are
>>>>>>> usually not that frequent, a single global workqueue is used.
>>>>>>>
>>>>>>> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
>>>>>>> flushes the cma_release_wq workqueue. To avoid a performance
>>>>>>> regression in the case when only cma_release() is used, gate it
>>>>>>> by a per-cma area flag, which is set by the first call
>>>>>>> of cma_release_nowait().
>>>>>>>
>>>>>>> Signed-off-by: Roman Gushchin 
>>>>>>> [mike.krav...@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
>>>>>>> Signed-off-by: Mike Kravetz 
>>>>>>> ---
>>>>>>
>>>>>>
>>>>>> 1. Is there a real reason this is a mutex and not a spin lock? It seems 
>>>>>> to
>>>>>> only protect the bitmap. Are bitmaps that huge that we spend a 
>>>>>> significant
>>>>>> amount of time in there?
>>>>>
>>>>> Good question. Looking at the code it doesn't seem that there is any
>>>>> blockable operation or any heavy lifting done under the lock.
>>>>> 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
>>>>> the lock and there was a simple bitmat protection back then. I suspect
>>>>> the patch just followed the cma_mutex lead and used the same type of the
>>>>> lock. cma_mutex used to protect alloc_contig_range which is sleepable.
>>>>>
>>>>> This all suggests that there is no real reason to use a sleepable lock
>>>>> at all and we do not need all this heavy lifting.
>>>>>
>>>>
>>>> When Roman first proposed these patches, I brought up the same issue:
>>>>
>>>> https://lore.kernel.org/linux-mm/20201022023352.gc300...@carbon.dhcp.thefacebook.com/
>>>>
>>>> Previously, Roman proposed replacing the mutex with a spinlock but
>>>> Joonsoo was opposed.
>>>>
>>>> Adding Joonsoo on Cc:
>>>>
>>>
>>> There has to be a good reason not to. And if there is a good reason,
>>> lockless clearing might be one feasible alternative.
>>
>> I also don't think nowait variant is good idea. If the scanning of
>> bitmap is *really* significant, it might be signal that we need to
>> introduce different technique or data structure not bitmap rather
>> than a new API variant.
> 
> I'd also prefer to just replace the mutex with a spinlock rather than fiddling
> with a delayed release.
> 

I hope Joonsoo or someone else brings up specific concerns.  I do not
know enough about all CMA use cases.  Certainly, in this specific use
case converting to a spinlock would not be an issue.  Do note that we
would want to convert to an irq safe spinlock and disable irqs if that
makes any difference in the discussion.
-- 
Mike Kravetz

Re: [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock

2021-03-25 Thread Mike Kravetz

On 3/25/21 12:39 PM, Michal Hocko wrote:
> On Thu 25-03-21 10:12:05, Mike Kravetz wrote:
>> On 3/25/21 3:55 AM, Michal Hocko wrote:
>>> On Wed 24-03-21 17:28:32, Mike Kravetz wrote:
>>>> With the introduction of remove_hugetlb_page(), there is no need for
>>>> update_and_free_page to hold the hugetlb lock.  Change all callers to
>>>> drop the lock before calling.
>>>>
>>>> With additional code modifications, this will allow loops which decrease
>>>> the huge page pool to drop the hugetlb_lock with each page to reduce
>>>> long hold times.
>>>>
>>>> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
>>>> a subsequent patch which restructures free_pool_huge_page.
>>>>
>>>> Signed-off-by: Mike Kravetz 
>>>
>>> Acked-by: Michal Hocko 
>>>
>>> One minor thing below
>>>
>>> [...]
>>>> @@ -2563,22 +2572,37 @@ static void try_to_free_low(struct hstate *h, 
>>>> unsigned long count,
>>>>nodemask_t *nodes_allowed)
>>>>  {
>>>>int i;
>>>> +  struct list_head page_list;
>>>> +  struct page *page, *next;
>>>>  
>>>>if (hstate_is_gigantic(h))
>>>>return;
>>>>  
>>>> +  /*
>>>> +   * Collect pages to be freed on a list, and free after dropping lock
>>>> +   */
>>>> +  INIT_LIST_HEAD(_list);
>>>>for_each_node_mask(i, *nodes_allowed) {
>>>> -  struct page *page, *next;
>>>>struct list_head *freel = >hugepage_freelists[i];
>>>>list_for_each_entry_safe(page, next, freel, lru) {
>>>>if (count >= h->nr_huge_pages)
>>>> -  return;
>>>> +  goto out;
>>>>if (PageHighMem(page))
>>>>continue;
>>>>remove_hugetlb_page(h, page, false);
>>>> -  update_and_free_page(h, page);
>>>> +  INIT_LIST_HEAD(>lru);
>>>
>>> What is the point of rhis INIT_LIST_HEAD? Page has been removed from the
>>> list by remove_hugetlb_page so it can be added to a new one without any
>>> reinitialization.
>>
>> remove_hugetlb_page just does a list_del.  list_del will poison the
>> pointers in page->lru.  The following list_add will then complain about
>> list corruption.
> 
> Are you sure? list_del followed by list_add is a normal API usage
> pattern AFAIK. INIT_LIST_HEAD is to do the first initialization before
> first use.

Sorry for the noise.  The INIT_LIST_HEAD is indeed unnecessary.

I must have got confused while looking at a corrupt list splat in
earlier code development.
-- 
Mike Kravetz

Re: [PATCH 7/8] hugetlb: make free_huge_page irq safe

2021-03-25 Thread Mike Kravetz

On 3/25/21 4:21 AM, Michal Hocko wrote:
> On Wed 24-03-21 17:28:34, Mike Kravetz wrote:
>> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
>> non-task context") was added to address the issue of free_huge_page
>> being called from irq context.  That commit hands off free_huge_page
>> processing to a workqueue if !in_task.  However, as seen in [1] this
>> does not cover all cases.  Instead, make the locks taken in the
>> free_huge_page irq safe.
> 
> I would just call out the deadlock scenario here in the changelog
> rathert than torture people by forcing them to follow up on the 0day
> report. Something like the below?
> "
> "
> However this doesn't cover all the cases as pointed out by 0day bot
> lockdep report [1]
> :  Possible interrupt unsafe locking scenario:
> : 
> :CPU0CPU1
> :
> :   lock(hugetlb_lock);
> :local_irq_disable();
> :lock(slock-AF_INET);
> :lock(hugetlb_lock);
> :   
> : lock(slock-AF_INET);
> 
> Shakeel has later explained that this is very likely TCP TX
> zerocopy from hugetlb pages scenario when the networking code drops a
> last reference to hugetlb page while having IRQ disabled. Hugetlb
> freeing path doesn't disable IRQ while holding hugetlb_lock so a lock
> dependency chain can lead to a deadlock.
>  

Thanks.  I will update changelog.

> 
>> This patch does the following:
>> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>>   changing spin_*lock calls to spin_*lock_irq* calls.
>> - Make subpool lock irq safe in a similar manner.
>> - Revert the !in_task check and workqueue handoff.
>>
>> [1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
>>
>> Signed-off-by: Mike Kravetz 
> 
> Acked-by: Michal Hocko 

And, thanks for looking at the series!
-- 
Mike Kravetz

Re: [PATCH 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page

2021-03-25 Thread Mike Kravetz

On 3/25/21 4:06 AM, Michal Hocko wrote:
> On Wed 24-03-21 17:28:33, Mike Kravetz wrote:
> [...]
>> @@ -2074,17 +2067,16 @@ static int gather_surplus_pages(struct hstate *h, 
>> long delta)
>>   *to the associated reservation map.
>>   * 2) Free any unused surplus pages that may have been allocated to satisfy
>>   *the reservation.  As many as unused_resv_pages may be freed.
>> - *
>> - * Called with hugetlb_lock held.  However, the lock could be dropped (and
>> - * reacquired) during calls to cond_resched_lock.  Whenever dropping the 
>> lock,
>> - * we must make sure nobody else can claim pages we are in the process of
>> - * freeing.  Do this by ensuring resv_huge_page always is greater than the
>> - * number of huge pages we plan to free when dropping the lock.
>>   */
>>  static void return_unused_surplus_pages(struct hstate *h,
>>  unsigned long unused_resv_pages)
>>  {
>>  unsigned long nr_pages;
>> +struct page *page, *t_page;
>> +struct list_head page_list;
>> +
>> +/* Uncommit the reservation */
>> +h->resv_huge_pages -= unused_resv_pages;
> 
> Is this ok for cases where remove_pool_huge_page fails early? I have to
> say I am kinda lost in the resv_huge_pages accounting here. The original
> code was already quite supicious to me. TBH.

Yes, it is safe.  The existing code will do the same but perhaps in a
different way.

Some history is in the changelog for commit e5bbc8a6c992 ("mm/hugetlb.c:
fix reservation race when freeing surplus pages").  The race fixed by
that commit was introduced by the cond_resched_lock() which we are
removing in this patch.  Therefore, we can remove the tricky code that
was added to deal with dropping the lock.

I should add an explanation to the commit message.

Additionally, I suspect we may end up once again dropping the lock in
the below loop when adding vmemmap support.  Then, we would need to add
back the code in commit e5bbc8a6c992.  Sigh.

>>  
>>  /* Cannot return gigantic pages currently */
>>  if (hstate_is_gigantic(h))
>> @@ -2101,24 +2093,27 @@ static void return_unused_surplus_pages(struct 
>> hstate *h,
>>   * evenly across all nodes with memory. Iterate across these nodes
>>   * until we can no longer free unreserved surplus pages. This occurs
>>   * when the nodes with surplus pages have no free pages.
>> - * free_pool_huge_page() will balance the freed pages across the
>> + * remove_pool_huge_page() will balance the freed pages across the
>>   * on-line nodes with memory and will handle the hstate accounting.
>> - *
>> - * Note that we decrement resv_huge_pages as we free the pages.  If
>> - * we drop the lock, resv_huge_pages will still be sufficiently large
>> - * to cover subsequent pages we may free.
>>   */
>> +INIT_LIST_HEAD(_list);
>>  while (nr_pages--) {
>> -h->resv_huge_pages--;
>> -unused_resv_pages--;
>> -if (!free_pool_huge_page(h, _states[N_MEMORY], 1))
>> +page = remove_pool_huge_page(h, _states[N_MEMORY], 1);
>> +if (!page)
>>  goto out;
>> -cond_resched_lock(_lock);
>> +
>> +INIT_LIST_HEAD(>lru);
> 
> again unnecessary INIT_LIST_HEAD
> 
>> +list_add(>lru, _list);
>>  }
>>  
>>  out:
>> -/* Fully uncommit the reservation */
>> -h->resv_huge_pages -= unused_resv_pages;
>> +spin_unlock(_lock);
>> +list_for_each_entry_safe(page, t_page, _list, lru) {
>> +list_del(>lru);
>> +update_and_free_page(h, page);
>> +cond_resched();
>> +}
> 
> You have the same construct at 3 different places maybe it deserves a
> little helper update_and_free_page_batch.

Sure.  I will add it.

-- 
Mike Kravetz

Re: [PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock

2021-03-25 Thread Mike Kravetz

On 3/25/21 3:55 AM, Michal Hocko wrote:
> On Wed 24-03-21 17:28:32, Mike Kravetz wrote:
>> With the introduction of remove_hugetlb_page(), there is no need for
>> update_and_free_page to hold the hugetlb lock.  Change all callers to
>> drop the lock before calling.
>>
>> With additional code modifications, this will allow loops which decrease
>> the huge page pool to drop the hugetlb_lock with each page to reduce
>> long hold times.
>>
>> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
>> a subsequent patch which restructures free_pool_huge_page.
>>
>> Signed-off-by: Mike Kravetz 
> 
> Acked-by: Michal Hocko 
> 
> One minor thing below
> 
> [...]
>> @@ -2563,22 +2572,37 @@ static void try_to_free_low(struct hstate *h, 
>> unsigned long count,
>>  nodemask_t *nodes_allowed)
>>  {
>>  int i;
>> +struct list_head page_list;
>> +struct page *page, *next;
>>  
>>  if (hstate_is_gigantic(h))
>>  return;
>>  
>> +/*
>> + * Collect pages to be freed on a list, and free after dropping lock
>> + */
>> +INIT_LIST_HEAD(_list);
>>  for_each_node_mask(i, *nodes_allowed) {
>> -struct page *page, *next;
>>  struct list_head *freel = >hugepage_freelists[i];
>>  list_for_each_entry_safe(page, next, freel, lru) {
>>  if (count >= h->nr_huge_pages)
>> -return;
>> +goto out;
>>  if (PageHighMem(page))
>>  continue;
>>  remove_hugetlb_page(h, page, false);
>> -update_and_free_page(h, page);
>> +INIT_LIST_HEAD(>lru);
> 
> What is the point of rhis INIT_LIST_HEAD? Page has been removed from the
> list by remove_hugetlb_page so it can be added to a new one without any
> reinitialization.

remove_hugetlb_page just does a list_del.  list_del will poison the
pointers in page->lru.  The following list_add will then complain about
list corruption.

I could replace the list_del in remove_hugetlb_page with list_del_init.
However, not all callers of remove_hugetlb_page will be adding the page
to a list.  If we just call update_and_free_page, there is no need to
reinitialize the list pointers.

Might be better to just use list_del_init in remove_hugetlb_page to
avoid any questions like this.
-- 
Mike Kravetz

> 
>> +list_add(>lru, _list);
>>  }
>>  }
>> +
>> +out:
>> +spin_unlock(_lock);
>> +list_for_each_entry_safe(page, next, _list, lru) {
>> +list_del(>lru);
>> +update_and_free_page(h, page);
>> +cond_resched();
>> +}
>> +spin_lock(_lock);
>>  }
>>  #else
>>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
>> -- 
>> 2.30.2
>>
>

Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()

2021-03-25 Thread Mike Kravetz

On 3/25/21 3:22 AM, Michal Hocko wrote:
> On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
>> On 25.03.21 01:28, Mike Kravetz wrote:
>>> From: Roman Gushchin 
>>>
>>> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
>>> It makes it a blocking function, which complicates its usage from
>>> non-blocking contexts. For instance, hugetlbfs code is temporarily
>>> dropping the hugetlb_lock spinlock to call cma_release().
>>>
>>> This patch introduces a non-blocking cma_release_nowait(), which
>>> postpones the cma bitmap clearance. It's done later from a work
>>> context. The first page in the cma allocation is used to store
>>> the work struct. Because CMA allocations and de-allocations are
>>> usually not that frequent, a single global workqueue is used.
>>>
>>> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
>>> flushes the cma_release_wq workqueue. To avoid a performance
>>> regression in the case when only cma_release() is used, gate it
>>> by a per-cma area flag, which is set by the first call
>>> of cma_release_nowait().
>>>
>>> Signed-off-by: Roman Gushchin 
>>> [mike.krav...@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
>>> Signed-off-by: Mike Kravetz 
>>> ---
>>
>>
>> 1. Is there a real reason this is a mutex and not a spin lock? It seems to
>> only protect the bitmap. Are bitmaps that huge that we spend a significant
>> amount of time in there?
> 
> Good question. Looking at the code it doesn't seem that there is any
> blockable operation or any heavy lifting done under the lock.
> 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
> the lock and there was a simple bitmat protection back then. I suspect
> the patch just followed the cma_mutex lead and used the same type of the
> lock. cma_mutex used to protect alloc_contig_range which is sleepable.
> 
> This all suggests that there is no real reason to use a sleepable lock
> at all and we do not need all this heavy lifting.
> 

When Roman first proposed these patches, I brought up the same issue:

https://lore.kernel.org/linux-mm/20201022023352.gc300...@carbon.dhcp.thefacebook.com/

Previously, Roman proposed replacing the mutex with a spinlock but
Joonsoo was opposed.

Adding Joonsoo on Cc:
-- 
Mike Kravetz

[PATCH 5/8] hugetlb: call update_and_free_page without hugetlb_lock

2021-03-24 Thread Mike Kravetz

With the introduction of remove_hugetlb_page(), there is no need for
update_and_free_page to hold the hugetlb lock.  Change all callers to
drop the lock before calling.

With additional code modifications, this will allow loops which decrease
the huge page pool to drop the hugetlb_lock with each page to reduce
long hold times.

The ugly unlock/lock cycle in free_pool_huge_page will be removed in
a subsequent patch which restructures free_pool_huge_page.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 34 +-
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3938ec086b5c..fae7f034d1eb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1450,16 +1450,18 @@ static void __free_huge_page(struct page *page)
 
if (HPageTemporary(page)) {
remove_hugetlb_page(h, page, false);
+   spin_unlock(_lock);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
remove_hugetlb_page(h, page, true);
+   spin_unlock(_lock);
update_and_free_page(h, page);
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
+   spin_unlock(_lock);
}
-   spin_unlock(_lock);
 }
 
 /*
@@ -1740,7 +1742,13 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
+   /*
+* unlock/lock around update_and_free_page is temporary
+* and will be removed with subsequent patch.
+*/
+   spin_unlock(_lock);
update_and_free_page(h, page);
+   spin_lock(_lock);
ret = 1;
break;
}
@@ -1809,8 +1817,9 @@ int dissolve_free_huge_page(struct page *page)
}
remove_hugetlb_page(h, page, false);
h->max_huge_pages--;
+   spin_unlock(_lock);
update_and_free_page(h, head);
-   rc = 0;
+   return 0;
}
 out:
spin_unlock(_lock);
@@ -2563,22 +2572,37 @@ static void try_to_free_low(struct hstate *h, unsigned 
long count,
nodemask_t *nodes_allowed)
 {
int i;
+   struct list_head page_list;
+   struct page *page, *next;
 
if (hstate_is_gigantic(h))
return;
 
+   /*
+* Collect pages to be freed on a list, and free after dropping lock
+*/
+   INIT_LIST_HEAD(_list);
for_each_node_mask(i, *nodes_allowed) {
-   struct page *page, *next;
struct list_head *freel = >hugepage_freelists[i];
list_for_each_entry_safe(page, next, freel, lru) {
if (count >= h->nr_huge_pages)
-   return;
+   goto out;
if (PageHighMem(page))
continue;
remove_hugetlb_page(h, page, false);
-   update_and_free_page(h, page);
+   INIT_LIST_HEAD(>lru);
+   list_add(>lru, _list);
}
}
+
+out:
+   spin_unlock(_lock);
+   list_for_each_entry_safe(page, next, _list, lru) {
+   list_del(>lru);
+   update_and_free_page(h, page);
+   cond_resched();
+   }
+   spin_lock(_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
-- 
2.30.2

[PATCH 7/8] hugetlb: make free_huge_page irq safe

2021-03-24 Thread Mike Kravetz

Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
non-task context") was added to address the issue of free_huge_page
being called from irq context.  That commit hands off free_huge_page
processing to a workqueue if !in_task.  However, as seen in [1] this
does not cover all cases.  Instead, make the locks taken in the
free_huge_page irq safe.

This patch does the following:
- Make hugetlb_lock irq safe.  This is mostly a simple process of
  changing spin_*lock calls to spin_*lock_irq* calls.
- Make subpool lock irq safe in a similar manner.
- Revert the !in_task check and workqueue handoff.

[1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c| 169 +---
 mm/hugetlb_cgroup.c |   8 +--
 2 files changed, 67 insertions(+), 110 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a9785e73379f..e4c441b878f2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool 
*spool)
return true;
 }
 
-static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
+   unsigned long irq_flags)
 {
-   spin_unlock(>lock);
+   spin_unlock_irqrestore(>lock, irq_flags);
 
/* If no pages are used, and no other handles to the subpool
 * remain, give up any reservations based on minimum size and
@@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct 
hstate *h, long max_hpages,
 
 void hugepage_put_subpool(struct hugepage_subpool *spool)
 {
-   spin_lock(>lock);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
BUG_ON(!spool->count);
spool->count--;
-   unlock_or_release_subpool(spool);
+   unlock_or_release_subpool(spool, flags);
 }
 
 /*
@@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct 
hugepage_subpool *spool,
if (!spool)
return ret;
 
-   spin_lock(>lock);
+   spin_lock_irq(>lock);
 
if (spool->max_hpages != -1) {  /* maximum size accounting */
if ((spool->used_hpages + delta) <= spool->max_hpages)
@@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct 
hugepage_subpool *spool,
}
 
 unlock_ret:
-   spin_unlock(>lock);
+   spin_unlock_irq(>lock);
return ret;
 }
 
@@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct 
hugepage_subpool *spool,
   long delta)
 {
long ret = delta;
+   unsigned long flags;
 
if (!spool)
return delta;
 
-   spin_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
 
if (spool->max_hpages != -1)/* maximum size accounting */
spool->used_hpages -= delta;
@@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct 
hugepage_subpool *spool,
 * If hugetlbfs_put_super couldn't free spool due to an outstanding
 * quota reference, free it now.
 */
-   unlock_or_release_subpool(spool);
+   unlock_or_release_subpool(spool, flags);
 
return ret;
 }
@@ -1401,7 +1405,7 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
-static void __free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
/*
 * Can't pass hstate in here because it is called from the
@@ -1411,6 +1415,7 @@ static void __free_huge_page(struct page *page)
int nid = page_to_nid(page);
struct hugepage_subpool *spool = hugetlb_page_subpool(page);
bool restore_reserve;
+   unsigned long flags;
 
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(page_mapcount(page), page);
@@ -1439,7 +1444,7 @@ static void __free_huge_page(struct page *page)
restore_reserve = true;
}
 
-   spin_lock(_lock);
+   spin_lock_irqsave(_lock, flags);
ClearHPageMigratable(page);
hugetlb_cgroup_uncharge_page(hstate_index(h),
 pages_per_huge_page(h), page);
@@ -1450,66 +1455,18 @@ static void __free_huge_page(struct page *page)
 
if (HPageTemporary(page)) {
remove_hugetlb_page(h, page, false);
-   spin_unlock(_lock);
+   spin_unlock_irqrestore(_lock, flags);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
remove_hugetlb_page(h, page, true);
-   spin_unlock(_lock);
+   spin_unlock_irqrestore(_lock, flags);
update_and_free_page(h, page);
} else {
arch_cl

[PATCH 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page

2021-03-24 Thread Mike Kravetz

free_pool_huge_page was called with hugetlb_lock held.  It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy.  free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.

Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held.  It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators.  The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 88 ++--
 1 file changed, 51 insertions(+), 37 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fae7f034d1eb..a9785e73379f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1204,7 +1204,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
 }
 
 /*
- * helper for free_pool_huge_page() - return the previously saved
+ * helper for remove_pool_huge_page() - return the previously saved
  * node ["this node"] from which to free a huge page.  Advance the
  * next node id whether or not we find a free huge page to free so
  * that the next attempt to free addresses the next node.
@@ -1720,16 +1720,18 @@ static int alloc_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 }
 
 /*
- * Free huge page from pool from next node to free.
- * Attempt to keep persistent huge pages more or less
- * balanced over allowed nodes.
+ * Remove huge page from pool from next node to free.  Attempt to keep
+ * persistent huge pages more or less balanced over allowed nodes.
+ * This routine only 'removes' the hugetlb page.  The caller must make
+ * an additional call to free the page to low level allocators.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-bool acct_surplus)
+static struct page *remove_pool_huge_page(struct hstate *h,
+   nodemask_t *nodes_allowed,
+bool acct_surplus)
 {
int nr_nodes, node;
-   int ret = 0;
+   struct page *page = NULL;
 
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
@@ -1738,23 +1740,14 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
 */
if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
!list_empty(>hugepage_freelists[node])) {
-   struct page *page =
-   list_entry(h->hugepage_freelists[node].next,
+   page = list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
-   /*
-* unlock/lock around update_and_free_page is temporary
-* and will be removed with subsequent patch.
-*/
-   spin_unlock(_lock);
-   update_and_free_page(h, page);
-   spin_lock(_lock);
-   ret = 1;
break;
}
}
 
-   return ret;
+   return page;
 }
 
 /*
@@ -2074,17 +2067,16 @@ static int gather_surplus_pages(struct hstate *h, long 
delta)
  *to the associated reservation map.
  * 2) Free any unused surplus pages that may have been allocated to satisfy
  *the reservation.  As many as unused_resv_pages may be freed.
- *
- * Called with hugetlb_lock held.  However, the lock could be dropped (and
- * reacquired) during calls to cond_resched_lock.  Whenever dropping the lock,
- * we must make sure nobody else can claim pages we are in the process of
- * freeing.  Do this by ensuring resv_huge_page always is greater than the
- * number of huge pages we plan to free when dropping the lock.
  */
 static void return_unused_surplus_pages(struct hstate *h,
unsigned long unused_resv_pages)
 {
unsigned long nr_pages;
+   struct page *page, *t_page;
+   struct list_head page_list;
+
+   /* Uncommit the reservation */
+   h->resv_huge_pages -= unused_resv_pages;
 
/* Cannot return gigantic pages currently */
if (hstate_is_gigantic(h))
@@ -2101,24 +2093,27 @@ static void return_unused_surplus_pages(struct hstate 
*h,
 * evenly across all nodes with memory. Iterate across these nodes
 * until we can no longer free unreserved surplus pages. This occurs
 * when the nodes with surplus pages have no free page

[PATCH 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-03-24 Thread Mike Kravetz

The new remove_hugetlb_page() routine is designed to remove a hugetlb
page from hugetlbfs processing.  It will remove the page from the active
or free list, update global counters and set the compound page
destructor to NULL so that PageHuge() will return false for the 'page'.
After this call, the 'page' can be treated as a normal compound page or
a collection of base size pages.

remove_hugetlb_page is to be called with the hugetlb_lock held.

Creating this routine and separating functionality is in preparation for
restructuring code to reduce lock hold times.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 70 +---
 1 file changed, 45 insertions(+), 25 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 404b0b1c5258..3938ec086b5c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1327,6 +1327,46 @@ static inline void destroy_compound_gigantic_page(struct 
page *page,
unsigned int order) { }
 #endif
 
+/*
+ * Remove hugetlb page from lists, and update dtor so that page appears
+ * as just a compound page.  A reference is held on the page.
+ * NOTE: hugetlb specific page flags stored in page->private are not
+ *  automatically cleared.  These flags may be used in routines
+ *  which operate on the resulting compound page.
+ *
+ * Must be called with hugetlb lock held.
+ */
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+   bool adjust_surplus)
+{
+   int nid = page_to_nid(page);
+
+   if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+   return;
+
+   list_del(>lru);
+
+   if (HPageFreed(page)) {
+   h->free_huge_pages--;
+   h->free_huge_pages_node[nid]--;
+   ClearHPageFreed(page);
+   }
+   if (adjust_surplus) {
+   h->surplus_huge_pages--;
+   h->surplus_huge_pages_node[nid]--;
+   }
+
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
+
+   ClearHPageTemporary(page);
+   set_page_refcounted(page);
+   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+
+   h->nr_huge_pages--;
+   h->nr_huge_pages_node[nid]--;
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
int i;
@@ -1335,8 +1375,6 @@ static void update_and_free_page(struct hstate *h, struct 
page *page)
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
 
-   h->nr_huge_pages--;
-   h->nr_huge_pages_node[page_to_nid(page)]--;
for (i = 0; i < pages_per_huge_page(h);
 i++, subpage = mem_map_next(subpage, page, i)) {
subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -1344,10 +1382,6 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
1 << PG_active | 1 << PG_private |
1 << PG_writeback);
}
-   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
-   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
-   set_page_refcounted(page);
if (hstate_is_gigantic(h)) {
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
@@ -1415,15 +1449,12 @@ static void __free_huge_page(struct page *page)
h->resv_huge_pages++;
 
if (HPageTemporary(page)) {
-   list_del(>lru);
-   ClearHPageTemporary(page);
+   remove_hugetlb_page(h, page, false);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
-   list_del(>lru);
+   remove_hugetlb_page(h, page, true);
update_and_free_page(h, page);
-   h->surplus_huge_pages--;
-   h->surplus_huge_pages_node[nid]--;
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
@@ -1708,13 +1739,7 @@ static int free_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
struct page *page =
list_entry(h->hugepage_freelists[node].next,
  struct page, lru);
-   list_del(>lru);
-   h->free_huge_pages--;
-   h->free_huge_pages_node[node]--;
-   if (acct_surplus) {
-   h->surplus_huge_pages--;
-   h->su

[PATCH 2/8] mm: hugetlb: don't drop hugetlb_lock around cma_release() call

2021-03-24 Thread Mike Kravetz

From: Roman Gushchin 

Replace blocking cma_release() with a non-blocking cma_release_nowait()
call, so there is no more need to temporarily drop hugetlb_lock.

Signed-off-by: Roman Gushchin 
Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 408dbc08298a..f9ba63fc1747 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1258,10 +1258,11 @@ static void free_gigantic_page(struct page *page, 
unsigned int order)
 {
/*
 * If the page isn't allocated using the cma allocator,
-* cma_release() returns false.
+* cma_release_nowait() returns false.
 */
 #ifdef CONFIG_CMA
-   if (cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order))
+   if (cma_release_nowait(hugetlb_cma[page_to_nid(page)], page,
+  1 << order))
return;
 #endif
 
@@ -1348,14 +1349,8 @@ static void update_and_free_page(struct hstate *h, 
struct page *page)
set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
set_page_refcounted(page);
if (hstate_is_gigantic(h)) {
-   /*
-* Temporarily drop the hugetlb_lock, because
-* we might block in free_gigantic_page().
-*/
-   spin_unlock(_lock);
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
-   spin_lock(_lock);
} else {
__free_pages(page, huge_page_order(h));
}
-- 
2.30.2

[PATCH 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock

2021-03-24 Thread Mike Kravetz

After making hugetlb lock irq safe and separating some functionality
done under the lock, add some lockdep_assert_held to help verify
locking.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e4c441b878f2..de5b3cf4a155 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1062,6 +1062,8 @@ static bool vma_has_reserves(struct vm_area_struct *vma, 
long chg)
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
int nid = page_to_nid(page);
+
+   lockdep_assert_held(_lock);
list_move(>lru, >hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
@@ -1073,6 +1075,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
hstate *h, int nid)
struct page *page;
bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
+   lockdep_assert_held(_lock);
list_for_each_entry(page, >hugepage_freelists[nid], lru) {
if (pin && !is_pinnable_page(page))
continue;
@@ -1345,6 +1348,7 @@ static void remove_hugetlb_page(struct hstate *h, struct 
page *page,
 {
int nid = page_to_nid(page);
 
+   lockdep_assert_held(_lock);
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
 
@@ -1690,6 +1694,7 @@ static struct page *remove_pool_huge_page(struct hstate 
*h,
int nr_nodes, node;
struct page *page = NULL;
 
+   lockdep_assert_held(_lock);
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
 * If we're returning unused surplus pages, only examine
@@ -1939,6 +1944,7 @@ static int gather_surplus_pages(struct hstate *h, long 
delta)
long needed, allocated;
bool alloc_ok = true;
 
+   lockdep_assert_held(_lock);
needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
if (needed <= 0) {
h->resv_huge_pages += delta;
@@ -2032,6 +2038,7 @@ static void return_unused_surplus_pages(struct hstate *h,
struct page *page, *t_page;
struct list_head page_list;
 
+   lockdep_assert_held(_lock);
/* Uncommit the reservation */
h->resv_huge_pages -= unused_resv_pages;
 
@@ -2527,6 +2534,7 @@ static void try_to_free_low(struct hstate *h, unsigned 
long count,
struct list_head page_list;
struct page *page, *next;
 
+   lockdep_assert_held(_lock);
if (hstate_is_gigantic(h))
return;
 
@@ -2573,6 +2581,7 @@ static int adjust_pool_surplus(struct hstate *h, 
nodemask_t *nodes_allowed,
 {
int nr_nodes, node;
 
+   lockdep_assert_held(_lock);
VM_BUG_ON(delta != -1 && delta != 1);
 
if (delta < 0) {
-- 
2.30.2

[PATCH 0/8] make hugetlb put_page safe for all calling contexts

2021-03-24 Thread Mike Kravetz

This effort is the result a recent bug report [1].  In subsequent
discussions [2], it was deemed necessary to properly fix the hugetlb
put_page path (free_huge_page).  This RFC provides a possible way to
address the issue.  Comments are welcome/encouraged as several attempts
at this have been made in the past.

This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
level, the series provides:
- Patches 1 & 2 from Roman Gushchin provide cma_release_nowait()
- Patches 4, 5 & 6 are aimed at reducing lock hold times.  To be clear
  the goal is to eliminate single lock hold times of a long duration.
  Overall lock hold time is not addressed.
- Patch 7 makes hugetlb_lock and subpool lock IRQ safe.  It also reverts
  the code which defers calls to a workqueue if !in_task.
- Patch 8 adds some lockdep_assert_held() calls

[1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
[2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.krav...@oracle.com

RFC -> v1
- Add Roman's cma_release_nowait() patches.  This eliminated the need
  to do a workqueue handoff in hugetlb code.
- Use Michal's suggestion to batch pages for freeing.  This eliminated
  the need to recalculate loop control variables when dropping the lock.
- Added lockdep_assert_held() calls
- Rebased to v5.12-rc3-mmotm-2021-03-17-22-24

Mike Kravetz (6):
  hugetlb: add per-hstate mutex to synchronize user adjustments
  hugetlb: create remove_hugetlb_page() to separate functionality
  hugetlb: call update_and_free_page without hugetlb_lock
  hugetlb: change free_pool_huge_page to remove_pool_huge_page
  hugetlb: make free_huge_page irq safe
  hugetlb: add lockdep_assert_held() calls for hugetlb_lock

Roman Gushchin (2):
  mm: cma: introduce cma_release_nowait()
  mm: hugetlb: don't drop hugetlb_lock around cma_release() call

 include/linux/cma.h |   2 +
 include/linux/hugetlb.h |   1 +
 mm/cma.c|  93 +++
 mm/cma.h|   5 +
 mm/hugetlb.c| 354 +---
 mm/hugetlb_cgroup.c |   8 +-
 6 files changed, 294 insertions(+), 169 deletions(-)

-- 
2.30.2

[PATCH 1/8] mm: cma: introduce cma_release_nowait()

2021-03-24 Thread Mike Kravetz

From: Roman Gushchin 

cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
It makes it a blocking function, which complicates its usage from
non-blocking contexts. For instance, hugetlbfs code is temporarily
dropping the hugetlb_lock spinlock to call cma_release().

This patch introduces a non-blocking cma_release_nowait(), which
postpones the cma bitmap clearance. It's done later from a work
context. The first page in the cma allocation is used to store
the work struct. Because CMA allocations and de-allocations are
usually not that frequent, a single global workqueue is used.

To make sure that subsequent cma_alloc() call will pass, cma_alloc()
flushes the cma_release_wq workqueue. To avoid a performance
regression in the case when only cma_release() is used, gate it
by a per-cma area flag, which is set by the first call
of cma_release_nowait().

Signed-off-by: Roman Gushchin 
[mike.krav...@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
Signed-off-by: Mike Kravetz 
---
 include/linux/cma.h |  2 +
 mm/cma.c| 93 +
 mm/cma.h|  5 +++
 3 files changed, 100 insertions(+)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 217999c8a762..497eca478c2f 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -47,6 +47,8 @@ extern int cma_init_reserved_mem(phys_addr_t base, 
phys_addr_t size,
 extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int 
align,
  bool no_warn);
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned 
int count);
+extern bool cma_release_nowait(struct cma *cma, const struct page *pages,
+  unsigned int count);
 
 extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void 
*data);
 #endif
diff --git a/mm/cma.c b/mm/cma.c
index 90e27458ddb7..14cc8e901703 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -36,9 +36,18 @@
 
 #include "cma.h"
 
+struct cma_clear_bitmap_work {
+   struct work_struct work;
+   struct cma *cma;
+   unsigned long pfn;
+   unsigned int count;
+};
+
 struct cma cma_areas[MAX_CMA_AREAS];
 unsigned cma_area_count;
 
+struct workqueue_struct *cma_release_wq;
+
 phys_addr_t cma_get_base(const struct cma *cma)
 {
return PFN_PHYS(cma->base_pfn);
@@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
for (i = 0; i < cma_area_count; i++)
cma_activate_area(_areas[i]);
 
+   cma_release_wq = create_workqueue("cma_release");
+   if (!cma_release_wq)
+   return -ENOMEM;
+
return 0;
 }
 core_initcall(cma_init_reserved_areas);
@@ -203,6 +216,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, 
phys_addr_t size,
 
cma->base_pfn = PFN_DOWN(base);
cma->count = size >> PAGE_SHIFT;
+   cma->flags = 0;
cma->order_per_bit = order_per_bit;
*res_cma = cma;
cma_area_count++;
@@ -452,6 +466,14 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
goto out;
 
for (;;) {
+   /*
+* If the CMA bitmap is cleared asynchronously after
+* cma_release_nowait(), cma release workqueue has to be
+* flushed here in order to make the allocation succeed.
+*/
+   if (test_bit(CMA_DELAYED_RELEASE, >flags))
+   flush_workqueue(cma_release_wq);
+
mutex_lock(>lock);
bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
bitmap_maxno, start, bitmap_count, mask,
@@ -552,6 +574,77 @@ bool cma_release(struct cma *cma, const struct page 
*pages, unsigned int count)
return true;
 }
 
+static void cma_clear_bitmap_fn(struct work_struct *work)
+{
+   struct cma_clear_bitmap_work *w;
+
+   w = container_of(work, struct cma_clear_bitmap_work, work);
+
+   cma_clear_bitmap(w->cma, w->pfn, w->count);
+
+   __free_page(pfn_to_page(w->pfn));
+}
+
+/**
+ * cma_release_nowait() - release allocated pages without blocking
+ * @cma:   Contiguous memory region for which the allocation is performed.
+ * @pages: Allocated pages.
+ * @count: Number of allocated pages.
+ *
+ * Similar to cma_release(), this function releases memory allocated
+ * by cma_alloc(), but unlike cma_release() is non-blocking and can be
+ * called from an atomic context.
+ * It returns false when provided pages do not belong to contiguous area
+ * and true otherwise.
+ */
+bool cma_release_nowait(struct cma *cma, const struct page *pages,
+   unsigned int count)
+{
+   struct cma_clear_bitmap_work *work;
+   unsigned long pfn;
+
+   if (!cma || !pages)
+   return false;
+
+   pr_debug("%s(page %p)\n", __func__, (void *)pages);
+
+   pfn = page_to_

[PATCH 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments

2021-03-24 Thread Mike Kravetz

The helper routine hstate_next_node_to_alloc accesses and modifies the
hstate variable next_nid_to_alloc.  The helper is used by the routines
alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
called with hugetlb_lock held.  However, alloc_pool_huge_page can not
be called with the hugetlb lock held as it will call the page allocator.
Two instances of alloc_pool_huge_page could be run in parallel or
alloc_pool_huge_page could run in parallel with adjust_pool_surplus
which may result in the variable next_nid_to_alloc becoming invalid
for the caller and pages being allocated on the wrong node.

Both alloc_pool_huge_page and adjust_pool_surplus are only called from
the routine set_max_huge_pages after boot.  set_max_huge_pages is only
called as the reusult of a user writing to the proc/sysfs nr_hugepages,
or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.

It makes little sense to allow multiple adjustment to the number of
hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
allow one hugetlb page adjustment at a time.  This will synchronize
modifications to the next_nid_to_alloc variable.

Signed-off-by: Mike Kravetz 
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c| 5 +
 2 files changed, 6 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a7f7d5f328dc..8817ec987d68 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
+   struct mutex mutex;
int next_nid_to_alloc;
int next_nid_to_free;
unsigned int order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f9ba63fc1747..404b0b1c5258 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2616,6 +2616,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
else
return -ENOMEM;
 
+   /* mutex prevents concurrent adjustments for the same hstate */
+   mutex_lock(>mutex);
spin_lock(_lock);
 
/*
@@ -2648,6 +2650,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
if (count > persistent_huge_pages(h)) {
spin_unlock(_lock);
+   mutex_unlock(>mutex);
NODEMASK_FREE(node_alloc_noretry);
return -EINVAL;
}
@@ -2722,6 +2725,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned 
long count, int nid,
 out:
h->max_huge_pages = persistent_huge_pages(h);
spin_unlock(_lock);
+   mutex_unlock(>mutex);
 
NODEMASK_FREE(node_alloc_noretry);
 
@@ -3209,6 +3213,7 @@ void __init hugetlb_add_hstate(unsigned int order)
BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
BUG_ON(order == 0);
h = [hugetlb_max_hstate++];
+   mutex_init(>mutex);
h->order = order;
h->mask = ~(huge_page_size(h) - 1);
for (i = 0; i < MAX_NUMNODES; ++i)
-- 
2.30.2

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2070 matches

Mail list logo