Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-08 Thread Muchun Song
On Fri, Jan 8, 2021 at 8:04 PM Michal Hocko  wrote:
>
> On Fri 08-01-21 19:52:54, Muchun Song wrote:
> > On Fri, Jan 8, 2021 at 7:44 PM Michal Hocko  wrote:
> > >
> > > On Fri 08-01-21 18:08:57, Muchun Song wrote:
> > > > On Fri, Jan 8, 2021 at 5:31 PM Michal Hocko  wrote:
> > > > >
> > > > > On Fri 08-01-21 17:01:03, Muchun Song wrote:
> > > > > > On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko  wrote:
> > > > > > >
> > > > > > > On Thu 07-01-21 23:11:22, Muchun Song wrote:
> > > > > [..]
> > > > > > > > But I find a tricky problem to solve. See free_huge_page().
> > > > > > > > If we are in non-task context, we should schedule a work
> > > > > > > > to free the page. We reuse the page->mapping. If the page
> > > > > > > > is already freed by the dissolve path. We should not touch
> > > > > > > > the page->mapping. So we need to check PageHuge().
> > > > > > > > The check and llist_add() should be protected by
> > > > > > > > hugetlb_lock. But we cannot do that. Right? If dissolve
> > > > > > > > happens after it is linked to the list. We also should
> > > > > > > > remove it from the list (hpage_freelist). It seems to make
> > > > > > > > the thing more complex.
> > > > > > >
> > > > > > > I am not sure I follow you here but yes PageHuge under 
> > > > > > > hugetlb_lock
> > > > > > > should be the reliable way to check for the race. I am not sure 
> > > > > > > why we
> > > > > > > really need to care about mapping or other state.
> > > > > >
> > > > > > CPU0:   CPU1:
> > > > > > free_huge_page(page)
> > > > > >   if (PageHuge(page))
> > > > > > dissolve_free_huge_page(page)
> > > > > >   spin_lock(_lock)
> > > > > >   update_and_free_page(page)
> > > > > >   spin_unlock(_lock)
> > > > > > llist_add(page->mapping)
> > > > > > // the mapping is corrupted
> > > > > >
> > > > > > The PageHuge(page) and llist_add() should be protected by
> > > > > > hugetlb_lock. Right? If so, we cannot hold hugetlb_lock
> > > > > > in free_huge_page() path.
> > > > >
> > > > > OK, I see. I completely forgot about this snowflake. I thought that
> > > > > free_huge_page was a typo missing initial __. Anyway you are right 
> > > > > that
> > > > > this path needs a check as well. But I don't see why we couldn't use 
> > > > > the
> > > > > lock here. The lock can be held only inside the !in_task branch.
> > > >
> > > > Because we hold the hugetlb_lock without disable irq. So if an interrupt
> > > > occurs after we hold the lock. And we also free a HugeTLB page. Then
> > > > it leads to deadlock.
> > >
> > > There is nothing really to prevent making hugetlb_lock irq safe, isn't
> > > it?
> >
> > Yeah. We can make the hugetlb_lock irq safe. But why have we not
> > done this? Maybe the commit changelog can provide more information.
> >
> > See 
> > https://github.com/torvalds/linux/commit/c77c0a8ac4c522638a8242fcb9de9496e3cdbb2d
>
> Dang! Maybe it is the time to finally stack one workaround on top of the
> other and put this code into the shape. The amount of hackery and subtle
> details has just grown beyond healthy!

>From readability point of view, maybe making the hugetlb_lock irq safe
is an improvement (in the feature).

The details are always messy. :)

> --
> Michal Hocko
> SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-08 Thread Michal Hocko
On Fri 08-01-21 19:52:54, Muchun Song wrote:
> On Fri, Jan 8, 2021 at 7:44 PM Michal Hocko  wrote:
> >
> > On Fri 08-01-21 18:08:57, Muchun Song wrote:
> > > On Fri, Jan 8, 2021 at 5:31 PM Michal Hocko  wrote:
> > > >
> > > > On Fri 08-01-21 17:01:03, Muchun Song wrote:
> > > > > On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko  wrote:
> > > > > >
> > > > > > On Thu 07-01-21 23:11:22, Muchun Song wrote:
> > > > [..]
> > > > > > > But I find a tricky problem to solve. See free_huge_page().
> > > > > > > If we are in non-task context, we should schedule a work
> > > > > > > to free the page. We reuse the page->mapping. If the page
> > > > > > > is already freed by the dissolve path. We should not touch
> > > > > > > the page->mapping. So we need to check PageHuge().
> > > > > > > The check and llist_add() should be protected by
> > > > > > > hugetlb_lock. But we cannot do that. Right? If dissolve
> > > > > > > happens after it is linked to the list. We also should
> > > > > > > remove it from the list (hpage_freelist). It seems to make
> > > > > > > the thing more complex.
> > > > > >
> > > > > > I am not sure I follow you here but yes PageHuge under hugetlb_lock
> > > > > > should be the reliable way to check for the race. I am not sure why 
> > > > > > we
> > > > > > really need to care about mapping or other state.
> > > > >
> > > > > CPU0:   CPU1:
> > > > > free_huge_page(page)
> > > > >   if (PageHuge(page))
> > > > > dissolve_free_huge_page(page)
> > > > >   spin_lock(_lock)
> > > > >   update_and_free_page(page)
> > > > >   spin_unlock(_lock)
> > > > > llist_add(page->mapping)
> > > > > // the mapping is corrupted
> > > > >
> > > > > The PageHuge(page) and llist_add() should be protected by
> > > > > hugetlb_lock. Right? If so, we cannot hold hugetlb_lock
> > > > > in free_huge_page() path.
> > > >
> > > > OK, I see. I completely forgot about this snowflake. I thought that
> > > > free_huge_page was a typo missing initial __. Anyway you are right that
> > > > this path needs a check as well. But I don't see why we couldn't use the
> > > > lock here. The lock can be held only inside the !in_task branch.
> > >
> > > Because we hold the hugetlb_lock without disable irq. So if an interrupt
> > > occurs after we hold the lock. And we also free a HugeTLB page. Then
> > > it leads to deadlock.
> >
> > There is nothing really to prevent making hugetlb_lock irq safe, isn't
> > it?
> 
> Yeah. We can make the hugetlb_lock irq safe. But why have we not
> done this? Maybe the commit changelog can provide more information.
> 
> See 
> https://github.com/torvalds/linux/commit/c77c0a8ac4c522638a8242fcb9de9496e3cdbb2d

Dang! Maybe it is the time to finally stack one workaround on top of the
other and put this code into the shape. The amount of hackery and subtle
details has just grown beyond healthy!
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-08 Thread Muchun Song
On Fri, Jan 8, 2021 at 7:44 PM Michal Hocko  wrote:
>
> On Fri 08-01-21 18:08:57, Muchun Song wrote:
> > On Fri, Jan 8, 2021 at 5:31 PM Michal Hocko  wrote:
> > >
> > > On Fri 08-01-21 17:01:03, Muchun Song wrote:
> > > > On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko  wrote:
> > > > >
> > > > > On Thu 07-01-21 23:11:22, Muchun Song wrote:
> > > [..]
> > > > > > But I find a tricky problem to solve. See free_huge_page().
> > > > > > If we are in non-task context, we should schedule a work
> > > > > > to free the page. We reuse the page->mapping. If the page
> > > > > > is already freed by the dissolve path. We should not touch
> > > > > > the page->mapping. So we need to check PageHuge().
> > > > > > The check and llist_add() should be protected by
> > > > > > hugetlb_lock. But we cannot do that. Right? If dissolve
> > > > > > happens after it is linked to the list. We also should
> > > > > > remove it from the list (hpage_freelist). It seems to make
> > > > > > the thing more complex.
> > > > >
> > > > > I am not sure I follow you here but yes PageHuge under hugetlb_lock
> > > > > should be the reliable way to check for the race. I am not sure why we
> > > > > really need to care about mapping or other state.
> > > >
> > > > CPU0:   CPU1:
> > > > free_huge_page(page)
> > > >   if (PageHuge(page))
> > > > dissolve_free_huge_page(page)
> > > >   spin_lock(_lock)
> > > >   update_and_free_page(page)
> > > >   spin_unlock(_lock)
> > > > llist_add(page->mapping)
> > > > // the mapping is corrupted
> > > >
> > > > The PageHuge(page) and llist_add() should be protected by
> > > > hugetlb_lock. Right? If so, we cannot hold hugetlb_lock
> > > > in free_huge_page() path.
> > >
> > > OK, I see. I completely forgot about this snowflake. I thought that
> > > free_huge_page was a typo missing initial __. Anyway you are right that
> > > this path needs a check as well. But I don't see why we couldn't use the
> > > lock here. The lock can be held only inside the !in_task branch.
> >
> > Because we hold the hugetlb_lock without disable irq. So if an interrupt
> > occurs after we hold the lock. And we also free a HugeTLB page. Then
> > it leads to deadlock.
>
> There is nothing really to prevent making hugetlb_lock irq safe, isn't
> it?

Yeah. We can make the hugetlb_lock irq safe. But why have we not
done this? Maybe the commit changelog can provide more information.

See 
https://github.com/torvalds/linux/commit/c77c0a8ac4c522638a8242fcb9de9496e3cdbb2d

> --
> Michal Hocko
> SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-08 Thread Michal Hocko
On Fri 08-01-21 18:08:57, Muchun Song wrote:
> On Fri, Jan 8, 2021 at 5:31 PM Michal Hocko  wrote:
> >
> > On Fri 08-01-21 17:01:03, Muchun Song wrote:
> > > On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko  wrote:
> > > >
> > > > On Thu 07-01-21 23:11:22, Muchun Song wrote:
> > [..]
> > > > > But I find a tricky problem to solve. See free_huge_page().
> > > > > If we are in non-task context, we should schedule a work
> > > > > to free the page. We reuse the page->mapping. If the page
> > > > > is already freed by the dissolve path. We should not touch
> > > > > the page->mapping. So we need to check PageHuge().
> > > > > The check and llist_add() should be protected by
> > > > > hugetlb_lock. But we cannot do that. Right? If dissolve
> > > > > happens after it is linked to the list. We also should
> > > > > remove it from the list (hpage_freelist). It seems to make
> > > > > the thing more complex.
> > > >
> > > > I am not sure I follow you here but yes PageHuge under hugetlb_lock
> > > > should be the reliable way to check for the race. I am not sure why we
> > > > really need to care about mapping or other state.
> > >
> > > CPU0:   CPU1:
> > > free_huge_page(page)
> > >   if (PageHuge(page))
> > > dissolve_free_huge_page(page)
> > >   spin_lock(_lock)
> > >   update_and_free_page(page)
> > >   spin_unlock(_lock)
> > > llist_add(page->mapping)
> > > // the mapping is corrupted
> > >
> > > The PageHuge(page) and llist_add() should be protected by
> > > hugetlb_lock. Right? If so, we cannot hold hugetlb_lock
> > > in free_huge_page() path.
> >
> > OK, I see. I completely forgot about this snowflake. I thought that
> > free_huge_page was a typo missing initial __. Anyway you are right that
> > this path needs a check as well. But I don't see why we couldn't use the
> > lock here. The lock can be held only inside the !in_task branch.
> 
> Because we hold the hugetlb_lock without disable irq. So if an interrupt
> occurs after we hold the lock. And we also free a HugeTLB page. Then
> it leads to deadlock.

There is nothing really to prevent making hugetlb_lock irq safe, isn't
it?
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-08 Thread Muchun Song
On Fri, Jan 8, 2021 at 5:31 PM Michal Hocko  wrote:
>
> On Fri 08-01-21 17:01:03, Muchun Song wrote:
> > On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko  wrote:
> > >
> > > On Thu 07-01-21 23:11:22, Muchun Song wrote:
> [..]
> > > > But I find a tricky problem to solve. See free_huge_page().
> > > > If we are in non-task context, we should schedule a work
> > > > to free the page. We reuse the page->mapping. If the page
> > > > is already freed by the dissolve path. We should not touch
> > > > the page->mapping. So we need to check PageHuge().
> > > > The check and llist_add() should be protected by
> > > > hugetlb_lock. But we cannot do that. Right? If dissolve
> > > > happens after it is linked to the list. We also should
> > > > remove it from the list (hpage_freelist). It seems to make
> > > > the thing more complex.
> > >
> > > I am not sure I follow you here but yes PageHuge under hugetlb_lock
> > > should be the reliable way to check for the race. I am not sure why we
> > > really need to care about mapping or other state.
> >
> > CPU0:   CPU1:
> > free_huge_page(page)
> >   if (PageHuge(page))
> > dissolve_free_huge_page(page)
> >   spin_lock(_lock)
> >   update_and_free_page(page)
> >   spin_unlock(_lock)
> > llist_add(page->mapping)
> > // the mapping is corrupted
> >
> > The PageHuge(page) and llist_add() should be protected by
> > hugetlb_lock. Right? If so, we cannot hold hugetlb_lock
> > in free_huge_page() path.
>
> OK, I see. I completely forgot about this snowflake. I thought that
> free_huge_page was a typo missing initial __. Anyway you are right that
> this path needs a check as well. But I don't see why we couldn't use the
> lock here. The lock can be held only inside the !in_task branch.

Because we hold the hugetlb_lock without disable irq. So if an interrupt
occurs after we hold the lock. And we also free a HugeTLB page. Then
it leads to deadlock.

task context: interrupt context:

put_page(page)
  spin_lock(_lock)

put_page(page)
  spin_lock(_lock)
  // deadlock

  spin_unlock(_lock)

> Although it would be much more nicer if the lock was held at this layer
> rather than both free_huge_page and __free_huge_page. But that clean up
> can be done on top.
>
> --
> Michal Hocko
> SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-08 Thread Michal Hocko
On Fri 08-01-21 17:01:03, Muchun Song wrote:
> On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko  wrote:
> >
> > On Thu 07-01-21 23:11:22, Muchun Song wrote:
[..]
> > > But I find a tricky problem to solve. See free_huge_page().
> > > If we are in non-task context, we should schedule a work
> > > to free the page. We reuse the page->mapping. If the page
> > > is already freed by the dissolve path. We should not touch
> > > the page->mapping. So we need to check PageHuge().
> > > The check and llist_add() should be protected by
> > > hugetlb_lock. But we cannot do that. Right? If dissolve
> > > happens after it is linked to the list. We also should
> > > remove it from the list (hpage_freelist). It seems to make
> > > the thing more complex.
> >
> > I am not sure I follow you here but yes PageHuge under hugetlb_lock
> > should be the reliable way to check for the race. I am not sure why we
> > really need to care about mapping or other state.
> 
> CPU0:   CPU1:
> free_huge_page(page)
>   if (PageHuge(page))
> dissolve_free_huge_page(page)
>   spin_lock(_lock)
>   update_and_free_page(page)
>   spin_unlock(_lock)
> llist_add(page->mapping)
> // the mapping is corrupted
> 
> The PageHuge(page) and llist_add() should be protected by
> hugetlb_lock. Right? If so, we cannot hold hugetlb_lock
> in free_huge_page() path.

OK, I see. I completely forgot about this snowflake. I thought that
free_huge_page was a typo missing initial __. Anyway you are right that
this path needs a check as well. But I don't see why we couldn't use the
lock here. The lock can be held only inside the !in_task branch.
Although it would be much more nicer if the lock was held at this layer
rather than both free_huge_page and __free_huge_page. But that clean up
can be done on top.

-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-08 Thread Muchun Song
On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko  wrote:
>
> On Thu 07-01-21 23:11:22, Muchun Song wrote:
> > On Thu, Jan 7, 2021 at 10:11 PM Michal Hocko  wrote:
> > >
> > > On Thu 07-01-21 20:59:33, Muchun Song wrote:
> > > > On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko  wrote:
> > > [...]
> > > > > Right. Can we simply back off in the dissolving path when ref count is
> > > > > 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario
> > > > > when the all above is true and the page is not being freed?
> > > >
> > > > The list_empty(>lru) may always return false.
> > > > The page before freeing is on the active list
> > > > (hstate->hugepage_activelist).Then it is on the free list
> > > > after freeing. So list_empty(>lru) is always false.
> > >
> > > The point I was trying to make is that the page has to be enqueued when
> > > it is dissolved and freed. If the page is not enqueued then something
> > > racing. But then I have realized that this is not a great check to
> > > detect the race because pages are going to be released to buddy
> > > allocator and that will reuse page->lru again. So scratch that and sorry
> > > for the detour.
> > >
> > > But that made me think some more and one way to reliably detect the race
> > > should be PageHuge() check in the freeing path. This is what dissolve
> > > path does already. PageHuge becomes false during update_and_free_page()
> > > while holding the hugetlb_lock. So can we use that?
> >
> > It may make the thing complex. Apart from freeing it to the
> > buddy allocator, free_huge_page also does something else for
> > us. If we detect the race in the freeing path, if it is not a HugeTLB
> > page, the freeing path just returns. We also should move those
> > things to the dissolve path. Right?
>
> Not sure what you mean. Dissolving is a subset of the freeing path. It
> effectivelly only implements the update_and_free_page branch (aka free
> to buddy). It skips some of the existing steps because it believes it
> sees a freed page. But as you have pointed out this is racy and I
> strongly suspect it is simply wrong in those assumptions. E.g. hugetlb
> cgroup accounting can get wrong right?

OK. I know what you mean. The update_and_free_page should
do the freeing which is similar to __free_huge_page().

>
> The more I think about it the more I think that dissolving path should
> simply have a common helper with  __free_huge_page that would release
> the huge page to the allocator. The only thing that should be specific
> to the dissolving path is HWpoison handling. It shouldn't be playing
> with accounting and what not. Btw the HWpoison handling is suspicious as
> well, a lost race would mean this doesn't happen. But maybe there is
> some fixup handled later on...
>
> > But I find a tricky problem to solve. See free_huge_page().
> > If we are in non-task context, we should schedule a work
> > to free the page. We reuse the page->mapping. If the page
> > is already freed by the dissolve path. We should not touch
> > the page->mapping. So we need to check PageHuge().
> > The check and llist_add() should be protected by
> > hugetlb_lock. But we cannot do that. Right? If dissolve
> > happens after it is linked to the list. We also should
> > remove it from the list (hpage_freelist). It seems to make
> > the thing more complex.
>
> I am not sure I follow you here but yes PageHuge under hugetlb_lock
> should be the reliable way to check for the race. I am not sure why we
> really need to care about mapping or other state.

CPU0:   CPU1:
free_huge_page(page)
  if (PageHuge(page))
dissolve_free_huge_page(page)
  spin_lock(_lock)
  update_and_free_page(page)
  spin_unlock(_lock)
llist_add(page->mapping)
// the mapping is corrupted

The PageHuge(page) and llist_add() should be protected by
hugetlb_lock. Right? If so, we cannot hold hugetlb_lock
in free_huge_page() path.

> --
> Michal Hocko
> SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-08 Thread Michal Hocko
On Thu 07-01-21 23:11:22, Muchun Song wrote:
> On Thu, Jan 7, 2021 at 10:11 PM Michal Hocko  wrote:
> >
> > On Thu 07-01-21 20:59:33, Muchun Song wrote:
> > > On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko  wrote:
> > [...]
> > > > Right. Can we simply back off in the dissolving path when ref count is
> > > > 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario
> > > > when the all above is true and the page is not being freed?
> > >
> > > The list_empty(>lru) may always return false.
> > > The page before freeing is on the active list
> > > (hstate->hugepage_activelist).Then it is on the free list
> > > after freeing. So list_empty(>lru) is always false.
> >
> > The point I was trying to make is that the page has to be enqueued when
> > it is dissolved and freed. If the page is not enqueued then something
> > racing. But then I have realized that this is not a great check to
> > detect the race because pages are going to be released to buddy
> > allocator and that will reuse page->lru again. So scratch that and sorry
> > for the detour.
> >
> > But that made me think some more and one way to reliably detect the race
> > should be PageHuge() check in the freeing path. This is what dissolve
> > path does already. PageHuge becomes false during update_and_free_page()
> > while holding the hugetlb_lock. So can we use that?
> 
> It may make the thing complex. Apart from freeing it to the
> buddy allocator, free_huge_page also does something else for
> us. If we detect the race in the freeing path, if it is not a HugeTLB
> page, the freeing path just returns. We also should move those
> things to the dissolve path. Right?

Not sure what you mean. Dissolving is a subset of the freeing path. It
effectivelly only implements the update_and_free_page branch (aka free
to buddy). It skips some of the existing steps because it believes it
sees a freed page. But as you have pointed out this is racy and I
strongly suspect it is simply wrong in those assumptions. E.g. hugetlb
cgroup accounting can get wrong right?

The more I think about it the more I think that dissolving path should
simply have a common helper with  __free_huge_page that would release
the huge page to the allocator. The only thing that should be specific
to the dissolving path is HWpoison handling. It shouldn't be playing
with accounting and what not. Btw the HWpoison handling is suspicious as
well, a lost race would mean this doesn't happen. But maybe there is
some fixup handled later on...
 
> But I find a tricky problem to solve. See free_huge_page().
> If we are in non-task context, we should schedule a work
> to free the page. We reuse the page->mapping. If the page
> is already freed by the dissolve path. We should not touch
> the page->mapping. So we need to check PageHuge().
> The check and llist_add() should be protected by
> hugetlb_lock. But we cannot do that. Right? If dissolve
> happens after it is linked to the list. We also should
> remove it from the list (hpage_freelist). It seems to make
> the thing more complex.

I am not sure I follow you here but yes PageHuge under hugetlb_lock
should be the reliable way to check for the race. I am not sure why we
really need to care about mapping or other state.
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Muchun Song
On Fri, Jan 8, 2021 at 9:08 AM Mike Kravetz  wrote:
>
> On 1/7/21 7:11 AM, Muchun Song wrote:
> > On Thu, Jan 7, 2021 at 10:11 PM Michal Hocko  wrote:
> >>
> >> On Thu 07-01-21 20:59:33, Muchun Song wrote:
> >>> On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko  wrote:
> >> [...]
>  Right. Can we simply back off in the dissolving path when ref count is
>  0 && PageHuge() if list_empty(page->lru)? Is there any other scenario
>  when the all above is true and the page is not being freed?
> >>>
> >>> The list_empty(>lru) may always return false.
> >>> The page before freeing is on the active list
> >>> (hstate->hugepage_activelist).Then it is on the free list
> >>> after freeing. So list_empty(>lru) is always false.
> >>
> >> The point I was trying to make is that the page has to be enqueued when
> >> it is dissolved and freed. If the page is not enqueued then something
> >> racing. But then I have realized that this is not a great check to
> >> detect the race because pages are going to be released to buddy
> >> allocator and that will reuse page->lru again. So scratch that and sorry
> >> for the detour.
> >>
> >> But that made me think some more and one way to reliably detect the race
> >> should be PageHuge() check in the freeing path. This is what dissolve
> >> path does already. PageHuge becomes false during update_and_free_page()
> >> while holding the hugetlb_lock. So can we use that?
> >
> > It may make the thing complex. Apart from freeing it to the
> > buddy allocator, free_huge_page also does something else for
> > us. If we detect the race in the freeing path, if it is not a HugeTLB
> > page, the freeing path just returns. We also should move those
> > things to the dissolve path. Right?
> >
> > But I find a tricky problem to solve. See free_huge_page().
> > If we are in non-task context, we should schedule a work
> > to free the page. We reuse the page->mapping. If the page
> > is already freed by the dissolve path. We should not touch
> > the page->mapping. So we need to check PageHuge().
> > The check and llist_add() should be protected by
> > hugetlb_lock. But we cannot do that. Right? If dissolve
> > happens after it is linked to the list. We also should
> > remove it from the list (hpage_freelist). It seems to make
> > the thing more complex.
>
> You are correct.  This is also an issue/potential problem with this
> race.  It seems that adding the state information might be the least
> complex way to address issue.

Yeah, I agree with you. Adding a state is a simple solution.


>
> --
> Mike Kravetz


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Mike Kravetz
On 1/7/21 7:11 AM, Muchun Song wrote:
> On Thu, Jan 7, 2021 at 10:11 PM Michal Hocko  wrote:
>>
>> On Thu 07-01-21 20:59:33, Muchun Song wrote:
>>> On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko  wrote:
>> [...]
 Right. Can we simply back off in the dissolving path when ref count is
 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario
 when the all above is true and the page is not being freed?
>>>
>>> The list_empty(>lru) may always return false.
>>> The page before freeing is on the active list
>>> (hstate->hugepage_activelist).Then it is on the free list
>>> after freeing. So list_empty(>lru) is always false.
>>
>> The point I was trying to make is that the page has to be enqueued when
>> it is dissolved and freed. If the page is not enqueued then something
>> racing. But then I have realized that this is not a great check to
>> detect the race because pages are going to be released to buddy
>> allocator and that will reuse page->lru again. So scratch that and sorry
>> for the detour.
>>
>> But that made me think some more and one way to reliably detect the race
>> should be PageHuge() check in the freeing path. This is what dissolve
>> path does already. PageHuge becomes false during update_and_free_page()
>> while holding the hugetlb_lock. So can we use that?
> 
> It may make the thing complex. Apart from freeing it to the
> buddy allocator, free_huge_page also does something else for
> us. If we detect the race in the freeing path, if it is not a HugeTLB
> page, the freeing path just returns. We also should move those
> things to the dissolve path. Right?
> 
> But I find a tricky problem to solve. See free_huge_page().
> If we are in non-task context, we should schedule a work
> to free the page. We reuse the page->mapping. If the page
> is already freed by the dissolve path. We should not touch
> the page->mapping. So we need to check PageHuge().
> The check and llist_add() should be protected by
> hugetlb_lock. But we cannot do that. Right? If dissolve
> happens after it is linked to the list. We also should
> remove it from the list (hpage_freelist). It seems to make
> the thing more complex.

You are correct.  This is also an issue/potential problem with this
race.  It seems that adding the state information might be the least
complex way to address issue.

-- 
Mike Kravetz


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Muchun Song
On Thu, Jan 7, 2021 at 10:11 PM Michal Hocko  wrote:
>
> On Thu 07-01-21 20:59:33, Muchun Song wrote:
> > On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko  wrote:
> [...]
> > > Right. Can we simply back off in the dissolving path when ref count is
> > > 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario
> > > when the all above is true and the page is not being freed?
> >
> > The list_empty(>lru) may always return false.
> > The page before freeing is on the active list
> > (hstate->hugepage_activelist).Then it is on the free list
> > after freeing. So list_empty(>lru) is always false.
>
> The point I was trying to make is that the page has to be enqueued when
> it is dissolved and freed. If the page is not enqueued then something
> racing. But then I have realized that this is not a great check to
> detect the race because pages are going to be released to buddy
> allocator and that will reuse page->lru again. So scratch that and sorry
> for the detour.
>
> But that made me think some more and one way to reliably detect the race
> should be PageHuge() check in the freeing path. This is what dissolve
> path does already. PageHuge becomes false during update_and_free_page()
> while holding the hugetlb_lock. So can we use that?

It may make the thing complex. Apart from freeing it to the
buddy allocator, free_huge_page also does something else for
us. If we detect the race in the freeing path, if it is not a HugeTLB
page, the freeing path just returns. We also should move those
things to the dissolve path. Right?

But I find a tricky problem to solve. See free_huge_page().
If we are in non-task context, we should schedule a work
to free the page. We reuse the page->mapping. If the page
is already freed by the dissolve path. We should not touch
the page->mapping. So we need to check PageHuge().
The check and llist_add() should be protected by
hugetlb_lock. But we cannot do that. Right? If dissolve
happens after it is linked to the list. We also should
remove it from the list (hpage_freelist). It seems to make
the thing more complex.

> --
> Michal Hocko
> SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Michal Hocko
On Thu 07-01-21 20:59:33, Muchun Song wrote:
> On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko  wrote:
[...]
> > Right. Can we simply back off in the dissolving path when ref count is
> > 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario
> > when the all above is true and the page is not being freed?
> 
> The list_empty(>lru) may always return false.
> The page before freeing is on the active list
> (hstate->hugepage_activelist).Then it is on the free list
> after freeing. So list_empty(>lru) is always false.

The point I was trying to make is that the page has to be enqueued when
it is dissolved and freed. If the page is not enqueued then something
racing. But then I have realized that this is not a great check to
detect the race because pages are going to be released to buddy
allocator and that will reuse page->lru again. So scratch that and sorry
for the detour.

But that made me think some more and one way to reliably detect the race
should be PageHuge() check in the freeing path. This is what dissolve
path does already. PageHuge becomes false during update_and_free_page()
while holding the hugetlb_lock. So can we use that?
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Muchun Song
On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko  wrote:
>
> On Thu 07-01-21 19:38:00, Muchun Song wrote:
> > On Thu, Jan 7, 2021 at 7:18 PM Michal Hocko  wrote:
> > >
> > > On Thu 07-01-21 16:53:13, Muchun Song wrote:
> > > > On Thu, Jan 7, 2021 at 4:41 PM Michal Hocko  wrote:
> > > > >
> > > > > On Thu 07-01-21 13:39:38, Muchun Song wrote:
> > > > > > On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed 06-01-21 16:47:36, Muchun Song wrote:
> > > > > > > > There is a race condition between __free_huge_page()
> > > > > > > > and dissolve_free_huge_page().
> > > > > > > >
> > > > > > > > CPU0: CPU1:
> > > > > > > >
> > > > > > > > // page_count(page) == 1
> > > > > > > > put_page(page)
> > > > > > > >   __free_huge_page(page)
> > > > > > > >   dissolve_free_huge_page(page)
> > > > > > > > spin_lock(_lock)
> > > > > > > > // PageHuge(page) && 
> > > > > > > > !page_count(page)
> > > > > > > > update_and_free_page(page)
> > > > > > > > // page is freed to the buddy
> > > > > > > > spin_unlock(_lock)
> > > > > > > > spin_lock(_lock)
> > > > > > > > clear_page_huge_active(page)
> > > > > > > > enqueue_huge_page(page)
> > > > > > > > // It is wrong, the page is already freed
> > > > > > > > spin_unlock(_lock)
> > > > > > > >
> > > > > > > > The race windows is between put_page() and spin_lock() which
> > > > > > > > is in the __free_huge_page().
> > > > > > >
> > > > > > > The race window reall is between put_page and 
> > > > > > > dissolve_free_huge_page.
> > > > > > > And the result is that the put_page path would clobber an 
> > > > > > > unrelated page
> > > > > > > (either free or already reused page) which is quite serious.
> > > > > > > Fortunatelly pages are dissolved very rarely. I believe that user 
> > > > > > > would
> > > > > > > require to be privileged to hit this by intention.
> > > > > > >
> > > > > > > > We should make sure that the page is already on the free list
> > > > > > > > when it is dissolved.
> > > > > > >
> > > > > > > Another option would be to check for PageHuge in 
> > > > > > > __free_huge_page. Have
> > > > > > > you considered that rather than add yet another state? The scope 
> > > > > > > of the
> > > > > > > spinlock would have to be extended. If that sounds more tricky 
> > > > > > > then can
> > > > > > > we check the page->lru in the dissolve path? If the page is still
> > > > > > > PageHuge and reference count 0 then there shouldn't be many 
> > > > > > > options
> > > > > > > where it can be queued, right?
> > > > > >
> > > > > > Did you mean that we iterate over the free list to check whether
> > > > > > the page is on the free list?
> > > > >
> > > > > No I meant to check that the page is enqueued which along with ref 
> > > > > count
> > > > > = 0 should mean it has been released to the pool unless I am missing
> > > > > something.
> > > >
> > > > The page can be on the free list or active list or empty when it
> > > > is freed to the pool. How to check whether it is on the free list?
> > >
> > > As I've said, I might be missing something here. But if the page is
> > > freed why does it matter whether it is on a active list or free list
> > > from the dissolve operation POV?
> >
> > As you said "check the page->lru". I have a question.
> > How to check the page->lru in the dissolve path?
>
> list_empty?

No.

>
> > BTW, dissolve_free_huge_page aims to free the page
> > to buddy allocator. put_page (for HugeTLB page) aims
> > to free the page to the hugepage pool.
>
> Right. Can we simply back off in the dissolving path when ref count is
> 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario
> when the all above is true and the page is not being freed?

The list_empty(>lru) may always return false.
The page before freeing is on the active list
(hstate->hugepage_activelist).Then it is on the free list
after freeing. So list_empty(>lru) is always false.

> --
> Michal Hocko
> SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Michal Hocko
On Thu 07-01-21 19:38:00, Muchun Song wrote:
> On Thu, Jan 7, 2021 at 7:18 PM Michal Hocko  wrote:
> >
> > On Thu 07-01-21 16:53:13, Muchun Song wrote:
> > > On Thu, Jan 7, 2021 at 4:41 PM Michal Hocko  wrote:
> > > >
> > > > On Thu 07-01-21 13:39:38, Muchun Song wrote:
> > > > > On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko  wrote:
> > > > > >
> > > > > > On Wed 06-01-21 16:47:36, Muchun Song wrote:
> > > > > > > There is a race condition between __free_huge_page()
> > > > > > > and dissolve_free_huge_page().
> > > > > > >
> > > > > > > CPU0: CPU1:
> > > > > > >
> > > > > > > // page_count(page) == 1
> > > > > > > put_page(page)
> > > > > > >   __free_huge_page(page)
> > > > > > >   dissolve_free_huge_page(page)
> > > > > > > spin_lock(_lock)
> > > > > > > // PageHuge(page) && 
> > > > > > > !page_count(page)
> > > > > > > update_and_free_page(page)
> > > > > > > // page is freed to the buddy
> > > > > > > spin_unlock(_lock)
> > > > > > > spin_lock(_lock)
> > > > > > > clear_page_huge_active(page)
> > > > > > > enqueue_huge_page(page)
> > > > > > > // It is wrong, the page is already freed
> > > > > > > spin_unlock(_lock)
> > > > > > >
> > > > > > > The race windows is between put_page() and spin_lock() which
> > > > > > > is in the __free_huge_page().
> > > > > >
> > > > > > The race window reall is between put_page and 
> > > > > > dissolve_free_huge_page.
> > > > > > And the result is that the put_page path would clobber an unrelated 
> > > > > > page
> > > > > > (either free or already reused page) which is quite serious.
> > > > > > Fortunatelly pages are dissolved very rarely. I believe that user 
> > > > > > would
> > > > > > require to be privileged to hit this by intention.
> > > > > >
> > > > > > > We should make sure that the page is already on the free list
> > > > > > > when it is dissolved.
> > > > > >
> > > > > > Another option would be to check for PageHuge in __free_huge_page. 
> > > > > > Have
> > > > > > you considered that rather than add yet another state? The scope of 
> > > > > > the
> > > > > > spinlock would have to be extended. If that sounds more tricky then 
> > > > > > can
> > > > > > we check the page->lru in the dissolve path? If the page is still
> > > > > > PageHuge and reference count 0 then there shouldn't be many options
> > > > > > where it can be queued, right?
> > > > >
> > > > > Did you mean that we iterate over the free list to check whether
> > > > > the page is on the free list?
> > > >
> > > > No I meant to check that the page is enqueued which along with ref count
> > > > = 0 should mean it has been released to the pool unless I am missing
> > > > something.
> > >
> > > The page can be on the free list or active list or empty when it
> > > is freed to the pool. How to check whether it is on the free list?
> >
> > As I've said, I might be missing something here. But if the page is
> > freed why does it matter whether it is on a active list or free list
> > from the dissolve operation POV?
> 
> As you said "check the page->lru". I have a question.
> How to check the page->lru in the dissolve path?

list_empty?
 
> BTW, dissolve_free_huge_page aims to free the page
> to buddy allocator. put_page (for HugeTLB page) aims
> to free the page to the hugepage pool.

Right. Can we simply back off in the dissolving path when ref count is
0 && PageHuge() if list_empty(page->lru)? Is there any other scenario
when the all above is true and the page is not being freed?
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Muchun Song
On Thu, Jan 7, 2021 at 7:18 PM Michal Hocko  wrote:
>
> On Thu 07-01-21 16:53:13, Muchun Song wrote:
> > On Thu, Jan 7, 2021 at 4:41 PM Michal Hocko  wrote:
> > >
> > > On Thu 07-01-21 13:39:38, Muchun Song wrote:
> > > > On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko  wrote:
> > > > >
> > > > > On Wed 06-01-21 16:47:36, Muchun Song wrote:
> > > > > > There is a race condition between __free_huge_page()
> > > > > > and dissolve_free_huge_page().
> > > > > >
> > > > > > CPU0: CPU1:
> > > > > >
> > > > > > // page_count(page) == 1
> > > > > > put_page(page)
> > > > > >   __free_huge_page(page)
> > > > > >   dissolve_free_huge_page(page)
> > > > > > spin_lock(_lock)
> > > > > > // PageHuge(page) && 
> > > > > > !page_count(page)
> > > > > > update_and_free_page(page)
> > > > > > // page is freed to the buddy
> > > > > > spin_unlock(_lock)
> > > > > > spin_lock(_lock)
> > > > > > clear_page_huge_active(page)
> > > > > > enqueue_huge_page(page)
> > > > > > // It is wrong, the page is already freed
> > > > > > spin_unlock(_lock)
> > > > > >
> > > > > > The race windows is between put_page() and spin_lock() which
> > > > > > is in the __free_huge_page().
> > > > >
> > > > > The race window reall is between put_page and dissolve_free_huge_page.
> > > > > And the result is that the put_page path would clobber an unrelated 
> > > > > page
> > > > > (either free or already reused page) which is quite serious.
> > > > > Fortunatelly pages are dissolved very rarely. I believe that user 
> > > > > would
> > > > > require to be privileged to hit this by intention.
> > > > >
> > > > > > We should make sure that the page is already on the free list
> > > > > > when it is dissolved.
> > > > >
> > > > > Another option would be to check for PageHuge in __free_huge_page. 
> > > > > Have
> > > > > you considered that rather than add yet another state? The scope of 
> > > > > the
> > > > > spinlock would have to be extended. If that sounds more tricky then 
> > > > > can
> > > > > we check the page->lru in the dissolve path? If the page is still
> > > > > PageHuge and reference count 0 then there shouldn't be many options
> > > > > where it can be queued, right?
> > > >
> > > > Did you mean that we iterate over the free list to check whether
> > > > the page is on the free list?
> > >
> > > No I meant to check that the page is enqueued which along with ref count
> > > = 0 should mean it has been released to the pool unless I am missing
> > > something.
> >
> > The page can be on the free list or active list or empty when it
> > is freed to the pool. How to check whether it is on the free list?
>
> As I've said, I might be missing something here. But if the page is
> freed why does it matter whether it is on a active list or free list
> from the dissolve operation POV?

As you said "check the page->lru". I have a question.
How to check the page->lru in the dissolve path?

BTW, dissolve_free_huge_page aims to free the page
to buddy allocator. put_page (for HugeTLB page) aims
to free the page to the hugepage pool.

> --
> Michal Hocko
> SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Michal Hocko
On Thu 07-01-21 16:53:13, Muchun Song wrote:
> On Thu, Jan 7, 2021 at 4:41 PM Michal Hocko  wrote:
> >
> > On Thu 07-01-21 13:39:38, Muchun Song wrote:
> > > On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko  wrote:
> > > >
> > > > On Wed 06-01-21 16:47:36, Muchun Song wrote:
> > > > > There is a race condition between __free_huge_page()
> > > > > and dissolve_free_huge_page().
> > > > >
> > > > > CPU0: CPU1:
> > > > >
> > > > > // page_count(page) == 1
> > > > > put_page(page)
> > > > >   __free_huge_page(page)
> > > > >   dissolve_free_huge_page(page)
> > > > > spin_lock(_lock)
> > > > > // PageHuge(page) && !page_count(page)
> > > > > update_and_free_page(page)
> > > > > // page is freed to the buddy
> > > > > spin_unlock(_lock)
> > > > > spin_lock(_lock)
> > > > > clear_page_huge_active(page)
> > > > > enqueue_huge_page(page)
> > > > > // It is wrong, the page is already freed
> > > > > spin_unlock(_lock)
> > > > >
> > > > > The race windows is between put_page() and spin_lock() which
> > > > > is in the __free_huge_page().
> > > >
> > > > The race window reall is between put_page and dissolve_free_huge_page.
> > > > And the result is that the put_page path would clobber an unrelated page
> > > > (either free or already reused page) which is quite serious.
> > > > Fortunatelly pages are dissolved very rarely. I believe that user would
> > > > require to be privileged to hit this by intention.
> > > >
> > > > > We should make sure that the page is already on the free list
> > > > > when it is dissolved.
> > > >
> > > > Another option would be to check for PageHuge in __free_huge_page. Have
> > > > you considered that rather than add yet another state? The scope of the
> > > > spinlock would have to be extended. If that sounds more tricky then can
> > > > we check the page->lru in the dissolve path? If the page is still
> > > > PageHuge and reference count 0 then there shouldn't be many options
> > > > where it can be queued, right?
> > >
> > > Did you mean that we iterate over the free list to check whether
> > > the page is on the free list?
> >
> > No I meant to check that the page is enqueued which along with ref count
> > = 0 should mean it has been released to the pool unless I am missing
> > something.
> 
> The page can be on the free list or active list or empty when it
> is freed to the pool. How to check whether it is on the free list?

As I've said, I might be missing something here. But if the page is
freed why does it matter whether it is on a active list or free list
from the dissolve operation POV?
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Muchun Song
On Thu, Jan 7, 2021 at 4:41 PM Michal Hocko  wrote:
>
> On Thu 07-01-21 13:39:38, Muchun Song wrote:
> > On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko  wrote:
> > >
> > > On Wed 06-01-21 16:47:36, Muchun Song wrote:
> > > > There is a race condition between __free_huge_page()
> > > > and dissolve_free_huge_page().
> > > >
> > > > CPU0: CPU1:
> > > >
> > > > // page_count(page) == 1
> > > > put_page(page)
> > > >   __free_huge_page(page)
> > > >   dissolve_free_huge_page(page)
> > > > spin_lock(_lock)
> > > > // PageHuge(page) && !page_count(page)
> > > > update_and_free_page(page)
> > > > // page is freed to the buddy
> > > > spin_unlock(_lock)
> > > > spin_lock(_lock)
> > > > clear_page_huge_active(page)
> > > > enqueue_huge_page(page)
> > > > // It is wrong, the page is already freed
> > > > spin_unlock(_lock)
> > > >
> > > > The race windows is between put_page() and spin_lock() which
> > > > is in the __free_huge_page().
> > >
> > > The race window reall is between put_page and dissolve_free_huge_page.
> > > And the result is that the put_page path would clobber an unrelated page
> > > (either free or already reused page) which is quite serious.
> > > Fortunatelly pages are dissolved very rarely. I believe that user would
> > > require to be privileged to hit this by intention.
> > >
> > > > We should make sure that the page is already on the free list
> > > > when it is dissolved.
> > >
> > > Another option would be to check for PageHuge in __free_huge_page. Have
> > > you considered that rather than add yet another state? The scope of the
> > > spinlock would have to be extended. If that sounds more tricky then can
> > > we check the page->lru in the dissolve path? If the page is still
> > > PageHuge and reference count 0 then there shouldn't be many options
> > > where it can be queued, right?
> >
> > Did you mean that we iterate over the free list to check whether
> > the page is on the free list?
>
> No I meant to check that the page is enqueued which along with ref count
> = 0 should mean it has been released to the pool unless I am missing
> something.

The page can be on the free list or active list or empty when it
is freed to the pool. How to check whether it is on the free list?

> --
> Michal Hocko
> SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Michal Hocko
On Thu 07-01-21 13:39:38, Muchun Song wrote:
> On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko  wrote:
> >
> > On Wed 06-01-21 16:47:36, Muchun Song wrote:
> > > There is a race condition between __free_huge_page()
> > > and dissolve_free_huge_page().
> > >
> > > CPU0: CPU1:
> > >
> > > // page_count(page) == 1
> > > put_page(page)
> > >   __free_huge_page(page)
> > >   dissolve_free_huge_page(page)
> > > spin_lock(_lock)
> > > // PageHuge(page) && !page_count(page)
> > > update_and_free_page(page)
> > > // page is freed to the buddy
> > > spin_unlock(_lock)
> > > spin_lock(_lock)
> > > clear_page_huge_active(page)
> > > enqueue_huge_page(page)
> > > // It is wrong, the page is already freed
> > > spin_unlock(_lock)
> > >
> > > The race windows is between put_page() and spin_lock() which
> > > is in the __free_huge_page().
> >
> > The race window reall is between put_page and dissolve_free_huge_page.
> > And the result is that the put_page path would clobber an unrelated page
> > (either free or already reused page) which is quite serious.
> > Fortunatelly pages are dissolved very rarely. I believe that user would
> > require to be privileged to hit this by intention.
> >
> > > We should make sure that the page is already on the free list
> > > when it is dissolved.
> >
> > Another option would be to check for PageHuge in __free_huge_page. Have
> > you considered that rather than add yet another state? The scope of the
> > spinlock would have to be extended. If that sounds more tricky then can
> > we check the page->lru in the dissolve path? If the page is still
> > PageHuge and reference count 0 then there shouldn't be many options
> > where it can be queued, right?
> 
> Did you mean that we iterate over the free list to check whether
> the page is on the free list?

No I meant to check that the page is enqueued which along with ref count
= 0 should mean it has been released to the pool unless I am missing
something.
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-06 Thread Muchun Song
On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko  wrote:
>
> On Wed 06-01-21 16:47:36, Muchun Song wrote:
> > There is a race condition between __free_huge_page()
> > and dissolve_free_huge_page().
> >
> > CPU0: CPU1:
> >
> > // page_count(page) == 1
> > put_page(page)
> >   __free_huge_page(page)
> >   dissolve_free_huge_page(page)
> > spin_lock(_lock)
> > // PageHuge(page) && !page_count(page)
> > update_and_free_page(page)
> > // page is freed to the buddy
> > spin_unlock(_lock)
> > spin_lock(_lock)
> > clear_page_huge_active(page)
> > enqueue_huge_page(page)
> > // It is wrong, the page is already freed
> > spin_unlock(_lock)
> >
> > The race windows is between put_page() and spin_lock() which
> > is in the __free_huge_page().
>
> The race window reall is between put_page and dissolve_free_huge_page.
> And the result is that the put_page path would clobber an unrelated page
> (either free or already reused page) which is quite serious.
> Fortunatelly pages are dissolved very rarely. I believe that user would
> require to be privileged to hit this by intention.
>
> > We should make sure that the page is already on the free list
> > when it is dissolved.
>
> Another option would be to check for PageHuge in __free_huge_page. Have
> you considered that rather than add yet another state? The scope of the
> spinlock would have to be extended. If that sounds more tricky then can
> we check the page->lru in the dissolve path? If the page is still
> PageHuge and reference count 0 then there shouldn't be many options
> where it can be queued, right?

Did you mean that we iterate over the free list to check whether
the page is on the free list? If so, I do not think it is a good solution
than introducing another state. Because if there are a lot of pages
on the free list, it may take some time to do it with holding
hugetlb_lock. Right? Actually, we have some tail page structs
to store the state. At least it's not in short supply right now.

Thanks.

>
> > Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle 
> > hugepage")
> > Signed-off-by: Muchun Song 
> > ---
> >  mm/hugetlb.c | 38 ++
> >  1 file changed, 38 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 4741d60f8955..8ff138c17129 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -79,6 +79,21 @@ DEFINE_SPINLOCK(hugetlb_lock);
> >  static int num_fault_mutexes;
> >  struct mutex *hugetlb_fault_mutex_table cacheline_aligned_in_smp;
> >
> > +static inline bool PageHugeFreed(struct page *head)
> > +{
> > + return (unsigned long)head[3].mapping == -1U;
> > +}
> > +
> > +static inline void SetPageHugeFreed(struct page *head)
> > +{
> > + head[3].mapping = (void *)-1U;
> > +}
> > +
> > +static inline void ClearPageHugeFreed(struct page *head)
> > +{
> > + head[3].mapping = NULL;
> > +}
> > +
> >  /* Forward declaration */
> >  static int hugetlb_acct_memory(struct hstate *h, long delta);
> >
> > @@ -1028,6 +1043,7 @@ static void enqueue_huge_page(struct hstate *h, 
> > struct page *page)
> >   list_move(>lru, >hugepage_freelists[nid]);
> >   h->free_huge_pages++;
> >   h->free_huge_pages_node[nid]++;
> > + SetPageHugeFreed(page);
> >  }
> >
> >  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> > @@ -1044,6 +1060,7 @@ static struct page 
> > *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> >
> >   list_move(>lru, >hugepage_activelist);
> >   set_page_refcounted(page);
> > + ClearPageHugeFreed(page);
> >   h->free_huge_pages--;
> >   h->free_huge_pages_node[nid]--;
> >   return page;
> > @@ -1291,6 +1308,17 @@ static inline void 
> > destroy_compound_gigantic_page(struct page *page,
> >   unsigned int order) { }
> >  #endif
> >
> > +/*
> > + * Because we reuse the mapping field of some tail page structs, we should
> > + * reset those mapping to initial value before @head is freed to the buddy
> > + * allocator. The invalid value will be checked in the 
> > free_tail_pages_check().
> > + */
> > +static inline void reset_tail_page_mapping(struct hstate *h, struct page 
> > *head)
> > +{
> > + if (!hstate_is_gigantic(h))
> > + head[3].mapping = TAIL_MAPPING;
> > +}
> > +
> >  static void update_and_free_page(struct hstate *h, struct page *page)
> >  {
> >   int i;
> > @@ -1298,6 +1326,7 @@ static void update_and_free_page(struct hstate *h, 
> > struct page *page)
> >   if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> >   return;
> >
> > + reset_tail_page_mapping(h, page);
> >   h->nr_huge_pages--;
> >   

Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-06 Thread Muchun Song
On Thu, Jan 7, 2021 at 5:00 AM Mike Kravetz  wrote:
>
> On 1/6/21 8:56 AM, Michal Hocko wrote:
> > On Wed 06-01-21 16:47:36, Muchun Song wrote:
> >> There is a race condition between __free_huge_page()
> >> and dissolve_free_huge_page().
> >>
> >> CPU0: CPU1:
> >>
> >> // page_count(page) == 1
> >> put_page(page)
> >>   __free_huge_page(page)
> >>   dissolve_free_huge_page(page)
> >> spin_lock(_lock)
> >> // PageHuge(page) && !page_count(page)
> >> update_and_free_page(page)
> >> // page is freed to the buddy
> >> spin_unlock(_lock)
> >> spin_lock(_lock)
> >> clear_page_huge_active(page)
> >> enqueue_huge_page(page)
> >> // It is wrong, the page is already freed
> >> spin_unlock(_lock)
> >>
> >> The race windows is between put_page() and spin_lock() which
> >> is in the __free_huge_page().
> >
> > The race window reall is between put_page and dissolve_free_huge_page.
> > And the result is that the put_page path would clobber an unrelated page
> > (either free or already reused page) which is quite serious.
> > Fortunatelly pages are dissolved very rarely. I believe that user would
> > require to be privileged to hit this by intention.
> >
> >> We should make sure that the page is already on the free list
> >> when it is dissolved.
> >
> > Another option would be to check for PageHuge in __free_huge_page. Have
> > you considered that rather than add yet another state? The scope of the
> > spinlock would have to be extended. If that sounds more tricky then can
> > we check the page->lru in the dissolve path? If the page is still
> > PageHuge and reference count 0 then there shouldn't be many options
> > where it can be queued, right?
>
> The tricky part with expanding lock scope will be the potential call to
> hugepage_subpool_put_pages as it may also try to acquire the hugetlb_lock.
>
> I am not sure what you mean by 'check the page->lru'?  If we knew the page
> was on the free list, then we could dissolve.  But, I do not think there
> is an easy way to determine that from page->lru.  A hugetlb page is either
> going to be on the active list or free list.
>
> >
> >> Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle 
> >> hugepage")
> >> Signed-off-by: Muchun Song 
> >> ---
> >>  mm/hugetlb.c | 38 ++
> >>  1 file changed, 38 insertions(+)
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index 4741d60f8955..8ff138c17129 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -79,6 +79,21 @@ DEFINE_SPINLOCK(hugetlb_lock);
> >>  static int num_fault_mutexes;
> >>  struct mutex *hugetlb_fault_mutex_table cacheline_aligned_in_smp;
> >>
> >> +static inline bool PageHugeFreed(struct page *head)
> >> +{
> >> +return (unsigned long)head[3].mapping == -1U;
> >> +}
> >> +
> >> +static inline void SetPageHugeFreed(struct page *head)
> >> +{
> >> +head[3].mapping = (void *)-1U;
> >> +}
> >> +
> >> +static inline void ClearPageHugeFreed(struct page *head)
> >> +{
> >> +head[3].mapping = NULL;
> >> +}
> >> +
> >>  /* Forward declaration */
> >>  static int hugetlb_acct_memory(struct hstate *h, long delta);
> >>
> >> @@ -1028,6 +1043,7 @@ static void enqueue_huge_page(struct hstate *h, 
> >> struct page *page)
> >>  list_move(>lru, >hugepage_freelists[nid]);
> >>  h->free_huge_pages++;
> >>  h->free_huge_pages_node[nid]++;
> >> +SetPageHugeFreed(page);
> >>  }
> >>
> >>  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int 
> >> nid)
> >> @@ -1044,6 +1060,7 @@ static struct page 
> >> *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> >>
> >>  list_move(>lru, >hugepage_activelist);
> >>  set_page_refcounted(page);
> >> +ClearPageHugeFreed(page);
> >>  h->free_huge_pages--;
> >>  h->free_huge_pages_node[nid]--;
> >>  return page;
> >> @@ -1291,6 +1308,17 @@ static inline void 
> >> destroy_compound_gigantic_page(struct page *page,
> >>  unsigned int order) { }
> >>  #endif
> >>
> >> +/*
> >> + * Because we reuse the mapping field of some tail page structs, we should
> >> + * reset those mapping to initial value before @head is freed to the buddy
> >> + * allocator. The invalid value will be checked in the 
> >> free_tail_pages_check().
> >> + */
>
> When I suggested using head[3].mapping for this state, I was not aware of
> this requirement.  My suggestion was only following the convention used in
> PageHugeTemporary.  I would not have made the suggestion if I had realized
> this was required.  Sorry.

Yeah, PageHugeTemporary is lucky. free_tail_pages_check() not check
head[2].mapping.

I will revert to the previous version (using head[3].private).