Re: [PATCH v3 7/7] mm/gup: Allow VM_FAULT_RETRY for multiple times

2019-09-11 Thread Peter Xu
On Wed, Sep 11, 2019 at 10:47:59AM +0100, Linus Torvalds wrote:
> On Wed, Sep 11, 2019 at 8:11 AM Peter Xu  wrote:
> >
> > This is the gup counterpart of the change that allows the
> > VM_FAULT_RETRY to happen for more than once.  One thing to mention is
> > that we must check the fatal signal here before retry because the GUP
> > can be interrupted by that, otherwise we can loop forever.
> 
> I still get nervous about the signal handling here.
> 
> I'm not entirely sure we get it right even before your patch series.
> 
> Right now, __get_user_pages() can return -ERESTARTSYS when it's killed:
> 
> /*
>  * If we have a pending SIGKILL, don't keep faulting pages and
>  * potentially allocating memory.
>  */
> if (fatal_signal_pending(current)) {
> ret = -ERESTARTSYS;
> goto out;
> }
> 
> and I don't think your series changes that.  And note how this is true
> _regardless_ of any FOLL_xyz flags (and we don't pass the
> FAULT_FLAG_xyz flags at all, they get generated deeper down if we
> actually end up faulting things in).
> 
> So this part of the patch:
> 
> +   if (fatal_signal_pending(current))
> +   goto out;
> +
> *locked = 1;
> -   lock_dropped = true;
> down_read(>mmap_sem);
> ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
> -  pages, NULL, NULL);
> +  pages, NULL, locked);
> +   if (!*locked) {
> +   /* Continue to retry until we succeeded */
> +   BUG_ON(ret != 0);
> +   goto retry;
> 
> just makes me go "that can't be right". The fatal_signal_pending() is
> pointless and would probably better be something like
> 
> if (down_read_killable(>mmap_sem) < 0)
> goto out;

Thanks for noticing all these!  I'd admit when I was working on the
series I didn't think & test very carefully with fatal signals but
mostly I'm making sure the normal signals should work especially for
processes like userfaultfd tracees so they won't hang death (I'm
always testing with GDB, and if without proper signal handle they do
hang death..).

I agree that we should probably replace the down_read() with the
killable version as you suggested.  Though we might still want the
explicit check of fatal_signal_pending() to make sure we react even
faster because IMHO down_read_killable() does not really check signals
all the time but it just put us into killable state if we need to wait
for the mmap_sem.  In other words, if we are always lucky that we get
the lock without even waiting anything then down_read_killable() will
still ignore the fatal signals forever.

> 
> and then _after_ calling __get_user_pages(), the whole "negative error
> handling" should be more obvious.
> 
> The BUG_ON(ret != 0) makes me nervous, but it might be fine (I guess
> the fatal signal handling has always been done before the lock is
> released?).

Yes it indeed looks nervous, though it's probably should be true in
all cases.  Actually we already have checks like this, for example, in
current __get_user_pages_locked():

/* VM_FAULT_RETRY cannot return errors */
if (!*locked) {
BUG_ON(ret < 0);
BUG_ON(ret >= nr_pages);
}

And in the new retry path since we always pass in npages==1 so it must
be zero when VM_FAULT_RETRY.

While... When I'm looking into this more carefully I seem to have
found another bug that we might want to fix with hugetlbfs path:

diff --git a/mm/gup.c b/mm/gup.c
index 7230f60a68d6..29ee3de65fad 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -836,6 +836,16 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
i = follow_hugetlb_page(mm, vma, pages, vmas,
, _pages, i,
gup_flags, locked);
+   if (locked && *locked == 0) {
+   /*
+* We've got a VM_FAULT_RETRY
+* and we've lost mmap_sem.
+* We must stop here.
+*/
+   BUG_ON(gup_flags & FOLL_NOWAIT);
+   BUG_ON(ret != 0);
+   goto out;
+   }
continue;
}
}

The problem is that if *locked==0 then we've lost the mmap_sem
already!  Then we probably can't go any further before taking it back
again.  With that, we should be able to keep the previous assumption
valid.

> 
> But exactly *because* __get_user_pages() can already 

Re: [PATCH v3 7/7] mm/gup: Allow VM_FAULT_RETRY for multiple times

2019-09-11 Thread Linus Torvalds
On Wed, Sep 11, 2019 at 8:11 AM Peter Xu  wrote:
>
> This is the gup counterpart of the change that allows the
> VM_FAULT_RETRY to happen for more than once.  One thing to mention is
> that we must check the fatal signal here before retry because the GUP
> can be interrupted by that, otherwise we can loop forever.

I still get nervous about the signal handling here.

I'm not entirely sure we get it right even before your patch series.

Right now, __get_user_pages() can return -ERESTARTSYS when it's killed:

/*
 * If we have a pending SIGKILL, don't keep faulting pages and
 * potentially allocating memory.
 */
if (fatal_signal_pending(current)) {
ret = -ERESTARTSYS;
goto out;
}

and I don't think your series changes that.  And note how this is true
_regardless_ of any FOLL_xyz flags (and we don't pass the
FAULT_FLAG_xyz flags at all, they get generated deeper down if we
actually end up faulting things in).

So this part of the patch:

+   if (fatal_signal_pending(current))
+   goto out;
+
*locked = 1;
-   lock_dropped = true;
down_read(>mmap_sem);
ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
-  pages, NULL, NULL);
+  pages, NULL, locked);
+   if (!*locked) {
+   /* Continue to retry until we succeeded */
+   BUG_ON(ret != 0);
+   goto retry;

just makes me go "that can't be right". The fatal_signal_pending() is
pointless and would probably better be something like

if (down_read_killable(>mmap_sem) < 0)
goto out;

and then _after_ calling __get_user_pages(), the whole "negative error
handling" should be more obvious.

The BUG_ON(ret != 0) makes me nervous, but it might be fine (I guess
the fatal signal handling has always been done before the lock is
released?).

But exactly *because* __get_user_pages() can already return on fatal
signals, I think it should also set FAULT_FLAG_KILLABLE when faulting
things in. I don't think it does right now, so it doesn't actually
necessarily check fatal signals in a timely manner (not _during_ the
fault, only _before_ it).

See what I'm reacting to?

And maybe I'm wrong. Maybe I misread the code, and your changes. And
at least some of these logic error predate your changes, I just was
hoping that since this whole "react to signals" is very much what your
patch series is working on, you'd look at this.

Ok?

Linus


[PATCH v3 7/7] mm/gup: Allow VM_FAULT_RETRY for multiple times

2019-09-11 Thread Peter Xu
This is the gup counterpart of the change that allows the
VM_FAULT_RETRY to happen for more than once.  One thing to mention is
that we must check the fatal signal here before retry because the GUP
can be interrupted by that, otherwise we can loop forever.

Signed-off-by: Peter Xu 
---
 mm/gup.c | 25 -
 mm/hugetlb.c |  6 --
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index eddbb95dcb8f..4b9413ee7b23 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -644,7 +644,10 @@ static int faultin_page(struct task_struct *tsk, struct 
vm_area_struct *vma,
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
if (*flags & FOLL_TRIED) {
-   VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+   /*
+* Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED
+* can co-exist
+*/
fault_flags |= FAULT_FLAG_TRIED;
}
 
@@ -1059,17 +1062,29 @@ static __always_inline long 
__get_user_pages_locked(struct task_struct *tsk,
if (likely(pages))
pages += ret;
start += ret << PAGE_SHIFT;
+   lock_dropped = true;
 
+retry:
/*
 * Repeat on the address that fired VM_FAULT_RETRY
-* without FAULT_FLAG_ALLOW_RETRY but with
-* FAULT_FLAG_TRIED.
+* with both FAULT_FLAG_ALLOW_RETRY and
+* FAULT_FLAG_TRIED.  Note that GUP can be interrupted
+* by fatal signals, so we need to check it before we
+* start trying again otherwise it can loop forever.
 */
+
+   if (fatal_signal_pending(current))
+   goto out;
+
*locked = 1;
-   lock_dropped = true;
down_read(>mmap_sem);
ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
-  pages, NULL, NULL);
+  pages, NULL, locked);
+   if (!*locked) {
+   /* Continue to retry until we succeeded */
+   BUG_ON(ret != 0);
+   goto retry;
+   }
if (ret != 1) {
BUG_ON(ret > 1);
if (!pages_done)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f816ee42206..6b9d27925e7a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4328,8 +4328,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
fault_flags |= FAULT_FLAG_ALLOW_RETRY |
FAULT_FLAG_RETRY_NOWAIT;
if (flags & FOLL_TRIED) {
-   VM_WARN_ON_ONCE(fault_flags &
-   FAULT_FLAG_ALLOW_RETRY);
+   /*
+* Note: FAULT_FLAG_ALLOW_RETRY and
+* FAULT_FLAG_TRIED can co-exist
+*/
fault_flags |= FAULT_FLAG_TRIED;
}
ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
-- 
2.21.0