RE: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread 陶文苇


> -Original Message-
> From: Michal Hocko [mailto:mho...@kernel.org]
> Sent: Friday, August 04, 2017 10:57 PM
> To: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: linux...@kvack.org; a...@linux-foundation.org; 陶文苇
> <wenwei@alibaba-inc.com>; o...@redhat.com; rient...@google.com;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] mm, oom: fix potential data corruption when
> oom_reaper races with writer
> 
> On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> > On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> [...]
> > > Yes. Data corruption still happens.
> >
> > I guess I managed to reproduce finally. Will investigate further.
> 
> One limitation of the current MMF_UNSTABLE implementation is that it still
> keeps the new page mapped and only sends EFAULT/kill to the consumer. If
> somebody tries to re-read the same content nothing will really happen. I
> went this way because it was much simpler and memory consumers usually
> do not retry on EFAULT. Maybe this is not the case here.
> 
> I've been staring into iov_iter_copy_from_user_atomic which I believe
> should be the common write path which reads the user buffer where the
> corruption caused by the oom_reaper would come from.
> iov_iter_fault_in_readable should be called before this function. If this
> happened after MMF_UNSTABLE was set then we should get EFAULT and bail
> out early. Let's assume this wasn't the case. Then we should get down to
> iov_iter_copy_from_user_atomic and that one shouldn't copy any data
> because __copy_from_user_inatomic says
> 
>  * If copying succeeds, the return value must be 0.  If some data cannot
be
>  * fetched, it is permitted to copy less than had been fetched; the only
>  * hard requirement is that not storing anything at all (i.e. returning
size)
>  * should happen only when nothing could be copied.  In other words, you
> don't
>  * have to squeeze as much as possible - it is allowed, but not necessary.
> 
> which should be our case.
> 
> I was testing with xfs (but generic_perform_write seem to be doing the
same
> thing) and that one does
>   if (unlikely(copied == 0)) {
>   /*
>* If we were unable to copy any data at all, we
must
>* fall back to a single segment length write.
>*
>* If we didn't fallback here, we could livelock
>* because not all segments in the iov can be copied
at
>* once without a pagefault.
>*/
>   bytes = min_t(unsigned long, PAGE_SIZE - offset,
>
iov_iter_single_seg_count(i));
>   goto again;
>   }
> 
> and that again will go through iov_iter_fault_in_readable again and that
will
> succeed now.
> 
Agree, didn't notice this case before.

> And that's why we still see the corruption. That, however, means that the
> MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.
> 
> This is a rough first draft that passes the test case from Tetsuo on my
system.
> It will need much more eyes on it and I will return to it with a fresh
brain next
> week. I would appreciate as much testing as possible.
> 
> Note that this is on top of the previous attempt for the fix but I will
squash
> the result into one patch because the previous one is not sufficient.
> ---
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index
> 86975dec0ba1..1fbc78d423d7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
>   struct mem_cgroup *memcg;
>   pgtable_t pgtable;
>   unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> + int ret;
> 
>   VM_BUG_ON_PAGE(!PageCompound(page), page);
> 
> @@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
> 
>   pgtable = pte_alloc_one(vma->vm_mm, haddr);
>   if (unlikely(!pgtable)) {
> - mem_cgroup_cancel_charge(page, memcg, true);
> - put_page(page);
> - return VM_FAULT_OOM;
> + ret = VM_FAULT_OOM;
> + goto release;
>   }
> 
>   clear_huge_page(page, haddr, HPAGE_PMD_NR); @@ -583,6 +583,15
> @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> struct page *page,
>   } else {
>   pmd_t entry;
> 
> + /*
> +  * range could have been already torn down by
> +   

RE: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread 陶文苇


> -Original Message-
> From: Michal Hocko [mailto:mho...@kernel.org]
> Sent: Friday, August 04, 2017 10:57 PM
> To: Tetsuo Handa 
> Cc: linux...@kvack.org; a...@linux-foundation.org; 陶文苇
> ; o...@redhat.com; rient...@google.com;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] mm, oom: fix potential data corruption when
> oom_reaper races with writer
> 
> On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> > On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> [...]
> > > Yes. Data corruption still happens.
> >
> > I guess I managed to reproduce finally. Will investigate further.
> 
> One limitation of the current MMF_UNSTABLE implementation is that it still
> keeps the new page mapped and only sends EFAULT/kill to the consumer. If
> somebody tries to re-read the same content nothing will really happen. I
> went this way because it was much simpler and memory consumers usually
> do not retry on EFAULT. Maybe this is not the case here.
> 
> I've been staring into iov_iter_copy_from_user_atomic which I believe
> should be the common write path which reads the user buffer where the
> corruption caused by the oom_reaper would come from.
> iov_iter_fault_in_readable should be called before this function. If this
> happened after MMF_UNSTABLE was set then we should get EFAULT and bail
> out early. Let's assume this wasn't the case. Then we should get down to
> iov_iter_copy_from_user_atomic and that one shouldn't copy any data
> because __copy_from_user_inatomic says
> 
>  * If copying succeeds, the return value must be 0.  If some data cannot
be
>  * fetched, it is permitted to copy less than had been fetched; the only
>  * hard requirement is that not storing anything at all (i.e. returning
size)
>  * should happen only when nothing could be copied.  In other words, you
> don't
>  * have to squeeze as much as possible - it is allowed, but not necessary.
> 
> which should be our case.
> 
> I was testing with xfs (but generic_perform_write seem to be doing the
same
> thing) and that one does
>   if (unlikely(copied == 0)) {
>   /*
>* If we were unable to copy any data at all, we
must
>* fall back to a single segment length write.
>*
>* If we didn't fallback here, we could livelock
>* because not all segments in the iov can be copied
at
>* once without a pagefault.
>*/
>   bytes = min_t(unsigned long, PAGE_SIZE - offset,
>
iov_iter_single_seg_count(i));
>   goto again;
>   }
> 
> and that again will go through iov_iter_fault_in_readable again and that
will
> succeed now.
> 
Agree, didn't notice this case before.

> And that's why we still see the corruption. That, however, means that the
> MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.
> 
> This is a rough first draft that passes the test case from Tetsuo on my
system.
> It will need much more eyes on it and I will return to it with a fresh
brain next
> week. I would appreciate as much testing as possible.
> 
> Note that this is on top of the previous attempt for the fix but I will
squash
> the result into one patch because the previous one is not sufficient.
> ---
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index
> 86975dec0ba1..1fbc78d423d7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
>   struct mem_cgroup *memcg;
>   pgtable_t pgtable;
>   unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> + int ret;
> 
>   VM_BUG_ON_PAGE(!PageCompound(page), page);
> 
> @@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
> 
>   pgtable = pte_alloc_one(vma->vm_mm, haddr);
>   if (unlikely(!pgtable)) {
> - mem_cgroup_cancel_charge(page, memcg, true);
> - put_page(page);
> - return VM_FAULT_OOM;
> + ret = VM_FAULT_OOM;
> + goto release;
>   }
> 
>   clear_huge_page(page, haddr, HPAGE_PMD_NR); @@ -583,6 +583,15
> @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> struct page *page,
>   } else {
>   pmd_t entry;
> 
> + /*
> +  * range could have been already torn down by
> +  * the oom reaper
> +  */
> + if (test_bit(

Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Tetsuo Handa
Michal Hocko wrote:
> And that's why we still see the corruption. That, however, means that
> the MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.

I don't understand mm internals including pte/ptl etc. , but I guess that
the direction is correct. Since the OOM reaper basically does

  Set MMF_UNSTABLE flag on mm_struct.
  For each reapable page in mm_struct {
Take ptl lock.
Remove pte.
Release ptl lock.
  }

the page fault handler will need to check MMF_UNSTABLE with lock held.

  For each faulted page in mm_struct {
Take ptl lock.
Add pte only if MMF_UNSTABLE flag is not set.
Release ptl lock.
  }


Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Tetsuo Handa
Michal Hocko wrote:
> And that's why we still see the corruption. That, however, means that
> the MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.

I don't understand mm internals including pte/ptl etc. , but I guess that
the direction is correct. Since the OOM reaper basically does

  Set MMF_UNSTABLE flag on mm_struct.
  For each reapable page in mm_struct {
Take ptl lock.
Remove pte.
Release ptl lock.
  }

the page fault handler will need to check MMF_UNSTABLE with lock held.

  For each faulted page in mm_struct {
Take ptl lock.
Add pte only if MMF_UNSTABLE flag is not set.
Release ptl lock.
  }


Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Michal Hocko
On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
[...]
> > Yes. Data corruption still happens.
> 
> I guess I managed to reproduce finally. Will investigate further.

One limitation of the current MMF_UNSTABLE implementation is that it
still keeps the new page mapped and only sends EFAULT/kill to the
consumer. If somebody tries to re-read the same content nothing will
really happen. I went this way because it was much simpler and memory
consumers usually do not retry on EFAULT. Maybe this is not the case
here.

I've been staring into iov_iter_copy_from_user_atomic which I
believe should be the common write path which reads the user buffer
where the corruption caused by the oom_reaper would come from.
iov_iter_fault_in_readable should be called before this function. If
this happened after MMF_UNSTABLE was set then we should get EFAULT and
bail out early. Let's assume this wasn't the case. Then we should get
down to iov_iter_copy_from_user_atomic and that one shouldn't copy any
data because __copy_from_user_inatomic says

 * If copying succeeds, the return value must be 0.  If some data cannot be
 * fetched, it is permitted to copy less than had been fetched; the only
 * hard requirement is that not storing anything at all (i.e. returning size)
 * should happen only when nothing could be copied.  In other words, you don't
 * have to squeeze as much as possible - it is allowed, but not necessary.

which should be our case.

I was testing with xfs (but generic_perform_write seem to be doing the
same thing) and that one does
if (unlikely(copied == 0)) {
/*
 * If we were unable to copy any data at all, we must
 * fall back to a single segment length write.
 *
 * If we didn't fallback here, we could livelock
 * because not all segments in the iov can be copied at
 * once without a pagefault.
 */
bytes = min_t(unsigned long, PAGE_SIZE - offset,
iov_iter_single_seg_count(i));
goto again;
}

and that again will go through iov_iter_fault_in_readable again and that
will succeed now.

And that's why we still see the corruption. That, however, means that
the MMF_UNSTABLE implementation has to be more complex and we have to
hook into all anonymous memory fault paths which I hoped I could avoid
previously.

This is a rough first draft that passes the test case from Tetsuo on my
system. It will need much more eyes on it and I will return to it with a
fresh brain next week. I would appreciate as much testing as possible.

Note that this is on top of the previous attempt for the fix but I will
squash the result into one patch because the previous one is not
sufficient.
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..1fbc78d423d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault 
*vmf, struct page *page,
struct mem_cgroup *memcg;
pgtable_t pgtable;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+   int ret;
 
VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault 
*vmf, struct page *page,
 
pgtable = pte_alloc_one(vma->vm_mm, haddr);
if (unlikely(!pgtable)) {
-   mem_cgroup_cancel_charge(page, memcg, true);
-   put_page(page);
-   return VM_FAULT_OOM;
+   ret = VM_FAULT_OOM;
+   goto release;
}
 
clear_huge_page(page, haddr, HPAGE_PMD_NR);
@@ -583,6 +583,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault 
*vmf, struct page *page,
} else {
pmd_t entry;
 
+   /*
+* range could have been already torn down by
+* the oom reaper
+*/
+   if (test_bit(MMF_UNSTABLE, >vm_mm->flags)) {
+   spin_unlock(vmf->ptl);
+   ret = VM_FAULT_SIGBUS;
+   goto release;
+   }
/* Deliver the page fault to userland */
if (userfaultfd_missing(vma)) {
int ret;
@@ -610,6 +619,13 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault 
*vmf, struct page *page,
}
 
return 0;
+release:
+   if (pgtable)
+   pte_free(vma->vm_mm, pgtable);
+   mem_cgroup_cancel_charge(page, memcg, true);
+   put_page(page);
+   return ret;
+
 }
 
 /*
@@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
ret = 0;
set = false;
if (pmd_none(*vmf->pmd)) {
-   if 

Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Michal Hocko
On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
[...]
> > Yes. Data corruption still happens.
> 
> I guess I managed to reproduce finally. Will investigate further.

One limitation of the current MMF_UNSTABLE implementation is that it
still keeps the new page mapped and only sends EFAULT/kill to the
consumer. If somebody tries to re-read the same content nothing will
really happen. I went this way because it was much simpler and memory
consumers usually do not retry on EFAULT. Maybe this is not the case
here.

I've been staring into iov_iter_copy_from_user_atomic which I
believe should be the common write path which reads the user buffer
where the corruption caused by the oom_reaper would come from.
iov_iter_fault_in_readable should be called before this function. If
this happened after MMF_UNSTABLE was set then we should get EFAULT and
bail out early. Let's assume this wasn't the case. Then we should get
down to iov_iter_copy_from_user_atomic and that one shouldn't copy any
data because __copy_from_user_inatomic says

 * If copying succeeds, the return value must be 0.  If some data cannot be
 * fetched, it is permitted to copy less than had been fetched; the only
 * hard requirement is that not storing anything at all (i.e. returning size)
 * should happen only when nothing could be copied.  In other words, you don't
 * have to squeeze as much as possible - it is allowed, but not necessary.

which should be our case.

I was testing with xfs (but generic_perform_write seem to be doing the
same thing) and that one does
if (unlikely(copied == 0)) {
/*
 * If we were unable to copy any data at all, we must
 * fall back to a single segment length write.
 *
 * If we didn't fallback here, we could livelock
 * because not all segments in the iov can be copied at
 * once without a pagefault.
 */
bytes = min_t(unsigned long, PAGE_SIZE - offset,
iov_iter_single_seg_count(i));
goto again;
}

and that again will go through iov_iter_fault_in_readable again and that
will succeed now.

And that's why we still see the corruption. That, however, means that
the MMF_UNSTABLE implementation has to be more complex and we have to
hook into all anonymous memory fault paths which I hoped I could avoid
previously.

This is a rough first draft that passes the test case from Tetsuo on my
system. It will need much more eyes on it and I will return to it with a
fresh brain next week. I would appreciate as much testing as possible.

Note that this is on top of the previous attempt for the fix but I will
squash the result into one patch because the previous one is not
sufficient.
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..1fbc78d423d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault 
*vmf, struct page *page,
struct mem_cgroup *memcg;
pgtable_t pgtable;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+   int ret;
 
VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault 
*vmf, struct page *page,
 
pgtable = pte_alloc_one(vma->vm_mm, haddr);
if (unlikely(!pgtable)) {
-   mem_cgroup_cancel_charge(page, memcg, true);
-   put_page(page);
-   return VM_FAULT_OOM;
+   ret = VM_FAULT_OOM;
+   goto release;
}
 
clear_huge_page(page, haddr, HPAGE_PMD_NR);
@@ -583,6 +583,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault 
*vmf, struct page *page,
} else {
pmd_t entry;
 
+   /*
+* range could have been already torn down by
+* the oom reaper
+*/
+   if (test_bit(MMF_UNSTABLE, >vm_mm->flags)) {
+   spin_unlock(vmf->ptl);
+   ret = VM_FAULT_SIGBUS;
+   goto release;
+   }
/* Deliver the page fault to userland */
if (userfaultfd_missing(vma)) {
int ret;
@@ -610,6 +619,13 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault 
*vmf, struct page *page,
}
 
return 0;
+release:
+   if (pgtable)
+   pte_free(vma->vm_mm, pgtable);
+   mem_cgroup_cancel_charge(page, memcg, true);
+   put_page(page);
+   return ret;
+
 }
 
 /*
@@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
ret = 0;
set = false;
if (pmd_none(*vmf->pmd)) {
-   if 

Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Michal Hocko
On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > > Well, while lockdep warning is gone, this problem is remaining.
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index edabf6f..1e06c29 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, 
> > > unsigned long address,
> > > /*
> > >  * This mm has been already reaped by the oom reaper and so the
> > >  * refault cannot be trusted in general. Anonymous refaults would
> > > -* lose data and give a zero page instead e.g. This is especially
> > > -* problem for use_mm() because regular tasks will just die and
> > > -* the corrupted data will not be visible anywhere while kthread
> > > -* will outlive the oom victim and potentially propagate the date
> > > -* further.
> > > +* lose data and give a zero page instead e.g.
> > >  */
> > > -   if (unlikely((current->flags & PF_KTHREAD) && !(ret & 
> > > VM_FAULT_ERROR)
> > > -   && test_bit(MMF_UNSTABLE, 
> > > >vm_mm->flags)))
> > > +   if (unlikely(!(ret & VM_FAULT_ERROR)
> > > +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) {
> > > +   if (ret & VM_FAULT_RETRY)
> > > +   down_read(>vm_mm->mmap_sem);
> > > ret = VM_FAULT_SIGBUS;
> > > +   }
> > > 
> > > return ret;
> > >  }
> > 
> > I have re-read your email again and I guess I misread previously. Are
> > you saying that the data corruption happens with the both patches
> > applied?
> 
> Yes. Data corruption still happens.

I guess I managed to reproduce finally. Will investigate further.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Michal Hocko
On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > > Well, while lockdep warning is gone, this problem is remaining.
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index edabf6f..1e06c29 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, 
> > > unsigned long address,
> > > /*
> > >  * This mm has been already reaped by the oom reaper and so the
> > >  * refault cannot be trusted in general. Anonymous refaults would
> > > -* lose data and give a zero page instead e.g. This is especially
> > > -* problem for use_mm() because regular tasks will just die and
> > > -* the corrupted data will not be visible anywhere while kthread
> > > -* will outlive the oom victim and potentially propagate the date
> > > -* further.
> > > +* lose data and give a zero page instead e.g.
> > >  */
> > > -   if (unlikely((current->flags & PF_KTHREAD) && !(ret & 
> > > VM_FAULT_ERROR)
> > > -   && test_bit(MMF_UNSTABLE, 
> > > >vm_mm->flags)))
> > > +   if (unlikely(!(ret & VM_FAULT_ERROR)
> > > +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) {
> > > +   if (ret & VM_FAULT_RETRY)
> > > +   down_read(>vm_mm->mmap_sem);
> > > ret = VM_FAULT_SIGBUS;
> > > +   }
> > > 
> > > return ret;
> > >  }
> > 
> > I have re-read your email again and I guess I misread previously. Are
> > you saying that the data corruption happens with the both patches
> > applied?
> 
> Yes. Data corruption still happens.

I guess I managed to reproduce finally. Will investigate further.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Tetsuo Handa
Michal Hocko wrote:
> On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > Well, while lockdep warning is gone, this problem is remaining.
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index edabf6f..1e06c29 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, 
> > unsigned long address,
> > /*
> >  * This mm has been already reaped by the oom reaper and so the
> >  * refault cannot be trusted in general. Anonymous refaults would
> > -* lose data and give a zero page instead e.g. This is especially
> > -* problem for use_mm() because regular tasks will just die and
> > -* the corrupted data will not be visible anywhere while kthread
> > -* will outlive the oom victim and potentially propagate the date
> > -* further.
> > +* lose data and give a zero page instead e.g.
> >  */
> > -   if (unlikely((current->flags & PF_KTHREAD) && !(ret & 
> > VM_FAULT_ERROR)
> > -   && test_bit(MMF_UNSTABLE, 
> > >vm_mm->flags)))
> > +   if (unlikely(!(ret & VM_FAULT_ERROR)
> > +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) {
> > +   if (ret & VM_FAULT_RETRY)
> > +   down_read(>vm_mm->mmap_sem);
> > ret = VM_FAULT_SIGBUS;
> > +   }
> > 
> > return ret;
> >  }
> 
> I have re-read your email again and I guess I misread previously. Are
> you saying that the data corruption happens with the both patches
> applied?

Yes. Data corruption still happens.


Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Tetsuo Handa
Michal Hocko wrote:
> On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > Well, while lockdep warning is gone, this problem is remaining.
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index edabf6f..1e06c29 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, 
> > unsigned long address,
> > /*
> >  * This mm has been already reaped by the oom reaper and so the
> >  * refault cannot be trusted in general. Anonymous refaults would
> > -* lose data and give a zero page instead e.g. This is especially
> > -* problem for use_mm() because regular tasks will just die and
> > -* the corrupted data will not be visible anywhere while kthread
> > -* will outlive the oom victim and potentially propagate the date
> > -* further.
> > +* lose data and give a zero page instead e.g.
> >  */
> > -   if (unlikely((current->flags & PF_KTHREAD) && !(ret & 
> > VM_FAULT_ERROR)
> > -   && test_bit(MMF_UNSTABLE, 
> > >vm_mm->flags)))
> > +   if (unlikely(!(ret & VM_FAULT_ERROR)
> > +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) {
> > +   if (ret & VM_FAULT_RETRY)
> > +   down_read(>vm_mm->mmap_sem);
> > ret = VM_FAULT_SIGBUS;
> > +   }
> > 
> > return ret;
> >  }
> 
> I have re-read your email again and I guess I misread previously. Are
> you saying that the data corruption happens with the both patches
> applied?

Yes. Data corruption still happens.


Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Michal Hocko
On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index edabf6f..1e06c29 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, 
> unsigned long address,
> /*
>  * This mm has been already reaped by the oom reaper and so the
>  * refault cannot be trusted in general. Anonymous refaults would
> -* lose data and give a zero page instead e.g. This is especially
> -* problem for use_mm() because regular tasks will just die and
> -* the corrupted data will not be visible anywhere while kthread
> -* will outlive the oom victim and potentially propagate the date
> -* further.
> +* lose data and give a zero page instead e.g.
>  */
> -   if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> -   && test_bit(MMF_UNSTABLE, 
> >vm_mm->flags)))
> +   if (unlikely(!(ret & VM_FAULT_ERROR)
> +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) {
> +   if (ret & VM_FAULT_RETRY)
> +   down_read(>vm_mm->mmap_sem);
> ret = VM_FAULT_SIGBUS;
> +   }
> 
> return ret;
>  }

I have re-read your email again and I guess I misread previously. Are
you saying that the data corruption happens with the both patches
applied?

> 
> $ cat /tmp/file.* | od -b | head
> 000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 42033 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 42034 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 45733 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 45734 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs


Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Michal Hocko
On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index edabf6f..1e06c29 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, 
> unsigned long address,
> /*
>  * This mm has been already reaped by the oom reaper and so the
>  * refault cannot be trusted in general. Anonymous refaults would
> -* lose data and give a zero page instead e.g. This is especially
> -* problem for use_mm() because regular tasks will just die and
> -* the corrupted data will not be visible anywhere while kthread
> -* will outlive the oom victim and potentially propagate the date
> -* further.
> +* lose data and give a zero page instead e.g.
>  */
> -   if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> -   && test_bit(MMF_UNSTABLE, 
> >vm_mm->flags)))
> +   if (unlikely(!(ret & VM_FAULT_ERROR)
> +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) {
> +   if (ret & VM_FAULT_RETRY)
> +   down_read(>vm_mm->mmap_sem);
> ret = VM_FAULT_SIGBUS;
> +   }
> 
> return ret;
>  }

I have re-read your email again and I guess I misread previously. Are
you saying that the data corruption happens with the both patches
applied?

> 
> $ cat /tmp/file.* | od -b | head
> 000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 42033 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 42034 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 45733 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 45734 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs


Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Michal Hocko
On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.

Ohh, I should have been more specific. Both patches have to be applied.
I have based this one first because it should go to stable. The later
one needs a trivial conflict resolution. I will send both of them as a
reply to this email!

Thanks for retesting. It matches my testing results.
-- 
Michal Hocko
SUSE Labs


Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Michal Hocko
On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.

Ohh, I should have been more specific. Both patches have to be applied.
I have based this one first because it should go to stable. The later
one needs a trivial conflict resolution. I will send both of them as a
reply to this email!

Thanks for retesting. It matches my testing results.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Michal Hocko
On Fri 04-08-17 15:46:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> >  So there is a race window when some threads
> > won't have fatal_signal_pending while the oom_reaper could start
> > unmapping the address space. generic_perform_write could then write
> > zero page to the page cache and corrupt data.
> 
> Oh, simple generic_perform_write() ?
> 
> > 
> > The race window is rather small and close to impossible to happen but it
> > would be better to have it covered.
> 
> OK, I confirmed that this problem is easily reproducible using below 
> reproducer.

Yeah, I can imagine this could be triggered artificially. I am somehow
more skeptical about real life oom scenarios to trigger this though.
Anyway, thanks for your test case!
 
> Applying your patch seems to avoid this problem, but as far as I tested
> your patch seems to trivially trigger something lock related problem.
> Is your patch really safe?

> --
> [   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or 
> sacrifice child
> [   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, 
> anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
> [   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
> [   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   58.557480] [ cut here ]
> [   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 
> lock_release+0x172/0x1e0
> [   58.569076] Modules linked in: nf_conntrack_netbios_ns 
> nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
> nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute 
> bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
> ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
> nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter 
> ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr 
> vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs 
> libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi 
> scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata
> [   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 
> 4.13.0-rc3-next-20170803+ #142
> [   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
> Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   58.609790] task: 9d90df888040 task.stack: a07084854000
> [   58.613944] RIP: 0010:lock_release+0x172/0x1e0
> [   58.617622] RSP: :a07084857e58 EFLAGS: 00010082
> [   58.621533] RAX: 001f RBX: 9d90df888040 RCX: 
> 
> [   58.626074] RDX:  RSI: 0001 RDI: 
> a30d4ba4
> [   58.630572] RBP: a07084857e98 R08:  R09: 
> 0001
> [   58.635016] R10:  R11: 001f R12: 
> a07084857f58
> [   58.639694] R13: 9d90f60d6cd0 R14:  R15: 
> a305cb6e
> [   58.644200] FS:  7fb932730740() GS:9d90f9f8() 
> knlGS:
> [   58.648989] CS:  0010 DS:  ES:  CR0: 80050033
> [   58.652903] CR2: 0040092f CR3: 000135229000 CR4: 
> 000606e0
> [   58.657280] Call Trace:
> [   58.659989]  up_read+0x1a/0x40
> [   58.662825]  __do_page_fault+0x28e/0x4c0
> [   58.665946]  do_page_fault+0x30/0x80
> [   58.668911]  page_fault+0x28/0x30

OK, I know what is going on here. The page fault must have returned with
VM_FAULT_RETRY when the caller drops mmap_sem. My patch overwrites the
this error code so the page fault path doesn't know that the lock is no
longer held and releases is unconditionally. This is a preexisting
problem introduced by 3f70dc38cec2 ("mm: make sure that kthreads will
not refault oom reaped memory"). I should have considered this option.

I believe the easiest way around this is the following patch
---
>From dd31779f763bbe2aa86100f804656ac680c49d35 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 4 Aug 2017 09:36:34 +0200
Subject: [PATCH] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced
 SIGBUS

Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice 
child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, 
anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] [ cut here ]
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 
lock_release+0x172/0x1e0
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 
4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: 9d90df888040 

Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Michal Hocko
On Fri 04-08-17 15:46:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> >  So there is a race window when some threads
> > won't have fatal_signal_pending while the oom_reaper could start
> > unmapping the address space. generic_perform_write could then write
> > zero page to the page cache and corrupt data.
> 
> Oh, simple generic_perform_write() ?
> 
> > 
> > The race window is rather small and close to impossible to happen but it
> > would be better to have it covered.
> 
> OK, I confirmed that this problem is easily reproducible using below 
> reproducer.

Yeah, I can imagine this could be triggered artificially. I am somehow
more skeptical about real life oom scenarios to trigger this though.
Anyway, thanks for your test case!
 
> Applying your patch seems to avoid this problem, but as far as I tested
> your patch seems to trivially trigger something lock related problem.
> Is your patch really safe?

> --
> [   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or 
> sacrifice child
> [   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, 
> anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
> [   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
> [   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   58.557480] [ cut here ]
> [   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 
> lock_release+0x172/0x1e0
> [   58.569076] Modules linked in: nf_conntrack_netbios_ns 
> nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
> nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute 
> bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
> ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
> nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter 
> ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr 
> vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs 
> libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi 
> scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata
> [   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 
> 4.13.0-rc3-next-20170803+ #142
> [   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
> Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   58.609790] task: 9d90df888040 task.stack: a07084854000
> [   58.613944] RIP: 0010:lock_release+0x172/0x1e0
> [   58.617622] RSP: :a07084857e58 EFLAGS: 00010082
> [   58.621533] RAX: 001f RBX: 9d90df888040 RCX: 
> 
> [   58.626074] RDX:  RSI: 0001 RDI: 
> a30d4ba4
> [   58.630572] RBP: a07084857e98 R08:  R09: 
> 0001
> [   58.635016] R10:  R11: 001f R12: 
> a07084857f58
> [   58.639694] R13: 9d90f60d6cd0 R14:  R15: 
> a305cb6e
> [   58.644200] FS:  7fb932730740() GS:9d90f9f8() 
> knlGS:
> [   58.648989] CS:  0010 DS:  ES:  CR0: 80050033
> [   58.652903] CR2: 0040092f CR3: 000135229000 CR4: 
> 000606e0
> [   58.657280] Call Trace:
> [   58.659989]  up_read+0x1a/0x40
> [   58.662825]  __do_page_fault+0x28e/0x4c0
> [   58.665946]  do_page_fault+0x30/0x80
> [   58.668911]  page_fault+0x28/0x30

OK, I know what is going on here. The page fault must have returned with
VM_FAULT_RETRY when the caller drops mmap_sem. My patch overwrites the
this error code so the page fault path doesn't know that the lock is no
longer held and releases is unconditionally. This is a preexisting
problem introduced by 3f70dc38cec2 ("mm: make sure that kthreads will
not refault oom reaped memory"). I should have considered this option.

I believe the easiest way around this is the following patch
---
>From dd31779f763bbe2aa86100f804656ac680c49d35 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 4 Aug 2017 09:36:34 +0200
Subject: [PATCH] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced
 SIGBUS

Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice 
child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, 
anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] [ cut here ]
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 
lock_release+0x172/0x1e0
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 
4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: 9d90df888040 task.stack: