RE: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
> -Original Message- > From: Michal Hocko [mailto:mho...@kernel.org] > Sent: Friday, August 04, 2017 10:57 PM > To: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> > Cc: linux...@kvack.org; a...@linux-foundation.org; 陶文苇 > <wenwei@alibaba-inc.com>; o...@redhat.com; rient...@google.com; > linux-kernel@vger.kernel.org > Subject: Re: [PATCH] mm, oom: fix potential data corruption when > oom_reaper races with writer > > On Fri 04-08-17 13:00:47, Michal Hocko wrote: > > On Fri 04-08-17 19:41:42, Tetsuo Handa wrote: > [...] > > > Yes. Data corruption still happens. > > > > I guess I managed to reproduce finally. Will investigate further. > > One limitation of the current MMF_UNSTABLE implementation is that it still > keeps the new page mapped and only sends EFAULT/kill to the consumer. If > somebody tries to re-read the same content nothing will really happen. I > went this way because it was much simpler and memory consumers usually > do not retry on EFAULT. Maybe this is not the case here. > > I've been staring into iov_iter_copy_from_user_atomic which I believe > should be the common write path which reads the user buffer where the > corruption caused by the oom_reaper would come from. > iov_iter_fault_in_readable should be called before this function. If this > happened after MMF_UNSTABLE was set then we should get EFAULT and bail > out early. Let's assume this wasn't the case. Then we should get down to > iov_iter_copy_from_user_atomic and that one shouldn't copy any data > because __copy_from_user_inatomic says > > * If copying succeeds, the return value must be 0. If some data cannot be > * fetched, it is permitted to copy less than had been fetched; the only > * hard requirement is that not storing anything at all (i.e. returning size) > * should happen only when nothing could be copied. In other words, you > don't > * have to squeeze as much as possible - it is allowed, but not necessary. > > which should be our case. > > I was testing with xfs (but generic_perform_write seem to be doing the same > thing) and that one does > if (unlikely(copied == 0)) { > /* >* If we were unable to copy any data at all, we must >* fall back to a single segment length write. >* >* If we didn't fallback here, we could livelock >* because not all segments in the iov can be copied at >* once without a pagefault. >*/ > bytes = min_t(unsigned long, PAGE_SIZE - offset, > iov_iter_single_seg_count(i)); > goto again; > } > > and that again will go through iov_iter_fault_in_readable again and that will > succeed now. > Agree, didn't notice this case before. > And that's why we still see the corruption. That, however, means that the > MMF_UNSTABLE implementation has to be more complex and we have to > hook into all anonymous memory fault paths which I hoped I could avoid > previously. > > This is a rough first draft that passes the test case from Tetsuo on my system. > It will need much more eyes on it and I will return to it with a fresh brain next > week. I would appreciate as much testing as possible. > > Note that this is on top of the previous attempt for the fix but I will squash > the result into one patch because the previous one is not sufficient. > --- > diff --git a/mm/huge_memory.c b/mm/huge_memory.c index > 86975dec0ba1..1fbc78d423d7 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct > vm_fault *vmf, struct page *page, > struct mem_cgroup *memcg; > pgtable_t pgtable; > unsigned long haddr = vmf->address & HPAGE_PMD_MASK; > + int ret; > > VM_BUG_ON_PAGE(!PageCompound(page), page); > > @@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct > vm_fault *vmf, struct page *page, > > pgtable = pte_alloc_one(vma->vm_mm, haddr); > if (unlikely(!pgtable)) { > - mem_cgroup_cancel_charge(page, memcg, true); > - put_page(page); > - return VM_FAULT_OOM; > + ret = VM_FAULT_OOM; > + goto release; > } > > clear_huge_page(page, haddr, HPAGE_PMD_NR); @@ -583,6 +583,15 > @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, > struct page *page, > } else { > pmd_t entry; > > + /* > + * range could have been already torn down by > +
RE: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
> -Original Message- > From: Michal Hocko [mailto:mho...@kernel.org] > Sent: Friday, August 04, 2017 10:57 PM > To: Tetsuo Handa > Cc: linux...@kvack.org; a...@linux-foundation.org; 陶文苇 > ; o...@redhat.com; rient...@google.com; > linux-kernel@vger.kernel.org > Subject: Re: [PATCH] mm, oom: fix potential data corruption when > oom_reaper races with writer > > On Fri 04-08-17 13:00:47, Michal Hocko wrote: > > On Fri 04-08-17 19:41:42, Tetsuo Handa wrote: > [...] > > > Yes. Data corruption still happens. > > > > I guess I managed to reproduce finally. Will investigate further. > > One limitation of the current MMF_UNSTABLE implementation is that it still > keeps the new page mapped and only sends EFAULT/kill to the consumer. If > somebody tries to re-read the same content nothing will really happen. I > went this way because it was much simpler and memory consumers usually > do not retry on EFAULT. Maybe this is not the case here. > > I've been staring into iov_iter_copy_from_user_atomic which I believe > should be the common write path which reads the user buffer where the > corruption caused by the oom_reaper would come from. > iov_iter_fault_in_readable should be called before this function. If this > happened after MMF_UNSTABLE was set then we should get EFAULT and bail > out early. Let's assume this wasn't the case. Then we should get down to > iov_iter_copy_from_user_atomic and that one shouldn't copy any data > because __copy_from_user_inatomic says > > * If copying succeeds, the return value must be 0. If some data cannot be > * fetched, it is permitted to copy less than had been fetched; the only > * hard requirement is that not storing anything at all (i.e. returning size) > * should happen only when nothing could be copied. In other words, you > don't > * have to squeeze as much as possible - it is allowed, but not necessary. > > which should be our case. > > I was testing with xfs (but generic_perform_write seem to be doing the same > thing) and that one does > if (unlikely(copied == 0)) { > /* >* If we were unable to copy any data at all, we must >* fall back to a single segment length write. >* >* If we didn't fallback here, we could livelock >* because not all segments in the iov can be copied at >* once without a pagefault. >*/ > bytes = min_t(unsigned long, PAGE_SIZE - offset, > iov_iter_single_seg_count(i)); > goto again; > } > > and that again will go through iov_iter_fault_in_readable again and that will > succeed now. > Agree, didn't notice this case before. > And that's why we still see the corruption. That, however, means that the > MMF_UNSTABLE implementation has to be more complex and we have to > hook into all anonymous memory fault paths which I hoped I could avoid > previously. > > This is a rough first draft that passes the test case from Tetsuo on my system. > It will need much more eyes on it and I will return to it with a fresh brain next > week. I would appreciate as much testing as possible. > > Note that this is on top of the previous attempt for the fix but I will squash > the result into one patch because the previous one is not sufficient. > --- > diff --git a/mm/huge_memory.c b/mm/huge_memory.c index > 86975dec0ba1..1fbc78d423d7 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct > vm_fault *vmf, struct page *page, > struct mem_cgroup *memcg; > pgtable_t pgtable; > unsigned long haddr = vmf->address & HPAGE_PMD_MASK; > + int ret; > > VM_BUG_ON_PAGE(!PageCompound(page), page); > > @@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct > vm_fault *vmf, struct page *page, > > pgtable = pte_alloc_one(vma->vm_mm, haddr); > if (unlikely(!pgtable)) { > - mem_cgroup_cancel_charge(page, memcg, true); > - put_page(page); > - return VM_FAULT_OOM; > + ret = VM_FAULT_OOM; > + goto release; > } > > clear_huge_page(page, haddr, HPAGE_PMD_NR); @@ -583,6 +583,15 > @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, > struct page *page, > } else { > pmd_t entry; > > + /* > + * range could have been already torn down by > + * the oom reaper > + */ > + if (test_bit(
Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
Michal Hocko wrote: > And that's why we still see the corruption. That, however, means that > the MMF_UNSTABLE implementation has to be more complex and we have to > hook into all anonymous memory fault paths which I hoped I could avoid > previously. I don't understand mm internals including pte/ptl etc. , but I guess that the direction is correct. Since the OOM reaper basically does Set MMF_UNSTABLE flag on mm_struct. For each reapable page in mm_struct { Take ptl lock. Remove pte. Release ptl lock. } the page fault handler will need to check MMF_UNSTABLE with lock held. For each faulted page in mm_struct { Take ptl lock. Add pte only if MMF_UNSTABLE flag is not set. Release ptl lock. }
Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
Michal Hocko wrote: > And that's why we still see the corruption. That, however, means that > the MMF_UNSTABLE implementation has to be more complex and we have to > hook into all anonymous memory fault paths which I hoped I could avoid > previously. I don't understand mm internals including pte/ptl etc. , but I guess that the direction is correct. Since the OOM reaper basically does Set MMF_UNSTABLE flag on mm_struct. For each reapable page in mm_struct { Take ptl lock. Remove pte. Release ptl lock. } the page fault handler will need to check MMF_UNSTABLE with lock held. For each faulted page in mm_struct { Take ptl lock. Add pte only if MMF_UNSTABLE flag is not set. Release ptl lock. }
Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
On Fri 04-08-17 13:00:47, Michal Hocko wrote: > On Fri 04-08-17 19:41:42, Tetsuo Handa wrote: [...] > > Yes. Data corruption still happens. > > I guess I managed to reproduce finally. Will investigate further. One limitation of the current MMF_UNSTABLE implementation is that it still keeps the new page mapped and only sends EFAULT/kill to the consumer. If somebody tries to re-read the same content nothing will really happen. I went this way because it was much simpler and memory consumers usually do not retry on EFAULT. Maybe this is not the case here. I've been staring into iov_iter_copy_from_user_atomic which I believe should be the common write path which reads the user buffer where the corruption caused by the oom_reaper would come from. iov_iter_fault_in_readable should be called before this function. If this happened after MMF_UNSTABLE was set then we should get EFAULT and bail out early. Let's assume this wasn't the case. Then we should get down to iov_iter_copy_from_user_atomic and that one shouldn't copy any data because __copy_from_user_inatomic says * If copying succeeds, the return value must be 0. If some data cannot be * fetched, it is permitted to copy less than had been fetched; the only * hard requirement is that not storing anything at all (i.e. returning size) * should happen only when nothing could be copied. In other words, you don't * have to squeeze as much as possible - it is allowed, but not necessary. which should be our case. I was testing with xfs (but generic_perform_write seem to be doing the same thing) and that one does if (unlikely(copied == 0)) { /* * If we were unable to copy any data at all, we must * fall back to a single segment length write. * * If we didn't fallback here, we could livelock * because not all segments in the iov can be copied at * once without a pagefault. */ bytes = min_t(unsigned long, PAGE_SIZE - offset, iov_iter_single_seg_count(i)); goto again; } and that again will go through iov_iter_fault_in_readable again and that will succeed now. And that's why we still see the corruption. That, however, means that the MMF_UNSTABLE implementation has to be more complex and we have to hook into all anonymous memory fault paths which I hoped I could avoid previously. This is a rough first draft that passes the test case from Tetsuo on my system. It will need much more eyes on it and I will return to it with a fresh brain next week. I would appreciate as much testing as possible. Note that this is on top of the previous attempt for the fix but I will squash the result into one patch because the previous one is not sufficient. --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 86975dec0ba1..1fbc78d423d7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, struct mem_cgroup *memcg; pgtable_t pgtable; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; + int ret; VM_BUG_ON_PAGE(!PageCompound(page), page); @@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, pgtable = pte_alloc_one(vma->vm_mm, haddr); if (unlikely(!pgtable)) { - mem_cgroup_cancel_charge(page, memcg, true); - put_page(page); - return VM_FAULT_OOM; + ret = VM_FAULT_OOM; + goto release; } clear_huge_page(page, haddr, HPAGE_PMD_NR); @@ -583,6 +583,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, } else { pmd_t entry; + /* +* range could have been already torn down by +* the oom reaper +*/ + if (test_bit(MMF_UNSTABLE, >vm_mm->flags)) { + spin_unlock(vmf->ptl); + ret = VM_FAULT_SIGBUS; + goto release; + } /* Deliver the page fault to userland */ if (userfaultfd_missing(vma)) { int ret; @@ -610,6 +619,13 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, } return 0; +release: + if (pgtable) + pte_free(vma->vm_mm, pgtable); + mem_cgroup_cancel_charge(page, memcg, true); + put_page(page); + return ret; + } /* @@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf) ret = 0; set = false; if (pmd_none(*vmf->pmd)) { - if
Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
On Fri 04-08-17 13:00:47, Michal Hocko wrote: > On Fri 04-08-17 19:41:42, Tetsuo Handa wrote: [...] > > Yes. Data corruption still happens. > > I guess I managed to reproduce finally. Will investigate further. One limitation of the current MMF_UNSTABLE implementation is that it still keeps the new page mapped and only sends EFAULT/kill to the consumer. If somebody tries to re-read the same content nothing will really happen. I went this way because it was much simpler and memory consumers usually do not retry on EFAULT. Maybe this is not the case here. I've been staring into iov_iter_copy_from_user_atomic which I believe should be the common write path which reads the user buffer where the corruption caused by the oom_reaper would come from. iov_iter_fault_in_readable should be called before this function. If this happened after MMF_UNSTABLE was set then we should get EFAULT and bail out early. Let's assume this wasn't the case. Then we should get down to iov_iter_copy_from_user_atomic and that one shouldn't copy any data because __copy_from_user_inatomic says * If copying succeeds, the return value must be 0. If some data cannot be * fetched, it is permitted to copy less than had been fetched; the only * hard requirement is that not storing anything at all (i.e. returning size) * should happen only when nothing could be copied. In other words, you don't * have to squeeze as much as possible - it is allowed, but not necessary. which should be our case. I was testing with xfs (but generic_perform_write seem to be doing the same thing) and that one does if (unlikely(copied == 0)) { /* * If we were unable to copy any data at all, we must * fall back to a single segment length write. * * If we didn't fallback here, we could livelock * because not all segments in the iov can be copied at * once without a pagefault. */ bytes = min_t(unsigned long, PAGE_SIZE - offset, iov_iter_single_seg_count(i)); goto again; } and that again will go through iov_iter_fault_in_readable again and that will succeed now. And that's why we still see the corruption. That, however, means that the MMF_UNSTABLE implementation has to be more complex and we have to hook into all anonymous memory fault paths which I hoped I could avoid previously. This is a rough first draft that passes the test case from Tetsuo on my system. It will need much more eyes on it and I will return to it with a fresh brain next week. I would appreciate as much testing as possible. Note that this is on top of the previous attempt for the fix but I will squash the result into one patch because the previous one is not sufficient. --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 86975dec0ba1..1fbc78d423d7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, struct mem_cgroup *memcg; pgtable_t pgtable; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; + int ret; VM_BUG_ON_PAGE(!PageCompound(page), page); @@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, pgtable = pte_alloc_one(vma->vm_mm, haddr); if (unlikely(!pgtable)) { - mem_cgroup_cancel_charge(page, memcg, true); - put_page(page); - return VM_FAULT_OOM; + ret = VM_FAULT_OOM; + goto release; } clear_huge_page(page, haddr, HPAGE_PMD_NR); @@ -583,6 +583,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, } else { pmd_t entry; + /* +* range could have been already torn down by +* the oom reaper +*/ + if (test_bit(MMF_UNSTABLE, >vm_mm->flags)) { + spin_unlock(vmf->ptl); + ret = VM_FAULT_SIGBUS; + goto release; + } /* Deliver the page fault to userland */ if (userfaultfd_missing(vma)) { int ret; @@ -610,6 +619,13 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, } return 0; +release: + if (pgtable) + pte_free(vma->vm_mm, pgtable); + mem_cgroup_cancel_charge(page, memcg, true); + put_page(page); + return ret; + } /* @@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf) ret = 0; set = false; if (pmd_none(*vmf->pmd)) { - if
Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
On Fri 04-08-17 19:41:42, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 04-08-17 17:25:46, Tetsuo Handa wrote: > > > Well, while lockdep warning is gone, this problem is remaining. > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index edabf6f..1e06c29 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, > > > unsigned long address, > > > /* > > > * This mm has been already reaped by the oom reaper and so the > > > * refault cannot be trusted in general. Anonymous refaults would > > > -* lose data and give a zero page instead e.g. This is especially > > > -* problem for use_mm() because regular tasks will just die and > > > -* the corrupted data will not be visible anywhere while kthread > > > -* will outlive the oom victim and potentially propagate the date > > > -* further. > > > +* lose data and give a zero page instead e.g. > > > */ > > > - if (unlikely((current->flags & PF_KTHREAD) && !(ret & > > > VM_FAULT_ERROR) > > > - && test_bit(MMF_UNSTABLE, > > > >vm_mm->flags))) > > > + if (unlikely(!(ret & VM_FAULT_ERROR) > > > +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) { > > > + if (ret & VM_FAULT_RETRY) > > > + down_read(>vm_mm->mmap_sem); > > > ret = VM_FAULT_SIGBUS; > > > + } > > > > > > return ret; > > > } > > > > I have re-read your email again and I guess I misread previously. Are > > you saying that the data corruption happens with the both patches > > applied? > > Yes. Data corruption still happens. I guess I managed to reproduce finally. Will investigate further. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
On Fri 04-08-17 19:41:42, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 04-08-17 17:25:46, Tetsuo Handa wrote: > > > Well, while lockdep warning is gone, this problem is remaining. > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index edabf6f..1e06c29 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, > > > unsigned long address, > > > /* > > > * This mm has been already reaped by the oom reaper and so the > > > * refault cannot be trusted in general. Anonymous refaults would > > > -* lose data and give a zero page instead e.g. This is especially > > > -* problem for use_mm() because regular tasks will just die and > > > -* the corrupted data will not be visible anywhere while kthread > > > -* will outlive the oom victim and potentially propagate the date > > > -* further. > > > +* lose data and give a zero page instead e.g. > > > */ > > > - if (unlikely((current->flags & PF_KTHREAD) && !(ret & > > > VM_FAULT_ERROR) > > > - && test_bit(MMF_UNSTABLE, > > > >vm_mm->flags))) > > > + if (unlikely(!(ret & VM_FAULT_ERROR) > > > +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) { > > > + if (ret & VM_FAULT_RETRY) > > > + down_read(>vm_mm->mmap_sem); > > > ret = VM_FAULT_SIGBUS; > > > + } > > > > > > return ret; > > > } > > > > I have re-read your email again and I guess I misread previously. Are > > you saying that the data corruption happens with the both patches > > applied? > > Yes. Data corruption still happens. I guess I managed to reproduce finally. Will investigate further. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
Michal Hocko wrote: > On Fri 04-08-17 17:25:46, Tetsuo Handa wrote: > > Well, while lockdep warning is gone, this problem is remaining. > > > > diff --git a/mm/memory.c b/mm/memory.c > > index edabf6f..1e06c29 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, > > unsigned long address, > > /* > > * This mm has been already reaped by the oom reaper and so the > > * refault cannot be trusted in general. Anonymous refaults would > > -* lose data and give a zero page instead e.g. This is especially > > -* problem for use_mm() because regular tasks will just die and > > -* the corrupted data will not be visible anywhere while kthread > > -* will outlive the oom victim and potentially propagate the date > > -* further. > > +* lose data and give a zero page instead e.g. > > */ > > - if (unlikely((current->flags & PF_KTHREAD) && !(ret & > > VM_FAULT_ERROR) > > - && test_bit(MMF_UNSTABLE, > > >vm_mm->flags))) > > + if (unlikely(!(ret & VM_FAULT_ERROR) > > +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) { > > + if (ret & VM_FAULT_RETRY) > > + down_read(>vm_mm->mmap_sem); > > ret = VM_FAULT_SIGBUS; > > + } > > > > return ret; > > } > > I have re-read your email again and I guess I misread previously. Are > you saying that the data corruption happens with the both patches > applied? Yes. Data corruption still happens.
Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
Michal Hocko wrote: > On Fri 04-08-17 17:25:46, Tetsuo Handa wrote: > > Well, while lockdep warning is gone, this problem is remaining. > > > > diff --git a/mm/memory.c b/mm/memory.c > > index edabf6f..1e06c29 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, > > unsigned long address, > > /* > > * This mm has been already reaped by the oom reaper and so the > > * refault cannot be trusted in general. Anonymous refaults would > > -* lose data and give a zero page instead e.g. This is especially > > -* problem for use_mm() because regular tasks will just die and > > -* the corrupted data will not be visible anywhere while kthread > > -* will outlive the oom victim and potentially propagate the date > > -* further. > > +* lose data and give a zero page instead e.g. > > */ > > - if (unlikely((current->flags & PF_KTHREAD) && !(ret & > > VM_FAULT_ERROR) > > - && test_bit(MMF_UNSTABLE, > > >vm_mm->flags))) > > + if (unlikely(!(ret & VM_FAULT_ERROR) > > +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) { > > + if (ret & VM_FAULT_RETRY) > > + down_read(>vm_mm->mmap_sem); > > ret = VM_FAULT_SIGBUS; > > + } > > > > return ret; > > } > > I have re-read your email again and I guess I misread previously. Are > you saying that the data corruption happens with the both patches > applied? Yes. Data corruption still happens.
Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
On Fri 04-08-17 17:25:46, Tetsuo Handa wrote: > Well, while lockdep warning is gone, this problem is remaining. > > diff --git a/mm/memory.c b/mm/memory.c > index edabf6f..1e06c29 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, > unsigned long address, > /* > * This mm has been already reaped by the oom reaper and so the > * refault cannot be trusted in general. Anonymous refaults would > -* lose data and give a zero page instead e.g. This is especially > -* problem for use_mm() because regular tasks will just die and > -* the corrupted data will not be visible anywhere while kthread > -* will outlive the oom victim and potentially propagate the date > -* further. > +* lose data and give a zero page instead e.g. > */ > - if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR) > - && test_bit(MMF_UNSTABLE, > >vm_mm->flags))) > + if (unlikely(!(ret & VM_FAULT_ERROR) > +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) { > + if (ret & VM_FAULT_RETRY) > + down_read(>vm_mm->mmap_sem); > ret = VM_FAULT_SIGBUS; > + } > > return ret; > } I have re-read your email again and I guess I misread previously. Are you saying that the data corruption happens with the both patches applied? > > $ cat /tmp/file.* | od -b | head > 000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 > * > 42033 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 > * > 42034 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 > * > 45733 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 > * > 45734 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 > * > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Michal Hocko SUSE Labs
Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
On Fri 04-08-17 17:25:46, Tetsuo Handa wrote: > Well, while lockdep warning is gone, this problem is remaining. > > diff --git a/mm/memory.c b/mm/memory.c > index edabf6f..1e06c29 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, > unsigned long address, > /* > * This mm has been already reaped by the oom reaper and so the > * refault cannot be trusted in general. Anonymous refaults would > -* lose data and give a zero page instead e.g. This is especially > -* problem for use_mm() because regular tasks will just die and > -* the corrupted data will not be visible anywhere while kthread > -* will outlive the oom victim and potentially propagate the date > -* further. > +* lose data and give a zero page instead e.g. > */ > - if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR) > - && test_bit(MMF_UNSTABLE, > >vm_mm->flags))) > + if (unlikely(!(ret & VM_FAULT_ERROR) > +&& test_bit(MMF_UNSTABLE, >vm_mm->flags))) { > + if (ret & VM_FAULT_RETRY) > + down_read(>vm_mm->mmap_sem); > ret = VM_FAULT_SIGBUS; > + } > > return ret; > } I have re-read your email again and I guess I misread previously. Are you saying that the data corruption happens with the both patches applied? > > $ cat /tmp/file.* | od -b | head > 000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 > * > 42033 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 > * > 42034 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 > * > 45733 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 > * > 45734 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 > * > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Michal Hocko SUSE Labs
Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
On Fri 04-08-17 17:25:46, Tetsuo Handa wrote: > Well, while lockdep warning is gone, this problem is remaining. Ohh, I should have been more specific. Both patches have to be applied. I have based this one first because it should go to stable. The later one needs a trivial conflict resolution. I will send both of them as a reply to this email! Thanks for retesting. It matches my testing results. -- Michal Hocko SUSE Labs
Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
On Fri 04-08-17 17:25:46, Tetsuo Handa wrote: > Well, while lockdep warning is gone, this problem is remaining. Ohh, I should have been more specific. Both patches have to be applied. I have based this one first because it should go to stable. The later one needs a trivial conflict resolution. I will send both of them as a reply to this email! Thanks for retesting. It matches my testing results. -- Michal Hocko SUSE Labs
Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
On Fri 04-08-17 15:46:46, Tetsuo Handa wrote: > Michal Hocko wrote: > > So there is a race window when some threads > > won't have fatal_signal_pending while the oom_reaper could start > > unmapping the address space. generic_perform_write could then write > > zero page to the page cache and corrupt data. > > Oh, simple generic_perform_write() ? > > > > > The race window is rather small and close to impossible to happen but it > > would be better to have it covered. > > OK, I confirmed that this problem is easily reproducible using below > reproducer. Yeah, I can imagine this could be triggered artificially. I am somehow more skeptical about real life oom scenarios to trigger this though. Anyway, thanks for your test case! > Applying your patch seems to avoid this problem, but as far as I tested > your patch seems to trivially trigger something lock related problem. > Is your patch really safe? > -- > [ 58.539455] Out of memory: Kill process 1056 (a.out) score 603 or > sacrifice child > [ 58.543943] Killed process 1056 (a.out) total-vm:4268108kB, > anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB > [ 58.544245] a.out (1169) used greatest stack depth: 11664 bytes left > [ 58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0) > [ 58.557480] [ cut here ] > [ 58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 > lock_release+0x172/0x1e0 > [ 58.569076] Modules linked in: nf_conntrack_netbios_ns > nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT > nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute > bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 > ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 > nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter > ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr > vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs > libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi > scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata > [ 58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted > 4.13.0-rc3-next-20170803+ #142 > [ 58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX > Desktop Reference Platform, BIOS 6.00 07/02/2015 > [ 58.609790] task: 9d90df888040 task.stack: a07084854000 > [ 58.613944] RIP: 0010:lock_release+0x172/0x1e0 > [ 58.617622] RSP: :a07084857e58 EFLAGS: 00010082 > [ 58.621533] RAX: 001f RBX: 9d90df888040 RCX: > > [ 58.626074] RDX: RSI: 0001 RDI: > a30d4ba4 > [ 58.630572] RBP: a07084857e98 R08: R09: > 0001 > [ 58.635016] R10: R11: 001f R12: > a07084857f58 > [ 58.639694] R13: 9d90f60d6cd0 R14: R15: > a305cb6e > [ 58.644200] FS: 7fb932730740() GS:9d90f9f8() > knlGS: > [ 58.648989] CS: 0010 DS: ES: CR0: 80050033 > [ 58.652903] CR2: 0040092f CR3: 000135229000 CR4: > 000606e0 > [ 58.657280] Call Trace: > [ 58.659989] up_read+0x1a/0x40 > [ 58.662825] __do_page_fault+0x28e/0x4c0 > [ 58.665946] do_page_fault+0x30/0x80 > [ 58.668911] page_fault+0x28/0x30 OK, I know what is going on here. The page fault must have returned with VM_FAULT_RETRY when the caller drops mmap_sem. My patch overwrites the this error code so the page fault path doesn't know that the lock is no longer held and releases is unconditionally. This is a preexisting problem introduced by 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory"). I should have considered this option. I believe the easiest way around this is the following patch --- >From dd31779f763bbe2aa86100f804656ac680c49d35 Mon Sep 17 00:00:00 2001 From: Michal HockoDate: Fri, 4 Aug 2017 09:36:34 +0200 Subject: [PATCH] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in handle_mm_fault causes a lockdep splat [ 58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child [ 58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB [ 58.544245] a.out (1169) used greatest stack depth: 11664 bytes left [ 58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0) [ 58.557480] [ cut here ] [ 58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0 [ 58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142 [ 58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 58.609790] task: 9d90df888040
Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
On Fri 04-08-17 15:46:46, Tetsuo Handa wrote: > Michal Hocko wrote: > > So there is a race window when some threads > > won't have fatal_signal_pending while the oom_reaper could start > > unmapping the address space. generic_perform_write could then write > > zero page to the page cache and corrupt data. > > Oh, simple generic_perform_write() ? > > > > > The race window is rather small and close to impossible to happen but it > > would be better to have it covered. > > OK, I confirmed that this problem is easily reproducible using below > reproducer. Yeah, I can imagine this could be triggered artificially. I am somehow more skeptical about real life oom scenarios to trigger this though. Anyway, thanks for your test case! > Applying your patch seems to avoid this problem, but as far as I tested > your patch seems to trivially trigger something lock related problem. > Is your patch really safe? > -- > [ 58.539455] Out of memory: Kill process 1056 (a.out) score 603 or > sacrifice child > [ 58.543943] Killed process 1056 (a.out) total-vm:4268108kB, > anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB > [ 58.544245] a.out (1169) used greatest stack depth: 11664 bytes left > [ 58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0) > [ 58.557480] [ cut here ] > [ 58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 > lock_release+0x172/0x1e0 > [ 58.569076] Modules linked in: nf_conntrack_netbios_ns > nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT > nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute > bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 > ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 > nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter > ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr > vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs > libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi > scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata > [ 58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted > 4.13.0-rc3-next-20170803+ #142 > [ 58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX > Desktop Reference Platform, BIOS 6.00 07/02/2015 > [ 58.609790] task: 9d90df888040 task.stack: a07084854000 > [ 58.613944] RIP: 0010:lock_release+0x172/0x1e0 > [ 58.617622] RSP: :a07084857e58 EFLAGS: 00010082 > [ 58.621533] RAX: 001f RBX: 9d90df888040 RCX: > > [ 58.626074] RDX: RSI: 0001 RDI: > a30d4ba4 > [ 58.630572] RBP: a07084857e98 R08: R09: > 0001 > [ 58.635016] R10: R11: 001f R12: > a07084857f58 > [ 58.639694] R13: 9d90f60d6cd0 R14: R15: > a305cb6e > [ 58.644200] FS: 7fb932730740() GS:9d90f9f8() > knlGS: > [ 58.648989] CS: 0010 DS: ES: CR0: 80050033 > [ 58.652903] CR2: 0040092f CR3: 000135229000 CR4: > 000606e0 > [ 58.657280] Call Trace: > [ 58.659989] up_read+0x1a/0x40 > [ 58.662825] __do_page_fault+0x28e/0x4c0 > [ 58.665946] do_page_fault+0x30/0x80 > [ 58.668911] page_fault+0x28/0x30 OK, I know what is going on here. The page fault must have returned with VM_FAULT_RETRY when the caller drops mmap_sem. My patch overwrites the this error code so the page fault path doesn't know that the lock is no longer held and releases is unconditionally. This is a preexisting problem introduced by 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory"). I should have considered this option. I believe the easiest way around this is the following patch --- >From dd31779f763bbe2aa86100f804656ac680c49d35 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 4 Aug 2017 09:36:34 +0200 Subject: [PATCH] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in handle_mm_fault causes a lockdep splat [ 58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child [ 58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB [ 58.544245] a.out (1169) used greatest stack depth: 11664 bytes left [ 58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0) [ 58.557480] [ cut here ] [ 58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0 [ 58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142 [ 58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 58.609790] task: 9d90df888040 task.stack: