Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback

2018-12-14 Thread Michal Hocko
On Thu 13-12-18 17:04:00, Johannes Weiner wrote:
[...]
> Acked-by: Johannes Weiner 

Thanks!

> Just one nit:
> 
> > @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> > struct vm_area_struct *vma = vmf->vma;
> > vm_fault_t ret;
> >  
> > +   /*
> > +* Preallocate pte before we take page_lock because this might lead to
> > +* deadlocks for memcg reclaim which waits for pages under writeback.
> > +*/
> > +   if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> > +   vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, 
> > vmf->address);
> > +   if (!vmf->prealloc_pte)
> > +   return VM_FAULT_OOM;
> > +   smp_wmb(); /* See comment in __pte_alloc() */
> > +   }
> 
> Could you be more specific in the deadlock comment? git blame will
> work fine for a while, but it becomes a pain to find corresponding
> patches after stuff gets moved around for years.
> 
> In particular the race diagram between reclaim with a page lock held
> and the fs doing SetPageWriteback batches before kicking off IO would
> be useful directly in the code, IMO.

This?

diff --git a/mm/memory.c b/mm/memory.c
index bb78e90a9b70..ece221e4da6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2995,7 +2995,18 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 
/*
 * Preallocate pte before we take page_lock because this might lead to
-* deadlocks for memcg reclaim which waits for pages under writeback.
+* deadlocks for memcg reclaim which waits for pages under writeback:
+*  lock_page(A)
+*  SetPageWriteback(A)
+*  unlock_page(A)
+* lock_page(B)
+*  lock_page(B)
+* pte_alloc_pne
+*   shrink_page_list
+* wait_on_page_writeback(A)
+*  SetPageWriteback(B)
+*  unlock_page(B)
+*  # flush A, B to clear the writeback
 */
if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, 
vmf->address);
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback

2018-12-13 Thread Liu Bo
On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [] wait_on_page_bit+0x82/0xa0
> [] shrink_page_list+0x907/0x960
> [] shrink_inactive_list+0x2c7/0x680
> [] shrink_node_memcg+0x404/0x830
> [] shrink_node+0xd8/0x300
> [] do_try_to_free_pages+0x10d/0x330
> [] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [] try_charge+0x14d/0x720
> [] memcg_kmem_charge_memcg+0x3c/0xa0
> [] memcg_kmem_charge+0x7e/0xd0
> [] __alloc_pages_nodemask+0x178/0x260
> [] alloc_pages_current+0x95/0x140
> [] pte_alloc_one+0x17/0x40
> [] __pte_alloc+0x1e/0x110
> [] alloc_set_pte+0x5fe/0xc20
> [] do_fault+0x103/0x970
> [] handle_mm_fault+0x61e/0xd10
> [] __do_page_fault+0x252/0x4d0
> [] do_page_fault+0x30/0x80
> [] page_fault+0x28/0x30
> [] 0x
> 
> task2:
> [] __lock_page+0x86/0xa0
> [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [] ext4_writepages+0x479/0xd60
> [] do_writepages+0x1e/0x30
> [] __writeback_single_inode+0x45/0x320
> [] writeback_sb_inodes+0x272/0x600
> [] __writeback_inodes_wb+0x92/0xc0
> [] wb_writeback+0x268/0x300
> [] wb_workfn+0xb4/0x390
> [] process_one_work+0x189/0x420
> [] worker_thread+0x4e/0x4b0
> [] kthread+0xe6/0x100
> [] ret_from_fork+0x41/0x50
> [] 0x
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock:
> lock_page(A)
> SetPageWriteback(A)
> unlock_page(A)
> lock_page(B)
> lock_page(B)
> pte_alloc_pne
>   shrink_page_list
> wait_on_page_writeback(A)
> SetPageWriteback(B)
> unlock_page(B)
> 
> # flush A, B to clear the writeback
> 
> This accumulating of more pages to flush is used by several filesystems
> to generate a more optimal IO patterns.
> 
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.
> 

Thanks for the update.

Looks good to me.

Reviewed-by: Liu Bo 

thanks,
-liubo

> Reported-and-Debugged-by: Liu Bo 
> Cc: stable
> Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> Signed-off-by: Michal Hocko 
> ---
>  mm/memory.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 4ad2d293ddc2..bb78e90a9b70 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>   struct vm_area_struct *vma = vmf->vma;
>   vm_fault_t ret;
>  
> + /*
> +  * Preallocate pte before we take page_lock because this might lead to
> +  * deadlocks for memcg reclaim which waits for pages under writeback.
> +  */
> + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, 
> vmf->address);
> + if (!vmf->prealloc_pte)
> + return VM_FAULT_OOM;
> + smp_wmb(); /* See comment in __pte_alloc() */
> + }
> +
>   ret = vma->vm_ops->fault(vmf);
>   if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>   VM_FAULT_DONE_COW)))
> -- 
> 2.19.2


Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback

2018-12-13 Thread Johannes Weiner
On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [] wait_on_page_bit+0x82/0xa0
> [] shrink_page_list+0x907/0x960
> [] shrink_inactive_list+0x2c7/0x680
> [] shrink_node_memcg+0x404/0x830
> [] shrink_node+0xd8/0x300
> [] do_try_to_free_pages+0x10d/0x330
> [] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [] try_charge+0x14d/0x720
> [] memcg_kmem_charge_memcg+0x3c/0xa0
> [] memcg_kmem_charge+0x7e/0xd0
> [] __alloc_pages_nodemask+0x178/0x260
> [] alloc_pages_current+0x95/0x140
> [] pte_alloc_one+0x17/0x40
> [] __pte_alloc+0x1e/0x110
> [] alloc_set_pte+0x5fe/0xc20
> [] do_fault+0x103/0x970
> [] handle_mm_fault+0x61e/0xd10
> [] __do_page_fault+0x252/0x4d0
> [] do_page_fault+0x30/0x80
> [] page_fault+0x28/0x30
> [] 0x
> 
> task2:
> [] __lock_page+0x86/0xa0
> [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [] ext4_writepages+0x479/0xd60
> [] do_writepages+0x1e/0x30
> [] __writeback_single_inode+0x45/0x320
> [] writeback_sb_inodes+0x272/0x600
> [] __writeback_inodes_wb+0x92/0xc0
> [] wb_writeback+0x268/0x300
> [] wb_workfn+0xb4/0x390
> [] process_one_work+0x189/0x420
> [] worker_thread+0x4e/0x4b0
> [] kthread+0xe6/0x100
> [] ret_from_fork+0x41/0x50
> [] 0x
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock:
> lock_page(A)
> SetPageWriteback(A)
> unlock_page(A)
> lock_page(B)
> lock_page(B)
> pte_alloc_pne
>   shrink_page_list
> wait_on_page_writeback(A)
> SetPageWriteback(B)
> unlock_page(B)
> 
> # flush A, B to clear the writeback
> 
> This accumulating of more pages to flush is used by several filesystems
> to generate a more optimal IO patterns.
> 
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.
> 
> Reported-and-Debugged-by: Liu Bo 
> Cc: stable
> Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> Signed-off-by: Michal Hocko 

Acked-by: Johannes Weiner 

Just one nit:

> @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>   struct vm_area_struct *vma = vmf->vma;
>   vm_fault_t ret;
>  
> + /*
> +  * Preallocate pte before we take page_lock because this might lead to
> +  * deadlocks for memcg reclaim which waits for pages under writeback.
> +  */
> + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, 
> vmf->address);
> + if (!vmf->prealloc_pte)
> + return VM_FAULT_OOM;
> + smp_wmb(); /* See comment in __pte_alloc() */
> + }

Could you be more specific in the deadlock comment? git blame will
work fine for a while, but it becomes a pain to find corresponding
patches after stuff gets moved around for years.

In particular the race diagram between reclaim with a page lock held
and the fs doing SetPageWriteback batches before kicking off IO would
be useful directly in the code, IMO.


Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback

2018-12-13 Thread Michal Hocko
On Thu 13-12-18 13:41:47, Kirill A. Shutemov wrote:
> On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> > ext4 writeback
> > task1:
> > [] wait_on_page_bit+0x82/0xa0
> > [] shrink_page_list+0x907/0x960
> > [] shrink_inactive_list+0x2c7/0x680
> > [] shrink_node_memcg+0x404/0x830
> > [] shrink_node+0xd8/0x300
> > [] do_try_to_free_pages+0x10d/0x330
> > [] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> > [] try_charge+0x14d/0x720
> > [] memcg_kmem_charge_memcg+0x3c/0xa0
> > [] memcg_kmem_charge+0x7e/0xd0
> > [] __alloc_pages_nodemask+0x178/0x260
> > [] alloc_pages_current+0x95/0x140
> > [] pte_alloc_one+0x17/0x40
> > [] __pte_alloc+0x1e/0x110
> > [] alloc_set_pte+0x5fe/0xc20
> > [] do_fault+0x103/0x970
> > [] handle_mm_fault+0x61e/0xd10
> > [] __do_page_fault+0x252/0x4d0
> > [] do_page_fault+0x30/0x80
> > [] page_fault+0x28/0x30
> > [] 0x
> > 
> > task2:
> > [] __lock_page+0x86/0xa0
> > [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> > [] ext4_writepages+0x479/0xd60
> > [] do_writepages+0x1e/0x30
> > [] __writeback_single_inode+0x45/0x320
> > [] writeback_sb_inodes+0x272/0x600
> > [] __writeback_inodes_wb+0x92/0xc0
> > [] wb_writeback+0x268/0x300
> > [] wb_workfn+0xb4/0x390
> > [] process_one_work+0x189/0x420
> > [] worker_thread+0x4e/0x4b0
> > [] kthread+0xe6/0x100
> > [] ret_from_fork+0x41/0x50
> > [] 0x
> > 
> > He adds
> > : task1 is waiting for the PageWriteback bit of the page that task2 has
> > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> > : bit the page which tasks1 has locked.
> > 
> > More precisely task1 is handling a page fault and it has a page locked
> > while it charges a new page table to a memcg. That in turn hits a memory
> > limit reclaim and the memcg reclaim for legacy controller is waiting on
> > the writeback but that is never going to finish because the writeback
> > itself is waiting for the page locked in the #PF path. So this is
> > essentially ABBA deadlock:
> > lock_page(A)
> > SetPageWriteback(A)
> > unlock_page(A)
> > lock_page(B)
> > lock_page(B)
> > pte_alloc_pne
> >   shrink_page_list
> > wait_on_page_writeback(A)
> > SetPageWriteback(B)
> > unlock_page(B)
> > 
> > # flush A, B to clear the writeback
> > 
> > This accumulating of more pages to flush is used by several filesystems
> > to generate a more optimal IO patterns.
> > 
> > Waiting for the writeback in legacy memcg controller is a workaround
> > for pre-mature OOM killer invocations because there is no dirty IO
> > throttling available for the controller. There is no easy way around
> > that unfortunately. Therefore fix this specific issue by pre-allocating
> > the page table outside of the page lock. We have that handy
> > infrastructure for that already so simply reuse the fault-around pattern
> > which already does this.
> > 
> > There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> > from under a fs page locked but they should be really rare. I am not
> > aware of a better solution unfortunately.
> > 
> > Reported-and-Debugged-by: Liu Bo 
> > Cc: stable
> > Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> > Signed-off-by: Michal Hocko 
> 
> Acked-by: Kirill A. Shutemov 

Thanks!

> Will you take care about converting vmf_insert_* to use the pre-allocated
> page table?

I can try but I would appreciate if somebody more familiar with the code
could do that. I am busy as hell and I do not want to promis something I
will likely not get to soon.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback

2018-12-13 Thread Kirill A. Shutemov
On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [] wait_on_page_bit+0x82/0xa0
> [] shrink_page_list+0x907/0x960
> [] shrink_inactive_list+0x2c7/0x680
> [] shrink_node_memcg+0x404/0x830
> [] shrink_node+0xd8/0x300
> [] do_try_to_free_pages+0x10d/0x330
> [] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [] try_charge+0x14d/0x720
> [] memcg_kmem_charge_memcg+0x3c/0xa0
> [] memcg_kmem_charge+0x7e/0xd0
> [] __alloc_pages_nodemask+0x178/0x260
> [] alloc_pages_current+0x95/0x140
> [] pte_alloc_one+0x17/0x40
> [] __pte_alloc+0x1e/0x110
> [] alloc_set_pte+0x5fe/0xc20
> [] do_fault+0x103/0x970
> [] handle_mm_fault+0x61e/0xd10
> [] __do_page_fault+0x252/0x4d0
> [] do_page_fault+0x30/0x80
> [] page_fault+0x28/0x30
> [] 0x
> 
> task2:
> [] __lock_page+0x86/0xa0
> [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [] ext4_writepages+0x479/0xd60
> [] do_writepages+0x1e/0x30
> [] __writeback_single_inode+0x45/0x320
> [] writeback_sb_inodes+0x272/0x600
> [] __writeback_inodes_wb+0x92/0xc0
> [] wb_writeback+0x268/0x300
> [] wb_workfn+0xb4/0x390
> [] process_one_work+0x189/0x420
> [] worker_thread+0x4e/0x4b0
> [] kthread+0xe6/0x100
> [] ret_from_fork+0x41/0x50
> [] 0x
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock:
> lock_page(A)
> SetPageWriteback(A)
> unlock_page(A)
> lock_page(B)
> lock_page(B)
> pte_alloc_pne
>   shrink_page_list
> wait_on_page_writeback(A)
> SetPageWriteback(B)
> unlock_page(B)
> 
> # flush A, B to clear the writeback
> 
> This accumulating of more pages to flush is used by several filesystems
> to generate a more optimal IO patterns.
> 
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.
> 
> Reported-and-Debugged-by: Liu Bo 
> Cc: stable
> Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> Signed-off-by: Michal Hocko 

Acked-by: Kirill A. Shutemov 

Will you take care about converting vmf_insert_* to use the pre-allocated
page table?

-- 
 Kirill A. Shutemov