Re: [PATCH v3] writeback: safer lock nesting
On Tue 10-04-18 13:48:37, Andrew Morton wrote: > On Tue, 10 Apr 2018 08:33:57 +0200 Michal Hocko wrote: > > > > Reported-by: Wang Long > > > Signed-off-by: Greg Thelen > > > Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613 > > > > Not a stable material IMHO > > Why's that? Wang Long said he's observed the deadlock three times? I thought it is just too unlikely to hit all the conditions. My fault, I have completely missed/forgot the fact Wang Long is seeing the issue happening. No real objection for the stable backport from me. -- Michal Hocko SUSE Labs
Re: [PATCH v3] writeback: safer lock nesting
On Tue, Apr 10, 2018 at 1:38 PM Andrew Morton wrote: > On Mon, 9 Apr 2018 17:59:08 -0700 Greg Thelen wrote: > > lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if > > the page's memcg is undergoing move accounting, which occurs when a > > process leaves its memcg for a new one that has > > memory.move_charge_at_immigrate set. > > > > unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the > > given inode is switching writeback domains. Switches occur when enough > > writes are issued from a new domain. > > > > This existing pattern is thus suspicious: > > lock_page_memcg(page); > > unlocked_inode_to_wb_begin(inode, &locked); > > ... > > unlocked_inode_to_wb_end(inode, locked); > > unlock_page_memcg(page); > > > > If both inode switch and process memcg migration are both in-flight then > > unlocked_inode_to_wb_end() will unconditionally enable interrupts while > > still holding the lock_page_memcg() irq spinlock. This suggests the > > possibility of deadlock if an interrupt occurs before > > unlock_page_memcg(). > > > > truncate > > __cancel_dirty_page > > lock_page_memcg > > unlocked_inode_to_wb_begin > > unlocked_inode_to_wb_end > > > > > > end_page_writeback > > test_clear_page_writeback > > lock_page_memcg > > > > unlock_page_memcg > > > > Due to configuration limitations this deadlock is not currently possible > > because we don't mix cgroup writeback (a cgroupv2 feature) and > > memory.move_charge_at_immigrate (a cgroupv1 feature). > > > > If the kernel is hacked to always claim inode switching and memcg > > moving_account, then this script triggers lockup in less than a minute: > > cd /mnt/cgroup/memory > > mkdir a b > > echo 1 > a/memory.move_charge_at_immigrate > > echo 1 > b/memory.move_charge_at_immigrate > > ( > > echo $BASHPID > a/cgroup.procs > > while true; do > > dd if=/dev/zero of=/mnt/big bs=1M count=256 > > done > > ) & > > while true; do > > sync > > done & > > sleep 1h & > > SLEEP=$! > > while true; do > > echo $SLEEP > a/cgroup.procs > > echo $SLEEP > b/cgroup.procs > > done > > > > Given the deadlock is not currently possible, it's debatable if there's > > any reason to modify the kernel. I suggest we should to prevent future > > surprises. > > > > ... > > > > Changelog since v2: > > - explicitly initialize wb_lock_cookie to silence compiler warnings. > But only in some places. What's up with that? I annotated it in places where my compiler was complaining about. But you're right. It's better to init all 4. > > > > ... > > > > --- a/include/linux/backing-dev.h > > +++ b/include/linux/backing-dev.h > > @@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode) > > /** > > * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction > > * @inode: target inode > > - * @lockedp: temp bool output param, to be passed to the end function > > + * @cookie: output param, to be passed to the end function > > * > > * The caller wants to access the wb associated with @inode but isn't > > * holding inode->i_lock, mapping->tree_lock or wb->list_lock. This > > @@ -354,12 +354,11 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode) > > * association doesn't change until the transaction is finished with > > * unlocked_inode_to_wb_end(). > > * > > - * The caller must call unlocked_inode_to_wb_end() with *@lockdep > > - * afterwards and can't sleep during transaction. IRQ may or may not be > > - * disabled on return. > > + * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and > > + * can't sleep during transaction. IRQ may or may not be disabled on return. > > */ > Grammar is a bit awkward here, > > > > ... > > > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -2501,13 +2501,13 @@ void account_page_redirty(struct page *page) > > if (mapping && mapping_cap_account_dirty(mapping)) { > > struct inode *inode = mapping->host; > > struct bdi_writeback *wb; > > - bool locked; > > + struct wb_lock_cookie cookie = {0}; > Trivia: it's better to use "= {}" here. That has the same effect and > it doesn't assume that the first field is a scalar. And indeed, the > first field is a bool so it should be {false}! Nod. Thanks for the tip. > So... > --- a/include/linux/backing-dev.h~writeback-safer-lock-nesting-fix > +++ a/include/linux/backing-dev.h > @@ -355,7 +355,8 @@ static inline struct bdi_writeback *inod > * unlocked_inode_to_wb_end(). > * > * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and > - * can't sleep during transaction. IRQ may o
Re: [PATCH v3] writeback: safer lock nesting
On Tue, Apr 10, 2018 at 1:15 AM Wang Long wrote: > > lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if > > the page's memcg is undergoing move accounting, which occurs when a > > process leaves its memcg for a new one that has > > memory.move_charge_at_immigrate set. > > > > unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the > > given inode is switching writeback domains. Switches occur when enough > > writes are issued from a new domain. > > > > This existing pattern is thus suspicious: > > lock_page_memcg(page); > > unlocked_inode_to_wb_begin(inode, &locked); > > ... > > unlocked_inode_to_wb_end(inode, locked); > > unlock_page_memcg(page); > > > > If both inode switch and process memcg migration are both in-flight then > > unlocked_inode_to_wb_end() will unconditionally enable interrupts while > > still holding the lock_page_memcg() irq spinlock. This suggests the > > possibility of deadlock if an interrupt occurs before > > unlock_page_memcg(). > > > > truncate > > __cancel_dirty_page > > lock_page_memcg > > unlocked_inode_to_wb_begin > > unlocked_inode_to_wb_end > > > > > > end_page_writeback > > test_clear_page_writeback > > lock_page_memcg > > > > unlock_page_memcg > > > > Due to configuration limitations this deadlock is not currently possible > > because we don't mix cgroup writeback (a cgroupv2 feature) and > > memory.move_charge_at_immigrate (a cgroupv1 feature). > > > > If the kernel is hacked to always claim inode switching and memcg > > moving_account, then this script triggers lockup in less than a minute: > >cd /mnt/cgroup/memory > >mkdir a b > >echo 1 > a/memory.move_charge_at_immigrate > >echo 1 > b/memory.move_charge_at_immigrate > >( > > echo $BASHPID > a/cgroup.procs > > while true; do > >dd if=/dev/zero of=/mnt/big bs=1M count=256 > > done > >) & > >while true; do > > sync > >done & > >sleep 1h & > >SLEEP=$! > >while true; do > > echo $SLEEP > a/cgroup.procs > > echo $SLEEP > b/cgroup.procs > >done > > > > Given the deadlock is not currently possible, it's debatable if there's > > any reason to modify the kernel. I suggest we should to prevent future > > surprises. > This deadlock occurs three times in our environment, > this deadlock occurs three times in our environment. It is better to cc stable kernel and > backport it. That's interesting. Are you using cgroup v1 or v2? Do you enable memory.move_charge_at_immigrate? I assume you've been using 4.4 stable. I'll look closer at it at a 4.4 stable backport.
Re: [PATCH v3] writeback: safer lock nesting
On Tue, 10 Apr 2018 08:33:57 +0200 Michal Hocko wrote: > > Reported-by: Wang Long > > Signed-off-by: Greg Thelen > > Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613 > > Not a stable material IMHO Why's that? Wang Long said he's observed the deadlock three times?
Re: [PATCH v3] writeback: safer lock nesting
On Mon, 9 Apr 2018 17:59:08 -0700 Greg Thelen wrote: > lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if > the page's memcg is undergoing move accounting, which occurs when a > process leaves its memcg for a new one that has > memory.move_charge_at_immigrate set. > > unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the > given inode is switching writeback domains. Switches occur when enough > writes are issued from a new domain. > > This existing pattern is thus suspicious: > lock_page_memcg(page); > unlocked_inode_to_wb_begin(inode, &locked); > ... > unlocked_inode_to_wb_end(inode, locked); > unlock_page_memcg(page); > > If both inode switch and process memcg migration are both in-flight then > unlocked_inode_to_wb_end() will unconditionally enable interrupts while > still holding the lock_page_memcg() irq spinlock. This suggests the > possibility of deadlock if an interrupt occurs before > unlock_page_memcg(). > > truncate > __cancel_dirty_page > lock_page_memcg > unlocked_inode_to_wb_begin > unlocked_inode_to_wb_end > > > end_page_writeback > test_clear_page_writeback > lock_page_memcg > > unlock_page_memcg > > Due to configuration limitations this deadlock is not currently possible > because we don't mix cgroup writeback (a cgroupv2 feature) and > memory.move_charge_at_immigrate (a cgroupv1 feature). > > If the kernel is hacked to always claim inode switching and memcg > moving_account, then this script triggers lockup in less than a minute: > cd /mnt/cgroup/memory > mkdir a b > echo 1 > a/memory.move_charge_at_immigrate > echo 1 > b/memory.move_charge_at_immigrate > ( > echo $BASHPID > a/cgroup.procs > while true; do > dd if=/dev/zero of=/mnt/big bs=1M count=256 > done > ) & > while true; do > sync > done & > sleep 1h & > SLEEP=$! > while true; do > echo $SLEEP > a/cgroup.procs > echo $SLEEP > b/cgroup.procs > done > > Given the deadlock is not currently possible, it's debatable if there's > any reason to modify the kernel. I suggest we should to prevent future > surprises. > > ... > > Changelog since v2: > - explicitly initialize wb_lock_cookie to silence compiler warnings. But only in some places. What's up with that? > > ... > > --- a/include/linux/backing-dev.h > +++ b/include/linux/backing-dev.h > @@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const > struct inode *inode) > /** > * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction > * @inode: target inode > - * @lockedp: temp bool output param, to be passed to the end function > + * @cookie: output param, to be passed to the end function > * > * The caller wants to access the wb associated with @inode but isn't > * holding inode->i_lock, mapping->tree_lock or wb->list_lock. This > @@ -354,12 +354,11 @@ static inline struct bdi_writeback *inode_to_wb(const > struct inode *inode) > * association doesn't change until the transaction is finished with > * unlocked_inode_to_wb_end(). > * > - * The caller must call unlocked_inode_to_wb_end() with *@lockdep > - * afterwards and can't sleep during transaction. IRQ may or may not be > - * disabled on return. > + * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards > and > + * can't sleep during transaction. IRQ may or may not be disabled on return. > */ Grammar is a bit awkward here, > > ... > > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2501,13 +2501,13 @@ void account_page_redirty(struct page *page) > if (mapping && mapping_cap_account_dirty(mapping)) { > struct inode *inode = mapping->host; > struct bdi_writeback *wb; > - bool locked; > + struct wb_lock_cookie cookie = {0}; Trivia: it's better to use "= {}" here. That has the same effect and it doesn't assume that the first field is a scalar. And indeed, the first field is a bool so it should be {false}! So... --- a/include/linux/backing-dev.h~writeback-safer-lock-nesting-fix +++ a/include/linux/backing-dev.h @@ -355,7 +355,8 @@ static inline struct bdi_writeback *inod * unlocked_inode_to_wb_end(). * * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and - * can't sleep during transaction. IRQ may or may not be disabled on return. + * can't sleep during the transaction. IRQs may or may not be disabled on + * return. */ static inline struct bdi_writeback * unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie) --- a/mm/page-writeback.c~writeback-safer-lock-nesting-fix +++ a/mm/page-writeback.c @@ -2501,7 +2501,7 @@ void account_page_redirty(struct page *p if (mapping && mappi
Re: [PATCH v3] writeback: safer lock nesting
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if the page's memcg is undergoing move accounting, which occurs when a process leaves its memcg for a new one that has memory.move_charge_at_immigrate set. unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the given inode is switching writeback domains. Switches occur when enough writes are issued from a new domain. This existing pattern is thus suspicious: lock_page_memcg(page); unlocked_inode_to_wb_begin(inode, &locked); ... unlocked_inode_to_wb_end(inode, locked); unlock_page_memcg(page); If both inode switch and process memcg migration are both in-flight then unlocked_inode_to_wb_end() will unconditionally enable interrupts while still holding the lock_page_memcg() irq spinlock. This suggests the possibility of deadlock if an interrupt occurs before unlock_page_memcg(). truncate __cancel_dirty_page lock_page_memcg unlocked_inode_to_wb_begin unlocked_inode_to_wb_end end_page_writeback test_clear_page_writeback lock_page_memcg unlock_page_memcg Due to configuration limitations this deadlock is not currently possible because we don't mix cgroup writeback (a cgroupv2 feature) and memory.move_charge_at_immigrate (a cgroupv1 feature). If the kernel is hacked to always claim inode switching and memcg moving_account, then this script triggers lockup in less than a minute: cd /mnt/cgroup/memory mkdir a b echo 1 > a/memory.move_charge_at_immigrate echo 1 > b/memory.move_charge_at_immigrate ( echo $BASHPID > a/cgroup.procs while true; do dd if=/dev/zero of=/mnt/big bs=1M count=256 done ) & while true; do sync done & sleep 1h & SLEEP=$! while true; do echo $SLEEP > a/cgroup.procs echo $SLEEP > b/cgroup.procs done Given the deadlock is not currently possible, it's debatable if there's any reason to modify the kernel. I suggest we should to prevent future surprises. This deadlock occurs three times in our environment, this deadlock occurs three times in our environment. It is better to cc stable kernel and backport it. Acked-by: Wang Long thanks Reported-by: Wang Long Signed-off-by: Greg Thelen Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613 --- Changelog since v2: - explicitly initialize wb_lock_cookie to silence compiler warnings. Changelog since v1: - add wb_lock_cookie to record lock context. fs/fs-writeback.c| 7 --- include/linux/backing-dev-defs.h | 5 + include/linux/backing-dev.h | 30 -- mm/page-writeback.c | 18 +- 4 files changed, 34 insertions(+), 26 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1280f915079b..f4b2f6625913 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -745,11 +745,12 @@ int inode_congested(struct inode *inode, int cong_bits) */ if (inode && inode_to_wb_is_valid(inode)) { struct bdi_writeback *wb; - bool locked, congested; + struct wb_lock_cookie lock_cookie; + bool congested; - wb = unlocked_inode_to_wb_begin(inode, &locked); + wb = unlocked_inode_to_wb_begin(inode, &lock_cookie); congested = wb_congested(wb, cong_bits); - unlocked_inode_to_wb_end(inode, locked); + unlocked_inode_to_wb_end(inode, &lock_cookie); return congested; } diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index bfe86b54f6c1..0bd432a4d7bd 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -223,6 +223,11 @@ static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync) set_wb_congested(bdi->wb.congested, sync); } +struct wb_lock_cookie { + bool locked; + unsigned long flags; +}; + #ifdef CONFIG_CGROUP_WRITEBACK /** diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 3e4ce54d84ab..1d744c61d996 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode) /** * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction * @inode: target inode - * @lockedp: temp bool output param, to be passed to the end function + * @cookie: output param, to be passed to the end function * * The caller wants to access the wb associated with @inode but isn't * holding inode->i_lock, mapping->tree_lock or wb->list_lock. This @@ -354,12 +354,11 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode) * associatio
Re: [PATCH v3] writeback: safer lock nesting
On Mon 09-04-18 17:59:08, Greg Thelen wrote: > lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if > the page's memcg is undergoing move accounting, which occurs when a > process leaves its memcg for a new one that has > memory.move_charge_at_immigrate set. > > unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the > given inode is switching writeback domains. Switches occur when enough > writes are issued from a new domain. > > This existing pattern is thus suspicious: > lock_page_memcg(page); > unlocked_inode_to_wb_begin(inode, &locked); > ... > unlocked_inode_to_wb_end(inode, locked); > unlock_page_memcg(page); > > If both inode switch and process memcg migration are both in-flight then > unlocked_inode_to_wb_end() will unconditionally enable interrupts while > still holding the lock_page_memcg() irq spinlock. This suggests the > possibility of deadlock if an interrupt occurs before > unlock_page_memcg(). > > truncate > __cancel_dirty_page > lock_page_memcg > unlocked_inode_to_wb_begin > unlocked_inode_to_wb_end > > > end_page_writeback > test_clear_page_writeback > lock_page_memcg > > unlock_page_memcg > > Due to configuration limitations this deadlock is not currently possible > because we don't mix cgroup writeback (a cgroupv2 feature) and > memory.move_charge_at_immigrate (a cgroupv1 feature). > > If the kernel is hacked to always claim inode switching and memcg > moving_account, then this script triggers lockup in less than a minute: > cd /mnt/cgroup/memory > mkdir a b > echo 1 > a/memory.move_charge_at_immigrate > echo 1 > b/memory.move_charge_at_immigrate > ( > echo $BASHPID > a/cgroup.procs > while true; do > dd if=/dev/zero of=/mnt/big bs=1M count=256 > done > ) & > while true; do > sync > done & > sleep 1h & > SLEEP=$! > while true; do > echo $SLEEP > a/cgroup.procs > echo $SLEEP > b/cgroup.procs > done > Very nice changelog! > Given the deadlock is not currently possible, it's debatable if there's > any reason to modify the kernel. I suggest we should to prevent future > surprises. Agreed! > Reported-by: Wang Long > Signed-off-by: Greg Thelen > Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613 Not a stable material IMHO but Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") AFAIU Acked-by: Michal Hocko Thanks! > --- > Changelog since v2: > - explicitly initialize wb_lock_cookie to silence compiler warnings. > > Changelog since v1: > - add wb_lock_cookie to record lock context. > > fs/fs-writeback.c| 7 --- > include/linux/backing-dev-defs.h | 5 + > include/linux/backing-dev.h | 30 -- > mm/page-writeback.c | 18 +- > 4 files changed, 34 insertions(+), 26 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 1280f915079b..f4b2f6625913 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -745,11 +745,12 @@ int inode_congested(struct inode *inode, int cong_bits) >*/ > if (inode && inode_to_wb_is_valid(inode)) { > struct bdi_writeback *wb; > - bool locked, congested; > + struct wb_lock_cookie lock_cookie; > + bool congested; > > - wb = unlocked_inode_to_wb_begin(inode, &locked); > + wb = unlocked_inode_to_wb_begin(inode, &lock_cookie); > congested = wb_congested(wb, cong_bits); > - unlocked_inode_to_wb_end(inode, locked); > + unlocked_inode_to_wb_end(inode, &lock_cookie); > return congested; > } > > diff --git a/include/linux/backing-dev-defs.h > b/include/linux/backing-dev-defs.h > index bfe86b54f6c1..0bd432a4d7bd 100644 > --- a/include/linux/backing-dev-defs.h > +++ b/include/linux/backing-dev-defs.h > @@ -223,6 +223,11 @@ static inline void set_bdi_congested(struct > backing_dev_info *bdi, int sync) > set_wb_congested(bdi->wb.congested, sync); > } > > +struct wb_lock_cookie { > + bool locked; > + unsigned long flags; > +}; > + > #ifdef CONFIG_CGROUP_WRITEBACK > > /** > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h > index 3e4ce54d84ab..1d744c61d996 100644 > --- a/include/linux/backing-dev.h > +++ b/include/linux/backing-dev.h > @@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const > struct inode *inode) > /** > * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction > * @inode: target inode > - * @lockedp: temp bool output param, to be passed to the end function > + * @cookie: output param, to be passed to the end function > * >