Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
Dave, I am sorry for delay. On 10/10, Dave Chinner wrote: > > So, it's time to waste more time explaining why lockdep is telling > us about something that *isn't a bug*. > [... snip ...] OK, thanks. I am not surprised although I have to admit I wasn't sure. > Basically, what we are seeing here is yet another case of "lockdep > is just smart enough to be really dumb" because we cannot fully > express or cleanly annotate the contexts in which it is being asked > to validate. Yes... perhaps we can add the new lockdep helpers to avoid the false- positives like this one, but so far it is not clear to me what we can do. Somehow we need to tell it to to avoid check_prev_add() because we know that the work function won't take sb_internal, but at the same time we should complain if it actually does this. Lets ignore this patch for now. Oleg.
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On Sun, Oct 09, 2016 at 06:14:57PM +0200, Oleg Nesterov wrote: > On 10/08, Dave Chinner wrote: > > > > On Fri, Oct 07, 2016 at 07:15:18PM +0200, Oleg Nesterov wrote: > > > > > > > > > > --- x/fs/xfs/xfs_trans.c > > > > > +++ x/fs/xfs/xfs_trans.c > > > > > @@ -245,7 +245,8 @@ xfs_trans_alloc( > > > > > atomic_inc(&mp->m_active_trans); > > > > > > > > > > tp = kmem_zone_zalloc(xfs_trans_zone, > > > > > - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > > > > > + (flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT)) > > > > > + ? KM_NOFS : KM_SLEEP); > > > > > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > > > > > tp->t_flags = flags; > > > > > tp->t_mountp = mp; > > > > > > > > Brief examination says caller should set XFS_TRANS_NOFS, not change > > > > the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean > > > > XFS_TRANS_NOFS. > > > > > > I didn't mean the change above can fix the problem, and I don't really > > > understand your suggestion. > > > > xfs_syncsb() does: > > > > tp = xfs_trans_alloc(... , XFS_TRANS_NO_WRITECOUNT, ); > > > > but it's running in a GFP_NOFS context when a freeze is being > > finalised. SO, rather than changing what XFS_TRANS_NO_WRITECOUNT > > does in xfs_trans_alloc(), we should tell it to do a GFP_NOFS > > allocation. i.e. > > > > tp = xfs_trans_alloc(... , XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT, > > ); > > Ah. This is clear but I am not sure it is enough, > > > > Obviously any GFP_FS allocation in xfs_fs_freeze() > > > paths will trigger the same warning. > > > > Of which there should be none except for that xfs_trans_alloc() > > call. > > Really? /Should/ is an indication of *intent*. Reality is that there may be *bugs*. We all know that testing can't prove the absence of bugs, so even after years and years of exercising the code with producing any evidence there may still be problems. So, it's time to waste more time explaining why lockdep is telling us about something that *isn't a bug*. > Again, I can be easily wrong, but when I look at xfs_freeze_fs() > paths I can see > > xfs_fs_freeze()->xfs_quiesce_attr()->xfs_log_quiesce()->xfs_log_unmount_write() > ->xfs_log_reserve()->xlog_ticket_alloc(KM_SLEEP) So, the problem being indicated here is that memory reclaim might either try to a) write back dirty pages (which require allocation transactions), b) might run a shrinker that requires running a transaction or c) we might run periodic background inode reclaim. For the case of a), this /can't happen/ because we've already run the part of a freeze that stops pages being dirtied and then written them all back and made them clean. So we won't run transactions from page cache reclaim and so we can't deadlock. For the case of b), well, that's even easier - the only shrinker path we care about here is inode reclaim through super_cache_scan(). Before that shrinker runs anything it calls: if (!trylock_super(sb)) return SHRINK_STOP; Now, we're running memory allocation for the freeze context, which means we are holding the sb->s_umount semaphore in write mode. That means the shrinker is going to /fail to lock the superblock/ and therefore not run any reclaim on that superblock. IOWs, while we hold the s_umount lock in write mode across a memory allocation, the shrinkers run in GFP_NOFS mode automatically. So we can't run transactions from memory reclaim and so we will can't deadlock. For the case of c), xfs_quiesce_attr() does this: /* force the log to unpin objects from the now complete transactions */ xfs_log_force(mp, XFS_LOG_SYNC); /* reclaim inodes to do any IO before the freeze completes */ xfs_reclaim_inodes(mp, 0); xfs_reclaim_inodes(mp, SYNC_WAIT); We basically unpin, clean, and reclaim all the unused inodes the XFS inode cache. With the shrinker not reclaiming any inodes, and there being no cached, dirty, unreclaimed inodes remaining in the cache, there can be no background memory allocations, transactions or IO triggered memory reclaim in this filesystem. At this point, memory reclaim /should never block/ trying to reclaim objects from this filesystem that require transactions to free. >From this, it should be obvious that we don't even need to change the code in xfs_syncsb() - in the freeze context that the allocation is run we've got a clean filesystem where memory reclaim won't block on the filesystem being frozen, so the code is safe as it stands. > > > just for testing purposes and after that I got another warning below. I > > > didn't > > > read it carefully yet, but _at first glance_ it looks like the lock > > > inversion > > > uncovered by 2/2, although I can be easily wrong. > > > cancel_delayed_work_sync(l_work) > > > under sb_internal can hang if xfs_log_worker() waits for this rwsem?` > > > > Actually: I *can't read it*. I've got no fucking clue what lockdep > > is trying to say
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On 10/08, Dave Chinner wrote: > > On Fri, Oct 07, 2016 at 07:15:18PM +0200, Oleg Nesterov wrote: > > > > > > > > --- x/fs/xfs/xfs_trans.c > > > > +++ x/fs/xfs/xfs_trans.c > > > > @@ -245,7 +245,8 @@ xfs_trans_alloc( > > > > atomic_inc(&mp->m_active_trans); > > > > > > > > tp = kmem_zone_zalloc(xfs_trans_zone, > > > > - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > > > > + (flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT)) > > > > + ? KM_NOFS : KM_SLEEP); > > > > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > > > > tp->t_flags = flags; > > > > tp->t_mountp = mp; > > > > > > Brief examination says caller should set XFS_TRANS_NOFS, not change > > > the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean > > > XFS_TRANS_NOFS. > > > > I didn't mean the change above can fix the problem, and I don't really > > understand your suggestion. > > xfs_syncsb() does: > > tp = xfs_trans_alloc(... , XFS_TRANS_NO_WRITECOUNT, ); > > but it's running in a GFP_NOFS context when a freeze is being > finalised. SO, rather than changing what XFS_TRANS_NO_WRITECOUNT > does in xfs_trans_alloc(), we should tell it to do a GFP_NOFS > allocation. i.e. > > tp = xfs_trans_alloc(... , XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT, > ); Ah. This is clear but I am not sure it is enough, > > Obviously any GFP_FS allocation in xfs_fs_freeze() > > paths will trigger the same warning. > > Of which there should be none except for that xfs_trans_alloc() > call. Really? Again, I can be easily wrong, but when I look at xfs_freeze_fs() paths I can see xfs_fs_freeze()->xfs_quiesce_attr()->xfs_log_quiesce()->xfs_log_unmount_write() ->xfs_log_reserve()->xlog_ticket_alloc(KM_SLEEP) at least. But I can test the change above, perhaps this call chain is not possible... > > I added this hack > > > > --- a/fs/xfs/xfs_super.c > > +++ b/fs/xfs/xfs_super.c > > @@ -1333,10 +1333,15 @@ xfs_fs_freeze( > > struct super_block *sb) > > { > > struct xfs_mount*mp = XFS_M(sb); > > + int ret; > > > > + current->flags |= PF_FSTRANS; // tell kmem_flags_convert() to > > remove GFP_FS > > xfs_save_resvblks(mp); > > xfs_quiesce_attr(mp); > > - return xfs_sync_sb(mp, true); > > + ret = xfs_sync_sb(mp, true); > > + current->flags &= ~PF_FSTRANS; > > + > > + return ret; > > } > > /me shudders don't worry, this debugging change won't escape my testing machine! > > just for testing purposes and after that I got another warning below. I > > didn't > > read it carefully yet, but _at first glance_ it looks like the lock > > inversion > > uncovered by 2/2, although I can be easily wrong. > > cancel_delayed_work_sync(l_work) > > under sb_internal can hang if xfs_log_worker() waits for this rwsem?` > > Actually: I *can't read it*. I've got no fucking clue what lockdep > is trying to say here. This /looks/ like a lockdep is getting > confused I can almost never understand what lockdep tells me, it is too clever for me. But this time I think it is right. Suppose that freeze_super() races with xfs_log_worker() callback. freeze_super() takes sb_internal lock and xfs_log_quiesce() calls cancel_delayed_work_sync(l_work). This will sleep until xfs_log_worker() finishes. xfs_log_worker() does a __GFP_FS alloc, triggers reclaim, and blocks on the same sb_internal lock. Say, in xfs_do_writepage()->xfs_trans_alloc() path. Deadlock. The worker thread can't take sb_internal hold by freeze_super(), cancel_delayed_work_sync() will sleep forever because xfs_log_worker() can't finish. So xfs_log_worker() should run in a GFP_NOFS context too. And perhaps the change above in xfs_trans_alloc() or in xfs_sync_sb() can help if it doesn't do other allocatiions, I dunno. Oleg.
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On Fri, Oct 07, 2016 at 07:15:18PM +0200, Oleg Nesterov wrote: > On 10/07, Dave Chinner wrote: > > > > On Thu, Oct 06, 2016 at 07:17:58PM +0200, Oleg Nesterov wrote: > > > Probably false positive? Although when I look at the comment above > > > xfs_sync_sb() > > > I think that may be sometging like below makes sense, but I know > > > absolutely nothing > > > about fs/ and XFS in particular. > > > > > > Oleg. > > > > > > > > > --- x/fs/xfs/xfs_trans.c > > > +++ x/fs/xfs/xfs_trans.c > > > @@ -245,7 +245,8 @@ xfs_trans_alloc( > > > atomic_inc(&mp->m_active_trans); > > > > > > tp = kmem_zone_zalloc(xfs_trans_zone, > > > - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > > > + (flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT)) > > > + ? KM_NOFS : KM_SLEEP); > > > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > > > tp->t_flags = flags; > > > tp->t_mountp = mp; > > > > Brief examination says caller should set XFS_TRANS_NOFS, not change > > the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean > > XFS_TRANS_NOFS. > > I didn't mean the change above can fix the problem, and I don't really > understand your suggestion. xfs_syncsb() does: tp = xfs_trans_alloc(... , XFS_TRANS_NO_WRITECOUNT, ); but it's running in a GFP_NOFS context when a freeze is being finalised. SO, rather than changing what XFS_TRANS_NO_WRITECOUNT does in xfs_trans_alloc(), we should tell it to do a GFP_NOFS allocation. i.e. tp = xfs_trans_alloc(... , XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT, ); and nothing inside xfs_trans_alloc() changes at all. > Obviously any GFP_FS allocation in xfs_fs_freeze() > paths will trigger the same warning. Of which there should be none except for that xfs_trans_alloc() call. > I added this hack > > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1333,10 +1333,15 @@ xfs_fs_freeze( > struct super_block *sb) >{ > struct xfs_mount*mp = XFS_M(sb); > + int ret; > > + current->flags |= PF_FSTRANS; // tell kmem_flags_convert() to > remove GFP_FS > xfs_save_resvblks(mp); > xfs_quiesce_attr(mp); > - return xfs_sync_sb(mp, true); > + ret = xfs_sync_sb(mp, true); > + current->flags &= ~PF_FSTRANS; > + > + return ret; >} /me shudders > just for testing purposes and after that I got another warning below. I didn't > read it carefully yet, but _at first glance_ it looks like the lock inversion > uncovered by 2/2, although I can be easily wrong. > cancel_delayed_work_sync(l_work) > under sb_internal can hang if xfs_log_worker() waits for this rwsem?` Actually: I *can't read it*. I've got no fucking clue what lockdep is trying to say here. This /looks/ like a lockdep is getting confused between memory reclaim contexts (which /aren't locks/ but overload interrupt levels) and freeze contexts (which /aren't locks/) and workqueue locks which /aren't nested/ inside transactions or freeze contexts. But, really, I can't follow it because I have to guess at what "lock contexts" are not locks but are something else. cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On 10/07, Dave Chinner wrote: > > On Thu, Oct 06, 2016 at 07:17:58PM +0200, Oleg Nesterov wrote: > > Probably false positive? Although when I look at the comment above > > xfs_sync_sb() > > I think that may be sometging like below makes sense, but I know absolutely > > nothing > > about fs/ and XFS in particular. > > > > Oleg. > > > > > > --- x/fs/xfs/xfs_trans.c > > +++ x/fs/xfs/xfs_trans.c > > @@ -245,7 +245,8 @@ xfs_trans_alloc( > > atomic_inc(&mp->m_active_trans); > > > > tp = kmem_zone_zalloc(xfs_trans_zone, > > - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > > + (flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT)) > > + ? KM_NOFS : KM_SLEEP); > > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > > tp->t_flags = flags; > > tp->t_mountp = mp; > > Brief examination says caller should set XFS_TRANS_NOFS, not change > the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean > XFS_TRANS_NOFS. I didn't mean the change above can fix the problem, and I don't really understand your suggestion. Obviously any GFP_FS allocation in xfs_fs_freeze() paths will trigger the same warning. I added this hack --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1333,10 +1333,15 @@ xfs_fs_freeze( struct super_block *sb) { struct xfs_mount*mp = XFS_M(sb); + int ret; + current->flags |= PF_FSTRANS; // tell kmem_flags_convert() to remove GFP_FS xfs_save_resvblks(mp); xfs_quiesce_attr(mp); - return xfs_sync_sb(mp, true); + ret = xfs_sync_sb(mp, true); + current->flags &= ~PF_FSTRANS; + + return ret; } just for testing purposes and after that I got another warning below. I didn't read it carefully yet, but _at first glance_ it looks like the lock inversion uncovered by 2/2, although I can be easily wrong. cancel_delayed_work_sync(l_work) under sb_internal can hang if xfs_log_worker() waits for this rwsem?` Oleg. = [ INFO: possible irq lock inversion dependency detected ] 4.8.0+ #10 Tainted: GW - kswapd0/32 just changed the state of lock: (sb_internal){.?}, at: [] __sb_start_write+0xb7/0xf0 but this lock took another, RECLAIM_FS-unsafe lock in the past: ((&(&log->l_work)->work)){+.+.+.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0CPU1 lock((&(&log->l_work)->work)); local_irq_disable(); lock(sb_internal); lock((&(&log->l_work)->work)); lock(sb_internal); *** DEADLOCK *** no locks held by kswapd0/32. the shortest dependencies between 2nd lock and 1st lock: -> ((&(&log->l_work)->work)){+.+.+.} ops: 803 { HARDIRQ-ON-W at: [] __lock_acquire+0x611/0x1870 [] lock_acquire+0x10d/0x200 [] process_one_work+0x1c9/0x690 [] worker_thread+0x4e/0x480 [] kthread+0xf3/0x110 [] ret_from_fork+0x1f/0x40 SOFTIRQ-ON-W at: [] __lock_acquire+0x643/0x1870 [] lock_acquire+0x10d/0x200 [] process_one_work+0x1c9/0x690 [] worker_thread+0x4e/0x480 [] kthread+0xf3/0x110 [] ret_from_fork+0x1f/0x40 RECLAIM_FS-ON-W at: [] mark_held_locks+0x6f/0xa0 [] lockdep_trace_alloc+0xd3/0x120 [] kmem_cache_alloc+0x2f/0x280 [] kmem_zone_alloc+0x81/0x120 [] xfs_trans_alloc+0x6c/0x130 [] xfs_sync_sb+0x39/0x80 [] xfs_log_worker+0xd4/0xf0 [] process_one_work+0x1f0/0x690 [] worker_thread+0x4e/0x480 [] kthread+0xf3/0x110 [] ret_from_fork+0x1f/0x40 INITIAL USE at: [] __lock_acquire+0x24b/0x1870 [] lock_acquire+0x10d/0x200 [] process_one_work+0x1c9/0x690 [] worker_thread+0x4e/0x480 [] kthread+0xf3/0x110 [] ret_from_fork+0x1f/0x40 } ... key at: [] __key.130101+0x0/0x8 ... acquired at: [] lock_acquire+0x10d/0x200 [] flush_work+0x4c/0x2c0 [] __cancel_work_timer+0x131/0x1f0 [] cancel_delayed_work_sync+0x13/0x20 [] xfs_log_quiesce+0x34/0x3b0 [] xfs_quiesce_attr+0x6d/0xb0 [] xfs_fs_freeze+0x33/0
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On 10/06, Johannes Weiner wrote: > > On Tue, Oct 04, 2016 at 01:48:00PM +0200, Michal Hocko wrote: > > Johannes is already looking into this > > http://lkml.kernel.org/r/20161004093216.ga21...@cmpxchg.org > > > > On Tue 04-10-16 13:43:43, Oleg Nesterov wrote: > > > because of kernel bug: > > > > > > [ 2730.242537] run fstests generic/274 at 2016-10-04 05:17:34 > > > [ 2730.738352] XFS (loop1): Mounting V5 Filesystem > > > [ 2730.741451] XFS (loop1): Ending clean mount > > > [ 2744.508698] [ cut here ] > > > [ 2744.509190] kernel BUG at ./include/linux/swap.h:276! > > > > > > static inline void workingset_node_shadows_dec(struct radix_tree_node > > > *node) > > > { > > > VM_BUG_ON(!workingset_node_shadows(node)); > > > node->count -= 1U << RADIX_TREE_COUNT_SHIFT; > > We tracked this one down and Linus merged a fix for this issue: > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d3798ae8c6f3767c726403c2ca6ecc317752c9dd Confirm, generic/274 no longer crashes the kernel. Thanks. Oleg.
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On Thu, Oct 06, 2016 at 07:17:58PM +0200, Oleg Nesterov wrote: > Probably false positive? Although when I look at the comment above > xfs_sync_sb() > I think that may be sometging like below makes sense, but I know absolutely > nothing > about fs/ and XFS in particular. > > Oleg. > > > --- x/fs/xfs/xfs_trans.c > +++ x/fs/xfs/xfs_trans.c > @@ -245,7 +245,8 @@ xfs_trans_alloc( > atomic_inc(&mp->m_active_trans); > > tp = kmem_zone_zalloc(xfs_trans_zone, > - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > + (flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT)) > + ? KM_NOFS : KM_SLEEP); > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > tp->t_flags = flags; > tp->t_mountp = mp; Brief examination says caller should set XFS_TRANS_NOFS, not change the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean XFS_TRANS_NOFS. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On 10/05, Oleg Nesterov wrote: > > On 10/05, Dave Chinner wrote: > > > > On Tue, Oct 04, 2016 at 01:43:43PM +0200, Oleg Nesterov wrote: > > > > > plus the following warnings: > > > > > > [ 1894.500040] run fstests generic/070 at 2016-10-04 05:03:39 > > > [ 1895.076655] = > > > [ 1895.077136] [ INFO: inconsistent lock state ] > > > [ 1895.077574] 4.8.0 #1 Not tainted > > > [ 1895.077900] - > > > [ 1895.078330] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} > > > usage. > > > [ 1895.078993] fsstress/18239 [HC0[0]:SC0[0]:HE1:SE1] takes: > > > [ 1895.079522] (&xfs_nondir_ilock_class){?-}, at: > > > [] xfs_ilock+0x165/0x210 [xfs] > > > [ 1895.080529] {IN-RECLAIM_FS-W} state was registered at: > > > > And that is a bug in the lockdep annotations for memory allocation because > > it > > fails to take into account the current task flags that are set via > > memalloc_noio_save() to prevent vmalloc from doing GFP_KERNEL allocations. > > i.e. > > in _xfs_buf_map_pages(): > > OK, I see... > > I'll re-test with the following change: > > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -2867,7 +2867,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, > unsigned long flags) > return; > > /* We're only interested __GFP_FS allocations for now */ > - if (!(gfp_mask & __GFP_FS)) > + if ((curr->flags & PF_MEMALLOC_NOIO) || !(gfp_mask & __GFP_FS)) > return; > and with the change above "./check -b auto" finishes without lockdep warnings, probably I should send this patch to lockdep maintainers. Now, with 2/2 applied I got the following: [ INFO: inconsistent lock state ] 4.8.0+ #4 Tainted: GW - inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage. kswapd0/32 [HC0[0]:SC0[0]:HE1:SE1] takes: (sb_internal){+?}, at: [] __sb_start_write+0xb7/0xf0 {RECLAIM_FS-ON-W} state was registered at: [] mark_held_locks+0x6f/0xa0 [] lockdep_trace_alloc+0xd3/0x120 [] kmem_cache_alloc+0x2f/0x280 [] kmem_zone_alloc+0x81/0x120 [xfs] [] xfs_trans_alloc+0x6c/0x130 [xfs] [] xfs_sync_sb+0x39/0x80 [xfs] [] xfs_log_sbcount+0x4d/0x50 [xfs] [] xfs_quiesce_attr+0x57/0xb0 [xfs] [] xfs_fs_freeze+0x21/0x40 [xfs] [] freeze_super+0xcf/0x190 [] do_vfs_ioctl+0x55f/0x6c0 [] SyS_ioctl+0x79/0x90 [] entry_SYSCALL_64_fastpath+0x1f/0xbd irq event stamp: 36471805 hardirqs last enabled at (36471805): [] clear_page_dirty_for_io+0x1ed/0x2e0 hardirqs last disabled at (36471804): [] clear_page_dirty_for_io+0x1bd/0x2e0 softirqs last enabled at (36468590): [] __do_softirq+0x37a/0x44d softirqs last disabled at (36468579): [] irq_exit+0xe5/0xf0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(sb_internal); lock(sb_internal); *** DEADLOCK *** no locks held by kswapd0/32. stack backtrace: CPU: 0 PID: 32 Comm: kswapd0 Tainted: GW 4.8.0+ #4 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 0086 028a3434 880139b2b520 91449193 880139b1a680 928c1e70 880139b2b570 91106c75 0001 88010001 000a Call Trace: [] dump_stack+0x85/0xc2 [] print_usage_bug+0x215/0x240 [] mark_lock+0x58b/0x650 [] ? print_shortest_lock_dependencies+0x1a0/0x1a0 [] __lock_acquire+0x36d/0x1870 [] lock_acquire+0x10d/0x200 [] ? __sb_start_write+0xb7/0xf0 [] percpu_down_read+0x3c/0x90 [] ? __sb_start_write+0xb7/0xf0 [] __sb_start_write+0xb7/0xf0 [] xfs_trans_alloc+0xe3/0x130 [xfs] [] xfs_iomap_write_allocate+0x1f7/0x380 [xfs] [] ? xfs_map_blocks+0xe3/0x380 [xfs] [] ? rcu_read_lock_sched_held+0x58/0x60 [] xfs_map_blocks+0x22a/0x380 [xfs] [] xfs_do_writepage+0x188/0x6c0 [xfs] [] xfs_vm_writepage+0x3b/0x70 [xfs] [] pageout.isra.46+0x190/0x380 [] shrink_page_list+0x9ab/0xa70 [] shrink_inactive_list+0x252/0x5d0 [] shrink_node_memcg+0x5af/0x790 [] shrink_node+0xe1/0x320 [] kswapd+0x387/0x8b0 Probably false positive? Although when I look at the comment above xfs_sync_sb() I think that may be sometging like below makes sense, but I know absolutely nothing about fs/ and XFS in particular. Oleg. --- x/fs/xfs/xfs_trans.c +++ x/fs/xfs/xfs_trans.c @@ -245,7 +245,8 @@ xfs_trans_alloc( atomic_inc(&mp->m_active_trans
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On Tue, Oct 04, 2016 at 01:48:00PM +0200, Michal Hocko wrote: > Johannes is already looking into this > http://lkml.kernel.org/r/20161004093216.ga21...@cmpxchg.org > > On Tue 04-10-16 13:43:43, Oleg Nesterov wrote: > > because of kernel bug: > > > > [ 2730.242537] run fstests generic/274 at 2016-10-04 05:17:34 > > [ 2730.738352] XFS (loop1): Mounting V5 Filesystem > > [ 2730.741451] XFS (loop1): Ending clean mount > > [ 2744.508698] [ cut here ] > > [ 2744.509190] kernel BUG at ./include/linux/swap.h:276! > > > > static inline void workingset_node_shadows_dec(struct radix_tree_node *node) > > { > > VM_BUG_ON(!workingset_node_shadows(node)); > > node->count -= 1U << RADIX_TREE_COUNT_SHIFT; We tracked this one down and Linus merged a fix for this issue: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d3798ae8c6f3767c726403c2ca6ecc317752c9dd Let me know if this still fires on kernels with that commit. Thanks!
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On Wed 05-10-16 18:44:32, Oleg Nesterov wrote: > On 10/05, Dave Chinner wrote: > > > > On Tue, Oct 04, 2016 at 01:43:43PM +0200, Oleg Nesterov wrote: > > > > > plus the following warnings: > > > > > > [ 1894.500040] run fstests generic/070 at 2016-10-04 05:03:39 > > > [ 1895.076655] = > > > [ 1895.077136] [ INFO: inconsistent lock state ] > > > [ 1895.077574] 4.8.0 #1 Not tainted > > > [ 1895.077900] - > > > [ 1895.078330] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} > > > usage. > > > [ 1895.078993] fsstress/18239 [HC0[0]:SC0[0]:HE1:SE1] takes: > > > [ 1895.079522] (&xfs_nondir_ilock_class){?-}, at: > > > [] xfs_ilock+0x165/0x210 [xfs] > > > [ 1895.080529] {IN-RECLAIM_FS-W} state was registered at: > > > > And that is a bug in the lockdep annotations for memory allocation because > > it > > fails to take into account the current task flags that are set via > > memalloc_noio_save() to prevent vmalloc from doing GFP_KERNEL allocations. > > i.e. > > in _xfs_buf_map_pages(): > > OK, I see... > > I'll re-test with the following change: > > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -2867,7 +2867,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, > unsigned long flags) > return; > > /* We're only interested __GFP_FS allocations for now */ > - if (!(gfp_mask & __GFP_FS)) > + if ((curr->flags & PF_MEMALLOC_NOIO) || !(gfp_mask & __GFP_FS)) > return; > > > Hmm. This is off-topic and most probably I missed something... but at > first glance we can simplify/improve the reclaim-fs lockdep annotations: > > 1. add the global "struct lockdep_map reclaim_fs_map" > > 2. change __lockdep_trace_alloc > > - mark_held_locks(curr, RECLAIM_FS); > + lock_map_acquire(&reclaim_fs_map); > + lock_map_release(&reclaim_fs_map); > > 3. turn lockdep_set/clear_current_reclaim_state() into > > void lockdep_set_current_reclaim_state(gfp_t gfp_mask) > { > if (gfp_mask & __GFP_FS) > lock_map_acquire(&reclaim_fs_map); > } > > void lockdep_clear_current_reclaim_state(gfp_t gfp_mask) > { > if (gfp_mask & __GFP_FS) > lock_map_release(&reclaim_fs_map); > } > > and now we can remove task_struct->lockdep_reclaim_gfp and all other > RECLAIM_FS hacks in lockdep.c. Plus we can easily extend this logic to > check more GFP_ flags. Yeah, looks possible to me. I've added Peter to CC since he's most likely to know. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On 10/05, Dave Chinner wrote: > > On Tue, Oct 04, 2016 at 01:43:43PM +0200, Oleg Nesterov wrote: > > > plus the following warnings: > > > > [ 1894.500040] run fstests generic/070 at 2016-10-04 05:03:39 > > [ 1895.076655] = > > [ 1895.077136] [ INFO: inconsistent lock state ] > > [ 1895.077574] 4.8.0 #1 Not tainted > > [ 1895.077900] - > > [ 1895.078330] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} > > usage. > > [ 1895.078993] fsstress/18239 [HC0[0]:SC0[0]:HE1:SE1] takes: > > [ 1895.079522] (&xfs_nondir_ilock_class){?-}, at: > > [] xfs_ilock+0x165/0x210 [xfs] > > [ 1895.080529] {IN-RECLAIM_FS-W} state was registered at: > > And that is a bug in the lockdep annotations for memory allocation because it > fails to take into account the current task flags that are set via > memalloc_noio_save() to prevent vmalloc from doing GFP_KERNEL allocations. > i.e. > in _xfs_buf_map_pages(): OK, I see... I'll re-test with the following change: --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2867,7 +2867,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) return; /* We're only interested __GFP_FS allocations for now */ - if (!(gfp_mask & __GFP_FS)) + if ((curr->flags & PF_MEMALLOC_NOIO) || !(gfp_mask & __GFP_FS)) return; Hmm. This is off-topic and most probably I missed something... but at first glance we can simplify/improve the reclaim-fs lockdep annotations: 1. add the global "struct lockdep_map reclaim_fs_map" 2. change __lockdep_trace_alloc - mark_held_locks(curr, RECLAIM_FS); + lock_map_acquire(&reclaim_fs_map); + lock_map_release(&reclaim_fs_map); 3. turn lockdep_set/clear_current_reclaim_state() into void lockdep_set_current_reclaim_state(gfp_t gfp_mask) { if (gfp_mask & __GFP_FS) lock_map_acquire(&reclaim_fs_map); } void lockdep_clear_current_reclaim_state(gfp_t gfp_mask) { if (gfp_mask & __GFP_FS) lock_map_release(&reclaim_fs_map); } and now we can remove task_struct->lockdep_reclaim_gfp and all other RECLAIM_FS hacks in lockdep.c. Plus we can easily extend this logic to check more GFP_ flags. No? Oleg.
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On 10/05, Dave Chinner wrote: > > On Tue, Oct 04, 2016 at 06:58:27PM +0200, Oleg Nesterov wrote: > > I removed this test and then the next run (after reboot) hangs at xfs/073 > > with > > a lot of errors in dmesg like > > > > XFS (loop2): Failing async write on buffer block 0x9600790. Retrying > > async write. > > blk_update_request: I/O error, dev loop2, sector 8389920 > > loop: Write error at byte offset 4295647232, length 4096. > > tests will dump lots of errors into dmesg. That doesn't mean there's > a problem - many tests are designed to exercise error paths. That, > however, looks like a loop device problem. > > xfs/073 is using loop devices internally itself, so this ends up > with XFS on loop2 on XFS on loop1. That loop2 device is 100GB in > size, and the test copies the xfstests source tree to a loopback > filesystem image in $SCRATCH_DEV mounted on $TEST_DIR/$$. > > The issue is, most likely, that your TEST_DIR and SCRATCH_MNT are > rooted in the xfstests source tree. Hence copying the xfstests > source tree will also try to copy the 8GB image files into the > filesystems that the image files contain. Which, clearly, will > eventually result in ENOSPC errors when writing to the underlying > loop device... > > Put your TEST_DIR and SCRATCHMNT mount points outside the xfstests > directory, and this should go away. Yes, thanks, xfs/073 no longer hangs. Oleg.
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On Tue, Oct 04, 2016 at 06:58:27PM +0200, Oleg Nesterov wrote: > I removed this test and then the next run (after reboot) hangs at xfs/073 with > a lot of errors in dmesg like > > XFS (loop2): Failing async write on buffer block 0x9600790. Retrying > async write. > blk_update_request: I/O error, dev loop2, sector 8389920 > loop: Write error at byte offset 4295647232, length 4096. tests will dump lots of errors into dmesg. That doesn't mean there's a problem - many tests are designed to exercise error paths. That, however, looks like a loop device problem. xfs/073 is using loop devices internally itself, so this ends up with XFS on loop2 on XFS on loop1. That loop2 device is 100GB in size, and the test copies the xfstests source tree to a loopback filesystem image in $SCRATCH_DEV mounted on $TEST_DIR/$$. The issue is, most likely, that your TEST_DIR and SCRATCH_MNT are rooted in the xfstests source tree. Hence copying the xfstests source tree will also try to copy the 8GB image files into the filesystems that the image files contain. Which, clearly, will eventually result in ENOSPC errors when writing to the underlying loop device... Put your TEST_DIR and SCRATCHMNT mount points outside the xfstests directory, and this should go away. Most people use /mnt/test and /mnt/scratch for these Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On Tue, Oct 04, 2016 at 01:43:43PM +0200, Oleg Nesterov wrote: > On 10/03, Oleg Nesterov wrote: > > > > On 10/03, Dave Chinner wrote: > > > > > > On Fri, Sep 30, 2016 at 07:14:34PM +0200, Oleg Nesterov wrote: > > > > On 09/27, Oleg Nesterov wrote: > > > > > > > > > Jan, I gave up. > > > > > > > > Whatever I did xfstests-dev/check reports a lot of failures, kernel > > > > bugs, > > > > and finally either crashes the kernel or hangs. > > > > > > Did you run the auto group like I suggested? That's the set of tests > > > that should complete successfully with minimal failures and without > > > crashing the machine. > > > > OK, I'll reserve a testing machine again. > > Fedora 23 running in KVM guest > > kernel v4.8 clean, see .config below > > xfstests from git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git > last commit 0bbd20b104765c75faaf8e28a54c32df505ce7fd ("btrfs: test > free space tree mount options") > > script: > > dd if=/dev/zero of=TEST.img bs=1MiB count=8192 > dd if=/dev/zero of=SCRATCH.img bs=1MiB count=8192 > > losetup --find --show TEST.img > losetup --find --show SCRATCH.img > > mkfs.xfs -f /dev/loop0 > mkfs.xfs -f /dev/loop1 > > mkdir -p TEST SCRATCH > > mount /dev/loop0 TEST > mount /dev/loop1 SCRATCH > > TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 > SCRATCH_MNT=SCRATCH ./check -g auto Looks sane. > This time it hangs after generic/274: > > --- tests/generic/274.out 2016-10-04 04:23:24.209006171 -0400 > +++ /root/XFS/xfstests-dev/results//generic/274.out.bad 2016-10-04 > 05:17:49.291742498 -0400 > @@ -3,3 +3,4 @@ > preallocation test > -- > done > +./common/rc: line 344: 4693 Segmentation fault exit > ... > (Run 'diff -u tests/generic/274.out > /root/XFS/xfstests-dev/results//generic/274.out.bad' to see the entire diff) > > because of kernel bug: > > [ 2730.242537] run fstests generic/274 at 2016-10-04 05:17:34 > [ 2730.738352] XFS (loop1): Mounting V5 Filesystem > [ 2730.741451] XFS (loop1): Ending clean mount > [ 2744.508698] [ cut here ] > [ 2744.509190] kernel BUG at ./include/linux/swap.h:276! > > static inline void workingset_node_shadows_dec(struct radix_tree_node *node) > { > VM_BUG_ON(!workingset_node_shadows(node)); > node->count -= 1U << RADIX_TREE_COUNT_SHIFT; > } Linus had a massive rant about this yesterday. Regression was introduced between v4.8-rc8 and v4.8 and it took Linus's machine down. xfstests found a filesystem related kernel bug, as it's supposed to. > plus the following warnings: > > [ 1894.500040] run fstests generic/070 at 2016-10-04 05:03:39 > [ 1895.076655] = > [ 1895.077136] [ INFO: inconsistent lock state ] > [ 1895.077574] 4.8.0 #1 Not tainted > [ 1895.077900] - > [ 1895.078330] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} > usage. > [ 1895.078993] fsstress/18239 [HC0[0]:SC0[0]:HE1:SE1] takes: > [ 1895.079522] (&xfs_nondir_ilock_class){?-}, at: > [] xfs_ilock+0x165/0x210 [xfs] > [ 1895.080529] {IN-RECLAIM_FS-W} state was registered at: [snip kswapd doing XFS inode reclaim] > [ 1895.102565] CPU: 0 PID: 18239 Comm: fsstress Not tainted 4.8.0 #1 > [ 1895.103160] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > [ 1895.103721] 0086 8f20d9c8 880038b73818 > b6449133 > [ 1895.104486] 880038a5cd00 b78bcd70 880038b73868 > b6106c75 > [ 1895.105257] 0001 8801 > 0008 > [ 1895.106034] Call Trace: > [ 1895.106281] [] dump_stack+0x85/0xc2 > [ 1895.106789] [] print_usage_bug+0x215/0x240 > [ 1895.107352] [] mark_lock+0x58b/0x650 > [ 1895.107888] [] ? check_usage_forwards+0x160/0x160 > [ 1895.108506] [] mark_held_locks+0x6f/0xa0 > [ 1895.109085] [] lockdep_trace_alloc+0xca/0x110 > [ 1895.109695] [] > kmem_cache_alloc_node_trace+0x39/0x2a0 > [ 1895.110414] [] ? vm_map_ram+0x2de/0x490 > [ 1895.110995] [] vm_map_ram+0x2de/0x490 > [ 1895.111536] [] ? vm_map_ram+0x4b/0x490 > [ 1895.112142] [] _xfs_buf_map_pages+0x6c/0x110 [xfs] > [ 1895.112837] [] xfs_buf_get_map+0x2c7/0x470 [xfs] > [ 1895.113500] [] xfs_attr_rmtval_set+0x2c9/0x450 > [xfs] > [ 1895.114235] [] xfs_attr_node_addname+0x4c9/0x5d0 > [xfs] > [ 1895.114948] [] xfs_attr_set+0x3dc/0x460 [xfs] > [ 1895.115593] [] xfs_xattr_set+0x4f/0x90 [xfs] > [ 1895.116221] [] generic_setxattr+0x70/0x80 > [ 1895.116798] [] __vfs_setxattr_noperm+0xaf/0x1a0 > [ 1895.117445] [] ? security_inode_setxattr+0x65/0xb0 > [ 1895.118121] [] vfs_setxattr+0xa7/0xb0 > [ 1895
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On 10/04, Oleg Nesterov wrote: > > This time it hangs after generic/274: > > --- tests/generic/274.out 2016-10-04 04:23:24.209006171 -0400 > +++ /root/XFS/xfstests-dev/results//generic/274.out.bad 2016-10-04 > 05:17:49.291742498 -0400 > @@ -3,3 +3,4 @@ > preallocation test > -- > done > +./common/rc: line 344: 4693 Segmentation fault exit > ... > (Run 'diff -u tests/generic/274.out > /root/XFS/xfstests-dev/results//generic/274.out.bad' to see the entire diff) > > because of kernel bug: > > [ 2730.242537] run fstests generic/274 at 2016-10-04 05:17:34 > [ 2730.738352] XFS (loop1): Mounting V5 Filesystem > [ 2730.741451] XFS (loop1): Ending clean mount > [ 2744.508698] [ cut here ] > [ 2744.509190] kernel BUG at ./include/linux/swap.h:276! I removed this test and then the next run (after reboot) hangs at xfs/073 with a lot of errors in dmesg like XFS (loop2): Failing async write on buffer block 0x9600790. Retrying async write. blk_update_request: I/O error, dev loop2, sector 8389920 loop: Write error at byte offset 4295647232, length 4096. Oleg.
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On 10/03, Dave Chinner wrote: > > On Fri, Sep 30, 2016 at 07:14:34PM +0200, Oleg Nesterov wrote: > > On 09/27, Oleg Nesterov wrote: > > > > > > > > > It seems that generic/001 just hangs on my laptop. With or without this > > > change. > > > Or perhaps I didn't wait enough... > > > > /usr/bin/awk spins in user-mode forever, according to strace it doesn't do > > any syscalls. I didn't even try to investigate. > > Are you using gawk, or some other version of awk? gawk doesn't have > any problems gawk, awk is symlink. And somehow this depends on userspace environment, it doesn't hang on the full-blown Fedora 23. > > Jan, I gave up. > > > > Whatever I did xfstests-dev/check reports a lot of failures, kernel bugs, > > and finally either crashes the kernel or hangs. > > Did you run the auto group like I suggested? That's the set of tests > that should complete successfully with minimal failures and without > crashing the machine. OK, I'll reserve a testing machine again. Oleg.
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On Fri, Sep 30, 2016 at 07:14:34PM +0200, Oleg Nesterov wrote: > On 09/27, Oleg Nesterov wrote: > > > > On 09/27, Jan Kara wrote: > > > > > > You can run either: > > > > > > ./check -g freeze > > > > passed all 6 tests. > > > > > to check just the freezing tests or > > > > > > ./check > > > > > > to run all sensible tests which is what I'd do (but it will take couple of > > > hours to pass). If that passes, chances are good there are no easy false > > > positives. > > > > It seems that generic/001 just hangs on my laptop. With or without this > > change. > > Or perhaps I didn't wait enough... > > /usr/bin/awk spins in user-mode forever, according to strace it doesn't do > any syscalls. I didn't even try to investigate. Are you using gawk, or some other version of awk? gawk doesn't have any problems (it is listed as a prereq package in the README file) that anyone has reported, and there are a lot of people using it... > > Or perhaps something is wrong with my very > > limited testing environment. I'll reserve a testing machine tomorrow. > > Jan, I gave up. > > Whatever I did xfstests-dev/check reports a lot of failures, kernel bugs, > and finally either crashes the kernel or hangs. Did you run the auto group like I suggested? That's the set of tests that should complete successfully with minimal failures and without crashing the machine. If you're running this group and there's failures, hangs and crashes all over the place, then you need to start reporting bugs because that should not be happening on any kernel Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On 09/27, Oleg Nesterov wrote: > > On 09/27, Jan Kara wrote: > > > > You can run either: > > > > ./check -g freeze > > passed all 6 tests. > > > to check just the freezing tests or > > > > ./check > > > > to run all sensible tests which is what I'd do (but it will take couple of > > hours to pass). If that passes, chances are good there are no easy false > > positives. > > It seems that generic/001 just hangs on my laptop. With or without this > change. > Or perhaps I didn't wait enough... /usr/bin/awk spins in user-mode forever, according to strace it doesn't do any syscalls. I didn't even try to investigate. > Or perhaps something is wrong with my very > limited testing environment. I'll reserve a testing machine tomorrow. Jan, I gave up. Whatever I did xfstests-dev/check reports a lot of failures, kernel bugs, and finally either crashes the kernel or hangs. With or without this change, and I didn't notice any warning from lockdep. So I hope we can apply this patch. At least 1/2 which fixes a bug. Oleg.
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On 09/27, Jan Kara wrote: > > On Mon 26-09-16 18:55:25, Oleg Nesterov wrote: > > > > Heh ;) if only I knew how to test this... I ran the following script > > under qemu > > > > mkfs.xfs -f /dev/vda > > mkfs.xfs -f /dev/vdb > > > > mkdir -p TEST SCRATCH > > > > TEST_DEV=/dev/vda TEST_DIR=TEST SCRATCH_DEV=/dev/vdb > > SCRATCH_MNT=SCRATCH \ > > ./check `grep -il freeze tests/*/???` > > You can run either: > > ./check -g freeze passed all 6 tests. > to check just the freezing tests or > > ./check > > to run all sensible tests which is what I'd do (but it will take couple of > hours to pass). If that passes, chances are good there are no easy false > positives. It seems that generic/001 just hangs on my laptop. With or without this change. Or perhaps I didn't wait enough... Or perhaps something is wrong with my very limited testing environment. I'll reserve a testing machine tomorrow. > > And yes, I'm afraid this change can uncover some false positives later. > > But at the same time potentially it can find the real problems. > > Well, sure it's not an end of world if there is some false positive - we > can just revert the change - but lockdep false positives are always > annoying because they take time to analyze and until they are fixed, you > are unable to see other probles found by lockdep... Yes, yes, agreed. > > It would be nice to remove another hack in __sb_start_write under > > ifdef(CONFIG_LOCKDEP), but iirc XFS actually takes the same rw_sem twice > > for reading, so we can't do this. > > Yes, and I don't really consider this a hack. Ah, sorry, I didn't try to blame XFS/fs. I meant, this "force_trylock" hack doesn't look nice. Perhaps we can use rwsem_acquire_nest() instead. > Reviewed-by: Jan Kara Thanks! Oleg.
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On Tue, Sep 27, 2016 at 08:51:35AM +0200, Jan Kara wrote: > On Mon 26-09-16 18:55:25, Oleg Nesterov wrote: > > On 09/26, Jan Kara wrote: > > > > > > On Mon 26-09-16 18:08:06, Oleg Nesterov wrote: > > > > +/* > > > > + * Tell lockdep we are holding these locks before we call > > > > ->unfreeze_fs(sb). > > > > + */ > > > > +static void sb_freeze_acquire(struct super_block *sb) > > > > > > Can we call this lockdep_sb_freeze_acquire() or something like that so > > > that > > > it is clear this is only about lockdep annotations? Similarly with > > > sb_freeze_unlock()... > > > > OK, thanks, done. See V2 below. > > > > > and I hope you really tested > > > there are no more lockdep false positives ;). > > > > Heh ;) if only I knew how to test this... I ran the following script > > under qemu > > > > mkfs.xfs -f /dev/vda > > mkfs.xfs -f /dev/vdb > > > > mkdir -p TEST SCRATCH > > > > TEST_DEV=/dev/vda TEST_DIR=TEST SCRATCH_DEV=/dev/vdb > > SCRATCH_MNT=SCRATCH \ > > ./check `grep -il freeze tests/*/???` > > You can run either: > > ./check -g freeze > > to check just the freezing tests or > > ./check Better for regression testing is: check -g auto so that is skips all the tests that are broken or likely to crash the machine on some debug check. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
On Mon 26-09-16 18:55:25, Oleg Nesterov wrote: > On 09/26, Jan Kara wrote: > > > > On Mon 26-09-16 18:08:06, Oleg Nesterov wrote: > > > +/* > > > + * Tell lockdep we are holding these locks before we call > > > ->unfreeze_fs(sb). > > > + */ > > > +static void sb_freeze_acquire(struct super_block *sb) > > > > Can we call this lockdep_sb_freeze_acquire() or something like that so that > > it is clear this is only about lockdep annotations? Similarly with > > sb_freeze_unlock()... > > OK, thanks, done. See V2 below. > > > and I hope you really tested > > there are no more lockdep false positives ;). > > Heh ;) if only I knew how to test this... I ran the following script > under qemu > > mkfs.xfs -f /dev/vda > mkfs.xfs -f /dev/vdb > > mkdir -p TEST SCRATCH > > TEST_DEV=/dev/vda TEST_DIR=TEST SCRATCH_DEV=/dev/vdb > SCRATCH_MNT=SCRATCH \ > ./check `grep -il freeze tests/*/???` You can run either: ./check -g freeze to check just the freezing tests or ./check to run all sensible tests which is what I'd do (but it will take couple of hours to pass). If that passes, chances are good there are no easy false positives. > And yes, I'm afraid this change can uncover some false positives later. > But at the same time potentially it can find the real problems. Well, sure it's not an end of world if there is some false positive - we can just revert the change - but lockdep false positives are always annoying because they take time to analyze and until they are fixed, you are unable to see other probles found by lockdep... > It would be nice to remove another hack in __sb_start_write under > ifdef(CONFIG_LOCKDEP), but iirc XFS actually takes the same rw_sem twice > for reading, so we can't do this. Yes, and I don't really consider this a hack. Filesystem freezing was never meant to have all the properties of classical rwsems in kernel - I just happened to notice that the semantics is close to rwsems and so decided to model it as such in lockdep to get some lockdep verification for it. In particular "read lock recursion" has always been OK with freeze protection by design and I don't see strong motivation for changing that. > --- > From 70e774533ab6319f9b90a490b036150ad9604af7 Mon Sep 17 00:00:00 2001 > From: Oleg Nesterov > Date: Mon, 26 Sep 2016 17:23:24 +0200 > Subject: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and > thaw_super() paths > > sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the > false-positives. Now that xfs was fixed by Dave's commit dbad7c993053 > ("xfs: stop holding ILOCK over filldir callbacks") we can remove it and > change freeze_super() and thaw_super() to run with s_writers.rw_sem locks > held; we add two trivial helpers for that, lockdep_sb_freeze_release() > and lockdep_sb_freeze_acquire(). > > xfstests-dev/check `grep -il freeze tests/*/???` does not trigger any > warning from lockdep. > > Signed-off-by: Oleg Nesterov Thanks. The patch looks good. You can add: Reviewed-by: Jan Kara Honza > --- > fs/super.c | 37 + > 1 file changed, 25 insertions(+), 12 deletions(-) > > diff --git a/fs/super.c b/fs/super.c > index 2549896c..0447afe 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -1214,25 +1214,34 @@ EXPORT_SYMBOL(__sb_start_write); > static void sb_wait_write(struct super_block *sb, int level) > { > percpu_down_write(sb->s_writers.rw_sem + level-1); > - /* > - * We are going to return to userspace and forget about this lock, the > - * ownership goes to the caller of thaw_super() which does unlock. > - * > - * FIXME: we should do this before return from freeze_super() after we > - * called sync_filesystem(sb) and s_op->freeze_fs(sb), and thaw_super() > - * should re-acquire these locks before s_op->unfreeze_fs(sb). However > - * this leads to lockdep false-positives, so currently we do the early > - * release right after acquire. > - */ > - percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_); > } > > -static void sb_freeze_unlock(struct super_block *sb) > +/* > + * We are going to return to userspace and forget about these locks, the > + * ownership goes to the caller of thaw_super() which does unlock(). > + */ > +static void lockdep_sb_freeze_release(struct super_block *sb) > +{ > + int level; > + > + for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--) > + percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, > _THIS_IP_); > +} > + > +/* > + * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb). > + */ > +static void lockdep_sb_freeze_acquire(struct super_block *sb) > { > int level; > > for (level = 0; level < SB_FREEZE_LEVELS; ++level) > percpu_rwsem_acquire(s