Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Al Viro: > Different ->i_mutex; you are holding one on the parent directory already. Let me make sure. In your scenario, - processA writes something into the union, and the unioning fs operates the writable layer. After sb_start_write() succeeds, processA should not block by the reason of fsfreeze. - processC causes the copyup. The current aufs implementation holds parent->i_mutex on the writable layer during the copyup. The parent->i_mutex can make processA blocking. Now I am considering the copyup approach you suggested in another mail, and I am going to replace your "unlink the target, re-link later, no dir lock during copyup" by "make it hidden instead of unlinking, rename the correct name later, no dir lock during copyup" since I am not sure all FSs can operate "->link with the unlinked one". I guess most FS can handle it, but I don't want to make sure everything particulary remote fs, journals. To make a file "hidden", I guess I can use the aufs "doubley whiteouted" approach. As you might know, aufs prepends the ".wh." prefix to the filename as whiteout. With one more prefix, the name loses the role of whiteout. Aufs simply ignores such doubly whiteouted name. The demerit is that aufs has to limit the length of the name. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Sat, Mar 23, 2013 at 11:49:11AM +0900, J. R. Okajima wrote: > > Al Viro: > > The scenario, BTW, looks so: > > process A does sb_start_write() (on your upper layer) > > process B tries to freeze said upper layer and blocks, waiting for A to > > finish > > process C grabs ->i_mutex in your upper layer > > process C does vfs_write(), which blocks, since there's a pending attempt to > > freeze > > process A tries to grab ->i_mutex held by C and blocks > > According to latest mm/filemap.c:generic_file_aio_write(), > sb_start_write(inode->i_sb); > mutex_lock(>i_mutex); > ret = __generic_file_aio_write(iocb, iov, nr_segs, >ki_pos); > mutex_unlock(>i_mutex); > ::: > sb_end_write(inode->i_sb); > > Process C would block *BEFORE* i_mutex by sb_start_write()? No? Different ->i_mutex; you are holding one on the parent directory already. That's the problem - you have ->i_mutex nested both inside that sucker (as it ought to) and outside. Which tends to do bad things, obviously, in particular because something like mkdir(2) will do sb_start_write() (from mnt_want_write(), called by kern_path_create()) before grabbing directory ->i_mutex. Thus the activity with lifting the bastard out of ->aio_write(), etc. in vfs.git#experimental - *any* union-like variant will need the ability to pull sb_start_write() outside of locking the parent directory on copyup. And yes, it's a common prerequisite to anything doing copyups - aufs is in the same boat as overlayfs and unionmount. Same deadlock for all three of them. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Al Viro: > The scenario, BTW, looks so: > process A does sb_start_write() (on your upper layer) > process B tries to freeze said upper layer and blocks, waiting for A to finish > process C grabs ->i_mutex in your upper layer > process C does vfs_write(), which blocks, since there's a pending attempt to > freeze > process A tries to grab ->i_mutex held by C and blocks According to latest mm/filemap.c:generic_file_aio_write(), sb_start_write(inode->i_sb); mutex_lock(>i_mutex); ret = __generic_file_aio_write(iocb, iov, nr_segs, >ki_pos); mutex_unlock(>i_mutex); ::: sb_end_write(inode->i_sb); Process C would block *BEFORE* i_mutex by sb_start_write()? No? Honestly speaking I didn't pay attention about the freeze feature since I don't use it. I am making aufs to support it now. But I don't know how to test it... J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Fri, Mar 22, 2013 at 06:11:11PM +, Al Viro wrote: > On Sat, Mar 23, 2013 at 02:37:55AM +0900, J. R. Okajima wrote: > > > > David Howells: > > > Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock > > > might operate, so it's possible that this is a false alarm. Maybe Jan > > > Kara can > > > illuminate further, so I've added him to the cc list. > > > > It is related to the design of UnionMount, isn't it? > > UnionMount is not a filesystem and doen't have its own superblock. > > If it was a fs, then > > - vfs_truncate() acquires sb_writers for the unioning-fs. > > - the unioning-fs may call vfs_truncate() again for the underlying fs. > > - this time, sb_writers is for the underlying fs which is a different > > sb_writers object from the already acquired one. > > So there would be no deadlock. > > Doesn't help the situation with copyup - witness overlayfs stepping into the > same deadlock on copyup. It wants ->i_mutex held on directory in upper layer > and it tries to write to file it has created in there. The problem is > with the upper layer superblock getting frozen; having a separate one for > union is irrelevant. Let me check how aufs does... Aha. Your > au_do_copy_file() ends up calling vfs_write() on the file opened in > upper layer. And AFAICS it's called with ->i_mutex held on the directory > in upper layer, so you've got the same deadlock, sorry. The scenario, BTW, looks so: process A does sb_start_write() (on your upper layer) process B tries to freeze said upper layer and blocks, waiting for A to finish process C grabs ->i_mutex in your upper layer process C does vfs_write(), which blocks, since there's a pending attempt to freeze process A tries to grab ->i_mutex held by C and blocks -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Sat, Mar 23, 2013 at 02:37:55AM +0900, J. R. Okajima wrote: > > David Howells: > > Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock > > might operate, so it's possible that this is a false alarm. Maybe Jan Kara > > can > > illuminate further, so I've added him to the cc list. > > It is related to the design of UnionMount, isn't it? > UnionMount is not a filesystem and doen't have its own superblock. > If it was a fs, then > - vfs_truncate() acquires sb_writers for the unioning-fs. > - the unioning-fs may call vfs_truncate() again for the underlying fs. > - this time, sb_writers is for the underlying fs which is a different > sb_writers object from the already acquired one. > So there would be no deadlock. Doesn't help the situation with copyup - witness overlayfs stepping into the same deadlock on copyup. It wants ->i_mutex held on directory in upper layer and it tries to write to file it has created in there. The problem is with the upper layer superblock getting frozen; having a separate one for union is irrelevant. Let me check how aufs does... Aha. Your au_do_copy_file() ends up calling vfs_write() on the file opened in upper layer. And AFAICS it's called with ->i_mutex held on the directory in upper layer, so you've got the same deadlock, sorry. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
David Howells: > Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock > might operate, so it's possible that this is a false alarm. Maybe Jan Kara > can > illuminate further, so I've added him to the cc list. It is related to the design of UnionMount, isn't it? UnionMount is not a filesystem and doen't have its own superblock. If it was a fs, then - vfs_truncate() acquires sb_writers for the unioning-fs. - the unioning-fs may call vfs_truncate() again for the underlying fs. - this time, sb_writers is for the underlying fs which is a different sb_writers object from the already acquired one. So there would be no deadlock. Still lockdep will produce the message since sb_writers doesn't have the lockdep class. Of course, we can introduce the lock class for it, or call lockdep_off()/on() simply in order to stop the message. But, as long as the unioning feature is not implemented as a fs, the solution will not be so easy. I am afraid UnionMount will need to introduce a new counter (or a new flag) to indicate the task entered the union, and adjust the lock class or decide to call lockdep_off() for sb_writers. I don't think it is a good idea. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
David Howells: Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock might operate, so it's possible that this is a false alarm. Maybe Jan Kara can illuminate further, so I've added him to the cc list. It is related to the design of UnionMount, isn't it? UnionMount is not a filesystem and doen't have its own superblock. If it was a fs, then - vfs_truncate() acquires sb_writers for the unioning-fs. - the unioning-fs may call vfs_truncate() again for the underlying fs. - this time, sb_writers is for the underlying fs which is a different sb_writers object from the already acquired one. So there would be no deadlock. Still lockdep will produce the message since sb_writers doesn't have the lockdep class. Of course, we can introduce the lock class for it, or call lockdep_off()/on() simply in order to stop the message. But, as long as the unioning feature is not implemented as a fs, the solution will not be so easy. I am afraid UnionMount will need to introduce a new counter (or a new flag) to indicate the task entered the union, and adjust the lock class or decide to call lockdep_off() for sb_writers. I don't think it is a good idea. J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Sat, Mar 23, 2013 at 02:37:55AM +0900, J. R. Okajima wrote: David Howells: Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock might operate, so it's possible that this is a false alarm. Maybe Jan Kara can illuminate further, so I've added him to the cc list. It is related to the design of UnionMount, isn't it? UnionMount is not a filesystem and doen't have its own superblock. If it was a fs, then - vfs_truncate() acquires sb_writers for the unioning-fs. - the unioning-fs may call vfs_truncate() again for the underlying fs. - this time, sb_writers is for the underlying fs which is a different sb_writers object from the already acquired one. So there would be no deadlock. Doesn't help the situation with copyup - witness overlayfs stepping into the same deadlock on copyup. It wants -i_mutex held on directory in upper layer and it tries to write to file it has created in there. The problem is with the upper layer superblock getting frozen; having a separate one for union is irrelevant. Let me check how aufs does... Aha. Your au_do_copy_file() ends up calling vfs_write() on the file opened in upper layer. And AFAICS it's called with -i_mutex held on the directory in upper layer, so you've got the same deadlock, sorry. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Fri, Mar 22, 2013 at 06:11:11PM +, Al Viro wrote: On Sat, Mar 23, 2013 at 02:37:55AM +0900, J. R. Okajima wrote: David Howells: Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock might operate, so it's possible that this is a false alarm. Maybe Jan Kara can illuminate further, so I've added him to the cc list. It is related to the design of UnionMount, isn't it? UnionMount is not a filesystem and doen't have its own superblock. If it was a fs, then - vfs_truncate() acquires sb_writers for the unioning-fs. - the unioning-fs may call vfs_truncate() again for the underlying fs. - this time, sb_writers is for the underlying fs which is a different sb_writers object from the already acquired one. So there would be no deadlock. Doesn't help the situation with copyup - witness overlayfs stepping into the same deadlock on copyup. It wants -i_mutex held on directory in upper layer and it tries to write to file it has created in there. The problem is with the upper layer superblock getting frozen; having a separate one for union is irrelevant. Let me check how aufs does... Aha. Your au_do_copy_file() ends up calling vfs_write() on the file opened in upper layer. And AFAICS it's called with -i_mutex held on the directory in upper layer, so you've got the same deadlock, sorry. The scenario, BTW, looks so: process A does sb_start_write() (on your upper layer) process B tries to freeze said upper layer and blocks, waiting for A to finish process C grabs -i_mutex in your upper layer process C does vfs_write(), which blocks, since there's a pending attempt to freeze process A tries to grab -i_mutex held by C and blocks -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Al Viro: The scenario, BTW, looks so: process A does sb_start_write() (on your upper layer) process B tries to freeze said upper layer and blocks, waiting for A to finish process C grabs -i_mutex in your upper layer process C does vfs_write(), which blocks, since there's a pending attempt to freeze process A tries to grab -i_mutex held by C and blocks According to latest mm/filemap.c:generic_file_aio_write(), sb_start_write(inode-i_sb); mutex_lock(inode-i_mutex); ret = __generic_file_aio_write(iocb, iov, nr_segs, iocb-ki_pos); mutex_unlock(inode-i_mutex); ::: sb_end_write(inode-i_sb); Process C would block *BEFORE* i_mutex by sb_start_write()? No? Honestly speaking I didn't pay attention about the freeze feature since I don't use it. I am making aufs to support it now. But I don't know how to test it... J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Sat, Mar 23, 2013 at 11:49:11AM +0900, J. R. Okajima wrote: Al Viro: The scenario, BTW, looks so: process A does sb_start_write() (on your upper layer) process B tries to freeze said upper layer and blocks, waiting for A to finish process C grabs -i_mutex in your upper layer process C does vfs_write(), which blocks, since there's a pending attempt to freeze process A tries to grab -i_mutex held by C and blocks According to latest mm/filemap.c:generic_file_aio_write(), sb_start_write(inode-i_sb); mutex_lock(inode-i_mutex); ret = __generic_file_aio_write(iocb, iov, nr_segs, iocb-ki_pos); mutex_unlock(inode-i_mutex); ::: sb_end_write(inode-i_sb); Process C would block *BEFORE* i_mutex by sb_start_write()? No? Different -i_mutex; you are holding one on the parent directory already. That's the problem - you have -i_mutex nested both inside that sucker (as it ought to) and outside. Which tends to do bad things, obviously, in particular because something like mkdir(2) will do sb_start_write() (from mnt_want_write(), called by kern_path_create()) before grabbing directory -i_mutex. Thus the activity with lifting the bastard out of -aio_write(), etc. in vfs.git#experimental - *any* union-like variant will need the ability to pull sb_start_write() outside of locking the parent directory on copyup. And yes, it's a common prerequisite to anything doing copyups - aufs is in the same boat as overlayfs and unionmount. Same deadlock for all three of them. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Al Viro: Different -i_mutex; you are holding one on the parent directory already. Let me make sure. In your scenario, - processA writes something into the union, and the unioning fs operates the writable layer. After sb_start_write() succeeds, processA should not block by the reason of fsfreeze. - processC causes the copyup. The current aufs implementation holds parent-i_mutex on the writable layer during the copyup. The parent-i_mutex can make processA blocking. Now I am considering the copyup approach you suggested in another mail, and I am going to replace your unlink the target, re-link later, no dir lock during copyup by make it hidden instead of unlinking, rename the correct name later, no dir lock during copyup since I am not sure all FSs can operate -link with the unlinked one. I guess most FS can handle it, but I don't want to make sure everything particulary remote fs, journals. To make a file hidden, I guess I can use the aufs doubley whiteouted approach. As you might know, aufs prepends the .wh. prefix to the filename as whiteout. With one more prefix, the name loses the role of whiteout. Aufs simply ignores such doubly whiteouted name. The demerit is that aufs has to limit the length of the name. J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Wed 20-03-13 21:48:13, Al Viro wrote: > On Wed, Mar 20, 2013 at 08:52:22PM +0100, Jan Kara wrote: > > > do_bio_filebacked(), with some ugliness between that and callsite. Note, > > > BTW, that we have a pair of possible vfs_fsync() calls in there; how do > > > those > > > interact with freeze? > > Freezing code takes care that all dirty data is synced before fs is > > frozen and no new dirty data can be created before fs is thawed. So > > vfs_fsync() should just return without doing anything on frozen filesystem. > > Um... How does it interact with vfs_fsync() already in progress when you > ask to freeze it? So the exact sequence of freezing is: sb->s_writers.frozen = SB_FREEZE_WRITE; smp_wmb(); sb_wait_write(sb, SB_FREEZE_WRITE); Now there are no processes in sb_start_write() - sb_end_write() section. Then we do the same for SB_FREEZE_PAGEFAULT. After this noone should be able to dirty a page or inode. Writeback or vfs_fsync() may be still running (so fs can be creating new transactions in the journal for writeback etc.). sync_filesystem(sb); After this there should be no dirty data so although we can still be somewhere inside vfs_fsync() it should have nothing to do. Now we freeze to state SB_FREEZE_FS (nop for ext4, but for XFS it may interact e.g. with inode reclaim trimming preallocated blocks) and we are done. > Anyway, I've pulled the fscker out of ->aio_write, ->write and ->splice_write; > on that pathway it's in the do_splice_from() (see vfs.git#experimental). > > ... and now, for something *really* nasty: where do mandatory file locks > belong in the locking hierarchy? Relative to fsfreeze one, for starters, > but both for unionmount and overlayfs we need to decide where they live > relative to ->i_mutex on directories. Hum, interesting question :). Relative to fsfreeze, it doesn't seem to matter much, does it? We don't actually lock / unlock these from write paths needing freeze protection, we only wait for them. But maybe I miss some ugly case you have in mind. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Wed, Mar 20, 2013 at 08:52:22PM +0100, Jan Kara wrote: > > do_bio_filebacked(), with some ugliness between that and callsite. Note, > > BTW, that we have a pair of possible vfs_fsync() calls in there; how do > > those > > interact with freeze? > Freezing code takes care that all dirty data is synced before fs is > frozen and no new dirty data can be created before fs is thawed. So > vfs_fsync() should just return without doing anything on frozen filesystem. Um... How does it interact with vfs_fsync() already in progress when you ask to freeze it? Anyway, I've pulled the fscker out of ->aio_write, ->write and ->splice_write; on that pathway it's in the do_splice_from() (see vfs.git#experimental). ... and now, for something *really* nasty: where do mandatory file locks belong in the locking hierarchy? Relative to fsfreeze one, for starters, but both for unionmount and overlayfs we need to decide where they live relative to ->i_mutex on directories. And that, BTW, may be the strongest argument so far in favour of the scheme I'd suggested for copyup-via-opened-but-unlinked... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Wed 20-03-13 02:33:08, Al Viro wrote: > On Tue, Mar 19, 2013 at 10:10:32PM +, Al Viro wrote: > > > OK, it's going to be an interesting series - aforementioned tentative patch > > was badly incomplete ;-/ > > The interesting question is how far do we want to lift that. ->aio_write() > part is trivial - see vfs.git#experimental; the trouble begins with > ->splice_write(). For *everything* except default_file_splice_write(), > lifting into the caller (do_splice_from()) is the right thing to do. > > default_file_splice_write(), however, it trickier; there we end up calling > vfs_write() (via an ugly callchain). And _that_ is a real bitch. Granted, > vfs_write() is somewhat an overkill there (we'd already done rw_verify_area() > and access_ok() is pointless due to set_fs() we do around vfs_write() > there) and we'd already lifted it up to do_sync_write(). But if we lift > it any further, we'll need to deal with ->write() callers in the tree. > Current situation: > > fs/coredump.c:662: return access_ok(VERIFY_READ, addr, nr) && > file->f_op->write(file, addr, nr, >f_pos) == nr; > arch/powerpc/platforms/cell/spufs/coredump.c:63:written = > file->f_op->write(file, addr, nr, >f_pos); > > for these guys we might actually want to lift all way up to do_coredump() > > drivers/staging/comedi/drivers/serial2002.c:91: result = f->f_op->write(f, > buf, count, >f_pos); > fs/autofs4/waitq.c:73: (wr = > file->f_op->write(file,data,bytes,>f_pos)) > 0) { > > not regular files, unless I'm seriously misreading the code. > > kernel/acct.c:553: file->f_op->write(file, (char *), > > BTW, this is probably where we want to deal with your acct deadlock. > > fs/compat.c:1103: fn = (io_fn_t)file->f_op->write; > fs/read_write.c:435:ret = file->f_op->write(file, buf, > count, pos); > fs/read_write.c:732:fn = (io_fn_t)file->f_op->write; > > syscalls - the question here is whether we lift it up to vfs_write/vfs_writev/ > compat_writev, or actually take it further. > > fs/cachefiles/rdwr.c:967: ret = file->f_op->write( > > cachefiles_write_page(); no fucking idea what locks might be held by caller > and potentially that's a rather nasty source of PITA > > fs/coda/file.c:84: ret = host_file->f_op->write(host_file, buf, count, > ppos); > > coda writing to file in cache on local fs. Potentially a nasty bugger, since > it's hard to lift any further - the caller has no idea that the thing is > on CODA, let alone what happens to hold the local cache. > > drivers/block/loop.c:234: bw = file->f_op->write(file, buf, len, ); > > do_bio_filebacked(), with some ugliness between that and callsite. Note, > BTW, that we have a pair of possible vfs_fsync() calls in there; how do those > interact with freeze? Freezing code takes care that all dirty data is synced before fs is frozen and no new dirty data can be created before fs is thawed. So vfs_fsync() should just return without doing anything on frozen filesystem. > This does *not* touch the current callers of vfs_write()/vfs_writev(); any of > those called while holding ->i_mutex on a directory (or mnt_want_write(), for > that matter) is a deadlock right now. > > And we'd better start thinking about how we'll backport that crap - deadlock > in e.g. xfs ->splice_write() had been there since last summer ;-/ Yeah but noone really noticed because in practice the code isn't stressed much. Much bigger problems had been there for years before they were fixed last summer without anybody complaining... So I'm not sure how hard do we want to try to backport this. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Al Viro wrote: > fs/cachefiles/rdwr.c:967: ret = file->f_op->write( > > cachefiles_write_page(); no fucking idea what locks might be held by caller > and potentially that's a rather nasty source of PITA The caller of cachefiles_write_page() (ie. fscache_write_op()) holds no locks over the call. __fscache_write_page() queues the pages for writing to the cache from the netfs's read-side routines and returns immediately. fscache_write_op() then picks them up in the background and passes them over to the backend (eg. cachefiles_write_page()) which writes them out to the cache. So I think it should be safe to call file_start/end_write() or whatever around the file->f_op->write() calls. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 10:24 PM, Al Viro wrote: > it still might make sense to implement > something as a layer, but some parts of that sucker may be better off as > fs primitives. Hell, we could, in theory, implement xattrs as a layer; > just look at how reiserfs had done them. We could do the same for hardlinks > (look at qnx4, if you wish to see that hack). Of for symlinks (sysvfs). > Or for opened-and-unlinked files (sillyrename could be done as a generic > layer). Or for permissions/ownership/arbitrary names (umsdos, and that > _was_ very similar to layering). It's just that often an underlying fs > has a better way of doing that. IMO whiteouts are in that class. My gut feeling is that whiteouts (and all that's related) are just too specialized. All the examples you listed are more general purpose. I know one that could be useful for a variety of things, an "exchange" operation. While similar to rename, the semantics are much cleaner and so the implementation becomes easier as well. But that one doesn't quite solve the rename-with-whiteout thing. For that a three way permutation would be needed where one of the three is an unlinked "floating" object. That one is a really generic op but we'd need to introduce linking unlinked objects into the tree which comes with its own problems. But I think it may be worth trying to come up with something more generally useful before adding whiteouts and various overlay-specific flags to filesystem code. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 10:24 PM, Al Viro v...@zeniv.linux.org.uk wrote: it still might make sense to implement something as a layer, but some parts of that sucker may be better off as fs primitives. Hell, we could, in theory, implement xattrs as a layer; just look at how reiserfs had done them. We could do the same for hardlinks (look at qnx4, if you wish to see that hack). Of for symlinks (sysvfs). Or for opened-and-unlinked files (sillyrename could be done as a generic layer). Or for permissions/ownership/arbitrary names (umsdos, and that _was_ very similar to layering). It's just that often an underlying fs has a better way of doing that. IMO whiteouts are in that class. My gut feeling is that whiteouts (and all that's related) are just too specialized. All the examples you listed are more general purpose. I know one that could be useful for a variety of things, an exchange operation. While similar to rename, the semantics are much cleaner and so the implementation becomes easier as well. But that one doesn't quite solve the rename-with-whiteout thing. For that a three way permutation would be needed where one of the three is an unlinked floating object. That one is a really generic op but we'd need to introduce linking unlinked objects into the tree which comes with its own problems. But I think it may be worth trying to come up with something more generally useful before adding whiteouts and various overlay-specific flags to filesystem code. Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Al Viro v...@zeniv.linux.org.uk wrote: fs/cachefiles/rdwr.c:967: ret = file-f_op-write( cachefiles_write_page(); no fucking idea what locks might be held by caller and potentially that's a rather nasty source of PITA The caller of cachefiles_write_page() (ie. fscache_write_op()) holds no locks over the call. __fscache_write_page() queues the pages for writing to the cache from the netfs's read-side routines and returns immediately. fscache_write_op() then picks them up in the background and passes them over to the backend (eg. cachefiles_write_page()) which writes them out to the cache. So I think it should be safe to call file_start/end_write() or whatever around the file-f_op-write() calls. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Wed 20-03-13 02:33:08, Al Viro wrote: On Tue, Mar 19, 2013 at 10:10:32PM +, Al Viro wrote: OK, it's going to be an interesting series - aforementioned tentative patch was badly incomplete ;-/ The interesting question is how far do we want to lift that. -aio_write() part is trivial - see vfs.git#experimental; the trouble begins with -splice_write(). For *everything* except default_file_splice_write(), lifting into the caller (do_splice_from()) is the right thing to do. default_file_splice_write(), however, it trickier; there we end up calling vfs_write() (via an ugly callchain). And _that_ is a real bitch. Granted, vfs_write() is somewhat an overkill there (we'd already done rw_verify_area() and access_ok() is pointless due to set_fs() we do around vfs_write() there) and we'd already lifted it up to do_sync_write(). But if we lift it any further, we'll need to deal with -write() callers in the tree. Current situation: fs/coredump.c:662: return access_ok(VERIFY_READ, addr, nr) file-f_op-write(file, addr, nr, file-f_pos) == nr; arch/powerpc/platforms/cell/spufs/coredump.c:63:written = file-f_op-write(file, addr, nr, file-f_pos); for these guys we might actually want to lift all way up to do_coredump() drivers/staging/comedi/drivers/serial2002.c:91: result = f-f_op-write(f, buf, count, f-f_pos); fs/autofs4/waitq.c:73: (wr = file-f_op-write(file,data,bytes,file-f_pos)) 0) { not regular files, unless I'm seriously misreading the code. kernel/acct.c:553: file-f_op-write(file, (char *)ac, BTW, this is probably where we want to deal with your acct deadlock. fs/compat.c:1103: fn = (io_fn_t)file-f_op-write; fs/read_write.c:435:ret = file-f_op-write(file, buf, count, pos); fs/read_write.c:732:fn = (io_fn_t)file-f_op-write; syscalls - the question here is whether we lift it up to vfs_write/vfs_writev/ compat_writev, or actually take it further. fs/cachefiles/rdwr.c:967: ret = file-f_op-write( cachefiles_write_page(); no fucking idea what locks might be held by caller and potentially that's a rather nasty source of PITA fs/coda/file.c:84: ret = host_file-f_op-write(host_file, buf, count, ppos); coda writing to file in cache on local fs. Potentially a nasty bugger, since it's hard to lift any further - the caller has no idea that the thing is on CODA, let alone what happens to hold the local cache. drivers/block/loop.c:234: bw = file-f_op-write(file, buf, len, pos); do_bio_filebacked(), with some ugliness between that and callsite. Note, BTW, that we have a pair of possible vfs_fsync() calls in there; how do those interact with freeze? Freezing code takes care that all dirty data is synced before fs is frozen and no new dirty data can be created before fs is thawed. So vfs_fsync() should just return without doing anything on frozen filesystem. This does *not* touch the current callers of vfs_write()/vfs_writev(); any of those called while holding -i_mutex on a directory (or mnt_want_write(), for that matter) is a deadlock right now. And we'd better start thinking about how we'll backport that crap - deadlock in e.g. xfs -splice_write() had been there since last summer ;-/ Yeah but noone really noticed because in practice the code isn't stressed much. Much bigger problems had been there for years before they were fixed last summer without anybody complaining... So I'm not sure how hard do we want to try to backport this. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Wed, Mar 20, 2013 at 08:52:22PM +0100, Jan Kara wrote: do_bio_filebacked(), with some ugliness between that and callsite. Note, BTW, that we have a pair of possible vfs_fsync() calls in there; how do those interact with freeze? Freezing code takes care that all dirty data is synced before fs is frozen and no new dirty data can be created before fs is thawed. So vfs_fsync() should just return without doing anything on frozen filesystem. Um... How does it interact with vfs_fsync() already in progress when you ask to freeze it? Anyway, I've pulled the fscker out of -aio_write, -write and -splice_write; on that pathway it's in the do_splice_from() (see vfs.git#experimental). ... and now, for something *really* nasty: where do mandatory file locks belong in the locking hierarchy? Relative to fsfreeze one, for starters, but both for unionmount and overlayfs we need to decide where they live relative to -i_mutex on directories. And that, BTW, may be the strongest argument so far in favour of the scheme I'd suggested for copyup-via-opened-but-unlinked... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Wed 20-03-13 21:48:13, Al Viro wrote: On Wed, Mar 20, 2013 at 08:52:22PM +0100, Jan Kara wrote: do_bio_filebacked(), with some ugliness between that and callsite. Note, BTW, that we have a pair of possible vfs_fsync() calls in there; how do those interact with freeze? Freezing code takes care that all dirty data is synced before fs is frozen and no new dirty data can be created before fs is thawed. So vfs_fsync() should just return without doing anything on frozen filesystem. Um... How does it interact with vfs_fsync() already in progress when you ask to freeze it? So the exact sequence of freezing is: sb-s_writers.frozen = SB_FREEZE_WRITE; smp_wmb(); sb_wait_write(sb, SB_FREEZE_WRITE); Now there are no processes in sb_start_write() - sb_end_write() section. Then we do the same for SB_FREEZE_PAGEFAULT. After this noone should be able to dirty a page or inode. Writeback or vfs_fsync() may be still running (so fs can be creating new transactions in the journal for writeback etc.). sync_filesystem(sb); After this there should be no dirty data so although we can still be somewhere inside vfs_fsync() it should have nothing to do. Now we freeze to state SB_FREEZE_FS (nop for ext4, but for XFS it may interact e.g. with inode reclaim trimming preallocated blocks) and we are done. Anyway, I've pulled the fscker out of -aio_write, -write and -splice_write; on that pathway it's in the do_splice_from() (see vfs.git#experimental). ... and now, for something *really* nasty: where do mandatory file locks belong in the locking hierarchy? Relative to fsfreeze one, for starters, but both for unionmount and overlayfs we need to decide where they live relative to -i_mutex on directories. Hum, interesting question :). Relative to fsfreeze, it doesn't seem to matter much, does it? We don't actually lock / unlock these from write paths needing freeze protection, we only wait for them. But maybe I miss some ugly case you have in mind. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 10:10:32PM +, Al Viro wrote: > OK, it's going to be an interesting series - aforementioned tentative patch > was badly incomplete ;-/ The interesting question is how far do we want to lift that. ->aio_write() part is trivial - see vfs.git#experimental; the trouble begins with ->splice_write(). For *everything* except default_file_splice_write(), lifting into the caller (do_splice_from()) is the right thing to do. default_file_splice_write(), however, it trickier; there we end up calling vfs_write() (via an ugly callchain). And _that_ is a real bitch. Granted, vfs_write() is somewhat an overkill there (we'd already done rw_verify_area() and access_ok() is pointless due to set_fs() we do around vfs_write() there) and we'd already lifted it up to do_sync_write(). But if we lift it any further, we'll need to deal with ->write() callers in the tree. Current situation: fs/coredump.c:662: return access_ok(VERIFY_READ, addr, nr) && file->f_op->write(file, addr, nr, >f_pos) == nr; arch/powerpc/platforms/cell/spufs/coredump.c:63:written = file->f_op->write(file, addr, nr, >f_pos); for these guys we might actually want to lift all way up to do_coredump() drivers/staging/comedi/drivers/serial2002.c:91: result = f->f_op->write(f, buf, count, >f_pos); fs/autofs4/waitq.c:73: (wr = file->f_op->write(file,data,bytes,>f_pos)) > 0) { not regular files, unless I'm seriously misreading the code. kernel/acct.c:553: file->f_op->write(file, (char *), BTW, this is probably where we want to deal with your acct deadlock. fs/compat.c:1103: fn = (io_fn_t)file->f_op->write; fs/read_write.c:435:ret = file->f_op->write(file, buf, count, pos); fs/read_write.c:732:fn = (io_fn_t)file->f_op->write; syscalls - the question here is whether we lift it up to vfs_write/vfs_writev/ compat_writev, or actually take it further. fs/cachefiles/rdwr.c:967: ret = file->f_op->write( cachefiles_write_page(); no fucking idea what locks might be held by caller and potentially that's a rather nasty source of PITA fs/coda/file.c:84: ret = host_file->f_op->write(host_file, buf, count, ppos); coda writing to file in cache on local fs. Potentially a nasty bugger, since it's hard to lift any further - the caller has no idea that the thing is on CODA, let alone what happens to hold the local cache. drivers/block/loop.c:234: bw = file->f_op->write(file, buf, len, ); do_bio_filebacked(), with some ugliness between that and callsite. Note, BTW, that we have a pair of possible vfs_fsync() calls in there; how do those interact with freeze? This does *not* touch the current callers of vfs_write()/vfs_writev(); any of those called while holding ->i_mutex on a directory (or mnt_want_write(), for that matter) is a deadlock right now. And we'd better start thinking about how we'll backport that crap - deadlock in e.g. xfs ->splice_write() had been there since last summer ;-/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 09:38:31PM +, Al Viro wrote: > On Tue, Mar 19, 2013 at 09:25:43PM +0100, Jan Kara wrote: > > > BTW, having sb_start_write() buried in individual ->splice_write() is > > > asking for trouble; could you describe the rules for that? E.g. where > > > does it nest wrt filesystem-private locks? XFS iolock, for example... > > Generally, the freeze protection should be the outermost lock taken (so > > that we mitigate possibility of blocking readers when waiting for fs to > > unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock. > > Welcome to deadlock, then: > xfs_file_splice_write() > ... > xfs_ilock(ip, XFS_IOLOCK_EXCL); > ... > ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags); > > > It seems that I screwed this up for ->splice_write() :-| If we are going to > > move out sb_start_write() out of filesystems' hands into do_splice_from() > > then we should likely do the same with ->aio_write(). Hmm? ... and then there's ext4_file_dio_write(), which cheerfully ignores freeze. ... and udf expanding from inline files to separately allocated before it gets to sb_start_write() ... and a bunch of guys doing generic_write_sync() after generic_aio_file_write() ... and a lot of other fun stuff. Ouch... OK, it's going to be an interesting series - aforementioned tentative patch was badly incomplete ;-/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 09:25:43PM +0100, Jan Kara wrote: > > BTW, having sb_start_write() buried in individual ->splice_write() is > > asking for trouble; could you describe the rules for that? E.g. where > > does it nest wrt filesystem-private locks? XFS iolock, for example... > Generally, the freeze protection should be the outermost lock taken (so > that we mitigate possibility of blocking readers when waiting for fs to > unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock. Welcome to deadlock, then: xfs_file_splice_write() ... xfs_ilock(ip, XFS_IOLOCK_EXCL); ... ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags); > It seems that I screwed this up for ->splice_write() :-| If we are going to > move out sb_start_write() out of filesystems' hands into do_splice_from() > then we should likely do the same with ->aio_write(). Hmm? Yes, I've a tentative patch doing just that; will push tonight. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 07:32:42PM +0100, Miklos Szeredi wrote: > > * victim is negative => create a whiteout > > * victim is a directory, parent opaque => rmdir > > * victim is a non-directory, parent opaque => unlink > > * victim is positive, parent _not_ opaque => replace with whiteout > > * old_dir in case of ->rename() is opaque => normal rename > > * old_dir in case of ->rename() is not opaque => leave whiteout > > behind > > Non-unioned => opaque, of course (nothing showing through it). > > > > I dunnow. Overloading common paths with overlay/union specific things > doesn't look very clean to me. We certainly can add a couple of new methods; the question is, would that buy us anything? This ->rename_and_whiteout would be practically identical to ->rename on most of filesystems; the only difference for something like ext* is that instead of deleting the old directory entry in the end we would change its inumber to 0 and type to 14. Everything else would be the same. Moreover, turning a positive entry into a whiteout differs from unlink/rmdir resp. in exactly the same way. FWIW, creating a whiteout from scratch might be better off as a separate method - it's closer to ->mknod() and friends. But whiteout from positive and whiteout-on-rename probably isn't worth separating from the normal methods. All we end up "overloading" is actual removal of directory entry, which is a rather minor part of the logics in those. And merging ->unlink/->rmdir is definitely a good idea, regardless of anything union-related; just look at the existing instances of both. > > Getting good behaviour on rename interrupted by crash is going to be _very_ > > tricky with any strategy other than whiteouts-in-fs, AFAICS. > > > > One idea is to add a journal to the overlay itself (yeah, namespace issues). ... and extra complexity from hell. If the underlying fs has a journal of its own, we are getting all kinds of interesting interplay between those; if not, replaying the journal in overlay will buy you nothing, since the damage to underlying objects makes it all moot. Layering is a very nice tool for quick prototyping; no arguments about that. It allows to avoid modifying $BIGNUM filesystems and for something that lives out of tree it's a very big benefit. Once the thing is merged, though, that benefit becomes much smaller; it still might make sense to implement something as a layer, but some parts of that sucker may be better off as fs primitives. Hell, we could, in theory, implement xattrs as a layer; just look at how reiserfs had done them. We could do the same for hardlinks (look at qnx4, if you wish to see that hack). Of for symlinks (sysvfs). Or for opened-and-unlinked files (sillyrename could be done as a generic layer). Or for permissions/ownership/arbitrary names (umsdos, and that _was_ very similar to layering). It's just that often an underlying fs has a better way of doing that. IMO whiteouts are in that class. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Mon 18-03-13 23:01:03, Al Viro wrote: > On Mon, Mar 18, 2013 at 09:53:34PM +, Al Viro wrote: > > On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote: > > > IMO the deadlock is real. In freeze_super() we wait for all writers to > > > the filesystem to finish while blocking beginning of any further writes. > > > So > > > we have a deadlock scenario like: > > > > > > THREAD1 THREAD2 THREAD3 > > > mnt_want_write() mutex_lock(>i_mutex); > > > ... freeze_super() > > > block on mutex_lock(>i_mutex) > > > sb_wait_write(sb, > > > SB_FREEZE_WRITE); > > > block in sb_start_write() > > > > The bug is on fsfreeze side and this is not the only problem related to it. > > I've missed the implications when I applied "fs: Add freezing handling > > to mnt_want_write() / mnt_drop_write()" last June ;-/ > > > > The thing is, until then mnt_want_write() used to be a counter; it could be > > nested. Now any such nesting is a deadlock you've just described. This > > is seriously wrong, IMO. > > > > BTW, having sb_start_write() buried in individual ->splice_write() is > > asking for trouble; could you describe the rules for that? E.g. where > > does it nest wrt filesystem-private locks? XFS iolock, for example... > > I'm looking at the existing callers and I really wonder if we ought to > push sb_start_write() from ->splice_write()/->aio_write()/etc. into the > callers. Yeah, that should be OK. > Something like file_start_write()/file_end_write(), with check for file > being regular one might be a good starting point. As it is, copyup is > really fucked both in unionmount and overlayfs... Makes sense. I can do the patch. BTW, for months I'm trying to push to you a patch which creates a function like file_start_write() which returns EAGAIN if the file is open with O_NONBLOCK and fs is frozen (this allows me to solve a deadlock with bsd process accounting to frozen fs). After this change the patch will become trivial so I'll add it to the series and hopefully it won't get ignored. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Mon 18-03-13 21:53:34, Al Viro wrote: > On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote: > > IMO the deadlock is real. In freeze_super() we wait for all writers to > > the filesystem to finish while blocking beginning of any further writes. So > > we have a deadlock scenario like: > > > > THREAD1 THREAD2 THREAD3 > > mnt_want_write()mutex_lock(>i_mutex); > > ... freeze_super() > > block on mutex_lock(>i_mutex) > > sb_wait_write(sb, > > SB_FREEZE_WRITE); > > block in sb_start_write() > > The bug is on fsfreeze side and this is not the only problem related to it. > I've missed the implications when I applied "fs: Add freezing handling > to mnt_want_write() / mnt_drop_write()" last June ;-/ > > The thing is, until then mnt_want_write() used to be a counter; it could be > nested. Now any such nesting is a deadlock you've just described. This > is seriously wrong, IMO. Well, but sb_start_write() has to be blocking (blocks when fs is frozen) and you have to get it somewhere. It seems only natural to get the counter from original mnt_want_write() at the same place and use one function for that. Whether I should have changed the name from mnt_want_write() to something else is questionable... > BTW, having sb_start_write() buried in individual ->splice_write() is > asking for trouble; could you describe the rules for that? E.g. where > does it nest wrt filesystem-private locks? XFS iolock, for example... Generally, the freeze protection should be the outermost lock taken (so that we mitigate possibility of blocking readers when waiting for fs to unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock. It seems that I screwed this up for ->splice_write() :-| If we are going to move out sb_start_write() out of filesystems' hands into do_splice_from() then we should likely do the same with ->aio_write(). Hmm? Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 6:03 PM, Al Viro wrote: > On Tue, Mar 19, 2013 at 11:29:41AM +0100, Miklos Szeredi wrote: > >> Copy up is a once-in-a-lifetime event for an object. Optimizing it is >> way down in the list of things to do. I'd drop splice in a jiffy if >> it's in the way. > > What makes you think that write is any better? Same deadlock there - check > generic_file_aio_write(), it calls the same sb_start_write()... IOW, > switching from splice to write won't help at all. Okay, I missed that. Yeah, that needs fixing... >> Much more interesting question: what happens if we crash during a >> rename? Whiteout implemented in the filesystem won't save us. And >> the results are interesting: old versions of files become visible and >> similar fun. Far from likely to happen, but ... >> >> Add a rename-with-whiteout primitive on filesystems? That one is not >> going to be as simple as plain whiteout. Or? > > Umm... If/when we start caring about that kind of atomicity (and I agree > that we ought to) overlayfs approach to whiteouts will actually have much > harder time - it doesn't take much to teach a journalling fs how to do that > kind of ->rename() in a single transaction; the only question is how to tell > it that we want to leave a whiteout behind us. Hell knows; one variant is > to add a flag, of course. Another might be more interesting - we want some > kind of "directory is opaque" flag, so if we start reshuffling the methods, > we might try to merge unlink/rmdir/whiteout. Rules: > * victim is negative => create a whiteout > * victim is a directory, parent opaque => rmdir > * victim is a non-directory, parent opaque => unlink > * victim is positive, parent _not_ opaque => replace with whiteout > * old_dir in case of ->rename() is opaque => normal rename > * old_dir in case of ->rename() is not opaque => leave whiteout behind > Non-unioned => opaque, of course (nothing showing through it). > I dunnow. Overloading common paths with overlay/union specific things doesn't look very clean to me. I have a similar problem with union-mounts: it's hooking into lots of common paths in the VFS for the sake of a very specialized feature. > Getting good behaviour on rename interrupted by crash is going to be _very_ > tricky with any strategy other than whiteouts-in-fs, AFAICS. > One idea is to add a journal to the overlay itself (yeah, namespace issues). Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 11:29:41AM +0100, Miklos Szeredi wrote: > Copy up is a once-in-a-lifetime event for an object. Optimizing it is > way down in the list of things to do. I'd drop splice in a jiffy if > it's in the way. What makes you think that write is any better? Same deadlock there - check generic_file_aio_write(), it calls the same sb_start_write()... IOW, switching from splice to write won't help at all. > Much more interesting question: what happens if we crash during a > rename? Whiteout implemented in the filesystem won't save us. And > the results are interesting: old versions of files become visible and > similar fun. Far from likely to happen, but ... > > Add a rename-with-whiteout primitive on filesystems? That one is not > going to be as simple as plain whiteout. Or? Umm... If/when we start caring about that kind of atomicity (and I agree that we ought to) overlayfs approach to whiteouts will actually have much harder time - it doesn't take much to teach a journalling fs how to do that kind of ->rename() in a single transaction; the only question is how to tell it that we want to leave a whiteout behind us. Hell knows; one variant is to add a flag, of course. Another might be more interesting - we want some kind of "directory is opaque" flag, so if we start reshuffling the methods, we might try to merge unlink/rmdir/whiteout. Rules: * victim is negative => create a whiteout * victim is a directory, parent opaque => rmdir * victim is a non-directory, parent opaque => unlink * victim is positive, parent _not_ opaque => replace with whiteout * old_dir in case of ->rename() is opaque => normal rename * old_dir in case of ->rename() is not opaque => leave whiteout behind Non-unioned => opaque, of course (nothing showing through it). Getting good behaviour on rename interrupted by crash is going to be _very_ tricky with any strategy other than whiteouts-in-fs, AFAICS. Again, I have no problem whatsoever with changing the set of directory methods, as long as the replacement is sane. We'd done that kind of thing before and it's not a problem. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 12:04 PM, David Howells wrote: > Miklos Szeredi wrote: > >> > BTW, I wonder what's the right locking for that sucker; overlayfs is >> > probably too heavy - we are talking about copying a file from one fs to >> > another, which can obviously take quite a while, so holding ->i_mutex on >> > _parent_ all along is asking for very serious contention. >> >> Copy up is a once-in-a-lifetime event for an object. Optimizing it is >> way down in the list of things to do. I'd drop splice in a jiffy if >> it's in the way. > > Yes, but it could block the parent directory for a long time. I suspect it's > fine if you can RCU walk through the parent, but if you have to grab a lock on > it... Right. Lets look at it this way: users of an overlay accept that an operation X can take T time, where T is much longer than would be on a normal filesystem. Then why would they complain that operation Y (which happens to bump into the parent lock of X) also takes T? If copy up of huge files happens more then very very occasionally, then the overlay will be basically unusable anyway. It's just not what it is designed for, so why try to optimize this case? Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Miklos Szeredi wrote: > > BTW, I wonder what's the right locking for that sucker; overlayfs is > > probably too heavy - we are talking about copying a file from one fs to > > another, which can obviously take quite a while, so holding ->i_mutex on > > _parent_ all along is asking for very serious contention. > > Copy up is a once-in-a-lifetime event for an object. Optimizing it is > way down in the list of things to do. I'd drop splice in a jiffy if > it's in the way. Yes, but it could block the parent directory for a long time. I suspect it's fine if you can RCU walk through the parent, but if you have to grab a lock on it... David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 2:38 AM, Al Viro wrote: > On Mon, Mar 18, 2013 at 11:01:03PM +, Al Viro wrote: > >> I'm looking at the existing callers and I really wonder if we ought to >> push sb_start_write() from ->splice_write()/->aio_write()/etc. into the >> callers. >> >> Something like file_start_write()/file_end_write(), with check for file >> being regular one might be a good starting point. As it is, copyup is >> really fucked both in unionmount and overlayfs... > > BTW, I wonder what's the right locking for that sucker; overlayfs is probably > too heavy - we are talking about copying a file from one fs to another, which > can obviously take quite a while, so holding ->i_mutex on _parent_ all along > is asking for very serious contention. Copy up is a once-in-a-lifetime event for an object. Optimizing it is way down in the list of things to do. I'd drop splice in a jiffy if it's in the way. Much more interesting question: what happens if we crash during a rename? Whiteout implemented in the filesystem won't save us. And the results are interesting: old versions of files become visible and similar fun. Far from likely to happen, but ... Add a rename-with-whiteout primitive on filesystems? That one is not going to be as simple as plain whiteout. Or? Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Al Viro: > BTW, I wonder what's the right locking for that sucker; overlayfs is probably > too heavy - we are talking about copying a file from one fs to another, which > can obviously take quite a while, so holding ->i_mutex on _parent_ all along > is asking for very serious contention. OTOH, there's a pile of unpleasant Yes, holding parent->i_mutex can be longer. Using splice function for copy-up is simple and (probably) fastest. But it doesn't support a "hole" in the file (sparse file). All holes are filled with NUL byte and consumes a disk block on the upper layer. It is a problem, especially for users who have smaller tmpfs as his upper layer. The copy-up with considering a hole may cost more, but it can save the storage consumtion. > Another fun issue is copyup vs. copyup - we want to sit and wait for copyup > attempt in progress to complete, rather than start another one in parallel. > And whoever comes the second must check if copyup has succeeded, obviously - > it's possible to have user run into 5% limit and fail copyup, followed by > root doing it successfully. "5% limit" means the reserved are for a superuser on a filesystem, right? As far as I know, overlayfs (UnionMount too?) solves this problem as changing the task credential and the capability. But I am not sure whether it solves the similar problem around the resource limit like RLIMIT_CPU, RLIMIT_FSIZE or something. > Another one: overwriting rename() vs. copyup. Similar to unlink() vs. > copyup(). Hmm, do you mean that, just after the parent dir lock on the underlying fs, the copyup routine should confirm whether the target is still alive? If so, I agree. > Another one: read-only open() vs. copyup(). Hell knows - we obviously don't > want to open a partially copied file; might want to wait or pretend that this > open() has come first and give it the underlying file. The same goes for > stat() vs. copyup(). Exactly. Moreover users don't want to refer to the lower file which is obsoleted by copyup. > FWIW, something like "lock parent, ->create(), ->unlink(), unlock parent, > copy data and metadata, lock parent, allocate a new dentry in covering layer > and do ->lookup() on it, verify that it is negative and not a whiteout, lock > child, use ->link() to put it into directory, unlock everything" would > probably DTRT for unlink/copyup and rename/copyup. The rest... hell knows; Please let me make sure. You are saying, - create the file on the upper layer - get the "struct file" object - hide it from users - before copying the file, unlock the parent in order to stop the long period locking - copy the file without the parent lock. it doesn't matter since the file is invisible to users. - confirm that the target name is still available for copyup - make the completed file visible by ->link() It is ineteresting to ->link() with the unlinked file. While vfs_unlink() rejects such case, it may not matter for the underlying fs. Need to verify FS including jounals. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Al Viro: BTW, I wonder what's the right locking for that sucker; overlayfs is probably too heavy - we are talking about copying a file from one fs to another, which can obviously take quite a while, so holding -i_mutex on _parent_ all along is asking for very serious contention. OTOH, there's a pile of unpleasant Yes, holding parent-i_mutex can be longer. Using splice function for copy-up is simple and (probably) fastest. But it doesn't support a hole in the file (sparse file). All holes are filled with NUL byte and consumes a disk block on the upper layer. It is a problem, especially for users who have smaller tmpfs as his upper layer. The copy-up with considering a hole may cost more, but it can save the storage consumtion. Another fun issue is copyup vs. copyup - we want to sit and wait for copyup attempt in progress to complete, rather than start another one in parallel. And whoever comes the second must check if copyup has succeeded, obviously - it's possible to have user run into 5% limit and fail copyup, followed by root doing it successfully. 5% limit means the reserved are for a superuser on a filesystem, right? As far as I know, overlayfs (UnionMount too?) solves this problem as changing the task credential and the capability. But I am not sure whether it solves the similar problem around the resource limit like RLIMIT_CPU, RLIMIT_FSIZE or something. Another one: overwriting rename() vs. copyup. Similar to unlink() vs. copyup(). Hmm, do you mean that, just after the parent dir lock on the underlying fs, the copyup routine should confirm whether the target is still alive? If so, I agree. Another one: read-only open() vs. copyup(). Hell knows - we obviously don't want to open a partially copied file; might want to wait or pretend that this open() has come first and give it the underlying file. The same goes for stat() vs. copyup(). Exactly. Moreover users don't want to refer to the lower file which is obsoleted by copyup. FWIW, something like lock parent, -create(), -unlink(), unlock parent, copy data and metadata, lock parent, allocate a new dentry in covering layer and do -lookup() on it, verify that it is negative and not a whiteout, lock child, use -link() to put it into directory, unlock everything would probably DTRT for unlink/copyup and rename/copyup. The rest... hell knows; Please let me make sure. You are saying, - create the file on the upper layer - get the struct file object - hide it from users - before copying the file, unlock the parent in order to stop the long period locking - copy the file without the parent lock. it doesn't matter since the file is invisible to users. - confirm that the target name is still available for copyup - make the completed file visible by -link() It is ineteresting to -link() with the unlinked file. While vfs_unlink() rejects such case, it may not matter for the underlying fs. Need to verify FS including jounals. J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 2:38 AM, Al Viro v...@zeniv.linux.org.uk wrote: On Mon, Mar 18, 2013 at 11:01:03PM +, Al Viro wrote: I'm looking at the existing callers and I really wonder if we ought to push sb_start_write() from -splice_write()/-aio_write()/etc. into the callers. Something like file_start_write()/file_end_write(), with check for file being regular one might be a good starting point. As it is, copyup is really fucked both in unionmount and overlayfs... BTW, I wonder what's the right locking for that sucker; overlayfs is probably too heavy - we are talking about copying a file from one fs to another, which can obviously take quite a while, so holding -i_mutex on _parent_ all along is asking for very serious contention. Copy up is a once-in-a-lifetime event for an object. Optimizing it is way down in the list of things to do. I'd drop splice in a jiffy if it's in the way. Much more interesting question: what happens if we crash during a rename? Whiteout implemented in the filesystem won't save us. And the results are interesting: old versions of files become visible and similar fun. Far from likely to happen, but ... Add a rename-with-whiteout primitive on filesystems? That one is not going to be as simple as plain whiteout. Or? Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Miklos Szeredi mik...@szeredi.hu wrote: BTW, I wonder what's the right locking for that sucker; overlayfs is probably too heavy - we are talking about copying a file from one fs to another, which can obviously take quite a while, so holding -i_mutex on _parent_ all along is asking for very serious contention. Copy up is a once-in-a-lifetime event for an object. Optimizing it is way down in the list of things to do. I'd drop splice in a jiffy if it's in the way. Yes, but it could block the parent directory for a long time. I suspect it's fine if you can RCU walk through the parent, but if you have to grab a lock on it... David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 12:04 PM, David Howells dhowe...@redhat.com wrote: Miklos Szeredi mik...@szeredi.hu wrote: BTW, I wonder what's the right locking for that sucker; overlayfs is probably too heavy - we are talking about copying a file from one fs to another, which can obviously take quite a while, so holding -i_mutex on _parent_ all along is asking for very serious contention. Copy up is a once-in-a-lifetime event for an object. Optimizing it is way down in the list of things to do. I'd drop splice in a jiffy if it's in the way. Yes, but it could block the parent directory for a long time. I suspect it's fine if you can RCU walk through the parent, but if you have to grab a lock on it... Right. Lets look at it this way: users of an overlay accept that an operation X can take T time, where T is much longer than would be on a normal filesystem. Then why would they complain that operation Y (which happens to bump into the parent lock of X) also takes T? If copy up of huge files happens more then very very occasionally, then the overlay will be basically unusable anyway. It's just not what it is designed for, so why try to optimize this case? Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 11:29:41AM +0100, Miklos Szeredi wrote: Copy up is a once-in-a-lifetime event for an object. Optimizing it is way down in the list of things to do. I'd drop splice in a jiffy if it's in the way. What makes you think that write is any better? Same deadlock there - check generic_file_aio_write(), it calls the same sb_start_write()... IOW, switching from splice to write won't help at all. Much more interesting question: what happens if we crash during a rename? Whiteout implemented in the filesystem won't save us. And the results are interesting: old versions of files become visible and similar fun. Far from likely to happen, but ... Add a rename-with-whiteout primitive on filesystems? That one is not going to be as simple as plain whiteout. Or? Umm... If/when we start caring about that kind of atomicity (and I agree that we ought to) overlayfs approach to whiteouts will actually have much harder time - it doesn't take much to teach a journalling fs how to do that kind of -rename() in a single transaction; the only question is how to tell it that we want to leave a whiteout behind us. Hell knows; one variant is to add a flag, of course. Another might be more interesting - we want some kind of directory is opaque flag, so if we start reshuffling the methods, we might try to merge unlink/rmdir/whiteout. Rules: * victim is negative = create a whiteout * victim is a directory, parent opaque = rmdir * victim is a non-directory, parent opaque = unlink * victim is positive, parent _not_ opaque = replace with whiteout * old_dir in case of -rename() is opaque = normal rename * old_dir in case of -rename() is not opaque = leave whiteout behind Non-unioned = opaque, of course (nothing showing through it). Getting good behaviour on rename interrupted by crash is going to be _very_ tricky with any strategy other than whiteouts-in-fs, AFAICS. Again, I have no problem whatsoever with changing the set of directory methods, as long as the replacement is sane. We'd done that kind of thing before and it's not a problem. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 6:03 PM, Al Viro v...@zeniv.linux.org.uk wrote: On Tue, Mar 19, 2013 at 11:29:41AM +0100, Miklos Szeredi wrote: Copy up is a once-in-a-lifetime event for an object. Optimizing it is way down in the list of things to do. I'd drop splice in a jiffy if it's in the way. What makes you think that write is any better? Same deadlock there - check generic_file_aio_write(), it calls the same sb_start_write()... IOW, switching from splice to write won't help at all. Okay, I missed that. Yeah, that needs fixing... Much more interesting question: what happens if we crash during a rename? Whiteout implemented in the filesystem won't save us. And the results are interesting: old versions of files become visible and similar fun. Far from likely to happen, but ... Add a rename-with-whiteout primitive on filesystems? That one is not going to be as simple as plain whiteout. Or? Umm... If/when we start caring about that kind of atomicity (and I agree that we ought to) overlayfs approach to whiteouts will actually have much harder time - it doesn't take much to teach a journalling fs how to do that kind of -rename() in a single transaction; the only question is how to tell it that we want to leave a whiteout behind us. Hell knows; one variant is to add a flag, of course. Another might be more interesting - we want some kind of directory is opaque flag, so if we start reshuffling the methods, we might try to merge unlink/rmdir/whiteout. Rules: * victim is negative = create a whiteout * victim is a directory, parent opaque = rmdir * victim is a non-directory, parent opaque = unlink * victim is positive, parent _not_ opaque = replace with whiteout * old_dir in case of -rename() is opaque = normal rename * old_dir in case of -rename() is not opaque = leave whiteout behind Non-unioned = opaque, of course (nothing showing through it). I dunnow. Overloading common paths with overlay/union specific things doesn't look very clean to me. I have a similar problem with union-mounts: it's hooking into lots of common paths in the VFS for the sake of a very specialized feature. Getting good behaviour on rename interrupted by crash is going to be _very_ tricky with any strategy other than whiteouts-in-fs, AFAICS. One idea is to add a journal to the overlay itself (yeah, namespace issues). Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Mon 18-03-13 21:53:34, Al Viro wrote: On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote: IMO the deadlock is real. In freeze_super() we wait for all writers to the filesystem to finish while blocking beginning of any further writes. So we have a deadlock scenario like: THREAD1 THREAD2 THREAD3 mnt_want_write()mutex_lock(inode-i_mutex); ... freeze_super() block on mutex_lock(inode-i_mutex) sb_wait_write(sb, SB_FREEZE_WRITE); block in sb_start_write() The bug is on fsfreeze side and this is not the only problem related to it. I've missed the implications when I applied fs: Add freezing handling to mnt_want_write() / mnt_drop_write() last June ;-/ The thing is, until then mnt_want_write() used to be a counter; it could be nested. Now any such nesting is a deadlock you've just described. This is seriously wrong, IMO. Well, but sb_start_write() has to be blocking (blocks when fs is frozen) and you have to get it somewhere. It seems only natural to get the counter from original mnt_want_write() at the same place and use one function for that. Whether I should have changed the name from mnt_want_write() to something else is questionable... BTW, having sb_start_write() buried in individual -splice_write() is asking for trouble; could you describe the rules for that? E.g. where does it nest wrt filesystem-private locks? XFS iolock, for example... Generally, the freeze protection should be the outermost lock taken (so that we mitigate possibility of blocking readers when waiting for fs to unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock. It seems that I screwed this up for -splice_write() :-| If we are going to move out sb_start_write() out of filesystems' hands into do_splice_from() then we should likely do the same with -aio_write(). Hmm? Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Mon 18-03-13 23:01:03, Al Viro wrote: On Mon, Mar 18, 2013 at 09:53:34PM +, Al Viro wrote: On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote: IMO the deadlock is real. In freeze_super() we wait for all writers to the filesystem to finish while blocking beginning of any further writes. So we have a deadlock scenario like: THREAD1 THREAD2 THREAD3 mnt_want_write() mutex_lock(inode-i_mutex); ... freeze_super() block on mutex_lock(inode-i_mutex) sb_wait_write(sb, SB_FREEZE_WRITE); block in sb_start_write() The bug is on fsfreeze side and this is not the only problem related to it. I've missed the implications when I applied fs: Add freezing handling to mnt_want_write() / mnt_drop_write() last June ;-/ The thing is, until then mnt_want_write() used to be a counter; it could be nested. Now any such nesting is a deadlock you've just described. This is seriously wrong, IMO. BTW, having sb_start_write() buried in individual -splice_write() is asking for trouble; could you describe the rules for that? E.g. where does it nest wrt filesystem-private locks? XFS iolock, for example... I'm looking at the existing callers and I really wonder if we ought to push sb_start_write() from -splice_write()/-aio_write()/etc. into the callers. Yeah, that should be OK. Something like file_start_write()/file_end_write(), with check for file being regular one might be a good starting point. As it is, copyup is really fucked both in unionmount and overlayfs... Makes sense. I can do the patch. BTW, for months I'm trying to push to you a patch which creates a function like file_start_write() which returns EAGAIN if the file is open with O_NONBLOCK and fs is frozen (this allows me to solve a deadlock with bsd process accounting to frozen fs). After this change the patch will become trivial so I'll add it to the series and hopefully it won't get ignored. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 07:32:42PM +0100, Miklos Szeredi wrote: * victim is negative = create a whiteout * victim is a directory, parent opaque = rmdir * victim is a non-directory, parent opaque = unlink * victim is positive, parent _not_ opaque = replace with whiteout * old_dir in case of -rename() is opaque = normal rename * old_dir in case of -rename() is not opaque = leave whiteout behind Non-unioned = opaque, of course (nothing showing through it). I dunnow. Overloading common paths with overlay/union specific things doesn't look very clean to me. We certainly can add a couple of new methods; the question is, would that buy us anything? This -rename_and_whiteout would be practically identical to -rename on most of filesystems; the only difference for something like ext* is that instead of deleting the old directory entry in the end we would change its inumber to 0 and type to 14. Everything else would be the same. Moreover, turning a positive entry into a whiteout differs from unlink/rmdir resp. in exactly the same way. FWIW, creating a whiteout from scratch might be better off as a separate method - it's closer to -mknod() and friends. But whiteout from positive and whiteout-on-rename probably isn't worth separating from the normal methods. All we end up overloading is actual removal of directory entry, which is a rather minor part of the logics in those. And merging -unlink/-rmdir is definitely a good idea, regardless of anything union-related; just look at the existing instances of both. Getting good behaviour on rename interrupted by crash is going to be _very_ tricky with any strategy other than whiteouts-in-fs, AFAICS. One idea is to add a journal to the overlay itself (yeah, namespace issues). ... and extra complexity from hell. If the underlying fs has a journal of its own, we are getting all kinds of interesting interplay between those; if not, replaying the journal in overlay will buy you nothing, since the damage to underlying objects makes it all moot. Layering is a very nice tool for quick prototyping; no arguments about that. It allows to avoid modifying $BIGNUM filesystems and for something that lives out of tree it's a very big benefit. Once the thing is merged, though, that benefit becomes much smaller; it still might make sense to implement something as a layer, but some parts of that sucker may be better off as fs primitives. Hell, we could, in theory, implement xattrs as a layer; just look at how reiserfs had done them. We could do the same for hardlinks (look at qnx4, if you wish to see that hack). Of for symlinks (sysvfs). Or for opened-and-unlinked files (sillyrename could be done as a generic layer). Or for permissions/ownership/arbitrary names (umsdos, and that _was_ very similar to layering). It's just that often an underlying fs has a better way of doing that. IMO whiteouts are in that class. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 09:25:43PM +0100, Jan Kara wrote: BTW, having sb_start_write() buried in individual -splice_write() is asking for trouble; could you describe the rules for that? E.g. where does it nest wrt filesystem-private locks? XFS iolock, for example... Generally, the freeze protection should be the outermost lock taken (so that we mitigate possibility of blocking readers when waiting for fs to unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock. Welcome to deadlock, then: xfs_file_splice_write() ... xfs_ilock(ip, XFS_IOLOCK_EXCL); ... ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags); It seems that I screwed this up for -splice_write() :-| If we are going to move out sb_start_write() out of filesystems' hands into do_splice_from() then we should likely do the same with -aio_write(). Hmm? Yes, I've a tentative patch doing just that; will push tonight. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 09:38:31PM +, Al Viro wrote: On Tue, Mar 19, 2013 at 09:25:43PM +0100, Jan Kara wrote: BTW, having sb_start_write() buried in individual -splice_write() is asking for trouble; could you describe the rules for that? E.g. where does it nest wrt filesystem-private locks? XFS iolock, for example... Generally, the freeze protection should be the outermost lock taken (so that we mitigate possibility of blocking readers when waiting for fs to unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock. Welcome to deadlock, then: xfs_file_splice_write() ... xfs_ilock(ip, XFS_IOLOCK_EXCL); ... ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags); It seems that I screwed this up for -splice_write() :-| If we are going to move out sb_start_write() out of filesystems' hands into do_splice_from() then we should likely do the same with -aio_write(). Hmm? ... and then there's ext4_file_dio_write(), which cheerfully ignores freeze. ... and udf expanding from inline files to separately allocated before it gets to sb_start_write() ... and a bunch of guys doing generic_write_sync() after generic_aio_file_write() ... and a lot of other fun stuff. Ouch... OK, it's going to be an interesting series - aforementioned tentative patch was badly incomplete ;-/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Tue, Mar 19, 2013 at 10:10:32PM +, Al Viro wrote: OK, it's going to be an interesting series - aforementioned tentative patch was badly incomplete ;-/ The interesting question is how far do we want to lift that. -aio_write() part is trivial - see vfs.git#experimental; the trouble begins with -splice_write(). For *everything* except default_file_splice_write(), lifting into the caller (do_splice_from()) is the right thing to do. default_file_splice_write(), however, it trickier; there we end up calling vfs_write() (via an ugly callchain). And _that_ is a real bitch. Granted, vfs_write() is somewhat an overkill there (we'd already done rw_verify_area() and access_ok() is pointless due to set_fs() we do around vfs_write() there) and we'd already lifted it up to do_sync_write(). But if we lift it any further, we'll need to deal with -write() callers in the tree. Current situation: fs/coredump.c:662: return access_ok(VERIFY_READ, addr, nr) file-f_op-write(file, addr, nr, file-f_pos) == nr; arch/powerpc/platforms/cell/spufs/coredump.c:63:written = file-f_op-write(file, addr, nr, file-f_pos); for these guys we might actually want to lift all way up to do_coredump() drivers/staging/comedi/drivers/serial2002.c:91: result = f-f_op-write(f, buf, count, f-f_pos); fs/autofs4/waitq.c:73: (wr = file-f_op-write(file,data,bytes,file-f_pos)) 0) { not regular files, unless I'm seriously misreading the code. kernel/acct.c:553: file-f_op-write(file, (char *)ac, BTW, this is probably where we want to deal with your acct deadlock. fs/compat.c:1103: fn = (io_fn_t)file-f_op-write; fs/read_write.c:435:ret = file-f_op-write(file, buf, count, pos); fs/read_write.c:732:fn = (io_fn_t)file-f_op-write; syscalls - the question here is whether we lift it up to vfs_write/vfs_writev/ compat_writev, or actually take it further. fs/cachefiles/rdwr.c:967: ret = file-f_op-write( cachefiles_write_page(); no fucking idea what locks might be held by caller and potentially that's a rather nasty source of PITA fs/coda/file.c:84: ret = host_file-f_op-write(host_file, buf, count, ppos); coda writing to file in cache on local fs. Potentially a nasty bugger, since it's hard to lift any further - the caller has no idea that the thing is on CODA, let alone what happens to hold the local cache. drivers/block/loop.c:234: bw = file-f_op-write(file, buf, len, pos); do_bio_filebacked(), with some ugliness between that and callsite. Note, BTW, that we have a pair of possible vfs_fsync() calls in there; how do those interact with freeze? This does *not* touch the current callers of vfs_write()/vfs_writev(); any of those called while holding -i_mutex on a directory (or mnt_want_write(), for that matter) is a deadlock right now. And we'd better start thinking about how we'll backport that crap - deadlock in e.g. xfs -splice_write() had been there since last summer ;-/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Mon, Mar 18, 2013 at 11:01:03PM +, Al Viro wrote: > I'm looking at the existing callers and I really wonder if we ought to > push sb_start_write() from ->splice_write()/->aio_write()/etc. into the > callers. > > Something like file_start_write()/file_end_write(), with check for file > being regular one might be a good starting point. As it is, copyup is > really fucked both in unionmount and overlayfs... BTW, I wonder what's the right locking for that sucker; overlayfs is probably too heavy - we are talking about copying a file from one fs to another, which can obviously take quite a while, so holding ->i_mutex on _parent_ all along is asking for very serious contention. OTOH, there's a pile of unpleasant races that need to be dealt with; consider e.g. chmod("lower_file", 0644) vs. unlink("lower_file"). The former means creating a copy of lower_file in the covering layer (and chmod of that copy once it's finished). The latter means creating a whiteout in the covering layer. No matter which comes first, the result *must* be whiteout in directory + no stray copies left in covering layer. chmod() may legitimately return -ENOENT or 0, but resulting state of fs is unambiguous. Holding ->i_mutex on parent would, of course, suffice to serialize them, but it's not particulary nice thing to do wrt contention. Another fun issue is copyup vs. copyup - we want to sit and wait for copyup attempt in progress to complete, rather than start another one in parallel. And whoever comes the second must check if copyup has succeeded, obviously - it's possible to have user run into 5% limit and fail copyup, followed by root doing it successfully. Another one: overwriting rename() vs. copyup. Similar to unlink() vs. copyup(). Another one: read-only open() vs. copyup(). Hell knows - we obviously don't want to open a partially copied file; might want to wait or pretend that this open() has come first and give it the underlying file. The same goes for stat() vs. copyup(). FWIW, something like "lock parent, ->create(), ->unlink(), unlock parent, copy data and metadata, lock parent, allocate a new dentry in covering layer and do ->lookup() on it, verify that it is negative and not a whiteout, lock child, use ->link() to put it into directory, unlock everything" would probably DTRT for unlink/copyup and rename/copyup. The rest... hell knows; ->i_mutex on child is a non-starter, since we couldn't use the normal ->write() or ->splice_write() under it. One possibility is to allocate a structure for copyup in progress, make it easily located (hash by lower dentry/upper sb pair) and serialize on that. Hell knows... One potential unpleasantness here is dcache lookup coming between ->create() and ->unlink(); OTOH, there had been fairly reasonable requests for something like atomic combination of open and unlink and it's not like _that_ would be hard to implement as a new method - pretty much everything local could just take part of ->create() and that would be it. Which might make sense on its own - open() flag creating a new temporary file, unlinked from the very beginning and thus killed when we close the damn thing. Objections? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Mon, Mar 18, 2013 at 09:53:34PM +, Al Viro wrote: > On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote: > > IMO the deadlock is real. In freeze_super() we wait for all writers to > > the filesystem to finish while blocking beginning of any further writes. So > > we have a deadlock scenario like: > > > > THREAD1 THREAD2 THREAD3 > > mnt_want_write()mutex_lock(>i_mutex); > > ... freeze_super() > > block on mutex_lock(>i_mutex) > > sb_wait_write(sb, > > SB_FREEZE_WRITE); > > block in sb_start_write() > > The bug is on fsfreeze side and this is not the only problem related to it. > I've missed the implications when I applied "fs: Add freezing handling > to mnt_want_write() / mnt_drop_write()" last June ;-/ > > The thing is, until then mnt_want_write() used to be a counter; it could be > nested. Now any such nesting is a deadlock you've just described. This > is seriously wrong, IMO. > > BTW, having sb_start_write() buried in individual ->splice_write() is > asking for trouble; could you describe the rules for that? E.g. where > does it nest wrt filesystem-private locks? XFS iolock, for example... I'm looking at the existing callers and I really wonder if we ought to push sb_start_write() from ->splice_write()/->aio_write()/etc. into the callers. Something like file_start_write()/file_end_write(), with check for file being regular one might be a good starting point. As it is, copyup is really fucked both in unionmount and overlayfs... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote: > IMO the deadlock is real. In freeze_super() we wait for all writers to > the filesystem to finish while blocking beginning of any further writes. So > we have a deadlock scenario like: > > THREAD1 THREAD2 THREAD3 > mnt_want_write() mutex_lock(>i_mutex); > ... freeze_super() > block on mutex_lock(>i_mutex) > sb_wait_write(sb, > SB_FREEZE_WRITE); > block in sb_start_write() The bug is on fsfreeze side and this is not the only problem related to it. I've missed the implications when I applied "fs: Add freezing handling to mnt_want_write() / mnt_drop_write()" last June ;-/ The thing is, until then mnt_want_write() used to be a counter; it could be nested. Now any such nesting is a deadlock you've just described. This is seriously wrong, IMO. BTW, having sb_start_write() buried in individual ->splice_write() is asking for trouble; could you describe the rules for that? E.g. where does it nest wrt filesystem-private locks? XFS iolock, for example... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Sun 17-03-13 13:06:59, David Howells wrote: > Miklos Szeredi wrote: > > > Export do_splice_direct() to modules. Needed by overlay filesystem. > > Apparently you cannot call this from any function that is holding an i_mutex > if the target of the splice uses generic_file_splice_write(). > > The problem is a potential deadlock situation: > > We have places already that do: > > mnt_want_write() > mutex_lock() > > This can be found in do_last() for example. > > However, mnt_want_write() calls sb_start_write() as does > generic_file_splice_write(). So now in ovl_copy_up_locked() you're adding: > > mutex_lock() > sb_start_write() > > which lockdep reports as a potential ABBA deadlock. > > Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock > might operate, so it's possible that this is a false alarm. Maybe Jan Kara > can > illuminate further, so I've added him to the cc list. IMO the deadlock is real. In freeze_super() we wait for all writers to the filesystem to finish while blocking beginning of any further writes. So we have a deadlock scenario like: THREAD1 THREAD2 THREAD3 mnt_want_write()mutex_lock(>i_mutex); ... freeze_super() block on mutex_lock(>i_mutex) sb_wait_write(sb, SB_FREEZE_WRITE); block in sb_start_write() Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Sun 17-03-13 13:06:59, David Howells wrote: Miklos Szeredi mik...@szeredi.hu wrote: Export do_splice_direct() to modules. Needed by overlay filesystem. Apparently you cannot call this from any function that is holding an i_mutex if the target of the splice uses generic_file_splice_write(). The problem is a potential deadlock situation: We have places already that do: mnt_want_write() mutex_lock() This can be found in do_last() for example. However, mnt_want_write() calls sb_start_write() as does generic_file_splice_write(). So now in ovl_copy_up_locked() you're adding: mutex_lock() sb_start_write() which lockdep reports as a potential ABBA deadlock. Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock might operate, so it's possible that this is a false alarm. Maybe Jan Kara can illuminate further, so I've added him to the cc list. IMO the deadlock is real. In freeze_super() we wait for all writers to the filesystem to finish while blocking beginning of any further writes. So we have a deadlock scenario like: THREAD1 THREAD2 THREAD3 mnt_want_write()mutex_lock(inode-i_mutex); ... freeze_super() block on mutex_lock(inode-i_mutex) sb_wait_write(sb, SB_FREEZE_WRITE); block in sb_start_write() Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote: IMO the deadlock is real. In freeze_super() we wait for all writers to the filesystem to finish while blocking beginning of any further writes. So we have a deadlock scenario like: THREAD1 THREAD2 THREAD3 mnt_want_write() mutex_lock(inode-i_mutex); ... freeze_super() block on mutex_lock(inode-i_mutex) sb_wait_write(sb, SB_FREEZE_WRITE); block in sb_start_write() The bug is on fsfreeze side and this is not the only problem related to it. I've missed the implications when I applied fs: Add freezing handling to mnt_want_write() / mnt_drop_write() last June ;-/ The thing is, until then mnt_want_write() used to be a counter; it could be nested. Now any such nesting is a deadlock you've just described. This is seriously wrong, IMO. BTW, having sb_start_write() buried in individual -splice_write() is asking for trouble; could you describe the rules for that? E.g. where does it nest wrt filesystem-private locks? XFS iolock, for example... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Mon, Mar 18, 2013 at 09:53:34PM +, Al Viro wrote: On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote: IMO the deadlock is real. In freeze_super() we wait for all writers to the filesystem to finish while blocking beginning of any further writes. So we have a deadlock scenario like: THREAD1 THREAD2 THREAD3 mnt_want_write()mutex_lock(inode-i_mutex); ... freeze_super() block on mutex_lock(inode-i_mutex) sb_wait_write(sb, SB_FREEZE_WRITE); block in sb_start_write() The bug is on fsfreeze side and this is not the only problem related to it. I've missed the implications when I applied fs: Add freezing handling to mnt_want_write() / mnt_drop_write() last June ;-/ The thing is, until then mnt_want_write() used to be a counter; it could be nested. Now any such nesting is a deadlock you've just described. This is seriously wrong, IMO. BTW, having sb_start_write() buried in individual -splice_write() is asking for trouble; could you describe the rules for that? E.g. where does it nest wrt filesystem-private locks? XFS iolock, for example... I'm looking at the existing callers and I really wonder if we ought to push sb_start_write() from -splice_write()/-aio_write()/etc. into the callers. Something like file_start_write()/file_end_write(), with check for file being regular one might be a good starting point. As it is, copyup is really fucked both in unionmount and overlayfs... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Mon, Mar 18, 2013 at 11:01:03PM +, Al Viro wrote: I'm looking at the existing callers and I really wonder if we ought to push sb_start_write() from -splice_write()/-aio_write()/etc. into the callers. Something like file_start_write()/file_end_write(), with check for file being regular one might be a good starting point. As it is, copyup is really fucked both in unionmount and overlayfs... BTW, I wonder what's the right locking for that sucker; overlayfs is probably too heavy - we are talking about copying a file from one fs to another, which can obviously take quite a while, so holding -i_mutex on _parent_ all along is asking for very serious contention. OTOH, there's a pile of unpleasant races that need to be dealt with; consider e.g. chmod(lower_file, 0644) vs. unlink(lower_file). The former means creating a copy of lower_file in the covering layer (and chmod of that copy once it's finished). The latter means creating a whiteout in the covering layer. No matter which comes first, the result *must* be whiteout in directory + no stray copies left in covering layer. chmod() may legitimately return -ENOENT or 0, but resulting state of fs is unambiguous. Holding -i_mutex on parent would, of course, suffice to serialize them, but it's not particulary nice thing to do wrt contention. Another fun issue is copyup vs. copyup - we want to sit and wait for copyup attempt in progress to complete, rather than start another one in parallel. And whoever comes the second must check if copyup has succeeded, obviously - it's possible to have user run into 5% limit and fail copyup, followed by root doing it successfully. Another one: overwriting rename() vs. copyup. Similar to unlink() vs. copyup(). Another one: read-only open() vs. copyup(). Hell knows - we obviously don't want to open a partially copied file; might want to wait or pretend that this open() has come first and give it the underlying file. The same goes for stat() vs. copyup(). FWIW, something like lock parent, -create(), -unlink(), unlock parent, copy data and metadata, lock parent, allocate a new dentry in covering layer and do -lookup() on it, verify that it is negative and not a whiteout, lock child, use -link() to put it into directory, unlock everything would probably DTRT for unlink/copyup and rename/copyup. The rest... hell knows; -i_mutex on child is a non-starter, since we couldn't use the normal -write() or -splice_write() under it. One possibility is to allocate a structure for copyup in progress, make it easily located (hash by lower dentry/upper sb pair) and serialize on that. Hell knows... One potential unpleasantness here is dcache lookup coming between -create() and -unlink(); OTOH, there had been fairly reasonable requests for something like atomic combination of open and unlink and it's not like _that_ would be hard to implement as a new method - pretty much everything local could just take part of -create() and that would be it. Which might make sense on its own - open() flag creating a new temporary file, unlinked from the very beginning and thus killed when we close the damn thing. Objections? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Sun, Mar 17, 2013 at 01:06:59PM +, David Howells wrote: > Miklos Szeredi wrote: > > > Export do_splice_direct() to modules. Needed by overlay filesystem. > > Apparently you cannot call this from any function that is holding an i_mutex > if the target of the splice uses generic_file_splice_write(). > > The problem is a potential deadlock situation: > > We have places already that do: > > mnt_want_write() > mutex_lock() > > This can be found in do_last() for example. > > However, mnt_want_write() calls sb_start_write() as does > generic_file_splice_write(). So now in ovl_copy_up_locked() you're adding: > > mutex_lock() > sb_start_write() > > which lockdep reports as a potential ABBA deadlock. > > Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock > might operate, so it's possible that this is a false alarm. Maybe Jan Kara > can > illuminate further, so I've added him to the cc list. > > I've attached the report I got with unionmount. There's plenty of problems with splice locking that can lead to deadlocks. Here's another that's been known for ages: http://oss.sgi.com/archives/xfs/2011-08/msg00168.html Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Miklos Szeredi wrote: > Export do_splice_direct() to modules. Needed by overlay filesystem. Apparently you cannot call this from any function that is holding an i_mutex if the target of the splice uses generic_file_splice_write(). The problem is a potential deadlock situation: We have places already that do: mnt_want_write() mutex_lock() This can be found in do_last() for example. However, mnt_want_write() calls sb_start_write() as does generic_file_splice_write(). So now in ovl_copy_up_locked() you're adding: mutex_lock() sb_start_write() which lockdep reports as a potential ABBA deadlock. Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock might operate, so it's possible that this is a false alarm. Maybe Jan Kara can illuminate further, so I've added him to the cc list. I've attached the report I got with unionmount. David --- [ INFO: possible recursive locking detected ] 3.9.0-rc1-fsdevel+ #934 Not tainted - fs-op/4476 is trying to acquire lock: (sb_writers#4){.+.+.+}, at: [] generic_file_splice_write+0x5d/0x14b but task is already holding lock: (sb_writers#4){.+.+.+}, at: [] mnt_want_write+0x1f/0x46 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(sb_writers#4); lock(sb_writers#4); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by fs-op/4476: #0: (sb_writers#4){.+.+.+}, at: [] mnt_want_write+0x1f/0x46 #1: (>i_mutex_dir_key[1]){+.+.+.}, at: [] __union_copy_up+0x9a/0x132 stack backtrace: Pid: 4476, comm: fs-op Not tainted 3.9.0-rc1-fsdevel+ #934 Call Trace: [] __lock_acquire+0x86a/0x16cf [] ? page_cache_pipe_buf_release+0x1b/0x1b [] lock_acquire+0x57/0x6d [] ? generic_file_splice_write+0x5d/0x14b [] __sb_start_write+0x10d/0x15d [] ? generic_file_splice_write+0x5d/0x14b [] generic_file_splice_write+0x5d/0x14b [] do_splice_from+0x74/0x91 [] direct_splice_actor+0x1e/0x20 [] splice_direct_to_actor+0xc2/0x17e [] ? do_splice_from+0x91/0x91 [] do_splice_direct+0x47/0x5a [] __union_copy_up_locked+0x171/0x2b2 [] __union_copy_up+0xea/0x132 [] vfs_truncate+0x15e/0x289 [] do_sys_truncate+0x46/0x83 [] sys_truncate+0x9/0xb [] system_call_fastpath+0x16/0x1b -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
Miklos Szeredi mik...@szeredi.hu wrote: Export do_splice_direct() to modules. Needed by overlay filesystem. Apparently you cannot call this from any function that is holding an i_mutex if the target of the splice uses generic_file_splice_write(). The problem is a potential deadlock situation: We have places already that do: mnt_want_write() mutex_lock() This can be found in do_last() for example. However, mnt_want_write() calls sb_start_write() as does generic_file_splice_write(). So now in ovl_copy_up_locked() you're adding: mutex_lock() sb_start_write() which lockdep reports as a potential ABBA deadlock. Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock might operate, so it's possible that this is a false alarm. Maybe Jan Kara can illuminate further, so I've added him to the cc list. I've attached the report I got with unionmount. David --- [ INFO: possible recursive locking detected ] 3.9.0-rc1-fsdevel+ #934 Not tainted - fs-op/4476 is trying to acquire lock: (sb_writers#4){.+.+.+}, at: [811087a4] generic_file_splice_write+0x5d/0x14b but task is already holding lock: (sb_writers#4){.+.+.+}, at: [810ff97c] mnt_want_write+0x1f/0x46 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(sb_writers#4); lock(sb_writers#4); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by fs-op/4476: #0: (sb_writers#4){.+.+.+}, at: [810ff97c] mnt_want_write+0x1f/0x46 #1: (type-i_mutex_dir_key[1]){+.+.+.}, at: [81131c74] __union_copy_up+0x9a/0x132 stack backtrace: Pid: 4476, comm: fs-op Not tainted 3.9.0-rc1-fsdevel+ #934 Call Trace: [81070398] __lock_acquire+0x86a/0x16cf [811081cc] ? page_cache_pipe_buf_release+0x1b/0x1b [810715e2] lock_acquire+0x57/0x6d [811087a4] ? generic_file_splice_write+0x5d/0x14b [810e3314] __sb_start_write+0x10d/0x15d [811087a4] ? generic_file_splice_write+0x5d/0x14b [811087a4] generic_file_splice_write+0x5d/0x14b [811083d5] do_splice_from+0x74/0x91 [81108410] direct_splice_actor+0x1e/0x20 [8110868b] splice_direct_to_actor+0xc2/0x17e [811083f2] ? do_splice_from+0x91/0x91 [8110999d] do_splice_direct+0x47/0x5a [81131a99] __union_copy_up_locked+0x171/0x2b2 [81131cc4] __union_copy_up+0xea/0x132 [810e02ca] vfs_truncate+0x15e/0x289 [810e043b] do_sys_truncate+0x46/0x83 [810e05cf] sys_truncate+0x9/0xb [81456f92] system_call_fastpath+0x16/0x1b -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Sun, Mar 17, 2013 at 01:06:59PM +, David Howells wrote: Miklos Szeredi mik...@szeredi.hu wrote: Export do_splice_direct() to modules. Needed by overlay filesystem. Apparently you cannot call this from any function that is holding an i_mutex if the target of the splice uses generic_file_splice_write(). The problem is a potential deadlock situation: We have places already that do: mnt_want_write() mutex_lock() This can be found in do_last() for example. However, mnt_want_write() calls sb_start_write() as does generic_file_splice_write(). So now in ovl_copy_up_locked() you're adding: mutex_lock() sb_start_write() which lockdep reports as a potential ABBA deadlock. Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock might operate, so it's possible that this is a false alarm. Maybe Jan Kara can illuminate further, so I've added him to the cc list. I've attached the report I got with unionmount. There's plenty of problems with splice locking that can lead to deadlocks. Here's another that's been known for ages: http://oss.sgi.com/archives/xfs/2011-08/msg00168.html Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Wed, 13 Mar 2013 15:16:26 +0100 Miklos Szeredi wrote: > From: Miklos Szeredi > > Export do_splice_direct() to modules. Needed by overlay filesystem. > > Signed-off-by: Miklos Szeredi > --- > fs/splice.c |1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/splice.c b/fs/splice.c > index 718bd00..0e8f44a 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -1308,6 +1308,7 @@ long do_splice_direct(struct file *in, loff_t *ppos, > struct file *out, > > return ret; > } > +EXPORT_SYMBOL(do_splice_direct); _GPL? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Wed, 13 Mar 2013 15:16:26 +0100 Miklos Szeredi mik...@szeredi.hu wrote: From: Miklos Szeredi mszer...@suse.cz Export do_splice_direct() to modules. Needed by overlay filesystem. Signed-off-by: Miklos Szeredi mszer...@suse.cz --- fs/splice.c |1 + 1 file changed, 1 insertion(+) diff --git a/fs/splice.c b/fs/splice.c index 718bd00..0e8f44a 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1308,6 +1308,7 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, return ret; } +EXPORT_SYMBOL(do_splice_direct); _GPL? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/