Re: [RFC][PATCHSET] sorting out RCU-delayed stuff in ->destroy_inode()
On Tue, Apr 30, 2019 at 05:00:43AM +0100, Al Viro wrote: > Where would you put that synchronize_rcu()? Doing that before ->put_super() > is too early - inode references might be dropped in there. OTOH, doing > that after that point means that while struct super_block itself will be > there, any number of data structures hanging from it might be not. > > So we are still very limited in what we can do inside ->free_inode() > instance *and* we get bunch of synchronize_rcu() for no good reason. > > Note that for normal lockless accesses (lockless ->d_revalidate(), ->d_hash(), > etc.) we are just fine with having struct super_block freeing RCU-delayed > (along with any data structures we might need) - the superblock had > been seen at some point after we'd taken rcu_read_lock(), so its > freeing won't happen until we drop it. So we don't need synchronize_rcu() > for that. > > Here the problem is that we are dealing with another RCU callback; > synchronize_rcu() would be needed for it, but it will only protect that > intermediate dereference of ->i_sb; any rcu-delayed stuff scheduled > from inside ->put_super() would not be ordered wrt ->free_inode(). > And if we are doing that just for the sake of that one dereference, > we might as well do it before scheduling i_callback(). > > PS: we *are* guaranteed that module will still be there > (unregister_filesystem() > does synchronize_rcu() and rcu_barrier() is done before kmem_cache_destroy() > in assorted exit_foo_fs()). After playing with that for a while, I think that adding barriers on superblock freeing (or shutdown) should wait, assuming we do them at all. Right now no ->free_inode() instances look at superblock or anything associated with it; moreover, there's no good candidate code that could be moved there and would benefit from such access. So we don't have any material to see what could be useful to protect. Access to ->i_sb->s_op->free_inode itself is the only exception and moving that to before the rcu delay is both less invasive and a _lot_ more robust than playing with synchronize_rcu(). We can do that without growing struct inode or storing it for long periods - ->i_fop is only accessed for struct inode with positive refcount, so we can put that into anon union with the ->free_inode value, setting it just before we schedule execution of i_callback() (and before the direct call of the same in alloc_inode() failure exit). IMO the following is the sane incremental for the coming window purposes; if we get a convincing case for ->free_inode() doing something that could benefit from being ordered wrt parts of fs shutdown, we can always deal with synchronize_rcu() later. Existing instances will be fine, and IMO separating RCU-delayed parts of inode destruction from the rest is worthwhile on its own. Objections? diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 9d80f9e0855e..b8d3ddd8b8db 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -655,3 +655,11 @@ in your dentry operations instead. * if ->free_inode() is non-NULL, it gets scheduled by call_rcu() * combination of NULL ->destroy_inode and NULL ->free_inode is treated as NULL/free_inode_nonrcu, to preserve the compatibility. + + Note that the callback (be it via ->free_inode() or explicit call_rcu() + in ->destroy_inode()) is *NOT* ordered wrt superblock destruction; + as the matter of fact, the superblock and all associated structures + might be already gone. The filesystem driver is guaranteed to be still + there, but that's it. Freeing memory in the callback is fine; doing + more than that is possible, but requires a lot of care and is best + avoided. diff --git a/fs/inode.c b/fs/inode.c index fb45590d284e..627e1766503a 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -211,8 +211,8 @@ EXPORT_SYMBOL(free_inode_nonrcu); static void i_callback(struct rcu_head *head) { struct inode *inode = container_of(head, struct inode, i_rcu); - if (inode->i_sb->s_op->free_inode) - inode->i_sb->s_op->free_inode(inode); + if (inode->free_inode) + inode->free_inode(inode); else free_inode_nonrcu(inode); } @@ -236,6 +236,7 @@ static struct inode *alloc_inode(struct super_block *sb) if (!ops->free_inode) return NULL; } + inode->free_inode = ops->free_inode; i_callback(>i_rcu); return NULL; } @@ -276,6 +277,7 @@ static void destroy_inode(struct inode *inode) if (!ops->free_inode) return; } + inode->free_inode = ops->free_inode; call_rcu(>i_rcu, i_callback); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 2e9b9f87caca..92732286b748 100644 --- a/include/linux/fs.h
Re: [RFC][PATCHSET] sorting out RCU-delayed stuff in ->destroy_inode()
> On Apr 29, 2019, at 10:26 PM, Al Viro wrote: > > On Mon, Apr 29, 2019 at 10:18:04PM -0600, Andreas Dilger wrote: >>> >>> void*i_private; /* fs or device private pointer */ >>> + void (*free_inode)(struct inode *); >> >> It seems like a waste to increase the size of every struct inode just to >> access >> a static pointer. Is this the only place that ->free_inode() is called? Why >> not move the ->free_inode() pointer into inode->i_fop->free_inode() so that >> it >> is still directly accessible at this point. > > i_op, surely? Yes, i_op is what I was thinking. > In any case, increasing sizeof(struct inode) is not a problem - > if anything, I'd turn ->i_fop into an anon union with that. As in, > > diff --git a/fs/inode.c b/fs/inode.c > index fb45590d284e..627e1766503a 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -211,8 +211,8 @@ EXPORT_SYMBOL(free_inode_nonrcu); > static void i_callback(struct rcu_head *head) > { > struct inode *inode = container_of(head, struct inode, i_rcu); > - if (inode->i_sb->s_op->free_inode) > - inode->i_sb->s_op->free_inode(inode); > + if (inode->free_inode) > + inode->free_inode(inode); > else > free_inode_nonrcu(inode); > } > @@ -236,6 +236,7 @@ static struct inode *alloc_inode(struct super_block *sb) > if (!ops->free_inode) > return NULL; > } > + inode->free_inode = ops->free_inode; > i_callback(>i_rcu); > return NULL; > } > @@ -276,6 +277,7 @@ static void destroy_inode(struct inode *inode) > if (!ops->free_inode) > return; > } > + inode->free_inode = ops->free_inode; > call_rcu(>i_rcu, i_callback); > } This seems like kind of a hack. I guess your goal is to have ->free_inode accessible regardless of whether the filesystem has installed its own ->i_op methods or not, and i_fop is no longer used by this point. That said, this seems better than increasing the size of struct inode. > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 2e9b9f87caca..92732286b748 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -694,7 +694,10 @@ struct inode { > #ifdef CONFIG_IMA > atomic_ti_readcount; /* struct files open RO */ > #endif > - const struct file_operations*i_fop; /* former > ->i_op->default_file_ops */ > + union { > + const struct file_operations*i_fop; /* former > ->i_op->default_file_ops */ > + void (*free_inode)(struct inode *); > + }; Cheers, Andreas signature.asc Description: Message signed with OpenPGP
Re: [RFC][PATCHSET] sorting out RCU-delayed stuff in ->destroy_inode()
On Mon, Apr 29, 2019 at 10:18:04PM -0600, Andreas Dilger wrote: > > > > void*i_private; /* fs or device private pointer */ > > + void (*free_inode)(struct inode *); > > It seems like a waste to increase the size of every struct inode just to > access > a static pointer. Is this the only place that ->free_inode() is called? Why > not move the ->free_inode() pointer into inode->i_fop->free_inode() so that it > is still directly accessible at this point. i_op, surely? In any case, increasing sizeof(struct inode) is not a problem - if anything, I'd turn ->i_fop into an anon union with that. As in, diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 9d80f9e0855e..b8d3ddd8b8db 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -655,3 +655,11 @@ in your dentry operations instead. * if ->free_inode() is non-NULL, it gets scheduled by call_rcu() * combination of NULL ->destroy_inode and NULL ->free_inode is treated as NULL/free_inode_nonrcu, to preserve the compatibility. + + Note that the callback (be it via ->free_inode() or explicit call_rcu() + in ->destroy_inode()) is *NOT* ordered wrt superblock destruction; + as the matter of fact, the superblock and all associated structures + might be already gone. The filesystem driver is guaranteed to be still + there, but that's it. Freeing memory in the callback is fine; doing + more than that is possible, but requires a lot of care and is best + avoided. diff --git a/fs/inode.c b/fs/inode.c index fb45590d284e..627e1766503a 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -211,8 +211,8 @@ EXPORT_SYMBOL(free_inode_nonrcu); static void i_callback(struct rcu_head *head) { struct inode *inode = container_of(head, struct inode, i_rcu); - if (inode->i_sb->s_op->free_inode) - inode->i_sb->s_op->free_inode(inode); + if (inode->free_inode) + inode->free_inode(inode); else free_inode_nonrcu(inode); } @@ -236,6 +236,7 @@ static struct inode *alloc_inode(struct super_block *sb) if (!ops->free_inode) return NULL; } + inode->free_inode = ops->free_inode; i_callback(>i_rcu); return NULL; } @@ -276,6 +277,7 @@ static void destroy_inode(struct inode *inode) if (!ops->free_inode) return; } + inode->free_inode = ops->free_inode; call_rcu(>i_rcu, i_callback); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 2e9b9f87caca..92732286b748 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -694,7 +694,10 @@ struct inode { #ifdef CONFIG_IMA atomic_ti_readcount; /* struct files open RO */ #endif - const struct file_operations*i_fop; /* former ->i_op->default_file_ops */ + union { + const struct file_operations*i_fop; /* former ->i_op->default_file_ops */ + void (*free_inode)(struct inode *); + }; struct file_lock_context*i_flctx; struct address_spacei_data; struct list_headi_devices;
Re: [RFC][PATCHSET] sorting out RCU-delayed stuff in ->destroy_inode()
On Apr 29, 2019, at 9:09 PM, Al Viro wrote: > > On Tue, Apr 16, 2019 at 11:01:16AM -0700, Linus Torvalds wrote: >> >> I only skimmed through the actual filesystem (and one networking) >> patches, but they looked like trivial conversions to a better >> interface. > > ... except that this callback can (and always could) get executed after > freeing struct super_block. So we can't just dereference ->i_sb->s_op > and expect to survive; the table ->s_op pointed to will still be there, > but ->i_sb might very well have been freed, with all its contents overwritten. > We need to copy the callback into struct inode itself, unfortunately. > The following incremental fixes it; I'm going to fold it into the first > commit in there. > > diff --git a/fs/inode.c b/fs/inode.c > index fb45590d284e..855dad43b11d 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -164,6 +164,7 @@ int inode_init_always(struct super_block *sb, struct > inode *inode) > inode->i_wb_frn_avg_time = 0; > inode->i_wb_frn_history = 0; > #endif > + inode->free_inode = sb->s_op->free_inode; > > if (security_inode_alloc(inode)) > goto out; > @@ -211,8 +212,8 @@ EXPORT_SYMBOL(free_inode_nonrcu); > static void i_callback(struct rcu_head *head) > { > struct inode *inode = container_of(head, struct inode, i_rcu); > - if (inode->i_sb->s_op->free_inode) > - inode->i_sb->s_op->free_inode(inode); > + if (inode->free_inode) > + inode->free_inode(inode); > else > free_inode_nonrcu(inode); > } > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 2e9b9f87caca..5ed6b39e588e 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -718,6 +718,7 @@ struct inode { > #endif > > void*i_private; /* fs or device private pointer */ > + void (*free_inode)(struct inode *); It seems like a waste to increase the size of every struct inode just to access a static pointer. Is this the only place that ->free_inode() is called? Why not move the ->free_inode() pointer into inode->i_fop->free_inode() so that it is still directly accessible at this point. Cheers, Andreas signature.asc Description: Message signed with OpenPGP
Re: [RFC][PATCHSET] sorting out RCU-delayed stuff in ->destroy_inode()
On Mon, Apr 29, 2019 at 08:37:29PM -0700, Linus Torvalds wrote: > On Mon, Apr 29, 2019, 20:09 Al Viro wrote: > > > > > ... except that this callback can (and always could) get executed after > > freeing struct super_block. > > > > Ugh. > > That food looks nasty. Shouldn't the super block freeing wait for the > filesystem to be all done instead? Do a rcu synchronization or something? > > Adding that pointer looks really wrong to me. I'd much rather delay the sb > freeing. Is there some reason that can't be done that I'm missing? Where would you put that synchronize_rcu()? Doing that before ->put_super() is too early - inode references might be dropped in there. OTOH, doing that after that point means that while struct super_block itself will be there, any number of data structures hanging from it might be not. So we are still very limited in what we can do inside ->free_inode() instance *and* we get bunch of synchronize_rcu() for no good reason. Note that for normal lockless accesses (lockless ->d_revalidate(), ->d_hash(), etc.) we are just fine with having struct super_block freeing RCU-delayed (along with any data structures we might need) - the superblock had been seen at some point after we'd taken rcu_read_lock(), so its freeing won't happen until we drop it. So we don't need synchronize_rcu() for that. Here the problem is that we are dealing with another RCU callback; synchronize_rcu() would be needed for it, but it will only protect that intermediate dereference of ->i_sb; any rcu-delayed stuff scheduled from inside ->put_super() would not be ordered wrt ->free_inode(). And if we are doing that just for the sake of that one dereference, we might as well do it before scheduling i_callback(). PS: we *are* guaranteed that module will still be there (unregister_filesystem() does synchronize_rcu() and rcu_barrier() is done before kmem_cache_destroy() in assorted exit_foo_fs()).
Re: [RFC][PATCHSET] sorting out RCU-delayed stuff in ->destroy_inode()
On Tue, Apr 16, 2019 at 11:01:16AM -0700, Linus Torvalds wrote: > On Tue, Apr 16, 2019 at 10:49 AM Al Viro wrote: > > > > 83 files changed, 241 insertions(+), 516 deletions(-) > > I think this single line is pretty convincing on its own. Ignoring > docs and fs/inode.c, we have > > 80 files changed, 190 insertions(+), 494 deletions(-) > > IOW, just over 300 lines of boiler plate code removed. > > The additions are > > - Ten more lines of actual code in fs/inode.c (and that's not > actually added complexity, it looks simpler if anything - most of it > is the new "i_callback()" helper function) > > - 19 lines of doc updates. > > So it absolutely looks fine to me. > > I only skimmed through the actual filesystem (and one networking) > patches, but they looked like trivial conversions to a better > interface. ... except that this callback can (and always could) get executed after freeing struct super_block. So we can't just dereference ->i_sb->s_op and expect to survive; the table ->s_op pointed to will still be there, but ->i_sb might very well have been freed, with all its contents overwritten. We need to copy the callback into struct inode itself, unfortunately. The following incremental fixes it; I'm going to fold it into the first commit in there. diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 9d80f9e0855e..b8d3ddd8b8db 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -655,3 +655,11 @@ in your dentry operations instead. * if ->free_inode() is non-NULL, it gets scheduled by call_rcu() * combination of NULL ->destroy_inode and NULL ->free_inode is treated as NULL/free_inode_nonrcu, to preserve the compatibility. + + Note that the callback (be it via ->free_inode() or explicit call_rcu() + in ->destroy_inode()) is *NOT* ordered wrt superblock destruction; + as the matter of fact, the superblock and all associated structures + might be already gone. The filesystem driver is guaranteed to be still + there, but that's it. Freeing memory in the callback is fine; doing + more than that is possible, but requires a lot of care and is best + avoided. diff --git a/fs/inode.c b/fs/inode.c index fb45590d284e..855dad43b11d 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -164,6 +164,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode) inode->i_wb_frn_avg_time = 0; inode->i_wb_frn_history = 0; #endif + inode->free_inode = sb->s_op->free_inode; if (security_inode_alloc(inode)) goto out; @@ -211,8 +212,8 @@ EXPORT_SYMBOL(free_inode_nonrcu); static void i_callback(struct rcu_head *head) { struct inode *inode = container_of(head, struct inode, i_rcu); - if (inode->i_sb->s_op->free_inode) - inode->i_sb->s_op->free_inode(inode); + if (inode->free_inode) + inode->free_inode(inode); else free_inode_nonrcu(inode); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 2e9b9f87caca..5ed6b39e588e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -718,6 +718,7 @@ struct inode { #endif void*i_private; /* fs or device private pointer */ + void (*free_inode)(struct inode *); } __randomize_layout; static inline unsigned int i_blocksize(const struct inode *node)
Re: [RFC][PATCHSET] sorting out RCU-delayed stuff in ->destroy_inode()
On Tue, Apr 16, 2019 at 06:49:00PM +0100, Al Viro wrote: > We have a lot of boilerplate in ->destroy_inode() > instances, and several filesystems got the things wrong > in that area. The patchset below attempts to deal with that. > > New method (void ->free_inode(inode)) is introduced, > and RCU-delayed parts of ->destroy_inode() are moved there. > The change is backwards-compatible - unmodified filesystem > will behave as it used to. Rules: > ->destroy_inode ->free_inode > f g f(), rcu-delayed g() > f NULLf() > NULLg rcu-delayed g() > NULLNULLrcu-delayed free_inode_nonrcu() > IOW, NULL/NULL acts as NULL/free_inode_nonrcu. > > For a lot of filesystems ->destroy_inode() used to consist > only of call_rcu(foo_i_callback, >i_rcu). Those simply get > rid of ->destroy_inode() and have the callback (with saner prototype) > become their ->free_inode(). The simplified API looks good to me. For btrfs and affs bits: Acked-by: David Sterba
Re: [RFC][PATCHSET] sorting out RCU-delayed stuff in ->destroy_inode()
On Tue, Apr 16, 2019 at 10:49 AM Al Viro wrote: > > 83 files changed, 241 insertions(+), 516 deletions(-) I think this single line is pretty convincing on its own. Ignoring docs and fs/inode.c, we have 80 files changed, 190 insertions(+), 494 deletions(-) IOW, just over 300 lines of boiler plate code removed. The additions are - Ten more lines of actual code in fs/inode.c (and that's not actually added complexity, it looks simpler if anything - most of it is the new "i_callback()" helper function) - 19 lines of doc updates. So it absolutely looks fine to me. I only skimmed through the actual filesystem (and one networking) patches, but they looked like trivial conversions to a better interface. Linus
[RFC][PATCHSET] sorting out RCU-delayed stuff in ->destroy_inode()
We have a lot of boilerplate in ->destroy_inode() instances, and several filesystems got the things wrong in that area. The patchset below attempts to deal with that. New method (void ->free_inode(inode)) is introduced, and RCU-delayed parts of ->destroy_inode() are moved there. The change is backwards-compatible - unmodified filesystem will behave as it used to. Rules: ->destroy_inode ->free_inode f g f(), rcu-delayed g() f NULLf() NULLg rcu-delayed g() NULLNULLrcu-delayed free_inode_nonrcu() IOW, NULL/NULL acts as NULL/free_inode_nonrcu. For a lot of filesystems ->destroy_inode() used to consist only of call_rcu(foo_i_callback, >i_rcu). Those simply get rid of ->destroy_inode() and have the callback (with saner prototype) become their ->free_inode(). Filesystems with NULL ->destroy_inode() are simply left as-is and so are the filesystems that don't have RCU-delayed call (pipefs, xfs, btrfs-tests). Filesystems that have both synchronous work and RCU-delayed call of a callback are more interesting. In any case, the callback can be converted to ->free_inode(). Sometimes that's all we can reasonably do there - the rest is left in ->destroy_inode() and that's it. However, for some of those we can do more: * some of the synchronous stuff can just as well live in RCU callback; such can be moved to ->free_inode(). * some of the synchronous stuff is a better fit for ->evict_inode(); e.g. the code that's undoing something done after the ->alloc_inode() or sanity checks on the inode state. I've done that in the obvious cases; the few non-obvious are up to fs maintainers - they can be done as followups at any point. The series lives in vfs.git#work.icache; patchbomb in followups. Overview: * a couple of missed fixes for ->i_link freed to early; -stable fodder: securityfs: fix use-after-free on symlink traversal apparmorfs: fix use-after-free on symlink traversal * infrastructure: new inode method: ->free_inode() * simple conversions (->destroy_inode() consisting only of call_rcu()) spufs: switch to ->free_inode() erofs: switch to ->free_inode() 9p: switch to ->free_inode() adfs: switch to ->free_inode() affs: switch to ->free_inode() befs: switch to ->free_inode() bfs: switch to ->free_inode() bdev: switch to ->free_inode() cifs: switch to ->free_inode() debugfs: switch to ->free_inode() efs: switch to ->free_inode() ext2: switch to ->free_inode() f2fs: switch to ->free_inode() fat: switch to ->free_inode() freevxfs: switch to ->free_inode() gfs2: switch to ->free_inode() hfs: switch to ->free_inode() hfsplus: switch to ->free_inode() hostfs: switch to ->free_inode() hpfs: switch to ->free_inode() isofs: switch to ->free_inode() jffs2: switch to ->free_inode() minix: switch to ->free_inode() nfs{,4}: switch to ->free_inode() nilfs2: switch to ->free_inode() dlmfs: switch to ->free_inode() ocfs2: switch to ->free_inode() openpromfs: switch to ->free_inode() procfs: switch to ->free_inode() qnx4: switch to ->free_inode() qnx6: switch to ->free_inode() reiserfs: convert to ->free_inode() romfs: convert to ->free_inode() squashfs: switch to ->free_inode() ubifs: switch to ->free_inode() udf: switch to ->free_inode() sysv: switch to ->free_inode() coda: switch to ->free_inode() ufs: switch to ->free_inode() mqueue: switch to ->free_inode() bpf: switch to ->free_inode() rpcpipe: switch to ->free_inode() apparmor: switch to ->free_inode() securityfs: switch to ->free_inode() ntfs: switch to ->free_inode() * cases where ->destroy_inode() contains both synchronous and delayed parts; fuse, jfs have their ->destroy_inode() dissolved and I'd like an ACK from their maintainers: dax: make use of ->free_inode() afs: switch to use of ->free_inode() btrfs: use ->free_inode() ceph: use ->free_inode() ecryptfs: make use of ->free_inode() ext4: make use of ->free_inode() fuse: switch to ->free_inode() jfs: switch to ->free_inode() overlayfs: make use of ->free_inode() hugetlb: make use of ->free_inode() shmem: make use of ->free_inode() orangefs: make use of ->free_inode() * sockets: sockfs is a case where everything can be moved to ->free_inode(); we are RCU-delaying the freeing of socket_wq anyway, so we might as well combine that with freeing the socket_alloc itself. That allows to get rid of separate allocations for those, which simplifies the things nicely. We obviously need an ACK from networking folks on the last pair