Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On 8 December 2015 at 11:08, Suzuki K. Poulose wrote: > On 08/12/15 07:58, Al Viro wrote: >> >> On Mon, Dec 07, 2015 at 10:49:05AM -0800, Linus Torvalds wrote: >>> >>> On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose >>> wrote: > > > ... > >> Anyway, the fix for 9p bogosity follows; it definitely fixes a bug there, >> and I'm fairly sure that it fixes the bug that had been reported. >> A confirmation would be nice, of course... >> >> Signed-off-by: Al Viro >> --- >> diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c >> index 699941e..5110785 100644 >> --- a/fs/9p/vfs_inode.c >> +++ b/fs/9p/vfs_inode.c >> @@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode) >> { >> struct v9fs_inode *v9inode = V9FS_I(inode); >> >> - truncate_inode_pages_final(inode->i_mapping); >> + truncate_inode_pages_final(>i_data); >> clear_inode(inode); >> - filemap_fdatawrite(inode->i_mapping); >> + filemap_fdatawrite(>i_data); >> >> v9fs_cache_inode_put_cookie(inode); >> /* clunk the fid stashed in writeback_fid */ >> > > This patch fixes the problem : > > Tested-by: Suzuki K. Poulose > > Thanks > Suzuki FWIW, I think I reported the same issue here: http://sourceforge.net/p/v9fs/mailman/message/34661239/ And Al's patch fixed it here too. Thanks, Vegard -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On 08/12/15 07:58, Al Viro wrote: On Mon, Dec 07, 2015 at 10:49:05AM -0800, Linus Torvalds wrote: On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose wrote: ... Anyway, the fix for 9p bogosity follows; it definitely fixes a bug there, and I'm fairly sure that it fixes the bug that had been reported. A confirmation would be nice, of course... Signed-off-by: Al Viro --- diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 699941e..5110785 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode) { struct v9fs_inode *v9inode = V9FS_I(inode); - truncate_inode_pages_final(inode->i_mapping); + truncate_inode_pages_final(>i_data); clear_inode(inode); - filemap_fdatawrite(inode->i_mapping); + filemap_fdatawrite(>i_data); v9fs_cache_inode_put_cookie(inode); /* clunk the fid stashed in writeback_fid */ This patch fixes the problem : Tested-by: Suzuki K. Poulose Thanks Suzuki -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On 08/12/15 07:25, Al Viro wrote: On Mon, Dec 07, 2015 at 06:05:03PM +, Suzuki K. Poulose wrote: blkdev_open() doesn't release the bdev, it attached to a given inode, if blkdev_get() fails (e.g, due to absence of a device). This can cause kernel crashes when the original filesystem tries to flush the data during evict_inode. This can be triggered easily with virtio-9p fs using the following simple steps. ??? How can filesystem type affect the behaviour of block devices? ... We should not do bd_forget() upon failing open() - what for? As long as ->i_rdev remains the same, the pointer to struct bdev is valid. It doesn't pin bdev down; having it (or any other alias) opened does. When we decide to evict bdev, *all* aliasing inodes are dissociated from it; none of them is open at that point, so we are OK. When an aliasing inode gets evicted, we have it dissociated from its ->i_bdev (if any). Since we only access the ->i_mapping of aliasing inode while its open, those places are fine and anything that wants ->i_data of alias will simply find it empty. Thanks for the detailed explanation. Surely my patch was not cooked up on the full understanding of the bdev fs. Things are much more clear now. Could you confirm that the patch below fixes your problem? Yes, it does solve the issue. Thanks Suzuki -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On 08/12/15 07:25, Al Viro wrote: On Mon, Dec 07, 2015 at 06:05:03PM +, Suzuki K. Poulose wrote: blkdev_open() doesn't release the bdev, it attached to a given inode, if blkdev_get() fails (e.g, due to absence of a device). This can cause kernel crashes when the original filesystem tries to flush the data during evict_inode. This can be triggered easily with virtio-9p fs using the following simple steps. ??? How can filesystem type affect the behaviour of block devices? ... We should not do bd_forget() upon failing open() - what for? As long as ->i_rdev remains the same, the pointer to struct bdev is valid. It doesn't pin bdev down; having it (or any other alias) opened does. When we decide to evict bdev, *all* aliasing inodes are dissociated from it; none of them is open at that point, so we are OK. When an aliasing inode gets evicted, we have it dissociated from its ->i_bdev (if any). Since we only access the ->i_mapping of aliasing inode while its open, those places are fine and anything that wants ->i_data of alias will simply find it empty. Thanks for the detailed explanation. Surely my patch was not cooked up on the full understanding of the bdev fs. Things are much more clear now. Could you confirm that the patch below fixes your problem? Yes, it does solve the issue. Thanks Suzuki -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On 8 December 2015 at 11:08, Suzuki K. Poulosewrote: > On 08/12/15 07:58, Al Viro wrote: >> >> On Mon, Dec 07, 2015 at 10:49:05AM -0800, Linus Torvalds wrote: >>> >>> On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose >>> wrote: > > > ... > >> Anyway, the fix for 9p bogosity follows; it definitely fixes a bug there, >> and I'm fairly sure that it fixes the bug that had been reported. >> A confirmation would be nice, of course... >> >> Signed-off-by: Al Viro >> --- >> diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c >> index 699941e..5110785 100644 >> --- a/fs/9p/vfs_inode.c >> +++ b/fs/9p/vfs_inode.c >> @@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode) >> { >> struct v9fs_inode *v9inode = V9FS_I(inode); >> >> - truncate_inode_pages_final(inode->i_mapping); >> + truncate_inode_pages_final(>i_data); >> clear_inode(inode); >> - filemap_fdatawrite(inode->i_mapping); >> + filemap_fdatawrite(>i_data); >> >> v9fs_cache_inode_put_cookie(inode); >> /* clunk the fid stashed in writeback_fid */ >> > > This patch fixes the problem : > > Tested-by: Suzuki K. Poulose > > Thanks > Suzuki FWIW, I think I reported the same issue here: http://sourceforge.net/p/v9fs/mailman/message/34661239/ And Al's patch fixed it here too. Thanks, Vegard -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On 08/12/15 07:58, Al Viro wrote: On Mon, Dec 07, 2015 at 10:49:05AM -0800, Linus Torvalds wrote: On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulosewrote: ... Anyway, the fix for 9p bogosity follows; it definitely fixes a bug there, and I'm fairly sure that it fixes the bug that had been reported. A confirmation would be nice, of course... Signed-off-by: Al Viro --- diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 699941e..5110785 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode) { struct v9fs_inode *v9inode = V9FS_I(inode); - truncate_inode_pages_final(inode->i_mapping); + truncate_inode_pages_final(>i_data); clear_inode(inode); - filemap_fdatawrite(inode->i_mapping); + filemap_fdatawrite(>i_data); v9fs_cache_inode_put_cookie(inode); /* clunk the fid stashed in writeback_fid */ This patch fixes the problem : Tested-by: Suzuki K. Poulose Thanks Suzuki -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On Mon, Dec 07, 2015 at 10:49:05AM -0800, Linus Torvalds wrote: > On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose > wrote: > > blkdev_open() doesn't release the bdev, it attached to a given > > inode, if blkdev_get() fails (e.g, due to absence of a device). > > This can cause kernel crashes when the original filesystem > > tries to flush the data during evict_inode. > > Ugh. This code is a mess. Al, can you please comment? > > So what happens is that when "blkdev_get()" fails, it will do a > bdput() on the bdev. Yes. > But blkdev_open() hasn't done a bdget(). It's done a bd_acquire(). > Which will do the whole "add inodes to bd_inodes". Yes. > And yes, > bd_forget() will undo that. It would, but there's no reason to drop the cached pointer to bdev. > IOW, the path looks simple and apparently fixes an oops, but I'd like > much more of an explanation for what happens, because it all feels > wrong to me. Why doesn't the bdput() end up undoing the bd_acquire() > properly? Because it doesn't work that way. ->i_bdev is just a cached result of lookup by device number. Once found, it stays there for as long as neither the struct inode nor struct block_device are freed. It does *NOT* pin struct block_device. Note that we have two kinds of block device inodes - ones coallocated with struct block_device (those are unique per major/minor, live on bdevfs and can't be seen directly) and ones aliasing the first kind. Those live on normal filesystems. Pagecache lives in ->i_data of the bdevfs inode; aliasing ones have their ->i_data empty and ->i_mapping pointing to ->i_data of the bdevfs inode. That guarantees the cache coherency between those guys. Now, simply having ->i_bdev point to struct block_device does not affect the lifetime of the latter in any way. All aliases are dissociated from block_device when bdevfs inode is evicted; block_device is dissociated from aliasing inode when that aliasing inode is evicted. bdev_lock provides the atomicity there. _Opening_ an alias (any of them) does pin block_device down. So when an aliasing inode is open, we can safely use its ->i_mapping in normal pagecache-related code and have everything work correctly. Accessing ->i_mapping when inode isn't open is valid only if filesystem code is sure it's pointing to its own ->i_data (and pointless in any case). And that's what 9p ->evict_inode() is doing - it's trying to evict not the pages in its ->i_data (which would be empty for block device), but the pages in its ->i_mapping. IOW, the pagecache shared by all aliasing inodes. Which is obviously bogus, regardless of lifetime rules violation - mknod /tmp/foo b 8 1 && dd count=1 /dev/null && rm /tmp/foo should not blow the cache of /dev/sda1, no matter which fs type we happen to use for /tmp. And 9p will try to do just that. Fortunately, no other ->evict_inode() instance is doing anything of that sort, so we just need to fix that bogosity in 9p one. As for the bdev eviction, bdput() acts exactly like iput(). In fact, it is iput() of the coallocated bdevfs inode. It can stay around with zero refcount; same as any other inode, memory pressure would eventually push them out. It does *NOT* pin the driver when not opened, BTW. Anyway, the fix for 9p bogosity follows; it definitely fixes a bug there, and I'm fairly sure that it fixes the bug that had been reported. A confirmation would be nice, of course... Signed-off-by: Al Viro --- diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 699941e..5110785 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode) { struct v9fs_inode *v9inode = V9FS_I(inode); - truncate_inode_pages_final(inode->i_mapping); + truncate_inode_pages_final(>i_data); clear_inode(inode); - filemap_fdatawrite(inode->i_mapping); + filemap_fdatawrite(>i_data); v9fs_cache_inode_put_cookie(inode); /* clunk the fid stashed in writeback_fid */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On Mon, Dec 07, 2015 at 06:05:03PM +, Suzuki K. Poulose wrote: > blkdev_open() doesn't release the bdev, it attached to a given > inode, if blkdev_get() fails (e.g, due to absence of a device). > This can cause kernel crashes when the original filesystem > tries to flush the data during evict_inode. > > This can be triggered easily with virtio-9p fs using the following > simple steps. ??? How can filesystem type affect the behaviour of block devices? Having mknod /tmp/splat b 8 1; rm /tmp/splat try to evict the pagecache of /dev/sda1 is simply wrong, no matter what type /tmp happens to have. And they must share pagecache, or you'll get one hell of cache coherency problems. As it is, that pagecache belongs to inode on bdevfs (see fs/block_dev.c; not mountable anywhere visible, the one and only mount is internal). That inode is tied to struct bdev, ditto for its lifetime. Block device inodes on anything else have their ->i_mapping pointing to the corresponding (unique for given major/minor) inode on bdevfs; that gives us the coherency, but that also means that their *own* pagecache (->i_data) is empty. Which is just fine, since inode eviction should get rid of everything in its embedded struct address_space. In case of block device inodes on ext2, 9p, etc. that amounts to no pages at all. In case of bdevfs, it contains the page cache of block device. Aha... truncate_inode_pages_final(inode->i_mapping); clear_inode(inode); filemap_fdatawrite(inode->i_mapping); in there is obviously wrong - it should be truncate_inode_pages_final(>i_data); clear_inode(inode); filemap_fdatawrite(>i_data); and if you check other filesystems' ->evict_inode() you'll see the same thing there. We should not do bd_forget() upon failing open() - what for? As long as ->i_rdev remains the same, the pointer to struct bdev is valid. It doesn't pin bdev down; having it (or any other alias) opened does. When we decide to evict bdev, *all* aliasing inodes are dissociated from it; none of them is open at that point, so we are OK. When an aliasing inode gets evicted, we have it dissociated from its ->i_bdev (if any). Since we only access the ->i_mapping of aliasing inode while its open, those places are fine and anything that wants ->i_data of alias will simply find it empty. AFAICS, the cause of your oopsen is that 9p evict_inode is accessing the object it has no business to touch. Could you confirm that the patch below fixes your problem? diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 699941e..5110785 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode) { struct v9fs_inode *v9inode = V9FS_I(inode); - truncate_inode_pages_final(inode->i_mapping); + truncate_inode_pages_final(>i_data); clear_inode(inode); - filemap_fdatawrite(inode->i_mapping); + filemap_fdatawrite(>i_data); v9fs_cache_inode_put_cookie(inode); /* clunk the fid stashed in writeback_fid */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose wrote: > blkdev_open() doesn't release the bdev, it attached to a given > inode, if blkdev_get() fails (e.g, due to absence of a device). > This can cause kernel crashes when the original filesystem > tries to flush the data during evict_inode. Ugh. This code is a mess. Al, can you please comment? So what happens is that when "blkdev_get()" fails, it will do a bdput() on the bdev. But blkdev_open() hasn't done a bdget(). It's done a bd_acquire(). Which will do the whole "add inodes to bd_inodes". And yes, bd_forget() will undo that. HOWEVER. bd_forget() will undo that unconditionally, but bd_acquire() has *not* unconditionally done that bd_inodes list operation. It might already have been there. So as far as I can tell, the patch here undoes things potentially too much. Shouldn't the last bdput() already end up doing a bd_forget()? We'd have bdput -> iput -> iput_final -> evict -> bd_forget. but the fact that Suzuki shows an oops clearly shows that something is badly wrong. IOW, the path looks simple and apparently fixes an oops, but I'd like much more of an explanation for what happens, because it all feels wrong to me. Why doesn't the bdput() end up undoing the bd_acquire() properly? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blkdev: Fix blkdev_open to release the bdev on error
blkdev_open() doesn't release the bdev, it attached to a given inode, if blkdev_get() fails (e.g, due to absence of a device). This can cause kernel crashes when the original filesystem tries to flush the data during evict_inode. This can be triggered easily with virtio-9p fs using the following simple steps. root@localhost:~# mknod disk b 9 1 root@localhost:~# cat disk Unable to handle kernel NULL pointer dereference at virtual address 0214 pgd = bea4 [0214] *pgd=be9eb831, *pte=, *ppte= Internal error: Oops: 17 [#1] SMP ARM Modules linked in: CPU: 0 PID: 1094 Comm: cat Not tainted 4.3.0 #3 Hardware name: Generic DT based system task: bf186600 ti: be822000 task.ti: be822000 PC is at blk_get_backing_dev_info+0x4/0x10 LR is at __filemap_fdatawrite_range+0x88/0x94 pc : [<80317e00>]lr : [<801995d4>]psr: 60010013 sp : be823db0 ip : fp : 0024 r10: fffa r9 : be86a240 r8 : bec87e58 r7 : 0001 r6 : 80615640 r5 : 7fff r4 : bec03354 r3 : r2 : bf006c00 r1 : 7fff r0 : bec03200 Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5383d Table: bea4006a DAC: 0051 Process cat (pid: 1094, stack limit = 0xbe822210) [... stack contents trimmed ...] [<80317e00>] (blk_get_backing_dev_info) from [<801995d4>] (__filemap_fdatawrite_range+0x88/0x94) [<801995d4>] (__filemap_fdatawrite_range) from [<80199608>] (filemap_fdatawrite+0x28/0x30) [<80199608>] (filemap_fdatawrite) from [<802fa830>] (v9fs_evict_inode+0x20/0x3c) [<802fa830>] (v9fs_evict_inode) from [<801ef4fc>] (evict+0xb0/0x188) [<801ef4fc>] (evict) from [<801eb998>] (__dentry_kill+0x1ec/0x250) [<801eb998>] (__dentry_kill) from [<801ec2d8>] (dput+0x188/0x28c) [<801ec2d8>] (dput) from [<801e0858>] (path_put+0x10/0x1c) [<801e0858>] (path_put) from [<801e08a0>] (terminate_walk+0x3c/0x98) [<801e08a0>] (terminate_walk) from [<801e3d54>] (path_openat+0x1ec/0xeac) [<801e3d54>] (path_openat) from [<801e56c8>] (do_filp_open+0x60/0xb4) [<801e56c8>] (do_filp_open) from [<801d7850>] (do_sys_open+0x124/0x1d0) [<801d7850>] (do_sys_open) from [<80107340>] (ret_fast_syscall+0x0/0x3c) Code: 806d3ca0 80941b7c 807175d8 e590305c (e5930214) ---[ end trace b61b160a3217ae29 ]--- Fixes: e525fd89d380c4a94c0d63913a1dd1a593ed25e7 Cc: Tejun Heo Cc: sta...@vger.kernel.org Cc: Al Viro Signed-off-by: Suzuki K. Poulose --- fs/block_dev.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index c25639e..7d7f322 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1484,6 +1484,7 @@ EXPORT_SYMBOL(blkdev_get_by_dev); static int blkdev_open(struct inode * inode, struct file * filp) { + int rc; struct block_device *bdev; /* @@ -1507,7 +1508,11 @@ static int blkdev_open(struct inode * inode, struct file * filp) filp->f_mapping = bdev->bd_inode->i_mapping; - return blkdev_get(bdev, filp->f_mode, filp); + rc = blkdev_get(bdev, filp->f_mode, filp); + if (rc) + bd_forget(inode); + + return rc; } static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On Mon, Dec 07, 2015 at 06:05:03PM +, Suzuki K. Poulose wrote: > blkdev_open() doesn't release the bdev, it attached to a given > inode, if blkdev_get() fails (e.g, due to absence of a device). > This can cause kernel crashes when the original filesystem > tries to flush the data during evict_inode. > > This can be triggered easily with virtio-9p fs using the following > simple steps. ??? How can filesystem type affect the behaviour of block devices? Having mknod /tmp/splat b 8 1; rm /tmp/splat try to evict the pagecache of /dev/sda1 is simply wrong, no matter what type /tmp happens to have. And they must share pagecache, or you'll get one hell of cache coherency problems. As it is, that pagecache belongs to inode on bdevfs (see fs/block_dev.c; not mountable anywhere visible, the one and only mount is internal). That inode is tied to struct bdev, ditto for its lifetime. Block device inodes on anything else have their ->i_mapping pointing to the corresponding (unique for given major/minor) inode on bdevfs; that gives us the coherency, but that also means that their *own* pagecache (->i_data) is empty. Which is just fine, since inode eviction should get rid of everything in its embedded struct address_space. In case of block device inodes on ext2, 9p, etc. that amounts to no pages at all. In case of bdevfs, it contains the page cache of block device. Aha... truncate_inode_pages_final(inode->i_mapping); clear_inode(inode); filemap_fdatawrite(inode->i_mapping); in there is obviously wrong - it should be truncate_inode_pages_final(>i_data); clear_inode(inode); filemap_fdatawrite(>i_data); and if you check other filesystems' ->evict_inode() you'll see the same thing there. We should not do bd_forget() upon failing open() - what for? As long as ->i_rdev remains the same, the pointer to struct bdev is valid. It doesn't pin bdev down; having it (or any other alias) opened does. When we decide to evict bdev, *all* aliasing inodes are dissociated from it; none of them is open at that point, so we are OK. When an aliasing inode gets evicted, we have it dissociated from its ->i_bdev (if any). Since we only access the ->i_mapping of aliasing inode while its open, those places are fine and anything that wants ->i_data of alias will simply find it empty. AFAICS, the cause of your oopsen is that 9p evict_inode is accessing the object it has no business to touch. Could you confirm that the patch below fixes your problem? diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 699941e..5110785 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode) { struct v9fs_inode *v9inode = V9FS_I(inode); - truncate_inode_pages_final(inode->i_mapping); + truncate_inode_pages_final(>i_data); clear_inode(inode); - filemap_fdatawrite(inode->i_mapping); + filemap_fdatawrite(>i_data); v9fs_cache_inode_put_cookie(inode); /* clunk the fid stashed in writeback_fid */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On Mon, Dec 07, 2015 at 10:49:05AM -0800, Linus Torvalds wrote: > On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose >wrote: > > blkdev_open() doesn't release the bdev, it attached to a given > > inode, if blkdev_get() fails (e.g, due to absence of a device). > > This can cause kernel crashes when the original filesystem > > tries to flush the data during evict_inode. > > Ugh. This code is a mess. Al, can you please comment? > > So what happens is that when "blkdev_get()" fails, it will do a > bdput() on the bdev. Yes. > But blkdev_open() hasn't done a bdget(). It's done a bd_acquire(). > Which will do the whole "add inodes to bd_inodes". Yes. > And yes, > bd_forget() will undo that. It would, but there's no reason to drop the cached pointer to bdev. > IOW, the path looks simple and apparently fixes an oops, but I'd like > much more of an explanation for what happens, because it all feels > wrong to me. Why doesn't the bdput() end up undoing the bd_acquire() > properly? Because it doesn't work that way. ->i_bdev is just a cached result of lookup by device number. Once found, it stays there for as long as neither the struct inode nor struct block_device are freed. It does *NOT* pin struct block_device. Note that we have two kinds of block device inodes - ones coallocated with struct block_device (those are unique per major/minor, live on bdevfs and can't be seen directly) and ones aliasing the first kind. Those live on normal filesystems. Pagecache lives in ->i_data of the bdevfs inode; aliasing ones have their ->i_data empty and ->i_mapping pointing to ->i_data of the bdevfs inode. That guarantees the cache coherency between those guys. Now, simply having ->i_bdev point to struct block_device does not affect the lifetime of the latter in any way. All aliases are dissociated from block_device when bdevfs inode is evicted; block_device is dissociated from aliasing inode when that aliasing inode is evicted. bdev_lock provides the atomicity there. _Opening_ an alias (any of them) does pin block_device down. So when an aliasing inode is open, we can safely use its ->i_mapping in normal pagecache-related code and have everything work correctly. Accessing ->i_mapping when inode isn't open is valid only if filesystem code is sure it's pointing to its own ->i_data (and pointless in any case). And that's what 9p ->evict_inode() is doing - it's trying to evict not the pages in its ->i_data (which would be empty for block device), but the pages in its ->i_mapping. IOW, the pagecache shared by all aliasing inodes. Which is obviously bogus, regardless of lifetime rules violation - mknod /tmp/foo b 8 1 && dd count=1 /dev/null && rm /tmp/foo should not blow the cache of /dev/sda1, no matter which fs type we happen to use for /tmp. And 9p will try to do just that. Fortunately, no other ->evict_inode() instance is doing anything of that sort, so we just need to fix that bogosity in 9p one. As for the bdev eviction, bdput() acts exactly like iput(). In fact, it is iput() of the coallocated bdevfs inode. It can stay around with zero refcount; same as any other inode, memory pressure would eventually push them out. It does *NOT* pin the driver when not opened, BTW. Anyway, the fix for 9p bogosity follows; it definitely fixes a bug there, and I'm fairly sure that it fixes the bug that had been reported. A confirmation would be nice, of course... Signed-off-by: Al Viro --- diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 699941e..5110785 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode) { struct v9fs_inode *v9inode = V9FS_I(inode); - truncate_inode_pages_final(inode->i_mapping); + truncate_inode_pages_final(>i_data); clear_inode(inode); - filemap_fdatawrite(inode->i_mapping); + filemap_fdatawrite(>i_data); v9fs_cache_inode_put_cookie(inode); /* clunk the fid stashed in writeback_fid */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blkdev: Fix blkdev_open to release the bdev on error
blkdev_open() doesn't release the bdev, it attached to a given inode, if blkdev_get() fails (e.g, due to absence of a device). This can cause kernel crashes when the original filesystem tries to flush the data during evict_inode. This can be triggered easily with virtio-9p fs using the following simple steps. root@localhost:~# mknod disk b 9 1 root@localhost:~# cat disk Unable to handle kernel NULL pointer dereference at virtual address 0214 pgd = bea4 [0214] *pgd=be9eb831, *pte=, *ppte= Internal error: Oops: 17 [#1] SMP ARM Modules linked in: CPU: 0 PID: 1094 Comm: cat Not tainted 4.3.0 #3 Hardware name: Generic DT based system task: bf186600 ti: be822000 task.ti: be822000 PC is at blk_get_backing_dev_info+0x4/0x10 LR is at __filemap_fdatawrite_range+0x88/0x94 pc : [<80317e00>]lr : [<801995d4>]psr: 60010013 sp : be823db0 ip : fp : 0024 r10: fffa r9 : be86a240 r8 : bec87e58 r7 : 0001 r6 : 80615640 r5 : 7fff r4 : bec03354 r3 : r2 : bf006c00 r1 : 7fff r0 : bec03200 Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5383d Table: bea4006a DAC: 0051 Process cat (pid: 1094, stack limit = 0xbe822210) [... stack contents trimmed ...] [<80317e00>] (blk_get_backing_dev_info) from [<801995d4>] (__filemap_fdatawrite_range+0x88/0x94) [<801995d4>] (__filemap_fdatawrite_range) from [<80199608>] (filemap_fdatawrite+0x28/0x30) [<80199608>] (filemap_fdatawrite) from [<802fa830>] (v9fs_evict_inode+0x20/0x3c) [<802fa830>] (v9fs_evict_inode) from [<801ef4fc>] (evict+0xb0/0x188) [<801ef4fc>] (evict) from [<801eb998>] (__dentry_kill+0x1ec/0x250) [<801eb998>] (__dentry_kill) from [<801ec2d8>] (dput+0x188/0x28c) [<801ec2d8>] (dput) from [<801e0858>] (path_put+0x10/0x1c) [<801e0858>] (path_put) from [<801e08a0>] (terminate_walk+0x3c/0x98) [<801e08a0>] (terminate_walk) from [<801e3d54>] (path_openat+0x1ec/0xeac) [<801e3d54>] (path_openat) from [<801e56c8>] (do_filp_open+0x60/0xb4) [<801e56c8>] (do_filp_open) from [<801d7850>] (do_sys_open+0x124/0x1d0) [<801d7850>] (do_sys_open) from [<80107340>] (ret_fast_syscall+0x0/0x3c) Code: 806d3ca0 80941b7c 807175d8 e590305c (e5930214) ---[ end trace b61b160a3217ae29 ]--- Fixes: e525fd89d380c4a94c0d63913a1dd1a593ed25e7 Cc: Tejun HeoCc: sta...@vger.kernel.org Cc: Al Viro Signed-off-by: Suzuki K. Poulose --- fs/block_dev.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index c25639e..7d7f322 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1484,6 +1484,7 @@ EXPORT_SYMBOL(blkdev_get_by_dev); static int blkdev_open(struct inode * inode, struct file * filp) { + int rc; struct block_device *bdev; /* @@ -1507,7 +1508,11 @@ static int blkdev_open(struct inode * inode, struct file * filp) filp->f_mapping = bdev->bd_inode->i_mapping; - return blkdev_get(bdev, filp->f_mode, filp); + rc = blkdev_get(bdev, filp->f_mode, filp); + if (rc) + bd_forget(inode); + + return rc; } static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error
On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulosewrote: > blkdev_open() doesn't release the bdev, it attached to a given > inode, if blkdev_get() fails (e.g, due to absence of a device). > This can cause kernel crashes when the original filesystem > tries to flush the data during evict_inode. Ugh. This code is a mess. Al, can you please comment? So what happens is that when "blkdev_get()" fails, it will do a bdput() on the bdev. But blkdev_open() hasn't done a bdget(). It's done a bd_acquire(). Which will do the whole "add inodes to bd_inodes". And yes, bd_forget() will undo that. HOWEVER. bd_forget() will undo that unconditionally, but bd_acquire() has *not* unconditionally done that bd_inodes list operation. It might already have been there. So as far as I can tell, the patch here undoes things potentially too much. Shouldn't the last bdput() already end up doing a bd_forget()? We'd have bdput -> iput -> iput_final -> evict -> bd_forget. but the fact that Suzuki shows an oops clearly shows that something is badly wrong. IOW, the path looks simple and apparently fixes an oops, but I'd like much more of an explanation for what happens, because it all feels wrong to me. Why doesn't the bdput() end up undoing the bd_acquire() properly? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/