Re: [RFC 0/2] readdir() as an inode operation
On Tue, Oct 30, 2007 at 04:26:04PM +0100, Jan Kara wrote: > > This is a first try to move readdir() to become an inode operation. This is > > necessary for a VFS implementation of "something like union-mounts" where a > > readdir() needs to read the directory contents of multiple directories. > > Besides that the new interface is no longer giving the struct file to the > > filesystem implementations anymore. > > > > Comments, please? > Hmm, are you sure there are no users which keep some per-struct-file > information for directories? File offset is one such obvious thing which > you've handled but actually filesystem with more complicated structure > of directory may remember some hints about where we really are, keep > some readahead information or so... For example, the ext3 filesystem, when it is supported hash tree, does exactly this. See ext3_htree_store_dirent() in fs/ext3/dir.c and ext3_htree_fill_tree() in fs/ext3/namei.c. So your patch would break ext3 htree support. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Wed, Oct 31, 2007 at 03:01:58PM +1100, Greg Banks wrote: > On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote: > > On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: > > > On Tuesday October 30, [EMAIL PROTECTED] wrote: > > > > BIO_HINT_RELEASE > > > > The bio's block extent is no longer in use by the filesystem > > > > and will not be read in the future. Any storage used to back > > > > the extent may be released without any threat to filesystem > > > > or data integrity. > > > > > > If the allocation unit of the storage device (e.g. a few MB) does not > > > match the allocation unit of the filesystem (e.g. a few KB) then for > > > this to be useful either the storage device must start recording tiny > > > allocations, or the filesystem should re-release areas as they grow. > > > i.e. when releasing a range of a device, look in the filesystem's usage > > > records for the largest surrounding free space, and release all of that. > > > > I figured that the easiest way around this is reporting free space > > extents, not the amoutn actually freed. e.g. > > > > 4k in file A @ block 10 > > 4k in file B @ block 11 > > 4k free space @ block 12 > > 4k in file C @ block 13 > > 1008k in free space at block 14. > > > > If we free file A, we report that we've released an extent of 4k @ block 10. > > if we then free file B, we report we've released an extent of 12k @ block > > 10. > > If we then free file C, we report a release of 1024k @ block 10. > > > > Then the underlying device knows what the aggregated free space regions > > are and can easily release large regions without needing to track tiny > > allocations and frees done by the filesystem. > > If you could do that in the filesystem, it certainly solve the problem. > In which case I'll explicitly allow for the hint's extent to overlap > extents previous extents thus hinted, and define the semantics > for overlaps. I think I'll rename the hint to BIO_HINT_RELEASED, > I think that will make the semantics a little clearer. I think that can be done - i wouldn't have mentioned it if I didn't think it was possible to implement ;). It will require a further btree lookup once the free transaction hits the disk, but I think that's pretty easy to do. I'd probably hook xfs_alloc_clear_busy() to do this. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/2] readdir() as an inode operation
On Tue, Oct 30, 2007 at 04:26:04PM +0100, Jan Kara wrote: > > This is a first try to move readdir() to become an inode operation. This is > > necessary for a VFS implementation of "something like union-mounts" where a > > readdir() needs to read the directory contents of multiple directories. > > Besides that the new interface is no longer giving the struct file to the > > filesystem implementations anymore. > > > > Comments, please? > Hmm, are you sure there are no users which keep some per-struct-file > information for directories? File offset is one such obvious thing which > you've handled but actually filesystem with more complicated structure > of directory may remember some hints about where we really are, keep > some readahead information or so... The hfsplus code keeps some extra data in the ->private_data of directories, although I've lost track of the exact benefit from not looking at the code in some years. As I recall it was a performance enhancement due to the fact that the data for all directories is effectively in a single file. I believe it was easier to keep the current position as a structure rather than just an offset and be careful about the tracking. The hfs code is almost identical to the hfsplus code in this area. Brad Boyer [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote: > On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: > > On Tuesday October 30, [EMAIL PROTECTED] wrote: > > > BIO_HINT_RELEASE > > > The bio's block extent is no longer in use by the filesystem > > > and will not be read in the future. Any storage used to back > > > the extent may be released without any threat to filesystem > > > or data integrity. > > > > If the allocation unit of the storage device (e.g. a few MB) does not > > match the allocation unit of the filesystem (e.g. a few KB) then for > > this to be useful either the storage device must start recording tiny > > allocations, or the filesystem should re-release areas as they grow. > > i.e. when releasing a range of a device, look in the filesystem's usage > > records for the largest surrounding free space, and release all of that. > > I figured that the easiest way around this is reporting free space > extents, not the amoutn actually freed. e.g. > > 4k in file A @ block 10 > 4k in file B @ block 11 > 4k free space @ block 12 > 4k in file C @ block 13 > 1008k in free space at block 14. > > If we free file A, we report that we've released an extent of 4k @ block 10. > if we then free file B, we report we've released an extent of 12k @ block 10. > If we then free file C, we report a release of 1024k @ block 10. > > Then the underlying device knows what the aggregated free space regions > are and can easily release large regions without needing to track tiny > allocations and frees done by the filesystem. If you could do that in the filesystem, it certainly solve the problem. In which case I'll explicitly allow for the hint's extent to overlap extents previous extents thus hinted, and define the semantics for overlaps. I think I'll rename the hint to BIO_HINT_RELEASED, I think that will make the semantics a little clearer. Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, Oct 30, 2007 at 06:35:08PM +0900, Dongjun Shin wrote: > On 10/30/07, Greg Banks <[EMAIL PROTECTED]> wrote: > > > > BIO_HINT_RELEASE > > The bio's block extent is no longer in use by the filesystem > > and will not be read in the future. Any storage used to back > > the extent may be released without any threat to filesystem > > or data integrity. > > > > I'd like to second the proposal, but it would be more useful to bring the hint > down to the physical devices. > > There is an ongoing discussion about adding 'Trim' ATA command for notifying > the drive about the deleted blocks. > > http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf What an interesting document. Am I reading the change markup correctly, did it get *simpler* in the last revision? Wow. I agree that BIO_HINT_RELEASE would be a good match for the proposed Trim command. But I don't think we'll ever be issuing Trims with more than a single LBA Range Entry, that feature seems unhelpful. The Trim proposal doesn't specify what happens when a sector which is already deallocated is deallocated again, presumably this is supposed to be harmless? Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Tue, 2007-10-30 at 16:08 -0700, Badari wrote: > Chris Mason wrote: > > On Tue, 30 Oct 2007 13:54:05 -0800 [cut] > > The easy way to narrow our search is to try without data=ordered, it is > > certainly complicating things. > > > > I can try that, its my root filesystem :( You meant to write "can't?" Download BusyBox and build it into an initramfs. It's pretty easy, you can do it yourself. Or you could download the Debian mkinitramfs (I think) package and look at it. Or the Fedora equivalent (I think mkinitrd). Then you can boot into that and mount your / with whatever options you like. Here's what I use for my own custom BusyBox initramfs /init script: #!/bin/sh get_arg() { local arg="$1" local x=`cat /proc/cmdline` for i in $x; do if [ "${i%=*}" = "$arg" ]; then echo ${i#*=} break fi done } do_switch() { mount -t proc none /proc local root=`get_arg root` local flags=`get_arg rootflags` mount "$root" /new ${flags:+-o "$flags"} umount /proc cd /new exec switch_root . /sbin/init } do_shell() { exec /sbin/init } # The following will wait 2 seconds for Enter before booting. trap "do_switch" ALRM target=$$ ( sleep 2; kill -ALRM $target ) & alarm=$! echo -n "Press Enter for a shell: " while read action; do kill $alarm break done do_shell -- Zan Lynx <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: > On Tuesday October 30, [EMAIL PROTECTED] wrote: > > BIO_HINT_RELEASE > > The bio's block extent is no longer in use by the filesystem > > and will not be read in the future. Any storage used to back > > the extent may be released without any threat to filesystem > > or data integrity. > > If the allocation unit of the storage device (e.g. a few MB) does not > match the allocation unit of the filesystem (e.g. a few KB) then for > this to be useful either the storage device must start recording tiny > allocations, or the filesystem should re-release areas as they grow. > i.e. when releasing a range of a device, look in the filesystem's usage > records for the largest surrounding free space, and release all of that. I figured that the easiest way around this is reporting free space extents, not the amoutn actually freed. e.g. 4k in file A @ block 10 4k in file B @ block 11 4k free space @ block 12 4k in file C @ block 13 1008k in free space at block 14. If we free file A, we report that we've released an extent of 4k @ block 10. if we then free file B, we report we've released an extent of 12k @ block 10. If we then free file C, we report a release of 1024k @ block 10. Then the underlying device knows what the aggregated free space regions are and can easily release large regions without needing to track tiny allocations and frees done by the filesystem. > I guess that is equally domain specific, but the difference is that if > you try to read from the DONTCOW part of the snapshot, you get bad > old data, where as if you try to access the subordinate device of a > snapshot, you get an IO error - which is probably safer. If you read from a DONTCOW region you should get zeros back - it's a hole in the snapshot. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote: > On Tuesday 30 October 2007, Dongjun Shin wrote: > > There is an ongoing discussion about adding 'Trim' ATA command for notifying > > the drive about the deleted blocks. > > > > http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf > > > > This is especially useful for the storage device like Solid State Drive > > (SSD). > > > This make me curious, why would t13 want to invent a new command when > there is already the erase command from CFA? > > It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE > should probably be mapped to CFA_ERASE (0xc0) on drives that support it: > http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf > IHMO, the main difference is that it requires the physical operation or not. The CFA_ERAES erases the free blocks, it requires the physical erase operation. But in trim case, it just unmapped the free blocks at FTL level. it doesn't require the physical operation. It's time saving and we can do a lot of works at FTL level internally. Thank you, Kyungmin Park I - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On 10/31/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote: > On Tuesday 30 October 2007, Jörn Engel wrote: > > On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote: > > > On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote: > > > > > > > > Not sure. Why shouldn't you be able to reorder the hints provided that > > > > they don't overlap with read/write bios for the same block? > > > > > > You're right. The bios can be reordered if they don't overlap with hint. > > > > I would keep things simpler. Bios can be reordered, full stop. If an > > erase and a write overlap, the caller (filesystem?) has to add a > > barrier. > > I thought bios were already ordered if they affect the same blocks. > Either way, I agree that an erase should not be treated special on > the bio layer, its ordering should be handled the same way we do it > for writes. > To support the new ATA command (trim, or dataset), the suggested hint is not enough. We have to send the bio with data (at least one sector or more) since the new ATA command requests the dataset information. And also we have to strictly follow the order using barrier or other methods at filesystem level For example, the delete operation in ext3. 1. delete some file 2. ext3_delete_inode() called 3. ... -> ext3_free_blocks_sb() releases the free blocks 4. If it sends the hints here, it breaks the ext3 power off recovery scheme since it trims the data from given information after reboot 5. after transaction, all dirty pages are flushed. after this work, we can trim the free blocks safely. Another approach is modifying the block framework. At I/O scheduler, it don't merge the hint bio (in my terminology, bio control info) with general bio. In this case we also consider the reordering problem. I'm not sure it is possible at this time. Thank you, Kyungmin Park - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
Chris Mason wrote: On Tue, 30 Oct 2007 13:54:05 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote: On Tue, 30 Oct 2007 10:27:04 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: Hi, While testing hotplug memory remove, I ran into this issue. Given a range of pages hotplug memory remove tries to migrate those pages. migrate_pages() keeps failing to migrate pages containing pagecache pages for reiserfs files. I noticed that reiserfs doesn't have ->migratepage() ops. So, fallback_migrate_page() code tries to do try_to_release_page(). try_to_release_page() fails to drop_buffers() since b_count == 1. Here is what my debug shows: migrate pages failed pfn 258111/flags 3f801 bh cb53f6e0 flags 110029 count 1 Any one know why the b_count == 1 and not getting dropped to zero ? If these are file data pages, the count is probably elevated as part of the data=ordered tracking. You can verify this via b_private, or just mount data=writeback to double check. Chris, That was my first assumption. But after looking at reiserfs_releasepage (), realized that it would do reiserfs_free_jh() and clears the b_private. I couldn't easily find out who has the ref. against this bh. bh cbdaaf00 flags 110029 count 1 private 0 If I'm reading this correctly the buffer is BH_Lock | BH_Req, perhaps it is currently under IO? Its BH_Req | BH_Uptodate. Its not under IO. The page isn't locked, but data=ordered does IO directly on the buffer heads, without taking the page lock. The easy way to narrow our search is to try without data=ordered, it is certainly complicating things. I can try that, its my root filesystem :( Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] 0/4 fs/ioctl.c coding style, function renaming/factoring
On Tue, 30 Oct 2007 15:39:55 -0400 Erez Zadok <[EMAIL PROTECTED]> wrote: > This series of 4 proposed patches (take 3) changes fs/ioctl.c and Unionfs as > follows. The problem is of course that you need these in your tree for ongoing development, but 2.6.25 is months away. They look OK to me so I suggest that you go ahead and commit them to your git tree and I'll drop them again. Please resend them for merging in the 2.6.25-rc1 merge window. unionfs has been hanging around for a long time now and we should work towards getting it into 2.6.25 or dropped from -mm (sorry). Right now would be a great time to get this process underway. I have only a partial memory of what the sticking points were, and I have basically zero knowledge of what was done to address them. So could you please, over the next few weeks: - Send out a description of what the issues were, and how they were addressed - If issues remain, describe their impact and possible workarounds, all that stuff. - If it mostly-survives all that design-level review and consideration then let's go for it: get all the patches cleaned up and consolidated and get them emailed out for review no later than 2.6.24-rc5. Thanks. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Tue, 30 Oct 2007 13:54:05 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: > On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote: > > On Tue, 30 Oct 2007 10:27:04 -0800 > > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > While testing hotplug memory remove, I ran into this issue. Given > > > a range of pages hotplug memory remove tries to migrate those > > > pages. > > > > > > migrate_pages() keeps failing to migrate pages containing > > > pagecache pages for reiserfs files. I noticed that reiserfs > > > doesn't have ->migratepage() ops. So, fallback_migrate_page() > > > code tries to do try_to_release_page(). try_to_release_page() > > > fails to drop_buffers() since b_count == 1. Here is what my debug > > > shows: > > > > > > migrate pages failed pfn 258111/flags 3f801 > > > bh cb53f6e0 flags 110029 count 1 > > > > > > Any one know why the b_count == 1 and not getting dropped to > > > zero ? > > > > If these are file data pages, the count is probably elevated as > > part of the data=ordered tracking. You can verify this via > > b_private, or just mount data=writeback to double check. > > > Chris, > > That was my first assumption. But after looking at > reiserfs_releasepage (), realized that it would do reiserfs_free_jh() > and clears the b_private. I couldn't easily find out who has the ref. > against this bh. > > bh cbdaaf00 flags 110029 count 1 private 0 > If I'm reading this correctly the buffer is BH_Lock | BH_Req, perhaps it is currently under IO? The page isn't locked, but data=ordered does IO directly on the buffer heads, without taking the page lock. The easy way to narrow our search is to try without data=ordered, it is certainly complicating things. -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote: > On Tue, 30 Oct 2007 10:27:04 -0800 > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > While testing hotplug memory remove, I ran into this issue. Given a > > range of pages hotplug memory remove tries to migrate those pages. > > > > migrate_pages() keeps failing to migrate pages containing pagecache > > pages for reiserfs files. I noticed that reiserfs doesn't have > > ->migratepage() ops. So, fallback_migrate_page() code tries to > > do try_to_release_page(). try_to_release_page() fails to > > drop_buffers() since b_count == 1. Here is what my debug shows: > > > > migrate pages failed pfn 258111/flags 3f801 > > bh cb53f6e0 flags 110029 count 1 > > > > Any one know why the b_count == 1 and not getting dropped to zero ? > > If these are file data pages, the count is probably elevated as part of > the data=ordered tracking. You can verify this via b_private, or just > mount data=writeback to double check. Chris, That was my first assumption. But after looking at reiserfs_releasepage (), realized that it would do reiserfs_free_jh() and clears the b_private. I couldn't easily find out who has the ref. against this bh. bh cbdaaf00 flags 110029 count 1 private 0 Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls
Signed-off-by: Erez Zadok <[EMAIL PROTECTED]> --- fs/ioctl.c | 129 +++- 1 files changed, 75 insertions(+), 54 deletions(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index 1ab7b7d..cd8c1a3 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -53,32 +53,34 @@ long vfs_ioctl(struct file *filp, unsigned int cmd, } EXPORT_SYMBOL_GPL(vfs_ioctl); +static int ioctl_fibmap(struct file *filp, int __user *p) +{ + struct address_space *mapping = filp->f_mapping; + int res, block; + + /* do we support this mess? */ + if (!mapping->a_ops->bmap) + return -EINVAL; + if (!capable(CAP_SYS_RAWIO)) + return -EPERM; + res = get_user(block, p); + if (res) + return res; + lock_kernel(); + res = mapping->a_ops->bmap(mapping, block); + unlock_kernel(); + return put_user(res, p); +} + static int file_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { - int error; - int block; struct inode *inode = filp->f_path.dentry->d_inode; int __user *p = (int __user *)arg; switch (cmd) { case FIBMAP: - { - struct address_space *mapping = filp->f_mapping; - int res; - /* do we support this mess? */ - if (!mapping->a_ops->bmap) - return -EINVAL; - if (!capable(CAP_SYS_RAWIO)) - return -EPERM; - error = get_user(block, p); - if (error) - return error; - lock_kernel(); - res = mapping->a_ops->bmap(mapping, block); - unlock_kernel(); - return put_user(res, p); - } + return ioctl_fibmap(filp, p); case FIGETBSZ: return put_user(inode->i_sb->s_blocksize, p); case FIONREAD: @@ -88,6 +90,57 @@ static int file_ioctl(struct file *filp, unsigned int cmd, return vfs_ioctl(filp, cmd, arg); } +static int ioctl_fionbio(struct file *filp, int __user *argp) +{ + unsigned int flag; + int on, error; + + error = get_user(on, argp); + if (error) + return error; + flag = O_NONBLOCK; +#ifdef __sparc__ + /* SunOS compatibility item. */ + if (O_NONBLOCK != O_NDELAY) + flag |= O_NDELAY; +#endif + if (on) + filp->f_flags |= flag; + else + filp->f_flags &= ~flag; + return error; +} + +static int ioctl_fioasync(unsigned int fd, struct file *filp, + int __user *argp) +{ + unsigned int flag; + int on, error; + + error = get_user(on, argp); + if (error) + return error; + flag = on ? FASYNC : 0; + + /* Did FASYNC state change ? */ + if ((flag ^ filp->f_flags) & FASYNC) { + if (filp->f_op && filp->f_op->fasync) { + lock_kernel(); + error = filp->f_op->fasync(fd, filp, on); + unlock_kernel(); + } else + error = -ENOTTY; + } + if (error) + return error; + + if (on) + filp->f_flags |= FASYNC; + else + filp->f_flags &= ~FASYNC; + return error; +} + /* * When you add any new common ioctls to the switches above and below * please update compat_sys_ioctl() too. @@ -98,8 +151,8 @@ static int file_ioctl(struct file *filp, unsigned int cmd, int do_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned long arg) { - unsigned int flag; - int on, error = 0; + int error = 0; + int __user *argp = (int __user *)arg; switch (cmd) { case FIOCLEX: @@ -111,43 +164,11 @@ int do_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, break; case FIONBIO: - error = get_user(on, (int __user *)arg); - if (error) - break; - flag = O_NONBLOCK; -#ifdef __sparc__ - /* SunOS compatibility item. */ - if (O_NONBLOCK != O_NDELAY) - flag |= O_NDELAY; -#endif - if (on) - filp->f_flags |= flag; - else - filp->f_flags &= ~flag; + error = ioctl_fionbio(filp, argp); break; case FIOASYNC: - error = get_user(on, (int __user *)arg); - if (error) - break; - flag = on ? FASYNC : 0; - - /* Did FASYNC state change ? */ - if ((flag ^ filp->f_flags) & FASYNC) { - if (filp->f_op && filp->f_op->fasync) { - lock_kernel(); - error = filp->f_o
[PATCH 1/4] VFS: apply coding standards to fs/ioctl.c
Signed-off-by: Erez Zadok <[EMAIL PROTECTED]> --- fs/ioctl.c | 164 +++- 1 files changed, 84 insertions(+), 80 deletions(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index c2a773e..652cacf 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -12,8 +12,8 @@ #include #include #include +#include -#include #include static long do_ioctl(struct file *filp, unsigned int cmd, @@ -45,31 +45,31 @@ static int file_ioctl(struct file *filp, unsigned int cmd, { int error; int block; - struct inode * inode = filp->f_path.dentry->d_inode; + struct inode *inode = filp->f_path.dentry->d_inode; int __user *p = (int __user *)arg; switch (cmd) { - case FIBMAP: - { - struct address_space *mapping = filp->f_mapping; - int res; - /* do we support this mess? */ - if (!mapping->a_ops->bmap) - return -EINVAL; - if (!capable(CAP_SYS_RAWIO)) - return -EPERM; - if ((error = get_user(block, p)) != 0) - return error; - - lock_kernel(); - res = mapping->a_ops->bmap(mapping, block); - unlock_kernel(); - return put_user(res, p); - } - case FIGETBSZ: - return put_user(inode->i_sb->s_blocksize, p); - case FIONREAD: - return put_user(i_size_read(inode) - filp->f_pos, p); + case FIBMAP: + { + struct address_space *mapping = filp->f_mapping; + int res; + /* do we support this mess? */ + if (!mapping->a_ops->bmap) + return -EINVAL; + if (!capable(CAP_SYS_RAWIO)) + return -EPERM; + error = get_user(block, p); + if (error) + return error; + lock_kernel(); + res = mapping->a_ops->bmap(mapping, block); + unlock_kernel(); + return put_user(res, p); + } + case FIGETBSZ: + return put_user(inode->i_sb->s_blocksize, p); + case FIONREAD: + return put_user(i_size_read(inode) - filp->f_pos, p); } return do_ioctl(filp, cmd, arg); @@ -82,81 +82,85 @@ static int file_ioctl(struct file *filp, unsigned int cmd, * vfs_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d. * It's just a simple helper for sys_ioctl and compat_sys_ioctl. */ -int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned long arg) +int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, + unsigned long arg) { unsigned int flag; int on, error = 0; switch (cmd) { - case FIOCLEX: - set_close_on_exec(fd, 1); - break; + case FIOCLEX: + set_close_on_exec(fd, 1); + break; - case FIONCLEX: - set_close_on_exec(fd, 0); - break; + case FIONCLEX: + set_close_on_exec(fd, 0); + break; - case FIONBIO: - if ((error = get_user(on, (int __user *)arg)) != 0) - break; - flag = O_NONBLOCK; + case FIONBIO: + error = get_user(on, (int __user *)arg); + if (error) + break; + flag = O_NONBLOCK; #ifdef __sparc__ - /* SunOS compatibility item. */ - if(O_NONBLOCK != O_NDELAY) - flag |= O_NDELAY; + /* SunOS compatibility item. */ + if (O_NONBLOCK != O_NDELAY) + flag |= O_NDELAY; #endif - if (on) - filp->f_flags |= flag; - else - filp->f_flags &= ~flag; + if (on) + filp->f_flags |= flag; + else + filp->f_flags &= ~flag; + break; + + case FIOASYNC: + error = get_user(on, (int __user *)arg); + if (error) break; - - case FIOASYNC: - if ((error = get_user(on, (int __user *)arg)) != 0) - break; - flag = on ? FASYNC : 0; - - /* Did FASYNC state change ? */ - if ((flag ^ filp->f_flags) & FASYNC) { - if (filp->f_op && filp->f_op->fasync) { -
[PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names
Rename old vfs_ioctl to do_ioctl, because the comment above it clearly indicates that it is an internal function not to be exported to modules; therefore it should have a more traditional do_XXX name. The new do_ioctl is exported in fs.h but not to modules. Rename the old do_ioctl to vfs_ioctl because the names vfs_XXX should preferably be reserved to callable VFS functions which modules may call, as many other vfs_XXX functions already do. Export the new vfs_ioctl to GPL modules so others can use it (including Unionfs and eCryptfs). Add DocBook for new vfs_ioctl. Signed-off-by: Erez Zadok <[EMAIL PROTECTED]> --- fs/compat_ioctl.c |2 +- fs/ioctl.c | 29 + include/linux/fs.h |4 +++- 3 files changed, 25 insertions(+), 10 deletions(-) diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c index a4284cc..a1604ce 100644 --- a/fs/compat_ioctl.c +++ b/fs/compat_ioctl.c @@ -2972,7 +2972,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd, } do_ioctl: - error = vfs_ioctl(filp, fd, cmd, arg); + error = do_ioctl(filp, fd, cmd, arg); out_fput: fput_light(filp, fput_needed); out: diff --git a/fs/ioctl.c b/fs/ioctl.c index 652cacf..1ab7b7d 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -16,8 +16,20 @@ #include -static long do_ioctl(struct file *filp, unsigned int cmd, - unsigned long arg) +/** + * vfs_ioctl - call filesystem specific ioctl methods + * @filp: [in] open file to invoke ioctl method on + * @cmd: [in] ioctl command to execute + * @arg: [in/out] command-specific argument for ioctl + * + * Invokes filesystem specific ->unlocked_ioctl, if one exists; otherwise + * invokes * filesystem specific ->ioctl method. If neither method exists, + * returns -ENOTTY. + * + * Returns 0 on success, -errno on error. + */ +long vfs_ioctl(struct file *filp, unsigned int cmd, + unsigned long arg) { int error = -ENOTTY; @@ -39,6 +51,7 @@ static long do_ioctl(struct file *filp, unsigned int cmd, out: return error; } +EXPORT_SYMBOL_GPL(vfs_ioctl); static int file_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) @@ -72,18 +85,18 @@ static int file_ioctl(struct file *filp, unsigned int cmd, return put_user(i_size_read(inode) - filp->f_pos, p); } - return do_ioctl(filp, cmd, arg); + return vfs_ioctl(filp, cmd, arg); } /* * When you add any new common ioctls to the switches above and below * please update compat_sys_ioctl() too. * - * vfs_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d. + * do_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d. * It's just a simple helper for sys_ioctl and compat_sys_ioctl. */ -int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, - unsigned long arg) +int do_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, +unsigned long arg) { unsigned int flag; int on, error = 0; @@ -152,7 +165,7 @@ int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, if (S_ISREG(filp->f_path.dentry->d_inode->i_mode)) error = file_ioctl(filp, cmd, arg); else - error = do_ioctl(filp, cmd, arg); + error = vfs_ioctl(filp, cmd, arg); break; } return error; @@ -172,7 +185,7 @@ asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg) if (error) goto out_fput; - error = vfs_ioctl(filp, fd, cmd, arg); + error = do_ioctl(filp, fd, cmd, arg); out_fput: fput_light(filp, fput_needed); out: diff --git a/include/linux/fs.h b/include/linux/fs.h index b3ec4a4..d513f16 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1924,7 +1924,9 @@ extern int vfs_stat_fd(int dfd, char __user *, struct kstat *); extern int vfs_lstat_fd(int dfd, char __user *, struct kstat *); extern int vfs_fstat(unsigned int, struct kstat *); -extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long); +extern long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg); +extern int do_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, + unsigned long arg); extern void get_filesystem(struct file_system_type *fs); extern void put_filesystem(struct file_system_type *fs); -- 1.5.2.2 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] Unionfs: use vfs_ioctl
Signed-off-by: Erez Zadok <[EMAIL PROTECTED]> --- fs/unionfs/commonfops.c | 36 ++-- 1 files changed, 6 insertions(+), 30 deletions(-) diff --git a/fs/unionfs/commonfops.c b/fs/unionfs/commonfops.c index 7654bcb..c99b519 100644 --- a/fs/unionfs/commonfops.c +++ b/fs/unionfs/commonfops.c @@ -661,35 +661,6 @@ out: return err; } -/* pass the ioctl to the lower fs */ -static long do_ioctl(struct file *file, unsigned int cmd, unsigned long arg) -{ - struct file *lower_file; - int err; - - lower_file = unionfs_lower_file(file); - - err = security_file_ioctl(lower_file, cmd, arg); - if (err) - goto out; - - err = -ENOTTY; - if (!lower_file || !lower_file->f_op) - goto out; - if (lower_file->f_op->unlocked_ioctl) { - err = lower_file->f_op->unlocked_ioctl(lower_file, cmd, arg); - } else if (lower_file->f_op->ioctl) { - lock_kernel(); - err = lower_file->f_op->ioctl( - lower_file->f_path.dentry->d_inode, - lower_file, cmd, arg); - unlock_kernel(); - } - -out: - return err; -} - /* * return to user-space the branch indices containing the file in question * @@ -756,6 +727,7 @@ out: long unionfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { long err; + struct file *lower_file; unionfs_read_lock(file->f_path.dentry->d_sb); @@ -779,7 +751,11 @@ long unionfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg) default: /* pass the ioctl down */ - err = do_ioctl(file, cmd, arg); + lower_file = unionfs_lower_file(file); + if (lower_file) + err = vfs_ioctl(lower_file, cmd, arg); + else + err = -ENOTTY; break; } -- 1.5.2.2 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] 0/4 fs/ioctl.c coding style, function renaming/factoring
This series of 4 proposed patches (take 3) changes fs/ioctl.c and Unionfs as follows. This series is against v2.6.24-rc1-423-g97855b4. Patch 1: just applies coding standards to fs/ioctl.c (while I'm at it, I figured it's worth cleaning VFS files one at a time). Patch 2: does two things: (a) Renames the old vfs_ioctl to do_ioctl, because the comment above it clearly indicates that it is an internal function not to be exported to modules; therefore it should have a more traditional do_XXX "internal function" name. The new do_ioctl is exported in fs.h but not to modules. (b) Renames the old (static) do_ioctl to vfs_ioctl because the names vfs_XXX should preferably be reserved to callable VFS functions which modules may call, as other vfs_XXX functions already do. Export the new vfs_ioctl to (GPL) modules so others can use it (including Unionfs and eCryptfs). Patch 3: factors out the switch statements' cases for FIBMAP/FIONBIO/FIOASYNC, into three small static helper functions. Patch 4: demonstrates how Unionfs can use the new vfs_ioctl. I successfully tested unionfs with this new exported vfs_ioctl. (eCryptfs could do the same.) I'd like to propose that the first 3 patches be merged in -mm and even mainline, pending review. Erez Zadok (4): VFS: apply coding standards to fs/ioctl.c VFS: swap do_ioctl and vfs_ioctl names VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls Unionfs: use vfs_ioctl fs/compat_ioctl.c |2 fs/ioctl.c | 224 fs/unionfs/commonfops.c | 36 +-- include/linux/fs.h |4 4 files changed, 141 insertions(+), 125 deletions(-) Cheers, Erez. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls
On Tue, Oct 30, 2007 at 01:49:48PM -0400, Erez Zadok wrote: > BTW, what's the origin of this oddity in fs/ioctl.c: > > #ifdef __sparc__ > /* SunOS compatibility item. */ > if (O_NONBLOCK != O_NDELAY) > flag |= O_NDELAY; > #endif > > It seems rather odd to have architecture-specific code in the VFS, no? When Dave did the sparc port he followed sunos ABIs for lots of things. When these ABIs are hidden behind ioctl arguments they can't easily be handled in arch code and we need warts like that. It's definitively not recommended for new ports.. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Tue, 30 Oct 2007 10:27:04 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: > Hi, > > While testing hotplug memory remove, I ran into this issue. Given a > range of pages hotplug memory remove tries to migrate those pages. > > migrate_pages() keeps failing to migrate pages containing pagecache > pages for reiserfs files. I noticed that reiserfs doesn't have > ->migratepage() ops. So, fallback_migrate_page() code tries to > do try_to_release_page(). try_to_release_page() fails to > drop_buffers() since b_count == 1. Here is what my debug shows: > > migrate pages failed pfn 258111/flags 3f801 > bh cb53f6e0 flags 110029 count 1 > > Any one know why the b_count == 1 and not getting dropped to zero ? If these are file data pages, the count is probably elevated as part of the data=ordered tracking. You can verify this via b_private, or just mount data=writeback to double check. -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls
BTW, what's the origin of this oddity in fs/ioctl.c: #ifdef __sparc__ /* SunOS compatibility item. */ if (O_NONBLOCK != O_NDELAY) flag |= O_NDELAY; #endif It seems rather odd to have architecture-specific code in the VFS, no? Erez. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names
In message <[EMAIL PROTECTED]>, Christoph Hellwig writes: > On Tue, Oct 30, 2007 at 08:22:40AM -0700, Randy Dunlap wrote: > > They are just treated as part of the parameter explanation text. > > I don't see any problem with them. > > Well, it's completely inconsistant with any other kerneldoc.. If it doesn't hurt, and kerneldoc will present it as such, then I'd leave the [in/out] in place. Ioctls are the few places where you can have variables used as both input and output; so _somehow_ we need to capture that behavior. Erez. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
migratepage failures on reiserfs
Hi, While testing hotplug memory remove, I ran into this issue. Given a range of pages hotplug memory remove tries to migrate those pages. migrate_pages() keeps failing to migrate pages containing pagecache pages for reiserfs files. I noticed that reiserfs doesn't have ->migratepage() ops. So, fallback_migrate_page() code tries to do try_to_release_page(). try_to_release_page() fails to drop_buffers() since b_count == 1. Here is what my debug shows: migrate pages failed pfn 258111/flags 3f801 bh cb53f6e0 flags 110029 count 1 Any one know why the b_count == 1 and not getting dropped to zero ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names
On Tue, 30 Oct 2007 17:14:36 + Christoph Hellwig wrote: > On Tue, Oct 30, 2007 at 08:22:40AM -0700, Randy Dunlap wrote: > > They are just treated as part of the parameter explanation text. > > I don't see any problem with them. > > Well, it's completely inconsistant with any other kerneldoc.. True. and it has an advantage. --- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names
On Tue, Oct 30, 2007 at 08:22:40AM -0700, Randy Dunlap wrote: > They are just treated as part of the parameter explanation text. > I don't see any problem with them. Well, it's completely inconsistant with any other kerneldoc.. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tuesday 30 October 2007, Jörn Engel wrote: > On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote: > > On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote: > > > > > > Not sure. Why shouldn't you be able to reorder the hints provided that > > > they don't overlap with read/write bios for the same block? > > > > You're right. The bios can be reordered if they don't overlap with hint. > > I would keep things simpler. Bios can be reordered, full stop. If an > erase and a write overlap, the caller (filesystem?) has to add a > barrier. I thought bios were already ordered if they affect the same blocks. Either way, I agree that an erase should not be treated special on the bio layer, its ordering should be handled the same way we do it for writes. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Mon, Oct 29, 2007 at 08:06:04AM +, Alan Cox wrote: > On Sun, 28 Oct 2007 13:43:21 -0400 > "J. Bruce Fields" <[EMAIL PROTECTED]> wrote: > > > From: J. Bruce Fields <[EMAIL PROTECTED]> > > > > We currently attempt to return -EDEALK to blocking fcntl() file locking > > requests that would create a cycle in the graph of tasks waiting on > > locks. > > > > This is inefficient: in the general case it requires us determining > > whether we're adding a cycle to an arbitrary directed acyclic graph. > > And this calculation has to be performed while holding a lock (currently > > the BKL) that prevents that graph from changing. > > > > It has historically been a source of bugs; most recently it was noticed > > that it could loop indefinitely while holding the BKL. > > > > It seems unlikely to be useful to applications: > > - The difficulty of implementation has kept standards from > > requiring it. (E.g. SUSv3 : "Since implementation of full > > deadlock detection is not always feasible, the [EDEADLK] error > > was made optional.") So portable applications may not be able to > > depend on it. > > - It only detects deadlocks that involve nothing but local posix > > file locks; deadlocks involving network filesystems or other kinds > > of locks or resources are missed. > > > > It therefore seems best to remove deadlock detection. > > > > Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]> > > > NAK. This is an ABI change OK, fair enough. > and one that was rejected before when this was last discussed in > detail. That previous discussion (http://marc.info/?t=10865285743&r=1&w=2) read the spec wrong and--now that I reread it--failed to address the original bug report. In fact it looks to me like the actual bug reported: http://bugzilla.kernel.org/show_bug.cgi?id=2829 was probably a minor variant of the one which George Davis stumbled on again here. That's a little depressing. > Moving it out of BKL makes a ton of sense, That would at least reduce the impact of bugs here, yeah. It would be great to figure out how to start getting locks.c and lockd out from under the BKL, but I don't personally have the time now. (And I believe there's been a failed attempt or two so it's not completely easy.) > even adding a "don't check" flag makes a lot of sense. That's an idea I'll keep in mind, though I'm not convinced it'll help much (or that applications would actually use it). > The failure case for removing this feature is obscure and hard to debug > application hangs for the afflicted programs - not nice for users at all. Yeah. I'd still be curious for any data about applications that actually rely on posix deadlock detection, though. --b. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote: > On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote: > > > > Not sure. Why shouldn't you be able to reorder the hints provided that > > they don't overlap with read/write bios for the same block? > > You're right. The bios can be reordered if they don't overlap with hint. I would keep things simpler. Bios can be reordered, full stop. If an erase and a write overlap, the caller (filesystem?) has to add a barrier. Jörn -- My second remark is that our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Mon, Oct 29, 2007 at 10:15:19AM +0100, Jiri Kosina wrote: > On Sun, 28 Oct 2007, J. Bruce Fields wrote: > > > But, OK, if we can identify unshared current->files at the time we put a > > task to sleep, then a slight modification of our current algorithm might > > be sufficient to detect any deadlock that involves purely posix file > > locks and processes. And we can tell people that avoiding deadlock is > > their problem as soon as any task with a shared current->files is > > involved. (Ditto, I assume, if nfsd/lockd acquires a lock.) > > Don't forget that comparing file_lock->fl_owner (i.e. current->files) is > not the only way how lock ownership could be computed (there could be > specific file_lock->fl_lmops->fl_compare_owner() and all of them should > be teached this new semantics, right?). Right. So, for example, this would handle both cases as described above. We're turning off deadlock detection in the problematic cases just by neglecting to add such waiting lockers to the blocked_list (which is what we search to look for lock cycles). (It'd be nice if we didn't have to search that list at all. There should be some way to set up the data structures so we can follow the lock dependencies without having to scan through all the blocked locks at each step. But I haven't quite figured out how to do that yet. And perhaps it's not important to optimize cases where there are lots of blocks.) --b. diff --git a/fs/locks.c b/fs/locks.c index 8b8388e..5446305 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -512,6 +512,8 @@ static void locks_delete_block(struct file_lock *waiter) unlock_kernel(); } +static int posix_owner_shared(struct file_lock *caller_fl); + /* Insert waiter into blocker's block list. * We use a circular list so that processes can be easily woken up in * the order they blocked. The documentation doesn't require this but @@ -523,7 +525,7 @@ static void locks_insert_block(struct file_lock *blocker, BUG_ON(!list_empty(&waiter->fl_block)); list_add_tail(&waiter->fl_block, &blocker->fl_block); waiter->fl_next = blocker; - if (IS_POSIX(blocker)) + if (IS_POSIX(blocker) && !posix_owner_shared(waiter)) list_add(&waiter->fl_link, &blocked_list); } @@ -683,46 +685,79 @@ posix_test_lock(struct file *filp, struct file_lock *fl) EXPORT_SYMBOL(posix_test_lock); -/* This function tests for deadlock condition before putting a process to - * sleep. The detection scheme is no longer recursive. Recursive was neat, - * but dangerous - we risked stack corruption if the lock data was bad, or - * if the recursion was too deep for any other reason. - * - * We rely on the fact that a task can only be on one lock's wait queue - * at a time. When we find blocked_task on a wait queue we can re-search - * with blocked_task equal to that queue's owner, until either blocked_task - * isn't found, or blocked_task is found on a queue owned by my_task. - * - * Note: the above assumption may not be true when handling lock requests - * from a broken NFS client. But broken NFS clients have a lot more to - * worry about than proper deadlock detection anyway... --okir - * - * However, the failure of this assumption (also possible in the case of - * multiple tasks sharing the same open file table) also means there's no - * guarantee that the loop below will terminate. As a hack, we give up - * after a few iterations. +/* + * Deadlock detection: + * + * We attempt to detect deadlocks that are due purely to posix file + * locks. + * + * We assume that a task can be waiting for at most one lock at a time. + * So for any acquired lock, the process holding that lock may be + * waiting on at most one other lock. That lock in turns may be held by + * someone waiting for at most one other lock. Given a requested lock + * caller_fl which is about to wait for a conflicting lock block_fl, we + * follow this chain of waiters to ensure we are not about to create a + * cycle. + * + * Since we do this before we ever put a process to sleep on a lock, we + * are ensured that there is never a cycle; that is what guarantees that + * the while() loop in posix_locks_deadlock() eventually completes. + * + * Note: the above assumption may not be true when handling lock + * requests from a broken NFS client. It may also fail in the presence + * of tasks (such as posix threads) sharing the same open file table. + * + * We don't necessarily care about returning EDEALK correctly in such + * cases, but we do need to avoid cycles in the lock dependency graph in + * order to ensure the loop in posix_locks_deadlock eventually + * terminates. To that end, we enforce the assumption above by refusing + * to return EDEADLK or add to the list of blocked locks in any case + * where a lock owner might be able to block on more than one lock. */ -#define MAX_DEADLK_ITERATIONS 10 +static int posix_owner_shared(struct file_lock *caller_fl) +{ + /* +* The cal
Re: [PATCH, RESEND] locks: fix possible infinite loop in posix deadlock detection
On Tue, 30 Oct 2007 11:20:02 -0400 "J. Bruce Fields" <[EMAIL PROTECTED]> wrote: > From: J. Bruce Fields <[EMAIL PROTECTED]> > > It's currently possible to send posix_locks_deadlock() into an infinite > loop (under the BKL). > > For now, fix this just by bailing out after a few iterations. We may > want to fix this in a way that better clarifies the semantics of > deadlock detection. But that will take more time, and this minimal fix > is probably adequate for any realistic scenario, and is simple enough to > be appropriate for applying to stable kernels now. > > Thanks to George Davis for reporting the problem. > > Cc: "George G. Davis" <[EMAIL PROTECTED]> > Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]> Acked-by: Alan Cox <[EMAIL PROTECTED]> Its a good fix for now and I doubt any real world user has that complex a locking pattern to break. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/2] readdir() as an inode operation
> This is a first try to move readdir() to become an inode operation. This is > necessary for a VFS implementation of "something like union-mounts" where a > readdir() needs to read the directory contents of multiple directories. > Besides that the new interface is no longer giving the struct file to the > filesystem implementations anymore. > > Comments, please? Hmm, are you sure there are no users which keep some per-struct-file information for directories? File offset is one such obvious thing which you've handled but actually filesystem with more complicated structure of directory may remember some hints about where we really are, keep some readahead information or so... Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names
On Tue, 30 Oct 2007 09:56:53 + Christoph Hellwig wrote: > On Sun, Oct 28, 2007 at 08:40:56PM -0400, Erez Zadok wrote: > > +/** > > + * vfs_ioctl - call filesystem specific ioctl methods > > + * > > + * @filp: [in] open file to invoke ioctl method on > > + * @cmd: [in] ioctl command to execute > > + * @arg: [in/out] command-specific argument for ioctl > > I've never seen these in/out annotations and doubt they're valid in > kerneldoc. Randy? They are just treated as part of the parameter explanation text. I don't see any problem with them. > > + * Invokes filesystem specific ->unlock_ioctl, if one exists; otherwise > > unlocked_ioctl --- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH, RESEND] locks: fix possible infinite loop in posix deadlock detection
From: J. Bruce Fields <[EMAIL PROTECTED]> It's currently possible to send posix_locks_deadlock() into an infinite loop (under the BKL). For now, fix this just by bailing out after a few iterations. We may want to fix this in a way that better clarifies the semantics of deadlock detection. But that will take more time, and this minimal fix is probably adequate for any realistic scenario, and is simple enough to be appropriate for applying to stable kernels now. Thanks to George Davis for reporting the problem. Cc: "George G. Davis" <[EMAIL PROTECTED]> Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]> --- fs/locks.c | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) I didn't see objections to this quick fix (just to the followup that attempts to rip out posix deadlock detection entirely), so I'm resending with just comment modifications. I haven't given up on a more comprehensive solution, but I think we really need to apply some fix now. --b. diff --git a/fs/locks.c b/fs/locks.c index 0127a28..8b8388e 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -696,17 +696,28 @@ EXPORT_SYMBOL(posix_test_lock); * Note: the above assumption may not be true when handling lock requests * from a broken NFS client. But broken NFS clients have a lot more to * worry about than proper deadlock detection anyway... --okir + * + * However, the failure of this assumption (also possible in the case of + * multiple tasks sharing the same open file table) also means there's no + * guarantee that the loop below will terminate. As a hack, we give up + * after a few iterations. */ + +#define MAX_DEADLK_ITERATIONS 10 + static int posix_locks_deadlock(struct file_lock *caller_fl, struct file_lock *block_fl) { struct file_lock *fl; + int i = 0; next_task: if (posix_same_owner(caller_fl, block_fl)) return 1; list_for_each_entry(fl, &blocked_list, fl_link) { if (posix_same_owner(fl, block_fl)) { + if (i++ > MAX_DEADLK_ITERATIONS) + return 0; fl = fl->fl_next; block_fl = fl; goto next_task; -- 1.5.3.4.208.gc990 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, 30 October 2007 18:35:08 +0900, Dongjun Shin wrote: > On 10/30/07, Greg Banks <[EMAIL PROTECTED]> wrote: > > > > BIO_HINT_RELEASE > > The bio's block extent is no longer in use by the filesystem > > and will not be read in the future. Any storage used to back > > the extent may be released without any threat to filesystem > > or data integrity. > > I'd like to second the proposal, but it would be more useful to bring the hint > down to the physical devices. Absolutely. Logfs would love to have an erase operation for block devices as well. However the above doesn't quite match my needs, because the blocks _will_ be read in the future. There are two reasons for reading things back later. The good one is to determine whether the segment was erased or not. Reads should return either valid data or one of (all-0xff, all-0x00, -ESOMETHING). Having a dedicated error code would be best. And getting the device erasesize would be useful as well, for obvious reasons. Jörn -- When you close your hand, you own nothing. When you open it up, you own the whole world. -- Li Mu Bai in Tiger & Dragon - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote: > > Not sure. Why shouldn't you be able to reorder the hints provided that > they don't overlap with read/write bios for the same block? > You're right. The bios can be reordered if they don't overlap with hint. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tuesday 30 October 2007, Dongjun Shin wrote: > Anyway, BIO_HINT_RELEASE could destroy the content of the blocks > after being passed to the device. I think that other bio should not be > reordered accross that hint (just like barrier). Not sure. Why shouldn't you be able to reorder the hints provided that they don't overlap with read/write bios for the same block? Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote: > This make me curious, why would t13 want to invent a new command when > there is already the erase command from CFA? > > It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE > should probably be mapped to CFA_ERASE (0xc0) on drives that support it: > http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf > I'm not sure about the background. However, it's definitely a sign that passing the deleted block info to the flash-based storage is useful. Anyway, BIO_HINT_RELEASE could destroy the content of the blocks after being passed to the device. I think that other bio should not be reordered accross that hint (just like barrier). Dongjun - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tuesday 30 October 2007, Dongjun Shin wrote: > There is an ongoing discussion about adding 'Trim' ATA command for notifying > the drive about the deleted blocks. > > http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf > > This is especially useful for the storage device like Solid State Drive (SSD). > This make me curious, why would t13 want to invent a new command when there is already the erase command from CFA? It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE should probably be mapped to CFA_ERASE (0xc0) on drives that support it: http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls
> +static int __ioctl_fibmap(struct file *filp, int __user *p) I'd say kill the __ prefix for all the functions you're adding. > +static int __ioctl_fionbio(struct file *filp, unsigned long arg) > +static int __ioctl_fioasync(unsigned int fd, struct file *filp, > + unsigned long arg) For these two I'd add a void __user *argp = (void __user *)arg; in the caller and then just pass argp down. That way we have the cast in one place. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names
On Sun, Oct 28, 2007 at 08:40:56PM -0400, Erez Zadok wrote: > +/** > + * vfs_ioctl - call filesystem specific ioctl methods > + * > + * @filp: [in] open file to invoke ioctl method on > + * @cmd: [in] ioctl command to execute > + * @arg: [in/out] command-specific argument for ioctl I've never seen these in/out annotations and doubt they're valid in kerneldoc. Randy? > + * Invokes filesystem specific ->unlock_ioctl, if one exists; otherwise unlocked_ioctl - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] VFS: apply coding standards to fs/ioctl.c
On Sun, Oct 28, 2007 at 07:57:47PM -0700, Daniel Phillips wrote: > On 10/28/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: > > While you're at it, it's probably worth splitting this out into > > a small helper function. > > Why? Is the same pattern called from more than one place? Becauase it's a lot more readable. ioctl subcommands are invidividual functionality, and separating them out into small self-contained functions makes the code a lot more readable. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] VFS: apply coding standards to fs/ioctl.c
On Sun, Oct 28, 2007 at 02:05:16PM -0400, Erez Zadok wrote: > > Sure. I assume you mean an internal function to encapsulate the entire case > statement's code, one for each of the FIO* cases. Yes. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On 10/30/07, Greg Banks <[EMAIL PROTECTED]> wrote: > > BIO_HINT_RELEASE > The bio's block extent is no longer in use by the filesystem > and will not be read in the future. Any storage used to back > the extent may be released without any threat to filesystem > or data integrity. > I'd like to second the proposal, but it would be more useful to bring the hint down to the physical devices. There is an ongoing discussion about adding 'Trim' ATA command for notifying the drive about the deleted blocks. http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf This is especially useful for the storage device like Solid State Drive (SSD). Dongjun - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html