date:20071030

Re: [RFC 0/2] readdir() as an inode operation

2007-10-30 Thread Theodore Tso

On Tue, Oct 30, 2007 at 04:26:04PM +0100, Jan Kara wrote:
> > This is a first try to move readdir() to become an inode operation. This is
> > necessary for a VFS implementation of "something like union-mounts" where a
> > readdir() needs to read the directory contents of multiple directories.
> > Besides that the new interface is no longer giving the struct file to the
> > filesystem implementations anymore.
> > 
> > Comments, please?
>   Hmm, are you sure there are no users which keep some per-struct-file
> information for directories? File offset is one such obvious thing which
> you've handled but actually filesystem with more complicated structure
> of directory may remember some hints about where we really are, keep
> some readahead information or so...

For example, the ext3 filesystem, when it is supported hash tree, does
exactly this.  See ext3_htree_store_dirent() in fs/ext3/dir.c and
ext3_htree_fill_tree() in fs/ext3/namei.c.

So your patch would break ext3 htree support.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread David Chinner

On Wed, Oct 31, 2007 at 03:01:58PM +1100, Greg Banks wrote:
> On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote:
> > On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
> > > On Tuesday October 30, [EMAIL PROTECTED] wrote:
> > > > BIO_HINT_RELEASE
> > > > The bio's block extent is no longer in use by the filesystem
> > > > and will not be read in the future.  Any storage used to back
> > > > the extent may be released without any threat to filesystem
> > > > or data integrity.
> > > 
> > > If the allocation unit of the storage device (e.g. a few MB) does not
> > > match the allocation unit of the filesystem (e.g. a few KB) then for
> > > this to be useful either the storage device must start recording tiny
> > > allocations, or the filesystem should re-release areas as they grow.
> > > i.e. when releasing a range of a device, look in the filesystem's usage
> > > records for the largest surrounding free space, and release all of that.
> > 
> > I figured that the easiest way around this is reporting free space
> > extents, not the amoutn actually freed. e.g.
> > 
> > 4k in file A @ block 10
> > 4k in file B @ block 11
> > 4k free space @ block 12
> > 4k in file C @ block 13
> > 1008k in free space at block 14.
> > 
> > If we free file A, we report that we've released an extent of 4k @ block 10.
> > if we then free file B, we report we've released an extent of 12k @ block 
> > 10.
> > If we then free file C, we report a release of 1024k @ block 10.
> > 
> > Then the underlying device knows what the aggregated free space regions
> > are and can easily release large regions without needing to track tiny
> > allocations and frees done by the filesystem.
> 
> If you could do that in the filesystem, it certainly solve the problem.
> In which case I'll explicitly allow for the hint's extent to overlap
> extents previous extents thus hinted, and define the semantics
> for overlaps.  I think I'll rename the hint to BIO_HINT_RELEASED,
> I think that will make the semantics a little clearer.

I think that can be done - i wouldn't have mentioned it if I didn't
think it was possible to implement ;).

It will require a further btree lookup once the free transaction
hits the disk, but I think that's pretty easy to do. I'd probably
hook xfs_alloc_clear_busy() to do this.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 0/2] readdir() as an inode operation

2007-10-30 Thread Brad Boyer

On Tue, Oct 30, 2007 at 04:26:04PM +0100, Jan Kara wrote:
> > This is a first try to move readdir() to become an inode operation. This is
> > necessary for a VFS implementation of "something like union-mounts" where a
> > readdir() needs to read the directory contents of multiple directories.
> > Besides that the new interface is no longer giving the struct file to the
> > filesystem implementations anymore.
> > 
> > Comments, please?
>   Hmm, are you sure there are no users which keep some per-struct-file
> information for directories? File offset is one such obvious thing which
> you've handled but actually filesystem with more complicated structure
> of directory may remember some hints about where we really are, keep
> some readahead information or so...

The hfsplus code keeps some extra data in the ->private_data of directories,
although I've lost track of the exact benefit from not looking at the code
in some years. As I recall it was a performance enhancement due to the
fact that the data for all directories is effectively in a single file.
I believe it was easier to keep the current position as a structure
rather than just an offset and be careful about the tracking. The hfs
code is almost identical to the hfsplus code in this area.

Brad Boyer
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Greg Banks

On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote:
> On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
> > On Tuesday October 30, [EMAIL PROTECTED] wrote:
> > > BIO_HINT_RELEASE
> > > The bio's block extent is no longer in use by the filesystem
> > > and will not be read in the future.  Any storage used to back
> > > the extent may be released without any threat to filesystem
> > > or data integrity.
> > 
> > If the allocation unit of the storage device (e.g. a few MB) does not
> > match the allocation unit of the filesystem (e.g. a few KB) then for
> > this to be useful either the storage device must start recording tiny
> > allocations, or the filesystem should re-release areas as they grow.
> > i.e. when releasing a range of a device, look in the filesystem's usage
> > records for the largest surrounding free space, and release all of that.
> 
> I figured that the easiest way around this is reporting free space
> extents, not the amoutn actually freed. e.g.
> 
>   4k in file A @ block 10
>   4k in file B @ block 11
>   4k free space @ block 12
>   4k in file C @ block 13
>   1008k in free space at block 14.
> 
> If we free file A, we report that we've released an extent of 4k @ block 10.
> if we then free file B, we report we've released an extent of 12k @ block 10.
> If we then free file C, we report a release of 1024k @ block 10.
> 
> Then the underlying device knows what the aggregated free space regions
> are and can easily release large regions without needing to track tiny
> allocations and frees done by the filesystem.

If you could do that in the filesystem, it certainly solve the problem.
In which case I'll explicitly allow for the hint's extent to overlap
extents previous extents thus hinted, and define the semantics
for overlaps.  I think I'll rename the hint to BIO_HINT_RELEASED,
I think that will make the semantics a little clearer.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Greg Banks

On Tue, Oct 30, 2007 at 06:35:08PM +0900, Dongjun Shin wrote:
> On 10/30/07, Greg Banks <[EMAIL PROTECTED]> wrote:
> >
> > BIO_HINT_RELEASE
> > The bio's block extent is no longer in use by the filesystem
> > and will not be read in the future.  Any storage used to back
> > the extent may be released without any threat to filesystem
> > or data integrity.
> >
> 
> I'd like to second the proposal, but it would be more useful to bring the hint
> down to the physical devices.
> 
> There is an ongoing discussion about adding 'Trim' ATA command for notifying
> the drive about the deleted blocks.
> 
> http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf

What an interesting document.  Am I reading the change markup correctly,
did it get *simpler* in the last revision?  Wow.

I agree that BIO_HINT_RELEASE would be a good match for the proposed
Trim command.  But I don't think we'll ever be issuing Trims with
more than a single LBA Range Entry, that feature seems unhelpful.

The Trim proposal doesn't specify what happens when a sector which
is already deallocated is deallocated again, presumably this is
supposed to be harmless?

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migratepage failures on reiserfs

2007-10-30 Thread Zan Lynx

On Tue, 2007-10-30 at 16:08 -0700, Badari wrote:
> Chris Mason wrote:
> > On Tue, 30 Oct 2007 13:54:05 -0800
[cut]
> > The easy way to narrow our search is to try without data=ordered, it is
> > certainly complicating things.
> >   
> 
> I can try that, its my root filesystem :(

You meant to write "can't?"

Download BusyBox and build it into an initramfs.  It's pretty easy, you
can do it yourself.  Or you could download the Debian mkinitramfs (I
think) package and look at it.  Or the Fedora equivalent (I think
mkinitrd).

Then you can boot into that and mount your / with whatever options you
like.

Here's what I use for my own custom BusyBox initramfs /init script:
#!/bin/sh

get_arg() {
local arg="$1"
local x=`cat /proc/cmdline`
for i in $x; do
if [ "${i%=*}" = "$arg" ]; then
echo ${i#*=}
break
fi
done
}

do_switch() {
mount -t proc none /proc
local root=`get_arg root`
local flags=`get_arg rootflags`
mount "$root" /new ${flags:+-o "$flags"}
umount /proc
cd /new
exec switch_root . /sbin/init
}

do_shell() {
exec /sbin/init
}

# The following will wait 2 seconds for Enter before booting.
trap "do_switch" ALRM

target=$$
( sleep 2; kill -ALRM $target ) &
alarm=$!

echo -n "Press Enter for a shell: "
while read action; do
kill $alarm
break
done
do_shell

-- 
Zan Lynx <[EMAIL PROTECTED]>

signature.asc
Description: This is a digitally signed message part

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread David Chinner

On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
> On Tuesday October 30, [EMAIL PROTECTED] wrote:
> > BIO_HINT_RELEASE
> > The bio's block extent is no longer in use by the filesystem
> > and will not be read in the future.  Any storage used to back
> > the extent may be released without any threat to filesystem
> > or data integrity.
> 
> If the allocation unit of the storage device (e.g. a few MB) does not
> match the allocation unit of the filesystem (e.g. a few KB) then for
> this to be useful either the storage device must start recording tiny
> allocations, or the filesystem should re-release areas as they grow.
> i.e. when releasing a range of a device, look in the filesystem's usage
> records for the largest surrounding free space, and release all of that.

I figured that the easiest way around this is reporting free space
extents, not the amoutn actually freed. e.g.

4k in file A @ block 10
4k in file B @ block 11
4k free space @ block 12
4k in file C @ block 13
1008k in free space at block 14.

If we free file A, we report that we've released an extent of 4k @ block 10.
if we then free file B, we report we've released an extent of 12k @ block 10.
If we then free file C, we report a release of 1024k @ block 10.

Then the underlying device knows what the aggregated free space regions
are and can easily release large regions without needing to track tiny
allocations and frees done by the filesystem.

> I guess that is equally domain specific, but the difference is that if
> you try to read from the DONTCOW part of the snapshot, you get bad
> old data, where as if you try to access the subordinate device of a
> snapshot, you get an IO error - which is probably safer.

If you read from a DONTCOW region you should get zeros back - it's
a hole in the snapshot.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Kyungmin Park

On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> On Tuesday 30 October 2007, Dongjun Shin wrote:
> > There is an ongoing discussion about adding 'Trim' ATA command for notifying
> > the drive about the deleted blocks.
> >
> > http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf
> >
> > This is especially useful for the storage device like Solid State Drive 
> > (SSD).
> >
> This make me curious, why would t13 want to invent a new command when
> there is already the erase command from CFA?
>
> It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE
> should probably be mapped to CFA_ERASE (0xc0) on drives that support it:
> http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf
>

IHMO, the main difference is that it requires the physical operation or not.
The CFA_ERAES erases the free blocks, it requires the physical erase operation.
But in trim case, it just unmapped the free blocks at FTL level. it
doesn't require the physical operation.
It's time saving and we can do a lot of works at FTL level internally.

Thank you,
Kyungmin Park

I
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Kyungmin Park

On 10/31/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> On Tuesday 30 October 2007, Jörn Engel wrote:
> > On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
> > > On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Not sure. Why shouldn't you be able to reorder the hints provided that
> > > > they don't overlap with read/write bios for the same block?
> > >
> > > You're right. The bios can be reordered if they don't overlap with hint.
> >
> > I would keep things simpler. Bios can be reordered, full stop. If an
> > erase and a write overlap, the caller (filesystem?) has to add a
> > barrier.
>
> I thought bios were already ordered if they affect the same blocks.
> Either way, I agree that an erase should not be treated special on
> the bio layer, its ordering should be handled the same way we do it
> for writes.
>

To support the new ATA command (trim, or dataset), the suggested hint
is not enough.
We have to send the bio with data (at least one sector or more) since
the new ATA command requests the dataset information.

And also we have to strictly follow the order using barrier or other
methods at filesystem level
For example, the delete operation in ext3.
1. delete some file
2. ext3_delete_inode() called
3. ... -> ext3_free_blocks_sb() releases the free blocks
4. If it sends the hints here, it breaks the ext3 power off recovery
scheme since it trims the data from given information after reboot
5. after transaction, all dirty pages are flushed. after this work, we
can trim the free blocks safely.

Another approach is modifying the block framework.
At  I/O scheduler, it don't merge the hint bio (in my terminology, bio
control info) with general bio. In this case we also consider the
reordering problem.
I'm not sure it is possible at this time.

Thank you,
Kyungmin Park
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migratepage failures on reiserfs

2007-10-30 Thread Badari


Chris Mason wrote:

On Tue, 30 Oct 2007 13:54:05 -0800
Badari Pulavarty <[EMAIL PROTECTED]> wrote:

  

On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote:


On Tue, 30 Oct 2007 10:27:04 -0800
Badari Pulavarty <[EMAIL PROTECTED]> wrote:

  

Hi,

While testing hotplug memory remove, I ran into this issue. Given
a range of pages hotplug memory remove tries to migrate those
pages.

migrate_pages() keeps failing to migrate pages containing
pagecache pages for reiserfs files. I noticed that reiserfs
doesn't have ->migratepage() ops. So, fallback_migrate_page()
code tries to do try_to_release_page(). try_to_release_page()
fails to drop_buffers() since b_count == 1. Here is what my debug
shows:

migrate pages failed pfn 258111/flags 3f801
bh cb53f6e0 flags 110029 count 1

Any one know why the b_count == 1 and not getting dropped to
zero ? 


If these are file data pages, the count is probably elevated as
part of the data=ordered tracking.  You can verify this via
b_private, or just mount data=writeback to double check.
  

Chris,

That was my first assumption. But after looking at
reiserfs_releasepage (), realized that it would do reiserfs_free_jh()
and clears the b_private. I couldn't easily find out who has the ref.
against this bh.

bh cbdaaf00 flags 110029 count 1 private 0




If I'm reading this correctly the buffer is BH_Lock | BH_Req, perhaps
it is currently under IO?
  

Its BH_Req | BH_Uptodate. Its not under IO.

The page isn't locked, but data=ordered does IO directly on the buffer
heads, without taking the page lock.

The easy way to narrow our search is to try without data=ordered, it is
certainly complicating things.
  


I can try that, its my root filesystem :(

Thanks,
Badari
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] 0/4 fs/ioctl.c coding style, function renaming/factoring

2007-10-30 Thread Andrew Morton

On Tue, 30 Oct 2007 15:39:55 -0400
Erez Zadok <[EMAIL PROTECTED]> wrote:

> This series of 4 proposed patches (take 3) changes fs/ioctl.c and Unionfs as
> follows.

The problem is of course that you need these in your tree for ongoing
development, but 2.6.25 is months away.  They look OK to me so I suggest
that you go ahead and commit them to your git tree and I'll drop them
again.  Please resend them for merging in the 2.6.25-rc1 merge window.

unionfs has been hanging around for a long time now and we should work
towards getting it into 2.6.25 or dropped from -mm (sorry).  Right now
would be a great time to get this process underway.  

I have only a partial memory of what the sticking points were, and I have
basically zero knowledge of what was done to address them.  So could you
please, over the next few weeks:

- Send out a description of what the issues were, and how they were addressed

- If issues remain, describe their impact and possible workarounds, all
  that stuff.

- If it mostly-survives all that design-level review and consideration
  then let's go for it: get all the patches cleaned up and consolidated and
  get them emailed out for review no later than 2.6.24-rc5.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migratepage failures on reiserfs

2007-10-30 Thread Chris Mason

On Tue, 30 Oct 2007 13:54:05 -0800
Badari Pulavarty <[EMAIL PROTECTED]> wrote:

> On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote:
> > On Tue, 30 Oct 2007 10:27:04 -0800
> > Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> > 
> > > Hi,
> > > 
> > > While testing hotplug memory remove, I ran into this issue. Given
> > > a range of pages hotplug memory remove tries to migrate those
> > > pages.
> > > 
> > > migrate_pages() keeps failing to migrate pages containing
> > > pagecache pages for reiserfs files. I noticed that reiserfs
> > > doesn't have ->migratepage() ops. So, fallback_migrate_page()
> > > code tries to do try_to_release_page(). try_to_release_page()
> > > fails to drop_buffers() since b_count == 1. Here is what my debug
> > > shows:
> > > 
> > >   migrate pages failed pfn 258111/flags 3f801
> > >   bh cb53f6e0 flags 110029 count 1
> > >   
> > > Any one know why the b_count == 1 and not getting dropped to
> > > zero ? 
> > 
> > If these are file data pages, the count is probably elevated as
> > part of the data=ordered tracking.  You can verify this via
> > b_private, or just mount data=writeback to double check.
> 
> 
> Chris,
> 
> That was my first assumption. But after looking at
> reiserfs_releasepage (), realized that it would do reiserfs_free_jh()
> and clears the b_private. I couldn't easily find out who has the ref.
> against this bh.
> 
> bh cbdaaf00 flags 110029 count 1 private 0
> 

If I'm reading this correctly the buffer is BH_Lock | BH_Req, perhaps
it is currently under IO?

The page isn't locked, but data=ordered does IO directly on the buffer
heads, without taking the page lock.

The easy way to narrow our search is to try without data=ordered, it is
certainly complicating things.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migratepage failures on reiserfs

2007-10-30 Thread Badari Pulavarty

On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote:
> On Tue, 30 Oct 2007 10:27:04 -0800
> Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> 
> > Hi,
> > 
> > While testing hotplug memory remove, I ran into this issue. Given a
> > range of pages hotplug memory remove tries to migrate those pages.
> > 
> > migrate_pages() keeps failing to migrate pages containing pagecache
> > pages for reiserfs files. I noticed that reiserfs doesn't have 
> > ->migratepage() ops. So, fallback_migrate_page() code tries to
> > do try_to_release_page(). try_to_release_page() fails to
> > drop_buffers() since b_count == 1. Here is what my debug shows:
> > 
> > migrate pages failed pfn 258111/flags 3f801
> > bh cb53f6e0 flags 110029 count 1
> > 
> > Any one know why the b_count == 1 and not getting dropped to zero ? 
> 
> If these are file data pages, the count is probably elevated as part of
> the data=ordered tracking.  You can verify this via b_private, or just
> mount data=writeback to double check.


Chris,

That was my first assumption. But after looking at reiserfs_releasepage
(), realized that it would do reiserfs_free_jh() and clears the
b_private. I couldn't easily find out who has the ref. against this
bh.

bh cbdaaf00 flags 110029 count 1 private 0

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls

2007-10-30 Thread Erez Zadok

Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
---
 fs/ioctl.c |  129 +++-
 1 files changed, 75 insertions(+), 54 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 1ab7b7d..cd8c1a3 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -53,32 +53,34 @@ long vfs_ioctl(struct file *filp, unsigned int cmd,
 }
 EXPORT_SYMBOL_GPL(vfs_ioctl);
 
+static int ioctl_fibmap(struct file *filp, int __user *p)
+{
+   struct address_space *mapping = filp->f_mapping;
+   int res, block;
+
+   /* do we support this mess? */
+   if (!mapping->a_ops->bmap)
+   return -EINVAL;
+   if (!capable(CAP_SYS_RAWIO))
+   return -EPERM;
+   res = get_user(block, p);
+   if (res)
+   return res;
+   lock_kernel();
+   res = mapping->a_ops->bmap(mapping, block);
+   unlock_kernel();
+   return put_user(res, p);
+}
+
 static int file_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
 {
-   int error;
-   int block;
struct inode *inode = filp->f_path.dentry->d_inode;
int __user *p = (int __user *)arg;
 
switch (cmd) {
case FIBMAP:
-   {
-   struct address_space *mapping = filp->f_mapping;
-   int res;
-   /* do we support this mess? */
-   if (!mapping->a_ops->bmap)
-   return -EINVAL;
-   if (!capable(CAP_SYS_RAWIO))
-   return -EPERM;
-   error = get_user(block, p);
-   if (error)
-   return error;
-   lock_kernel();
-   res = mapping->a_ops->bmap(mapping, block);
-   unlock_kernel();
-   return put_user(res, p);
-   }
+   return ioctl_fibmap(filp, p);
case FIGETBSZ:
return put_user(inode->i_sb->s_blocksize, p);
case FIONREAD:
@@ -88,6 +90,57 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
return vfs_ioctl(filp, cmd, arg);
 }
 
+static int ioctl_fionbio(struct file *filp, int __user *argp)
+{
+   unsigned int flag;
+   int on, error;
+
+   error = get_user(on, argp);
+   if (error)
+   return error;
+   flag = O_NONBLOCK;
+#ifdef __sparc__
+   /* SunOS compatibility item. */
+   if (O_NONBLOCK != O_NDELAY)
+   flag |= O_NDELAY;
+#endif
+   if (on)
+   filp->f_flags |= flag;
+   else
+   filp->f_flags &= ~flag;
+   return error;
+}
+
+static int ioctl_fioasync(unsigned int fd, struct file *filp,
+ int __user *argp)
+{
+   unsigned int flag;
+   int on, error;
+
+   error = get_user(on, argp);
+   if (error)
+   return error;
+   flag = on ? FASYNC : 0;
+
+   /* Did FASYNC state change ? */
+   if ((flag ^ filp->f_flags) & FASYNC) {
+   if (filp->f_op && filp->f_op->fasync) {
+   lock_kernel();
+   error = filp->f_op->fasync(fd, filp, on);
+   unlock_kernel();
+   } else
+   error = -ENOTTY;
+   }
+   if (error)
+   return error;
+
+   if (on)
+   filp->f_flags |= FASYNC;
+   else
+   filp->f_flags &= ~FASYNC;
+   return error;
+}
+
 /*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
@@ -98,8 +151,8 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
 int do_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 unsigned long arg)
 {
-   unsigned int flag;
-   int on, error = 0;
+   int error = 0;
+   int __user *argp = (int __user *)arg;
 
switch (cmd) {
case FIOCLEX:
@@ -111,43 +164,11 @@ int do_ioctl(struct file *filp, unsigned int fd, unsigned 
int cmd,
break;
 
case FIONBIO:
-   error = get_user(on, (int __user *)arg);
-   if (error)
-   break;
-   flag = O_NONBLOCK;
-#ifdef __sparc__
-   /* SunOS compatibility item. */
-   if (O_NONBLOCK != O_NDELAY)
-   flag |= O_NDELAY;
-#endif
-   if (on)
-   filp->f_flags |= flag;
-   else
-   filp->f_flags &= ~flag;
+   error = ioctl_fionbio(filp, argp);
break;
 
case FIOASYNC:
-   error = get_user(on, (int __user *)arg);
-   if (error)
-   break;
-   flag = on ? FASYNC : 0;
-
-   /* Did FASYNC state change ? */
-   if ((flag ^ filp->f_flags) & FASYNC) {
-   if (filp->f_op && filp->f_op->fasync) {
-   lock_kernel();
-   error = filp->f_o

[PATCH 1/4] VFS: apply coding standards to fs/ioctl.c

2007-10-30 Thread Erez Zadok

Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
---
 fs/ioctl.c |  164 +++-
 1 files changed, 84 insertions(+), 80 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index c2a773e..652cacf 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -12,8 +12,8 @@
 #include 
 #include 
 #include 
+#include 
 
-#include 
 #include 
 
 static long do_ioctl(struct file *filp, unsigned int cmd,
@@ -45,31 +45,31 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
 {
int error;
int block;
-   struct inode * inode = filp->f_path.dentry->d_inode;
+   struct inode *inode = filp->f_path.dentry->d_inode;
int __user *p = (int __user *)arg;
 
switch (cmd) {
-   case FIBMAP:
-   {
-   struct address_space *mapping = filp->f_mapping;
-   int res;
-   /* do we support this mess? */
-   if (!mapping->a_ops->bmap)
-   return -EINVAL;
-   if (!capable(CAP_SYS_RAWIO))
-   return -EPERM;
-   if ((error = get_user(block, p)) != 0)
-   return error;
-
-   lock_kernel();
-   res = mapping->a_ops->bmap(mapping, block);
-   unlock_kernel();
-   return put_user(res, p);
-   }
-   case FIGETBSZ:
-   return put_user(inode->i_sb->s_blocksize, p);
-   case FIONREAD:
-   return put_user(i_size_read(inode) - filp->f_pos, p);
+   case FIBMAP:
+   {
+   struct address_space *mapping = filp->f_mapping;
+   int res;
+   /* do we support this mess? */
+   if (!mapping->a_ops->bmap)
+   return -EINVAL;
+   if (!capable(CAP_SYS_RAWIO))
+   return -EPERM;
+   error = get_user(block, p);
+   if (error)
+   return error;
+   lock_kernel();
+   res = mapping->a_ops->bmap(mapping, block);
+   unlock_kernel();
+   return put_user(res, p);
+   }
+   case FIGETBSZ:
+   return put_user(inode->i_sb->s_blocksize, p);
+   case FIONREAD:
+   return put_user(i_size_read(inode) - filp->f_pos, p);
}
 
return do_ioctl(filp, cmd, arg);
@@ -82,81 +82,85 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
  * vfs_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d.
  * It's just a simple helper for sys_ioctl and compat_sys_ioctl.
  */
-int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned 
long arg)
+int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
+ unsigned long arg)
 {
unsigned int flag;
int on, error = 0;
 
switch (cmd) {
-   case FIOCLEX:
-   set_close_on_exec(fd, 1);
-   break;
+   case FIOCLEX:
+   set_close_on_exec(fd, 1);
+   break;
 
-   case FIONCLEX:
-   set_close_on_exec(fd, 0);
-   break;
+   case FIONCLEX:
+   set_close_on_exec(fd, 0);
+   break;
 
-   case FIONBIO:
-   if ((error = get_user(on, (int __user *)arg)) != 0)
-   break;
-   flag = O_NONBLOCK;
+   case FIONBIO:
+   error = get_user(on, (int __user *)arg);
+   if (error)
+   break;
+   flag = O_NONBLOCK;
 #ifdef __sparc__
-   /* SunOS compatibility item. */
-   if(O_NONBLOCK != O_NDELAY)
-   flag |= O_NDELAY;
+   /* SunOS compatibility item. */
+   if (O_NONBLOCK != O_NDELAY)
+   flag |= O_NDELAY;
 #endif
-   if (on)
-   filp->f_flags |= flag;
-   else
-   filp->f_flags &= ~flag;
+   if (on)
+   filp->f_flags |= flag;
+   else
+   filp->f_flags &= ~flag;
+   break;
+
+   case FIOASYNC:
+   error = get_user(on, (int __user *)arg);
+   if (error)
break;
-
-   case FIOASYNC:
-   if ((error = get_user(on, (int __user *)arg)) != 0)
-   break;
-   flag = on ? FASYNC : 0;
-
-   /* Did FASYNC state change ? */
-   if ((flag ^ filp->f_flags) & FASYNC) {
-   if (filp->f_op && filp->f_op->fasync) {
-

[PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names

2007-10-30 Thread Erez Zadok

Rename old vfs_ioctl to do_ioctl, because the comment above it clearly
indicates that it is an internal function not to be exported to modules;
therefore it should have a more traditional do_XXX name.  The new do_ioctl
is exported in fs.h but not to modules.

Rename the old do_ioctl to vfs_ioctl because the names vfs_XXX should
preferably be reserved to callable VFS functions which modules may call, as
many other vfs_XXX functions already do.  Export the new vfs_ioctl to GPL
modules so others can use it (including Unionfs and eCryptfs).  Add DocBook
for new vfs_ioctl.

Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
---
 fs/compat_ioctl.c  |2 +-
 fs/ioctl.c |   29 +
 include/linux/fs.h |4 +++-
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index a4284cc..a1604ce 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -2972,7 +2972,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, 
unsigned int cmd,
}
 
  do_ioctl:
-   error = vfs_ioctl(filp, fd, cmd, arg);
+   error = do_ioctl(filp, fd, cmd, arg);
  out_fput:
fput_light(filp, fput_needed);
  out:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 652cacf..1ab7b7d 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -16,8 +16,20 @@
 
 #include 
 
-static long do_ioctl(struct file *filp, unsigned int cmd,
-   unsigned long arg)
+/**
+ * vfs_ioctl - call filesystem specific ioctl methods
+ * @filp: [in] open file to invoke ioctl method on
+ * @cmd:  [in] ioctl command to execute
+ * @arg:  [in/out] command-specific argument for ioctl
+ *
+ * Invokes filesystem specific ->unlocked_ioctl, if one exists; otherwise
+ * invokes * filesystem specific ->ioctl method.  If neither method exists,
+ * returns -ENOTTY.
+ *
+ * Returns 0 on success, -errno on error.
+ */
+long vfs_ioctl(struct file *filp, unsigned int cmd,
+  unsigned long arg)
 {
int error = -ENOTTY;
 
@@ -39,6 +51,7 @@ static long do_ioctl(struct file *filp, unsigned int cmd,
  out:
return error;
 }
+EXPORT_SYMBOL_GPL(vfs_ioctl);
 
 static int file_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
@@ -72,18 +85,18 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
return put_user(i_size_read(inode) - filp->f_pos, p);
}
 
-   return do_ioctl(filp, cmd, arg);
+   return vfs_ioctl(filp, cmd, arg);
 }
 
 /*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
  *
- * vfs_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d.
+ * do_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d.
  * It's just a simple helper for sys_ioctl and compat_sys_ioctl.
  */
-int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
- unsigned long arg)
+int do_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
+unsigned long arg)
 {
unsigned int flag;
int on, error = 0;
@@ -152,7 +165,7 @@ int vfs_ioctl(struct file *filp, unsigned int fd, unsigned 
int cmd,
if (S_ISREG(filp->f_path.dentry->d_inode->i_mode))
error = file_ioctl(filp, cmd, arg);
else
-   error = do_ioctl(filp, cmd, arg);
+   error = vfs_ioctl(filp, cmd, arg);
break;
}
return error;
@@ -172,7 +185,7 @@ asmlinkage long sys_ioctl(unsigned int fd, unsigned int 
cmd, unsigned long arg)
if (error)
goto out_fput;
 
-   error = vfs_ioctl(filp, fd, cmd, arg);
+   error = do_ioctl(filp, fd, cmd, arg);
  out_fput:
fput_light(filp, fput_needed);
  out:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..d513f16 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1924,7 +1924,9 @@ extern int vfs_stat_fd(int dfd, char __user *, struct 
kstat *);
 extern int vfs_lstat_fd(int dfd, char __user *, struct kstat *);
 extern int vfs_fstat(unsigned int, struct kstat *);
 
-extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long);
+extern long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
+extern int do_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
+   unsigned long arg);
 
 extern void get_filesystem(struct file_system_type *fs);
 extern void put_filesystem(struct file_system_type *fs);
-- 
1.5.2.2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] Unionfs: use vfs_ioctl

2007-10-30 Thread Erez Zadok

Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
---
 fs/unionfs/commonfops.c |   36 ++--
 1 files changed, 6 insertions(+), 30 deletions(-)

diff --git a/fs/unionfs/commonfops.c b/fs/unionfs/commonfops.c
index 7654bcb..c99b519 100644
--- a/fs/unionfs/commonfops.c
+++ b/fs/unionfs/commonfops.c
@@ -661,35 +661,6 @@ out:
return err;
 }
 
-/* pass the ioctl to the lower fs */
-static long do_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
-{
-   struct file *lower_file;
-   int err;
-
-   lower_file = unionfs_lower_file(file);
-
-   err = security_file_ioctl(lower_file, cmd, arg);
-   if (err)
-   goto out;
-
-   err = -ENOTTY;
-   if (!lower_file || !lower_file->f_op)
-   goto out;
-   if (lower_file->f_op->unlocked_ioctl) {
-   err = lower_file->f_op->unlocked_ioctl(lower_file, cmd, arg);
-   } else if (lower_file->f_op->ioctl) {
-   lock_kernel();
-   err = lower_file->f_op->ioctl(
-   lower_file->f_path.dentry->d_inode,
-   lower_file, cmd, arg);
-   unlock_kernel();
-   }
-
-out:
-   return err;
-}
-
 /*
  * return to user-space the branch indices containing the file in question
  *
@@ -756,6 +727,7 @@ out:
 long unionfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
long err;
+   struct file *lower_file;
 
unionfs_read_lock(file->f_path.dentry->d_sb);
 
@@ -779,7 +751,11 @@ long unionfs_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 
default:
/* pass the ioctl down */
-   err = do_ioctl(file, cmd, arg);
+   lower_file = unionfs_lower_file(file);
+   if (lower_file)
+   err = vfs_ioctl(lower_file, cmd, arg);
+   else
+   err = -ENOTTY;
break;
}
 
-- 
1.5.2.2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3] 0/4 fs/ioctl.c coding style, function renaming/factoring

2007-10-30 Thread Erez Zadok


This series of 4 proposed patches (take 3) changes fs/ioctl.c and Unionfs as
follows.  This series is against v2.6.24-rc1-423-g97855b4.

Patch 1: just applies coding standards to fs/ioctl.c (while I'm at it, I
figured it's worth cleaning VFS files one at a time).

Patch 2: does two things:

(a) Renames the old vfs_ioctl to do_ioctl, because the comment above it
clearly indicates that it is an internal function not to be exported to
modules; therefore it should have a more traditional do_XXX "internal
function" name.  The new do_ioctl is exported in fs.h but not to
modules.

(b) Renames the old (static) do_ioctl to vfs_ioctl because the names vfs_XXX
should preferably be reserved to callable VFS functions which modules
may call, as other vfs_XXX functions already do.  Export the new
vfs_ioctl to (GPL) modules so others can use it (including Unionfs and
eCryptfs).

Patch 3: factors out the switch statements' cases for
FIBMAP/FIONBIO/FIOASYNC, into three small static helper functions.

Patch 4: demonstrates how Unionfs can use the new vfs_ioctl.  I successfully
tested unionfs with this new exported vfs_ioctl.  (eCryptfs could do the
same.)

I'd like to propose that the first 3 patches be merged in -mm and even
mainline, pending review.

Erez Zadok (4):
  VFS: apply coding standards to fs/ioctl.c
  VFS: swap do_ioctl and vfs_ioctl names
  VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls
  Unionfs: use vfs_ioctl

 fs/compat_ioctl.c   |2 
 fs/ioctl.c  |  224 
 fs/unionfs/commonfops.c |   36 +--
 include/linux/fs.h  |4 
 4 files changed, 141 insertions(+), 125 deletions(-)

Cheers,
Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls

2007-10-30 Thread Christoph Hellwig

On Tue, Oct 30, 2007 at 01:49:48PM -0400, Erez Zadok wrote:
> BTW, what's the origin of this oddity in fs/ioctl.c:
> 
> #ifdef __sparc__
>   /* SunOS compatibility item. */
>   if (O_NONBLOCK != O_NDELAY)
>   flag |= O_NDELAY;
> #endif
> 
> It seems rather odd to have architecture-specific code in the VFS, no?

When Dave did the sparc port he followed sunos ABIs for lots of things.
When these ABIs are hidden behind ioctl arguments they can't easily
be handled in arch code and we need warts like that.  It's definitively
not recommended for new ports..

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migratepage failures on reiserfs

2007-10-30 Thread Chris Mason

On Tue, 30 Oct 2007 10:27:04 -0800
Badari Pulavarty <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> While testing hotplug memory remove, I ran into this issue. Given a
> range of pages hotplug memory remove tries to migrate those pages.
> 
> migrate_pages() keeps failing to migrate pages containing pagecache
> pages for reiserfs files. I noticed that reiserfs doesn't have 
> ->migratepage() ops. So, fallback_migrate_page() code tries to
> do try_to_release_page(). try_to_release_page() fails to
> drop_buffers() since b_count == 1. Here is what my debug shows:
> 
>   migrate pages failed pfn 258111/flags 3f801
>   bh cb53f6e0 flags 110029 count 1
>   
> Any one know why the b_count == 1 and not getting dropped to zero ? 

If these are file data pages, the count is probably elevated as part of
the data=ordered tracking.  You can verify this via b_private, or just
mount data=writeback to double check.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls

2007-10-30 Thread Erez Zadok

BTW, what's the origin of this oddity in fs/ioctl.c:

#ifdef __sparc__
/* SunOS compatibility item. */
if (O_NONBLOCK != O_NDELAY)
flag |= O_NDELAY;
#endif

It seems rather odd to have architecture-specific code in the VFS, no?

Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names

2007-10-30 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Christoph Hellwig writes:
> On Tue, Oct 30, 2007 at 08:22:40AM -0700, Randy Dunlap wrote:
> > They are just treated as part of the parameter explanation text.
> > I don't see any problem with them.
> 
> Well, it's completely inconsistant with any other kerneldoc..

If it doesn't hurt, and kerneldoc will present it as such, then I'd leave
the [in/out] in place.  Ioctls are the few places where you can have
variables used as both input and output; so _somehow_ we need to capture
that behavior.

Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

migratepage failures on reiserfs

2007-10-30 Thread Badari Pulavarty

Hi,

While testing hotplug memory remove, I ran into this issue. Given a
range of pages hotplug memory remove tries to migrate those pages.

migrate_pages() keeps failing to migrate pages containing pagecache
pages for reiserfs files. I noticed that reiserfs doesn't have 
->migratepage() ops. So, fallback_migrate_page() code tries to
do try_to_release_page(). try_to_release_page() fails to drop_buffers()
since b_count == 1. Here is what my debug shows:

migrate pages failed pfn 258111/flags 3f801
bh cb53f6e0 flags 110029 count 1

Any one know why the b_count == 1 and not getting dropped to zero ? 

Thanks,
Badari 

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names

2007-10-30 Thread Randy Dunlap

On Tue, 30 Oct 2007 17:14:36 + Christoph Hellwig wrote:

> On Tue, Oct 30, 2007 at 08:22:40AM -0700, Randy Dunlap wrote:
> > They are just treated as part of the parameter explanation text.
> > I don't see any problem with them.
> 
> Well, it's completely inconsistant with any other kerneldoc..

True.  and it has an advantage.

---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names

2007-10-30 Thread Christoph Hellwig

On Tue, Oct 30, 2007 at 08:22:40AM -0700, Randy Dunlap wrote:
> They are just treated as part of the parameter explanation text.
> I don't see any problem with them.

Well, it's completely inconsistant with any other kerneldoc..


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Arnd Bergmann

On Tuesday 30 October 2007, Jörn Engel wrote:
> On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
> > On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> > >
> > > Not sure. Why shouldn't you be able to reorder the hints provided that
> > > they don't overlap with read/write bios for the same block?
> > 
> > You're right. The bios can be reordered if they don't overlap with hint.
> 
> I would keep things simpler.  Bios can be reordered, full stop.  If an
> erase and a write overlap, the caller (filesystem?) has to add a
> barrier.

I thought bios were already ordered if they affect the same blocks.
Either way, I agree that an erase should not be treated special on
the bio layer, its ordering should be handled the same way we do it
for writes.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-30 Thread J. Bruce Fields

On Mon, Oct 29, 2007 at 08:06:04AM +, Alan Cox wrote:
> On Sun, 28 Oct 2007 13:43:21 -0400
> "J. Bruce Fields" <[EMAIL PROTECTED]> wrote:
> 
> > From: J. Bruce Fields <[EMAIL PROTECTED]>
> > 
> > We currently attempt to return -EDEALK to blocking fcntl() file locking
> > requests that would create a cycle in the graph of tasks waiting on
> > locks.
> > 
> > This is inefficient: in the general case it requires us determining
> > whether we're adding a cycle to an arbitrary directed acyclic graph.
> > And this calculation has to be performed while holding a lock (currently
> > the BKL) that prevents that graph from changing.
> > 
> > It has historically been a source of bugs; most recently it was noticed
> > that it could loop indefinitely while holding the BKL.
> > 
> > It seems unlikely to be useful to applications:
> > - The difficulty of implementation has kept standards from
> >   requiring it.  (E.g. SUSv3 : "Since implementation of full
> >   deadlock detection is not always feasible, the [EDEADLK] error
> >   was made optional.")  So portable applications may not be able to
> >   depend on it.
> > - It only detects deadlocks that involve nothing but local posix
> >   file locks; deadlocks involving network filesystems or other kinds
> >   of locks or resources are missed.
> > 
> > It therefore seems best to remove deadlock detection.
> > 
> > Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]>
> 
> 
> NAK. This is an ABI change

OK, fair enough.

> and one that was rejected before when this was last discussed in
> detail.

That previous discussion (http://marc.info/?t=10865285743&r=1&w=2)
read the spec wrong and--now that I reread it--failed to address the
original bug report.  In fact it looks to me like the actual bug
reported:

http://bugzilla.kernel.org/show_bug.cgi?id=2829

was probably a minor variant of the one which George Davis stumbled on
again here.  That's a little depressing.

> Moving it out of BKL makes a ton of sense,

That would at least reduce the impact of bugs here, yeah.  It would be
great to figure out how to start getting locks.c and lockd out from
under the BKL, but I don't personally have the time now.  (And I believe
there's been a failed attempt or two so it's not completely easy.)

> even adding a "don't check" flag makes a lot of sense.

That's an idea I'll keep in mind, though I'm not convinced it'll help
much (or that applications would actually use it).

> The failure case for removing this feature is obscure and hard to debug
> application hangs for the afflicted programs - not nice for users at all.

Yeah.  I'd still be curious for any data about applications that
actually rely on posix deadlock detection, though.

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Jörn Engel

On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
> On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> >
> > Not sure. Why shouldn't you be able to reorder the hints provided that
> > they don't overlap with read/write bios for the same block?
> 
> You're right. The bios can be reordered if they don't overlap with hint.

I would keep things simpler.  Bios can be reordered, full stop.  If an
erase and a write overlap, the caller (filesystem?) has to add a
barrier.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-30 Thread J. Bruce Fields

On Mon, Oct 29, 2007 at 10:15:19AM +0100, Jiri Kosina wrote:
> On Sun, 28 Oct 2007, J. Bruce Fields wrote:
> 
> > But, OK, if we can identify unshared current->files at the time we put a 
> > task to sleep, then a slight modification of our current algorithm might 
> > be sufficient to detect any deadlock that involves purely posix file 
> > locks and processes.  And we can tell people that avoiding deadlock is 
> > their problem as soon as any task with a shared current->files is 
> > involved.  (Ditto, I assume, if nfsd/lockd acquires a lock.)
> 
> Don't forget that comparing file_lock->fl_owner (i.e. current->files) is 
> not the only way how lock ownership could be computed (there could be 
> specific file_lock->fl_lmops->fl_compare_owner() and all of them should 
> be teached this new semantics, right?).

Right.  So, for example, this would handle both cases as described
above.

We're turning off deadlock detection in the problematic cases just by
neglecting to add such waiting lockers to the blocked_list (which is
what we search to look for lock cycles).

(It'd be nice if we didn't have to search that list at all.  There
should be some way to set up the data structures so we can follow the
lock dependencies without having to scan through all the blocked locks
at each step.  But I haven't quite figured out how to do that yet.  And
perhaps it's not important to optimize cases where there are lots of
blocks.)

--b.

diff --git a/fs/locks.c b/fs/locks.c
index 8b8388e..5446305 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -512,6 +512,8 @@ static void locks_delete_block(struct file_lock *waiter)
unlock_kernel();
 }
 
+static int posix_owner_shared(struct file_lock *caller_fl);
+
 /* Insert waiter into blocker's block list.
  * We use a circular list so that processes can be easily woken up in
  * the order they blocked. The documentation doesn't require this but
@@ -523,7 +525,7 @@ static void locks_insert_block(struct file_lock *blocker,
BUG_ON(!list_empty(&waiter->fl_block));
list_add_tail(&waiter->fl_block, &blocker->fl_block);
waiter->fl_next = blocker;
-   if (IS_POSIX(blocker))
+   if (IS_POSIX(blocker) && !posix_owner_shared(waiter))
list_add(&waiter->fl_link, &blocked_list);
 }
 
@@ -683,46 +685,79 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
 
 EXPORT_SYMBOL(posix_test_lock);
 
-/* This function tests for deadlock condition before putting a process to
- * sleep. The detection scheme is no longer recursive. Recursive was neat,
- * but dangerous - we risked stack corruption if the lock data was bad, or
- * if the recursion was too deep for any other reason.
- *
- * We rely on the fact that a task can only be on one lock's wait queue
- * at a time. When we find blocked_task on a wait queue we can re-search
- * with blocked_task equal to that queue's owner, until either blocked_task
- * isn't found, or blocked_task is found on a queue owned by my_task.
- *
- * Note: the above assumption may not be true when handling lock requests
- * from a broken NFS client. But broken NFS clients have a lot more to
- * worry about than proper deadlock detection anyway... --okir
- *
- * However, the failure of this assumption (also possible in the case of
- * multiple tasks sharing the same open file table) also means there's no
- * guarantee that the loop below will terminate.  As a hack, we give up
- * after a few iterations.
+/*
+ * Deadlock detection: 
+ *
+ * We attempt to detect deadlocks that are due purely to posix file
+ * locks.
+ *
+ * We assume that a task can be waiting for at most one lock at a time.
+ * So for any acquired lock, the process holding that lock may be
+ * waiting on at most one other lock.  That lock in turns may be held by
+ * someone waiting for at most one other lock.  Given a requested lock
+ * caller_fl which is about to wait for a conflicting lock block_fl, we
+ * follow this chain of waiters to ensure we are not about to create a
+ * cycle.
+ *
+ * Since we do this before we ever put a process to sleep on a lock, we
+ * are ensured that there is never a cycle; that is what guarantees that
+ * the while() loop in posix_locks_deadlock() eventually completes.
+ *
+ * Note: the above assumption may not be true when handling lock
+ * requests from a broken NFS client. It may also fail in the presence
+ * of tasks (such as posix threads) sharing the same open file table.
+ *
+ * We don't necessarily care about returning EDEALK correctly in such
+ * cases, but we do need to avoid cycles in the lock dependency graph in
+ * order to ensure the loop in posix_locks_deadlock eventually
+ * terminates.  To that end, we enforce the assumption above by refusing
+ * to return EDEADLK or add to the list of blocked locks in any case
+ * where a lock owner might be able to block on more than one lock.
  */
 
-#define MAX_DEADLK_ITERATIONS 10
+static int posix_owner_shared(struct file_lock *caller_fl)
+{
+   /*
+* The cal

Re: [PATCH, RESEND] locks: fix possible infinite loop in posix deadlock detection

2007-10-30 Thread Alan Cox

On Tue, 30 Oct 2007 11:20:02 -0400
"J. Bruce Fields" <[EMAIL PROTECTED]> wrote:

> From: J. Bruce Fields <[EMAIL PROTECTED]>
> 
> It's currently possible to send posix_locks_deadlock() into an infinite
> loop (under the BKL).
> 
> For now, fix this just by bailing out after a few iterations.  We may
> want to fix this in a way that better clarifies the semantics of
> deadlock detection.  But that will take more time, and this minimal fix
> is probably adequate for any realistic scenario, and is simple enough to
> be appropriate for applying to stable kernels now.
> 
> Thanks to George Davis for reporting the problem.
> 
> Cc: "George G. Davis" <[EMAIL PROTECTED]>
> Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]>

Acked-by: Alan Cox <[EMAIL PROTECTED]>

Its a good fix for now and I doubt any real world user has that complex a
locking pattern to break.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 0/2] readdir() as an inode operation

2007-10-30 Thread Jan Kara

> This is a first try to move readdir() to become an inode operation. This is
> necessary for a VFS implementation of "something like union-mounts" where a
> readdir() needs to read the directory contents of multiple directories.
> Besides that the new interface is no longer giving the struct file to the
> filesystem implementations anymore.
> 
> Comments, please?
  Hmm, are you sure there are no users which keep some per-struct-file
information for directories? File offset is one such obvious thing which
you've handled but actually filesystem with more complicated structure
of directory may remember some hints about where we really are, keep
some readahead information or so...

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names

2007-10-30 Thread Randy Dunlap

On Tue, 30 Oct 2007 09:56:53 + Christoph Hellwig wrote:

> On Sun, Oct 28, 2007 at 08:40:56PM -0400, Erez Zadok wrote:
> > +/**
> > + * vfs_ioctl - call filesystem specific ioctl methods
> > + *
> > + * @filp: [in] open file to invoke ioctl method on
> > + * @cmd:  [in] ioctl command to execute
> > + * @arg:  [in/out] command-specific argument for ioctl
> 
> I've never seen these in/out annotations and doubt they're valid in
> kerneldoc.  Randy?

They are just treated as part of the parameter explanation text.
I don't see any problem with them.

> > + * Invokes filesystem specific ->unlock_ioctl, if one exists; otherwise
> 
> unlocked_ioctl


---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH, RESEND] locks: fix possible infinite loop in posix deadlock detection

2007-10-30 Thread J. Bruce Fields

From: J. Bruce Fields <[EMAIL PROTECTED]>

It's currently possible to send posix_locks_deadlock() into an infinite
loop (under the BKL).

For now, fix this just by bailing out after a few iterations.  We may
want to fix this in a way that better clarifies the semantics of
deadlock detection.  But that will take more time, and this minimal fix
is probably adequate for any realistic scenario, and is simple enough to
be appropriate for applying to stable kernels now.

Thanks to George Davis for reporting the problem.

Cc: "George G. Davis" <[EMAIL PROTECTED]>
Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]>
---
 fs/locks.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

I didn't see objections to this quick fix (just to the followup that
attempts to rip out posix deadlock detection entirely), so I'm
resending with just comment modifications.

I haven't given up on a more comprehensive solution, but I think we
really need to apply some fix now.

--b.

diff --git a/fs/locks.c b/fs/locks.c
index 0127a28..8b8388e 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -696,17 +696,28 @@ EXPORT_SYMBOL(posix_test_lock);
  * Note: the above assumption may not be true when handling lock requests
  * from a broken NFS client. But broken NFS clients have a lot more to
  * worry about than proper deadlock detection anyway... --okir
+ *
+ * However, the failure of this assumption (also possible in the case of
+ * multiple tasks sharing the same open file table) also means there's no
+ * guarantee that the loop below will terminate.  As a hack, we give up
+ * after a few iterations.
  */
+
+#define MAX_DEADLK_ITERATIONS 10
+
 static int posix_locks_deadlock(struct file_lock *caller_fl,
struct file_lock *block_fl)
 {
struct file_lock *fl;
+   int i = 0;
 
 next_task:
if (posix_same_owner(caller_fl, block_fl))
return 1;
list_for_each_entry(fl, &blocked_list, fl_link) {
if (posix_same_owner(fl, block_fl)) {
+   if (i++ > MAX_DEADLK_ITERATIONS)
+   return 0;
fl = fl->fl_next;
block_fl = fl;
goto next_task;
-- 
1.5.3.4.208.gc990

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Jörn Engel

On Tue, 30 October 2007 18:35:08 +0900, Dongjun Shin wrote:
> On 10/30/07, Greg Banks <[EMAIL PROTECTED]> wrote:
> >
> > BIO_HINT_RELEASE
> > The bio's block extent is no longer in use by the filesystem
> > and will not be read in the future.  Any storage used to back
> > the extent may be released without any threat to filesystem
> > or data integrity.
> 
> I'd like to second the proposal, but it would be more useful to bring the hint
> down to the physical devices.

Absolutely.  Logfs would love to have an erase operation for block
devices as well.  However the above doesn't quite match my needs,
because the blocks _will_ be read in the future.

There are two reasons for reading things back later.  The good one is to
determine whether the segment was erased or not.  Reads should return
either valid data or one of (all-0xff, all-0x00, -ESOMETHING).  Having
a dedicated error code would be best.

And getting the device erasesize would be useful as well, for obvious
reasons.

Jörn

-- 
When you close your hand, you own nothing. When you open it up, you
own the whole world.
-- Li Mu Bai in Tiger & Dragon
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Dongjun Shin

On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote:
>
> Not sure. Why shouldn't you be able to reorder the hints provided that
> they don't overlap with read/write bios for the same block?
>

You're right. The bios can be reordered if they don't overlap with hint.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Arnd Bergmann

On Tuesday 30 October 2007, Dongjun Shin wrote:
> Anyway, BIO_HINT_RELEASE could destroy the content of the blocks
> after being passed to the device. I think that other bio should not be
> reordered accross that hint (just like barrier).

Not sure. Why shouldn't you be able to reorder the hints provided that
they don't overlap with read/write bios for the same block?

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Dongjun Shin

On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> This make me curious, why would t13 want to invent a new command when
> there is already the erase command from CFA?
>
> It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE
> should probably be mapped to CFA_ERASE (0xc0) on drives that support it:
> http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf
>

I'm not sure about the background.
However, it's definitely a sign that passing the deleted block info
to the flash-based storage is useful.

Anyway, BIO_HINT_RELEASE could destroy the content of the blocks
after being passed to the device. I think that other bio should not be
reordered
accross that hint (just like barrier).

Dongjun
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Arnd Bergmann

On Tuesday 30 October 2007, Dongjun Shin wrote:
> There is an ongoing discussion about adding 'Trim' ATA command for notifying
> the drive about the deleted blocks.
> 
> http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf
> 
> This is especially useful for the storage device like Solid State Drive (SSD).
> 
This make me curious, why would t13 want to invent a new command when
there is already the erase command from CFA? 

It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE
should probably be mapped to CFA_ERASE (0xc0) on drives that support it:
http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls

2007-10-30 Thread Christoph Hellwig

> +static int __ioctl_fibmap(struct file *filp, int __user *p)

I'd say kill the __ prefix for all the functions you're adding.

> +static int __ioctl_fionbio(struct file *filp, unsigned long arg)

> +static int __ioctl_fioasync(unsigned int fd, struct file *filp,
> + unsigned long arg)

For these two I'd add a

void __user *argp = (void __user *)arg;

in the caller and then just pass argp down.  That way we have the cast in
one place.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names

2007-10-30 Thread Christoph Hellwig

On Sun, Oct 28, 2007 at 08:40:56PM -0400, Erez Zadok wrote:
> +/**
> + * vfs_ioctl - call filesystem specific ioctl methods
> + *
> + * @filp: [in] open file to invoke ioctl method on
> + * @cmd:  [in] ioctl command to execute
> + * @arg:  [in/out] command-specific argument for ioctl

I've never seen these in/out annotations and doubt they're valid in
kerneldoc.  Randy?

> + * Invokes filesystem specific ->unlock_ioctl, if one exists; otherwise

unlocked_ioctl

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] VFS: apply coding standards to fs/ioctl.c

2007-10-30 Thread Christoph Hellwig

On Sun, Oct 28, 2007 at 07:57:47PM -0700, Daniel Phillips wrote:
> On 10/28/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> > While you're at it, it's probably worth splitting this out into
> > a small helper function.
> 
> Why? Is the same pattern called from more than one place?

Becauase it's a lot more readable.  ioctl subcommands are invidividual
functionality, and separating them out into small self-contained functions
makes the code a lot more readable.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] VFS: apply coding standards to fs/ioctl.c

2007-10-30 Thread Christoph Hellwig

On Sun, Oct 28, 2007 at 02:05:16PM -0400, Erez Zadok wrote:
> 
> Sure.  I assume you mean an internal function to encapsulate the entire case
> statement's code, one for each of the FIO* cases.

Yes.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Dongjun Shin

On 10/30/07, Greg Banks <[EMAIL PROTECTED]> wrote:
>
> BIO_HINT_RELEASE
> The bio's block extent is no longer in use by the filesystem
> and will not be read in the future.  Any storage used to back
> the extent may be released without any threat to filesystem
> or data integrity.
>

I'd like to second the proposal, but it would be more useful to bring the hint
down to the physical devices.

There is an ongoing discussion about adding 'Trim' ATA command for notifying
the drive about the deleted blocks.

http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf

This is especially useful for the storage device like Solid State Drive (SSD).

Dongjun
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

43 matches

Mail list logo