Re: What to do about subvolumes?

2010-12-08 Thread Andreas Dilger
On 2010-12-08, at 16:07, Neil Brown wrote:
> On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields" 
> wrote:
> 
>> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
>>> Any chance we can add a ->get_fsid(sb, inode) method to
>>> export_operations (or something simiar), that allows the
>>> filesystem to generate an FSID based on the volume and
>>> inode that is being exported?
>> 
>> No objection from here.
> 
> My standard objection here is that you cannot guarantee that the
> fsid is 100% guarantied to be unique across all filesystems in
> the system (including filesystems mounted from dm snapshots of
> filesystems that are currently mounted).  NFSd needs this uniqueness.

Sure, but you also cannot guarantee that the devno is constant across reboots, 
yet NFS continues to use this much-less-constant value...

> This is only really an objection if user-space cannot over-ride
> the fsid provided by the filesystem.

Agreed.  It definitely makes sense to allow this, for whatever strange 
circumstances might arise.  However, defaulting to using the filesystem UUID 
definitely makes the most sense, and looking at the nfs-utils mountd code, it 
seems that this is already standard behaviour for local block devices 
(excluding "btrfs" filesystems).

> I'd be very happy to see an interface to user-space whereby
> user-space can get a reasonably unique fsid for a given
> filesystem.

Hmm, maybe I'm missing something, but why does userspace need to be able to get 
this value?  I would think that nfsd gets it from the filesystem directly in 
the kernel, but if a "uuid=" option is present in the exports file that is 
preferentially used over the value from the filesystem.

That said, I think Aneesh's open_by_handle patchset also made the UUID visible 
in /proc//mountinfo, after the filesystems stored it in
sb->s_uuid at mount time.  That _should_ make it visible for non-block 
mountpoints as well, assuming they fill in s_uuid.

> Whether this is an export_operations method or some field in the
> 'struct super' which gets copied out doesn't matter to me.

Since Aneesh has already developed patches, is there any objection to using 
those (last sent to linux-fsdevel on 2010-10-29):

[PATCH -V22 12/14] vfs: Export file system uuid via /proc//mountinfo
[PATCH -V22 13/14] ext3: Copy fs UUID to superblock.
[PATCH -V22 14/14] ext4: Copy fs UUID to superblock

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compile warning in fs/btrfs/inode.c

2010-12-08 Thread liubo
On 12/08/2010 06:01 PM, liubo wrote:
> While compiling btrfs, I got belows:
> 
>   CC [M]  fs/btrfs/inode.o
> fs/btrfs/inode.c: In function ‘btrfs_end_dio_bio’:
> fs/btrfs/inode.c:5720: warning: format ‘%lu’ expects type ‘long unsigned 
> int’, but argument 4 has type ‘sector_t’
>   LD [M]  fs/btrfs/btrfs.o
>   Building modules, stage 2.
>   MODPOST 1 modules
>   LD [M]  fs/btrfs/btrfs.ko
> 
> This fixes the compile warning.
> 

Sorry, plz ignore this.

Have seen someone post patch to fix this.

thanks,
Liu Bo

> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/inode.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0f34cae..eff5aef 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5713,8 +5713,8 @@ static void btrfs_end_dio_bio(struct bio *bio, int err)
>   if (err) {
>   printk(KERN_ERR "btrfs direct IO failed ino %lu rw %lu "
> "disk_bytenr %lu len %u err no %d\n",
> -   dip->inode->i_ino, bio->bi_rw, bio->bi_sector,
> -   bio->bi_size, err);
> +   dip->inode->i_ino, bio->bi_rw,
> +   (unsigned long)bio->bi_sector, bio->bi_size, err);
>   dip->errors = 1;
>  
>   /*

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix compile warning in fs/btrfs/inode.c

2010-12-08 Thread Tsutomu Itoh
(2010/12/08 19:01), liubo wrote:
> While compiling btrfs, I got belows:
> 
>   CC [M]  fs/btrfs/inode.o
> fs/btrfs/inode.c: In function ‘btrfs_end_dio_bio’:
> fs/btrfs/inode.c:5720: warning: format ‘%lu’ expects type ‘long unsigned 
> int’, but argument 4 has type ‘sector_t’
>   LD [M]  fs/btrfs/btrfs.o
>   Building modules, stage 2.
>   MODPOST 1 modules
>   LD [M]  fs/btrfs/btrfs.ko
> 
> This fixes the compile warning.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/inode.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0f34cae..eff5aef 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5713,8 +5713,8 @@ static void btrfs_end_dio_bio(struct bio *bio, int err)
>   if (err) {
>   printk(KERN_ERR "btrfs direct IO failed ino %lu rw %lu "
> "disk_bytenr %lu len %u err no %d\n",
> -   dip->inode->i_ino, bio->bi_rw, bio->bi_sector,
> -   bio->bi_size, err);
> +   dip->inode->i_ino, bio->bi_rw,
> +   (unsigned long)bio->bi_sector, bio->bi_size, err);

sector_t is defined by include/linux/types.h as follows.

  #ifdef CONFIG_LBDAF
  typedef u64 sector_t;
  typedef u64 blkcnt_t;
  #else
  typedef unsigned long sector_t;
  typedef unsigned long blkcnt_t;
  #endif

Therefore, I think that you should change the code as follows. 
 
printk(KERN_ERR "btrfs direct IO failed ino %lu rw %lu "
  "disk_bytenr %llu len %u err no %d\n",
  dip->inode->i_ino, bio->bi_rw,
  (u64)bio->bi_sector, bio->bi_size, err);

>   dip->errors = 1;
>  
>   /*

Regards,
Itoh

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-08 Thread Neil Brown
On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields" 
wrote:

> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
> > On 2010-12-03, at 15:45, J. Bruce Fields wrote:
> > > We're using statfs64.fs_fsid for this; I believe that's both stable
> > > across reboots and distinguishes between subvolumes, so that's OK.
> > > 
> > > (That said, since fs_fsid doesn't work for other filesystems, we depend
> > > on an explicit check for a filesystem type of "btrfs", which is
> > > awful--btrfs won't always be the only filesystem that wants to do this
> > > kind of thing, etc.)
> > 
> > Sigh, I wanted to be able to specify the NFS FSID directly from within the 
> > kernel for Lustre many years already.  Glad to see that this is moving 
> > forward.
> > 
> > Any chance we can add a ->get_fsid(sb, inode) method to export_operations
> > (or something simiar), that allows the filesystem to generate an FSID based 
> > on the volume and inode that is being exported?
> 
> No objection from here.

My standard objection here is that you cannot guarantee that the fsid is 100%
guarantied to be unique across all filesystems in the system (including
filesystems mounted from dm snapshots of filesystems that are currently
mounted).  NFSd needs this uniqueness.

This is only really an objection if user-space cannot over-ride the fsid
provided by the filesystem.

I'd be very happy to see an interface to user-space whereby user-space can
get a reasonably unique fsid for a given filesystem.  Whether this is an
export_operations method or some field in the 'struct super' which gets
copied out doesn't matter to me.

NeilBrown


> 
> (Though I don't understand the inode argument--aren't "subvolumes"
> usually expected to have separate superblocks?)
> 
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ENOSPC on heterogeneous raid 0

2010-12-08 Thread William Sheffler
Hello btrfs community.

First off, thanks for all your hard work... I have been following
btrfs with interest for several years now and very much look forward
to the day it replaces ext4. The real killer feature (of btrfs
specifically) for me is the ability to add *and remove* devices from a
filesystem, as this allows rolling upgrades of my server's disks. I
have a 16 port 3ware 1650SE on which I have a number of small raid
units and it will be fantastic to be able to remove the oldest,
upgrade, and add the new storage back. I had previously been using
ZFS, but since ZFS doesn't allow removal of devices, this rolling
upgrade strategy doesn't work.

My question is this: can btrfs handle striping (raid 0) across
heterogeneous devices? I seem to be losing any capacity on the larger
disk beyond what is available on the smaller disk. I really hope there
is some simple fix!

Here is a bit more info:

Upon the release of ubuntu 10.10, I decided to give Btrfs v0.19 a
shot. I created a raid5 on the 3ware card with 4 disks and did:

mkfs.btrfs /dev/sdc -L hugeR1

After moving some data onto the filesystem, I created a mirror on the
3ware card and added it to the fielsystem:

btrfs device add /dev/sdd /huge

then did:

btrfs filesystem balance /huge

and waited a few days. Unfortunately, I am now running to ENOSPC
errors with the filesystem only (in theory?) half full:

 df -h /huge
Filesystem    Size  Used Avail Use% Mounted on
/dev/sdc  5.5T  2.8T  2.8T  50% /huge

btrfs filesystem show
Label: 'hugeR1'  uuid: 543aefd1-ad28-4685-a6fc-
15fbdaa9a591
    Total devices 2 FS bytes used 2.70TB
    devid    1 size 4.09TB used 1.40TB path /dev/sdc
    devid    2 size 1.36TB used 1.31TB path /dev/sdd

I gather btrfs is spreading the data evenly across the two disks, but
in this case, I would like 3/4 of the data on devid 1 and 1/4 on devid
2.

For reference, I am on bunutu 10.10 with kernel 2.6.35-22:

uname -a
Linux huge 2.6.35-22-generic #35-Ubuntu SMP Sat Oct 16 20:36:48 UTC
2010 i686 GNU/Linux



Thank you very much for your help, and please let me know if I can
provide any additional information!

Will Sheffler


--
William H. Sheffler Ph.D.
Senior Fellow
Baker Lab
University of Washington
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-08 Thread Andreas Dilger
On 2010-12-06, at 09:48, J. Bruce Fields wrote:
On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
>> Any chance we can add a ->get_fsid(sb, inode) method to
>> export_operations (or something simiar), that allows the filesystem to
>> generate an FSID based on the volume and inode that is being exported?
> 
> No objection from here.
> 
> (Though I don't understand the inode argument--aren't "subvolumes"
> usually expected to have separate superblocks?)

I thought that if two directories from the same filesystem are both being 
exported at the same time that they would need to have different FSID values, 
hence the inode parameter to allow generating an FSID that is a function of 
both the filesystem (sb) and the directory being exported (inode)?

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-08 Thread Andreas Dilger
On 2010-12-08, at 10:27, J. Bruce Fields wrote:
> On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger wrote:
>> It looks like mk_fsid() is only actually using the UUID if it is specified 
>> in the /etc/exports file (AFAICS, this depends on ex_uuid being set from a 
>> uuid="..." option).
> 
> No, if you look at the nfs-utils source you'll find mountd sets a uuid
> by default (in utils/mountd/cache.c:uuid_by_path()).

Unfortunately, this only works for block devices, not network filesystems.

>> There was a patch in the open_by_handle() patch series that added an s_uuid 
>> field to the superblock, that could be used if no uuid= option is specified 
>> in the /etc/exports file.
> 
> Agreed that doing this in the kernel would probably be simpler.

Agreed.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fsck, parent transid verify failed

2010-12-08 Thread Tommy Jonsson
>> Build the latest tools, then:
>>
>> btrfsck -s 1 /dev/xxx
>> btrfsck -s 2 /dev/xxx
>>
>> If either of these work we have an easy way to get it mounted.  Just let
>> me know.
>>
>> -chris
>>

> $ btrfsck -s 1 /dev/sda
> using SB copy 1, bytenr 67108864
> parent transid verify failed on 2721514774528 wanted 39651 found 39649
> btrfsck: disk-io.c:739: open_ctree_fd: Assertion `!(!tree_root->node)' failed.
> Aborted
>
> $ btrfsck -s 2 /dev/sda
> using SB copy 2, bytenr 274877906944
> parent transid verify failed on 2721514774528 wanted 39651 found 39649
> btrfsck: disk-io.c:739: open_ctree_fd: Assertion `!(!tree_root->node)' failed.
> Aborted
>
> Tried "btrfs-debug-tree /dev/sda" and "btrfs-debug-tree -e /dev/sda" :
> parent transid verify failed on 2721514774528 wanted 39651 found 39649
> btrfs-debug-tree: disk-io.c:739: open_ctree_fd: Assertion
> `!(!tree_root->node)' failed.
>
> dmesg said:
> [268375.903581] device fsid 734a485d12c77872-9b0b5aa408670db4 devid 2 transid 
> 39650 /dev/sdd
> [268375.904241] device fsid 734a485d12c77872-9b0b5aa408670db4 devid 1 transid 
> 39651 /dev/sdc
> [268375.904526] device fsid 734a485d12c77872-9b0b5aa408670db4 devid 3 transid 
> 39651 /dev/sda

Sorry to be a bother, but do you have any other suggestions ?

Thanks!

-tommy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fs/vfs/security: pass last path component to LSM on inode creation

2010-12-08 Thread Eric Paris
SELinux would like to implement a new labeling behavior of newly created
inodes.  We currently label new inodes based on the parent and the creating
process.  This new behavior would also take into account the name of the
new object when deciding the new label.  This is not the (supposed) full path,
just the last component of the path.

This is very useful because creating /etc/shadow is different than creating
/etc/passwd but the kernel hooks are unable to differentiate these
operations.  We currently require that userspace realize it is doing some
difficult operation like that and than userspace jumps through SELinux hoops
to get things set up correctly.  This patch does not implement new
behavior, that is obviously contained in a seperate SELinux patch, but it
does pass the needed name down to the correct LSM hook.  If no such name
exists it is fine to pass NULL.

Signed-off-by: Eric Paris 
---

 fs/btrfs/inode.c   |   13 +++--
 fs/btrfs/xattr.c   |6 --
 fs/btrfs/xattr.h   |3 ++-
 fs/ext2/ext2.h |2 +-
 fs/ext2/ialloc.c   |5 +++--
 fs/ext2/namei.c|8 
 fs/ext2/xattr.h|6 --
 fs/ext2/xattr_security.c   |5 +++--
 fs/ext3/ialloc.c   |5 +++--
 fs/ext3/namei.c|8 
 fs/ext3/xattr.h|4 ++--
 fs/ext3/xattr_security.c   |5 +++--
 fs/ext4/ialloc.c   |2 +-
 fs/ext4/xattr.h|4 ++--
 fs/ext4/xattr_security.c   |5 +++--
 fs/gfs2/inode.c|7 ---
 fs/jffs2/dir.c |9 -
 fs/jffs2/nodelist.h|2 +-
 fs/jffs2/security.c|5 +++--
 fs/jffs2/write.c   |   18 ++
 fs/jffs2/xattr.h   |5 +++--
 fs/jfs/jfs_xattr.h |5 +++--
 fs/jfs/namei.c |8 
 fs/jfs/xattr.c |6 --
 fs/ocfs2/namei.c   |4 ++--
 fs/ocfs2/refcounttree.c|3 ++-
 fs/ocfs2/xattr.c   |   10 ++
 fs/ocfs2/xattr.h   |4 +++-
 fs/reiserfs/namei.c|9 +
 fs/reiserfs/xattr_security.c   |3 ++-
 fs/xfs/linux-2.6/xfs_iops.c|9 +
 include/linux/ext3_fs.h|3 ++-
 include/linux/reiserfs_xattr.h |2 ++
 include/linux/security.h   |9 +++--
 mm/shmem.c |9 +
 security/capability.c  |3 ++-
 security/security.c|6 --
 security/selinux/hooks.c   |5 +++--
 security/smack/smack_lsm.c |5 -
 39 files changed, 136 insertions(+), 94 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8039390..ffc6e15 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -90,13 +90,14 @@ static noinline int cow_file_range(struct inode *inode,
   unsigned long *nr_written, int unlock);
 
 static int btrfs_init_inode_security(struct btrfs_trans_handle *trans,
-struct inode *inode,  struct inode *dir)
+struct inode *inode,  struct inode *dir,
+const struct qstr *qstr)
 {
int err;
 
err = btrfs_init_acl(trans, inode, dir);
if (!err)
-   err = btrfs_xattr_security_init(trans, inode, dir);
+   err = btrfs_xattr_security_init(trans, inode, dir, qstr);
return err;
 }
 
@@ -4675,7 +4676,7 @@ static int btrfs_mknod(struct inode *dir, struct dentry 
*dentry,
if (IS_ERR(inode))
goto out_unlock;
 
-   err = btrfs_init_inode_security(trans, inode, dir);
+   err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name);
if (err) {
drop_inode = 1;
goto out_unlock;
@@ -4736,7 +4737,7 @@ static int btrfs_create(struct inode *dir, struct dentry 
*dentry,
if (IS_ERR(inode))
goto out_unlock;
 
-   err = btrfs_init_inode_security(trans, inode, dir);
+   err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name);
if (err) {
drop_inode = 1;
goto out_unlock;
@@ -4864,7 +4865,7 @@ static int btrfs_mkdir(struct inode *dir, struct dentry 
*dentry, int mode)
 
drop_on_err = 1;
 
-   err = btrfs_init_inode_security(trans, inode, dir);
+   err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name);
if (err)
goto out_fail;
 
@@ -6940,7 +6941,7 @@ static int btrfs_symlink(struct inode *dir, struct dentry 
*dentry,
if (IS_ERR(inode))
goto out_unlock;
 
-   err = btrfs_init_inode_security(trans, inode, dir);
+   err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name);
if (err) {
drop_inode = 1;
goto 

[PATCH] Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots

2010-12-08 Thread Josef Bacik
Not being able to delete an orphan item isn't a horrible thing.  The worst that
happens is the next time around we try and do the orphan cleanup and we can't
find the referenced object and just delete the item and move on.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8aed05e..8c26441 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6354,7 +6354,6 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
if (ret > 0) {
ret = btrfs_del_orphan_item(trans, tree_root,
root->root_key.objectid);
-   BUG_ON(ret);
}
}
 
-- 
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fixup return code for btrfs_del_orphan_item

2010-12-08 Thread Josef Bacik
If the orphan item doesn't exist, we return 1, which doesn't make any sense to
the callers.  Instead return -ENOENT if we didn't find the item.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/orphan.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/orphan.c b/fs/btrfs/orphan.c
index 79cba5f..f8be250 100644
--- a/fs/btrfs/orphan.c
+++ b/fs/btrfs/orphan.c
@@ -56,8 +56,12 @@ int btrfs_del_orphan_item(struct btrfs_trans_handle *trans,
return -ENOMEM;
 
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
-   if (ret)
+   if (ret < 0)
goto out;
+   if (ret) {
+   ret = -ENOENT;
+   goto out;
+   }
 
ret = btrfs_del_item(trans, root, path);
 
-- 
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-08 Thread J. Bruce Fields
On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger wrote:
> On 2010-12-07, at 10:02, Trond Myklebust wrote:
> 
> > On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
> >> It's just as stable as a real dev_t in the times of hotplug and udev.
> >> As long as you don't touch anything including not upgrading the kernel
> >> it's remain stable, otherwise it will break.  That's why modern
> >> nfs-utils default to using the uuid-based filehandle schemes instead of
> >> the dev_t based ones.  At least that's what I told - I really hope it's
> >> using the real UUIDs from the filesystem and not the horrible fsid hack
> >> that was once added - for some filesystems like XFS that field does not
> >> actually have any relation to the UUID historically.  And while we
> >> could have changed that it's too late now that nfs was hacked into
> >> abusing that field.
> > 
> > IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
> > they won't fit into the NFSv2 32-byte filehandles, so there is an
> > '8-byte fsid' and '4-byte fsid + inode number' workaround for that...
> > 
> > See the mk_fsid() helper in fs/nfsd/nfsfh.h
> 
> It looks like mk_fsid() is only actually using the UUID if it is specified in 
> the /etc/exports file (AFAICS, this depends on ex_uuid being set from a 
> uuid="..." option).

No, if you look at the nfs-utils source you'll find mountd sets a uuid
by default (in utils/mountd/cache.c:uuid_by_path()).

> There was a patch in the open_by_handle() patch series that added an s_uuid 
> field to the superblock, that could be used if no uuid= option is specified 
> in the /etc/exports file.

Agreed that doing this in the kernel would probably be simpler.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-08 Thread Andreas Dilger
On 2010-12-07, at 10:02, Trond Myklebust wrote:

> On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
>> It's just as stable as a real dev_t in the times of hotplug and udev.
>> As long as you don't touch anything including not upgrading the kernel
>> it's remain stable, otherwise it will break.  That's why modern
>> nfs-utils default to using the uuid-based filehandle schemes instead of
>> the dev_t based ones.  At least that's what I told - I really hope it's
>> using the real UUIDs from the filesystem and not the horrible fsid hack
>> that was once added - for some filesystems like XFS that field does not
>> actually have any relation to the UUID historically.  And while we
>> could have changed that it's too late now that nfs was hacked into
>> abusing that field.
> 
> IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
> they won't fit into the NFSv2 32-byte filehandles, so there is an
> '8-byte fsid' and '4-byte fsid + inode number' workaround for that...
> 
> See the mk_fsid() helper in fs/nfsd/nfsfh.h

It looks like mk_fsid() is only actually using the UUID if it is specified in 
the /etc/exports file (AFAICS, this depends on ex_uuid being set from a 
uuid="..." option).

There was a patch in the open_by_handle() patch series that added an s_uuid 
field to the superblock, that could be used if no uuid= option is specified in 
the /etc/exports file.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-08 Thread Jon Nelson
On Tue, Dec 7, 2010 at 9:37 PM, Jon Nelson  wrote:
> On Tue, Dec 7, 2010 at 1:35 PM, Ted Ts'o  wrote:
>> On Tue, Dec 07, 2010 at 01:22:43PM -0500, Mike Snitzer wrote:
>>> > 1. create a database (from bash):
>>> >
>>> > createdb test
>>> >
>>> > 2. place the following contents in a file (I used 't.sql'):
>>> >
>>> > begin;
>>> > create temporary table foo as select x as a, ARRAY[x] as b FROM
>>> > generate_series(1, 1000 ) AS x;
>>> > create index foo_a_idx on foo (a);
>>> > create index foo_b_idx on foo USING GIN (b);
>>> > rollback;
>>> >
>>> > 3. execute that sql:
>>> >
>>> > psql -f t.sql --echo-all test
>>> >
>>> > With 2.6.34.7 I can re-run [3] all day long, as many times as I want,
>>> > without issue.
>>> >
>>> > With 2.6.37-rc4-13 (the currently-installed KOTD kernel) if tails
>>> > pretty frequently.
>>
>> So I just tried to reproduce this on an Ubuntu 10.04 system running
>> 2.6.37-rc5 (completely stock except for a few apparmor patches that I
>> needed to keep the apparmor userspace from complaining).  I'm using
>> Postgres 8.4.5-0ubuntu10.04.
>>
>> Using the above procedure, I wasn't able to reproduce.  Then I
>> realized this might have been because I was using an SSD root file
>> system (which is secured using LUKS/dm-crypt, with LVM on top of
>> dm-crypt).  So I mounted a file system on a 5400 rpm SSD disk, which
>> is also protected using LUKS/dm-crypt with LVM on top.  I then
>> executed the PostgresQL commands:
>>
>> CREATE TABLESPACE test LOCATION '/kbuild/postgres';
>> SET default_tablespace = test;
>> COMMIT
>> \quit
>>
>> I then re-ran the above proceduing, and verified that all of the I/O
>> was going to the 5400rpm laptop disk.
>>
>> I then ran the above procedure a half-dozen times, and I still haven't
>> been able to reproduce any Postgresql errors or kernel errors.
>>
>> Jon, can you help me identify what might be different with your run
>> and mine?  What version of Postgres are you using?
>
> One difference is the location of the transaction logs (pg_xlog). In
> my case, /var/lib/pgsql/data *is* mountpoint for the test volume
> (actually, it's a symlink to the mount point). In your case, that is
> not so. Perhaps that makes a difference?  pgsql_tmp might also be on
> two different volumes in your case (I can't be sure).


I grabbed a Kubuntu iso and installed Kubuntu 10.10, and then upgraded
to 'natty', and eventually to 2.6.37-8-generic.

With that install, and postgresql's "data" (/var/lib/postgresql/data)
being located on a LUKS+ext4 volume, I easily observe the behavior.

Does this help?

-- 
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: do not do fast caching if we are allocating blocks for tree_root

2010-12-08 Thread Josef Bacik
Since the fast caching uses normal tree locking, we can possibly deadlock if we
get to the caching via a btrfs_search_slot() on the tree_root.  So just check to
see if the root we are on is the tree root, and just don't do the fast caching.

Reported-by: Sage Weil 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c |   18 --
 1 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 03f4738..8aed05e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -434,6 +434,7 @@ err:
 
 static int cache_block_group(struct btrfs_block_group_cache *cache,
 struct btrfs_trans_handle *trans,
+struct btrfs_root *root,
 int load_cache_only)
 {
struct btrfs_fs_info *fs_info = cache->fs_info;
@@ -447,9 +448,12 @@ static int cache_block_group(struct 
btrfs_block_group_cache *cache,
 
/*
 * We can't do the read from on-disk cache during a commit since we need
-* to have the normal tree locking.
+* to have the normal tree locking.  Also if we are currently trying to
+* allocate blocks for the tree root we can't do the fast caching since
+* we likely hold important locks.
 */
-   if (!trans->transaction->in_commit) {
+   if (!trans->transaction->in_commit &&
+   (root && root != root->fs_info->tree_root)) {
spin_lock(&cache->lock);
if (cache->cached != BTRFS_CACHE_NO) {
spin_unlock(&cache->lock);
@@ -4089,7 +4093,7 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
 * space back to the block group, otherwise we will leak space.
 */
if (!alloc && cache->cached == BTRFS_CACHE_NO)
-   cache_block_group(cache, trans, 1);
+   cache_block_group(cache, trans, NULL, 1);
 
byte_in_group = bytenr - cache->key.objectid;
WARN_ON(byte_in_group > cache->key.offset);
@@ -4984,7 +4988,8 @@ have_block_group:
if (unlikely(block_group->cached == BTRFS_CACHE_NO)) {
u64 free_percent;
 
-   ret = cache_block_group(block_group, trans, 1);
+   ret = cache_block_group(block_group, trans,
+   orig_root, 1);
if (block_group->cached == BTRFS_CACHE_FINISHED)
goto have_block_group;
 
@@ -5008,7 +5013,8 @@ have_block_group:
if (loop > LOOP_CACHING_NOWAIT ||
(loop > LOOP_FIND_IDEAL &&
 atomic_read(&space_info->caching_threads) < 2)) {
-   ret = cache_block_group(block_group, trans, 0);
+   ret = cache_block_group(block_group, trans,
+   orig_root, 0);
BUG_ON(ret);
}
found_uncached_bg = true;
@@ -5561,7 +5567,7 @@ int btrfs_alloc_logged_file_extent(struct 
btrfs_trans_handle *trans,
u64 num_bytes = ins->offset;
 
block_group = btrfs_lookup_block_group(root->fs_info, ins->objectid);
-   cache_block_group(block_group, trans, 0);
+   cache_block_group(block_group, trans, NULL, 0);
caching_ctl = get_caching_control(block_group);
 
if (!caching_ctl) {
-- 
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs subvolume snapshot hung in btrfs_commit_transaction

2010-12-08 Thread Ian! D. Allen
I've been exercising btrfs doing a continuous loop of:

- delete an old snapshot to keep disk space about the same
- create snapshot from previous snapshot
- rsync root into new snapshot

I have room for 150 snapshots on disk.  I delete the oldest, create
the newest, do the rsync into the newest, repeat.  It hung today on
snapshot 564:

$ ps uww 24575
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root 24575  0.0  0.0   6224   332 pts/10   DN   07:35   0:00 btrfs 
subvolume snapshot /mnt/sde1/snap564 /mnt/sde1/snap565

$ ps lww 24575
F   UID   PID  PPID PRI  NIVSZ   RSS WCHAN  STAT TTYTIME COMMAND
4 0 24575 27716  35   -   6224   332 btrfs_ DN   pts/10 0:00 btrfs 
subvolume snapshot /mnt/sde1/snap564 /mnt/sde1/snap565

$ ps -o wchan 24575
WCHAN
btrfs_commit_transaction

No messages in "dmesg" or kernel log.  Anyone want me to run some other
debug tests to find out what is wrong?  Anything that tries to access
anything inside the btrfs file system /dev/sde1 hangs uninterruptably:

1 0  1863 2  20   0  0 0 wait_f D?  0:29 
[btrfs-transacti]
4 0  4933  4925  20   0  26524  2864 lookup D+   pts/10 0:02 /bin/bash
1   777 27995  7318  20   0  26576  1784 vfs_re D+   pts/52 0:00 bash
0   777 29395  7284  20   0  21856   688 vfs_re Dpts/51 0:00 ls -abp 
--color=auto /mnt/sde1
0   777 29510  7284  20   0  21856   692 vfs_re Dpts/51 0:00 /bin/ls 
/mnt/sde1

$ ps -o wchan 1863
WCHAN
wait_for_commit

$ ps -o wchan 27995
WCHAN
vfs_readdir

-- 
| Ian! D. Allen  -  idal...@idallen.ca  -  Ottawa, Ontario, Canada
| Home Page: http://idallen.com/   Contact Improv: http://contactimprov.ca/
| College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/
| Defend digital freedom:  http://eff.org/  and have fun:  http://fools.ca/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-08 Thread Chris Mason
Excerpts from Jon Nelson's message of 2010-12-07 22:29:26 -0500:
> On Tue, Dec 7, 2010 at 3:02 PM, Chris Mason  wrote:
> > Excerpts from Jon Nelson's message of 2010-12-07 15:48:58 -0500:
> >> On Tue, Dec 7, 2010 at 2:41 PM, Chris Mason  wrote:
> >> > Excerpts from Jon Nelson's message of 2010-12-07 15:25:47 -0500:
> >> >> On Tue, Dec 7, 2010 at 2:02 PM, Chris Mason  
> >> >> wrote:
> >> >> > Excerpts from Jon Nelson's message of 2010-12-07 14:34:40 -0500:
> >> >> >> On Tue, Dec 7, 2010 at 12:52 PM, Chris Mason 
> >> >> >>  wrote:
> >> >> >> >> postgresql errors. Typically, header corruption but from the 
> >> >> >> >> limited
> >> >> >> >> visibility I've had into this via strace, what I see is zeroed 
> >> >> >> >> pages
> >> >> >> >> where there shouldn't be.
> >> >> >> >
> >> >> >> > This sounds a lot like a bug higher up than dm-crypt.  Zeros tend 
> >> >> >> > to
> >> >> >> > come from some piece of code explicitly filling a page with zeros, 
> >> >> >> > and
> >> >> >> > that often happens in the corner cases for O_DIRECT and a few other
> >> >> >> > places in the filesystem.
> >> >> >> >
> >> >> >> > Have you tried triggering this with a regular block device?
> >> >> >>
> >> >> >> I just tried the whole set of tests, but with /dev/sdb directly (as
> >> >> >> ext4) without any crypt-y bits.
> >> >> >> It takes more iterations but out of 6 tests I had one failure: same
> >> >> >> type of thing, 'invalid page header in block '.
> >> >> >>
> >> >> >> I can't guarantee that it is a full-page of zeroes, just what I saw
> >> >> >> from the (limited) stracing I did.
> >> >> >
> >> >> > Fantastic. Now for our usual suspects:
> 
> Maybe not so fantastic. I kept testing and had no more failures. At
> all. After 40+ iterations I gave up.
> I went back to trying ext4 on a LUKS volume. The 'hit' ratio went to
> something like 1 in 3, or better.
> 
> I will continue to do testing with and without LUKS. I did /not/
> reboot between tests, but I do start with a fresh postgres database.
> 

Once we trigger once without dm-crypt, dm-crypt is off the hook.  Just
to verify, when you say without luks, you mean without any crypto bits
in use at all on the filesystems postgres uses?

Usually the trick to reproducing filesystem corruptions is adding memory
pressure.  The corruption is probably a bad interaction between reads
and writes, and we need to make sure the reads actually happen.

http://oss.oracle.com/~mason/pin_ram.c

gcc -Wall -o pin_ram pin_ram.c

pin_ram -m 80%-of-your-ram-in-mb

The idea is to trigger constant reads without having to swap heavily.
80% might be too much.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix compile warning in fs/btrfs/inode.c

2010-12-08 Thread liubo
While compiling btrfs, I got belows:

  CC [M]  fs/btrfs/inode.o
fs/btrfs/inode.c: In function ‘btrfs_end_dio_bio’:
fs/btrfs/inode.c:5720: warning: format ‘%lu’ expects type ‘long unsigned int’, 
but argument 4 has type ‘sector_t’
  LD [M]  fs/btrfs/btrfs.o
  Building modules, stage 2.
  MODPOST 1 modules
  LD [M]  fs/btrfs/btrfs.ko

This fixes the compile warning.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0f34cae..eff5aef 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5713,8 +5713,8 @@ static void btrfs_end_dio_bio(struct bio *bio, int err)
if (err) {
printk(KERN_ERR "btrfs direct IO failed ino %lu rw %lu "
  "disk_bytenr %lu len %u err no %d\n",
- dip->inode->i_ino, bio->bi_rw, bio->bi_sector,
- bio->bi_size, err);
+ dip->inode->i_ino, bio->bi_rw,
+ (unsigned long)bio->bi_sector, bio->bi_size, err);
dip->errors = 1;
 
/*
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hunt for 2.6.37 dm-crypt+ext4 corruption?

2010-12-08 Thread Milan Broz

On 12/08/2010 04:29 AM, Jon Nelson wrote:
> Maybe not so fantastic. I kept testing and had no more failures. At
> all. After 40+ iterations I gave up.
> I went back to trying ext4 on a LUKS volume. The 'hit' ratio went to
> something like 1 in 3, or better.

Encryption usually propagates bit corruption (not sure if it is
in this case). But in principle if there is one bit corrupted,
after decryption the whole sector is corrupted.
(That's why bit media errors have usually more serious impact
with FDE.)

Isn't there random noise instead of zeroes when reading
sparse files?
We should probably write some test focusing on sparse files
handling here...

Milan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html