Re: What to do about subvolumes?
On 2010-12-08, at 16:07, Neil Brown wrote: > On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields" > wrote: > >> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote: >>> Any chance we can add a ->get_fsid(sb, inode) method to >>> export_operations (or something simiar), that allows the >>> filesystem to generate an FSID based on the volume and >>> inode that is being exported? >> >> No objection from here. > > My standard objection here is that you cannot guarantee that the > fsid is 100% guarantied to be unique across all filesystems in > the system (including filesystems mounted from dm snapshots of > filesystems that are currently mounted). NFSd needs this uniqueness. Sure, but you also cannot guarantee that the devno is constant across reboots, yet NFS continues to use this much-less-constant value... > This is only really an objection if user-space cannot over-ride > the fsid provided by the filesystem. Agreed. It definitely makes sense to allow this, for whatever strange circumstances might arise. However, defaulting to using the filesystem UUID definitely makes the most sense, and looking at the nfs-utils mountd code, it seems that this is already standard behaviour for local block devices (excluding "btrfs" filesystems). > I'd be very happy to see an interface to user-space whereby > user-space can get a reasonably unique fsid for a given > filesystem. Hmm, maybe I'm missing something, but why does userspace need to be able to get this value? I would think that nfsd gets it from the filesystem directly in the kernel, but if a "uuid=" option is present in the exports file that is preferentially used over the value from the filesystem. That said, I think Aneesh's open_by_handle patchset also made the UUID visible in /proc//mountinfo, after the filesystems stored it in sb->s_uuid at mount time. That _should_ make it visible for non-block mountpoints as well, assuming they fill in s_uuid. > Whether this is an export_operations method or some field in the > 'struct super' which gets copied out doesn't matter to me. Since Aneesh has already developed patches, is there any objection to using those (last sent to linux-fsdevel on 2010-10-29): [PATCH -V22 12/14] vfs: Export file system uuid via /proc//mountinfo [PATCH -V22 13/14] ext3: Copy fs UUID to superblock. [PATCH -V22 14/14] ext4: Copy fs UUID to superblock Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix compile warning in fs/btrfs/inode.c
On 12/08/2010 06:01 PM, liubo wrote: > While compiling btrfs, I got belows: > > CC [M] fs/btrfs/inode.o > fs/btrfs/inode.c: In function ‘btrfs_end_dio_bio’: > fs/btrfs/inode.c:5720: warning: format ‘%lu’ expects type ‘long unsigned > int’, but argument 4 has type ‘sector_t’ > LD [M] fs/btrfs/btrfs.o > Building modules, stage 2. > MODPOST 1 modules > LD [M] fs/btrfs/btrfs.ko > > This fixes the compile warning. > Sorry, plz ignore this. Have seen someone post patch to fix this. thanks, Liu Bo > Signed-off-by: Liu Bo > --- > fs/btrfs/inode.c |4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 0f34cae..eff5aef 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -5713,8 +5713,8 @@ static void btrfs_end_dio_bio(struct bio *bio, int err) > if (err) { > printk(KERN_ERR "btrfs direct IO failed ino %lu rw %lu " > "disk_bytenr %lu len %u err no %d\n", > - dip->inode->i_ino, bio->bi_rw, bio->bi_sector, > - bio->bi_size, err); > + dip->inode->i_ino, bio->bi_rw, > + (unsigned long)bio->bi_sector, bio->bi_size, err); > dip->errors = 1; > > /* -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix compile warning in fs/btrfs/inode.c
(2010/12/08 19:01), liubo wrote: > While compiling btrfs, I got belows: > > CC [M] fs/btrfs/inode.o > fs/btrfs/inode.c: In function ‘btrfs_end_dio_bio’: > fs/btrfs/inode.c:5720: warning: format ‘%lu’ expects type ‘long unsigned > int’, but argument 4 has type ‘sector_t’ > LD [M] fs/btrfs/btrfs.o > Building modules, stage 2. > MODPOST 1 modules > LD [M] fs/btrfs/btrfs.ko > > This fixes the compile warning. > > Signed-off-by: Liu Bo > --- > fs/btrfs/inode.c |4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 0f34cae..eff5aef 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -5713,8 +5713,8 @@ static void btrfs_end_dio_bio(struct bio *bio, int err) > if (err) { > printk(KERN_ERR "btrfs direct IO failed ino %lu rw %lu " > "disk_bytenr %lu len %u err no %d\n", > - dip->inode->i_ino, bio->bi_rw, bio->bi_sector, > - bio->bi_size, err); > + dip->inode->i_ino, bio->bi_rw, > + (unsigned long)bio->bi_sector, bio->bi_size, err); sector_t is defined by include/linux/types.h as follows. #ifdef CONFIG_LBDAF typedef u64 sector_t; typedef u64 blkcnt_t; #else typedef unsigned long sector_t; typedef unsigned long blkcnt_t; #endif Therefore, I think that you should change the code as follows. printk(KERN_ERR "btrfs direct IO failed ino %lu rw %lu " "disk_bytenr %llu len %u err no %d\n", dip->inode->i_ino, bio->bi_rw, (u64)bio->bi_sector, bio->bi_size, err); > dip->errors = 1; > > /* Regards, Itoh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields" wrote: > On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote: > > On 2010-12-03, at 15:45, J. Bruce Fields wrote: > > > We're using statfs64.fs_fsid for this; I believe that's both stable > > > across reboots and distinguishes between subvolumes, so that's OK. > > > > > > (That said, since fs_fsid doesn't work for other filesystems, we depend > > > on an explicit check for a filesystem type of "btrfs", which is > > > awful--btrfs won't always be the only filesystem that wants to do this > > > kind of thing, etc.) > > > > Sigh, I wanted to be able to specify the NFS FSID directly from within the > > kernel for Lustre many years already. Glad to see that this is moving > > forward. > > > > Any chance we can add a ->get_fsid(sb, inode) method to export_operations > > (or something simiar), that allows the filesystem to generate an FSID based > > on the volume and inode that is being exported? > > No objection from here. My standard objection here is that you cannot guarantee that the fsid is 100% guarantied to be unique across all filesystems in the system (including filesystems mounted from dm snapshots of filesystems that are currently mounted). NFSd needs this uniqueness. This is only really an objection if user-space cannot over-ride the fsid provided by the filesystem. I'd be very happy to see an interface to user-space whereby user-space can get a reasonably unique fsid for a given filesystem. Whether this is an export_operations method or some field in the 'struct super' which gets copied out doesn't matter to me. NeilBrown > > (Though I don't understand the inode argument--aren't "subvolumes" > usually expected to have separate superblocks?) > > --b. > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ENOSPC on heterogeneous raid 0
Hello btrfs community. First off, thanks for all your hard work... I have been following btrfs with interest for several years now and very much look forward to the day it replaces ext4. The real killer feature (of btrfs specifically) for me is the ability to add *and remove* devices from a filesystem, as this allows rolling upgrades of my server's disks. I have a 16 port 3ware 1650SE on which I have a number of small raid units and it will be fantastic to be able to remove the oldest, upgrade, and add the new storage back. I had previously been using ZFS, but since ZFS doesn't allow removal of devices, this rolling upgrade strategy doesn't work. My question is this: can btrfs handle striping (raid 0) across heterogeneous devices? I seem to be losing any capacity on the larger disk beyond what is available on the smaller disk. I really hope there is some simple fix! Here is a bit more info: Upon the release of ubuntu 10.10, I decided to give Btrfs v0.19 a shot. I created a raid5 on the 3ware card with 4 disks and did: mkfs.btrfs /dev/sdc -L hugeR1 After moving some data onto the filesystem, I created a mirror on the 3ware card and added it to the fielsystem: btrfs device add /dev/sdd /huge then did: btrfs filesystem balance /huge and waited a few days. Unfortunately, I am now running to ENOSPC errors with the filesystem only (in theory?) half full: df -h /huge Filesystem Size Used Avail Use% Mounted on /dev/sdc 5.5T 2.8T 2.8T 50% /huge btrfs filesystem show Label: 'hugeR1' uuid: 543aefd1-ad28-4685-a6fc- 15fbdaa9a591 Total devices 2 FS bytes used 2.70TB devid 1 size 4.09TB used 1.40TB path /dev/sdc devid 2 size 1.36TB used 1.31TB path /dev/sdd I gather btrfs is spreading the data evenly across the two disks, but in this case, I would like 3/4 of the data on devid 1 and 1/4 on devid 2. For reference, I am on bunutu 10.10 with kernel 2.6.35-22: uname -a Linux huge 2.6.35-22-generic #35-Ubuntu SMP Sat Oct 16 20:36:48 UTC 2010 i686 GNU/Linux Thank you very much for your help, and please let me know if I can provide any additional information! Will Sheffler -- William H. Sheffler Ph.D. Senior Fellow Baker Lab University of Washington -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On 2010-12-06, at 09:48, J. Bruce Fields wrote: On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote: >> Any chance we can add a ->get_fsid(sb, inode) method to >> export_operations (or something simiar), that allows the filesystem to >> generate an FSID based on the volume and inode that is being exported? > > No objection from here. > > (Though I don't understand the inode argument--aren't "subvolumes" > usually expected to have separate superblocks?) I thought that if two directories from the same filesystem are both being exported at the same time that they would need to have different FSID values, hence the inode parameter to allow generating an FSID that is a function of both the filesystem (sb) and the directory being exported (inode)? Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On 2010-12-08, at 10:27, J. Bruce Fields wrote: > On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger wrote: >> It looks like mk_fsid() is only actually using the UUID if it is specified >> in the /etc/exports file (AFAICS, this depends on ex_uuid being set from a >> uuid="..." option). > > No, if you look at the nfs-utils source you'll find mountd sets a uuid > by default (in utils/mountd/cache.c:uuid_by_path()). Unfortunately, this only works for block devices, not network filesystems. >> There was a patch in the open_by_handle() patch series that added an s_uuid >> field to the superblock, that could be used if no uuid= option is specified >> in the /etc/exports file. > > Agreed that doing this in the kernel would probably be simpler. Agreed. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fsck, parent transid verify failed
>> Build the latest tools, then: >> >> btrfsck -s 1 /dev/xxx >> btrfsck -s 2 /dev/xxx >> >> If either of these work we have an easy way to get it mounted. Just let >> me know. >> >> -chris >> > $ btrfsck -s 1 /dev/sda > using SB copy 1, bytenr 67108864 > parent transid verify failed on 2721514774528 wanted 39651 found 39649 > btrfsck: disk-io.c:739: open_ctree_fd: Assertion `!(!tree_root->node)' failed. > Aborted > > $ btrfsck -s 2 /dev/sda > using SB copy 2, bytenr 274877906944 > parent transid verify failed on 2721514774528 wanted 39651 found 39649 > btrfsck: disk-io.c:739: open_ctree_fd: Assertion `!(!tree_root->node)' failed. > Aborted > > Tried "btrfs-debug-tree /dev/sda" and "btrfs-debug-tree -e /dev/sda" : > parent transid verify failed on 2721514774528 wanted 39651 found 39649 > btrfs-debug-tree: disk-io.c:739: open_ctree_fd: Assertion > `!(!tree_root->node)' failed. > > dmesg said: > [268375.903581] device fsid 734a485d12c77872-9b0b5aa408670db4 devid 2 transid > 39650 /dev/sdd > [268375.904241] device fsid 734a485d12c77872-9b0b5aa408670db4 devid 1 transid > 39651 /dev/sdc > [268375.904526] device fsid 734a485d12c77872-9b0b5aa408670db4 devid 3 transid > 39651 /dev/sda Sorry to be a bother, but do you have any other suggestions ? Thanks! -tommy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fs/vfs/security: pass last path component to LSM on inode creation
SELinux would like to implement a new labeling behavior of newly created inodes. We currently label new inodes based on the parent and the creating process. This new behavior would also take into account the name of the new object when deciding the new label. This is not the (supposed) full path, just the last component of the path. This is very useful because creating /etc/shadow is different than creating /etc/passwd but the kernel hooks are unable to differentiate these operations. We currently require that userspace realize it is doing some difficult operation like that and than userspace jumps through SELinux hoops to get things set up correctly. This patch does not implement new behavior, that is obviously contained in a seperate SELinux patch, but it does pass the needed name down to the correct LSM hook. If no such name exists it is fine to pass NULL. Signed-off-by: Eric Paris --- fs/btrfs/inode.c | 13 +++-- fs/btrfs/xattr.c |6 -- fs/btrfs/xattr.h |3 ++- fs/ext2/ext2.h |2 +- fs/ext2/ialloc.c |5 +++-- fs/ext2/namei.c|8 fs/ext2/xattr.h|6 -- fs/ext2/xattr_security.c |5 +++-- fs/ext3/ialloc.c |5 +++-- fs/ext3/namei.c|8 fs/ext3/xattr.h|4 ++-- fs/ext3/xattr_security.c |5 +++-- fs/ext4/ialloc.c |2 +- fs/ext4/xattr.h|4 ++-- fs/ext4/xattr_security.c |5 +++-- fs/gfs2/inode.c|7 --- fs/jffs2/dir.c |9 - fs/jffs2/nodelist.h|2 +- fs/jffs2/security.c|5 +++-- fs/jffs2/write.c | 18 ++ fs/jffs2/xattr.h |5 +++-- fs/jfs/jfs_xattr.h |5 +++-- fs/jfs/namei.c |8 fs/jfs/xattr.c |6 -- fs/ocfs2/namei.c |4 ++-- fs/ocfs2/refcounttree.c|3 ++- fs/ocfs2/xattr.c | 10 ++ fs/ocfs2/xattr.h |4 +++- fs/reiserfs/namei.c|9 + fs/reiserfs/xattr_security.c |3 ++- fs/xfs/linux-2.6/xfs_iops.c|9 + include/linux/ext3_fs.h|3 ++- include/linux/reiserfs_xattr.h |2 ++ include/linux/security.h |9 +++-- mm/shmem.c |9 + security/capability.c |3 ++- security/security.c|6 -- security/selinux/hooks.c |5 +++-- security/smack/smack_lsm.c |5 - 39 files changed, 136 insertions(+), 94 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8039390..ffc6e15 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -90,13 +90,14 @@ static noinline int cow_file_range(struct inode *inode, unsigned long *nr_written, int unlock); static int btrfs_init_inode_security(struct btrfs_trans_handle *trans, -struct inode *inode, struct inode *dir) +struct inode *inode, struct inode *dir, +const struct qstr *qstr) { int err; err = btrfs_init_acl(trans, inode, dir); if (!err) - err = btrfs_xattr_security_init(trans, inode, dir); + err = btrfs_xattr_security_init(trans, inode, dir, qstr); return err; } @@ -4675,7 +4676,7 @@ static int btrfs_mknod(struct inode *dir, struct dentry *dentry, if (IS_ERR(inode)) goto out_unlock; - err = btrfs_init_inode_security(trans, inode, dir); + err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name); if (err) { drop_inode = 1; goto out_unlock; @@ -4736,7 +4737,7 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry, if (IS_ERR(inode)) goto out_unlock; - err = btrfs_init_inode_security(trans, inode, dir); + err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name); if (err) { drop_inode = 1; goto out_unlock; @@ -4864,7 +4865,7 @@ static int btrfs_mkdir(struct inode *dir, struct dentry *dentry, int mode) drop_on_err = 1; - err = btrfs_init_inode_security(trans, inode, dir); + err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name); if (err) goto out_fail; @@ -6940,7 +6941,7 @@ static int btrfs_symlink(struct inode *dir, struct dentry *dentry, if (IS_ERR(inode)) goto out_unlock; - err = btrfs_init_inode_security(trans, inode, dir); + err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name); if (err) { drop_inode = 1; goto
[PATCH] Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
Not being able to delete an orphan item isn't a horrible thing. The worst that happens is the next time around we try and do the orphan cleanup and we can't find the referenced object and just delete the item and move on. Thanks, Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8aed05e..8c26441 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6354,7 +6354,6 @@ int btrfs_drop_snapshot(struct btrfs_root *root, if (ret > 0) { ret = btrfs_del_orphan_item(trans, tree_root, root->root_key.objectid); - BUG_ON(ret); } } -- 1.6.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fixup return code for btrfs_del_orphan_item
If the orphan item doesn't exist, we return 1, which doesn't make any sense to the callers. Instead return -ENOENT if we didn't find the item. Thanks, Signed-off-by: Josef Bacik --- fs/btrfs/orphan.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/orphan.c b/fs/btrfs/orphan.c index 79cba5f..f8be250 100644 --- a/fs/btrfs/orphan.c +++ b/fs/btrfs/orphan.c @@ -56,8 +56,12 @@ int btrfs_del_orphan_item(struct btrfs_trans_handle *trans, return -ENOMEM; ret = btrfs_search_slot(trans, root, &key, path, -1, 1); - if (ret) + if (ret < 0) goto out; + if (ret) { + ret = -ENOENT; + goto out; + } ret = btrfs_del_item(trans, root, path); -- 1.6.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger wrote: > On 2010-12-07, at 10:02, Trond Myklebust wrote: > > > On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote: > >> It's just as stable as a real dev_t in the times of hotplug and udev. > >> As long as you don't touch anything including not upgrading the kernel > >> it's remain stable, otherwise it will break. That's why modern > >> nfs-utils default to using the uuid-based filehandle schemes instead of > >> the dev_t based ones. At least that's what I told - I really hope it's > >> using the real UUIDs from the filesystem and not the horrible fsid hack > >> that was once added - for some filesystems like XFS that field does not > >> actually have any relation to the UUID historically. And while we > >> could have changed that it's too late now that nfs was hacked into > >> abusing that field. > > > > IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but > > they won't fit into the NFSv2 32-byte filehandles, so there is an > > '8-byte fsid' and '4-byte fsid + inode number' workaround for that... > > > > See the mk_fsid() helper in fs/nfsd/nfsfh.h > > It looks like mk_fsid() is only actually using the UUID if it is specified in > the /etc/exports file (AFAICS, this depends on ex_uuid being set from a > uuid="..." option). No, if you look at the nfs-utils source you'll find mountd sets a uuid by default (in utils/mountd/cache.c:uuid_by_path()). > There was a patch in the open_by_handle() patch series that added an s_uuid > field to the superblock, that could be used if no uuid= option is specified > in the /etc/exports file. Agreed that doing this in the kernel would probably be simpler. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On 2010-12-07, at 10:02, Trond Myklebust wrote: > On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote: >> It's just as stable as a real dev_t in the times of hotplug and udev. >> As long as you don't touch anything including not upgrading the kernel >> it's remain stable, otherwise it will break. That's why modern >> nfs-utils default to using the uuid-based filehandle schemes instead of >> the dev_t based ones. At least that's what I told - I really hope it's >> using the real UUIDs from the filesystem and not the horrible fsid hack >> that was once added - for some filesystems like XFS that field does not >> actually have any relation to the UUID historically. And while we >> could have changed that it's too late now that nfs was hacked into >> abusing that field. > > IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but > they won't fit into the NFSv2 32-byte filehandles, so there is an > '8-byte fsid' and '4-byte fsid + inode number' workaround for that... > > See the mk_fsid() helper in fs/nfsd/nfsfh.h It looks like mk_fsid() is only actually using the UUID if it is specified in the /etc/exports file (AFAICS, this depends on ex_uuid being set from a uuid="..." option). There was a patch in the open_by_handle() patch series that added an s_uuid field to the superblock, that could be used if no uuid= option is specified in the /etc/exports file. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)
On Tue, Dec 7, 2010 at 9:37 PM, Jon Nelson wrote: > On Tue, Dec 7, 2010 at 1:35 PM, Ted Ts'o wrote: >> On Tue, Dec 07, 2010 at 01:22:43PM -0500, Mike Snitzer wrote: >>> > 1. create a database (from bash): >>> > >>> > createdb test >>> > >>> > 2. place the following contents in a file (I used 't.sql'): >>> > >>> > begin; >>> > create temporary table foo as select x as a, ARRAY[x] as b FROM >>> > generate_series(1, 1000 ) AS x; >>> > create index foo_a_idx on foo (a); >>> > create index foo_b_idx on foo USING GIN (b); >>> > rollback; >>> > >>> > 3. execute that sql: >>> > >>> > psql -f t.sql --echo-all test >>> > >>> > With 2.6.34.7 I can re-run [3] all day long, as many times as I want, >>> > without issue. >>> > >>> > With 2.6.37-rc4-13 (the currently-installed KOTD kernel) if tails >>> > pretty frequently. >> >> So I just tried to reproduce this on an Ubuntu 10.04 system running >> 2.6.37-rc5 (completely stock except for a few apparmor patches that I >> needed to keep the apparmor userspace from complaining). I'm using >> Postgres 8.4.5-0ubuntu10.04. >> >> Using the above procedure, I wasn't able to reproduce. Then I >> realized this might have been because I was using an SSD root file >> system (which is secured using LUKS/dm-crypt, with LVM on top of >> dm-crypt). So I mounted a file system on a 5400 rpm SSD disk, which >> is also protected using LUKS/dm-crypt with LVM on top. I then >> executed the PostgresQL commands: >> >> CREATE TABLESPACE test LOCATION '/kbuild/postgres'; >> SET default_tablespace = test; >> COMMIT >> \quit >> >> I then re-ran the above proceduing, and verified that all of the I/O >> was going to the 5400rpm laptop disk. >> >> I then ran the above procedure a half-dozen times, and I still haven't >> been able to reproduce any Postgresql errors or kernel errors. >> >> Jon, can you help me identify what might be different with your run >> and mine? What version of Postgres are you using? > > One difference is the location of the transaction logs (pg_xlog). In > my case, /var/lib/pgsql/data *is* mountpoint for the test volume > (actually, it's a symlink to the mount point). In your case, that is > not so. Perhaps that makes a difference? pgsql_tmp might also be on > two different volumes in your case (I can't be sure). I grabbed a Kubuntu iso and installed Kubuntu 10.10, and then upgraded to 'natty', and eventually to 2.6.37-8-generic. With that install, and postgresql's "data" (/var/lib/postgresql/data) being located on a LUKS+ext4 volume, I easily observe the behavior. Does this help? -- Jon -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: do not do fast caching if we are allocating blocks for tree_root
Since the fast caching uses normal tree locking, we can possibly deadlock if we get to the caching via a btrfs_search_slot() on the tree_root. So just check to see if the root we are on is the tree root, and just don't do the fast caching. Reported-by: Sage Weil Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 18 -- 1 files changed, 12 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 03f4738..8aed05e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -434,6 +434,7 @@ err: static int cache_block_group(struct btrfs_block_group_cache *cache, struct btrfs_trans_handle *trans, +struct btrfs_root *root, int load_cache_only) { struct btrfs_fs_info *fs_info = cache->fs_info; @@ -447,9 +448,12 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, /* * We can't do the read from on-disk cache during a commit since we need -* to have the normal tree locking. +* to have the normal tree locking. Also if we are currently trying to +* allocate blocks for the tree root we can't do the fast caching since +* we likely hold important locks. */ - if (!trans->transaction->in_commit) { + if (!trans->transaction->in_commit && + (root && root != root->fs_info->tree_root)) { spin_lock(&cache->lock); if (cache->cached != BTRFS_CACHE_NO) { spin_unlock(&cache->lock); @@ -4089,7 +4093,7 @@ static int update_block_group(struct btrfs_trans_handle *trans, * space back to the block group, otherwise we will leak space. */ if (!alloc && cache->cached == BTRFS_CACHE_NO) - cache_block_group(cache, trans, 1); + cache_block_group(cache, trans, NULL, 1); byte_in_group = bytenr - cache->key.objectid; WARN_ON(byte_in_group > cache->key.offset); @@ -4984,7 +4988,8 @@ have_block_group: if (unlikely(block_group->cached == BTRFS_CACHE_NO)) { u64 free_percent; - ret = cache_block_group(block_group, trans, 1); + ret = cache_block_group(block_group, trans, + orig_root, 1); if (block_group->cached == BTRFS_CACHE_FINISHED) goto have_block_group; @@ -5008,7 +5013,8 @@ have_block_group: if (loop > LOOP_CACHING_NOWAIT || (loop > LOOP_FIND_IDEAL && atomic_read(&space_info->caching_threads) < 2)) { - ret = cache_block_group(block_group, trans, 0); + ret = cache_block_group(block_group, trans, + orig_root, 0); BUG_ON(ret); } found_uncached_bg = true; @@ -5561,7 +5567,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, u64 num_bytes = ins->offset; block_group = btrfs_lookup_block_group(root->fs_info, ins->objectid); - cache_block_group(block_group, trans, 0); + cache_block_group(block_group, trans, NULL, 0); caching_ctl = get_caching_control(block_group); if (!caching_ctl) { -- 1.6.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs subvolume snapshot hung in btrfs_commit_transaction
I've been exercising btrfs doing a continuous loop of: - delete an old snapshot to keep disk space about the same - create snapshot from previous snapshot - rsync root into new snapshot I have room for 150 snapshots on disk. I delete the oldest, create the newest, do the rsync into the newest, repeat. It hung today on snapshot 564: $ ps uww 24575 USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 24575 0.0 0.0 6224 332 pts/10 DN 07:35 0:00 btrfs subvolume snapshot /mnt/sde1/snap564 /mnt/sde1/snap565 $ ps lww 24575 F UID PID PPID PRI NIVSZ RSS WCHAN STAT TTYTIME COMMAND 4 0 24575 27716 35 - 6224 332 btrfs_ DN pts/10 0:00 btrfs subvolume snapshot /mnt/sde1/snap564 /mnt/sde1/snap565 $ ps -o wchan 24575 WCHAN btrfs_commit_transaction No messages in "dmesg" or kernel log. Anyone want me to run some other debug tests to find out what is wrong? Anything that tries to access anything inside the btrfs file system /dev/sde1 hangs uninterruptably: 1 0 1863 2 20 0 0 0 wait_f D? 0:29 [btrfs-transacti] 4 0 4933 4925 20 0 26524 2864 lookup D+ pts/10 0:02 /bin/bash 1 777 27995 7318 20 0 26576 1784 vfs_re D+ pts/52 0:00 bash 0 777 29395 7284 20 0 21856 688 vfs_re Dpts/51 0:00 ls -abp --color=auto /mnt/sde1 0 777 29510 7284 20 0 21856 692 vfs_re Dpts/51 0:00 /bin/ls /mnt/sde1 $ ps -o wchan 1863 WCHAN wait_for_commit $ ps -o wchan 27995 WCHAN vfs_readdir -- | Ian! D. Allen - idal...@idallen.ca - Ottawa, Ontario, Canada | Home Page: http://idallen.com/ Contact Improv: http://contactimprov.ca/ | College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/ | Defend digital freedom: http://eff.org/ and have fun: http://fools.ca/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)
Excerpts from Jon Nelson's message of 2010-12-07 22:29:26 -0500: > On Tue, Dec 7, 2010 at 3:02 PM, Chris Mason wrote: > > Excerpts from Jon Nelson's message of 2010-12-07 15:48:58 -0500: > >> On Tue, Dec 7, 2010 at 2:41 PM, Chris Mason wrote: > >> > Excerpts from Jon Nelson's message of 2010-12-07 15:25:47 -0500: > >> >> On Tue, Dec 7, 2010 at 2:02 PM, Chris Mason > >> >> wrote: > >> >> > Excerpts from Jon Nelson's message of 2010-12-07 14:34:40 -0500: > >> >> >> On Tue, Dec 7, 2010 at 12:52 PM, Chris Mason > >> >> >> wrote: > >> >> >> >> postgresql errors. Typically, header corruption but from the > >> >> >> >> limited > >> >> >> >> visibility I've had into this via strace, what I see is zeroed > >> >> >> >> pages > >> >> >> >> where there shouldn't be. > >> >> >> > > >> >> >> > This sounds a lot like a bug higher up than dm-crypt. Zeros tend > >> >> >> > to > >> >> >> > come from some piece of code explicitly filling a page with zeros, > >> >> >> > and > >> >> >> > that often happens in the corner cases for O_DIRECT and a few other > >> >> >> > places in the filesystem. > >> >> >> > > >> >> >> > Have you tried triggering this with a regular block device? > >> >> >> > >> >> >> I just tried the whole set of tests, but with /dev/sdb directly (as > >> >> >> ext4) without any crypt-y bits. > >> >> >> It takes more iterations but out of 6 tests I had one failure: same > >> >> >> type of thing, 'invalid page header in block '. > >> >> >> > >> >> >> I can't guarantee that it is a full-page of zeroes, just what I saw > >> >> >> from the (limited) stracing I did. > >> >> > > >> >> > Fantastic. Now for our usual suspects: > > Maybe not so fantastic. I kept testing and had no more failures. At > all. After 40+ iterations I gave up. > I went back to trying ext4 on a LUKS volume. The 'hit' ratio went to > something like 1 in 3, or better. > > I will continue to do testing with and without LUKS. I did /not/ > reboot between tests, but I do start with a fresh postgres database. > Once we trigger once without dm-crypt, dm-crypt is off the hook. Just to verify, when you say without luks, you mean without any crypto bits in use at all on the filesystems postgres uses? Usually the trick to reproducing filesystem corruptions is adding memory pressure. The corruption is probably a bad interaction between reads and writes, and we need to make sure the reads actually happen. http://oss.oracle.com/~mason/pin_ram.c gcc -Wall -o pin_ram pin_ram.c pin_ram -m 80%-of-your-ram-in-mb The idea is to trigger constant reads without having to swap heavily. 80% might be too much. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix compile warning in fs/btrfs/inode.c
While compiling btrfs, I got belows: CC [M] fs/btrfs/inode.o fs/btrfs/inode.c: In function ‘btrfs_end_dio_bio’: fs/btrfs/inode.c:5720: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 4 has type ‘sector_t’ LD [M] fs/btrfs/btrfs.o Building modules, stage 2. MODPOST 1 modules LD [M] fs/btrfs/btrfs.ko This fixes the compile warning. Signed-off-by: Liu Bo --- fs/btrfs/inode.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0f34cae..eff5aef 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5713,8 +5713,8 @@ static void btrfs_end_dio_bio(struct bio *bio, int err) if (err) { printk(KERN_ERR "btrfs direct IO failed ino %lu rw %lu " "disk_bytenr %lu len %u err no %d\n", - dip->inode->i_ino, bio->bi_rw, bio->bi_sector, - bio->bi_size, err); + dip->inode->i_ino, bio->bi_rw, + (unsigned long)bio->bi_sector, bio->bi_size, err); dip->errors = 1; /* -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hunt for 2.6.37 dm-crypt+ext4 corruption?
On 12/08/2010 04:29 AM, Jon Nelson wrote: > Maybe not so fantastic. I kept testing and had no more failures. At > all. After 40+ iterations I gave up. > I went back to trying ext4 on a LUKS volume. The 'hit' ratio went to > something like 1 in 3, or better. Encryption usually propagates bit corruption (not sure if it is in this case). But in principle if there is one bit corrupted, after decryption the whole sector is corrupted. (That's why bit media errors have usually more serious impact with FDE.) Isn't there random noise instead of zeroes when reading sparse files? We should probably write some test focusing on sparse files handling here... Milan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html