[RFC PATCH 0/5 v3] Btrfs: Add readonly support to replace BUG_ON phrase
Btrfs has a number of BUG_ON()s, which may lead btrfs to unpleasant panic. Meanwhile, they are very ugly and should be handled more propriately. There are mainly two ways to deal with these BUG_ON()s. 1. For those errors which can be handled well by callers, we just return their error number to callers. 2. For others, We can force the filesystem readonly when it hits errors, which is what this patchset has done. Replaced BUG_ON() with the interface provided in this patchset, we will get error infomation via dmesg. Since btrfs is now readonly, we can save our data safely and umount it, then a btrfsck is recommended. By these ways, we can protect our filesystem from panic caused by those BUG_ONs. We still need a incompat flag to make old kernels happy. This patchset needs more test. v2-v3: - since btrfs may do log replay after crash, even it is mounted as readonly, and we have add a readonly check at start transaction time, it needs to set and to restore readonly flags around log replay. v1-v2: - in order to avoid deadlock thing, move write super stuff from error handle path to unmount time. - remove BTRFS_SUPER_FLAG_VALID, just use BTRFS_SUPER_FLAG_ERROR to make it simple. - add MS_RDONLY check at start of a transaction instead of commit transaction. --- fs/btrfs/ctree.h | 19 ++ fs/btrfs/disk-io.c | 56 +- fs/btrfs/super.c | 88 fs/btrfs/transaction.c |3 ++ 4 files changed, 164 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/5 v3] Btrfs: add readonly support for error handle
This patch provide a new error handle interface for those errors that handled by current BUG_ONs. In order to protect btrfs from panic, when it comes to those BUG_ON errors, the interface forces btrfs readonly and saves the FS state to disk. And the filesystem can be umounted, although mabye with some warning in kernel dmesg. Then btrfsck is helpful to recover btrfs. v1-v2: move write super stuff from error handle path to unmount in order to avoid deadlock. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/ctree.h |8 + fs/btrfs/super.c | 88 ++ 2 files changed, 96 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 92b5ca2..fc9b6a0 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2552,6 +2552,14 @@ ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size); /* super.c */ int btrfs_parse_options(struct btrfs_root *root, char *options); int btrfs_sync_fs(struct super_block *sb, int wait); +void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function, +unsigned int line, int errno); + +#define btrfs_std_error(fs_info, errno)\ +do { \ + if ((errno))\ + __btrfs_std_error((fs_info), __func__, __LINE__, (errno));\ +} while (0) /* acl.c */ #ifdef CONFIG_BTRFS_FS_POSIX_ACL diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 718b10d..07c58f9 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -54,6 +54,94 @@ static const struct super_operations btrfs_super_ops; +static const char *btrfs_decode_error(struct btrfs_fs_info *fs_info, int errno, + char nbuf[16]) +{ + char *errstr = NULL; + + switch (errno) { + case -EIO: + errstr = IO failure; + break; + case -ENOMEM: + errstr = Out of memory; + break; + case -EROFS: + errstr = Readonly filesystem; + break; + default: + if (nbuf) { + if (snprintf(nbuf, 16, error %d, -errno) = 0) + errstr = nbuf; + } + break; + } + + return errstr; +} + +static void __save_error_info(struct btrfs_fs_info *fs_info) +{ + struct btrfs_super_block *disk_super = fs_info-super_copy; + + fs_info-fs_state = BTRFS_SUPER_FLAG_ERROR; + disk_super-flags |= cpu_to_le64(BTRFS_SUPER_FLAG_ERROR); + + mutex_lock(fs_info-trans_mutex); + memcpy(fs_info-super_for_commit, disk_super, + sizeof(fs_info-super_for_commit)); + mutex_unlock(fs_info-trans_mutex); +} + +/* NOTE: + * We move write_super stuff at umount in order to avoid deadlock + * for umount hold all lock. + */ +static void save_error_info(struct btrfs_fs_info *fs_info) +{ + __save_error_info(fs_info); +} + +/* btrfs handle error by forcing the filesystem readonly */ +static void btrfs_handle_error(struct btrfs_fs_info *fs_info) +{ + struct super_block *sb = fs_info-sb; + + if (sb-s_flags MS_RDONLY) + return; + + if (fs_info-fs_state BTRFS_SUPER_FLAG_ERROR) { + sb-s_flags |= MS_RDONLY; + printk(KERN_INFO btrfs is forced readonly\n); + } +} + +/* + * __btrfs_std_error decodes expected errors from the caller and + * invokes the approciate error response. + */ +void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function, +unsigned int line, int errno) +{ + struct super_block *sb = fs_info-sb; + char nbuf[16]; + const char *errstr; + + /* +* Special case: if the error is EROFS, and we're already +* under MS_RDONLY, then it is safe here. +*/ + if (errno == -EROFS (sb-s_flags MS_RDONLY)) + return; + + errstr = btrfs_decode_error(fs_info, errno, nbuf); + printk(KERN_CRIT BTRFS error (device %s) in %s:%d: %s\n, + sb-s_id, function, line, errstr); + save_error_info(fs_info); + + btrfs_handle_error(fs_info); +} + static void btrfs_put_super(struct super_block *sb) { struct btrfs_root *root = btrfs_sb(sb); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/5 v3] Btrfs: avoid transaction stuff when btrfs is readonly
When the filesystem is readonly, avoid transaction stuff by checking MS_RDONLY at start transaction time. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/transaction.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 1fffbc0..14a597d 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -181,6 +181,9 @@ static struct btrfs_trans_handle *start_transaction(struct btrfs_root *root, struct btrfs_trans_handle *h; struct btrfs_transaction *cur_trans; int ret; + + if (root-fs_info-sb-s_flags MS_RDONLY) + return ERR_PTR(-EROFS); again: h = kmem_cache_alloc(btrfs_trans_handle_cachep, GFP_NOFS); if (!h) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/5 v3] Btrfs: add filesystem state for error handle
Add filesystem state and a flags to tell if the filesystem is valid or insane now. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/ctree.h | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8db9234..92b5ca2 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -294,6 +294,14 @@ static inline unsigned long btrfs_chunk_item_size(int num_stripes) #define BTRFS_FSID_SIZE 16 #define BTRFS_HEADER_FLAG_WRITTEN (1ULL 0) #define BTRFS_HEADER_FLAG_RELOC(1ULL 1) + +/* + * File system states + */ + +/* Errors detected */ +#define BTRFS_SUPER_FLAG_ERROR (1ULL 2) + #define BTRFS_SUPER_FLAG_SEEDING (1ULL 32) #define BTRFS_SUPER_FLAG_METADUMP (1ULL 33) @@ -1050,6 +1058,9 @@ struct btrfs_fs_info { unsigned metadata_ratio; void *bdev_holder; + + /* filesystem state */ + u64 fs_state; }; /* -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 4/5 v3] Btrfs: deal with filesystem state at mount, umount
Since there is a filesystem state, we should deal with it carefully at mount, umount and remount. - At mount, the FS state should be checked if there is error on these FS. If it does have, btrfsck is recommended. - At umount, the FS state should be saved into disk for consistency. v2-v3: do write super stuff at umount time. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/disk-io.c | 47 ++- 1 files changed, 46 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index b40dfe4..15d795a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -43,6 +43,8 @@ static struct extent_io_ops btree_extent_io_ops; static void end_workqueue_fn(struct btrfs_work *work); static void free_fs_root(struct btrfs_root *root); +static void btrfs_check_super_valid(struct btrfs_fs_info *fs_info, + int read_only); /* * end_io_wq structs are used to do processing in task context when an IO is @@ -1700,6 +1702,11 @@ struct btrfs_root *open_ctree(struct super_block *sb, if (!btrfs_super_root(disk_super)) goto fail_iput; + /* check filesystem state */ + fs_info-fs_state |= btrfs_super_flags(disk_super); + + btrfs_check_super_valid(fs_info, sb-s_flags MS_RDONLY); + ret = btrfs_parse_options(tree_root, options); if (ret) { err = ret; @@ -2405,10 +2412,17 @@ int btrfs_commit_super(struct btrfs_root *root) up_write(root-fs_info-cleanup_work_sem); trans = btrfs_join_transaction(root, 1); + if (IS_ERR(trans)) + return PTR_ERR(trans); + ret = btrfs_commit_transaction(trans, root); BUG_ON(ret); + /* run commit again to drop the original snapshot */ trans = btrfs_join_transaction(root, 1); + if (IS_ERR(trans)) + return PTR_ERR(trans); + btrfs_commit_transaction(trans, root); ret = btrfs_write_and_wait_transaction(NULL, root); BUG_ON(ret); @@ -2426,8 +2440,28 @@ int close_ctree(struct btrfs_root *root) smp_mb(); btrfs_put_block_group_cache(fs_info); + + /* +* Here come 2 situations when btrfs flips readonly: +* +* 1. when btrfs flips readonly somewhere else before +* btrfs_commit_super, sb-s_flags has MS_RDONLY flag, +* and btrfs will skip to write sb directly to keep +* ERROR state on disk. +* +* 2. when btrfs flips readonly just in btrfs_commit_super, +* and in such case, btrfs cannnot write sb via btrfs_commit_super, +* and since fs_state has been set BTRFS_SUPER_FLAG_ERROR flag, +* btrfs will directly write sb. +*/ if (!(fs_info-sb-s_flags MS_RDONLY)) { - ret = btrfs_commit_super(root); + ret = btrfs_commit_super(root); + if (ret) + printk(KERN_ERR btrfs: commit super ret %d\n, ret); + } + + if (fs_info-fs_state BTRFS_SUPER_FLAG_ERROR) { + ret = write_ctree_super(NULL, root, 0); if (ret) printk(KERN_ERR btrfs: commit super ret %d\n, ret); } @@ -2603,6 +2637,17 @@ out: return 0; } +static void btrfs_check_super_valid(struct btrfs_fs_info *fs_info, + int read_only) +{ + if (read_only) + return; + + if (fs_info-fs_state BTRFS_SUPER_FLAG_ERROR) + printk(KERN_WARNING warning: mount fs with errors, + running btrfsck is recommended\n); +} + static struct extent_io_ops btree_extent_io_ops = { .write_cache_pages_lock_hook = btree_lock_page_hook, .readpage_end_io_hook = btree_readpage_end_io_hook, -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 5/5 v3] Btrfs: avoid log replay when btrfs is insane
btrfs may do log replay even as mounted readonly, since we have added readonly check at start transaction time, in order to keep the original attribute, it needs to set and to restore readonly flags around log replay. However, we do not permit log replay when btrfs is insane, and log replay can start once btrfs is mounted in good state. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/disk-io.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 15d795a..727e156 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1937,9 +1937,14 @@ struct btrfs_root *open_ctree(struct super_block *sb, btrfs_set_opt(fs_info-mount_opt, SSD); } - if (btrfs_super_log_root(disk_super) != 0) { + if (btrfs_super_log_root(disk_super) != 0 + !(fs_info-fs_state BTRFS_SUPER_FLAG_ERROR)) { u64 bytenr = btrfs_super_log_root(disk_super); + unsigned int s_flags = sb-s_flags; + if (s_flags MS_RDONLY) + sb-s_flags = ~MS_RDONLY; + if (fs_devices-rw_devices == 0) { printk(KERN_WARNING Btrfs log replay required on RO media\n); @@ -1969,6 +1974,8 @@ struct btrfs_root *open_ctree(struct super_block *sb, ret = btrfs_commit_super(tree_root); BUG_ON(ret); } + + sb-s_flags = s_flags; } ret = btrfs_find_orphan_roots(tree_root); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-PATCH] Re: mounting arbitrary directories
On Mon, Nov 29, 2010 at 11:42 AM, C Anthony Risinger anth...@extof.me wrote: On Sun, Nov 28, 2010 at 4:07 AM, C Anthony Risinger anth...@extof.me wrote: On Nov 27, 2010, at 10:22 PM, Calvin Walton calvin.wal...@gmail.com wrote: On Sat, 2010-11-27 at 21:19 -0600, C Anthony Risinger wrote: eg. if i have a regular directory (not a subvolume) in the top- level: /__boot i can mount it with: mount -o subvol=__boot /dev/sda /mnt The 'subvol' option actually works using the same mechanism as a bind mount. The fact that it works using a directory is purely a coincidence - I would not expect it to be officially supported, and you shouldn't rely on it. Hrrrm... Well I thought I'd be clever and use this little trick one time to update my kernel... Anyways I oops out like 3 times in a row (last func was something.clone in each IIRC) and now I'm stuck with only my mobile since my server isn't set up yet and my SSD just failed on my netbook yesterday :-( So, if anyone can help me recover this partition long enough to grab a few things... I would be most grateful :-) I think I have everything critical, but I'd rather take another look :-) Right now I can't mount; mount is failing w/bad superblock. /me promises to never recklessly sabotage himself after being warned 6 hrs earlier any suggestions for me? i dd'ed the corrupt partition to an external disk because i need the machine back up and running, but if possible i'd like to get the image running again. i mounted it via loopback and as expected got the same errors: (will post the dmesg output next message -- left at home) open_ctree_fd failed wanted found instead the mount command itself fails with bad superblock or ... i have seen others withe similar crash reports and IIRC correctly were able to recover it (seems like something just didn't get updated right before it locked...) any ideas? dmesg output was: -- device label root devid 1 transid 61235 /dev/loop0 parent transid verify failed on 11335634944 wanted 61235 found 61237 parent transid verify failed on 11335634944 wanted 61235 found 61237 parent transid verify failed on 11335634944 wanted 61235 found 61237 btrfs: open_ctree failed -- i tried btrfsck as suggested in the recent thread: -- # sudo losetup /dev/loop0 /mnt/btrfs.img # sudo btrfsck -s 1 /dev/loop0 using SB copy 1, bytenr 67108864 parent transid verify failed on 11335634944 wanted 61235 found 61237 parent transid verify failed on 11335634944 wanted 61235 found 61237 parent transid verify failed on 11335634944 wanted 61235 found 61237 btrfsck: disk-io.c:739: open_ctree_fd: Assertion `!(!tree_root-node)' failed. # sudo btrfsck -s 2 /dev/loop0 using SB copy 2, bytenr 274877906944 No valid Btrfs found on /dev/loop0 -- s... does this mean i'm totally boned for ever mounting this image again? or should i keep it around for later? or anything else i can try/do? C Anthony -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: What to do about subvolumes?
2010/12/2 David Pottage da...@electric-spoon.com: Therefore I think it is a bad idea if potentially different files on btrfs can have the same inode number. It will break all sorts of tools. Instead of maintaining a big complicated reference count of used inode numbers, could btrfs use bit masks to create a the userland visible inode number from the subvolume id and the real internal inode number. Something like: userland_inode = ( volume_id 48 ) internal_inode; Please forgive me if this is impossible, or if that C snippet is syntactically incorrect. I am not a filesystem or kernel developer, and I have not coded in C for many years. -- David Pottage Expanding on the idea: what about a pool of IDs for subvolumes and inode numbers inside a subvolume having the subvolume ID as a prefix? It gives each inode a unique number, doesn't require cheating the userland and is less costly than keeping reference count for each inode. The obvious downside that I can see is limitation on number of subvolumes that it would be possible to create. It also lowers the maximum number of inodes in a filesystem (because of bits taken up by subvolume ID). I expect there are also less-than obvious downsides. Just an idea by a kernel and FS ignorant. -- Paweł Brodacki -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On Thu, Dec 02, 2010 at 11:25:01PM -0500, Chris Ball wrote: Hi Josef, 1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode to say Hey, I'm a subvolume and then we can do all of the appropriate magic that way. This unfortunately will be an incompatible format change, but the sooner we get this adressed the easier it will be in the long run. Obviously when I say format change I mean via the incompat bits we have, so old fs's won't be broken and such. Sorry if I've missed this elsewhere in the thread -- will we still have an efficient operation for enumerating subvolumes and snapshots, and how will that work? We're going to want tools like plymouth and grub to be able to list all snapshots without running a large scan. Yeah the idea is we want to fix the problems with the design without breaking anything that currently works. So all the changes I want to make are going to be invisible for the user. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: deal with space cache errors better
Currently if the space cache inode generation number doesn't match the generation number in the space cache header we will just fail to load the space cache, but we won't mark the space cache as an error, so we'll keep getting that error each time somebody tries to cache that block group until we actually clear the thing. Fix this by marking the space cache as having an error so we only get the message once. This patch also makes it so that we don't try and setup space cache for a block group that isn't cached, since we won't be able to write it out anyway. None of these problems are actual problems, they are just annoying and sub-optimal. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/extent-tree.c | 10 ++ fs/btrfs/free-space-cache.c | 12 +++- 2 files changed, 13 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 87aae66..03f4738 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2746,6 +2746,7 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group, struct btrfs_root *root = block_group-fs_info-tree_root; struct inode *inode = NULL; u64 alloc_hint = 0; + int dcs = BTRFS_DC_ERROR; int num_pages = 0; int retries = 0; int ret = 0; @@ -2800,6 +2801,8 @@ again: spin_lock(block_group-lock); if (block_group-cached != BTRFS_CACHE_FINISHED) { + /* We're not cached, don't bother trying to write stuff out */ + dcs = BTRFS_DC_WRITTEN; spin_unlock(block_group-lock); goto out_put; } @@ -2826,6 +2829,8 @@ again: ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, num_pages, num_pages, num_pages, alloc_hint); + if (!ret) + dcs = BTRFS_DC_SETUP; btrfs_free_reserved_data_space(inode, num_pages); out_put: iput(inode); @@ -2833,10 +2838,7 @@ out_free: btrfs_release_path(root, path); out: spin_lock(block_group-lock); - if (ret) - block_group-disk_cache_state = BTRFS_DC_ERROR; - else - block_group-disk_cache_state = BTRFS_DC_SETUP; + block_group-disk_cache_state = dcs; spin_unlock(block_group-lock); return ret; diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 22ee0dc..60d6842 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -290,7 +290,7 @@ int load_free_space_cache(struct btrfs_fs_info *fs_info, (unsigned long long)BTRFS_I(inode)-generation, (unsigned long long)generation, (unsigned long long)block_group-key.objectid); - goto out; + goto free_cache; } if (!num_entries) @@ -524,6 +524,12 @@ int btrfs_write_out_cache(struct btrfs_root *root, return 0; } + node = rb_first(block_group-free_space_offset); + if (!node) { + iput(inode); + return 0; + } + last_index = (i_size_read(inode) - 1) PAGE_CACHE_SHIFT; filemap_write_and_wait(inode-i_mapping); btrfs_wait_ordered_range(inode, inode-i_size @@ -543,10 +549,6 @@ int btrfs_write_out_cache(struct btrfs_root *root, */ first_page_offset = (sizeof(u32) * num_checksums) + sizeof(u64); - node = rb_first(block_group-free_space_offset); - if (!node) - goto out_free; - /* * Lock all pages first so we can lock the extent safely. * -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: Hello, Various people have complained about how BTRFS deals with subvolumes recently, specifically the fact that they all have the same inode number, and there's no discrete seperation from one subvolume to another. Christoph asked that I lay out a basic design document of how we want subvolumes to work so we can hash everything out now, fix what is broken, and then move forward with a design that everybody is more or less happy with. I apologize in advance for how freaking long this email is going to be. I assume that most people are generally familiar with how BTRFS works, so I'm not going to bother explaining in great detail some stuff. === What are subvolumes? === They are just another tree. In BTRFS we have various b-trees to describe the filesystem. A few of them are filesystem wide, such as the extent tree, chunk tree, root tree etc. The tree's that hold the actual filesystem data, that is inodes and such, are kept in their own b-tree. This is how subvolumes and snapshots appear on disk, they are simply new b-trees with all of the file data contained within them. === What do subvolumes look like? === All the user sees are directories. They act like any other directory acts, with a few exceptions 1) You cannot hardlink between subvolumes. This is because subvolumes have their own inode numbers and such, think of them as seperate mounts in this case, you cannot hardlink between two mounts because the link needs to point to the same on disk inode, which is impossible between two different filesystems. The same is true for subvolumes, they have their own trees with their own inodes and inode numbers, so it's impossible to hardlink between them. 1a) In case it wasn't clear from above, each subvolume has their own inode numbers, so you can have the same inode numbers used between two different subvolumes, since they are two different trees. 2) Obviously you can't just rm -rf subvolumes. Because they are roots there's extra metadata to keep track of them, so you have to use one of our ioctls to delete subvolumes/snapshots. But permissions and everything else they are the same. There is one tricky thing. When you create a subvolume, the directory inode that is created in the parent subvolume has the inode number of 256. So if you have a bunch of subvolumes in the same parent subvolume, you are going to have a bunch of directories with the inode number of 256. This is so when users cd into a subvolume we can know its a subvolume and do all the normal voodoo to start looking in the subvolumes tree instead of the parent subvolumes tree. This is where things go a bit sideways. We had serious problems with NFS, but thankfully NFS gives us a bunch of hooks to get around these problems. CIFS/Samba do not, so we will have problems there, not to mention any other userspace application that looks at inode numbers. === How do we want subvolumes to work from a user perspective? === 1) Users need to be able to create their own subvolumes. The permission semantics will be absolutely the same as creating directories, so I don't think this is too tricky. We want this because you can only take snapshots of subvolumes, and so it is important that users be able to create their own discrete snapshottable targets. 2) Users need to be able to snapshot their subvolumes. This is basically the same as #1, but it bears repeating. 3) Subvolumes shouldn't need to be specifically mounted. This is also important, we don't want users to have to go around mounting their subvolumes up manually one-by-one. Today users just cd into subvolumes and it works, just like cd'ing into a directory. === Quotas === This is a huge topic in and of itself, but Christoph mentioned wanting to have an idea of what we wanted to do with it, so I'm putting it here. There are really 2 things here 1) Limiting the size of subvolumes. This is really easy for us, just create a subvolume and at creation time set a maximum size it can grow to and not let it go farther than that. Nice, simple and straightforward. 2) Normal quotas, via the quota tools. This just comes down to how do we want to charge users, do we want to do it per subvolume, or per filesystem. My vote is per filesystem. Obviously this will make it tricky with snapshots, but I think if we're just charging the diff's between the original volume and the snapshot to the user then that will be the easiest for people to understand, rather than making a snapshot all of a sudden count the users currently used quota * 2. === What do we do? === This is where I expect to see the most discussion. Here is what I want to do 1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode to say Hey, I'm a subvolume and then we can do all of the appropriate magic that way.
Re: What to do about subvolumes?
On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote: So now that I've actually looked at everything, it looks like the semantics are all right for subvolumes 1) readdir - we return the root id in d_ino, which is unique across the fs Though Michael Vrable pointed out an apparent collision with normal inode numbers on the parent filesystem? 2) stat - we return 256 for all subvolumes, because that is their inode number 3) dev_t - we setup an anon super for all volumes, so they all get their own dev_t, which is set properly for all of their children, see below [r...@test1244 btrfs-test]# stat . File: `.' Size: 20 Blocks: 8 IO Block: 4096 directory Device: 15h/21d Inode: 256 Links: 1 Access: (0555/dr-xr-xr-x) Uid: (0/root) Gid: (0/root) Access: 2010-12-03 15:35:41.931679393 -0500 Modify: 2010-12-03 15:35:20.405679493 -0500 Change: 2010-12-03 15:35:20.405679493 -0500 [r...@test1244 btrfs-test]# stat foo File: `foo' Size: 12 Blocks: 0 IO Block: 4096 directory Device: 19h/25d Inode: 256 Links: 1 Access: (0700/drwx--) Uid: (0/root) Gid: (0/root) Access: 2010-12-03 15:35:17.501679393 -0500 Modify: 2010-12-03 15:35:59.150680051 -0500 Change: 2010-12-03 15:35:59.150680051 -0500 [r...@test1244 btrfs-test]# stat foo/foobar File: `foo/foobar' Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: 19h/25d Inode: 257 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) Access: 2010-12-03 15:35:59.150680051 -0500 Modify: 2010-12-03 15:35:59.150680051 -0500 Change: 2010-12-03 15:35:59.150680051 -0500 So as far as the user is concerned, everything should come out right. Obviously we had to do the NFS trickery still because as far as VFS is concerned the subvolumes are all on the same mount. So the question is this (and really this is directed at Christoph and Bruce and anybody else who may care), is this good enough, or do we want to have a seperate vfsmount for each subvolume? Thanks, For nfsd's purposes, we need to be able find out about filesystems in two different ways: 1. Lookup by filehandle: we need to be able to identify which subvolume we're dealing with from a filehandle. 2. Lookup by path: we need to notice when we cross into a subvolume. Looks like #1 already works. Not #2: the current nfsd code just checks for mountpoints. We could modify nfsd to also check whether dev_t changed each time it did a lookup. I suppose it would work, though it's annoying to have to do it just for the case of btrfs. As far as I can tell, crossing into a subvolume is like crossing a mountpoint in every way except for the lack of a separate vfsmount. I'd worry that the inconsistency will end up requiring more special cases down the road, but I don't have any in mind. --b. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote: On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: Hello, Various people have complained about how BTRFS deals with subvolumes recently, specifically the fact that they all have the same inode number, and there's no discrete seperation from one subvolume to another. Christoph asked that I lay out a basic design document of how we want subvolumes to work so we can hash everything out now, fix what is broken, and then move forward with a design that everybody is more or less happy with. I apologize in advance for how freaking long this email is going to be. I assume that most people are generally familiar with how BTRFS works, so I'm not going to bother explaining in great detail some stuff. are things that cannot be fixed now. Some of these changes will require incompat format changes, but it's either we fix it now, or later on down the road when BTRFS starts getting used in production really find out how many things our current scheme breaks and then have to do the changes then. Thanks, So now that I've actually looked at everything, it looks like the semantics are all right for subvolumes 1) readdir - we return the root id in d_ino, which is unique across the fs 2) stat - we return 256 for all subvolumes, because that is their inode number 3) dev_t - we setup an anon super for all volumes, so they all get their own dev_t, which is set properly for all of their children, see below A property of NFS fileshandles is that they must be stable across server reboots. Is this anon dev_t used as part of the NFS filehandle and if so how can you guarantee that it is stable? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
Excerpts from Dave Chinner's message of 2010-12-03 17:27:56 -0500: On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote: On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: Hello, Various people have complained about how BTRFS deals with subvolumes recently, specifically the fact that they all have the same inode number, and there's no discrete seperation from one subvolume to another. Christoph asked that I lay out a basic design document of how we want subvolumes to work so we can hash everything out now, fix what is broken, and then move forward with a design that everybody is more or less happy with. I apologize in advance for how freaking long this email is going to be. I assume that most people are generally familiar with how BTRFS works, so I'm not going to bother explaining in great detail some stuff. are things that cannot be fixed now. Some of these changes will require incompat format changes, but it's either we fix it now, or later on down the road when BTRFS starts getting used in production really find out how many things our current scheme breaks and then have to do the changes then. Thanks, So now that I've actually looked at everything, it looks like the semantics are all right for subvolumes 1) readdir - we return the root id in d_ino, which is unique across the fs 2) stat - we return 256 for all subvolumes, because that is their inode number 3) dev_t - we setup an anon super for all volumes, so they all get their own dev_t, which is set properly for all of their children, see below A property of NFS fileshandles is that they must be stable across server reboots. Is this anon dev_t used as part of the NFS filehandle and if so how can you guarantee that it is stable? It isn't today, that's something we'll have to address. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On Fri, Dec 03, 2010 at 05:29:24PM -0500, Chris Mason wrote: Excerpts from Dave Chinner's message of 2010-12-03 17:27:56 -0500: On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote: On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: Hello, Various people have complained about how BTRFS deals with subvolumes recently, specifically the fact that they all have the same inode number, and there's no discrete seperation from one subvolume to another. Christoph asked that I lay out a basic design document of how we want subvolumes to work so we can hash everything out now, fix what is broken, and then move forward with a design that everybody is more or less happy with. I apologize in advance for how freaking long this email is going to be. I assume that most people are generally familiar with how BTRFS works, so I'm not going to bother explaining in great detail some stuff. are things that cannot be fixed now. Some of these changes will require incompat format changes, but it's either we fix it now, or later on down the road when BTRFS starts getting used in production really find out how many things our current scheme breaks and then have to do the changes then. Thanks, So now that I've actually looked at everything, it looks like the semantics are all right for subvolumes 1) readdir - we return the root id in d_ino, which is unique across the fs 2) stat - we return 256 for all subvolumes, because that is their inode number 3) dev_t - we setup an anon super for all volumes, so they all get their own dev_t, which is set properly for all of their children, see below A property of NFS fileshandles is that they must be stable across server reboots. Is this anon dev_t used as part of the NFS filehandle and if so how can you guarantee that it is stable? It isn't today, that's something we'll have to address. We're using statfs64.fs_fsid for this; I believe that's both stable across reboots and distinguishes between subvolumes, so that's OK. (That said, since fs_fsid doesn't work for other filesystems, we depend on an explicit check for a filesystem type of btrfs, which is awful--btrfs won't always be the only filesystem that wants to do this kind of thing, etc.) --b. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On 2010-12-03, at 15:45, J. Bruce Fields wrote: We're using statfs64.fs_fsid for this; I believe that's both stable across reboots and distinguishes between subvolumes, so that's OK. (That said, since fs_fsid doesn't work for other filesystems, we depend on an explicit check for a filesystem type of btrfs, which is awful--btrfs won't always be the only filesystem that wants to do this kind of thing, etc.) Sigh, I wanted to be able to specify the NFS FSID directly from within the kernel for Lustre many years already. Glad to see that this is moving forward. Any chance we can add a -get_fsid(sb, inode) method to export_operations (or something simiar), that allows the filesystem to generate an FSID based on the volume and inode that is being exported? Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html