[RFC PATCH 0/5 v3] Btrfs: Add readonly support to replace BUG_ON phrase

2010-12-03 Thread liubo
Btrfs has a number of BUG_ON()s, which may lead btrfs to unpleasant panic.
Meanwhile, they are very ugly and should be handled more propriately.

There are mainly two ways to deal with these BUG_ON()s.

1. For those errors which can be handled well by callers, we just return their
error number to callers.

2. For others, We can force the filesystem readonly when it hits errors, which
 is what this patchset has done. Replaced BUG_ON() with the interface provided
 in this patchset, we will get error infomation via dmesg. Since btrfs is now 
readonly, we can save our data safely and umount it, then a btrfsck is 
recommended.

By these ways, we can protect our filesystem from panic caused by those 
BUG_ONs.

We still need a incompat flag to make old kernels happy.

This patchset needs more test.

v2-v3:
- since btrfs may do log replay after crash, even it is mounted as readonly,
  and we have add a readonly check at start transaction time, it needs to set
  and to restore readonly flags around log replay.

v1-v2:
- in order to avoid deadlock thing, move write super stuff from error handle
  path to unmount time.
- remove BTRFS_SUPER_FLAG_VALID, just use BTRFS_SUPER_FLAG_ERROR to make it
  simple.
- add MS_RDONLY check at start of a transaction instead of commit transaction.
---
 fs/btrfs/ctree.h   |   19 ++
 fs/btrfs/disk-io.c |   56 +-
 fs/btrfs/super.c   |   88 
 fs/btrfs/transaction.c |3 ++
 4 files changed, 164 insertions(+), 2 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/5 v3] Btrfs: add readonly support for error handle

2010-12-03 Thread liubo
This patch provide a new error handle interface for those errors that handled
by current BUG_ONs.

In order to protect btrfs from panic, when it comes to those BUG_ON errors, 
the interface forces btrfs readonly and saves the FS state to disk. And the 
filesystem can be umounted, although mabye with some warning in kernel dmesg.
Then btrfsck is helpful to recover btrfs.

v1-v2:
move write super stuff from error handle path to unmount in order to avoid
deadlock.

Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
 fs/btrfs/ctree.h |8 +
 fs/btrfs/super.c |   88 ++
 2 files changed, 96 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 92b5ca2..fc9b6a0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2552,6 +2552,14 @@ ssize_t btrfs_listxattr(struct dentry *dentry, char 
*buffer, size_t size);
 /* super.c */
 int btrfs_parse_options(struct btrfs_root *root, char *options);
 int btrfs_sync_fs(struct super_block *sb, int wait);
+void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
+unsigned int line, int errno);
+
+#define btrfs_std_error(fs_info, errno)\
+do {   \
+   if ((errno))\
+   __btrfs_std_error((fs_info), __func__, __LINE__, (errno));\
+} while (0)
 
 /* acl.c */
 #ifdef CONFIG_BTRFS_FS_POSIX_ACL
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 718b10d..07c58f9 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -54,6 +54,94 @@
 
 static const struct super_operations btrfs_super_ops;
 
+static const char *btrfs_decode_error(struct btrfs_fs_info *fs_info, int errno,
+ char nbuf[16])
+{
+   char *errstr = NULL;
+
+   switch (errno) {
+   case -EIO:
+   errstr = IO failure;
+   break;
+   case -ENOMEM:
+   errstr = Out of memory;
+   break;
+   case -EROFS:
+   errstr = Readonly filesystem;
+   break;
+   default:
+   if (nbuf) {
+   if (snprintf(nbuf, 16, error %d, -errno) = 0)
+   errstr = nbuf;
+   }
+   break;
+   }
+
+   return errstr;
+}
+
+static void __save_error_info(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_super_block *disk_super = fs_info-super_copy;
+
+   fs_info-fs_state = BTRFS_SUPER_FLAG_ERROR;
+   disk_super-flags |= cpu_to_le64(BTRFS_SUPER_FLAG_ERROR);
+
+   mutex_lock(fs_info-trans_mutex);
+   memcpy(fs_info-super_for_commit, disk_super,
+  sizeof(fs_info-super_for_commit));
+   mutex_unlock(fs_info-trans_mutex);
+}
+
+/* NOTE:
+ * We move write_super stuff at umount in order to avoid deadlock
+ * for umount hold all lock.
+ */
+static void save_error_info(struct btrfs_fs_info *fs_info)
+{
+   __save_error_info(fs_info);
+}
+
+/* btrfs handle error by forcing the filesystem readonly */
+static void btrfs_handle_error(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info-sb;
+
+   if (sb-s_flags  MS_RDONLY)
+   return;
+
+   if (fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR) {
+   sb-s_flags |= MS_RDONLY;
+   printk(KERN_INFO btrfs is forced readonly\n);
+   }
+}
+
+/*
+ * __btrfs_std_error decodes expected errors from the caller and
+ * invokes the approciate error response.
+ */
+void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
+unsigned int line, int errno)
+{
+   struct super_block *sb = fs_info-sb;
+   char nbuf[16];
+   const char *errstr;
+
+   /*
+* Special case: if the error is EROFS, and we're already
+* under MS_RDONLY, then it is safe here.
+*/
+   if (errno == -EROFS  (sb-s_flags  MS_RDONLY))
+   return;
+
+   errstr = btrfs_decode_error(fs_info, errno, nbuf);
+   printk(KERN_CRIT BTRFS error (device %s) in %s:%d: %s\n,
+   sb-s_id, function, line, errstr);
+   save_error_info(fs_info);
+
+   btrfs_handle_error(fs_info);
+}
+
 static void btrfs_put_super(struct super_block *sb)
 {
struct btrfs_root *root = btrfs_sb(sb);
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/5 v3] Btrfs: avoid transaction stuff when btrfs is readonly

2010-12-03 Thread liubo
When the filesystem is readonly, avoid transaction stuff by checking MS_RDONLY
at start transaction time.

Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
 fs/btrfs/transaction.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1fffbc0..14a597d 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -181,6 +181,9 @@ static struct btrfs_trans_handle *start_transaction(struct 
btrfs_root *root,
struct btrfs_trans_handle *h;
struct btrfs_transaction *cur_trans;
int ret;
+
+   if (root-fs_info-sb-s_flags  MS_RDONLY)
+   return ERR_PTR(-EROFS);
 again:
h = kmem_cache_alloc(btrfs_trans_handle_cachep, GFP_NOFS);
if (!h)
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/5 v3] Btrfs: add filesystem state for error handle

2010-12-03 Thread liubo
Add filesystem state and a flags to tell if the filesystem is 
valid or insane now.

Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
 fs/btrfs/ctree.h |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8db9234..92b5ca2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -294,6 +294,14 @@ static inline unsigned long btrfs_chunk_item_size(int 
num_stripes)
 #define BTRFS_FSID_SIZE 16
 #define BTRFS_HEADER_FLAG_WRITTEN  (1ULL  0)
 #define BTRFS_HEADER_FLAG_RELOC(1ULL  1)
+
+/*
+ * File system states
+ */
+
+/* Errors detected */
+#define BTRFS_SUPER_FLAG_ERROR (1ULL  2)
+
 #define BTRFS_SUPER_FLAG_SEEDING   (1ULL  32)
 #define BTRFS_SUPER_FLAG_METADUMP  (1ULL  33)
 
@@ -1050,6 +1058,9 @@ struct btrfs_fs_info {
unsigned metadata_ratio;
 
void *bdev_holder;
+
+   /* filesystem state */
+   u64 fs_state;
 };
 
 /*
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 4/5 v3] Btrfs: deal with filesystem state at mount, umount

2010-12-03 Thread liubo
Since there is a filesystem state, we should deal with it carefully at mount,
umount and remount.

- At mount, the FS state should be checked if there is error on these FS.
  If it does have, btrfsck is recommended.
- At umount, the FS state should be saved into disk for consistency.

v2-v3:
do write super stuff at umount time.

Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
 fs/btrfs/disk-io.c |   47 ++-
 1 files changed, 46 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b40dfe4..15d795a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -43,6 +43,8 @@
 static struct extent_io_ops btree_extent_io_ops;
 static void end_workqueue_fn(struct btrfs_work *work);
 static void free_fs_root(struct btrfs_root *root);
+static void btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
+   int read_only);
 
 /*
  * end_io_wq structs are used to do processing in task context when an IO is
@@ -1700,6 +1702,11 @@ struct btrfs_root *open_ctree(struct super_block *sb,
if (!btrfs_super_root(disk_super))
goto fail_iput;
 
+   /* check filesystem state */
+   fs_info-fs_state |= btrfs_super_flags(disk_super);
+
+   btrfs_check_super_valid(fs_info, sb-s_flags  MS_RDONLY);
+
ret = btrfs_parse_options(tree_root, options);
if (ret) {
err = ret;
@@ -2405,10 +2412,17 @@ int btrfs_commit_super(struct btrfs_root *root)
up_write(root-fs_info-cleanup_work_sem);
 
trans = btrfs_join_transaction(root, 1);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
ret = btrfs_commit_transaction(trans, root);
BUG_ON(ret);
+
/* run commit again to drop the original snapshot */
trans = btrfs_join_transaction(root, 1);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
btrfs_commit_transaction(trans, root);
ret = btrfs_write_and_wait_transaction(NULL, root);
BUG_ON(ret);
@@ -2426,8 +2440,28 @@ int close_ctree(struct btrfs_root *root)
smp_mb();
 
btrfs_put_block_group_cache(fs_info);
+
+   /*
+* Here come 2 situations when btrfs flips readonly:
+*
+* 1. when btrfs flips readonly somewhere else before
+* btrfs_commit_super, sb-s_flags has MS_RDONLY flag,
+* and btrfs will skip to write sb directly to keep
+* ERROR state on disk.
+*
+* 2. when btrfs flips readonly just in btrfs_commit_super,
+* and in such case, btrfs cannnot write sb via btrfs_commit_super,
+* and since fs_state has been set BTRFS_SUPER_FLAG_ERROR flag,
+* btrfs will directly write sb.
+*/
if (!(fs_info-sb-s_flags  MS_RDONLY)) {
-   ret =  btrfs_commit_super(root);
+   ret = btrfs_commit_super(root);
+   if (ret)
+   printk(KERN_ERR btrfs: commit super ret %d\n, ret);
+   }
+
+   if (fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR) {
+   ret = write_ctree_super(NULL, root, 0);
if (ret)
printk(KERN_ERR btrfs: commit super ret %d\n, ret);
}
@@ -2603,6 +2637,17 @@ out:
return 0;
 }
 
+static void btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
+ int read_only)
+{
+   if (read_only)
+   return;
+
+   if (fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR)
+   printk(KERN_WARNING warning: mount fs with errors, 
+  running btrfsck is recommended\n);
+}
+
 static struct extent_io_ops btree_extent_io_ops = {
.write_cache_pages_lock_hook = btree_lock_page_hook,
.readpage_end_io_hook = btree_readpage_end_io_hook,
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 5/5 v3] Btrfs: avoid log replay when btrfs is insane

2010-12-03 Thread liubo
btrfs may do log replay even as mounted readonly, since we have added
readonly check at start transaction time, in order to keep the original
attribute, it needs to set and to restore readonly flags around log
replay.
However, we do not permit log replay when btrfs is insane, and log replay
can start once btrfs is mounted in good state.

Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
 fs/btrfs/disk-io.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 15d795a..727e156 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1937,9 +1937,14 @@ struct btrfs_root *open_ctree(struct super_block *sb,
btrfs_set_opt(fs_info-mount_opt, SSD);
}
 
-   if (btrfs_super_log_root(disk_super) != 0) {
+   if (btrfs_super_log_root(disk_super) != 0 
+   !(fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR)) {
u64 bytenr = btrfs_super_log_root(disk_super);
 
+   unsigned int s_flags = sb-s_flags;
+   if (s_flags  MS_RDONLY)
+   sb-s_flags = ~MS_RDONLY;
+
if (fs_devices-rw_devices == 0) {
printk(KERN_WARNING Btrfs log replay required 
   on RO media\n);
@@ -1969,6 +1974,8 @@ struct btrfs_root *open_ctree(struct super_block *sb,
ret =  btrfs_commit_super(tree_root);
BUG_ON(ret);
}
+
+   sb-s_flags = s_flags;
}
 
ret = btrfs_find_orphan_roots(tree_root);
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC-PATCH] Re: mounting arbitrary directories

2010-12-03 Thread C Anthony Risinger
On Mon, Nov 29, 2010 at 11:42 AM, C Anthony Risinger anth...@extof.me wrote:
 On Sun, Nov 28, 2010 at 4:07 AM, C Anthony Risinger anth...@extof.me wrote:
 On Nov 27, 2010, at 10:22 PM, Calvin Walton calvin.wal...@gmail.com
 wrote:
 On Sat, 2010-11-27 at 21:19 -0600, C Anthony Risinger wrote:

 eg. if i have a regular directory (not a subvolume) in the top-
 level:

 /__boot

 i can mount it with:

 mount -o subvol=__boot /dev/sda /mnt

 The 'subvol' option actually works using the same mechanism as a bind
 mount. The fact that it works using a directory is purely a
 coincidence
 - I would not expect it to be officially supported, and you shouldn't
 rely on it.

 Hrrrm... Well I thought I'd be clever and use this little trick one
 time to update my kernel... Anyways I oops out like 3 times in a row
 (last func was something.clone in each IIRC) and now I'm stuck with
 only my mobile since my server isn't set up yet and my  SSD just
 failed on my netbook yesterday :-(

 So, if anyone can help me recover this partition long enough to
 grab a few things... I would be most grateful :-) I think I have
 everything critical, but I'd rather take another look :-) Right now I
 can't mount; mount is failing w/bad superblock.

 /me promises to never recklessly sabotage himself after being warned 
 6 hrs earlier

 any suggestions for me?  i dd'ed the corrupt partition to an external
 disk because i need the machine back up and running, but if possible
 i'd like to get the image running again.  i mounted it via loopback
 and as expected got the same errors:

 (will post the dmesg output next message -- left at home)
 open_ctree_fd failed
 wanted  found  instead

 the mount command itself fails with bad superblock or ...

 i have seen others withe similar crash reports and IIRC correctly were
 able to recover it (seems like something just didn't get updated right
 before it locked...)

 any ideas?

dmesg output was:

--
device label root devid 1 transid 61235 /dev/loop0
parent transid verify failed on 11335634944 wanted 61235 found 61237
parent transid verify failed on 11335634944 wanted 61235 found 61237
parent transid verify failed on 11335634944 wanted 61235 found 61237
btrfs: open_ctree failed
--

i tried btrfsck as suggested in the recent thread:

--
# sudo losetup /dev/loop0 /mnt/btrfs.img

# sudo btrfsck -s 1 /dev/loop0
using SB copy 1, bytenr 67108864
parent transid verify failed on 11335634944 wanted 61235 found 61237
parent transid verify failed on 11335634944 wanted 61235 found 61237
parent transid verify failed on 11335634944 wanted 61235 found 61237
btrfsck: disk-io.c:739: open_ctree_fd: Assertion `!(!tree_root-node)' failed.

# sudo btrfsck -s 2 /dev/loop0
using SB copy 2, bytenr 274877906944
No valid Btrfs found on /dev/loop0
--

s... does this mean i'm totally boned for ever mounting this image
again?  or should i keep it around for later?  or anything else i can
try/do?

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: What to do about subvolumes?

2010-12-03 Thread Paweł Brodacki
2010/12/2 David Pottage da...@electric-spoon.com:

 Therefore I think it is a bad idea if potentially different files on btrfs
 can have the same inode number. It will break all sorts of tools.

 Instead of maintaining a big complicated reference count of used inode
 numbers, could btrfs use bit masks to create a the userland visible inode
 number from the subvolume id and the real internal inode number. Something
 like:

 userland_inode = ( volume_id  48 )  internal_inode;

 Please forgive me if this is impossible, or if that C snippet is
 syntactically incorrect. I am not a filesystem or kernel developer, and I
 have not coded in C for many years.

 --
 David Pottage


Expanding on the idea: what about a pool of IDs for subvolumes and
inode numbers inside a subvolume having the subvolume ID as a prefix?
It gives each inode a unique number, doesn't require cheating the
userland and is less costly than keeping reference count for each
inode. The obvious downside that I can see is limitation on number of
subvolumes that it would be possible to create. It also lowers the
maximum number of inodes in a filesystem (because of bits taken up by
subvolume ID). I expect there are also less-than obvious downsides.

Just an idea by a kernel and FS ignorant.

--
Paweł Brodacki
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-03 Thread Josef Bacik
On Thu, Dec 02, 2010 at 11:25:01PM -0500, Chris Ball wrote:
 Hi Josef,
 
 1) Scrap the 256 inode number thing.  Instead we'll just put a
 flag in the inode to say Hey, I'm a subvolume and then we can
 do all of the appropriate magic that way.  This unfortunately
 will be an incompatible format change, but the sooner we get this
 adressed the easier it will be in the long run.  Obviously when I
 say format change I mean via the incompat bits we have, so old
 fs's won't be broken and such.
 
 Sorry if I've missed this elsewhere in the thread -- will we still
 have an efficient operation for enumerating subvolumes and snapshots,
 and how will that work?  We're going to want tools like plymouth and
 grub to be able to list all snapshots without running a large scan.


Yeah the idea is we want to fix the problems with the design without breaking
anything that currently works.  So all the changes I want to make are going to
be invisible for the user.  Thanks,

Josef 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: deal with space cache errors better

2010-12-03 Thread Josef Bacik
Currently if the space cache inode generation number doesn't match the
generation number in the space cache header we will just fail to load the space
cache, but we won't mark the space cache as an error, so we'll keep getting that
error each time somebody tries to cache that block group until we actually clear
the thing.  Fix this by marking the space cache as having an error so we only
get the message once.  This patch also makes it so that we don't try and setup
space cache for a block group that isn't cached, since we won't be able to write
it out anyway.  None of these problems are actual problems, they are just
annoying and sub-optimal.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/btrfs/extent-tree.c  |   10 ++
 fs/btrfs/free-space-cache.c |   12 +++-
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 87aae66..03f4738 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2746,6 +2746,7 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
struct btrfs_root *root = block_group-fs_info-tree_root;
struct inode *inode = NULL;
u64 alloc_hint = 0;
+   int dcs = BTRFS_DC_ERROR;
int num_pages = 0;
int retries = 0;
int ret = 0;
@@ -2800,6 +2801,8 @@ again:
 
spin_lock(block_group-lock);
if (block_group-cached != BTRFS_CACHE_FINISHED) {
+   /* We're not cached, don't bother trying to write stuff out */
+   dcs = BTRFS_DC_WRITTEN;
spin_unlock(block_group-lock);
goto out_put;
}
@@ -2826,6 +2829,8 @@ again:
ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, num_pages,
  num_pages, num_pages,
  alloc_hint);
+   if (!ret)
+   dcs = BTRFS_DC_SETUP;
btrfs_free_reserved_data_space(inode, num_pages);
 out_put:
iput(inode);
@@ -2833,10 +2838,7 @@ out_free:
btrfs_release_path(root, path);
 out:
spin_lock(block_group-lock);
-   if (ret)
-   block_group-disk_cache_state = BTRFS_DC_ERROR;
-   else
-   block_group-disk_cache_state = BTRFS_DC_SETUP;
+   block_group-disk_cache_state = dcs;
spin_unlock(block_group-lock);
 
return ret;
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 22ee0dc..60d6842 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -290,7 +290,7 @@ int load_free_space_cache(struct btrfs_fs_info *fs_info,
   (unsigned long long)BTRFS_I(inode)-generation,
   (unsigned long long)generation,
   (unsigned long long)block_group-key.objectid);
-   goto out;
+   goto free_cache;
}
 
if (!num_entries)
@@ -524,6 +524,12 @@ int btrfs_write_out_cache(struct btrfs_root *root,
return 0;
}
 
+   node = rb_first(block_group-free_space_offset);
+   if (!node) {
+   iput(inode);
+   return 0;
+   }
+
last_index = (i_size_read(inode) - 1)  PAGE_CACHE_SHIFT;
filemap_write_and_wait(inode-i_mapping);
btrfs_wait_ordered_range(inode, inode-i_size 
@@ -543,10 +549,6 @@ int btrfs_write_out_cache(struct btrfs_root *root,
 */
first_page_offset = (sizeof(u32) * num_checksums) + sizeof(u64);
 
-   node = rb_first(block_group-free_space_offset);
-   if (!node)
-   goto out_free;
-
/*
 * Lock all pages first so we can lock the extent safely.
 *
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-03 Thread Josef Bacik
On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
 Hello,
 
 Various people have complained about how BTRFS deals with subvolumes recently,
 specifically the fact that they all have the same inode number, and there's no
 discrete seperation from one subvolume to another.  Christoph asked that I lay
 out a basic design document of how we want subvolumes to work so we can hash
 everything out now, fix what is broken, and then move forward with a design 
 that
 everybody is more or less happy with.  I apologize in advance for how freaking
 long this email is going to be.  I assume that most people are generally
 familiar with how BTRFS works, so I'm not going to bother explaining in great
 detail some stuff.
 
 === What are subvolumes? ===
 
 They are just another tree.  In BTRFS we have various b-trees to describe the
 filesystem.  A few of them are filesystem wide, such as the extent tree, chunk
 tree, root tree etc.  The tree's that hold the actual filesystem data, that is
 inodes and such, are kept in their own b-tree.  This is how subvolumes and
 snapshots appear on disk, they are simply new b-trees with all of the file 
 data
 contained within them.
 
 === What do subvolumes look like? ===
 
 All the user sees are directories.  They act like any other directory acts, 
 with
 a few exceptions
 
 1) You cannot hardlink between subvolumes.  This is because subvolumes have
 their own inode numbers and such, think of them as seperate mounts in this 
 case,
 you cannot hardlink between two mounts because the link needs to point to the
 same on disk inode, which is impossible between two different filesystems.  
 The
 same is true for subvolumes, they have their own trees with their own inodes 
 and
 inode numbers, so it's impossible to hardlink between them.
 
 1a) In case it wasn't clear from above, each subvolume has their own inode
 numbers, so you can have the same inode numbers used between two different
 subvolumes, since they are two different trees.
 
 2) Obviously you can't just rm -rf subvolumes.  Because they are roots there's
 extra metadata to keep track of them, so you have to use one of our ioctls to
 delete subvolumes/snapshots.
 
 But permissions and everything else they are the same.
 
 There is one tricky thing.  When you create a subvolume, the directory inode
 that is created in the parent subvolume has the inode number of 256.  So if 
 you
 have a bunch of subvolumes in the same parent subvolume, you are going to 
 have a
 bunch of directories with the inode number of 256.  This is so when users cd
 into a subvolume we can know its a subvolume and do all the normal voodoo to
 start looking in the subvolumes tree instead of the parent subvolumes tree.
 
 This is where things go a bit sideways.  We had serious problems with NFS, but
 thankfully NFS gives us a bunch of hooks to get around these problems.
 CIFS/Samba do not, so we will have problems there, not to mention any other
 userspace application that looks at inode numbers.
 
 === How do we want subvolumes to work from a user perspective? ===
 
 1) Users need to be able to create their own subvolumes.  The permission
 semantics will be absolutely the same as creating directories, so I don't 
 think
 this is too tricky.  We want this because you can only take snapshots of
 subvolumes, and so it is important that users be able to create their own
 discrete snapshottable targets.
 
 2) Users need to be able to snapshot their subvolumes.  This is basically the
 same as #1, but it bears repeating.
 
 3) Subvolumes shouldn't need to be specifically mounted.  This is also
 important, we don't want users to have to go around mounting their subvolumes 
 up
 manually one-by-one.  Today users just cd into subvolumes and it works, just
 like cd'ing into a directory.
 
 === Quotas ===
 
 This is a huge topic in and of itself, but Christoph mentioned wanting to have
 an idea of what we wanted to do with it, so I'm putting it here.  There are
 really 2 things here
 
 1) Limiting the size of subvolumes.  This is really easy for us, just create a
 subvolume and at creation time set a maximum size it can grow to and not let 
 it
 go farther than that.  Nice, simple and straightforward.
 
 2) Normal quotas, via the quota tools.  This just comes down to how do we want
 to charge users, do we want to do it per subvolume, or per filesystem.  My 
 vote
 is per filesystem.  Obviously this will make it tricky with snapshots, but I
 think if we're just charging the diff's between the original volume and the
 snapshot to the user then that will be the easiest for people to understand,
 rather than making a snapshot all of a sudden count the users currently used
 quota * 2.
 
 === What do we do? ===
 
 This is where I expect to see the most discussion.  Here is what I want to do
 
 1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the 
 inode
 to say Hey, I'm a subvolume and then we can do all of the appropriate magic
 that way. 

Re: What to do about subvolumes?

2010-12-03 Thread J. Bruce Fields
On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
 So now that I've actually looked at everything, it looks like the semantics 
 are
 all right for subvolumes
 
 1) readdir - we return the root id in d_ino, which is unique across the fs

Though Michael Vrable pointed out an apparent collision with normal
inode numbers on the parent filesystem?

 2) stat - we return 256 for all subvolumes, because that is their inode number
 3) dev_t - we setup an anon super for all volumes, so they all get their own
 dev_t, which is set properly for all of their children, see below
 
 [r...@test1244 btrfs-test]# stat .
   File: `.'
   Size: 20  Blocks: 8  IO Block: 4096   directory
 Device: 15h/21d Inode: 256 Links: 1
 Access: (0555/dr-xr-xr-x)  Uid: (0/root)   Gid: (0/root)
 Access: 2010-12-03 15:35:41.931679393 -0500
 Modify: 2010-12-03 15:35:20.405679493 -0500
 Change: 2010-12-03 15:35:20.405679493 -0500
 
 [r...@test1244 btrfs-test]# stat foo
   File: `foo'
   Size: 12  Blocks: 0  IO Block: 4096   directory
 Device: 19h/25d Inode: 256 Links: 1
 Access: (0700/drwx--)  Uid: (0/root)   Gid: (0/root)
 Access: 2010-12-03 15:35:17.501679393 -0500
 Modify: 2010-12-03 15:35:59.150680051 -0500
 Change: 2010-12-03 15:35:59.150680051 -0500
 
 [r...@test1244 btrfs-test]# stat foo/foobar 
   File: `foo/foobar'
   Size: 0   Blocks: 0  IO Block: 4096   regular empty file
 Device: 19h/25d Inode: 257 Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 Access: 2010-12-03 15:35:59.150680051 -0500
 Modify: 2010-12-03 15:35:59.150680051 -0500
 Change: 2010-12-03 15:35:59.150680051 -0500
 
 So as far as the user is concerned, everything should come out right.  
 Obviously
 we had to do the NFS trickery still because as far as VFS is concerned the
 subvolumes are all on the same mount.  So the question is this (and really 
 this
 is directed at Christoph and Bruce and anybody else who may care), is this 
 good
 enough, or do we want to have a seperate vfsmount for each subvolume?  Thanks,

For nfsd's purposes, we need to be able find out about filesystems in
two different ways:

1. Lookup by filehandle: we need to be able to identify which
subvolume we're dealing with from a filehandle.
2. Lookup by path: we need to notice when we cross into a
subvolume.

Looks like #1 already works.  Not #2: the current nfsd code just checks
for mountpoints.  We could modify nfsd to also check whether dev_t
changed each time it did a lookup.  I suppose it would work, though it's
annoying to have to do it just for the case of btrfs.

As far as I can tell, crossing into a subvolume is like crossing a
mountpoint in every way except for the lack of a separate vfsmount.  I'd
worry that the inconsistency will end up requiring more special cases
down the road, but I don't have any in mind.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-03 Thread Dave Chinner
On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
 On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
  Hello,
  
  Various people have complained about how BTRFS deals with subvolumes 
  recently,
  specifically the fact that they all have the same inode number, and there's 
  no
  discrete seperation from one subvolume to another.  Christoph asked that I 
  lay
  out a basic design document of how we want subvolumes to work so we can hash
  everything out now, fix what is broken, and then move forward with a design 
  that
  everybody is more or less happy with.  I apologize in advance for how 
  freaking
  long this email is going to be.  I assume that most people are generally
  familiar with how BTRFS works, so I'm not going to bother explaining in 
  great
  detail some stuff.
  

  are things that cannot be fixed now.  Some of these changes will require
  incompat format changes, but it's either we fix it now, or later on down the
  road when BTRFS starts getting used in production really find out how many
  things our current scheme breaks and then have to do the changes then.  
  Thanks,
  
 
 So now that I've actually looked at everything, it looks like the semantics 
 are
 all right for subvolumes
 
 1) readdir - we return the root id in d_ino, which is unique across the fs
 2) stat - we return 256 for all subvolumes, because that is their inode number
 3) dev_t - we setup an anon super for all volumes, so they all get their own
 dev_t, which is set properly for all of their children, see below

A property of NFS fileshandles is that they must be stable across
server reboots. Is this anon dev_t used as part of the NFS
filehandle and if so how can you guarantee that it is stable?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-03 Thread Chris Mason
Excerpts from Dave Chinner's message of 2010-12-03 17:27:56 -0500:
 On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
  On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
   Hello,
   
   Various people have complained about how BTRFS deals with subvolumes 
   recently,
   specifically the fact that they all have the same inode number, and 
   there's no
   discrete seperation from one subvolume to another.  Christoph asked that 
   I lay
   out a basic design document of how we want subvolumes to work so we can 
   hash
   everything out now, fix what is broken, and then move forward with a 
   design that
   everybody is more or less happy with.  I apologize in advance for how 
   freaking
   long this email is going to be.  I assume that most people are generally
   familiar with how BTRFS works, so I'm not going to bother explaining in 
   great
   detail some stuff.
   
 
   are things that cannot be fixed now.  Some of these changes will require
   incompat format changes, but it's either we fix it now, or later on down 
   the
   road when BTRFS starts getting used in production really find out how many
   things our current scheme breaks and then have to do the changes then.  
   Thanks,
   
  
  So now that I've actually looked at everything, it looks like the semantics 
  are
  all right for subvolumes
  
  1) readdir - we return the root id in d_ino, which is unique across the fs
  2) stat - we return 256 for all subvolumes, because that is their inode 
  number
  3) dev_t - we setup an anon super for all volumes, so they all get their own
  dev_t, which is set properly for all of their children, see below
 
 A property of NFS fileshandles is that they must be stable across
 server reboots. Is this anon dev_t used as part of the NFS
 filehandle and if so how can you guarantee that it is stable?

It isn't today, that's something we'll have to address.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-03 Thread J. Bruce Fields
On Fri, Dec 03, 2010 at 05:29:24PM -0500, Chris Mason wrote:
 Excerpts from Dave Chinner's message of 2010-12-03 17:27:56 -0500:
  On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
   On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
Hello,

Various people have complained about how BTRFS deals with subvolumes 
recently,
specifically the fact that they all have the same inode number, and 
there's no
discrete seperation from one subvolume to another.  Christoph asked 
that I lay
out a basic design document of how we want subvolumes to work so we can 
hash
everything out now, fix what is broken, and then move forward with a 
design that
everybody is more or less happy with.  I apologize in advance for how 
freaking
long this email is going to be.  I assume that most people are generally
familiar with how BTRFS works, so I'm not going to bother explaining in 
great
detail some stuff.

  
are things that cannot be fixed now.  Some of these changes will require
incompat format changes, but it's either we fix it now, or later on 
down the
road when BTRFS starts getting used in production really find out how 
many
things our current scheme breaks and then have to do the changes then.  
Thanks,

   
   So now that I've actually looked at everything, it looks like the 
   semantics are
   all right for subvolumes
   
   1) readdir - we return the root id in d_ino, which is unique across the fs
   2) stat - we return 256 for all subvolumes, because that is their inode 
   number
   3) dev_t - we setup an anon super for all volumes, so they all get their 
   own
   dev_t, which is set properly for all of their children, see below
  
  A property of NFS fileshandles is that they must be stable across
  server reboots. Is this anon dev_t used as part of the NFS
  filehandle and if so how can you guarantee that it is stable?
 
 It isn't today, that's something we'll have to address.

We're using statfs64.fs_fsid for this; I believe that's both stable
across reboots and distinguishes between subvolumes, so that's OK.

(That said, since fs_fsid doesn't work for other filesystems, we depend
on an explicit check for a filesystem type of btrfs, which is
awful--btrfs won't always be the only filesystem that wants to do this
kind of thing, etc.)

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-03 Thread Andreas Dilger
On 2010-12-03, at 15:45, J. Bruce Fields wrote:
 We're using statfs64.fs_fsid for this; I believe that's both stable
 across reboots and distinguishes between subvolumes, so that's OK.
 
 (That said, since fs_fsid doesn't work for other filesystems, we depend
 on an explicit check for a filesystem type of btrfs, which is
 awful--btrfs won't always be the only filesystem that wants to do this
 kind of thing, etc.)

Sigh, I wanted to be able to specify the NFS FSID directly from within the 
kernel for Lustre many years already.  Glad to see that this is moving forward.

Any chance we can add a -get_fsid(sb, inode) method to export_operations
(or something simiar), that allows the filesystem to generate an FSID based on 
the volume and inode that is being exported?

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html