date:20101201

Re: [RFC PATCH 2/4 v2] Btrfs: avoid transaction stuff when readonly

2010-12-01 Thread liubo

On 12/02/2010 01:41 PM, Mike Fedyk wrote:
> On Wed, Dec 1, 2010 at 8:28 PM, Yan, Zheng  wrote:
>> On Thu, Dec 2, 2010 at 11:42 AM, liubo  wrote:
>>> On 12/01/2010 06:20 PM, liubo wrote:
 When the filesystem is readonly, avoid transaction stuff by checking 
 MS_RDONLY at
 start transaction time.

>>> This patch may lead btrfs panic.
>>>
>>> Since btrfs allows transaction under readonly fs state, which is a bit 
>>> weird, btrfs
>>> does not even check the returned transaction from start_transaction, 
>>> although it may
>>> return -ENOMEM.
>> btrfs may do log replay even mount as readonly.
>>
> 
> What part is logged besides tree roots and/or superblocks?

log tree is used for log replay after crash and fast fsync and O_SYNC, it logs
inodes.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/4 v2] Btrfs: avoid transaction stuff when readonly

2010-12-01 Thread Mike Fedyk

On Wed, Dec 1, 2010 at 8:28 PM, Yan, Zheng  wrote:
> On Thu, Dec 2, 2010 at 11:42 AM, liubo  wrote:
>> On 12/01/2010 06:20 PM, liubo wrote:
>>> When the filesystem is readonly, avoid transaction stuff by checking 
>>> MS_RDONLY at
>>> start transaction time.
>>>
>>
>> This patch may lead btrfs panic.
>>
>> Since btrfs allows transaction under readonly fs state, which is a bit 
>> weird, btrfs
>> does not even check the returned transaction from start_transaction, 
>> although it may
>> return -ENOMEM.
>
> btrfs may do log replay even mount as readonly.
>

What part is logged besides tree roots and/or superblocks?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/4 v2] Btrfs: avoid transaction stuff when readonly

2010-12-01 Thread liubo

On 12/02/2010 12:28 PM, Yan, Zheng wrote:
> On Thu, Dec 2, 2010 at 11:42 AM, liubo  wrote:
>> On 12/01/2010 06:20 PM, liubo wrote:
>>> When the filesystem is readonly, avoid transaction stuff by checking 
>>> MS_RDONLY at
>>> start transaction time.
>>>
>> This patch may lead btrfs panic.
>>
>> Since btrfs allows transaction under readonly fs state, which is a bit 
>> weird, btrfs
>> does not even check the returned transaction from start_transaction, 
>> although it may
>> return -ENOMEM.
> 
> btrfs may do log replay even mount as readonly.

Yeah, it it right.

log replay maybe does take place when btrfs is mounted as readonly, but after 
the FS is
broken, is btrfs willing to do log replay in such case?

thanks,
Liu Bo

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Mike Fedyk

On Wed, Dec 1, 2010 at 3:32 PM, Freddie Cash  wrote:
> On Wed, Dec 1, 2010 at 1:28 PM, Hugo Mills  wrote:
>> On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:
>>> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills  wrote:
>>> >>  The idea is you are only charged for what blocks
>>> >> you have on the disk.  Thanks,
>>> >
>>> >   My point was that it's perfectly possible to have blocks on the
>>> > disk that are effectively owned by two people, and that the person to
>>> > charge for those blocks is, to me, far from clear. You either end up
>>> > charging twice for a single set of blocks on the disk, or you end up
>>> > in a situation where one person's actions can cause another person's
>>> > quota to fill up. Neither of these is particularly obvious behaviour.
>>>
>>> As a sysadmin and as a user, quotas shouldn't be about "physical
>>> blocks of storage used" but should be about "logical storage used".
>>>
>>> IOW, if the filesystem is compressed, using 1 GB of physical space to
>>> store 10 GB of data, my "quota used" should be 10 GB.
>>>
>>> Similar for deduplication.  The quota is based on the storage *before*
>>> the file is deduped.  Not after.
>>>
>>> Similar for snapshots.  If UserA has 10 GB of quota used, I snapshot
>>> their filesystem, then my "quota used" would be 10 GB as well.  As
>>> data in my snapshot changes, my "quota used" is updated to reflect
>>> that (change 1 GB of data compared to snapshot, use 1 GB of quota).
>>
>>   So if I've got 10G of data, and I snapshot it, I've just used
>> another 10G of quota?
>
> Sorry, forgot the "per user" bit above.
>
> If UserA has 10 GB of data, then UserB snapshots it, UserB's quota
> usage is 10 GB.
>
> If UserA has 10 GB of data and snapshots it, then only 10 GB of quota
> usage is used, as there is 0 difference between the snapshot and the
> filesystem.  As UserA modifies data, their quota usage increases by
> the amount that is modified (ie 10 GB data, snapshot, modify 1 GB data
> == 11 GB quota usage).
>
> If you combine the two scenarios, you end up with:
>  - UserA has 10 GB of data == 10 GB quota usage
>  - UserB snapshots UserA's filesystem (clone), so UserB has 10 GB
> quota usage (even though 0 blocks have changed on disk)

Please define where the owner of a subvolume/snapshot is stored.

To my knowledge when you make a snapshot, you have the same set of
files with the same set of owners and groups.  Whatever user does the
snapshot this does not change this unless chown or chgrp are used.

Also a non-root user (or a process without CAP_whatever) should not be
able to snapshot a subvolume where the root directory of that
subvolume is not owned by the user attempting the snapshot.   If you
do not do so then you end up with the same security and quota issues
that hard links have when you don't have separate filesystems.

You could have separate subvolumes for / and /home/foo and user foo
could snapshot / to /home/foo/exploit_later_001 and then foo can just
wait for an exploit to come along for one of the binaries or libs in
/home/foo/exploit_later_001 and own.

Yes, snapshot creation should be more restricted than hard links, for
good reason.

I have other questions but the answer to this fundamental game changer
may solve many of the mentioned issues.

>  - UserA snapshots UserA's filesystem == no change to quota usage (no
> blocks on disk have changed)
>  - UserA modifies 1 GB of data in the filesystem == 1 GB new quota
> usage (11 GB total) (1 GB of blocks owned by UserA have changed, plus
> the 10 GB in the snapshot)
>  - UserB still only has 10 GB quota usage, since their snapshot
> hasn't changed (0 blocks changed)
>
> If UserA deletes their filesystem and all their snapshots, freeing up
> 11 GB of quota usage on their account, UserB's quota will still be 10
> GB, and the blocks on the disk aren't actually removed (still
> referenced by UserB's snapshot).
>
> Basically, within a user's account, only the data unique to a snapshot
> should count toward the quota.
>
> Across accounts, the original (root) snapshot would count completely
> to the new user's quota, and then only data unique to subsequent
> snapshots would count.
>
> I hope that makes it more clear.  :)  All the different layers and
> whatnot get confusing.  :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/4 v2] Btrfs: avoid transaction stuff when readonly

2010-12-01 Thread Yan, Zheng

On Thu, Dec 2, 2010 at 11:42 AM, liubo  wrote:
> On 12/01/2010 06:20 PM, liubo wrote:
>> When the filesystem is readonly, avoid transaction stuff by checking 
>> MS_RDONLY at
>> start transaction time.
>>
>
> This patch may lead btrfs panic.
>
> Since btrfs allows transaction under readonly fs state, which is a bit weird, 
> btrfs
> does not even check the returned transaction from start_transaction, although 
> it may
> return -ENOMEM.

btrfs may do log replay even mount as readonly.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/4 v2] Btrfs: avoid transaction stuff when readonly

2010-12-01 Thread liubo

On 12/01/2010 06:20 PM, liubo wrote:
> When the filesystem is readonly, avoid transaction stuff by checking 
> MS_RDONLY at 
> start transaction time.
> 

This patch may lead btrfs panic.

Since btrfs allows transaction under readonly fs state, which is a bit weird, 
btrfs
does not even check the returned transaction from start_transaction, although 
it may
return -ENOMEM. 

With this patch, if btrfs flips readonly or is mounted readonly, to start a 
transaction
will get a -EROFS. So we needs to check transaction more carefully, rather than 
just
leave it alone.

thanks,
Liu Bo

> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/transaction.c |3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 1fffbc0..14a597d 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -181,6 +181,9 @@ static struct btrfs_trans_handle 
> *start_transaction(struct btrfs_root *root,
>   struct btrfs_trans_handle *h;
>   struct btrfs_transaction *cur_trans;
>   int ret;
> +
> + if (root->fs_info->sb->s_flags & MS_RDONLY)
> + return ERR_PTR(-EROFS);
>  again:
>   h = kmem_cache_alloc(btrfs_trans_handle_cachep, GFP_NOFS);
>   if (!h)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 4/4 v2] Btrfs: deal with filesystem state at mount, umount

2010-12-01 Thread liubo

On 12/02/2010 10:29 AM, Tsutomu Itoh wrote:
> Hi,
> 
> I found 1 typo.
> 
> (2010/12/01 19:21), liubo wrote:
>> Since there is a filesystem state, we should deal with it carefully at mount,
>>  umount and remount.
>>
>> - At mount, the FS state should be checked if there is error on these FS.
>>   If it does have, btrfsck is recommended.
>> - At umount, the FS state should be saved into disk for consistency.
>>
>> Signed-off-by: Liu Bo 
>> ---
>>  fs/btrfs/disk-io.c |   47 ++-
>>  1 files changed, 46 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index b40dfe4..663d360 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -43,6 +43,8 @@
>>  static struct extent_io_ops btree_extent_io_ops;
>>  static void end_workqueue_fn(struct btrfs_work *work);
>>  static void free_fs_root(struct btrfs_root *root);
>> +static void btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
>> + int read_only);
>>  
>>  /*
>>   * end_io_wq structs are used to do processing in task context when an IO is
>> @@ -1700,6 +1702,11 @@ struct btrfs_root *open_ctree(struct super_block *sb,
>>  if (!btrfs_super_root(disk_super))
>>  goto fail_iput;
>>  
>> +/* check filesystem state */
>> +fs_info->fs_state |= btrfs_super_flags(disk_super);
>> +
>> +btrfs_check_super_valid(fs_info, sb->s_flags & MS_RDONLY);
>> +
>>  ret = btrfs_parse_options(tree_root, options);
>>  if (ret) {
>>  err = ret;
>> @@ -2405,10 +2412,17 @@ int btrfs_commit_super(struct btrfs_root *root)
>>  up_write(&root->fs_info->cleanup_work_sem);
>>  
>>  trans = btrfs_join_transaction(root, 1);
>> +if (IS_ERR(trans))
>> +return PTR_ERR(trans);
>> +
>>  ret = btrfs_commit_transaction(trans, root);
>>  BUG_ON(ret);
>> +
>>  /* run commit again to drop the original snapshot */
>>  trans = btrfs_join_transaction(root, 1);
>> +if (IS_ERR(trans))
>> +return PTR_ERR(trans);
>> +
>>  btrfs_commit_transaction(trans, root);
>>  ret = btrfs_write_and_wait_transaction(NULL, root);
>>  BUG_ON(ret);
>> @@ -2426,8 +2440,28 @@ int close_ctree(struct btrfs_root *root)
>>  smp_mb();
>>  
>>  btrfs_put_block_group_cache(fs_info);
>> +
>> +/*
>> + * Here come 2 situations when btrfs flips readonly:
>> + *
>> + * 1. when btrfs flips readonly somewhere else before
>> + * btrfs_commit_super, sb->s_flags has MS_RDONLY flag,
>> + * and btrfs will skip to write sb directly to keep
>> + * ERROR state on disk.
>> + *
>> + * 2. when btrfs flips readonly just in btrfs_commit_super,
>> + * and in such case, btrfs cannnot write sb via btrfs_commit_super,
>> + * and since fs_state has been set BTRFS_SUPER_FLAG_ERROR flag,
>> + * btrfs will directly write sb.
>> + */
>>  if (!(fs_info->sb->s_flags & MS_RDONLY)) {
>> -ret =  btrfs_commit_super(root);
>> +ret = btrfs_commit_super(root);
>> +if (ret)
>> +printk(KERN_ERR "btrfs: commit super ret %d\n", ret);
>> +}
>> +
>> +if (fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR) {
>> +ret = write_ctree_super(NULL, root, 0);
>>  if (ret)
>>  printk(KERN_ERR "btrfs: commit super ret %d\n", ret);
>>  }
>> @@ -2603,6 +2637,17 @@ out:
>>  return 0;
>>  }
>>  
>> +static void btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
>> +  int read_only)
>> +{
>> +if (read_only)
>> +return;
>> +
>> +if (fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR)
>> +printk(KERN_WARNING "warning: mount fs with errors, "
>> +   "running btfsck is recommended\n");
> 
> btfsck -> btrfsck

ahh, my fault, sorry for my carelessness.

Thanks a lot for reviewing.

thanks,
Liu Bo

> 
>> +}
>> +
>>  static struct extent_io_ops btree_extent_io_ops = {
>>  .write_cache_pages_lock_hook = btree_lock_page_hook,
>>  .readpage_end_io_hook = btree_readpage_end_io_hook,
> 
> Regards,
> Itoh
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 4/4 v2] Btrfs: deal with filesystem state at mount, umount

2010-12-01 Thread Tsutomu Itoh

Hi,

I found 1 typo.

(2010/12/01 19:21), liubo wrote:
> Since there is a filesystem state, we should deal with it carefully at mount,
>  umount and remount.
> 
> - At mount, the FS state should be checked if there is error on these FS.
>   If it does have, btrfsck is recommended.
> - At umount, the FS state should be saved into disk for consistency.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/disk-io.c |   47 ++-
>  1 files changed, 46 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b40dfe4..663d360 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -43,6 +43,8 @@
>  static struct extent_io_ops btree_extent_io_ops;
>  static void end_workqueue_fn(struct btrfs_work *work);
>  static void free_fs_root(struct btrfs_root *root);
> +static void btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
> +  int read_only);
>  
>  /*
>   * end_io_wq structs are used to do processing in task context when an IO is
> @@ -1700,6 +1702,11 @@ struct btrfs_root *open_ctree(struct super_block *sb,
>   if (!btrfs_super_root(disk_super))
>   goto fail_iput;
>  
> + /* check filesystem state */
> + fs_info->fs_state |= btrfs_super_flags(disk_super);
> +
> + btrfs_check_super_valid(fs_info, sb->s_flags & MS_RDONLY);
> +
>   ret = btrfs_parse_options(tree_root, options);
>   if (ret) {
>   err = ret;
> @@ -2405,10 +2412,17 @@ int btrfs_commit_super(struct btrfs_root *root)
>   up_write(&root->fs_info->cleanup_work_sem);
>  
>   trans = btrfs_join_transaction(root, 1);
> + if (IS_ERR(trans))
> + return PTR_ERR(trans);
> +
>   ret = btrfs_commit_transaction(trans, root);
>   BUG_ON(ret);
> +
>   /* run commit again to drop the original snapshot */
>   trans = btrfs_join_transaction(root, 1);
> + if (IS_ERR(trans))
> + return PTR_ERR(trans);
> +
>   btrfs_commit_transaction(trans, root);
>   ret = btrfs_write_and_wait_transaction(NULL, root);
>   BUG_ON(ret);
> @@ -2426,8 +2440,28 @@ int close_ctree(struct btrfs_root *root)
>   smp_mb();
>  
>   btrfs_put_block_group_cache(fs_info);
> +
> + /*
> +  * Here come 2 situations when btrfs flips readonly:
> +  *
> +  * 1. when btrfs flips readonly somewhere else before
> +  * btrfs_commit_super, sb->s_flags has MS_RDONLY flag,
> +  * and btrfs will skip to write sb directly to keep
> +  * ERROR state on disk.
> +  *
> +  * 2. when btrfs flips readonly just in btrfs_commit_super,
> +  * and in such case, btrfs cannnot write sb via btrfs_commit_super,
> +  * and since fs_state has been set BTRFS_SUPER_FLAG_ERROR flag,
> +  * btrfs will directly write sb.
> +  */
>   if (!(fs_info->sb->s_flags & MS_RDONLY)) {
> - ret =  btrfs_commit_super(root);
> + ret = btrfs_commit_super(root);
> + if (ret)
> + printk(KERN_ERR "btrfs: commit super ret %d\n", ret);
> + }
> +
> + if (fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR) {
> + ret = write_ctree_super(NULL, root, 0);
>   if (ret)
>   printk(KERN_ERR "btrfs: commit super ret %d\n", ret);
>   }
> @@ -2603,6 +2637,17 @@ out:
>   return 0;
>  }
>  
> +static void btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
> +   int read_only)
> +{
> + if (read_only)
> + return;
> +
> + if (fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR)
> + printk(KERN_WARNING "warning: mount fs with errors, "
> +"running btfsck is recommended\n");

btfsck -> btrfsck

> +}
> +
>  static struct extent_io_ops btree_extent_io_ops = {
>   .write_cache_pages_lock_hook = btree_lock_page_hook,
>   .readpage_end_io_hook = btree_readpage_end_io_hook,

Regards,
Itoh

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Michael Vrable


On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote:

On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:

I think you're already fine:

# mkdir TMP
# dd if=/dev/zero of=TMP-image bs=1M count=512
# mkfs.btrfs TMP-image
# mount -oloop TMP-image TMP/
# btrfs subvolume create sub-a
# btrfs subvolume create sub-b
../readdir-inos .
. 256 256
.. 256 4130609
sub-a 256 256
sub-b 257 256

Where readdir-inos is my silly test program below, and the first 
number is from readdir, the second from stat.




Heh as soon as I typed my email I went and actually looked at the 
code, looks like for readdir we fill in the root id, which will be 
unique, so hotdamn we are good and I don't have to use a stupid 
incompat flag.  Thanks for checking that :),


Except, aren't the inode numbers within a filesystem and the sunbvolume 
tree IDs allocated out of separate namespaces?  I don't think there's 
anything preventing a file/directory from having an inode number that 
clashes with one of the snapshots.


In fact, this already happens in the example above: "." (inode 256 in 
the root subvolume) and "sub-a" (subvolume ID 256).


(Though I still don't understand the semantics well enough to say 
whether we need all the inode numbers returned by readdir to be 
distinct.)


--Michael Vrable
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Freddie Cash

On Wed, Dec 1, 2010 at 1:28 PM, Hugo Mills  wrote:
> On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:
>> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills  wrote:
>> >>  The idea is you are only charged for what blocks
>> >> you have on the disk.  Thanks,
>> >
>> >   My point was that it's perfectly possible to have blocks on the
>> > disk that are effectively owned by two people, and that the person to
>> > charge for those blocks is, to me, far from clear. You either end up
>> > charging twice for a single set of blocks on the disk, or you end up
>> > in a situation where one person's actions can cause another person's
>> > quota to fill up. Neither of these is particularly obvious behaviour.
>>
>> As a sysadmin and as a user, quotas shouldn't be about "physical
>> blocks of storage used" but should be about "logical storage used".
>>
>> IOW, if the filesystem is compressed, using 1 GB of physical space to
>> store 10 GB of data, my "quota used" should be 10 GB.
>>
>> Similar for deduplication.  The quota is based on the storage *before*
>> the file is deduped.  Not after.
>>
>> Similar for snapshots.  If UserA has 10 GB of quota used, I snapshot
>> their filesystem, then my "quota used" would be 10 GB as well.  As
>> data in my snapshot changes, my "quota used" is updated to reflect
>> that (change 1 GB of data compared to snapshot, use 1 GB of quota).
>
>   So if I've got 10G of data, and I snapshot it, I've just used
> another 10G of quota?

Sorry, forgot the "per user" bit above.

If UserA has 10 GB of data, then UserB snapshots it, UserB's quota
usage is 10 GB.

If UserA has 10 GB of data and snapshots it, then only 10 GB of quota
usage is used, as there is 0 difference between the snapshot and the
filesystem.  As UserA modifies data, their quota usage increases by
the amount that is modified (ie 10 GB data, snapshot, modify 1 GB data
== 11 GB quota usage).

If you combine the two scenarios, you end up with:
  - UserA has 10 GB of data == 10 GB quota usage
  - UserB snapshots UserA's filesystem (clone), so UserB has 10 GB
quota usage (even though 0 blocks have changed on disk)
  - UserA snapshots UserA's filesystem == no change to quota usage (no
blocks on disk have changed)
  - UserA modifies 1 GB of data in the filesystem == 1 GB new quota
usage (11 GB total) (1 GB of blocks owned by UserA have changed, plus
the 10 GB in the snapshot)
  - UserB still only has 10 GB quota usage, since their snapshot
hasn't changed (0 blocks changed)

If UserA deletes their filesystem and all their snapshots, freeing up
11 GB of quota usage on their account, UserB's quota will still be 10
GB, and the blocks on the disk aren't actually removed (still
referenced by UserB's snapshot).

Basically, within a user's account, only the data unique to a snapshot
should count toward the quota.

Across accounts, the original (root) snapshot would count completely
to the new user's quota, and then only data unique to subsequent
snapshots would count.

I hope that makes it more clear.  :)  All the different layers and
whatnot get confusing.  :)

-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: disk space caching generation missmatch

2010-12-01 Thread Johannes Hirte

On Wednesday 01 December 2010 22:22:45 Johannes Hirte wrote:
> On Wednesday 01 December 2010 21:03:13 Josef Bacik wrote:
> > On Wed, Dec 01, 2010 at 08:56:14PM +0100, Johannes Hirte wrote:
> > > On Wednesday 01 December 2010 18:40:18 Josef Bacik wrote:
> > > > On Wed, Dec 01, 2010 at 05:46:14PM +0100, Johannes Hirte wrote:
> > > > > After enabling disk space caching I've observed several log entries 
> > > > > like this:
> > > > > 
> > > > > btrfs: free space inode generation (0) did not match free space cache 
> > > > > generation (169594) for block group 15464398848
> > > > > 
> > > > > I'm not sure, but it seems this happens on every reboot. Is this 
> > > > > something to
> > > > > worry about?
> > > > > 
> > > > 
> > > > So that usually means 1 of a couple of things
> > > > 
> > > > 1) You didn't have space for us to save the free space cache
> > > > 2) When trying to write out the cache we hit one of those cases where 
> > > > we would
> > > > deadlock so we couldn't write the cache out
> > > > 
> > > > It's nothing to worry about, it's doing what it is supposed to.  
> > > > However I'd
> > > > like to know why we're not able to write out the cache.  Are you 
> > > > running close
> > > > to full?  Thanks,
> > > > 
> > > > Josef
> > > >
> > > 
> > > I think there should be enough free space:
> > > 
> > 
> > Hmm well then we're hitting one of the other corner cases.  Can you run with
> > this debug thread and reboot.  Hopefully it will tell me why we're not 
> > saving
> > the free space cache. Thanks,
> > 
> > Josef
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 87aae66..4fd5659 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2794,13 +2794,17 @@ again:
> > if (i_size_read(inode) > 0) {
> > ret = btrfs_truncate_free_space_cache(root, trans, path,
> >   inode);
> > -   if (ret)
> > +   if (ret) {
> > +   printk(KERN_ERR "truncate free space cache failed for 
> > %llu, %d\n",
> > +  block_group->key.objectid, ret);
> > goto out_put;
> > +   }
> > }
> >  
> > spin_lock(&block_group->lock);
> > if (block_group->cached != BTRFS_CACHE_FINISHED) {
> > spin_unlock(&block_group->lock);
> > +   printk(KERN_ERR "block group %llu not cached\n", 
> > block_group->key.objectid);
> > goto out_put;
> > }
> > spin_unlock(&block_group->lock);
> > @@ -2820,8 +2824,10 @@ again:
> > num_pages *= PAGE_CACHE_SIZE;
> >  
> > ret = btrfs_check_data_free_space(inode, num_pages);
> > -   if (ret)
> > +   if (ret) {
> > +   printk(KERN_ERR "not enough free space for cache %llu\n", 
> > block_group->key.objectid);
> > goto out_put;
> > +   }
> >  
> > ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, num_pages,
> >   num_pages, num_pages,
> > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > index 22ee0dc..0078172 100644
> > --- a/fs/btrfs/free-space-cache.c
> > +++ b/fs/btrfs/free-space-cache.c
> > @@ -511,6 +511,8 @@ int btrfs_write_out_cache(struct btrfs_root *root,
> > spin_lock(&block_group->lock);
> > if (block_group->disk_cache_state < BTRFS_DC_SETUP) {
> > spin_unlock(&block_group->lock);
> > +   printk(KERN_ERR "block group %llu, wrong dcs %d\n", 
> > block_group->key.objectid,
> > +  block_group->disk_cache_state);
> > return 0;
> > }
> > spin_unlock(&block_group->lock);
> > @@ -520,6 +522,7 @@ int btrfs_write_out_cache(struct btrfs_root *root,
> > return 0;
> >  
> > if (!i_size_read(inode)) {
> > +   printk(KERN_ERR "no allocated space for block group %llu\n", 
> > block_group->key.objectid);
> > iput(inode);
> > return 0;
> > }
> > @@ -771,6 +774,7 @@ out_free:
> > block_group->disk_cache_state = BTRFS_DC_ERROR;
> > spin_unlock(&block_group->lock);
> > BTRFS_I(inode)->generation = 0;
> > +   printk(KERN_ERR "problem writing out block group cache for 
> > %llu\n", block_group->key.objectid);
> > }
> > kfree(checksums);
> > btrfs_update_inode(trans, root, inode);
> > 
> 
> This is from dmesg shortly after reboot with the debug patch:
> 
> btrfs: free space inode generation (0) did not match free space cache 
> generation (116974) for block group 14256439296
> btrfs: free space inode generation (0) did not match free space cache 
> generation (116974) for block group 14256439296
> btrfs: free space inode generation (0) did not match free space cache 
> generation (116974) for block group 14256439296
> btrfs: free space inode generation (0) did not match free space cache 
> generation (116974) for block group 14256439296
> btrfs: free space inode generation (0) did not match free space c

Re: What to do about subvolumes?

2010-12-01 Thread Hugo Mills

On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:
> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills  wrote:
> >>  The idea is you are only charged for what blocks
> >> you have on the disk.  Thanks,
> >
> >   My point was that it's perfectly possible to have blocks on the
> > disk that are effectively owned by two people, and that the person to
> > charge for those blocks is, to me, far from clear. You either end up
> > charging twice for a single set of blocks on the disk, or you end up
> > in a situation where one person's actions can cause another person's
> > quota to fill up. Neither of these is particularly obvious behaviour.
> 
> As a sysadmin and as a user, quotas shouldn't be about "physical
> blocks of storage used" but should be about "logical storage used".
> 
> IOW, if the filesystem is compressed, using 1 GB of physical space to
> store 10 GB of data, my "quota used" should be 10 GB.
> 
> Similar for deduplication.  The quota is based on the storage *before*
> the file is deduped.  Not after.
> 
> Similar for snapshots.  If UserA has 10 GB of quota used, I snapshot
> their filesystem, then my "quota used" would be 10 GB as well.  As
> data in my snapshot changes, my "quota used" is updated to reflect
> that (change 1 GB of data compared to snapshot, use 1 GB of quota).

   So if I've got 10G of data, and I snapshot it, I've just used
another 10G of quota?

> You have to (or at least should) keep two sets of stats for storage usage:
>   - logical amount used ("real" file size, before compression, before
> de-dupe, before snapshots, etc)
>   - physical amount used (what's actually written to disk)
> 
> User-level quotas are based on the logical storage used.
> Admin-level quotas (if you want to implement them) would be based on
> physical storage used.
> 
> Thus, the output of things like df, du, ls would show the "logical"
> storage used and file sizes.  And you would either have an additional
> option to those apps (--real or something) to show the "actual"
> storage used and file sizes as stored on disk.
> 
> Trying to make quotas and disk usage utilities to work based on what's
> physically on disk is just backwards, imo.  And prone to a lot of
> confusion.

   Trying to make quotas work based on what's physically on the disk
appears to have serious issues on the semantics of "using up space",
so I agree with you on this point (and, indeed, it was the point I was
trying to make).

   However, doing it that way also effectively penalises users and
prevents (or severely discourages) them from using the advanced
functions of the filesystem. There's no benefit (in disk usage terms)
to the user in using a snapshot -- they might as well use plain cp.

   Hugo.

-- 
=== Hugo Mills: h...@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- I believe that it's closely correlated with ---   
   the aeroswine coefficient.


signature.asc
Description: Digital signature

hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-01 Thread Mike Snitzer

On Wed, Dec 01 2010 at  3:45pm -0500,
Milan Broz  wrote:

> 
> On 12/01/2010 08:34 PM, Jon Nelson wrote:
> > Perhaps this is useful: for myself, I found that when I started using
> > 2.6.37rc3 that postgresql starting having a *lot* of problems with
> > corruption. Specifically, I noted zeroed pages, corruption in headers,
> > all sorts of stuff on /newly created/ tables, especially during index
> > creation. I had a fairly high hit rate of failure. I backed off to
> > 2.6.34.7 and have *zero* problems (in fact, prior to 2.6.37rc3, I had
> > never had a corruption issue with postgresql). I ran on 2.6.36 for a
> > few weeks as well, without issue.
> > 
> > I am using kcrypt with lvm on top of that, and ext4 on top of that.
> 
> With unpatched dmcrypt (IOW with Linus' git)? Then it must be ext4 or
> dm-core problem because there were no patches for dm-crypt...

Matt and Jon,

If you'd be up to it: could you try testing your dm-crypt+ext4
corruption reproducers against the following two 2.6.37-rc commits:

1) 1de3e3df917459422cb2aecac440febc8879d410
then
2) bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc

Then, depending on results of no corruption for those commits, bonus
points for testing the same commits but with Andi and Milan's latest
dm-crypt cpu scalability patch applied too:
https://patchwork.kernel.org/patch/365542/

Thanks!
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: disk space caching generation missmatch

2010-12-01 Thread Johannes Hirte

On Wednesday 01 December 2010 21:03:13 Josef Bacik wrote:
> On Wed, Dec 01, 2010 at 08:56:14PM +0100, Johannes Hirte wrote:
> > On Wednesday 01 December 2010 18:40:18 Josef Bacik wrote:
> > > On Wed, Dec 01, 2010 at 05:46:14PM +0100, Johannes Hirte wrote:
> > > > After enabling disk space caching I've observed several log entries 
> > > > like this:
> > > > 
> > > > btrfs: free space inode generation (0) did not match free space cache 
> > > > generation (169594) for block group 15464398848
> > > > 
> > > > I'm not sure, but it seems this happens on every reboot. Is this 
> > > > something to
> > > > worry about?
> > > > 
> > > 
> > > So that usually means 1 of a couple of things
> > > 
> > > 1) You didn't have space for us to save the free space cache
> > > 2) When trying to write out the cache we hit one of those cases where we 
> > > would
> > > deadlock so we couldn't write the cache out
> > > 
> > > It's nothing to worry about, it's doing what it is supposed to.  However 
> > > I'd
> > > like to know why we're not able to write out the cache.  Are you running 
> > > close
> > > to full?  Thanks,
> > > 
> > > Josef
> > >
> > 
> > I think there should be enough free space:
> > 
> 
> Hmm well then we're hitting one of the other corner cases.  Can you run with
> this debug thread and reboot.  Hopefully it will tell me why we're not saving
> the free space cache. Thanks,
> 
> Josef
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 87aae66..4fd5659 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2794,13 +2794,17 @@ again:
>   if (i_size_read(inode) > 0) {
>   ret = btrfs_truncate_free_space_cache(root, trans, path,
> inode);
> - if (ret)
> + if (ret) {
> + printk(KERN_ERR "truncate free space cache failed for 
> %llu, %d\n",
> +block_group->key.objectid, ret);
>   goto out_put;
> + }
>   }
>  
>   spin_lock(&block_group->lock);
>   if (block_group->cached != BTRFS_CACHE_FINISHED) {
>   spin_unlock(&block_group->lock);
> + printk(KERN_ERR "block group %llu not cached\n", 
> block_group->key.objectid);
>   goto out_put;
>   }
>   spin_unlock(&block_group->lock);
> @@ -2820,8 +2824,10 @@ again:
>   num_pages *= PAGE_CACHE_SIZE;
>  
>   ret = btrfs_check_data_free_space(inode, num_pages);
> - if (ret)
> + if (ret) {
> + printk(KERN_ERR "not enough free space for cache %llu\n", 
> block_group->key.objectid);
>   goto out_put;
> + }
>  
>   ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, num_pages,
> num_pages, num_pages,
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 22ee0dc..0078172 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -511,6 +511,8 @@ int btrfs_write_out_cache(struct btrfs_root *root,
>   spin_lock(&block_group->lock);
>   if (block_group->disk_cache_state < BTRFS_DC_SETUP) {
>   spin_unlock(&block_group->lock);
> + printk(KERN_ERR "block group %llu, wrong dcs %d\n", 
> block_group->key.objectid,
> +block_group->disk_cache_state);
>   return 0;
>   }
>   spin_unlock(&block_group->lock);
> @@ -520,6 +522,7 @@ int btrfs_write_out_cache(struct btrfs_root *root,
>   return 0;
>  
>   if (!i_size_read(inode)) {
> + printk(KERN_ERR "no allocated space for block group %llu\n", 
> block_group->key.objectid);
>   iput(inode);
>   return 0;
>   }
> @@ -771,6 +774,7 @@ out_free:
>   block_group->disk_cache_state = BTRFS_DC_ERROR;
>   spin_unlock(&block_group->lock);
>   BTRFS_I(inode)->generation = 0;
> + printk(KERN_ERR "problem writing out block group cache for 
> %llu\n", block_group->key.objectid);
>   }
>   kfree(checksums);
>   btrfs_update_inode(trans, root, inode);
> 

This is from dmesg shortly after reboot with the debug patch:

btrfs: free space inode generation (0) did not match free space cache 
generation (116974) for block group 14256439296
btrfs: free space inode generation (0) did not match free space cache 
generation (116974) for block group 14256439296
btrfs: free space inode generation (0) did not match free space cache 
generation (116974) for block group 14256439296
btrfs: free space inode generation (0) did not match free space cache 
generation (116974) for block group 14256439296
btrfs: free space inode generation (0) did not match free space cache 
generation (177986) for block group 5398069248
btrfs: free space inode generation (0) did not match free space cache 
generation (177986) for block group 5398069248
block group 5398069248 not cached
block group 1

Re: What to do about subvolumes?

2010-12-01 Thread Jeff Layton

On Wed, 1 Dec 2010 21:46:03 +0100
Goffredo Baroncelli  wrote:

> On Wednesday, 01 December, 2010, Jeff Layton wrote:
> > A more common use case than CIFS or samba is going to be things like
> > backup programs. They commonly look at inode numbers in order to
> > identify hardlinks and may be horribly confused when there files that
> > have a link count >1 and inode number collisions with other files.
> > 
> > That probably qualifies as an "enterprise-ready" show stopper...
> 
> I hope that a backup program, uses the pair (inode,fsid) to identify if two 
> file are hardlinked... otherwise a backup of two filesystem mounted can be 
> quite danguerous...
> 
> 
> From the statfs(2) man page:
> [..]
> The f_fsid field
> [...]
> The general idea is that f_fsid contains some random stuff such that the pair 
> (f_fsid,ino) uniquely determines a file.  Some operating systems use (a 
> variation on) the device number, or the device number combined  with  the  
> file-system  type.   Several  OSes restrict giving out the f_fsid field to 
> the 
> superuser only (and zero it for unprivileged users), because this field is 
> used in the filehandle of the file system when NFS-exported, and giving it 
> out 
> is a security concern.
> 
> 
> And the btrfs_statfs function returns a different fsid for every subvolume.
> 

Ahh, interesting. I've never read that blurb on f_fsid...

Unfortunately, it looks like not all filesystems fill that field out.
NFS and CIFS leave it conspicuously blank. Those are probably bugs...

OTOH, the GLibc docs say this:

dev_t st_dev
Identifies the device containing the file. The st_ino and st_dev,
taken together, uniquely identify the file. The st_dev value is not
necessarily consistent across reboots or system crashes, however. 

...and it's always been my understanding that a st_dev/st_ino
combination should be unique.

Is there some definitive POSIX statement on why one should prefer to
use f_fsid over st_dev in this situation?

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dm-crypt barrier support is effective

2010-12-01 Thread Milan Broz


On 12/01/2010 08:34 PM, Jon Nelson wrote:
> Perhaps this is useful: for myself, I found that when I started using
> 2.6.37rc3 that postgresql starting having a *lot* of problems with
> corruption. Specifically, I noted zeroed pages, corruption in headers,
> all sorts of stuff on /newly created/ tables, especially during index
> creation. I had a fairly high hit rate of failure. I backed off to
> 2.6.34.7 and have *zero* problems (in fact, prior to 2.6.37rc3, I had
> never had a corruption issue with postgresql). I ran on 2.6.36 for a
> few weeks as well, without issue.
> 
> I am using kcrypt with lvm on top of that, and ext4 on top of that.

With unpatched dmcrypt (IOW with Linus' git)? Then it must be ext4 or
dm-core problem because there were no patches for dm-crypt...

Anyway, thanks for hint how to reproduce it.

Milan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Goffredo Baroncelli

On Wednesday, 01 December, 2010, Jeff Layton wrote:
> A more common use case than CIFS or samba is going to be things like
> backup programs. They commonly look at inode numbers in order to
> identify hardlinks and may be horribly confused when there files that
> have a link count >1 and inode number collisions with other files.
> 
> That probably qualifies as an "enterprise-ready" show stopper...

I hope that a backup program, uses the pair (inode,fsid) to identify if two 
file are hardlinked... otherwise a backup of two filesystem mounted can be 
quite danguerous...


>From the statfs(2) man page:
[..]
The f_fsid field
[...]
The general idea is that f_fsid contains some random stuff such that the pair 
(f_fsid,ino) uniquely determines a file.  Some operating systems use (a 
variation on) the device number, or the device number combined  with  the  
file-system  type.   Several  OSes restrict giving out the f_fsid field to the 
superuser only (and zero it for unprivileged users), because this field is 
used in the filehandle of the file system when NFS-exported, and giving it out 
is a security concern.


And the btrfs_statfs function returns a different fsid for every subvolume.

-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) 
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Freddie Cash

On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills  wrote:
> On Wed, Dec 01, 2010 at 12:38:30PM -0500, Josef Bacik wrote:
>> If you delete your subvolume A, like use the btrfs tool to delete it, you 
>> will
>> only be stuck with what you changed in snapshot B.  So if you only changed 
>> 5gig
>> worth of information, and you deleted the original subvolume, you would have
>> 5gig charged to your quota.
>
>   This doesn't work, though, if the owners of the "original" and
> "new" subvolume are different:
>
> Case 1:
>
>  * Porthos creates 10G data.
>  * Athos makes a snapshot of Porthos's data.
>  * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
>   Porthos's data to Athos.
>  * Porthos deletes his copy of the data.
>
> Case 2:
>
>  * Porthos creates 10G of data.
>  * Athos makes a snapshot of Porthos's data.
>  * Porthos deletes his copy of the data.
>  * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
>   Porthos's data to Athos.
>
> Case 3:
>
>  * Porthos creates 10G data.
>  * Athos makes a snapshot of Porthos's data.
>  * Aramis makes a snapshot of Porthos's data.
>  * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
>   Porthos's data to Athos.
>  * Porthos deletes his copy of the data.
>
> Case 4:
>
>  * Porthos creates 10G data.
>  * Athos makes a snapshot of Porthos's data.
>  * Aramis makes a snapshot of Athos's data.
>  * Porthos deletes his copy of the data.
>   [Consider also Richelieu changing ownerships of Athos's and Aramis's
>   data at alternative points in this sequence]
>
>   In each of these, who gets charged (and how much) for their copy of
> the data?
>
>>  The idea is you are only charged for what blocks
>> you have on the disk.  Thanks,
>
>   My point was that it's perfectly possible to have blocks on the
> disk that are effectively owned by two people, and that the person to
> charge for those blocks is, to me, far from clear. You either end up
> charging twice for a single set of blocks on the disk, or you end up
> in a situation where one person's actions can cause another person's
> quota to fill up. Neither of these is particularly obvious behaviour.

As a sysadmin and as a user, quotas shouldn't be about "physical
blocks of storage used" but should be about "logical storage used".

IOW, if the filesystem is compressed, using 1 GB of physical space to
store 10 GB of data, my "quota used" should be 10 GB.

Similar for deduplication.  The quota is based on the storage *before*
the file is deduped.  Not after.

Similar for snapshots.  If UserA has 10 GB of quota used, I snapshot
their filesystem, then my "quota used" would be 10 GB as well.  As
data in my snapshot changes, my "quota used" is updated to reflect
that (change 1 GB of data compared to snapshot, use 1 GB of quota).

You have to (or at least should) keep two sets of stats for storage usage:
  - logical amount used ("real" file size, before compression, before
de-dupe, before snapshots, etc)
  - physical amount used (what's actually written to disk)

User-level quotas are based on the logical storage used.
Admin-level quotas (if you want to implement them) would be based on
physical storage used.

Thus, the output of things like df, du, ls would show the "logical"
storage used and file sizes.  And you would either have an additional
option to those apps (--real or something) to show the "actual"
storage used and file sizes as stored on disk.

Trying to make quotas and disk usage utilities to work based on what's
physically on disk is just backwards, imo.  And prone to a lot of
confusion.

-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread J. Bruce Fields

On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote:
> On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
> > On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
> > > Oh well crud, I was hoping that I could leave the inode numbers as 256 for
> > > everything, but I forgot about readdir.  So the inode item in the parent 
> > > would
> > > have to have a unique inode number that would get spit out in readdir, 
> > > but then
> > > if we stat'ed the directory we'd get 256 for the inode number.  Oh well,
> > > incompat flag it is then.
> > 
> > I think you're already fine:
> > 
> > # mkdir TMP
> > # dd if=/dev/zero of=TMP-image bs=1M count=512
> > # mkfs.btrfs TMP-image
> > # mount -oloop TMP-image TMP/
> > # btrfs subvolume create sub-a
> > # btrfs subvolume create sub-b
> > ../readdir-inos .
> > . 256 256
> > .. 256 4130609
> > sub-a 256 256
> > sub-b 257 256
> > 
> > Where readdir-inos is my silly test program below, and the first number is 
> > from
> > readdir, the second from stat.
> >
> 
> Heh as soon as I typed my email I went and actually looked at the code, looks
> like for readdir we fill in the root id, which will be unique, so hotdamn we 
> are
> good and I don't have to use a stupid incompat flag.  Thanks for checking that
> :),

My only complaint was just about how you said this:

"When you create a subvolume, the directory inode that is
created in the parent subvolume has the inode number of 256"

If you revise that you might want to clarify.  (Maybe "Every subvolume
has a root directory inode with inode number 256"?)

The way you've stated it sounds like you're talking about the
readdir-returned number, which would normally come from the inode that
has been covered up by the mount, and which really is an inode in the
parent filesystem

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Josef Bacik

On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
> On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
> > Oh well crud, I was hoping that I could leave the inode numbers as 256 for
> > everything, but I forgot about readdir.  So the inode item in the parent 
> > would
> > have to have a unique inode number that would get spit out in readdir, but 
> > then
> > if we stat'ed the directory we'd get 256 for the inode number.  Oh well,
> > incompat flag it is then.
> 
> I think you're already fine:
> 
>   # mkdir TMP
>   # dd if=/dev/zero of=TMP-image bs=1M count=512
>   # mkfs.btrfs TMP-image
>   # mount -oloop TMP-image TMP/
>   # btrfs subvolume create sub-a
>   # btrfs subvolume create sub-b
>   ../readdir-inos .
>   . 256 256
>   .. 256 4130609
>   sub-a 256 256
>   sub-b 257 256
> 
> Where readdir-inos is my silly test program below, and the first number is 
> from
> readdir, the second from stat.
>

Heh as soon as I typed my email I went and actually looked at the code, looks
like for readdir we fill in the root id, which will be unique, so hotdamn we are
good and I don't have to use a stupid incompat flag.  Thanks for checking that
:),

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Jeff Layton

On Wed, 1 Dec 2010 09:21:36 -0500
Josef Bacik  wrote:

> There is one tricky thing.  When you create a subvolume, the directory inode
> that is created in the parent subvolume has the inode number of 256.  So if 
> you
> have a bunch of subvolumes in the same parent subvolume, you are going to 
> have a
> bunch of directories with the inode number of 256.  This is so when users cd
> into a subvolume we can know its a subvolume and do all the normal voodoo to
> start looking in the subvolumes tree instead of the parent subvolumes tree.
> 
> This is where things go a bit sideways.  We had serious problems with NFS, but
> thankfully NFS gives us a bunch of hooks to get around these problems.
> CIFS/Samba do not, so we will have problems there, not to mention any other
> userspace application that looks at inode numbers.

A more common use case than CIFS or samba is going to be things like
backup programs. They commonly look at inode numbers in order to
identify hardlinks and may be horribly confused when there files that
have a link count >1 and inode number collisions with other files.

That probably qualifies as an "enterprise-ready" show stopper...

> === What do we do? ===
> 
> This is where I expect to see the most discussion.  Here is what I want to do
> 
> 1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the 
> inode
> to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
> that way.  This unfortunately will be an incompatible format change, but the
> sooner we get this adressed the easier it will be in the long run.  Obviously
> when I say format change I mean via the incompat bits we have, so old fs's 
> won't
> be broken and such.
> 
> 2) Do something like NFS's referral mounts when we cd into a subvolume.  Now 
> we
> just do dentry trickery, but that doesn't make the boundary between subvolumes
> clear, so it will confuse people (and samba) when they walk into a subvolume 
> and
> all of a sudden the inode numbers are the same as in the directory behind 
> them.
> With doing the referral mount thing, each subvolume appears to be its own 
> mount
> and that way things like NFS and samba will work properly.
> 

Sounds like you're on the right track.

The key concept is really that an inode number should be unique within
the scope of the st_dev. The simplest solution for you here is simply to
give each subvol its own st_dev and mount it up via a shrinkable mount
automagically when someone walks into the directory. In addition to the
examples of this in NFS, CIFS does this for DFS referrals.

Today, this is mostly done by hijacking the follow_link operation, but
David Howells proposed some patches a while back to do this via a more
formalized interface. It may be reasonable to target this work on top
of that, depending on the state of those changes...

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: disk space caching generation missmatch

2010-12-01 Thread Josef Bacik

On Wed, Dec 01, 2010 at 08:56:14PM +0100, Johannes Hirte wrote:
> On Wednesday 01 December 2010 18:40:18 Josef Bacik wrote:
> > On Wed, Dec 01, 2010 at 05:46:14PM +0100, Johannes Hirte wrote:
> > > After enabling disk space caching I've observed several log entries like 
> > > this:
> > > 
> > > btrfs: free space inode generation (0) did not match free space cache 
> > > generation (169594) for block group 15464398848
> > > 
> > > I'm not sure, but it seems this happens on every reboot. Is this 
> > > something to
> > > worry about?
> > > 
> > 
> > So that usually means 1 of a couple of things
> > 
> > 1) You didn't have space for us to save the free space cache
> > 2) When trying to write out the cache we hit one of those cases where we 
> > would
> > deadlock so we couldn't write the cache out
> > 
> > It's nothing to worry about, it's doing what it is supposed to.  However I'd
> > like to know why we're not able to write out the cache.  Are you running 
> > close
> > to full?  Thanks,
> > 
> > Josef
> >
> 
> I think there should be enough free space:
> 

Hmm well then we're hitting one of the other corner cases.  Can you run with
this debug thread and reboot.  Hopefully it will tell me why we're not saving
the free space cache. Thanks,

Josef

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 87aae66..4fd5659 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2794,13 +2794,17 @@ again:
if (i_size_read(inode) > 0) {
ret = btrfs_truncate_free_space_cache(root, trans, path,
  inode);
-   if (ret)
+   if (ret) {
+   printk(KERN_ERR "truncate free space cache failed for 
%llu, %d\n",
+  block_group->key.objectid, ret);
goto out_put;
+   }
}
 
spin_lock(&block_group->lock);
if (block_group->cached != BTRFS_CACHE_FINISHED) {
spin_unlock(&block_group->lock);
+   printk(KERN_ERR "block group %llu not cached\n", 
block_group->key.objectid);
goto out_put;
}
spin_unlock(&block_group->lock);
@@ -2820,8 +2824,10 @@ again:
num_pages *= PAGE_CACHE_SIZE;
 
ret = btrfs_check_data_free_space(inode, num_pages);
-   if (ret)
+   if (ret) {
+   printk(KERN_ERR "not enough free space for cache %llu\n", 
block_group->key.objectid);
goto out_put;
+   }
 
ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, num_pages,
  num_pages, num_pages,
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 22ee0dc..0078172 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -511,6 +511,8 @@ int btrfs_write_out_cache(struct btrfs_root *root,
spin_lock(&block_group->lock);
if (block_group->disk_cache_state < BTRFS_DC_SETUP) {
spin_unlock(&block_group->lock);
+   printk(KERN_ERR "block group %llu, wrong dcs %d\n", 
block_group->key.objectid,
+  block_group->disk_cache_state);
return 0;
}
spin_unlock(&block_group->lock);
@@ -520,6 +522,7 @@ int btrfs_write_out_cache(struct btrfs_root *root,
return 0;
 
if (!i_size_read(inode)) {
+   printk(KERN_ERR "no allocated space for block group %llu\n", 
block_group->key.objectid);
iput(inode);
return 0;
}
@@ -771,6 +774,7 @@ out_free:
block_group->disk_cache_state = BTRFS_DC_ERROR;
spin_unlock(&block_group->lock);
BTRFS_I(inode)->generation = 0;
+   printk(KERN_ERR "problem writing out block group cache for 
%llu\n", block_group->key.objectid);
}
kfree(checksums);
btrfs_update_inode(trans, root, inode);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dm-crypt barrier support is effective

2010-12-01 Thread Heinz Diehl

On 01.12.2010, Milan Broz wrote: 

> Anyway, I run several tests on 2.6.37-rc3+ and see no integrity
> problems (using xfs,ext3 and ext4 over dmcrypt).

Not that this might help, but just for testing purposes, I have run all 
the -rcX from 2.6.36 on with Milan's patch (XFS filesystem) under heavy 
load and disk i/o, and have not encountered a single problem or corruption.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread J. Bruce Fields

On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
> Oh well crud, I was hoping that I could leave the inode numbers as 256 for
> everything, but I forgot about readdir.  So the inode item in the parent would
> have to have a unique inode number that would get spit out in readdir, but 
> then
> if we stat'ed the directory we'd get 256 for the inode number.  Oh well,
> incompat flag it is then.

I think you're already fine:

# mkdir TMP
# dd if=/dev/zero of=TMP-image bs=1M count=512
# mkfs.btrfs TMP-image
# mount -oloop TMP-image TMP/
# btrfs subvolume create sub-a
# btrfs subvolume create sub-b
../readdir-inos .
. 256 256
.. 256 4130609
sub-a 256 256
sub-b 257 256

Where readdir-inos is my silly test program below, and the first number is from
readdir, the second from stat.

?

--b.

#include 
#include 
#include 
#include 
#include 
#include 

/* demonstrate that for mountpoints, readdir ino of mounted-on
 * directory, stat returns ino of mounted directory. */

int main(int argc, char *argv[])
{
struct dirent *de;
int ret;
DIR *d;

if (argc != 2)
errx(1, "usage: %s ", argv[0]);
ret = chdir(argv[1]);
if (ret)
errx(1, "chdir /");
d = opendir(".");
if (!d)
errx(1, "opendir .");
while (de = readdir(d)) {
struct stat st;

ret = stat(de->d_name, &st);
if (ret)
errx(1, "stat %s", de->d_name);
printf("%s %d %d\n", de->d_name, de->d_ino, st.st_ino);
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Josef Bacik

On Wed, Dec 01, 2010 at 02:44:04PM -0500, J. Bruce Fields wrote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > Hello,
> > 
> > Various people have complained about how BTRFS deals with subvolumes 
> > recently,
> > specifically the fact that they all have the same inode number, and there's 
> > no
> > discrete seperation from one subvolume to another.  Christoph asked that I 
> > lay
> > out a basic design document of how we want subvolumes to work so we can hash
> > everything out now, fix what is broken, and then move forward with a design 
> > that
> > everybody is more or less happy with.  I apologize in advance for how 
> > freaking
> > long this email is going to be.  I assume that most people are generally
> > familiar with how BTRFS works, so I'm not going to bother explaining in 
> > great
> > detail some stuff.
> > 
> > === What are subvolumes? ===
> > 
> > They are just another tree.  In BTRFS we have various b-trees to describe 
> > the
> > filesystem.  A few of them are filesystem wide, such as the extent tree, 
> > chunk
> > tree, root tree etc.  The tree's that hold the actual filesystem data, that 
> > is
> > inodes and such, are kept in their own b-tree.  This is how subvolumes and
> > snapshots appear on disk, they are simply new b-trees with all of the file 
> > data
> > contained within them.
> > 
> > === What do subvolumes look like? ===
> > 
> > All the user sees are directories.  They act like any other directory acts, 
> > with
> > a few exceptions
> > 
> > 1) You cannot hardlink between subvolumes.  This is because subvolumes have
> > their own inode numbers and such, think of them as seperate mounts in this 
> > case,
> > you cannot hardlink between two mounts because the link needs to point to 
> > the
> > same on disk inode, which is impossible between two different filesystems.  
> > The
> > same is true for subvolumes, they have their own trees with their own 
> > inodes and
> > inode numbers, so it's impossible to hardlink between them.
> 
> OK, so I'm unclear: would it be possible for nfsd to export subvolumes
> independently?
> 

Yeah.

> For that to work, we need to be able to take an inode that we just
> looked up by filehandle, and see which subvolume it belongs in.  So if
> two subvolumes can point to the same inode, it doesn't work, but if
> st_dev is different between them, e.g., that'd be fine.  Sounds like
> you're seeing the latter is possible, good!
> 

So you can't have the same inode in two subvolumes, since they are different
trees.  You can have the same inode numbers between two subvolumes, because they
are different trees.

> > 
> > 1a) In case it wasn't clear from above, each subvolume has their own inode
> > numbers, so you can have the same inode numbers used between two different
> > subvolumes, since they are two different trees.
> > 
> > 2) Obviously you can't just rm -rf subvolumes.  Because they are roots 
> > there's
> > extra metadata to keep track of them, so you have to use one of our ioctls 
> > to
> > delete subvolumes/snapshots.
> > 
> > But permissions and everything else they are the same.
> > 
> > There is one tricky thing.  When you create a subvolume, the directory inode
> > that is created in the parent subvolume has the inode number of 256.
> 
> Is that the right way to say this?  Doing a quick test, the inode
> numbers that a readdir of the parent directory returns *are* distinct.
> It's just the inode number that you get when you stat that is different.
> 
> Which is all fine and normal, *if* you treat this as a real mountpoint
> with its own vfsmount, st_dev, etc.
> 

Oh well crud, I was hoping that I could leave the inode numbers as 256 for
everything, but I forgot about readdir.  So the inode item in the parent would
have to have a unique inode number that would get spit out in readdir, but then
if we stat'ed the directory we'd get 256 for the inode number.  Oh well,
incompat flag it is then.

> > === How do we want subvolumes to work from a user perspective? ===
> > 
> > 1) Users need to be able to create their own subvolumes.  The permission
> > semantics will be absolutely the same as creating directories, so I don't 
> > think
> > this is too tricky.  We want this because you can only take snapshots of
> > subvolumes, and so it is important that users be able to create their own
> > discrete snapshottable targets.
> > 
> > 2) Users need to be able to snapshot their subvolumes.  This is basically 
> > the
> > same as #1, but it bears repeating.
> > 
> > 3) Subvolumes shouldn't need to be specifically mounted.  This is also
> > important, we don't want users to have to go around mounting their 
> > subvolumes up
> > manually one-by-one.  Today users just cd into subvolumes and it works, just
> > like cd'ing into a directory.
> 
> And the separate nfsd exports is another thing I'd really love to see
> work: currently you can export a subtree of a filesystem if you want,
> but it's trivial to escape the subtree

Re: disk space caching generation missmatch

2010-12-01 Thread Johannes Hirte

On Wednesday 01 December 2010 18:40:18 Josef Bacik wrote:
> On Wed, Dec 01, 2010 at 05:46:14PM +0100, Johannes Hirte wrote:
> > After enabling disk space caching I've observed several log entries like 
> > this:
> > 
> > btrfs: free space inode generation (0) did not match free space cache 
> > generation (169594) for block group 15464398848
> > 
> > I'm not sure, but it seems this happens on every reboot. Is this something 
> > to
> > worry about?
> > 
> 
> So that usually means 1 of a couple of things
> 
> 1) You didn't have space for us to save the free space cache
> 2) When trying to write out the cache we hit one of those cases where we would
> deadlock so we couldn't write the cache out
> 
> It's nothing to worry about, it's doing what it is supposed to.  However I'd
> like to know why we're not able to write out the cache.  Are you running close
> to full?  Thanks,
> 
> Josef
>

I think there should be enough free space:

df -h

FilesystemSize  Used Avail Use% Mounted on
rootfs 41G   29G  8.4G  78% /
/dev/root  41G   29G  8.4G  78% /
rc-svcdir 1.0M  112K  912K  11% /lib/rc/init.d
udev   10M  284K  9.8M   3% /dev
shm  1008M 0 1008M   0% /dev/shm
/dev/sda3 108G   90G   15G  87% /home

btrfs filesystem df /

Data: total=34.48GB, used=26.13GB
System, DUP: total=8.00MB, used=12.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=2.75GB, used=1.26GB
Metadata: total=8.00MB, used=0.00

btrfs filesystem df /home

Data: total=88.01GB, used=84.84GB
System, DUP: total=8.00MB, used=20.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=4.00GB, used=2.43GB
Metadata: total=8.00MB, used=0.00
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread J. Bruce Fields

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> Hello,
> 
> Various people have complained about how BTRFS deals with subvolumes recently,
> specifically the fact that they all have the same inode number, and there's no
> discrete seperation from one subvolume to another.  Christoph asked that I lay
> out a basic design document of how we want subvolumes to work so we can hash
> everything out now, fix what is broken, and then move forward with a design 
> that
> everybody is more or less happy with.  I apologize in advance for how freaking
> long this email is going to be.  I assume that most people are generally
> familiar with how BTRFS works, so I'm not going to bother explaining in great
> detail some stuff.
> 
> === What are subvolumes? ===
> 
> They are just another tree.  In BTRFS we have various b-trees to describe the
> filesystem.  A few of them are filesystem wide, such as the extent tree, chunk
> tree, root tree etc.  The tree's that hold the actual filesystem data, that is
> inodes and such, are kept in their own b-tree.  This is how subvolumes and
> snapshots appear on disk, they are simply new b-trees with all of the file 
> data
> contained within them.
> 
> === What do subvolumes look like? ===
> 
> All the user sees are directories.  They act like any other directory acts, 
> with
> a few exceptions
> 
> 1) You cannot hardlink between subvolumes.  This is because subvolumes have
> their own inode numbers and such, think of them as seperate mounts in this 
> case,
> you cannot hardlink between two mounts because the link needs to point to the
> same on disk inode, which is impossible between two different filesystems.  
> The
> same is true for subvolumes, they have their own trees with their own inodes 
> and
> inode numbers, so it's impossible to hardlink between them.

OK, so I'm unclear: would it be possible for nfsd to export subvolumes
independently?

For that to work, we need to be able to take an inode that we just
looked up by filehandle, and see which subvolume it belongs in.  So if
two subvolumes can point to the same inode, it doesn't work, but if
st_dev is different between them, e.g., that'd be fine.  Sounds like
you're seeing the latter is possible, good!

> 
> 1a) In case it wasn't clear from above, each subvolume has their own inode
> numbers, so you can have the same inode numbers used between two different
> subvolumes, since they are two different trees.
> 
> 2) Obviously you can't just rm -rf subvolumes.  Because they are roots there's
> extra metadata to keep track of them, so you have to use one of our ioctls to
> delete subvolumes/snapshots.
> 
> But permissions and everything else they are the same.
> 
> There is one tricky thing.  When you create a subvolume, the directory inode
> that is created in the parent subvolume has the inode number of 256.

Is that the right way to say this?  Doing a quick test, the inode
numbers that a readdir of the parent directory returns *are* distinct.
It's just the inode number that you get when you stat that is different.

Which is all fine and normal, *if* you treat this as a real mountpoint
with its own vfsmount, st_dev, etc.

> === How do we want subvolumes to work from a user perspective? ===
> 
> 1) Users need to be able to create their own subvolumes.  The permission
> semantics will be absolutely the same as creating directories, so I don't 
> think
> this is too tricky.  We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.
> 
> 2) Users need to be able to snapshot their subvolumes.  This is basically the
> same as #1, but it bears repeating.
> 
> 3) Subvolumes shouldn't need to be specifically mounted.  This is also
> important, we don't want users to have to go around mounting their subvolumes 
> up
> manually one-by-one.  Today users just cd into subvolumes and it works, just
> like cd'ing into a directory.

And the separate nfsd exports is another thing I'd really love to see
work: currently you can export a subtree of a filesystem if you want,
but it's trivial to escape the subtree by guessing filehandles.  So this
gives us an easy way for administrators to create secure separate
exports without having to manage entirely separate volumes.

If subvolumes got real mountpoints and so on, this would be easy.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Hugo Mills

On Wed, Dec 01, 2010 at 12:38:30PM -0500, Josef Bacik wrote:
> On Wed, Dec 01, 2010 at 04:38:00PM +, Hugo Mills wrote:
> > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > > === Quotas ===
> > > 
> > > This is a huge topic in and of itself, but Christoph mentioned wanting to 
> > > have
> > > an idea of what we wanted to do with it, so I'm putting it here.  There 
> > > are
> > > really 2 things here
> > > 
> > > 1) Limiting the size of subvolumes.  This is really easy for us, just 
> > > create a
> > > subvolume and at creation time set a maximum size it can grow to and not 
> > > let it
> > > go farther than that.  Nice, simple and straightforward.
> > > 
> > > 2) Normal quotas, via the quota tools.  This just comes down to how do we 
> > > want
> > > to charge users, do we want to do it per subvolume, or per filesystem.  
> > > My vote
> > > is per filesystem.  Obviously this will make it tricky with snapshots, 
> > > but I
> > > think if we're just charging the diff's between the original volume and 
> > > the
> > > snapshot to the user then that will be the easiest for people to 
> > > understand,
> > > rather than making a snapshot all of a sudden count the users currently 
> > > used
> > > quota * 2.
> > 
> >This is going to be tricky to get the semantics right, I suspect.
> > 
> >Say you've created a subvolume, A, containing 10G of Useful Stuff
> > (say, a base image for VMs). This counts 10G against your quota. Now,
> > I come along and snapshot that subvolume (as a writable subvolume) --
> > call it B. This is essentially free for me, because I've got a COW
> > copy of your subvolume (and the original counts against your quota).
> > 
> >If I now modify a file in subvolume B, the full modified section
> > goes onto my quota. This is all well and good. But what happens if you
> > delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> > files.  Worse, what happens if someone else had made a snapshot of A,
> > too? Who gets the 10G added to their quota, me or them? What if I'd
> > filled up my quota? Would that stop you from deleting your copy,
> > because my copy can't be charged against my quota? Would I just end up
> > unexpectedly 10G over quota?
> > 
> 
> If you delete your subvolume A, like use the btrfs tool to delete it, you will
> only be stuck with what you changed in snapshot B.  So if you only changed 
> 5gig
> worth of information, and you deleted the original subvolume, you would have
> 5gig charged to your quota.

   This doesn't work, though, if the owners of the "original" and
"new" subvolume are different:

Case 1:

 * Porthos creates 10G data.
 * Athos makes a snapshot of Porthos's data.
 * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
   Porthos's data to Athos.
 * Porthos deletes his copy of the data.

Case 2:

 * Porthos creates 10G of data.
 * Athos makes a snapshot of Porthos's data.
 * Porthos deletes his copy of the data.
 * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
   Porthos's data to Athos.

Case 3:

 * Porthos creates 10G data.
 * Athos makes a snapshot of Porthos's data.
 * Aramis makes a snapshot of Porthos's data.
 * A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
   Porthos's data to Athos.
 * Porthos deletes his copy of the data.

Case 4:

 * Porthos creates 10G data.
 * Athos makes a snapshot of Porthos's data.
 * Aramis makes a snapshot of Athos's data.
 * Porthos deletes his copy of the data.
   [Consider also Richelieu changing ownerships of Athos's and Aramis's
   data at alternative points in this sequence]

   In each of these, who gets charged (and how much) for their copy of
the data?

>  The idea is you are only charged for what blocks
> you have on the disk.  Thanks,

   My point was that it's perfectly possible to have blocks on the
disk that are effectively owned by two people, and that the person to
charge for those blocks is, to me, far from clear. You either end up
charging twice for a single set of blocks on the disk, or you end up
in a situation where one person's actions can cause another person's
quota to fill up. Neither of these is particularly obvious behaviour.

   Hugo.

-- 
=== Hugo Mills: h...@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- I believe that it's closely correlated with ---   
   the aeroswine coefficient.

signature.asc
Description: Digital signature

Re: dm-crypt barrier support is effective

2010-12-01 Thread Jon Nelson

On Wed, Dec 1, 2010 at 12:24 PM, Milan Broz  wrote:
> On 12/01/2010 06:35 PM, Matt wrote:
>> Thanks for pointing to v6 ! I hadn't noticed that there was a new one :)
>>
>> Well, so I'll restore my box to a working/productive state and will
>> try out v6 (I'm pretty confident that it'll work without problems).
>
> It's the same as previous, just with fixed header (to track it properly
> in patchwork) , second patch adds some read optimisation, nothing what
> should help here.
>
> Anyway, I run several tests on 2.6.37-rc3+ and see no integrity
> problems (using xfs,ext3 and ext4 over dmcrypt).
>
> So please try to check which change causes these problems for you,
> it can be something completely unrelated to these patches.
>
> (If if anyone know how to trigger some corruption with btrfs/dmcrypt,
> let me know I am not able to reproduce it either.)

Perhaps this is useful: for myself, I found that when I started using
2.6.37rc3 that postgresql starting having a *lot* of problems with
corruption. Specifically, I noted zeroed pages, corruption in headers,
all sorts of stuff on /newly created/ tables, especially during index
creation. I had a fairly high hit rate of failure. I backed off to
2.6.34.7 and have *zero* problems (in fact, prior to 2.6.37rc3, I had
never had a corruption issue with postgresql). I ran on 2.6.36 for a
few weeks as well, without issue.

I am using kcrypt with lvm on top of that, and ext4 on top of that.

-- 
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Goffredo Baroncelli

On Wednesday, 01 December, 2010, you (C Anthony Risinger) wrote:
[...]
> i forgot to mention, but a quick 'n dirty solution would be to simply
> not enable users to do this by accident.  mkfs.btrfs could create a
> new subvol, then mark it as default... this way the user has to
> manually mount with id=0, or remark 0 as the default.
> 
> effectively, users would be unknowingly be installing into a
> subvolume, rather then the top-level root (apologies if my terminology
> is incorrect).

I fully agree: it fulfill the KISS principle :-)

> C Anthony
> 


-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) 
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread C Anthony Risinger

On Wed, Dec 1, 2010 at 12:48 PM, C Anthony Risinger  wrote:
> On Wed, Dec 1, 2010 at 12:36 PM, Josef Bacik  wrote:
>> On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:
>>
>>> Another point that I want like to discuss is how manage the "pivoting" 
>>> between
>>> the subvolumes. One of the most beautiful feature of btrfs is the snapshot
>>> capability. In fact it is possible to make a snapshot of the root of the
>>> filesystem and to mount it in a subsequent reboot.
>>> But is very complicated to manage the pivoting of a snapshot of a root
>>> filesystem, because I cannot delete the "old root" due to the fact that the
>>> "new root" is placed in the "old root".
>>>
>>> A possible solution is not to put the root of the filesystem (where are 
>>> placed
>>> /usr, /etc) in the root of the btrfs filesystem; but it should be 
>>> accepted
>>> from the beginning the idea that the root of a filesystem should be placed 
>>> in
>>> a subvolume which int turn is placed in the root of a btrfs filesystem...
>>>
>>> I am open to other opinions.
>>>
>>
>> Agreed, one of the things that Chris and I have discussed is the possiblity 
>> of
>> just having dangling roots, since really the directories are just an easy 
>> way to
>> get to the subvolumes.  This would let you delete the original volume and use
>> the snapshot from then on out.  Something to do in the future for sure.
>
> i would really like to see a solution to this particular issue.  i may
> be missing something, but the dangling subvol roots doesn't seem to
> address the management of the root volume itself.
>
> for example... most people will install their whole system into the
> real root (id=5), but this renders the system unmanageable, because
> there is no way to ever empty it without manually issuing an `rm -rf`.
>
> i'm having a really hard time controlling this with the initramfs hook
> i provide for archlinux users.  the hook requires a specific structure
> "underneath" what the user perceives as /, but i can only accomplish
> this for new installs -- for existing installs i can setup the proper
> "subroot" structure, and snapshot their current root... but i cannot
> remove the stagnant files in the real root (id=5) that well never,
> ever be accessed again.
>
> ... or does dangling roots address this?

i forgot to mention, but a quick 'n dirty solution would be to simply
not enable users to do this by accident.  mkfs.btrfs could create a
new subvol, then mark it as default... this way the user has to
manually mount with id=0, or remark 0 as the default.

effectively, users would be unknowingly be installing into a
subvolume, rather then the top-level root (apologies if my terminology
is incorrect).

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: disk space caching generation missmatch

2010-12-01 Thread Josef Bacik

On Wed, Dec 01, 2010 at 05:46:14PM +0100, Johannes Hirte wrote:
> After enabling disk space caching I've observed several log entries like this:
> 
> btrfs: free space inode generation (0) did not match free space cache 
> generation (169594) for block group 15464398848
> 
> I'm not sure, but it seems this happens on every reboot. Is this something to
> worry about?
> 

So that usually means 1 of a couple of things

1) You didn't have space for us to save the free space cache
2) When trying to write out the cache we hit one of those cases where we would
deadlock so we couldn't write the cache out

It's nothing to worry about, it's doing what it is supposed to.  However I'd
like to know why we're not able to write out the cache.  Are you running close
to full?  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread C Anthony Risinger

On Wed, Dec 1, 2010 at 12:36 PM, Josef Bacik  wrote:
> On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:
>
>> Another point that I want like to discuss is how manage the "pivoting" 
>> between
>> the subvolumes. One of the most beautiful feature of btrfs is the snapshot
>> capability. In fact it is possible to make a snapshot of the root of the
>> filesystem and to mount it in a subsequent reboot.
>> But is very complicated to manage the pivoting of a snapshot of a root
>> filesystem, because I cannot delete the "old root" due to the fact that the
>> "new root" is placed in the "old root".
>>
>> A possible solution is not to put the root of the filesystem (where are 
>> placed
>> /usr, /etc) in the root of the btrfs filesystem; but it should be 
>> accepted
>> from the beginning the idea that the root of a filesystem should be placed in
>> a subvolume which int turn is placed in the root of a btrfs filesystem...
>>
>> I am open to other opinions.
>>
>
> Agreed, one of the things that Chris and I have discussed is the possiblity of
> just having dangling roots, since really the directories are just an easy way 
> to
> get to the subvolumes.  This would let you delete the original volume and use
> the snapshot from then on out.  Something to do in the future for sure.

i would really like to see a solution to this particular issue.  i may
be missing something, but the dangling subvol roots doesn't seem to
address the management of the root volume itself.

for example... most people will install their whole system into the
real root (id=5), but this renders the system unmanageable, because
there is no way to ever empty it without manually issuing an `rm -rf`.

i'm having a really hard time controlling this with the initramfs hook
i provide for archlinux users.  the hook requires a specific structure
"underneath" what the user perceives as /, but i can only accomplish
this for new installs -- for existing installs i can setup the proper
"subroot" structure, and snapshot their current root... but i cannot
remove the stagnant files in the real root (id=5) that well never,
ever be accessed again.

... or does dangling roots address this?

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Josef Bacik

On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:
> On Wednesday, 01 December, 2010, Josef Bacik wrote:
> > Hello,
> > 
> 
> Hi Josef
> 
> > 
> > === What are subvolumes? ===
> > 
> > They are just another tree.  In BTRFS we have various b-trees to describe 
> the
> > filesystem.  A few of them are filesystem wide, such as the extent tree, 
> chunk
> > tree, root tree etc.  The tree's that hold the actual filesystem data, that 
> is
> > inodes and such, are kept in their own b-tree.  This is how subvolumes and
> > snapshots appear on disk, they are simply new b-trees with all of the file 
> data
> > contained within them.
> > 
> > === What do subvolumes look like? ===
> > 
> [...]
> > 
> > 2) Obviously you can't just rm -rf subvolumes.  Because they are roots 
> there's
> > extra metadata to keep track of them, so you have to use one of our ioctls 
> to
> > delete subvolumes/snapshots.
> 
> Sorry, but I can't understand this sentence. It is clear that a directory and 
> a subvolume have a totally different on-disk format. But why it would be not 
> possible to remove a subvolume via the normal rmdir(2) syscall ? I posted a 
> patch some months ago: when the rmdir is invoked on a subvolume, the same 
> action of the ioctl BTRFS_IOC_SNAP_DESTROY is performed.
> 
> See https://patchwork.kernel.org/patch/260301/
>  

Oh hey thats cool.  That would be reasonable I think.  I was just saying that
currently we can't remove subvolumes/snapshots via rm, not that it wasn't
possible at all.  So I think what you did would be a good thing to have.

> [...]
> > 
> > There is one tricky thing.  When you create a subvolume, the directory inode
> > that is created in the parent subvolume has the inode number of 256.  So if 
> you
> > have a bunch of subvolumes in the same parent subvolume, you are going to 
> have a
> > bunch of directories with the inode number of 256.  This is so when users cd
> > into a subvolume we can know its a subvolume and do all the normal voodoo to
> > start looking in the subvolumes tree instead of the parent subvolumes tree.
> > 
> > This is where things go a bit sideways.  We had serious problems with NFS, 
> but
> > thankfully NFS gives us a bunch of hooks to get around these problems.
> > CIFS/Samba do not, so we will have problems there, not to mention any other
> > userspace application that looks at inode numbers.
> 
> How this is/should be different of a mounted filesystem ?
> For example:
> 
> # cd /tmp
> # btrfs subvolume create sub-a
> # btrfs subvolume create sub-b
> # mkdir mount -a; mkdir mount-b
> # mount /dev/sda6 mount-a # an ext4 fs
> # mount /dev/sdb2 mount-b # an ext3 fs
> # $ stat -c "%8i %n" sub-a sub-b mount-a mount-b
>  256 sub-a
>  256 sub-b
>2 mount-a
>2 mount-b
> 
> In this case the inode-number returned are equal for both the mounted 
> filesystems and the subvolumes. However, the fsid is different.
> 
> # stat -fc "%8i %n" sub-a sub-b mount-a mount-b .
> cdc937c1a203df74 sub-a
> cdc937c1a203df77 sub-b
> b27d147f003561c8 mount-a
> d49e1a3d2333d2e1 mount-b
> cdc937c1a203df75 .
> 
> Moreover I suggest to look at the difference of the inode returned by 
> readdir(3) and stat(3)..
>

Yeah you are right, the inode numbering can probably be the same, we just need
to make them logically different mounts so things like NFS and samba still work
right.

> [...]
> > I feel like I'm forgetting something here, hopefully somebody will point it 
> out.
> > 
> 
> Another point that I want like to discuss is how manage the "pivoting" 
> between 
> the subvolumes. One of the most beautiful feature of btrfs is the snapshot 
> capability. In fact it is possible to make a snapshot of the root of the 
> filesystem and to mount it in a subsequent reboot.
> But is very complicated to manage the pivoting of a snapshot of a root 
> filesystem, because I cannot delete the "old root" due to the fact that the 
> "new root" is placed in the "old root".
> 
> A possible solution is not to put the root of the filesystem (where are 
> placed 
> /usr, /etc) in the root of the btrfs filesystem; but it should be 
> accepted 
> from the beginning the idea that the root of a filesystem should be placed in 
> a subvolume which int turn is placed in the root of a btrfs filesystem...
> 
> I am open to other opinions.
> 

Agreed, one of the things that Chris and I have discussed is the possiblity of
just having dangling roots, since really the directories are just an easy way to
get to the subvolumes.  This would let you delete the original volume and use
the snapshot from then on out.  Something to do in the future for sure.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: On Removing BUG_ON macros

2010-12-01 Thread Josef Bacik

On Thu, Nov 11, 2010 at 12:32:06PM +0800, Ian Kent wrote:
> On Mon, 2010-11-08 at 23:02 +0800, Ian Kent wrote:
> > On Mon, 2010-11-08 at 09:15 -0500, Josef Bacik wrote:
> > > On Mon, Nov 08, 2010 at 10:06:13PM +0800, Ian Kent wrote:
> > > > On Mon, 2010-11-08 at 07:42 -0500, Josef Bacik wrote:
> > > > > On Mon, Nov 08, 2010 at 10:54:07AM +0800, Ian Kent wrote:
> > > > > > On Sun, 2010-11-07 at 09:51 -0500, Josef Bacik wrote:
> > > > > > > On Sun, Nov 07, 2010 at 04:16:47PM +0900, Yoshinori Sano wrote:
> > > > > > > > This is a question I've posted on the #btrfs IRC channel today.
> > > > > > > > hyperair adviced me to contact with Josef Bacik or Chris Mason.
> > > > > > > > So, I post my question to this maling list.
> > > > > > > > 
> > > > > > > > Here are my post on the IRC:
> > > > > > > > 
> > > > > > > > Actually, I want to remove BUG_ON(ret) around the Btrfs code.
> > > > > > > > The motivation is to make the Btrfs code more robust.
> > > > > > > > First of all, is this meaningless?
> > > > > > > > 
> > > > > > > > For example, there are code like the following:
> > > > > > > > 
> > > > > > > > struct btrfs_path *path;
> > > > > > > > path = btrfs_alloc_path();
> > > > > > > > BUG_ON(!path);
> > > > > > > > 
> > > > > > > > This is a frequenty used pattern of current Btrfs code.
> > > > > > > > A btrfs_alloc_path()'s caller has to deal with the allocation 
> > > > > > > > failure
> > > > > > > > instead of using BUG_ON.  However, (this is what most 
> > > > > > > > interesting
> > > > > > > > thing for me) can the caller do any proper error handlings here?
> > > > > > > > I mean, is this a critical situation where we cannot recover 
> > > > > > > > from?
> > > > > > > >
> > > > > > > 
> > > > > > > No we're just lazy ;).  Tho making sure the caller can recover 
> > > > > > > from getting
> > > > > > > -ENOMEM is very important, which is why in some of these paths we 
> > > > > > > just do BUG_ON
> > > > > > > since fixing the callers is tricky.  A good strategy for things 
> > > > > > > like this is to
> > > > > > > do something like
> > > > > > > 
> > > > > > > static int foo = 1;
> > > > > > > 
> > > > > > > path = btrfs_alloc_path();
> > > > > > > if (!path || !(foo % 1000))
> > > > > > >   return -ENOMEM;
> > > > > > > foo++;
> > > > > > 
> > > > > > Hahaha, I love it.
> > > > > > 
> > > > > > So, return ENOMEM every 1000 times we call the containing function!
> > > > > > 
> > > > > > > 
> > > > > > > that way you can catch all the callers and make sure we're 
> > > > > > > handling the error
> > > > > > > all the way up the chain properly.  Thanks,
> > > > > > 
> > > > > > Yeah, I suspect this approach will be a bit confusing though.
> > > > > > 
> > > > > > I believe that it will be more effective, although time consuming, 
> > > > > > to
> > > > > > work through the call tree function by function. Although, as I have
> > > > > > said, the problem is working out what needs to be done to recover,
> > > > > > rather than working out what the callers are. I'm not at all sure 
> > > > > > yet
> > > > > > but I also suspect that it may not be possible to recover in some 
> > > > > > cases,
> > > > > > which will likely lead to serious rework of some subsystems (but, 
> > > > > > hey,
> > > > > > who am I to say, I really don't have any clue yet).
> > > > > >
> > > > > 
> > > > > So we talked about this at plumbers.  First thing we need is a way to 
> > > > > flip the
> > > > > filesystem read only, that way we can deal with the simple corruption 
> > > > > cases.
> > > > 
> > > > Right, yes.
> > > > 
> > > > > And then we can start looking at these harder cases where it's really 
> > > > > unclear
> > > > > about how to recover.
> > > > 
> > > > I have a way to go before I will even understand these cases.
> > > > 
> > > > > 
> > > > > Thankfully because we're COW we really shouldn't have any cases that 
> > > > > we have to
> > > > > unwind anything, we just fail the operation and go on our happy merry 
> > > > > way.  The
> > > > > only tricky thing is where we get ENOMEM when say inserting the 
> > > > > metadata for
> > > > > data after writing out the data, since that will leave data just 
> > > > > sitting around.
> > > > > Probably should look at what NFS does with dirty pages when the 
> > > > > server hangs up.
> > > > 
> > > > OK, that's a though for me to focus on while I'm trying to work out
> > > > what's going on ... mmm.
> > > > 
> > > > Indeed, a large proportion of these are handling ENOMEM.
> > > > 
> > > > I somehow suspect your heavily focused on disk io itself when I'm still
> > > > back thinking about house keeping of operations, in the process of being
> > > > queued and those currently being processed, the later being the
> > > > difficult case. But I'll eventually get to worrying about io as part of
> > > > that process. It's also worth mentioning that my scope is also quite
> > > > narrow at this stage, focusing largely on the transaction subsystem,
> > > > alt

Re: What to do about subvolumes?

2010-12-01 Thread Goffredo Baroncelli

On Wednesday, 01 December, 2010, Josef Bacik wrote:
> Hello,
> 

Hi Josef

> 
> === What are subvolumes? ===
> 
> They are just another tree.  In BTRFS we have various b-trees to describe 
the
> filesystem.  A few of them are filesystem wide, such as the extent tree, 
chunk
> tree, root tree etc.  The tree's that hold the actual filesystem data, that 
is
> inodes and such, are kept in their own b-tree.  This is how subvolumes and
> snapshots appear on disk, they are simply new b-trees with all of the file 
data
> contained within them.
> 
> === What do subvolumes look like? ===
> 
[...]
> 
> 2) Obviously you can't just rm -rf subvolumes.  Because they are roots 
there's
> extra metadata to keep track of them, so you have to use one of our ioctls 
to
> delete subvolumes/snapshots.

Sorry, but I can't understand this sentence. It is clear that a directory and 
a subvolume have a totally different on-disk format. But why it would be not 
possible to remove a subvolume via the normal rmdir(2) syscall ? I posted a 
patch some months ago: when the rmdir is invoked on a subvolume, the same 
action of the ioctl BTRFS_IOC_SNAP_DESTROY is performed.

See https://patchwork.kernel.org/patch/260301/

[...]
> 
> There is one tricky thing.  When you create a subvolume, the directory inode
> that is created in the parent subvolume has the inode number of 256.  So if 
you
> have a bunch of subvolumes in the same parent subvolume, you are going to 
have a
> bunch of directories with the inode number of 256.  This is so when users cd
> into a subvolume we can know its a subvolume and do all the normal voodoo to
> start looking in the subvolumes tree instead of the parent subvolumes tree.
> 
> This is where things go a bit sideways.  We had serious problems with NFS, 
but
> thankfully NFS gives us a bunch of hooks to get around these problems.
> CIFS/Samba do not, so we will have problems there, not to mention any other
> userspace application that looks at inode numbers.

How this is/should be different of a mounted filesystem ?
For example:

# cd /tmp
# btrfs subvolume create sub-a
# btrfs subvolume create sub-b
# mkdir mount -a; mkdir mount-b
# mount /dev/sda6 mount-a   # an ext4 fs
# mount /dev/sdb2 mount-b   # an ext3 fs
# $ stat -c "%8i %n" sub-a sub-b mount-a mount-b
 256 sub-a
 256 sub-b
   2 mount-a
   2 mount-b

In this case the inode-number returned are equal for both the mounted 
filesystems and the subvolumes. However, the fsid is different.

# stat -fc "%8i %n" sub-a sub-b mount-a mount-b .
cdc937c1a203df74 sub-a
cdc937c1a203df77 sub-b
b27d147f003561c8 mount-a
d49e1a3d2333d2e1 mount-b
cdc937c1a203df75 .

Moreover I suggest to look at the difference of the inode returned by 
readdir(3) and stat(3)..

[...]
> I feel like I'm forgetting something here, hopefully somebody will point it 
out.
> 

Another point that I want like to discuss is how manage the "pivoting" between 
the subvolumes. One of the most beautiful feature of btrfs is the snapshot 
capability. In fact it is possible to make a snapshot of the root of the 
filesystem and to mount it in a subsequent reboot.
But is very complicated to manage the pivoting of a snapshot of a root 
filesystem, because I cannot delete the "old root" due to the fact that the 
"new root" is placed in the "old root".

A possible solution is not to put the root of the filesystem (where are placed 
/usr, /etc) in the root of the btrfs filesystem; but it should be accepted 
from the beginning the idea that the root of a filesystem should be placed in 
a subvolume which int turn is placed in the root of a btrfs filesystem...

I am open to other opinions.

> === Conclusion ===
> 
> There are definitely some wonky things with subvolumes, but I don't think 
they
> are things that cannot be fixed now.  Some of these changes will require
> incompat format changes, but it's either we fix it now, or later on down the
> road when BTRFS starts getting used in production really find out how many
> things our current scheme breaks and then have to do the changes then.  
Thanks,
> 
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) 
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dm-crypt barrier support is effective

2010-12-01 Thread Milan Broz

On 12/01/2010 06:35 PM, Matt wrote:
> Thanks for pointing to v6 ! I hadn't noticed that there was a new one :)
> 
> Well, so I'll restore my box to a working/productive state and will
> try out v6 (I'm pretty confident that it'll work without problems).

It's the same as previous, just with fixed header (to track it properly
in patchwork) , second patch adds some read optimisation, nothing what
should help here.

Anyway, I run several tests on 2.6.37-rc3+ and see no integrity
problems (using xfs,ext3 and ext4 over dmcrypt).

So please try to check which change causes these problems for you,
it can be something completely unrelated to these patches.

(If if anyone know how to trigger some corruption with btrfs/dmcrypt,
let me know I am not able to reproduce it either.)

Milan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Josef Bacik

On Wed, Dec 01, 2010 at 04:38:00PM +, Hugo Mills wrote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > === Quotas ===
> > 
> > This is a huge topic in and of itself, but Christoph mentioned wanting to 
> > have
> > an idea of what we wanted to do with it, so I'm putting it here.  There are
> > really 2 things here
> > 
> > 1) Limiting the size of subvolumes.  This is really easy for us, just 
> > create a
> > subvolume and at creation time set a maximum size it can grow to and not 
> > let it
> > go farther than that.  Nice, simple and straightforward.
> > 
> > 2) Normal quotas, via the quota tools.  This just comes down to how do we 
> > want
> > to charge users, do we want to do it per subvolume, or per filesystem.  My 
> > vote
> > is per filesystem.  Obviously this will make it tricky with snapshots, but I
> > think if we're just charging the diff's between the original volume and the
> > snapshot to the user then that will be the easiest for people to understand,
> > rather than making a snapshot all of a sudden count the users currently used
> > quota * 2.
> 
>This is going to be tricky to get the semantics right, I suspect.
> 
>Say you've created a subvolume, A, containing 10G of Useful Stuff
> (say, a base image for VMs). This counts 10G against your quota. Now,
> I come along and snapshot that subvolume (as a writable subvolume) --
> call it B. This is essentially free for me, because I've got a COW
> copy of your subvolume (and the original counts against your quota).
> 
>If I now modify a file in subvolume B, the full modified section
> goes onto my quota. This is all well and good. But what happens if you
> delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> files.  Worse, what happens if someone else had made a snapshot of A,
> too? Who gets the 10G added to their quota, me or them? What if I'd
> filled up my quota? Would that stop you from deleting your copy,
> because my copy can't be charged against my quota? Would I just end up
> unexpectedly 10G over quota?
> 

If you delete your subvolume A, like use the btrfs tool to delete it, you will
only be stuck with what you changed in snapshot B.  So if you only changed 5gig
worth of information, and you deleted the original subvolume, you would have
5gig charged to your quota.  The idea is you are only charged for what blocks
you have on the disk.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dm-crypt barrier support is effective

2010-12-01 Thread Matt

On Wed, Dec 1, 2010 at 5:52 PM, Mike Snitzer  wrote:
> On Wed, Dec 01 2010 at 11:05am -0500,
> Matt  wrote:
>
>> On Mon, Nov 15, 2010 at 12:24 AM, Matt  wrote:
>> > On Sun, Nov 14, 2010 at 10:54 PM, Milan Broz  wrote:
>> >> On 11/14/2010 10:49 PM, Matt wrote:
>> >>> only with the dm-crypt scaling patch I could observe the data-corruption
>> >>
>> >> even with v5 I sent on Friday?
>> >>
>> >> Are you sure that it is not related to some fs problem in 2.6.37-rc1?
>> >>
>> >> If it works on 2.6.36 without problems, it is probably problems somewhere
>> >> else (flush/fua conversion was trivial here - DM is still doing full flush
>> >> and there are no other changes in code IMHO.)
>> >>
>> >> Milan
>> >>
>> >
>> > Hi Milan,
>> >
>> > I'm aware of your new v5 patch (which should include several
>> > improvements (or potential fixes in my case) over the v3 patch)
>> >
>> > as I already wrote my schedule unfortunately currently doesn't allow
>> > me to test it
>> >
>> > * in the case of no corruption it would be nice to have 2.6.37-rc* running 
>> > :)
>> >
>> > * in the case of data corruption that would mean restoring my system -
>> > since it's my production box and right now I don't have a fallback at
>> > reach
>> > at earliest I could give it a shot at the beginning of December. Then
>> > I could also test reiserfs and ext4 as a system partition to rule out
>> > that it's
>> > a ext4-specific thing (currently I'm running reiserfs on my 
>> > system-partition).
>> >
>> > Thanks !
>> >
>> > Matt
>> >
>>
>>
>> OK guys,
>>
>> I've updated my system to latest glibc 2.12.1-r3 (on gentoo) and gcc
>> hardened 4.5.1-r1 with 1.4 patchset which also uses pie (that one
>> should fix problems with graphite)
>>
>> not much system changes besides that,
>>
>> with those it worked fine with 2.6.36 and I couldn't observe any
>> filesystem corruption
>
> So dm-crypt cpu scalability v5 with 2.6.36 worked fine.
>
>> the bad news is: I'm again seeing corruption (!) [on ext4, on the /
>> (root) partition]:
>
> ...
>
>> ===> so the No.1 trigger of this kind of corruption where files are
>> empty, missing or the content gets corrupted (at least for me) is
>> compiling software which is part of the system (e.g. emerge -e
>> system);
>>
>> the system is Gentoo ~amd64; with binutils 2.20.51.0.12 (afaik this
>> one has changed from 2.20.51.0.10 to 2.20.51.0.12 from my last
>> report); gcc 4.5.1 (Gentoo Hardened 4.5.1-r1 p1.4, pie-0.4.5) <--
>> works fine with 2.6.36 and 2.6.36.1
>>
>> I'm not sure whether benchmarks would have the same "impact"
>
> Seems this emerge is a good test if it reliably enduces the corruption.
>
>> the kernel currently running is 2.6.37-rc4 with the [PATCH v5] dm
>> crypt: scale to multiple CPUs
>>
>> besides that additional patchsets are applied (I apologize that it's
>> not only plain vanilla with the dm-crypt patch):
>> * Prevent kswapd dumping excessive amounts of memory in response to
>> high-order allocation
>> * ext4: coordinate data-only flush requests sent by fsync
>> * vmscan: protect executable page from inactive list scan
>> * writeback livelock fixes v2
>
> Have you actually experienced any of the issues the above patches are
> meant to address?  Seems you're applying patches guessing/hoping
> that they'll fix the dm-crypt corruption.
>
>> I originally had hoped that the mentioned patch in "ext4: coordinate
>> data-only flush requests sent by fsync", namely: "md: Call
>> blk_queue_flush() to establish flush/fua" and additional changes &
>> fixes to 2.6.37-rc4 would once and for all fix problems but it didn't
>
> That md patch doesn't help DM at all.  And the ext4 coordination patch
> is completely bleeding and actually broken (especially as it relates to
> DM -- but that breakage is ony a concern for request-based DM,
> e.g. DM-mapth), anyway see:
> https://www.redhat.com/archives/dm-devel/2010-November/msg00185.html
>
> I'm not sure which patches you're using for the ext4 fsync changes but
> please don't use them at all.  It is purely an optimization for
> extremely heavy fsync workloads and is only getting in the way at this
> point.
>
>> I'm also using the the writeback livelock fixes and the dm-crypt scale
>> to multiple CPUs with 2.6.36 so those generally work fine
>>
>> so it has be something that changed from 2.6.36->2.6.37 within
>> dm-crypt or other parts that gets stressed and breaks during usage of
>> the "[PATCH v5] dm crypt: scale to multiple CPUs" patch
>>
>> the other included patches surely won't be the cause for that (100%).
>>
>> Filesystem corruption only seems to occur on the / (root) where the
>> system resides -
>
> We need better fault isolation; you've introduced enough change that it
> isn't helping zero in on what your particular problem is.  Milan has
> tested he latest version of the dm-crypt cpu scalability patch quite a
> bit and hasn't seen any corruption -- but clearly the corruption you're
> seeing is a real concern and we need to get to the bottom of it.
>
> I'd rea

Re: dm-crypt barrier support is effective

2010-12-01 Thread Mike Snitzer

On Wed, Dec 01 2010 at 11:05am -0500,
Matt  wrote:

> On Mon, Nov 15, 2010 at 12:24 AM, Matt  wrote:
> > On Sun, Nov 14, 2010 at 10:54 PM, Milan Broz  wrote:
> >> On 11/14/2010 10:49 PM, Matt wrote:
> >>> only with the dm-crypt scaling patch I could observe the data-corruption
> >>
> >> even with v5 I sent on Friday?
> >>
> >> Are you sure that it is not related to some fs problem in 2.6.37-rc1?
> >>
> >> If it works on 2.6.36 without problems, it is probably problems somewhere
> >> else (flush/fua conversion was trivial here - DM is still doing full flush
> >> and there are no other changes in code IMHO.)
> >>
> >> Milan
> >>
> >
> > Hi Milan,
> >
> > I'm aware of your new v5 patch (which should include several
> > improvements (or potential fixes in my case) over the v3 patch)
> >
> > as I already wrote my schedule unfortunately currently doesn't allow
> > me to test it
> >
> > * in the case of no corruption it would be nice to have 2.6.37-rc* running 
> > :)
> >
> > * in the case of data corruption that would mean restoring my system -
> > since it's my production box and right now I don't have a fallback at
> > reach
> > at earliest I could give it a shot at the beginning of December. Then
> > I could also test reiserfs and ext4 as a system partition to rule out
> > that it's
> > a ext4-specific thing (currently I'm running reiserfs on my 
> > system-partition).
> >
> > Thanks !
> >
> > Matt
> >
> 
> 
> OK guys,
> 
> I've updated my system to latest glibc 2.12.1-r3 (on gentoo) and gcc
> hardened 4.5.1-r1 with 1.4 patchset which also uses pie (that one
> should fix problems with graphite)
> 
> not much system changes besides that,
> 
> with those it worked fine with 2.6.36 and I couldn't observe any
> filesystem corruption

So dm-crypt cpu scalability v5 with 2.6.36 worked fine.

> the bad news is: I'm again seeing corruption (!) [on ext4, on the /
> (root) partition]:

...

> ===> so the No.1 trigger of this kind of corruption where files are
> empty, missing or the content gets corrupted (at least for me) is
> compiling software which is part of the system (e.g. emerge -e
> system);
> 
> the system is Gentoo ~amd64; with binutils 2.20.51.0.12 (afaik this
> one has changed from 2.20.51.0.10 to 2.20.51.0.12 from my last
> report); gcc 4.5.1 (Gentoo Hardened 4.5.1-r1 p1.4, pie-0.4.5) <--
> works fine with 2.6.36 and 2.6.36.1
> 
> I'm not sure whether benchmarks would have the same "impact"

Seems this emerge is a good test if it reliably enduces the corruption.

> the kernel currently running is 2.6.37-rc4 with the [PATCH v5] dm
> crypt: scale to multiple CPUs
> 
> besides that additional patchsets are applied (I apologize that it's
> not only plain vanilla with the dm-crypt patch):
> * Prevent kswapd dumping excessive amounts of memory in response to
> high-order allocation
> * ext4: coordinate data-only flush requests sent by fsync
> * vmscan: protect executable page from inactive list scan
> * writeback livelock fixes v2

Have you actually experienced any of the issues the above patches are
meant to address?  Seems you're applying patches guessing/hoping
that they'll fix the dm-crypt corruption.

> I originally had hoped that the mentioned patch in "ext4: coordinate
> data-only flush requests sent by fsync", namely: "md: Call
> blk_queue_flush() to establish flush/fua" and additional changes &
> fixes to 2.6.37-rc4 would once and for all fix problems but it didn't

That md patch doesn't help DM at all.  And the ext4 coordination patch
is completely bleeding and actually broken (especially as it relates to
DM -- but that breakage is ony a concern for request-based DM,
e.g. DM-mapth), anyway see: 
https://www.redhat.com/archives/dm-devel/2010-November/msg00185.html

I'm not sure which patches you're using for the ext4 fsync changes but
please don't use them at all.  It is purely an optimization for
extremely heavy fsync workloads and is only getting in the way at this
point.

> I'm also using the the writeback livelock fixes and the dm-crypt scale
> to multiple CPUs with 2.6.36 so those generally work fine
> 
> so it has be something that changed from 2.6.36->2.6.37 within
> dm-crypt or other parts that gets stressed and breaks during usage of
> the "[PATCH v5] dm crypt: scale to multiple CPUs" patch
> 
> the other included patches surely won't be the cause for that (100%).
> 
> Filesystem corruption only seems to occur on the / (root) where the
> system resides -

We need better fault isolation; you've introduced enough change that it
isn't helping zero in on what your particular problem is.  Milan has
tested he latest version of the dm-crypt cpu scalability patch quite a
bit and hasn't seen any corruption -- but clearly the corruption you're
seeing is a real concern and we need to get to the bottom of it.

I'd really appreciate it if you could just use Linus' latest linux-2.6
tree plus Milan's latest patch (technically v6 even though it wasn't
labeled as such): https://patchwork.kernel.org/patch/36554

Re: What to do about subvolumes?

2010-12-01 Thread C Anthony Risinger

On Wed, Dec 1, 2010 at 10:38 AM, Hugo Mills  wrote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
>> === Quotas ===
>>
>> This is a huge topic in and of itself, but Christoph mentioned wanting to 
>> have
>> an idea of what we wanted to do with it, so I'm putting it here.  There are
>> really 2 things here
>>
>> 1) Limiting the size of subvolumes.  This is really easy for us, just create 
>> a
>> subvolume and at creation time set a maximum size it can grow to and not let 
>> it
>> go farther than that.  Nice, simple and straightforward.
>>
>> 2) Normal quotas, via the quota tools.  This just comes down to how do we 
>> want
>> to charge users, do we want to do it per subvolume, or per filesystem.  My 
>> vote
>> is per filesystem.  Obviously this will make it tricky with snapshots, but I
>> think if we're just charging the diff's between the original volume and the
>> snapshot to the user then that will be the easiest for people to understand,
>> rather than making a snapshot all of a sudden count the users currently used
>> quota * 2.
>
>   This is going to be tricky to get the semantics right, I suspect.
>
>   Say you've created a subvolume, A, containing 10G of Useful Stuff
> (say, a base image for VMs). This counts 10G against your quota. Now,
> I come along and snapshot that subvolume (as a writable subvolume) --
> call it B. This is essentially free for me, because I've got a COW
> copy of your subvolume (and the original counts against your quota).
>
>   If I now modify a file in subvolume B, the full modified section
> goes onto my quota. This is all well and good. But what happens if you
> delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> files.  Worse, what happens if someone else had made a snapshot of A,
> too? Who gets the 10G added to their quota, me or them? What if I'd
> filled up my quota? Would that stop you from deleting your copy,
> because my copy can't be charged against my quota? Would I just end up
> unexpectedly 10G over quota?
>
>   This is a whole gigantic can of worms, as far as I can see, and I
> don't think it's going to be possible to implement quotas, even on a
> filesystem level, until there's some good and functional model for
> dealing with all the implications of COW copies. :(

i'd expect that as a separate user, you should both be whacked 10G.
imo, the whole benefit of transparent COW is to the administrators
advantage, thus i would even think the _uncompressed_ volume size
would go against quota (which could possibly be artificially inflated
to account for the space saving of compression).  users just need a
nice steadily predictable number to monitor.

thought maybe these users could be grouped, such that the COW'ed
portions of the files they share are balanced across each users quota,
but this would have to be a soprt of "opt in" thing else you get the
wild fluctuations because of other user's actions.  additionally, some
users could be marked as "system", where COW'ing their subvol results
in 0 quota -- you only pay for what you change -- but if the system
subvol gets removed, then you pay for it all.  in this way you would
have to keep reusing system subvols to get any advantage as a regular
user.

i dont know the existing systems though so i dont know what it would
take to do such balancing.

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Mike Hommey

On Wed, Dec 01, 2010 at 04:38:00PM +, Hugo Mills wrote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > === Quotas ===
> > 
> > This is a huge topic in and of itself, but Christoph mentioned wanting to 
> > have
> > an idea of what we wanted to do with it, so I'm putting it here.  There are
> > really 2 things here
> > 
> > 1) Limiting the size of subvolumes.  This is really easy for us, just 
> > create a
> > subvolume and at creation time set a maximum size it can grow to and not 
> > let it
> > go farther than that.  Nice, simple and straightforward.
> > 
> > 2) Normal quotas, via the quota tools.  This just comes down to how do we 
> > want
> > to charge users, do we want to do it per subvolume, or per filesystem.  My 
> > vote
> > is per filesystem.  Obviously this will make it tricky with snapshots, but I
> > think if we're just charging the diff's between the original volume and the
> > snapshot to the user then that will be the easiest for people to understand,
> > rather than making a snapshot all of a sudden count the users currently used
> > quota * 2.
> 
>This is going to be tricky to get the semantics right, I suspect.
> 
>Say you've created a subvolume, A, containing 10G of Useful Stuff
> (say, a base image for VMs). This counts 10G against your quota. Now,
> I come along and snapshot that subvolume (as a writable subvolume) --
> call it B. This is essentially free for me, because I've got a COW
> copy of your subvolume (and the original counts against your quota).
> 
>If I now modify a file in subvolume B, the full modified section
> goes onto my quota. This is all well and good. But what happens if you
> delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> files.  Worse, what happens if someone else had made a snapshot of A,
> too? Who gets the 10G added to their quota, me or them? What if I'd
> filled up my quota? Would that stop you from deleting your copy,
> because my copy can't be charged against my quota? Would I just end up
> unexpectedly 10G over quota?
> 
>This is a whole gigantic can of worms, as far as I can see, and I
> don't think it's going to be possible to implement quotas, even on a
> filesystem level, until there's some good and functional model for
> dealing with all the implications of COW copies. :(

In your case, it would sound fair that everyone is "simply" charged 10G.
What Josef is refering to would probably only apply to volumes and
snapshots owned by the same user: If I have a subvolume of 10G, and a
snapshot of it where I only changed 1G, the charged quota would be 11G,
not 20G.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Gordan Bobic


Hugo Mills wrote:

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:

=== Quotas ===

This is a huge topic in and of itself, but Christoph mentioned wanting to have
an idea of what we wanted to do with it, so I'm putting it here.  There are
really 2 things here

1) Limiting the size of subvolumes.  This is really easy for us, just create a
subvolume and at creation time set a maximum size it can grow to and not let it
go farther than that.  Nice, simple and straightforward.

2) Normal quotas, via the quota tools.  This just comes down to how do we want
to charge users, do we want to do it per subvolume, or per filesystem.  My vote
is per filesystem.  Obviously this will make it tricky with snapshots, but I
think if we're just charging the diff's between the original volume and the
snapshot to the user then that will be the easiest for people to understand,
rather than making a snapshot all of a sudden count the users currently used
quota * 2.


   This is going to be tricky to get the semantics right, I suspect.

   Say you've created a subvolume, A, containing 10G of Useful Stuff
(say, a base image for VMs). This counts 10G against your quota. Now,
I come along and snapshot that subvolume (as a writable subvolume) --
call it B. This is essentially free for me, because I've got a COW
copy of your subvolume (and the original counts against your quota).

   If I now modify a file in subvolume B, the full modified section
goes onto my quota. This is all well and good. But what happens if you
delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
files.  Worse, what happens if someone else had made a snapshot of A,
too? Who gets the 10G added to their quota, me or them? What if I'd
filled up my quota? Would that stop you from deleting your copy,
because my copy can't be charged against my quota? Would I just end up
unexpectedly 10G over quota?

   This is a whole gigantic can of worms, as far as I can see, and I
don't think it's going to be possible to implement quotas, even on a
filesystem level, until there's some good and functional model for
dealing with all the implications of COW copies. :(


I would argue that a simple and probably correct solution is to have the 
files count toward the quota of everyone who has a COW copy. i.e. if I 
have a volume A and you make a snapshot B, the du content of B should 
count toward your quota as well, rather than being "free". I don't see 
any reason why this would not be the correct and intuitive way to do it. 
Simply treat it as you would transparent block-level deduplication.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

disk space caching generation missmatch

2010-12-01 Thread Johannes Hirte

After enabling disk space caching I've observed several log entries like this:

btrfs: free space inode generation (0) did not match free space cache 
generation (169594) for block group 15464398848

I'm not sure, but it seems this happens on every reboot. Is this something to
worry about?

regards,
  Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Hugo Mills

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> === Quotas ===
> 
> This is a huge topic in and of itself, but Christoph mentioned wanting to have
> an idea of what we wanted to do with it, so I'm putting it here.  There are
> really 2 things here
> 
> 1) Limiting the size of subvolumes.  This is really easy for us, just create a
> subvolume and at creation time set a maximum size it can grow to and not let 
> it
> go farther than that.  Nice, simple and straightforward.
> 
> 2) Normal quotas, via the quota tools.  This just comes down to how do we want
> to charge users, do we want to do it per subvolume, or per filesystem.  My 
> vote
> is per filesystem.  Obviously this will make it tricky with snapshots, but I
> think if we're just charging the diff's between the original volume and the
> snapshot to the user then that will be the easiest for people to understand,
> rather than making a snapshot all of a sudden count the users currently used
> quota * 2.

   This is going to be tricky to get the semantics right, I suspect.

   Say you've created a subvolume, A, containing 10G of Useful Stuff
(say, a base image for VMs). This counts 10G against your quota. Now,
I come along and snapshot that subvolume (as a writable subvolume) --
call it B. This is essentially free for me, because I've got a COW
copy of your subvolume (and the original counts against your quota).

   If I now modify a file in subvolume B, the full modified section
goes onto my quota. This is all well and good. But what happens if you
delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
files.  Worse, what happens if someone else had made a snapshot of A,
too? Who gets the 10G added to their quota, me or them? What if I'd
filled up my quota? Would that stop you from deleting your copy,
because my copy can't be charged against my quota? Would I just end up
unexpectedly 10G over quota?

   This is a whole gigantic can of worms, as far as I can see, and I
don't think it's going to be possible to implement quotas, even on a
filesystem level, until there's some good and functional model for
dealing with all the implications of COW copies. :(

   Hugo.

-- 
=== Hugo Mills: h...@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- I believe that it's closely correlated with ---   
   the aeroswine coefficient.

signature.asc
Description: Digital signature

Re: What to do about subvolumes?

2010-12-01 Thread Mike Hommey

On Wed, Dec 01, 2010 at 11:01:37AM -0500, Chris Mason wrote:
> Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55 -0500:
> > On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik  wrote:
> > >
> > > === How do we want subvolumes to work from a user perspective? ===
> > >
> > > 1) Users need to be able to create their own subvolumes. Â The permission
> > > semantics will be absolutely the same as creating directories, so I don't 
> > > think
> > > this is too tricky. Â We want this because you can only take snapshots of
> > > subvolumes, and so it is important that users be able to create their own
> > > discrete snapshottable targets.
> > >
> > > 2) Users need to be able to snapshot their subvolumes. Â This is 
> > > basically the
> > > same as #1, but it bears repeating.
> > 
> > could it be possible to convert a directory into a volume?  or at
> > least base a snapshot off it?
> 
> I'm afraid this turns into the same complexity as creating a new volume
> and copying all the files/dirs in by hand.

Except you wouldn't have to copy data, only metadata.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Chris Mason

Excerpts from C Anthony Risinger's message of 2010-12-01 11:03:23 -0500:
> On Wed, Dec 1, 2010 at 10:01 AM, Chris Mason  wrote:
> > Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55 -0500:
> >> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik  wrote:
> >> >
> >> > === How do we want subvolumes to work from a user perspective? ===
> >> >
> >> > 1) Users need to be able to create their own subvolumes.  The permission
> >> > semantics will be absolutely the same as creating directories, so I 
> >> > don't think
> >> > this is too tricky.  We want this because you can only take snapshots of
> >> > subvolumes, and so it is important that users be able to create their own
> >> > discrete snapshottable targets.
> >> >
> >> > 2) Users need to be able to snapshot their subvolumes.  This is 
> >> > basically the
> >> > same as #1, but it bears repeating.
> >>
> >> could it be possible to convert a directory into a volume?  or at
> >> least base a snapshot off it?
> >
> > I'm afraid this turns into the same complexity as creating a new volume
> > and copying all the files/dirs in by hand.
> 
> ok; if i create an empty volume, and use cp --reflink, it would have
> the desired affect though, right?

Almost, for no good reason at all our cp --reflink doesn't reflink
across subvols.  I'll get that fixed up.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dm-crypt barrier support is effective

2010-12-01 Thread Matt

On Mon, Nov 15, 2010 at 12:24 AM, Matt  wrote:
> On Sun, Nov 14, 2010 at 10:54 PM, Milan Broz  wrote:
>> On 11/14/2010 10:49 PM, Matt wrote:
>>> only with the dm-crypt scaling patch I could observe the data-corruption
>>
>> even with v5 I sent on Friday?
>>
>> Are you sure that it is not related to some fs problem in 2.6.37-rc1?
>>
>> If it works on 2.6.36 without problems, it is probably problems somewhere
>> else (flush/fua conversion was trivial here - DM is still doing full flush
>> and there are no other changes in code IMHO.)
>>
>> Milan
>>
>
> Hi Milan,
>
> I'm aware of your new v5 patch (which should include several
> improvements (or potential fixes in my case) over the v3 patch)
>
> as I already wrote my schedule unfortunately currently doesn't allow
> me to test it
>
> * in the case of no corruption it would be nice to have 2.6.37-rc* running :)
>
> * in the case of data corruption that would mean restoring my system -
> since it's my production box and right now I don't have a fallback at
> reach
> at earliest I could give it a shot at the beginning of December. Then
> I could also test reiserfs and ext4 as a system partition to rule out
> that it's
> a ext4-specific thing (currently I'm running reiserfs on my system-partition).
>
> Thanks !
>
> Matt
>


OK guys,

I've updated my system to latest glibc 2.12.1-r3 (on gentoo) and gcc
hardened 4.5.1-r1 with 1.4 patchset which also uses pie (that one
should fix problems with graphite)

not much system changes besides that,

with those it worked fine with 2.6.36 and I couldn't observe any
filesystem corruption



the bad news is: I'm again seeing corruption (!) [on ext4, on the /
(root) partition]:

I was re-emerging/re-installing stuff - pretty trivial stuff actually
(which worked fine in the past): emerging gnome-base programs (gconf,
librsvg, nautilus, gnome-mount, gnome-vfs, gvfs, imagemagick,
xine-lib) and some others: terminal (from xfce), vtwm, rman, vala
(library), xclock, xload, atk, gtk+, vte

during that I noticed some corruption and programs kept failing to
configure/compile, saying that g++ was missing; I re-extracted gcc
(which I previously had made an backup-tarball), that seemed to help
for some time until programs again failed with some corrupted files
from gcc

so I re-emerged gcc (compiling it) and after it had finished the same
error occured I already had written about in an previous email:
the content of /etc/env.d/03opengl got corrupted - but NOT the whole file:

normally it's
# Configuration file for eselect
# This file has been automatically generated.
LDPATH=
OPENGL_PROFILE=
<-- where the path to the graphics-drivers and the opengl profile is written;

in this case of the corruption it only where 
symbols


I have no clue how this file could be connected with gcc


===> so the No.1 trigger of this kind of corruption where files are
empty, missing or the content gets corrupted (at least for me) is
compiling software which is part of the system (e.g. emerge -e
system);

the system is Gentoo ~amd64; with binutils 2.20.51.0.12 (afaik this
one has changed from 2.20.51.0.10 to 2.20.51.0.12 from my last
report); gcc 4.5.1 (Gentoo Hardened 4.5.1-r1 p1.4, pie-0.4.5) <--
works fine with 2.6.36 and 2.6.36.1

I'm not sure whether benchmarks would have the same "impact"



the kernel currently running is 2.6.37-rc4 with the [PATCH v5] dm
crypt: scale to multiple CPUs

besides that additional patchsets are applied (I apologize that it's
not only plain vanilla with the dm-crypt patch):
* Prevent kswapd dumping excessive amounts of memory in response to
high-order allocation
* ext4: coordinate data-only flush requests sent by fsync
* vmscan: protect executable page from inactive list scan
* writeback livelock fixes v2

I originally had hoped that the mentioned patch in "ext4: coordinate
data-only flush requests sent by fsync", namely: "md: Call
blk_queue_flush() to establish flush/fua" and additional changes &
fixes to 2.6.37-rc4 would once and for all fix problems but it didn't

I'm also using the the writeback livelock fixes and the dm-crypt scale
to multiple CPUs with 2.6.36 so those generally work fine

so it has be something that changed from 2.6.36->2.6.37 within
dm-crypt or other parts that gets stressed and breaks during usage of
the "[PATCH v5] dm crypt: scale to multiple CPUs" patch

the other included patches surely won't be the cause for that (100%).

Filesystem corruption only seems to occur on the / (root) where the
system resides -

Fortunately I haven't encountered any corruption on my /home partition
which also uses ext4 and during rsync'ing from /home to other data
partitions with ext4 and xfs (I don't want to try to seriously corrupt
any of my data so I played it safe from the beginning and didn't use
anything heavy such as virtualmachines, etc.) - browsing the web,
using firefox & chromium, amarok, etc. worked fine so far

the system is in a pretty "new" state - which means I extracted it
from a tarball out o

Re: What to do about subvolumes?

2010-12-01 Thread C Anthony Risinger

On Wed, Dec 1, 2010 at 10:01 AM, Chris Mason  wrote:
> Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55 -0500:
>> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik  wrote:
>> >
>> > === How do we want subvolumes to work from a user perspective? ===
>> >
>> > 1) Users need to be able to create their own subvolumes.  The permission
>> > semantics will be absolutely the same as creating directories, so I don't 
>> > think
>> > this is too tricky.  We want this because you can only take snapshots of
>> > subvolumes, and so it is important that users be able to create their own
>> > discrete snapshottable targets.
>> >
>> > 2) Users need to be able to snapshot their subvolumes.  This is basically 
>> > the
>> > same as #1, but it bears repeating.
>>
>> could it be possible to convert a directory into a volume?  or at
>> least base a snapshot off it?
>
> I'm afraid this turns into the same complexity as creating a new volume
> and copying all the files/dirs in by hand.

ok; if i create an empty volume, and use cp --reflink, it would have
the desired affect though, right?

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Chris Mason

Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55 -0500:
> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik  wrote:
> >
> > === How do we want subvolumes to work from a user perspective? ===
> >
> > 1) Users need to be able to create their own subvolumes.  The permission
> > semantics will be absolutely the same as creating directories, so I don't 
> > think
> > this is too tricky.  We want this because you can only take snapshots of
> > subvolumes, and so it is important that users be able to create their own
> > discrete snapshottable targets.
> >
> > 2) Users need to be able to snapshot their subvolumes.  This is basically 
> > the
> > same as #1, but it bears repeating.
> 
> could it be possible to convert a directory into a volume?  or at
> least base a snapshot off it?

I'm afraid this turns into the same complexity as creating a new volume
and copying all the files/dirs in by hand.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Chris Mason

Excerpts from Josef Bacik's message of 2010-12-01 09:21:36 -0500:
> Hello,
> 
> Various people have complained about how BTRFS deals with subvolumes recently,
> specifically the fact that they all have the same inode number, and there's no
> discrete seperation from one subvolume to another.  Christoph asked that I lay
> out a basic design document of how we want subvolumes to work so we can hash
> everything out now, fix what is broken, and then move forward with a design 
> that
> everybody is more or less happy with.  I apologize in advance for how freaking
> long this email is going to be.  I assume that most people are generally
> familiar with how BTRFS works, so I'm not going to bother explaining in great
> detail some stuff.

Thanks for writing this up.

> === What do we do? ===
> 
> This is where I expect to see the most discussion.  Here is what I want to do
> 
> 1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the 
> inode
> to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
> that way.  This unfortunately will be an incompatible format change, but the
> sooner we get this adressed the easier it will be in the long run.  Obviously
> when I say format change I mean via the incompat bits we have, so old fs's 
> won't
> be broken and such.

If they don't have inode number 256, what inode number do they have?
I'm assuming you mean the subvolume is given an inode number in the
parent directory just like any other dir,  but this doesn't get rid of
the duplicate inode problem.  I think it ends up making it less clear,
but I'm open to suggestions ;)

We could give each subvol a different devt, which is something Christoph
had asked about as well.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread C Anthony Risinger

On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik  wrote:
>
> === How do we want subvolumes to work from a user perspective? ===
>
> 1) Users need to be able to create their own subvolumes.  The permission
> semantics will be absolutely the same as creating directories, so I don't 
> think
> this is too tricky.  We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.
>
> 2) Users need to be able to snapshot their subvolumes.  This is basically the
> same as #1, but it bears repeating.

could it be possible to convert a directory into a volume?  or at
least base a snapshot off it?

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about subvolumes?

2010-12-01 Thread Mike Hommey

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> 1) Users need to be able to create their own subvolumes.  The permission
> semantics will be absolutely the same as creating directories, so I don't 
> think
> this is too tricky.  We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.
> 
> 2) Users need to be able to snapshot their subvolumes.  This is basically the
> same as #1, but it bears repeating.
> 
> 3) Subvolumes shouldn't need to be specifically mounted.  This is also
> important, we don't want users to have to go around mounting their subvolumes 
> up
> manually one-by-one.  Today users just cd into subvolumes and it works, just
> like cd'ing into a directory.

It would be helpful to be able to create subvolumes off existing
directories, instead of creating a subvolume and having to copy all the
data around.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

What to do about subvolumes?

2010-12-01 Thread Josef Bacik

Hello,

Various people have complained about how BTRFS deals with subvolumes recently,
specifically the fact that they all have the same inode number, and there's no
discrete seperation from one subvolume to another.  Christoph asked that I lay
out a basic design document of how we want subvolumes to work so we can hash
everything out now, fix what is broken, and then move forward with a design that
everybody is more or less happy with.  I apologize in advance for how freaking
long this email is going to be.  I assume that most people are generally
familiar with how BTRFS works, so I'm not going to bother explaining in great
detail some stuff.

=== What are subvolumes? ===

They are just another tree.  In BTRFS we have various b-trees to describe the
filesystem.  A few of them are filesystem wide, such as the extent tree, chunk
tree, root tree etc.  The tree's that hold the actual filesystem data, that is
inodes and such, are kept in their own b-tree.  This is how subvolumes and
snapshots appear on disk, they are simply new b-trees with all of the file data
contained within them.

=== What do subvolumes look like? ===

All the user sees are directories.  They act like any other directory acts, with
a few exceptions

1) You cannot hardlink between subvolumes.  This is because subvolumes have
their own inode numbers and such, think of them as seperate mounts in this case,
you cannot hardlink between two mounts because the link needs to point to the
same on disk inode, which is impossible between two different filesystems.  The
same is true for subvolumes, they have their own trees with their own inodes and
inode numbers, so it's impossible to hardlink between them.

1a) In case it wasn't clear from above, each subvolume has their own inode
numbers, so you can have the same inode numbers used between two different
subvolumes, since they are two different trees.

2) Obviously you can't just rm -rf subvolumes.  Because they are roots there's
extra metadata to keep track of them, so you have to use one of our ioctls to
delete subvolumes/snapshots.

But permissions and everything else they are the same.

There is one tricky thing.  When you create a subvolume, the directory inode
that is created in the parent subvolume has the inode number of 256.  So if you
have a bunch of subvolumes in the same parent subvolume, you are going to have a
bunch of directories with the inode number of 256.  This is so when users cd
into a subvolume we can know its a subvolume and do all the normal voodoo to
start looking in the subvolumes tree instead of the parent subvolumes tree.

This is where things go a bit sideways.  We had serious problems with NFS, but
thankfully NFS gives us a bunch of hooks to get around these problems.
CIFS/Samba do not, so we will have problems there, not to mention any other
userspace application that looks at inode numbers.

=== How do we want subvolumes to work from a user perspective? ===

1) Users need to be able to create their own subvolumes.  The permission
semantics will be absolutely the same as creating directories, so I don't think
this is too tricky.  We want this because you can only take snapshots of
subvolumes, and so it is important that users be able to create their own
discrete snapshottable targets.

2) Users need to be able to snapshot their subvolumes.  This is basically the
same as #1, but it bears repeating.

3) Subvolumes shouldn't need to be specifically mounted.  This is also
important, we don't want users to have to go around mounting their subvolumes up
manually one-by-one.  Today users just cd into subvolumes and it works, just
like cd'ing into a directory.

=== Quotas ===

This is a huge topic in and of itself, but Christoph mentioned wanting to have
an idea of what we wanted to do with it, so I'm putting it here.  There are
really 2 things here

1) Limiting the size of subvolumes.  This is really easy for us, just create a
subvolume and at creation time set a maximum size it can grow to and not let it
go farther than that.  Nice, simple and straightforward.

2) Normal quotas, via the quota tools.  This just comes down to how do we want
to charge users, do we want to do it per subvolume, or per filesystem.  My vote
is per filesystem.  Obviously this will make it tricky with snapshots, but I
think if we're just charging the diff's between the original volume and the
snapshot to the user then that will be the easiest for people to understand,
rather than making a snapshot all of a sudden count the users currently used
quota * 2.

=== What do we do? ===

This is where I expect to see the most discussion.  Here is what I want to do

1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the inode
to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
that way.  This unfortunately will be an incompatible format change, but the
sooner we get this adressed the easier it will be in the long run.  Obviously
when I say format change I mean via

kernel BUG at fs/btrfs/inode.c:806

2010-12-01 Thread Johannes Hirte

On one of my machines with btrfs I got this bug:

entry offset 29085974528, bytes 4096, bitmap no
entry offset 29162995712, bytes 20480, bitmap yes
entry offset 29171744768, bytes 4096, bitmap no
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 29834084352 has 1073741824 bytes, 1072648192 used 0 pinned 0 
reserved
entry offset 29834084352, bytes 376832, bitmap yes
entry offset 29890392064, bytes 4096, bitmap no
entry offset 29895069696, bytes 4096, bitmap no
entry offset 29896048640, bytes 4096, bitmap no
entry offset 29896364032, bytes 4096, bitmap no
entry offset 29896482816, bytes 4096, bitmap no
entry offset 29905817600, bytes 4096, bitmap no
entry offset 29906878464, bytes 4096, bitmap no
entry offset 29908029440, bytes 4096, bitmap no
entry offset 29908418560, bytes 4096, bitmap no
entry offset 29910061056, bytes 4096, bitmap no
entry offset 29911105536, bytes 4096, bitmap no
entry offset 29912371200, bytes 4096, bitmap no
entry offset 29912748032, bytes 4096, bitmap no
entry offset 29914660864, bytes 4096, bitmap no
entry offset 29914755072, bytes 4096, bitmap no
entry offset 29915865088, bytes 4096, bitmap no
entry offset 29915914240, bytes 4096, bitmap no
entry offset 29916409856, bytes 4096, bitmap no
entry offset 29916471296, bytes 4096, bitmap no
entry offset 29924597760, bytes 4096, bitmap no
entry offset 29931642880, bytes 4096, bitmap no
entry offset 29931925504, bytes 4096, bitmap no
entry offset 29932732416, bytes 4096, bitmap no
entry offset 29933383680, bytes 4096, bitmap no
entry offset 29933412352, bytes 4096, bitmap no
entry offset 29933596672, bytes 4096, bitmap no
entry offset 29935316992, bytes 4096, bitmap no
entry offset 29938610176, bytes 4096, bitmap no
entry offset 29939154944, bytes 4096, bitmap no
entry offset 29944033280, bytes 4096, bitmap no
entry offset 29946318848, bytes 4096, bitmap no
entry offset 29964181504, bytes 4096, bitmap no
entry offset 29964828672, bytes 4096, bitmap no
entry offset 29966233600, bytes 4096, bitmap no
entry offset 29968302080, bytes 98304, bitmap yes
entry offset 29983170560, bytes 4096, bitmap no
entry offset 29984059392, bytes 4096, bitmap no
entry offset 29992976384, bytes 4096, bitmap no
entry offset 30008422400, bytes 4096, bitmap no
entry offset 30025895936, bytes 4096, bitmap no
entry offset 30034280448, bytes 4096, bitmap no
entry offset 30055174144, bytes 4096, bitmap no
entry offset 30067208192, bytes 4096, bitmap no
entry offset 30094012416, bytes 4096, bitmap no
entry offset 30098358272, bytes 4096, bitmap no
entry offset 30098722816, bytes 4096, bitmap no
entry offset 30102491136, bytes 4096, bitmap no
entry offset 30102519808, bytes 143360, bitmap yes
entry offset 30103207936, bytes 4096, bitmap no
entry offset 30103601152, bytes 4096, bitmap no
entry offset 30105415680, bytes 4096, bitmap no
entry offset 30112169984, bytes 4096, bitmap no
entry offset 30139326464, bytes 4096, bitmap no
entry offset 30173143040, bytes 4096, bitmap no
entry offset 30176014336, bytes 4096, bitmap no
entry offset 30202048512, bytes 4096, bitmap no
entry offset 30229487616, bytes 4096, bitmap no
entry offset 30230700032, bytes 4096, bitmap no
entry offset 30230777856, bytes 4096, bitmap no
entry offset 30232813568, bytes 4096, bitmap no
entry offset 30235348992, bytes 4096, bitmap no
entry offset 30236737536, bytes 49152, bitmap yes
entry offset 30241488896, bytes 4096, bitmap no
entry offset 30252662784, bytes 4096, bitmap no
entry offset 30370955264, bytes 49152, bitmap yes
entry offset 30425870336, bytes 4096, bitmap no
entry offset 30505172992, bytes 61440, bitmap yes
entry offset 30507831296, bytes 4096, bitmap no
entry offset 30639390720, bytes 8192, bitmap yes
entry offset 30760058880, bytes 4096, bitmap no
entry offset 30773608448, bytes 45056, bitmap yes
block group has cluster?: no
3 blocks of free space at or bigger than bytes is
block group 30907826176 has 536870912 bytes, 533860352 used 0 pinned 0 reserved
entry offset 30907826176, bytes 1441792, bitmap yes
entry offset 31042043904, bytes 995328, bitmap yes
entry offset 31176261632, bytes 212992, bitmap yes
entry offset 31310479360, bytes 8192, bitmap yes
block group has cluster?: no
3 blocks of free space at or bigger than bytes is
block group 31444697088 has 268435456 bytes, 266985472 used 0 pinned 0 reserved
entry offset 31444697088, bytes 1298432, bitmap yes
entry offset 31578914816, bytes 151552, bitmap yes
block group has cluster?: no
2 blocks of free space at or bigger than bytes is
block group 31713132544 has 268435456 bytes, 267300864 used 0 pinned 0 reserved
entry offset 31713132544, bytes 1093632, bitmap yes
entry offset 31847350272, bytes 40960, bitmap yes
block group has cluster?: no
1 blocks of free space at or bigger than bytes is
block group 31981568000 has 268435456 bytes, 268029952 used 0 pinned 0 reserved
entry offset 31981568000, bytes 360448, bitmap yes
entry offset 32115785728, bytes 45056, bitmap yes
block group has cluster

Fsck, parent transid verify failed

2010-12-01 Thread Tommy Jonsson

Hi folks!

Been using btrfs for quite a while now, worked great until now. 
Got power-loss on my machine and now i have the "parent transid verify
failed on X wanted X found X" problem.
So I can't get it to mount.

My btrfs is spread over sda (2tb), sdc(2tb), sdd(1tb).

Is this something that an offline fsck could fix ? 
If so is the fsck-util being developed ?
Is there a way to mount the FS in a read-only mode or something to rescue
the data ?

Thanks, Tommy.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 4/4 v2] Btrfs: deal with filesystem state at mount, umount

2010-12-01 Thread liubo

Since there is a filesystem state, we should deal with it carefully at mount,
 umount and remount.

- At mount, the FS state should be checked if there is error on these FS.
  If it does have, btrfsck is recommended.
- At umount, the FS state should be saved into disk for consistency.

Signed-off-by: Liu Bo 
---
 fs/btrfs/disk-io.c |   47 ++-
 1 files changed, 46 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b40dfe4..663d360 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -43,6 +43,8 @@
 static struct extent_io_ops btree_extent_io_ops;
 static void end_workqueue_fn(struct btrfs_work *work);
 static void free_fs_root(struct btrfs_root *root);
+static void btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
+int read_only);
 
 /*
  * end_io_wq structs are used to do processing in task context when an IO is
@@ -1700,6 +1702,11 @@ struct btrfs_root *open_ctree(struct super_block *sb,
if (!btrfs_super_root(disk_super))
goto fail_iput;
 
+   /* check filesystem state */
+   fs_info->fs_state |= btrfs_super_flags(disk_super);
+
+   btrfs_check_super_valid(fs_info, sb->s_flags & MS_RDONLY);
+
ret = btrfs_parse_options(tree_root, options);
if (ret) {
err = ret;
@@ -2405,10 +2412,17 @@ int btrfs_commit_super(struct btrfs_root *root)
up_write(&root->fs_info->cleanup_work_sem);
 
trans = btrfs_join_transaction(root, 1);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
ret = btrfs_commit_transaction(trans, root);
BUG_ON(ret);
+
/* run commit again to drop the original snapshot */
trans = btrfs_join_transaction(root, 1);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
btrfs_commit_transaction(trans, root);
ret = btrfs_write_and_wait_transaction(NULL, root);
BUG_ON(ret);
@@ -2426,8 +2440,28 @@ int close_ctree(struct btrfs_root *root)
smp_mb();
 
btrfs_put_block_group_cache(fs_info);
+
+   /*
+* Here come 2 situations when btrfs flips readonly:
+*
+* 1. when btrfs flips readonly somewhere else before
+* btrfs_commit_super, sb->s_flags has MS_RDONLY flag,
+* and btrfs will skip to write sb directly to keep
+* ERROR state on disk.
+*
+* 2. when btrfs flips readonly just in btrfs_commit_super,
+* and in such case, btrfs cannnot write sb via btrfs_commit_super,
+* and since fs_state has been set BTRFS_SUPER_FLAG_ERROR flag,
+* btrfs will directly write sb.
+*/
if (!(fs_info->sb->s_flags & MS_RDONLY)) {
-   ret =  btrfs_commit_super(root);
+   ret = btrfs_commit_super(root);
+   if (ret)
+   printk(KERN_ERR "btrfs: commit super ret %d\n", ret);
+   }
+
+   if (fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR) {
+   ret = write_ctree_super(NULL, root, 0);
if (ret)
printk(KERN_ERR "btrfs: commit super ret %d\n", ret);
}
@@ -2603,6 +2637,17 @@ out:
return 0;
 }
 
+static void btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
+ int read_only)
+{
+   if (read_only)
+   return;
+
+   if (fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR)
+   printk(KERN_WARNING "warning: mount fs with errors, "
+  "running btfsck is recommended\n");
+}
+
 static struct extent_io_ops btree_extent_io_ops = {
.write_cache_pages_lock_hook = btree_lock_page_hook,
.readpage_end_io_hook = btree_readpage_end_io_hook,
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 3/4 v2] Btrfs: add readonly support for error handle

2010-12-01 Thread liubo

This patch provide a new error handle interface for those errors that handled
 by current BUG_ONs.

In order to protect btrfs from panic, when it comes to those BUG_ON errors, 
the interface forces btrfs readonly and saves the FS state to disk. And the 
filesystem can be umounted, although mabye with some warning in kernel dmesg.
Then btrfsck is helpful to recover btrfs.

v1->v2:
move write super stuff from error handle path to unmount in order to avoid
deadlock.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h |8 +
 fs/btrfs/super.c |   88 ++
 2 files changed, 96 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 92b5ca2..fc9b6a0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2552,6 +2552,14 @@ ssize_t btrfs_listxattr(struct dentry *dentry, char 
*buffer, size_t size);
 /* super.c */
 int btrfs_parse_options(struct btrfs_root *root, char *options);
 int btrfs_sync_fs(struct super_block *sb, int wait);
+void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
+unsigned int line, int errno);
+
+#define btrfs_std_error(fs_info, errno)\
+do {   \
+   if ((errno))\
+   __btrfs_std_error((fs_info), __func__, __LINE__, (errno));\
+} while (0)
 
 /* acl.c */
 #ifdef CONFIG_BTRFS_FS_POSIX_ACL
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 718b10d..07c58f9 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -54,6 +54,94 @@
 
 static const struct super_operations btrfs_super_ops;
 
+static const char *btrfs_decode_error(struct btrfs_fs_info *fs_info, int errno,
+ char nbuf[16])
+{
+   char *errstr = NULL;
+
+   switch (errno) {
+   case -EIO:
+   errstr = "IO failure";
+   break;
+   case -ENOMEM:
+   errstr = "Out of memory";
+   break;
+   case -EROFS:
+   errstr = "Readonly filesystem";
+   break;
+   default:
+   if (nbuf) {
+   if (snprintf(nbuf, 16, "error %d", -errno) >= 0)
+   errstr = nbuf;
+   }
+   break;
+   }
+
+   return errstr;
+}
+
+static void __save_error_info(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_super_block *disk_super = &fs_info->super_copy;
+
+   fs_info->fs_state = BTRFS_SUPER_FLAG_ERROR;
+   disk_super->flags |= cpu_to_le64(BTRFS_SUPER_FLAG_ERROR);
+
+   mutex_lock(&fs_info->trans_mutex);
+   memcpy(&fs_info->super_for_commit, disk_super,
+  sizeof(fs_info->super_for_commit));
+   mutex_unlock(&fs_info->trans_mutex);
+}
+
+/* NOTE:
+ * We move write_super stuff at umount in order to avoid deadlock
+ * for umount hold all lock.
+ */
+static void save_error_info(struct btrfs_fs_info *fs_info)
+{
+   __save_error_info(fs_info);
+}
+
+/* btrfs handle error by forcing the filesystem readonly */
+static void btrfs_handle_error(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info->sb;
+
+   if (sb->s_flags & MS_RDONLY)
+   return;
+
+   if (fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR) {
+   sb->s_flags |= MS_RDONLY;
+   printk(KERN_INFO "btrfs is forced readonly\n");
+   }
+}
+
+/*
+ * __btrfs_std_error decodes expected errors from the caller and
+ * invokes the approciate error response.
+ */
+void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
+unsigned int line, int errno)
+{
+   struct super_block *sb = fs_info->sb;
+   char nbuf[16];
+   const char *errstr;
+
+   /*
+* Special case: if the error is EROFS, and we're already
+* under MS_RDONLY, then it is safe here.
+*/
+   if (errno == -EROFS && (sb->s_flags & MS_RDONLY))
+   return;
+
+   errstr = btrfs_decode_error(fs_info, errno, nbuf);
+   printk(KERN_CRIT "BTRFS error (device %s) in %s:%d: %s\n",
+   sb->s_id, function, line, errstr);
+   save_error_info(fs_info);
+
+   btrfs_handle_error(fs_info);
+}
+
 static void btrfs_put_super(struct super_block *sb)
 {
struct btrfs_root *root = btrfs_sb(sb);
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/4 v2] Btrfs: add filesystem state for error handle

2010-12-01 Thread liubo

Add filesystem state and a flags to tell if the filesystem is 
valid or insane now.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8db9234..92b5ca2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -294,6 +294,14 @@ static inline unsigned long btrfs_chunk_item_size(int 
num_stripes)
 #define BTRFS_FSID_SIZE 16
 #define BTRFS_HEADER_FLAG_WRITTEN  (1ULL << 0)
 #define BTRFS_HEADER_FLAG_RELOC(1ULL << 1)
+
+/*
+ * File system states
+ */
+
+/* Errors detected */
+#define BTRFS_SUPER_FLAG_ERROR (1ULL << 2)
+
 #define BTRFS_SUPER_FLAG_SEEDING   (1ULL << 32)
 #define BTRFS_SUPER_FLAG_METADUMP  (1ULL << 33)
 
@@ -1050,6 +1058,9 @@ struct btrfs_fs_info {
unsigned metadata_ratio;
 
void *bdev_holder;
+
+   /* filesystem state */
+   u64 fs_state;
 };
 
 /*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 2/4 v2] Btrfs: avoid transaction stuff when readonly

2010-12-01 Thread liubo

When the filesystem is readonly, avoid transaction stuff by checking MS_RDONLY 
at 
start transaction time.

Signed-off-by: Liu Bo 
---
 fs/btrfs/transaction.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1fffbc0..14a597d 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -181,6 +181,9 @@ static struct btrfs_trans_handle *start_transaction(struct 
btrfs_root *root,
struct btrfs_trans_handle *h;
struct btrfs_transaction *cur_trans;
int ret;
+
+   if (root->fs_info->sb->s_flags & MS_RDONLY)
+   return ERR_PTR(-EROFS);
 again:
h = kmem_cache_alloc(btrfs_trans_handle_cachep, GFP_NOFS);
if (!h)
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/4 v2] Btrfs: Add readonly support to replace BUG_ON phrase

2010-12-01 Thread liubo

Btrfs has a number of BUG_ON()s, which may lead btrfs to unpleasant panic.
Meanwhile, they are very ugly and should be handled more propriately.

There are mainly two ways to deal with these BUG_ON()s.

1. For those errors which can be handled well by callers, we just return their
error number to callers.

2. For others, We can force the filesystem readonly when it hits errors, which
 is what this patchset has done. Replaced BUG_ON() with the interface provided
 in this patchset, we will get error infomation via dmesg. Since btrfs is now 
readonly, we can save our data safely and umount it, then a btrfsck is 
recommended.

By these ways, we can protect our filesystem from panic caused by those 
BUG_ONs.

We still need a incompat flag to make old kernels happy.

v1->v2:
- in order to avoid deadlock thing, move write super stuff from error handle
  path to umount time.
- remove BTRFS_SUPER_FLAG_VALID, just use BTRFS_SUPER_FLAG_ERROR to make it
  simple.
- add MS_RDONLY check at start of a transaction instead of commit transaction.

---
 fs/btrfs/ctree.h   |   19 ++
 fs/btrfs/disk-io.c |   47 +-
 fs/btrfs/super.c   |   88 
 fs/btrfs/transaction.c |3 ++
 4 files changed, 156 insertions(+), 1 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: dynamically remove unused block groups

2010-12-01 Thread Josef Bacik

On Tue, Nov 30, 2010 at 09:53:41PM -0700, Anthony Roberts wrote:
> Hello,
>
> What happens in the event the filesystem has mostly been cleared out,  
> but there's a few things left? For example, several of the chunks might  
> be at very low usage, but not zero. Would the user be able to defragment  
> the filesystem to cause these chunks to be consolidated?
>

Yeah thats what balance is for.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 3/4] btrfs: reduce the times of mmap() in fill_inode_item()

2010-12-01 Thread Miao Xie

With the old code, we must map the page every time we want to set a member
variable of the inode item, it is inefficient. We just do it at first.
By this way, we can improve the performance of file creation and deletion
by ~2%

Signed-off-by: Miao Xie 
---
 fs/btrfs/inode.c |   15 +++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 46b9d1a..2f0c742 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2539,6 +2539,16 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
struct btrfs_inode_item *item,
struct inode *inode)
 {
+   int unmap_on_exit;
+
+   unmap_on_exit = (leaf->map_token == NULL);
+   if (unmap_on_exit)
+   map_extent_buffer(leaf, (unsigned long)item,
+ sizeof(*item), &leaf->map_token,
+ &leaf->kaddr, &leaf->map_start,
+ &leaf->map_len, KM_USER1);
+
+
btrfs_set_inode_uid(leaf, item, inode->i_uid);
btrfs_set_inode_gid(leaf, item, inode->i_gid);
btrfs_set_inode_size(leaf, item, BTRFS_I(inode)->disk_i_size);
@@ -2567,6 +2577,11 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
btrfs_set_inode_rdev(leaf, item, inode->i_rdev);
btrfs_set_inode_flags(leaf, item, BTRFS_I(inode)->flags);
btrfs_set_inode_block_group(leaf, item, BTRFS_I(inode)->block_group);
+
+   if (unmap_on_exit && leaf->map_token) {
+   unmap_extent_buffer(leaf, leaf->map_token, KM_USER1);
+   leaf->map_token = NULL;
+   }
 }
 
 /*
-- 
1.7.0.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 4/4] btrfs: implement delayed dir index insertion and deletion

2010-12-01 Thread Miao Xie

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion and deletion.

Implementation:
- Every transaction has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the inserting rb-tree.
  And when the number of directory name index touches the upper limit (The max
  number of the directory name index that can be stored in a leaf), we start
  inserting manipulation and insert those directory name index into the b+ tree.
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the deleting rb-tree.
  Similar to the inserting rb-tree, the number of directory name index touches
  the upper limit, we start deleting manipulation and delete those directory
  name index from the b+ tree.
- If we want to commit transaction, we do inserting/deleting manipulation and
  insert/delete all the directory name indexs which are in the rb-tree
  into/from the b+ tree.
- when we want to read directory entry, we will do deleting manipulation and
  delete all the directory name indexs of the file/directory it contains from
  the b+ tree. And then read directory entries in the b+ tree. At the end read
  directory entries which is in the inserting rb-tree.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~9%, and file deletion by ~13%.

Before applying this patch:
Create files:
Total files: 5
Total time: 1.188547
Average time: 0.24
Delete files:
Total files: 5
Total time: 1.662012
Average time: 0.33

After applying this patch:
Create files:
Total files: 5
Total time: 1.083526
Average time: 0.22
Delete files:
Total files: 5
Total time: 1.439360
Average time: 0.29

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Signed-off-by: Miao Xie 
---
 fs/btrfs/Makefile|2 +-
 fs/btrfs/btrfs_inode.h   |2 +
 fs/btrfs/ctree.c |   13 +-
 fs/btrfs/ctree.h |   15 +-
 fs/btrfs/delayed-dir-index.c |  790 ++
 fs/btrfs/delayed-dir-index.h |   92 +
 fs/btrfs/dir-item.c  |   24 +-
 fs/btrfs/extent-tree.c   |   21 ++
 fs/btrfs/inode.c |   80 +++--
 fs/btrfs/transaction.c   |9 +
 fs/btrfs/transaction.h   |2 +
 11 files changed, 995 insertions(+), 55 deletions(-)
 create mode 100644 fs/btrfs/delayed-dir-index.c
 create mode 100644 fs/btrfs/delayed-dir-index.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index a35eb36..1f7696a 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -7,4 +7,4 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
   export.o tree-log.o acl.o free-space-cache.o zlib.o \
-  compression.o delayed-ref.o relocation.o
+  compression.o delayed-ref.o relocation.o delayed-dir-index.o
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 6ad63f1..3d03a17 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -162,6 +162,8 @@ struct btrfs_inode {
struct inode vfs_inode;
 };
 
+extern unsigned char btrfs_filetype_table[];
+
 static inline struct btrfs_inode *BTRFS_I(struct inode *inode)
 {
return container_of(inode, struct btrfs_inode, vfs_inode);
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 9ac1715..08f4339 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -38,10 +38,6 @@ static int balance_node_right(struct btrfs_trans_handle 
*trans,
  struct extent_buffer *src_buf);
 static int del_ptr(struct btrfs_trans_handle *trans, struct btrfs_root *root,
   struct btrfs_path *path, int level, int slot);
-static int setup_items_for_insert(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root, struct btrfs_path *path,
-   struct btrfs_key *cpu_key, u32 *data_size,
-   u32 total_data, u32 total_size, int nr);
 
 
 struct btrfs_path *btrfs_alloc_path(void)
@@ -3680,11 +3676,10 @@ out:
  * to save stack depth by doing the bulk of the work in a function

[GIT PULL] [RFC PATCH 0/4] btrfs: Implement delayed directory name index insertion and deletion

2010-12-01 Thread Miao Xie

Compare with Ext3/4, the performance of file creation and deletion on btrfs is
very poor. the reason is that btrfs must do a lot of b+ tree insertions, such as
inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion and deletion.

Beside that, we found we must map the page every time we want to set a member
variable of the inode item, it is inefficient. We just do it at first to reduce
the times of mmap(). By this way, we can also improve the performance of file
creation and deletion.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~11%, and file deletion by ~14%.

Before applying this patch:
Create files:
Total files: 5
Total time: 1.188547
Average time: 0.24
Delete files:
Total files: 5
Total time: 1.662012
Average time: 0.33

After applying this patch:
Create files:
Total files: 5
Total time: 1.057432
Average time: 0.21
Delete files:
Total files: 5
Total time: 1.422851
Average time: 0.28

You can also try out the patchset by pulling:
git://repo.or.cz/linux-btrfs-devel.git perf-improve

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

---
 fs/btrfs/Makefile|2 +-
 fs/btrfs/btrfs_inode.h   |2 +
 fs/btrfs/ctree.c |   13 +-
 fs/btrfs/ctree.h |   21 +-
 fs/btrfs/delayed-dir-index.c |  790 ++
 fs/btrfs/delayed-dir-index.h |   92 +
 fs/btrfs/dir-item.c  |   61 +++-
 fs/btrfs/extent-tree.c   |   21 ++
 fs/btrfs/inode.c |  189 +++---
 fs/btrfs/transaction.c   |9 +
 fs/btrfs/transaction.h   |2 +
 11 files changed, 1117 insertions(+), 85 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/4] btrfs: introduce a function btrfs_insert_dir_index_item()

2010-12-01 Thread Miao Xie

restructure btrfs_insert_dir_item() and introduce a function
btrfs_insert_dir_index_item() to insert dir index item.

Signed-off-by: Miao Xie 
---
 fs/btrfs/ctree.h|6 
 fs/btrfs/dir-item.c |   69 +++---
 2 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index af52f6d..5c44cf4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2313,6 +2313,12 @@ int btrfs_find_orphan_roots(struct btrfs_root 
*tree_root);
 int btrfs_set_root_node(struct btrfs_root_item *item,
struct extent_buffer *node);
 /* dir-item.c */
+int btrfs_insert_dir_index_item(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root,
+   struct btrfs_key *key,
+   struct btrfs_path *path,
+   struct btrfs_dir_item *dir_item,
+   int data_len);
 int btrfs_insert_dir_item(struct btrfs_trans_handle *trans,
  struct btrfs_root *root, const char *name,
  int name_len, u64 dir,
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index f0cad5a..d2d23b6 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -117,6 +117,44 @@ int btrfs_insert_xattr_item(struct btrfs_trans_handle 
*trans,
 }
 
 /*
+ * btrfs_insert_dir_index_item - insert a dir index item into the b-tree
+ * @trans: pointer of the transcation handle
+ * @root:  pointer used to return the address of the btrfs root
+ * @rootid:id of the btrfs root that the dir index item is inserted to
+ * @key:   key the dir index item
+ * @path:  pointer of the b-tree path
+ * @dir_item:  pointer of dir item
+ * @data_len:  lenght of the data
+ *
+ * Return value:
+ * 0  - successed
+ * <0 - error happened
+ */
+int btrfs_insert_dir_index_item(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root,
+   struct btrfs_key *key,
+   struct btrfs_path *path,
+   struct btrfs_dir_item *dir_item,
+   int data_len)
+{
+   struct extent_buffer *leaf;
+   struct btrfs_dir_item *dir_item_ptr;
+
+   dir_item_ptr = insert_with_overflow(trans, root, path, key, data_len,
+   (char *)(dir_item + 1),
+   le16_to_cpu(dir_item->name_len));
+   if (IS_ERR(dir_item_ptr))
+   return PTR_ERR(dir_item_ptr);
+
+   leaf = path->nodes[0];
+   write_extent_buffer(leaf, dir_item, (unsigned long)dir_item_ptr,
+   data_len);
+   btrfs_mark_buffer_dirty(leaf);
+
+   return 0;
+}
+
+/*
  * insert a directory item in the tree, doing all the magic for
  * both indexes. 'dir' indicates which objectid to insert it into,
  * 'location' is the key to stuff into the directory item, 'type' is the
@@ -174,24 +212,25 @@ second_insert:
}
btrfs_release_path(root, path);
 
-   btrfs_set_key_type(&key, BTRFS_DIR_INDEX_KEY);
-   key.offset = index;
-   dir_item = insert_with_overflow(trans, root, path, &key, data_size,
-   name, name_len);
-   if (IS_ERR(dir_item)) {
-   ret2 = PTR_ERR(dir_item);
+   dir_item = kmalloc(sizeof(*dir_item) + name_len, GFP_KERNEL | GFP_NOFS);
+   if (!dir_item) {
+   ret2 = -ENOMEM;
goto out;
}
-   leaf = path->nodes[0];
+
+   btrfs_set_key_type(&key, BTRFS_DIR_INDEX_KEY);
+   key.offset = index;
+
btrfs_cpu_key_to_disk(&disk_key, location);
-   btrfs_set_dir_item_key(leaf, dir_item, &disk_key);
-   btrfs_set_dir_type(leaf, dir_item, type);
-   btrfs_set_dir_data_len(leaf, dir_item, 0);
-   btrfs_set_dir_name_len(leaf, dir_item, name_len);
-   btrfs_set_dir_transid(leaf, dir_item, trans->transid);
-   name_ptr = (unsigned long)(dir_item + 1);
-   write_extent_buffer(leaf, name, name_ptr, name_len);
-   btrfs_mark_buffer_dirty(leaf);
+   dir_item->location = disk_key;
+   dir_item->transid = cpu_to_le64(trans->transid);
+   dir_item->data_len = 0;
+   dir_item->name_len = cpu_to_le16(name_len);
+   dir_item->type = type;
+   memcpy((char *)(dir_item + 1), name, name_len);
+   ret2 = btrfs_insert_dir_index_item(trans, root, &key, path, dir_item,
+  sizeof(*dir_item) + name_len);
+   kfree(dir_item);
 out:
btrfs_free_path(path);
if (ret)
-- 
1.7.0.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 2/4] btrfs: restructure btrfs_real_readdir()

2010-12-01 Thread Miao Xie

- restructure btrfs_real_readdir(), it will be used later.
- add no memory check for btrfs_alloc_path()

Signed-off-by: Miao Xie 
---
 fs/btrfs/inode.c |   98 +++--
 1 files changed, 65 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0f34cae..46b9d1a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4158,16 +4158,21 @@ static unsigned char btrfs_filetype_table[] = {
DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
 };
 
+/*
+ * Return value:
+ * 0  - Reached end of directory/root in the ctree.
+ * 1  - buffer is full
+ * <0 - error happened
+ */
 static int btrfs_real_readdir(struct file *filp, void *dirent,
- filldir_t filldir)
+ filldir_t filldir, struct btrfs_root *root,
+ int key_type, struct btrfs_path *path)
 {
struct inode *inode = filp->f_dentry->d_inode;
-   struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_item *item;
struct btrfs_dir_item *di;
struct btrfs_key key;
struct btrfs_key found_key;
-   struct btrfs_path *path;
int ret;
u32 nritems;
struct extent_buffer *leaf;
@@ -4178,36 +4183,10 @@ static int btrfs_real_readdir(struct file *filp, void 
*dirent,
u32 di_cur;
u32 di_total;
u32 di_len;
-   int key_type = BTRFS_DIR_INDEX_KEY;
char tmp_name[32];
char *name_ptr;
int name_len;
 
-   /* FIXME, use a real flag for deciding about the key type */
-   if (root->fs_info->tree_root == root)
-   key_type = BTRFS_DIR_ITEM_KEY;
-
-   /* special case for "." */
-   if (filp->f_pos == 0) {
-   over = filldir(dirent, ".", 1,
-  1, inode->i_ino,
-  DT_DIR);
-   if (over)
-   return 0;
-   filp->f_pos = 1;
-   }
-   /* special case for .., just use the back ref */
-   if (filp->f_pos == 1) {
-   u64 pino = parent_ino(filp->f_path.dentry);
-   over = filldir(dirent, "..", 2,
-  2, pino, DT_DIR);
-   if (over)
-   return 0;
-   filp->f_pos = 2;
-   }
-   path = btrfs_alloc_path();
-   path->reada = 2;
-
btrfs_set_key_type(&key, key_type);
key.offset = filp->f_pos;
key.objectid = inode->i_ino;
@@ -4224,7 +4203,9 @@ static int btrfs_real_readdir(struct file *filp, void 
*dirent,
if (advance || slot >= nritems) {
if (slot >= nritems - 1) {
ret = btrfs_next_leaf(root, path);
-   if (ret)
+   if (ret < 0)
+   goto err;
+   else if (ret > 0)
break;
leaf = path->nodes[0];
nritems = btrfs_header_nritems(leaf);
@@ -4287,8 +4268,10 @@ skip:
if (name_ptr != tmp_name)
kfree(name_ptr);
 
-   if (over)
-   goto nopos;
+   if (over) {
+   ret = 1;
+   goto err;
+   }
di_len = btrfs_dir_name_len(leaf, di) +
 btrfs_dir_data_len(leaf, di) + sizeof(*di);
di_cur += di_len;
@@ -4296,6 +4279,55 @@ skip:
}
}
 
+   ret = 0;
+err:
+   btrfs_release_path(root, path);
+   return ret;
+}
+
+static int btrfs_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+   struct inode *inode = filp->f_dentry->d_inode;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_path *path;
+   int key_type = BTRFS_DIR_INDEX_KEY;
+   int ret;
+   int over = 0;
+
+   /* FIXME, use a real flag for deciding about the key type */
+   if (root->fs_info->tree_root == root)
+   key_type = BTRFS_DIR_ITEM_KEY;
+
+   /* special case for "." */
+   if (filp->f_pos == 0) {
+   over = filldir(dirent, ".", 1,
+  1, inode->i_ino,
+  DT_DIR);
+   if (over)
+   return 0;
+   filp->f_pos = 1;
+   }
+   /* special case for .., just use the back ref */
+   if (filp->f_pos == 1) {
+   u64 pino = parent_ino(filp->f_path.dentry);
+   over = filldir(dirent, "..", 2,
+  2, pino, DT_DIR);
+   if (over)
+   return 0;
+   filp->f_pos = 2;
+   }
+
+   path = btrfs_allo

67 matches

Mail list logo