Re: [PATCH 2/2 v2] Btrfs: Per file/directory controls for COW and compression

2011-04-04 Thread Konstantinos Skarlatos

Hello,
I would like to ask about the status of this feature/patch, is it 
accepted into btrfs code, and how can I use it?


I am interested in enabling compression in a specific 
folder(force-compress would be ideal) of a large btrfs volume, and 
disabling it for the rest.



On 21/3/2011 10:57 πμ, liubo wrote:

Data compression and data cow are controlled across the entire FS by mount
options right now.  ioctls are needed to set this on a per file or per
directory basis.  This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.

According to chris's comment, there should be just one true compression
method(probably LZO) stored in the super.  However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.

After applying this patch, we can use the generic FS_IOC_SETFLAGS ioctl to
control file and directory's datacow and compression attribute.

NOTE:
  - The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).

v1-v2:
Rebase the patch with the latest btrfs.

Signed-off-by: Liu Boliubo2...@cn.fujitsu.com
---
  fs/btrfs/ctree.h   |1 +
  fs/btrfs/disk-io.c |6 ++
  fs/btrfs/inode.c   |   32 
  fs/btrfs/ioctl.c   |   41 +
  4 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8b4b9d1..b77d1a5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1283,6 +1283,7 @@ struct btrfs_root {
  #define BTRFS_INODE_NODUMP(1  8)
  #define BTRFS_INODE_NOATIME   (1  9)
  #define BTRFS_INODE_DIRSYNC   (1  10)
+#define BTRFS_INODE_COMPRESS   (1  11)

  /* some macros to generate set/get funcs for the struct fields.  This
   * assumes there is a lefoo_to_cpu for every type, so lets make a simple
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3e1ea3e..a894c12 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1762,6 +1762,12 @@ struct btrfs_root *open_ctree(struct super_block *sb,

btrfs_check_super_valid(fs_info, sb-s_flags  MS_RDONLY);

+   /*
+* In the long term, we'll store the compression type in the super
+* block, and it'll be used for per file compression control.
+*/
+   fs_info-compress_type = BTRFS_COMPRESS_ZLIB;
+
ret = btrfs_parse_options(tree_root, options);
if (ret) {
err = ret;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index db67821..e687bb9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -381,7 +381,8 @@ again:
 */
if (!(BTRFS_I(inode)-flags  BTRFS_INODE_NOCOMPRESS)
(btrfs_test_opt(root, COMPRESS) ||
-(BTRFS_I(inode)-force_compress))) {
+(BTRFS_I(inode)-force_compress) ||
+(BTRFS_I(inode)-flags  BTRFS_INODE_COMPRESS))) {
WARN_ON(pages);
pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);

@@ -1253,7 +1254,8 @@ static int run_delalloc_range(struct inode *inode, struct 
page *locked_page,
ret = run_delalloc_nocow(inode, locked_page, start, end,
 page_started, 0, nr_written);
else if (!btrfs_test_opt(root, COMPRESS)
-!(BTRFS_I(inode)-force_compress))
+!(BTRFS_I(inode)-force_compress)
+!(BTRFS_I(inode)-flags  BTRFS_INODE_COMPRESS))
ret = cow_file_range(inode, locked_page, start, end,
  page_started, nr_written, 1);
else
@@ -4581,8 +4583,6 @@ static struct inode *btrfs_new_inode(struct 
btrfs_trans_handle *trans,
location-offset = 0;
btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY);

-   btrfs_inherit_iflags(inode, dir);
-
if ((mode  S_IFREG)) {
if (btrfs_test_opt(root, NODATASUM))
BTRFS_I(inode)-flags |= BTRFS_INODE_NODATASUM;
@@ -4590,6 +4590,8 @@ static struct inode *btrfs_new_inode(struct 
btrfs_trans_handle *trans,
BTRFS_I(inode)-flags |= BTRFS_INODE_NODATACOW;
}

+   btrfs_inherit_iflags(inode, dir);
+
insert_inode_hash(inode);
inode_tree_add(inode);
return inode;
@@ -6803,6 +6805,26 @@ static int btrfs_getattr(struct vfsmount *mnt,
return 0;
  }

+/*
+ * If a file is moved, it will inherit the cow and compression flags of the new
+ * directory.
+ */
+static void fixup_inode_flags(struct inode *dir, struct inode *inode)
+{
+   struct btrfs_inode *b_dir = BTRFS_I(dir);
+   struct btrfs_inode *b_inode = BTRFS_I(inode);
+
+   if (b_dir-flags  BTRFS_INODE_NODATACOW)
+   b_inode-flags |= BTRFS_INODE_NODATACOW;
+  

Odd rebalancing behavior

2011-04-04 Thread Michel Alexandre Salim
I have an external 4-disk enclosure, connected through USB 2.0 (my
laptop does not have a USB 3.0 connector, and the eSATA connector
somehow does not work); it initially had a 2-disk btrfs soft-RAID1 file
system (both data and metadata are RAID1).

I recently added two more disks and did a rebalance. To my surprise it
went past the point where all four disks have the same amount of disk
usage, and went all the way to the original disks being empty, and the
new disks having all the data!

Label: 'media.store'  uuid: 4cfd3551-aa85-4399-b872-9238ddb14c97
Total devices 4 FS bytes used 1.22TB
devid3 size 1.82TB used 1.24TB path /dev/sdb
devid4 size 1.82TB used 1.24TB path /dev/sdc
devid2 size 1.82TB used 8.00MB path /dev/sde
devid1 size 1.82TB used 12.00MB path /dev/sdd

Is this to be expected? Would another rebalance fix it, or should I
force-stop it by shutting down when the disk usage is roughly balanced?

This is on Fedora 15 pre-release, x86_64, fully updated, kernel
2.6.38.2-9 and btrfs-progs 0.19-13

Thanks,

-- 
Michel Alexandre Salim
GPG key ID: 78884778

()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs balancing start - and stop?

2011-04-04 Thread Stephane Chazelas
2011-04-03 21:35:00 +0200, Helmut Hullen:
 Hallo, Stephane,
 
 Du meintest am 03.04.11:
 
  balancing about 2 TByte needed about 20 hours.
 
 [...]
 
  Hugo has explained the limits of regarding
 
  dmesg | grep relocating
 
  or (more simple) the last lines of dmesg and looking for the
  relocating lines. But: what do these lines tell now? What is the
  (pessimistic) estimation when you extrapolate the data?
 
 [...]
 
  4.7 more days to go. And I reckon it will have written about 9
  TB to disk by that time (which is the total size of the volume,
  though only 3.8TB are occupied).
 
 Yes - that's the pessimistic estimation. As Hugo has explained it can  
 finish faster - just look to the data tomorrow again.
[...]

That may be an optimistic estimation actually, as there hasn't
been much progress in the last 34 hours:

# dmesg | awk -F '[][ ]+' '/reloc/ n++%5==0 {x=(n-$7)/($2-t)/1048576; printf 
%s\t%s\t%.2f\t%*s\n, $2/3600,$7, x, x/3, ; t=$2; n=$7}' | tr ' ' '*' | tail 
-40
125.629 4170039951360   11.93   ***
125.641 4166818725888   70.99   ***
125.699 4157155049472   43.87   **
125.753 4144270147584   63.34   *
125.773 4137827696640   84.98   
125.786 4134606471168   64.39   *
125.823 4124942794752   70.09   ***
125.87  4112057892864   71.66   ***
125.887 4105615441920   100.60  *
125.898 4102394216448   81.26   ***
125.935 4092730540032   69.06   ***
126.33  4085751218176   4.69*
131.904 4072597880832   0.63
132.082 4059712978944   19.20   **
132.12  4053270528000   45.52   ***
132.138 4050049302528   45.60   ***
132.225 4040385626112   29.68   *
132.267 4027500724224   81.17   ***
132.283 4021058273280   106.31  ***
132.29  4017837047808   110.42  
132.316 4008173371392   100.54  *
132.358 3995288469504   81.18   ***
132.475 3988846018560   14.62   
132.514 3985624793088   21.55   ***
132.611 3975961116672   26.40   
132.663 3963076214784   65.31   *
132.678 3956633763840   120.11  
132.685 3956365328384   10.26   ***
137.701 3949922877440   0.34
137.709 3946701651968   106.54  ***
137.744 3937037975552   72.10   
137.889 3927105863680   18.18   **
137.901 3926837428224   5.85*
141.555 3926300557312   0.04
141.93  3925226815488   0.76
151.227 3924421509120   0.02
151.491 3924153073664   0.27
151.712 3923616202752   0.64
165.301 3922542460928   0.02
174.346 3921737154560   0.02

At this rate (third field expressed in MiB/s), it could take
months to complete.

iostat still reports writes at about 5MiB/s though. Note that
this system is not doing anything else at all.

There definitely seems to be scope for optimisation in the
balancing I'd say.

-- 
Stephane
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs subvolume snapshot syntax too smart

2011-04-04 Thread Freddie Cash
On Mon, Apr 4, 2011 at 12:47 PM, Goffredo Baroncelli kreij...@libero.it wrote:
 On 04/04/2011 09:09 PM, krz...@gmail.com wrote:
 I understand btrfs intent but same command run twice should not give
 diffrent results. This really makes snapshot automation hard


 root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
 Create a snapshot of '/ssd/sub1' in '/ssd/5'
 root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
 Create a snapshot of '/ssd/sub1' in '/ssd/5/sub1'
 root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
 Create a snapshot of '/ssd/sub1' in '/ssd/5/sub1'
 ERROR: cannot snapshot '/ssd/sub1'

 The same is true for cp:

 # cp -rf /ssd/sub1 /ssd/5       - copy sub1 as 5
 # cp -rf /ssd/sub1 /ssd/5       - copy sub1 in 5

 However you are right. It could be fixed easily adding a switch like
 --script, which force to handle the last part of the destination as
 the name of the subvolume, raising an error if it already exists.

 subvolume snapshot is the only command which suffers of this kind of
 problem ?

Isn't this a situation where supporting a trailing / would help?

For example, with the / at the end, means put the snapshot into the
folder.  Thus btrfs subvolume snapshot /ssd/sub1 /ssd/5/ would
create a sub1 snapshot inside the 5/ folder.  Running it a second
time would error out since /ssd/5/sub1/ already exists.  And if the 5/
folder doesn't exist, it would error out.

And without the / at the end, means name the snapshot.  Thus btrfs
subvolume snapshot /ssd/sub1 /ssd/5 would create a snapshot named
/ssd/5.  Running the command again would error out due to the
snapshot already existing.  And if the 5/ folder doesn't exist, it's
created.  And it errors out if the 5/ folder already exists.

Or, something along those lines.  Similar to how other apps work
with/without a trailing /.

-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix free space cache when there are pinned extents and clusters V2

2011-04-04 Thread Mitch Harder
On Fri, Apr 1, 2011 at 9:55 AM, Josef Bacik jo...@redhat.com wrote:
 I noticed a huge problem with the free space cache that was presenting as an
 early ENOSPC.  Turns out when writing the free space cache out I forgot to 
 take
 into account pinned extents and more importantly clusters.  This would result 
 in
 us leaking free space everytime we unmounted the filesystem and remounted it. 
  I
 fix this by making sure to check and see if the current block group has a
 cluster and writing out any entries that are in the cluster to the cache, as
 well as writing any pinned extents we currently have to the cache since those
 will be available for us to use the next time the fs mounts.  This patch also
 adds a check to the end of load_free_space_cache to make sure we got the right
 amount of free space cache, and if not make sure to clear the cache and 
 re-cache
 the old fashioned way.  Thanks,

 Signed-off-by: Josef Bacik jo...@redhat.com
 ---
 V1-V2:
 - use block_group-free_space instead of
  btrfs_block_group_free_space(block_group)

  fs/btrfs/free-space-cache.c |   82 --
  1 files changed, 78 insertions(+), 4 deletions(-)

 diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
 index f03ef97..74bc432 100644
 --- a/fs/btrfs/free-space-cache.c
 +++ b/fs/btrfs/free-space-cache.c
 @@ -24,6 +24,7 @@
  #include free-space-cache.h
  #include transaction.h
  #include disk-io.h
 +#include extent_io.h

  #define BITS_PER_BITMAP                (PAGE_CACHE_SIZE * 8)
  #define MAX_CACHE_BYTES_PER_GIG        (32 * 1024)
 @@ -222,6 +223,7 @@ int load_free_space_cache(struct btrfs_fs_info *fs_info,
        u64 num_entries;
        u64 num_bitmaps;
        u64 generation;
 +       u64 used = btrfs_block_group_used(block_group-item);
        u32 cur_crc = ~(u32)0;
        pgoff_t index = 0;
        unsigned long first_page_offset;
 @@ -467,6 +469,17 @@ next:
                index++;
        }

 +       spin_lock(block_group-tree_lock);
 +       if (block_group-free_space != (block_group-key.offset - used -
 +                                       block_group-bytes_super)) {
 +               spin_unlock(block_group-tree_lock);
 +               printk(KERN_ERR block group %llu has an wrong amount of free 
 
 +                      space\n, block_group-key.objectid);
 +               ret = 0;
 +               goto free_cache;
 +       }
 +       spin_unlock(block_group-tree_lock);
 +
        ret = 1;
  out:
        kfree(checksums);
 @@ -495,8 +508,11 @@ int btrfs_write_out_cache(struct btrfs_root *root,
        struct list_head *pos, *n;
        struct page *page;
        struct extent_state *cached_state = NULL;
 +       struct btrfs_free_cluster *cluster = NULL;
 +       struct extent_io_tree *unpin = NULL;
        struct list_head bitmap_list;
        struct btrfs_key key;
 +       u64 start, end, len;
        u64 bytes = 0;
        u32 *crc, *checksums;
        pgoff_t index = 0, last_index = 0;
 @@ -505,6 +521,7 @@ int btrfs_write_out_cache(struct btrfs_root *root,
        int entries = 0;
        int bitmaps = 0;
        int ret = 0;
 +       bool next_page = false;

        root = root-fs_info-tree_root;

 @@ -551,6 +568,18 @@ int btrfs_write_out_cache(struct btrfs_root *root,
         */
        first_page_offset = (sizeof(u32) * num_checksums) + sizeof(u64);

 +       /* Get the cluster for this block_group if it exists */
 +       if (!list_empty(block_group-cluster_list))
 +               cluster = list_entry(block_group-cluster_list.next,
 +                                    struct btrfs_free_cluster,
 +                                    block_group_list);
 +
 +       /*
 +        * We shouldn't have switched the pinned extents yet so this is the
 +        * right one
 +        */
 +       unpin = root-fs_info-pinned_extents;
 +
        /*
         * Lock all pages first so we can lock the extent safely.
         *
 @@ -580,6 +609,12 @@ int btrfs_write_out_cache(struct btrfs_root *root,
        lock_extent_bits(BTRFS_I(inode)-io_tree, 0, i_size_read(inode) - 1,
                         0, cached_state, GFP_NOFS);

 +       /*
 +        * When searching for pinned extents, we need to start at our start
 +        * offset.
 +        */
 +       start = block_group-key.objectid;
 +
        /* Write out the extent entries */
        do {
                struct btrfs_free_space_entry *entry;
 @@ -587,6 +622,8 @@ int btrfs_write_out_cache(struct btrfs_root *root,
                unsigned long offset = 0;
                unsigned long start_offset = 0;

 +               next_page = false;
 +
                if (index == 0) {
                        start_offset = first_page_offset;
                        offset = start_offset;
 @@ -598,7 +635,7 @@ int btrfs_write_out_cache(struct btrfs_root *root,
                entry = addr + start_offset;

                memset(addr, 0, PAGE_CACHE_SIZE);
 -               while (1) {
 +               while (node  

bug report

2011-04-04 Thread Larry D'Anna
So i made a filesystem image 

 $ dd if=/dev/zero of=root_fs bs=1024 count=$(expr 1024 \* 1024)
 $ mkfs.btrfs root_fs 

Then I put some debian on it (my kernel is 2.6.35-27-generic #48-Ubuntu) 

 $ mkdir root 
 $ mount -o loop root_fs root 
 $ debootstrap sid root 
 $ umount root

Then i run uml.   (2.6.35-1um-0ubuntu1)

 $ linux single eth0=tuntap,tap0,fe:fd:f0:00:00:01

and then try to apt-get some stuff, and the result is this:

btrfs csum failed ino 17498 off 2412544 csum 491052325 private 446722121
btrfs csum failed ino 17498 off 2416640 csum 2077462867 private 906054605
btrfs csum failed ino 17498 off 2420736 csum 263316283 private 2215839539
btrfs csum failed ino 17498 off 2424832 csum 4177088190 private 2414263107
btrfs csum failed ino 17498 off 2428928 csum 4028205539 private 3560605623
btrfs csum failed ino 17498 off 2433024 csum 1724529595 private 200634979
btrfs csum failed ino 17498 off 2437120 csum 4038631380 private 2927872002
btrfs csum failed ino 17498 off 2441216 csum 2616837020 private 729736037
btrfs csum failed ino 17498 off 2498560 csum 2566472073 private 3417075259
btrfs csum failed ino 17498 off 2502656 csum 2566472073 private 1410567947


 $ find / -mount -inum 17498  
 /var/cache/apt/srcpkgcache.bin

I've gone through this twice now, so it's repeatable at least.  I know 2.6.35 is
kinda old but was this kind of thing to be expected back then?  


  --larry
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix subvolume mount by name problem when default mount subvolume is set

2011-04-04 Thread Chris Mason
Excerpts from Zhong, Xin's message of 2011-03-31 03:59:22 -0400:
 We create two subvolumes (meego_root and meego_home) in
 btrfs root directory. And set meego_root as default mount
 subvolume. After we remount btrfs, meego_root is mounted
 to top directory by default. Then when we try to mount
 meego_home (subvol=meego_home) to a subdirectory, it failed.
 The problem is when default mount subvolume is set to
 meego_root, we search meego_home in it but can not find it.
 So the solution is to search meego_home in btrfs root
 directory instead when subvol=meego_home is given.

I think this one is difficult because if they have set the default
subvolume they might have done so because the original default has the
result of a busted upgrade or something in it.

So, I think the subvol= should be relative to the default.  Would it
work for you to add a new mount option to specify the subvol id to
search for subvol=?

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html