Re: [PATCH 2/2 v2] Btrfs: Per file/directory controls for COW and compression
Hello, I would like to ask about the status of this feature/patch, is it accepted into btrfs code, and how can I use it? I am interested in enabling compression in a specific folder(force-compress would be ideal) of a large btrfs volume, and disabling it for the rest. On 21/3/2011 10:57 πμ, liubo wrote: Data compression and data cow are controlled across the entire FS by mount options right now. ioctls are needed to set this on a per file or per directory basis. This has been proposed previously, but VFS developers wanted us to use generic ioctls rather than btrfs-specific ones. According to chris's comment, there should be just one true compression method(probably LZO) stored in the super. However, before this, we would wait for that one method is stable enough to be adopted into the super. So I list it as a long term goal, and just store it in ram today. After applying this patch, we can use the generic FS_IOC_SETFLAGS ioctl to control file and directory's datacow and compression attribute. NOTE: - The compression type is selected by such rules: If we mount btrfs with compress options, ie, zlib/lzo, the type is it. Otherwise, we'll use the default compress type (zlib today). v1-v2: Rebase the patch with the latest btrfs. Signed-off-by: Liu Boliubo2...@cn.fujitsu.com --- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |6 ++ fs/btrfs/inode.c | 32 fs/btrfs/ioctl.c | 41 + 4 files changed, 72 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8b4b9d1..b77d1a5 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1283,6 +1283,7 @@ struct btrfs_root { #define BTRFS_INODE_NODUMP(1 8) #define BTRFS_INODE_NOATIME (1 9) #define BTRFS_INODE_DIRSYNC (1 10) +#define BTRFS_INODE_COMPRESS (1 11) /* some macros to generate set/get funcs for the struct fields. This * assumes there is a lefoo_to_cpu for every type, so lets make a simple diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3e1ea3e..a894c12 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1762,6 +1762,12 @@ struct btrfs_root *open_ctree(struct super_block *sb, btrfs_check_super_valid(fs_info, sb-s_flags MS_RDONLY); + /* +* In the long term, we'll store the compression type in the super +* block, and it'll be used for per file compression control. +*/ + fs_info-compress_type = BTRFS_COMPRESS_ZLIB; + ret = btrfs_parse_options(tree_root, options); if (ret) { err = ret; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index db67821..e687bb9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -381,7 +381,8 @@ again: */ if (!(BTRFS_I(inode)-flags BTRFS_INODE_NOCOMPRESS) (btrfs_test_opt(root, COMPRESS) || -(BTRFS_I(inode)-force_compress))) { +(BTRFS_I(inode)-force_compress) || +(BTRFS_I(inode)-flags BTRFS_INODE_COMPRESS))) { WARN_ON(pages); pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS); @@ -1253,7 +1254,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 0, nr_written); else if (!btrfs_test_opt(root, COMPRESS) -!(BTRFS_I(inode)-force_compress)) +!(BTRFS_I(inode)-force_compress) +!(BTRFS_I(inode)-flags BTRFS_INODE_COMPRESS)) ret = cow_file_range(inode, locked_page, start, end, page_started, nr_written, 1); else @@ -4581,8 +4583,6 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, location-offset = 0; btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY); - btrfs_inherit_iflags(inode, dir); - if ((mode S_IFREG)) { if (btrfs_test_opt(root, NODATASUM)) BTRFS_I(inode)-flags |= BTRFS_INODE_NODATASUM; @@ -4590,6 +4590,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, BTRFS_I(inode)-flags |= BTRFS_INODE_NODATACOW; } + btrfs_inherit_iflags(inode, dir); + insert_inode_hash(inode); inode_tree_add(inode); return inode; @@ -6803,6 +6805,26 @@ static int btrfs_getattr(struct vfsmount *mnt, return 0; } +/* + * If a file is moved, it will inherit the cow and compression flags of the new + * directory. + */ +static void fixup_inode_flags(struct inode *dir, struct inode *inode) +{ + struct btrfs_inode *b_dir = BTRFS_I(dir); + struct btrfs_inode *b_inode = BTRFS_I(inode); + + if (b_dir-flags BTRFS_INODE_NODATACOW) + b_inode-flags |= BTRFS_INODE_NODATACOW; +
Odd rebalancing behavior
I have an external 4-disk enclosure, connected through USB 2.0 (my laptop does not have a USB 3.0 connector, and the eSATA connector somehow does not work); it initially had a 2-disk btrfs soft-RAID1 file system (both data and metadata are RAID1). I recently added two more disks and did a rebalance. To my surprise it went past the point where all four disks have the same amount of disk usage, and went all the way to the original disks being empty, and the new disks having all the data! Label: 'media.store' uuid: 4cfd3551-aa85-4399-b872-9238ddb14c97 Total devices 4 FS bytes used 1.22TB devid3 size 1.82TB used 1.24TB path /dev/sdb devid4 size 1.82TB used 1.24TB path /dev/sdc devid2 size 1.82TB used 8.00MB path /dev/sde devid1 size 1.82TB used 12.00MB path /dev/sdd Is this to be expected? Would another rebalance fix it, or should I force-stop it by shutting down when the disk usage is roughly balanced? This is on Fedora 15 pre-release, x86_64, fully updated, kernel 2.6.38.2-9 and btrfs-progs 0.19-13 Thanks, -- Michel Alexandre Salim GPG key ID: 78884778 () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balancing start - and stop?
2011-04-03 21:35:00 +0200, Helmut Hullen: Hallo, Stephane, Du meintest am 03.04.11: balancing about 2 TByte needed about 20 hours. [...] Hugo has explained the limits of regarding dmesg | grep relocating or (more simple) the last lines of dmesg and looking for the relocating lines. But: what do these lines tell now? What is the (pessimistic) estimation when you extrapolate the data? [...] 4.7 more days to go. And I reckon it will have written about 9 TB to disk by that time (which is the total size of the volume, though only 3.8TB are occupied). Yes - that's the pessimistic estimation. As Hugo has explained it can finish faster - just look to the data tomorrow again. [...] That may be an optimistic estimation actually, as there hasn't been much progress in the last 34 hours: # dmesg | awk -F '[][ ]+' '/reloc/ n++%5==0 {x=(n-$7)/($2-t)/1048576; printf %s\t%s\t%.2f\t%*s\n, $2/3600,$7, x, x/3, ; t=$2; n=$7}' | tr ' ' '*' | tail -40 125.629 4170039951360 11.93 *** 125.641 4166818725888 70.99 *** 125.699 4157155049472 43.87 ** 125.753 4144270147584 63.34 * 125.773 4137827696640 84.98 125.786 4134606471168 64.39 * 125.823 4124942794752 70.09 *** 125.87 4112057892864 71.66 *** 125.887 4105615441920 100.60 * 125.898 4102394216448 81.26 *** 125.935 4092730540032 69.06 *** 126.33 4085751218176 4.69* 131.904 4072597880832 0.63 132.082 4059712978944 19.20 ** 132.12 4053270528000 45.52 *** 132.138 4050049302528 45.60 *** 132.225 4040385626112 29.68 * 132.267 4027500724224 81.17 *** 132.283 4021058273280 106.31 *** 132.29 4017837047808 110.42 132.316 4008173371392 100.54 * 132.358 3995288469504 81.18 *** 132.475 3988846018560 14.62 132.514 3985624793088 21.55 *** 132.611 3975961116672 26.40 132.663 3963076214784 65.31 * 132.678 3956633763840 120.11 132.685 3956365328384 10.26 *** 137.701 3949922877440 0.34 137.709 3946701651968 106.54 *** 137.744 3937037975552 72.10 137.889 3927105863680 18.18 ** 137.901 3926837428224 5.85* 141.555 3926300557312 0.04 141.93 3925226815488 0.76 151.227 3924421509120 0.02 151.491 3924153073664 0.27 151.712 3923616202752 0.64 165.301 3922542460928 0.02 174.346 3921737154560 0.02 At this rate (third field expressed in MiB/s), it could take months to complete. iostat still reports writes at about 5MiB/s though. Note that this system is not doing anything else at all. There definitely seems to be scope for optimisation in the balancing I'd say. -- Stephane -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs subvolume snapshot syntax too smart
On Mon, Apr 4, 2011 at 12:47 PM, Goffredo Baroncelli kreij...@libero.it wrote: On 04/04/2011 09:09 PM, krz...@gmail.com wrote: I understand btrfs intent but same command run twice should not give diffrent results. This really makes snapshot automation hard root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5 Create a snapshot of '/ssd/sub1' in '/ssd/5' root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5 Create a snapshot of '/ssd/sub1' in '/ssd/5/sub1' root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5 Create a snapshot of '/ssd/sub1' in '/ssd/5/sub1' ERROR: cannot snapshot '/ssd/sub1' The same is true for cp: # cp -rf /ssd/sub1 /ssd/5 - copy sub1 as 5 # cp -rf /ssd/sub1 /ssd/5 - copy sub1 in 5 However you are right. It could be fixed easily adding a switch like --script, which force to handle the last part of the destination as the name of the subvolume, raising an error if it already exists. subvolume snapshot is the only command which suffers of this kind of problem ? Isn't this a situation where supporting a trailing / would help? For example, with the / at the end, means put the snapshot into the folder. Thus btrfs subvolume snapshot /ssd/sub1 /ssd/5/ would create a sub1 snapshot inside the 5/ folder. Running it a second time would error out since /ssd/5/sub1/ already exists. And if the 5/ folder doesn't exist, it would error out. And without the / at the end, means name the snapshot. Thus btrfs subvolume snapshot /ssd/sub1 /ssd/5 would create a snapshot named /ssd/5. Running the command again would error out due to the snapshot already existing. And if the 5/ folder doesn't exist, it's created. And it errors out if the 5/ folder already exists. Or, something along those lines. Similar to how other apps work with/without a trailing /. -- Freddie Cash fjwc...@gmail.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix free space cache when there are pinned extents and clusters V2
On Fri, Apr 1, 2011 at 9:55 AM, Josef Bacik jo...@redhat.com wrote: I noticed a huge problem with the free space cache that was presenting as an early ENOSPC. Turns out when writing the free space cache out I forgot to take into account pinned extents and more importantly clusters. This would result in us leaking free space everytime we unmounted the filesystem and remounted it. I fix this by making sure to check and see if the current block group has a cluster and writing out any entries that are in the cluster to the cache, as well as writing any pinned extents we currently have to the cache since those will be available for us to use the next time the fs mounts. This patch also adds a check to the end of load_free_space_cache to make sure we got the right amount of free space cache, and if not make sure to clear the cache and re-cache the old fashioned way. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- V1-V2: - use block_group-free_space instead of btrfs_block_group_free_space(block_group) fs/btrfs/free-space-cache.c | 82 -- 1 files changed, 78 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index f03ef97..74bc432 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -24,6 +24,7 @@ #include free-space-cache.h #include transaction.h #include disk-io.h +#include extent_io.h #define BITS_PER_BITMAP (PAGE_CACHE_SIZE * 8) #define MAX_CACHE_BYTES_PER_GIG (32 * 1024) @@ -222,6 +223,7 @@ int load_free_space_cache(struct btrfs_fs_info *fs_info, u64 num_entries; u64 num_bitmaps; u64 generation; + u64 used = btrfs_block_group_used(block_group-item); u32 cur_crc = ~(u32)0; pgoff_t index = 0; unsigned long first_page_offset; @@ -467,6 +469,17 @@ next: index++; } + spin_lock(block_group-tree_lock); + if (block_group-free_space != (block_group-key.offset - used - + block_group-bytes_super)) { + spin_unlock(block_group-tree_lock); + printk(KERN_ERR block group %llu has an wrong amount of free + space\n, block_group-key.objectid); + ret = 0; + goto free_cache; + } + spin_unlock(block_group-tree_lock); + ret = 1; out: kfree(checksums); @@ -495,8 +508,11 @@ int btrfs_write_out_cache(struct btrfs_root *root, struct list_head *pos, *n; struct page *page; struct extent_state *cached_state = NULL; + struct btrfs_free_cluster *cluster = NULL; + struct extent_io_tree *unpin = NULL; struct list_head bitmap_list; struct btrfs_key key; + u64 start, end, len; u64 bytes = 0; u32 *crc, *checksums; pgoff_t index = 0, last_index = 0; @@ -505,6 +521,7 @@ int btrfs_write_out_cache(struct btrfs_root *root, int entries = 0; int bitmaps = 0; int ret = 0; + bool next_page = false; root = root-fs_info-tree_root; @@ -551,6 +568,18 @@ int btrfs_write_out_cache(struct btrfs_root *root, */ first_page_offset = (sizeof(u32) * num_checksums) + sizeof(u64); + /* Get the cluster for this block_group if it exists */ + if (!list_empty(block_group-cluster_list)) + cluster = list_entry(block_group-cluster_list.next, + struct btrfs_free_cluster, + block_group_list); + + /* + * We shouldn't have switched the pinned extents yet so this is the + * right one + */ + unpin = root-fs_info-pinned_extents; + /* * Lock all pages first so we can lock the extent safely. * @@ -580,6 +609,12 @@ int btrfs_write_out_cache(struct btrfs_root *root, lock_extent_bits(BTRFS_I(inode)-io_tree, 0, i_size_read(inode) - 1, 0, cached_state, GFP_NOFS); + /* + * When searching for pinned extents, we need to start at our start + * offset. + */ + start = block_group-key.objectid; + /* Write out the extent entries */ do { struct btrfs_free_space_entry *entry; @@ -587,6 +622,8 @@ int btrfs_write_out_cache(struct btrfs_root *root, unsigned long offset = 0; unsigned long start_offset = 0; + next_page = false; + if (index == 0) { start_offset = first_page_offset; offset = start_offset; @@ -598,7 +635,7 @@ int btrfs_write_out_cache(struct btrfs_root *root, entry = addr + start_offset; memset(addr, 0, PAGE_CACHE_SIZE); - while (1) { + while (node
bug report
So i made a filesystem image $ dd if=/dev/zero of=root_fs bs=1024 count=$(expr 1024 \* 1024) $ mkfs.btrfs root_fs Then I put some debian on it (my kernel is 2.6.35-27-generic #48-Ubuntu) $ mkdir root $ mount -o loop root_fs root $ debootstrap sid root $ umount root Then i run uml. (2.6.35-1um-0ubuntu1) $ linux single eth0=tuntap,tap0,fe:fd:f0:00:00:01 and then try to apt-get some stuff, and the result is this: btrfs csum failed ino 17498 off 2412544 csum 491052325 private 446722121 btrfs csum failed ino 17498 off 2416640 csum 2077462867 private 906054605 btrfs csum failed ino 17498 off 2420736 csum 263316283 private 2215839539 btrfs csum failed ino 17498 off 2424832 csum 4177088190 private 2414263107 btrfs csum failed ino 17498 off 2428928 csum 4028205539 private 3560605623 btrfs csum failed ino 17498 off 2433024 csum 1724529595 private 200634979 btrfs csum failed ino 17498 off 2437120 csum 4038631380 private 2927872002 btrfs csum failed ino 17498 off 2441216 csum 2616837020 private 729736037 btrfs csum failed ino 17498 off 2498560 csum 2566472073 private 3417075259 btrfs csum failed ino 17498 off 2502656 csum 2566472073 private 1410567947 $ find / -mount -inum 17498 /var/cache/apt/srcpkgcache.bin I've gone through this twice now, so it's repeatable at least. I know 2.6.35 is kinda old but was this kind of thing to be expected back then? --larry -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix subvolume mount by name problem when default mount subvolume is set
Excerpts from Zhong, Xin's message of 2011-03-31 03:59:22 -0400: We create two subvolumes (meego_root and meego_home) in btrfs root directory. And set meego_root as default mount subvolume. After we remount btrfs, meego_root is mounted to top directory by default. Then when we try to mount meego_home (subvol=meego_home) to a subdirectory, it failed. The problem is when default mount subvolume is set to meego_root, we search meego_home in it but can not find it. So the solution is to search meego_home in btrfs root directory instead when subvol=meego_home is given. I think this one is difficult because if they have set the default subvolume they might have done so because the original default has the result of a busted upgrade or something in it. So, I think the subvol= should be relative to the default. Would it work for you to add a new mount option to specify the subvol id to search for subvol=? -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html