Re: Moving top level to a subvolume
On Fri, Jun 8, 2012 at 2:40 PM, Arne Jansen sensi...@gmx.net wrote: On 06/08/2012 09:24 PM, Matthew Hawn wrote: I just converted my root filesystem to btrfs with btrfs-convert. However, since I am running Ubuntu, I would like to have the same subvolume structure as a default install,. How do I move the top-level subvolume (where all my files currently are) to another subvolume? Just snapshot the root subvol and continue working in the snapshot. ... yeah but that solution totally sucks when you: a) have a lot of data b) need to do this via script c) ??? ... because in a), data will *copied* the slow way, and in b) you leave a bunch of junk laying around in the old root that will rot unless you `rm -rf` it ... and idk about you, but issuing what is very near to that command on someone else's machine -- via script -- makes me REALLY uneasy ;-) i have asked this exact question at least 4 times specifically, and referenced it probably 8-10, in the last 3 years or more. i needed it then. i still need it now. but since i never got an answer up/down or around, i gave up and told people to `rm -rf`themselves ... http://markmail.org/message/7hj5ioqrztkeerqv ... that's from May of 2010, but i don't think it's the first. so, would it possible to implement this, or could someone kindly (and briefly!) explain why it cannot be done? 1. people install stuff to the top-level 2. top-level is unmanageable 3. problem in my case i wrote an initramfs hook that implemented rollback functionality, but there was not way for me to cleanly -- and safely -- rotate the user's setup to one that DOES NOT have user items in the top-level volume. -- C Anthony -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving top level to a subvolume
On 13.06.2012 09:04, C Anthony Risinger wrote: On Fri, Jun 8, 2012 at 2:40 PM, Arne Jansen sensi...@gmx.net wrote: On 06/08/2012 09:24 PM, Matthew Hawn wrote: I just converted my root filesystem to btrfs with btrfs-convert. However, since I am running Ubuntu, I would like to have the same subvolume structure as a default install,. How do I move the top-level subvolume (where all my files currently are) to another subvolume? Just snapshot the root subvol and continue working in the snapshot. ... yeah but that solution totally sucks when you: a) have a lot of data b) need to do this via script c) ??? ... because in a), data will *copied* the slow way, and in b) you leave a bunch of junk laying around in the old root that will rot unless you `rm -rf` it ... and idk about you, but issuing what is very near to that command on someone else's machine -- via script -- makes me REALLY uneasy ;-) well, don't put data in the top level in the first place. Yes, you have to remove the content of the subvol / by rm -rf, but I don't really see the problem with it. What I don't understand is why you think data will be copied. i have asked this exact question at least 4 times specifically, and referenced it probably 8-10, in the last 3 years or more. i needed it then. i still need it now. but since i never got an answer up/down or around, i gave up and told people to `rm -rf`themselves ... http://markmail.org/message/7hj5ioqrztkeerqv ... that's from May of 2010, but i don't think it's the first. so, would it possible to implement this, or could someone kindly (and briefly!) explain why it cannot be done? The default subvol ('/') has the special number 5 and is expected to always be around. All other subvols get numbers starting with 256. Creating a new 5 and internally renumbering the old 5 isn't easy, because each tree block has an owner recorded in it. Also, all backreferences have the root number in them. If you have to touch each tree block, you can as well choose the snapshot/rm -rf approach. 1. people install stuff to the top-level 2. top-level is unmanageable 3. problem in my case i wrote an initramfs hook that implemented rollback functionality, but there was not way for me to cleanly -- and safely -- rotate the user's setup to one that DOES NOT have user items in the top-level volume. Can't instead add code to the installer that warns a user if he wants to install into the default subvol? Or you could hack mkfs.btrfs to always create an additional subvol. Even making / readonly except for creating mountpoint could be possible. Just some random ideas... -Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving top level to a subvolume
Fajar A. Nugraha posted on Wed, 13 Jun 2012 08:49:47 +0700 as excerpted: As for lose their filesystems, are there recent ones that uses one of the three distros above, and is purely btrfs fault? The ones I can remember (from the post to this list) were broken on earlier kernels, or caused by bad disks. I tried btrfs during the 3.4 cycle for a bit, and didn't lose the whole filesystem, but definitely found it not upto my usual standard of robustness, my previous and back to now filesystem, Chris's former project, reiserfs. My system's old and has a bit of a problem with overheating in the Phoenix summer, so has been suffering SATA resets (not the disk, the sata chipset most likely, and/or issues with the graphics overheating since I'm using an AMD 8xxx chipset with AGPGART split between IOMMU for storage I/O and graphics) and full system freezes. Not only did I have way more stuff disappearing or being zeroed out than on reiserfs (in default data=ordered mode), but in one case I had a segment disappear out of the middle of a file, and in another, I had firefox's crash-resume-file /content/ show up as what SHOULD have been an entirely unrelated configuration file. Naturally I had backups to restore from, and if it wasn't for the freezes, it would have likely been fine, but it's exactly this sort of corner-case that filesystems need to be able to deal with, and what bothered me wasn't disappearing or zeroed out last few seconds of work with well documented explanations, but having random segments of files that I hadn't changed (whether the app was rewriting them with the same data's another question) in some time disappear, and having one file's content show up with an entirely unrelated name. I thought that's the sort of thing btrfs checksums were supposed to detect and effectively zero out, but... I decided that's /too/ experimental for me ATM, especially with not-quite- stable hardware (it's worth noting that I survived bad memory and the related crashes on reiserfs, without /that/ sort of damage, at least not since data=ordered mode!), so am back on reiserfs for now, anyway. I'll likely try again next year sometime... -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs and data nocow per inode basis
On 06/13/2012 05:10 AM, Ted Ts'o wrote: On Tue, Jun 12, 2012 at 04:44:23PM -0400, Chris Mason wrote: On Tue, Jun 12, 2012 at 01:15:27PM -0600, Ted Ts'o wrote: It appears the NOCOW_FL flag is currently a no-op in the 3.2 kernel? It's not a noop, but it is only setting the NODATACOW flag. It needs to set the nodatasum flag as well, just like the mount -o nodatacow mount option does. I'll fix this up on the kernel side, thanks Ted. ohh, that's my fault...sorry. Here's the final patch to e2fsprogs that will be going into 1.42.4: This commit is lack of the related usage update, I'll send a patch for it :) thanks, liubo commit 5a23c93aeb65d61892a47f8f27bffad38f4759ea Author: Theodore Ts'o ty...@mit.edu Date: Tue Jun 12 17:09:39 2012 -0400 lsattr, chattr: add support for btrfs's No_COW flag Signed-off-by: Theodore Ts'o ty...@mit.edu diff --git a/lib/e2p/pf.c b/lib/e2p/pf.c index f03193c..e2f8ce5 100644 --- a/lib/e2p/pf.c +++ b/lib/e2p/pf.c @@ -49,6 +49,7 @@ static struct flags_name flags_array[] = { { EXT2_TOPDIR_FL, T, Top_of_Directory_Hierarchies }, { EXT4_EXTENTS_FL, e, Extents }, { EXT4_HUGE_FILE_FL, h, Huge_file }, + { FS_NOCOW_FL, C, No_COW }, { 0, NULL, NULL } }; diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h index f46a1a9..fb3f7cc 100644 --- a/lib/ext2fs/ext2_fs.h +++ b/lib/ext2fs/ext2_fs.h @@ -301,6 +301,7 @@ struct ext2_dx_countlimit { #define EXT4_EXTENTS_FL 0x0008 /* Inode uses extents */ #define EXT4_EA_INODE_FL 0x0020 /* Inode used for large EA */ /* EXT4_EOFBLOCKS_FL 0x0040 was here */ +#define FS_NOCOW_FL 0x0080 /* Do not cow file */ #define EXT4_SNAPFILE_FL 0x0100 /* Inode is a snapshot */ #define EXT4_SNAPFILE_DELETED_FL 0x0400 /* Snapshot is being deleted */ #define EXT4_SNAPFILE_SHRUNK_FL 0x0800 /* Snapshot shrink has completed */ diff --git a/misc/chattr.1.in b/misc/chattr.1.in index 92f6d70..5a57d2c 100644 --- a/misc/chattr.1.in +++ b/misc/chattr.1.in @@ -64,6 +64,15 @@ this file compresses data before storing them on the disk. Note: please make sure to read the bugs and limitations section at the end of this document. .PP +A file with the 'C' attribute set will not be subject to copy-on-write +updates. This flag is only supported on file systems which perform +copy-on-write. (Note: For btrfs, the 'C' flag should be only +set on new or empty files. If it is set on a file which already has +data blocks, it is undefined when the blocks assigned to the file will +be fully stable. If the 'C' flag is set on a directory, it will have no +effect on the directory, but new files created in that directory will +the No_COW attribute.) +.PP When a directory with the `D' attribute set is modified, the changes are written synchronously on the disk; this is equivalent to the `dirsync' mount option applied to a subset of the files. @@ -159,8 +168,7 @@ maintained by Theodore Ts'o ty...@alum.mit.edu. .SH BUGS AND LIMITATIONS The `c', 's', and `u' attributes are not honored by the ext2 and ext3 filesystems as implemented in the current mainline -Linux kernels.These attributes may be implemented -in future versions of the ext2 and ext3 filesystems. +Linux kernels. .PP The `j' option is only useful if the filesystem is mounted as ext3. .PP diff --git a/misc/chattr.c b/misc/chattr.c index 8a2d61f..141ea6e 100644 --- a/misc/chattr.c +++ b/misc/chattr.c @@ -107,6 +107,7 @@ static const struct flags_char flags_array[] = { { EXT2_UNRM_FL, 'u' }, { EXT2_NOTAIL_FL, 't' }, { EXT2_TOPDIR_FL, 'T' }, + { FS_NOCOW_FL, 'C' }, { 0, 0 } }; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] E2fsprogs: add missing usage for No_COW
Add the missing usage for No_COW since we've supported No_COW flag. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- misc/chattr.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/misc/chattr.c b/misc/chattr.c index 141ea6e..24254cc 100644 --- a/misc/chattr.c +++ b/misc/chattr.c @@ -83,7 +83,7 @@ static unsigned long sf; static void usage(void) { fprintf(stderr, - _(Usage: %s [-RVf] [-+=AacDdeijsSu] [-v version] files...\n), + _(Usage: %s [-RVf] [-+=AacDdeijsSuC] [-v version] files...\n), program_name); exit(1); } -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] E2fsprogs: add missing usage for No_COW
On Wed, 13 Jun 2012 15:47:13 +0800 Liu Bo liubo2...@cn.fujitsu.com wrote: Add the missing usage for No_COW since we've supported No_COW flag. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- misc/chattr.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/misc/chattr.c b/misc/chattr.c index 141ea6e..24254cc 100644 --- a/misc/chattr.c +++ b/misc/chattr.c @@ -83,7 +83,7 @@ static unsigned long sf; static void usage(void) { fprintf(stderr, - _(Usage: %s [-RVf] [-+=AacDdeijsSu] [-v version] files...\n), + _(Usage: %s [-RVf] [-+=AacDdeijsSuC] [-v version] files...\n), program_name); exit(1); } These were sorted alphabetically so the better way would be to use AaCcDdeijsSu -- With respect, Roman ~~~ Stallman had a printer, with code he could not see. So he began to tinker, and set the software free. signature.asc Description: PGP signature
RE: Bug in btrfs-debug-tree for two or more devices.
-Original Message- From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Hugo Mills Sent: Wednesday, June 13, 2012 1:37 AM To: Santosh Hosamani Cc: linux-btrfs@vger.kernel.org Subject: Re: Bug in btrfs-debug-tree for two or more devices. On Tue, Jun 12, 2012 at 06:53:00AM +, Santosh Hosamani wrote: Hi btrfs folks, I am working on btrfs filesystem on how it manages the free space. And found out btrfs maintain a ctree which manages the physical location of the chunks and stripes of the filesystem. Btrfs-debug-tree also gives the information on the chunk tree I created btrfs on single device and two device.I have attached the output of both on running btrfs-debug-tree. For single device sum of all the length in the chunks will add upto the total used bytes which is expected behavior. But for two devices sum of all lengths in the chunks does not add to the total bytes .Am I missing something . Without actually seeing the details of your technique and expectations, I shall make a guess that you're not accounting for the double-counting of RAID-1 metadata. In other words, you will find that all of the metadata device extents (or chunks) will appear twice -- once on each device. Actually, this isn't quite right either -- what you really need to do is look at the RAID-1, RAID-10 and DUP bits in the chunk flags, add up all of those chunks, divide by two, and then add in the remaining (RAID-0 and single) chunks. That total should then add up to the total value of allocated space that you get from the output of btrfs fi df. chunk tree leaf 20971520 items 8 free space 3023 generation 4 owner 3 fs uuid 23f86d1e-038a-4f5b-b87c-2ba78018135c chunk uuid db672366-6801-4f83-99ef-2087a60bb394 item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 3897 itemsize 98 dev item devid 1 total_bytes 3221225472 bytes used 673579008 item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 3799 itemsize 98 dev item devid 2 total_bytes 3221225472 bytes used 652607488 item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 0) itemoff 3719 itemsize 80 chunk length 4194304 owner 2 type 2 num_stripes 1 stripe 0 devid 1 offset 0 item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 4194304) itemoff 3639 itemsize 80 chunk length 8388608 owner 2 type 4 num_stripes 1 stripe 0 devid 1 offset 4194304 item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 12582912) itemoff 3559 itemsize 80 chunk length 8388608 owner 2 type 1 num_stripes 1 stripe 0 devid 1 offset 12582912 item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 3447 itemsize 112 chunk length 8388608 owner 2 type 18 num_stripes 2 stripe 0 devid 2 offset 1048576 stripe 1 devid 1 offset 20971520 item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 29360128) itemoff 3335 itemsize 112 chunk length 322109440 owner 2 type 20 num_stripes 2 stripe 0 devid 2 offset 9437184 stripe 1 devid 1 offset 29360128 item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 351469568) itemoff 3223 itemsize 112 chunk length 644218880 owner 2 type 9 num_stripes 2 stripe 0 devid 2 offset 331546624 stripe 1 devid 1 offset 351469568 chunk tree will tell me where the physical stripes are there right .?Irrespective of the raid type ... correct me if I am wrong. If not how will you know which all blocks are occupied and which all block are free. Basically what I want to do is . get the used blocks of all the devices and create a bitmap of that and zero out all the free block. Then I should not overwrite the used blocks. I should be able to mount the filesystem without any error. How do I achieve that? Also I notice that for the second device the superblock location 0x1 is not considered as used . I would be really grateful if you folks can answer my query. I hav run these tests on SLES11-sp2-x86 Kernel 3.0.13.0.27-default This is pretty old, but shouldn't affect the results. It will cause reliability problems if you try running it seriously. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- There's a Martian war machine outside -- they want to talk --- to you about a cure for the common cold. http://www.mindtree.com/email/disclaimer.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Bug in btrfs-debug-tree for two or more devices.
-Original Message- From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Randy Barlow Sent: Tuesday, June 12, 2012 8:28 PM To: linux-btrfs@vger.kernel.org Subject: Re: Bug in btrfs-debug-tree for two or more devices. On Tuesday, June 12, 2012 06:53:00 AM Santosh Hosamani wrote: Kernel 3.0.13.0.27-default This kernel is very old for btrfs. Can you try with at least Linux 3.4? I have installed 3.4.2 kernel but still I am facing the same issue.May be my understanding of calculating the used block may be wrong. If someone could help me in understanding .It would be great. -- R -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html http://www.mindtree.com/email/disclaimer.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] E2fsprogs: add missing usage for No_COW
Add the missing usage for No_COW since we've supported No_COW flag. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- v1-v2: sort options alphabetically, thanks to Roman Mamedov. misc/chattr.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/misc/chattr.c b/misc/chattr.c index 141ea6e..24254cc 100644 --- a/misc/chattr.c +++ b/misc/chattr.c @@ -83,7 +83,7 @@ static unsigned long sf; static void usage(void) { fprintf(stderr, - _(Usage: %s [-RVf] [-+=AacDdeijsSu] [-v version] files...\n), + _(Usage: %s [-RVf] [-+=AaCcDdeijsSu] [-v version] files...\n), program_name); exit(1); } -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving top level to a subvolume
On Wed, Jun 13, 2012 at 2:21 AM, Arne Jansen sensi...@gmx.net wrote: On 13.06.2012 09:04, C Anthony Risinger wrote: a) have a lot of data b) need to do this via script c) ??? ... because in a), data will *copied* the slow way, and in b) you leave a bunch of junk laying around in the old root that will rot unless you `rm -rf` it ... [...] What I don't understand is why you think data will be copied. at one point i tried to create a new subvol and `mv` files there, and it took quite some time to complete (cross-link-device-what-have-you?), but maybe things changed ... will try it out. [...] so, would it possible to implement this, or could someone kindly (and briefly!) explain why it cannot be done? The default subvol ('/') has the special number 5 and is expected to always be around. All other subvols get numbers starting with 256. Creating a new 5 and internally renumbering the old 5 isn't easy, because each tree block has an owner recorded in it. Also, all backreferences have the root number in them. If you have to touch each tree block, you can as well choose the snapshot/rm -rf approach. ok this makes sense thanks, the last sentence especially ... top-level volume is different. it's identical to other subvols in 99% of ways save one-gotcha-little-1%. couldn't we shield ourselves a little better? 1. people install stuff to the top-level 2. top-level is unmanageable 3. problem [...] Can't instead add code to the installer that warns a user if he wants to install into the default subvol? Just some random ideas... i would like to see #5 cut off from natural access: accessible by an _explicit_ manual mount only, cannot be made default, and cannot be removed. maybe btrfs manages a proxy/facade subvol, say, #10, settable by `--flag-origin` or `{insert-here}` option -- a symlink to subvol? if, at absolutely any time or whatever reason, #10 pointer should not exist, immediately snapshot #5 and update. #5 - #10 - #256+ ? ... this might allow the root to be replaced. default is set to #10 proxy volume when FS is initialized. [...] Or you could hack mkfs.btrfs to always create an additional subvol. Even making / readonly except for creating mountpoint could be possible. ^ yeah, this sounds like exactly what i'm thinking, differing mainly on who does the work... i just want a guaranteed way of replacing the logical root, at #10. the physical root at #5 it's more-or-less indestructible and off limits, and never available except as a template. ... i am new to postgresql, but their template0/template1 feels related to solving problems like this. -- C Anthony -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving top level to a subvolume
On Wed, Jun 13, 2012 at 4:44 PM, C Anthony Risinger anth...@xtfx.me wrote: On Wed, Jun 13, 2012 at 2:21 AM, Arne Jansen sensi...@gmx.net wrote: On 13.06.2012 09:04, C Anthony Risinger wrote: ... because in a), data will *copied* the slow way What I don't understand is why you think data will be copied. at one point i tried to create a new subvol and `mv` files there, and it took quite some time to complete (cross-link-device-what-have-you?), but maybe things changed ... will try it out. IIRC it hasn't. Not in upstream anyway. Some distros (e.g. opensuse) carry their own patch which allows cross-subvolume links (cp --reflink ...). But it shouldn't matter anyway, since you can SNAPSHOT the old subvol (even root subvol), instead of creating a new subvol. Which means nothing needs to be copied. You'd still have to do rm manually though. -- Fajar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4 v2][RFC] apply rwlock for extent state
This patchset is against one of project ideas, RBtree lock contention: Btrfs uses a number of rbtrees to index in-memory data structures. Some of these are dominated by reads, and the lock contention from searching them is showing up in profiles. We need to look into an RCU and sequence counter combination to allow lockless reads. The goal is to use RCU, but we take it as a long term one, and instead we use rwlock until we find a mature rcu structure for lockless read. So what we need to do is to make the code RCU friendly, and the idea mainly comes from Chris Mason: Quoted: I think the extent_state code can be much more RCU friendly if we separate the operations on the tree from operations on the individual state. In general, we can gain a lot of performance if we are able to reduce the write locks taken at endio time. Especially for reads, these are critical. The patchset is also available in: git://github.com/liubogithub/btrfs-work.git rwlock-for-extent-state I've run through xfstests, and no bugs jump out by then. I made a simple test to show the difference on my box: $ cat 6_FIO/fio-4thread-4M-sync-read [global] group_reporting thread numjobs=4 bs=4M rw=read sync=0 ioengine=sync directory=/mnt/btrfs/ [READ] filename=foobar size=4000M *results:* w/o patch w patch READ bandwidth(aggrb) 849MB/s 971MB/s MORE TESTS ARE WELCOME! v1-v2: drop changes on invalidatepage() and rebase to the latest btrfs upstream. Liu Bo (4): Btrfs: use radix tree for checksum Btrfs: merge adjacent states as much as possible Btrfs: use large extent range for read and its endio Btrfs: apply rwlock for extent state fs/btrfs/extent_io.c | 712 +++--- fs/btrfs/extent_io.h |5 +- fs/btrfs/inode.c |7 +- 3 files changed, 568 insertions(+), 156 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] Btrfs: merge adjacent states as much as possible
In order to reduce write locks, we do merge_state as much as much as possible. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/extent_io.c | 47 +++ 1 files changed, 27 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 2923ede..081fe13 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -276,29 +276,36 @@ static void merge_state(struct extent_io_tree *tree, if (state-state (EXTENT_IOBITS | EXTENT_BOUNDARY)) return; - other_node = rb_prev(state-rb_node); - if (other_node) { + while (1) { + other_node = rb_prev(state-rb_node); + if (!other_node) + break; other = rb_entry(other_node, struct extent_state, rb_node); - if (other-end == state-start - 1 - other-state == state-state) { - merge_cb(tree, state, other); - state-start = other-start; - other-tree = NULL; - rb_erase(other-rb_node, tree-state); - free_extent_state(other); - } + if (other-end != state-start - 1 || + other-state != state-state) + break; + + merge_cb(tree, state, other); + state-start = other-start; + other-tree = NULL; + rb_erase(other-rb_node, tree-state); + free_extent_state(other); } - other_node = rb_next(state-rb_node); - if (other_node) { + + while (1) { + other_node = rb_next(state-rb_node); + if (!other_node) + break; other = rb_entry(other_node, struct extent_state, rb_node); - if (other-start == state-end + 1 - other-state == state-state) { - merge_cb(tree, state, other); - state-end = other-end; - other-tree = NULL; - rb_erase(other-rb_node, tree-state); - free_extent_state(other); - } + if (other-start != state-end + 1 || + other-state != state-state) + break; + + merge_cb(tree, state, other); + state-end = other-end; + other-tree = NULL; + rb_erase(other-rb_node, tree-state); + free_extent_state(other); } } -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] Btrfs: use radix tree for checksum
We used to issue a checksum to an extent state of 4K range for read endio, but now we want to use larger range for performance optimization, so instead we create a radix tree for checksum, where an item stands for checksum of 4K data. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/extent_io.c | 84 -- fs/btrfs/extent_io.h |2 + fs/btrfs/inode.c |7 +--- 3 files changed, 23 insertions(+), 70 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 2c8f7b2..2923ede 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -117,10 +117,12 @@ void extent_io_tree_init(struct extent_io_tree *tree, { tree-state = RB_ROOT; INIT_RADIX_TREE(tree-buffer, GFP_ATOMIC); + INIT_RADIX_TREE(tree-csum, GFP_ATOMIC); tree-ops = NULL; tree-dirty_bytes = 0; spin_lock_init(tree-lock); spin_lock_init(tree-buffer_lock); + spin_lock_init(tree-csum_lock); tree-mapping = mapping; } @@ -703,15 +705,6 @@ static void cache_state(struct extent_state *state, } } -static void uncache_state(struct extent_state **cached_ptr) -{ - if (cached_ptr (*cached_ptr)) { - struct extent_state *state = *cached_ptr; - *cached_ptr = NULL; - free_extent_state(state); - } -} - /* * set some bits on a range in the tree. This may require allocations or * sleeping, so the gfp mask is used to indicate what is allowed. @@ -1666,56 +1659,32 @@ out: */ int set_state_private(struct extent_io_tree *tree, u64 start, u64 private) { - struct rb_node *node; - struct extent_state *state; int ret = 0; - spin_lock(tree-lock); - /* -* this search will find all the extents that end after -* our range starts. -*/ - node = tree_search(tree, start); - if (!node) { - ret = -ENOENT; - goto out; - } - state = rb_entry(node, struct extent_state, rb_node); - if (state-start != start) { - ret = -ENOENT; - goto out; - } - state-private = private; -out: - spin_unlock(tree-lock); + spin_lock(tree-csum_lock); + ret = radix_tree_insert(tree-csum, (unsigned long)start, + (void *)((unsigned long)private 1)); + BUG_ON(ret); + spin_unlock(tree-csum_lock); return ret; } int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private) { - struct rb_node *node; - struct extent_state *state; - int ret = 0; + void **slot = NULL; - spin_lock(tree-lock); - /* -* this search will find all the extents that end after -* our range starts. -*/ - node = tree_search(tree, start); - if (!node) { - ret = -ENOENT; - goto out; - } - state = rb_entry(node, struct extent_state, rb_node); - if (state-start != start) { - ret = -ENOENT; - goto out; + spin_lock(tree-csum_lock); + slot = radix_tree_lookup_slot(tree-csum, (unsigned long)start); + if (!slot) { + spin_unlock(tree-csum_lock); + return -ENOENT; } - *private = state-private; -out: - spin_unlock(tree-lock); - return ret; + *private = (u64)(*slot) 1; + + radix_tree_delete(tree-csum, (unsigned long)start); + spin_unlock(tree-csum_lock); + + return 0; } /* @@ -2294,7 +2263,6 @@ static void end_bio_extent_readpage(struct bio *bio, int err) do { struct page *page = bvec-bv_page; struct extent_state *cached = NULL; - struct extent_state *state; pr_debug(end_bio_extent_readpage: bi_vcnt=%d, idx=%d, err=%d, mirror=%ld\n, bio-bi_vcnt, bio-bi_idx, err, @@ -2313,21 +2281,10 @@ static void end_bio_extent_readpage(struct bio *bio, int err) if (++bvec = bvec_end) prefetchw(bvec-bv_page-flags); - spin_lock(tree-lock); - state = find_first_extent_bit_state(tree, start, EXTENT_LOCKED); - if (state state-start == start) { - /* -* take a reference on the state, unlock will drop -* the ref -*/ - cache_state(state, cached); - } - spin_unlock(tree-lock); - mirror = (int)(unsigned long)bio-bi_bdev; if (uptodate tree-ops tree-ops-readpage_end_io_hook) { ret = tree-ops-readpage_end_io_hook(page, start, end, - state, mirror); + NULL, mirror);
[PATCH 3/4] Btrfs: use large extent range for read and its endio
we use larger extent state range for both readpages and read endio, so that we can lock or unlock less and avoid most of split ops, then we'll reduce write locks taken at endio time. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/extent_io.c | 201 +- 1 files changed, 182 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 081fe13..bb66e3c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2258,18 +2258,26 @@ static void end_bio_extent_readpage(struct bio *bio, int err) struct bio_vec *bvec_end = bio-bi_io_vec + bio-bi_vcnt - 1; struct bio_vec *bvec = bio-bi_io_vec; struct extent_io_tree *tree; + struct extent_state *cached = NULL; u64 start; u64 end; int whole_page; int mirror; int ret; + u64 up_start, up_end, un_start, un_end; + int up_first, un_first; + int for_uptodate[bio-bi_vcnt]; + int i = 0; + + up_start = un_start = (u64)-1; + up_end = un_end = 0; + up_first = un_first = 1; if (err) uptodate = 0; do { struct page *page = bvec-bv_page; - struct extent_state *cached = NULL; pr_debug(end_bio_extent_readpage: bi_vcnt=%d, idx=%d, err=%d, mirror=%ld\n, bio-bi_vcnt, bio-bi_idx, err, @@ -2280,11 +2288,6 @@ static void end_bio_extent_readpage(struct bio *bio, int err) bvec-bv_offset; end = start + bvec-bv_len - 1; - if (bvec-bv_offset == 0 bvec-bv_len == PAGE_CACHE_SIZE) - whole_page = 1; - else - whole_page = 0; - if (++bvec = bvec_end) prefetchw(bvec-bv_page-flags); @@ -2337,14 +2340,71 @@ static void end_bio_extent_readpage(struct bio *bio, int err) } } + if (uptodate) + for_uptodate[i++] = 1; + else + for_uptodate[i++] = 0; + if (uptodate tree-track_uptodate) { - set_extent_uptodate(tree, start, end, cached, - GFP_ATOMIC); + if (up_first) { + up_start = start; + up_end = end; + up_first = 0; + } else { + if (up_start == end + 1) { + up_start = start; + } else if (up_end == start - 1) { + up_end = end; + } else { + set_extent_uptodate( + tree, up_start, up_end, + cached, GFP_ATOMIC); + up_start = start; + up_end = end; + } + } } - unlock_extent_cached(tree, start, end, cached, GFP_ATOMIC); + + if (un_first) { + un_start = start; + un_end = end; + un_first = 0; + } else { + if (un_start == end + 1) { + un_start = start; + } else if (un_end == start - 1) { + un_end = end; + } else { + unlock_extent_cached(tree, un_start, un_end, +cached, GFP_ATOMIC); + un_start = start; + un_end = end; + } + } + } while (bvec = bvec_end); + + cached = NULL; + if (up_start up_end) + set_extent_uptodate(tree, up_start, up_end, cached, + GFP_ATOMIC); + if (un_start un_end) + unlock_extent_cached(tree, un_start, un_end, cached, +GFP_ATOMIC); + + i = 0; + bvec = bio-bi_io_vec; + do { + struct page *page = bvec-bv_page; + + tree = BTRFS_I(page-mapping-host)-io_tree; + + if (bvec-bv_offset == 0 bvec-bv_len == PAGE_CACHE_SIZE) + whole_page = 1; + else + whole_page = 0; if (whole_page) { - if (uptodate) { + if (for_uptodate[i++]) { SetPageUptodate(page); } else {
[PATCH 4/4] Btrfs: apply rwlock for extent state
We used to protect both extent state tree and an individual state's state by tree-lock, but this can be an obstacle of lockless read. So we seperate them here: o tree-lock protects the tree o state-lock protects the state. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/extent_io.c | 380 -- fs/btrfs/extent_io.h |3 +- 2 files changed, 336 insertions(+), 47 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index bb66e3c..4c6b743 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -27,7 +27,7 @@ static struct kmem_cache *extent_buffer_cache; static LIST_HEAD(buffers); static LIST_HEAD(states); -#define LEAK_DEBUG 0 +#define LEAK_DEBUG 1 #if LEAK_DEBUG static DEFINE_SPINLOCK(leak_lock); #endif @@ -120,7 +120,7 @@ void extent_io_tree_init(struct extent_io_tree *tree, INIT_RADIX_TREE(tree-csum, GFP_ATOMIC); tree-ops = NULL; tree-dirty_bytes = 0; - spin_lock_init(tree-lock); + rwlock_init(tree-lock); spin_lock_init(tree-buffer_lock); spin_lock_init(tree-csum_lock); tree-mapping = mapping; @@ -146,6 +146,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask) #endif atomic_set(state-refs, 1); init_waitqueue_head(state-wq); + spin_lock_init(state-lock); trace_alloc_extent_state(state, mask, _RET_IP_); return state; } @@ -281,6 +282,7 @@ static void merge_state(struct extent_io_tree *tree, if (!other_node) break; other = rb_entry(other_node, struct extent_state, rb_node); + /* FIXME: need other-lock? */ if (other-end != state-start - 1 || other-state != state-state) break; @@ -297,6 +299,7 @@ static void merge_state(struct extent_io_tree *tree, if (!other_node) break; other = rb_entry(other_node, struct extent_state, rb_node); + /* FIXME: need other-lock? */ if (other-start != state-end + 1 || other-state != state-state) break; @@ -364,7 +367,10 @@ static int insert_state(struct extent_io_tree *tree, return -EEXIST; } state-tree = tree; + + spin_lock(state-lock); merge_state(tree, state); + spin_unlock(state-lock); return 0; } @@ -410,6 +416,23 @@ static int split_state(struct extent_io_tree *tree, struct extent_state *orig, return 0; } +static struct extent_state * +alloc_extent_state_atomic(struct extent_state *prealloc) +{ + if (!prealloc) + prealloc = alloc_extent_state(GFP_ATOMIC); + + return prealloc; +} + +enum extent_lock_type { + EXTENT_READ= 0, + EXTENT_WRITE = 1, + EXTENT_RLOCKED = 2, + EXTENT_WLOCKED = 3, + EXTENT_LAST= 4, +}; + static struct extent_state *next_state(struct extent_state *state) { struct rb_node *next = rb_next(state-rb_node); @@ -426,13 +449,17 @@ static struct extent_state *next_state(struct extent_state *state) * If no bits are set on the state struct after clearing things, the * struct is freed and removed from the tree */ -static struct extent_state *clear_state_bit(struct extent_io_tree *tree, - struct extent_state *state, - int *bits, int wake) +static int __clear_state_bit(struct extent_io_tree *tree, +struct extent_state *state, +int *bits, int wake, int check) { - struct extent_state *next; int bits_to_clear = *bits ~EXTENT_CTLBITS; + if (check) { + if ((state-state ~bits_to_clear) == 0) + return 1; + } + if ((bits_to_clear EXTENT_DIRTY) (state-state EXTENT_DIRTY)) { u64 range = state-end - state-start + 1; WARN_ON(range tree-dirty_bytes); @@ -442,7 +469,17 @@ static struct extent_state *clear_state_bit(struct extent_io_tree *tree, state-state = ~bits_to_clear; if (wake) wake_up(state-wq); + return 0; +} + +static struct extent_state * +try_free_or_merge_state(struct extent_io_tree *tree, struct extent_state *state) +{ + struct extent_state *next = NULL; + + BUG_ON(!spin_is_locked(state-lock)); if (state-state == 0) { + spin_unlock(state-lock); next = next_state(state); if (state-tree) { rb_erase(state-rb_node, tree-state); @@ -453,18 +490,17 @@ static struct extent_state *clear_state_bit(struct extent_io_tree *tree, } } else { merge_state(tree, state); + spin_unlock(state-lock); next = next_state(state);
Re: Massive metadata size increase after upgrade from 3.2.18 to 3.4.1
Did you try balance ? (also there is a balance option to pick the least utilized metadata chunks). in long run when you have the understanding of your files and sizes tuning using mount option metadata_ratio might help. but not sure how the metadata expanded to 84.38G was there any major delete operation on the filesystem? thanks, Anand On 13/06/12 01:38, Calvin Walton wrote: On Sat, 2012-06-09 at 01:38 +0600, Roman Mamedov wrote: Hello, Before the upgrade (on 3.2.18): Metadata, DUP: total=9.38GB, used=5.94GB After the FS has been mounted once with 3.4.1: Data: total=3.44TB, used=2.67TB System, DUP: total=8.00MB, used=412.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=84.38GB, used=5.94GB Where did my 75 GB of free space just went? Btrfs tries to keep a certain ratio of allocated data space to allocated metadata space at all times, in order to ensure that there is always some free metadata space available. In 3.3 (I believe, but haven't actually checked...) this ratio was increased, since people were still complaining about btrfs reporting out of space errors too soon. On a filesystem containing (a relatively small number of) large files, it probably over-allocates the metadata space, which is what you're seeing. I'm not sure if the ratio is tunable. But better to have a bit of unused metadata space than to get 'out of space' errors once you've filled your disk and you're trying to delete some files! -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Btrfs: use rcu to protect device-name V2
On Wed, 13 Jun 2012 00:35:26 +0200, David Sterba wrote: On Tue, Jun 12, 2012 at 03:50:41PM -0400, Josef Bacik wrote: +++ b/fs/btrfs/check-integrity.c @@ -93,6 +93,7 @@ #include print-tree.h #include locking.h #include check-integrity.h +#include rcu-string.h #define BTRFSIC_BLOCK_HASHTABLE_SIZE 0x1 #define BTRFSIC_BLOCK_LINK_HASHTABLE_SIZE 0x1 @@ -843,13 +844,14 @@ static int btrfsic_process_superblock_dev_mirror( superblock_tmp-never_written = 0; superblock_tmp-mirror_num = 1 + superblock_mirror_num; if (state-print_mask BTRFSIC_PRINT_MASK_SUPERBLOCK_WRITE) -printk(KERN_INFO New initial S-block (bdev %p, %s) -@%llu (%s/%llu/%d)\n, - superblock_bdev, device-name, - (unsigned long long)dev_bytenr, - dev_state-name, - (unsigned long long)dev_bytenr, - superblock_mirror_num); +printk_in_rcu(KERN_INFO New initial S-block (bdev %p, can you please add the 'btrfs: ' prefixes? Please no additional btrfs prefix in the check-integrity printk lines that are enabled with the print_mask option. If they are enabled, then for btrfs debugging, and then the context is known. And you get thousands of these lines... + %s) @%llu (%s/%llu/%d)\n, + superblock_bdev, + rcu_str_deref(device-name), + (unsigned long long)dev_bytenr, + dev_state-name, + (unsigned long long)dev_bytenr, + superblock_mirror_num); list_add(superblock_tmp-all_blocks_node, state-all_blocks_list); btrfsic_block_hashtable_add(superblock_tmp, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Computing size of snapshots approximatly
Hi, we using on a server several lvm volumes with btrfs. We want to use nightly build snapshots for some days as an alternative to backups. Now I want to get the size of the snapshots in detail. Therefore I played with btrfs subvolume find-new $snapshot $gen-id. And I know, that this is quite complicated and not implemented. Therefore I try to go my own way: Now assume there are two snapshots of one subvolume, snap1 and snap2. Further get the find-new informations of these snapshots with $gen-id=1 and save them into different files. A diff of these files shows the changes between snap1 and snap2, right? Ok. There are three operations on a filesystem, I think, 1. copy a file on the filesystem 2. change a file on the filesystem 3. delete a file on the filesystem Am I right to assume, that operation 1 and 2 are not change much the size of a snapshot and the delete operation let increase the size of a snapshot in the size of the deleted files? If it is so, it would be enough for me to get the deletions of files between two snapshots and their size. But is there another way to get these informations beside btrfs subvolume find-new? Perhaps it makes sense to use ioctl for it? What about the send/receive feature, which is upcoming? Are there any hints? Many thanks in advance. Jan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Btrfs: use rcu to protect device-name V2
On Wed, Jun 13, 2012 at 12:35:26AM +0200, David Sterba wrote: On Tue, Jun 12, 2012 at 03:50:41PM -0400, Josef Bacik wrote: +++ b/fs/btrfs/check-integrity.c @@ -93,6 +93,7 @@ #include print-tree.h #include locking.h #include check-integrity.h +#include rcu-string.h #define BTRFSIC_BLOCK_HASHTABLE_SIZE 0x1 #define BTRFSIC_BLOCK_LINK_HASHTABLE_SIZE 0x1 @@ -843,13 +844,14 @@ static int btrfsic_process_superblock_dev_mirror( superblock_tmp-never_written = 0; superblock_tmp-mirror_num = 1 + superblock_mirror_num; if (state-print_mask BTRFSIC_PRINT_MASK_SUPERBLOCK_WRITE) - printk(KERN_INFO New initial S-block (bdev %p, %s) - @%llu (%s/%llu/%d)\n, - superblock_bdev, device-name, - (unsigned long long)dev_bytenr, - dev_state-name, - (unsigned long long)dev_bytenr, - superblock_mirror_num); + printk_in_rcu(KERN_INFO New initial S-block (bdev %p, can you please add the 'btrfs: ' prefixes? No, I'm not changing the output of print statements in this patch, I'll leave that up to the Strato guys. + %s) @%llu (%s/%llu/%d)\n, +superblock_bdev, +rcu_str_deref(device-name), +(unsigned long long)dev_bytenr, +dev_state-name, +(unsigned long long)dev_bytenr, +superblock_mirror_num); list_add(superblock_tmp-all_blocks_node, state-all_blocks_list); btrfsic_block_hashtable_add(superblock_tmp, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index e39a3b9..7d658f2 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -44,6 +44,7 @@ #include free-space-cache.h #include inode-map.h #include check-integrity.h +#include rcu-string.h static struct extent_io_ops btree_extent_io_ops; static void end_workqueue_fn(struct btrfs_work *work); @@ -2575,8 +2576,9 @@ static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate) struct btrfs_device *device = (struct btrfs_device *) bh-b_private; - printk_ratelimited(KERN_WARNING lost page write due to - I/O error on %s\n, device-name); + printk_in_rcu(KERN_WARNING lost page write due to here + I/O error on %s\n, + rcu_str_deref(device-name)); /* note, we dont' set_buffer_write_io_error because we have * our own ways of dealing with the IO errors */ diff --git a/fs/btrfs/rcu-string.h b/fs/btrfs/rcu-string.h new file mode 100644 index 000..2fbb56b --- /dev/null +++ b/fs/btrfs/rcu-string.h @@ -0,0 +1,56 @@ +/* + * Copyright (C) 2012 Red Hat. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +struct rcu_string { + struct rcu_head rcu; + char str[0]; +}; + +static inline struct rcu_string *rcu_string_strdup(const char *src, gfp_t mask) +{ + size_t len = strlen(src); + struct rcu_string *ret = kzalloc(sizeof(struct rcu_string) + +(len * sizeof(char)), mask); len + 1 ? or is the devname not null-terminated? Oh hey strlen doesn't include the null how about that. I will fix, thanks. + if (!ret) + return ret; + strncpy(ret-str, src, len); + return ret; +} + +static inline void rcu_string_free(struct rcu_string *str) +{ + if (str) + kfree_rcu(str, rcu); +} + +#define printk_in_rcu(fmt, ...) do { \ + rcu_read_lock();\ + printk(fmt, ##__VA_ARGS__); \ drop the ## see http://gcc.gnu.org/onlinedocs/cpp/Variadic-Macros.html #define eprintf(...) fprintf (stderr, __VA_ARGS__) Well everybody else in the kernel does it that way, but if this works I'll change it. + rcu_read_unlock(); \ +} while (0)
Re: Computing size of snapshots approximatly
On Wed, Jun 13, 2012 at 02:15:33PM +0200, Jan-Hendrik Palic wrote: Hi, we using on a server several lvm volumes with btrfs. We want to use nightly build snapshots for some days as an alternative to backups. Now I want to get the size of the snapshots in detail. There are basically two figures you can get for each snapshot. These values may differ wildly. Which one do you want? (A) The first, larger, value is the total computed size of the files in the subvolume. This is what du returns. (B) The second, smaller, value is the amount of space that would be freed by deleting the subvolume. (Alternatively, this is the amount of data in the subvolume which is not shared with some other subvolume). It is currently a difficult process to work out this value in general, but the qgroups patch set will track this information automatically, and expose an API that will allow you to retrieve it. The qgroups patches aren't complete yet. Therefore I played with btrfs subvolume find-new $snapshot $gen-id. And I know, that this is quite complicated and not implemented. Therefore I try to go my own way: Now assume there are two snapshots of one subvolume, snap1 and snap2. Further get the find-new informations of these snapshots with $gen-id=1 and save them into different files. A diff of these files shows the changes between snap1 and snap2, right? Ok. There are three operations on a filesystem, I think, 1. copy a file on the filesystem 2. change a file on the filesystem 3. delete a file on the filesystem Am I right to assume, that operation 1 and 2 are not change much the size of a snapshot and the delete operation let increase the size of a snapshot in the size of the deleted files? It depends on which measure of the two above you're trying to use, and whether the subvolume (and file) you're modifying still has extents shared with some other subvolume. 1. Copying a file (without --reflink) will increase both the (A) and the (B) size of the snapshot. Copying a file with --reflink will increase (A) and leave (B) much the same. 2. Changing a file will, obviously, cause (A) to change by the difference between the old file and the new. If that file shares no extents with anything else, then (B) will also change by that amount. Otherwise, if it shares extents with anything else (another subvolume, or a reflink copy), then (B) will increase by the amount of data modified. 3. Deleting a file will reduce (A) by the size of the file. (B) will reduce by the size of non-shared extents owned by that file. Note that btrfs sub find-new will not allow you to track file deletions. If it is so, it would be enough for me to get the deletions of files between two snapshots and their size. But is there another way to get these informations beside btrfs subvolume find-new? Perhaps it makes sense to use ioctl for it? What about the send/receive feature, which is upcoming? Are there any hints? Wait for qgroups to land, because that actually does it the right way, and will avoid you having to track all kinds of awkward (and hard-to-find) corner cases. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Summoning his Cosmic Powers, and glowing slightly --- from his toes... signature.asc Description: Digital signature
Re: [PATCH 1/2] Btrfs: use rcu to protect device-name V2
On Wed, 13 Jun 2012 09:14:27 -0400, Josef Bacik wrote: On Wed, Jun 13, 2012 at 12:35:26AM +0200, David Sterba wrote: On Tue, Jun 12, 2012 at 03:50:41PM -0400, Josef Bacik wrote: @@ -4694,8 +4716,11 @@ int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info) key.offset = device-devid; ret = btrfs_search_slot(NULL, dev_root, key, path, 0, 0); if (ret) { - printk(KERN_WARNING btrfs: no dev_stats entry found for device %s (devid %llu) (OK on first mount after mkfs)\n, - device-name, (unsigned long long)device-devid); + printk_in_rcu(KERN_WARNING btrfs: no dev_stats entry + found for device %s (devid %llu) (OK on + first mount after mkfs)\n, breaking printk strings hurts when grepping for a message + rcu_str_deref(device-name), + (unsigned long long)device-devid); __btrfs_reset_dev_stats(device); device-dev_stats_valid = 1; btrfs_release_path(path); @@ -4747,8 +4772,9 @@ static int update_dev_stat_item(struct btrfs_trans_handle *trans, BUG_ON(!path); ret = btrfs_search_slot(trans, dev_root, key, path, -1, 1); if (ret 0) { - printk(KERN_WARNING btrfs: error %d while searching for dev_stats item for device %s!\n, - ret, device-name); + printk_in_rcu(KERN_WARNING btrfs: error %d while searching + for dev_stats item for device %s!\n, ret, and here as well + rcu_str_deref(device-name)); goto out; } @@ -4757,8 +4783,9 @@ static int update_dev_stat_item(struct btrfs_trans_handle *trans, /* need to delete old one and insert a new one */ ret = btrfs_del_item(trans, dev_root, path); if (ret != 0) { - printk(KERN_WARNING btrfs: delete too small dev_stats item for device %s failed %d!\n, - device-name, ret); + printk_in_rcu(KERN_WARNING btrfs: delete too small + dev_stats item for device %s failed %d!\n, here + rcu_str_deref(device-name), ret); goto out; } ret = 1; @@ -4770,8 +4797,9 @@ static int update_dev_stat_item(struct btrfs_trans_handle *trans, ret = btrfs_insert_empty_item(trans, dev_root, path, key, sizeof(*ptr)); if (ret 0) { - printk(KERN_WARNING btrfs: insert dev_stats item for device %s failed %d!\n, - device-name, ret); + printk_in_rcu(KERN_WARNING btrfs: insert dev_stats + item for device %s failed %d!\n, here + rcu_str_deref(device-name), ret); goto out; } } mostly minor things, but please fix them. I'm breaking them for the 80 char limit, it happens for all long messages, we're all used to it. I'll fix up the other things. Thanks, Josef The last sentence of chapter 2 of Documentation/CodingStyle is quite unambiguous. Here is the full quote of that chapter: Chapter 2: Breaking long lines and strings Coding style is all about readability and maintainability using commonly available tools. The limit on the length of lines is 80 columns and this is a strongly preferred limit. Statements longer than 80 columns will be broken into sensible chunks, unless exceeding 80 columns significantly increases readability and does not hide information. Descendants are always substantially shorter than the parent and are placed substantially to the right. The same applies to function headers with a long argument list. However, never break user-visible strings such as printk messages, because that breaks the ability to grep for them. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Computing size of snapshots approximatly
Hi Hugo, hi all, On 13.06.2012 15:27, Hugo Mills wrote: On Wed, Jun 13, 2012 at 02:15:33PM +0200, Jan-Hendrik Palic wrote: Hi, we using on a server several lvm volumes with btrfs. We want to use nightly build snapshots for some days as an alternative to backups. Now I want to get the size of the snapshots in detail. There are basically two figures you can get for each snapshot. These values may differ wildly. Which one do you want? (A) The first, larger, value is the total computed size of the files in the subvolume. This is what du returns. (B) The second, smaller, value is the amount of space that would be freed by deleting the subvolume. (Alternatively, this is the amount of data in the subvolume which is not shared with some other subvolume). It is currently a difficult process to work out this value in general, but the qgroups patch set will track this information automatically, and expose an API that will allow you to retrieve it. The qgroups patches aren't complete yet. Sorry, that I forgot to mention that. I want the size which I will get, if I delete a snapshot. The next assumption I forgot, sorry, was, that the snapshot are not changing. The user only get readonly access to the snapshots. [...] There are three operations on a filesystem, I think, 1. copy a file on the filesystem 2. change a file on the filesystem 3. delete a file on the filesystem Am I right to assume, that operation 1 and 2 are not change much the size of a snapshot and the delete operation let increase the size of a snapshot in the size of the deleted files? It depends on which measure of the two above you're trying to use, and whether the subvolume (and file) you're modifying still has extents shared with some other subvolume. Sure, and honestly, this is the point, where the complexity is exploding for me. ,-) 1. Copying a file (without --reflink) will increase both the (A) and the (B) size of the snapshot. Copying a file with --reflink will increase (A) and leave (B) much the same. Yep. 2. Changing a file will, obviously, cause (A) to change by the difference between the old file and the new. If that file shares no extents with anything else, then (B) will also change by that amount. Otherwise, if it shares extents with anything else (another subvolume, or a reflink copy), then (B) will increase by the amount of data modified. Yep. 3. Deleting a file will reduce (A) by the size of the file. (B) will reduce by the size of non-shared extents owned by that file. Yep. I think, I got the right thought. Thanks for the explanation. Note that btrfs sub find-new will not allow you to track file deletions. Yep, I got this to. But you can get them not directly by a diff. You have a subvolume with a file_A on it. Taking a snapshot snap_A of this subvolume let show the existence of that file in the btrfs sub find-new output. Now delete the fila_A on this subvolume and take a new snapshot, call it snap_B. The btrfs sub find-new output doesn't show it anymore, right. So, a diff of the both outputs, from snap_A to snap_B gives you the deleted file. It is a cruel way, but I think, that it is working. If it is so, it would be enough for me to get the deletions of files between two snapshots and their size. But is there another way to get these informations beside btrfs subvolume find-new? Perhaps it makes sense to use ioctl for it? What about the send/receive feature, which is upcoming? Are there any hints? Wait for qgroups to land, because that actually does it the right way, and will avoid you having to track all kinds of awkward (and hard-to-find) corner cases. Thanks for the hint, I will have a look for that. Best regards, Jan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph-on-btrfs inline-cow regression fix for 3.4.3
On Tue, Jun 12, 2012 at 09:46:26PM -0600, Alexandre Oliva wrote: Hi, Greg, There's a btrfs regression in 3.4 that's causing a lot of grief to ceph-on-btrfs users like myself. This small and nice patch cures it. It's in Linus' master already. I've been running it on top of 3.4.2, and it would be very convenient for me if this could be in 3.4.3. Ack, this can definitely to go 3.4-stable. Thanks Alexandre. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] Btrfs: use radix tree for checksum
int set_state_private(struct extent_io_tree *tree, u64 start, u64 private) { [...] + ret = radix_tree_insert(tree-csum, (unsigned long)start, + (void *)((unsigned long)private 1)); Will this fail for 64bit files on 32bit hosts? + BUG_ON(ret); I wonder if we can patch BUG_ON() to break the build if its only argument is ret. - z -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving top level to a subvolume
On 06/13/2012 09:21 AM, Arne Jansen wrote: On 13.06.2012 09:04, C Anthony Risinger wrote: On Fri, Jun 8, 2012 at 2:40 PM, Arne Jansen sensi...@gmx.net wrote: On 06/08/2012 09:24 PM, Matthew Hawn wrote: I just converted my root filesystem to btrfs with btrfs-convert. However, since I am running Ubuntu, I would like to have the same subvolume structure as a default install,. How do I move the top-level subvolume (where all my files currently are) to another subvolume? Just snapshot the root subvol and continue working in the snapshot. ... yeah but that solution totally sucks when you: a) have a lot of data b) need to do this via script c) ??? ... because in a), data will *copied* the slow way, and in b) you leave a bunch of junk laying around in the old root that will rot unless you `rm -rf` it ... and idk about you, but issuing what is very near to that command on someone else's machine -- via script -- makes me REALLY uneasy ;-) well, don't put data in the top level in the first place. Yes, you have to remove the content of the subvol / by rm -rf, but I don't really see the problem with it. It is slow. You have to change a lot of metadata (each shared metadata block have to be unshared, and then one copy will be deleted ). What I don't understand is why you think data will be copied. i have asked this exact question at least 4 times specifically, and referenced it probably 8-10, in the last 3 years or more. i needed it then. i still need it now. but since i never got an answer up/down or around, i gave up and told people to `rm -rf`themselves ... http://markmail.org/message/7hj5ioqrztkeerqv ... that's from May of 2010, but i don't think it's the first. so, would it possible to implement this, or could someone kindly (and briefly!) explain why it cannot be done? The default subvol ('/') has the special number 5 and is expected to always be around. All other subvols get numbers starting with 256. Creating a new 5 and internally renumbering the old 5 isn't easy, because each tree block has an owner recorded in it. Also, all backreferences have the root number in them. If you have to touch each tree block, you can as well choose the snapshot/rm -rf approach. I don't know very well the internal of btrfs. Do you think that It is possible to move/swap the root subvolume ? [...] Or you could hack mkfs.btrfs to always create an additional subvol. Which can be the default one: so nobody should complain. I Even making / readonly except for creating mountpoint could be possible. Just some random ideas... -Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Leaving Oracle
On Sun, Jun 10, 2012 at 12:01:28PM -0600, David Pottage wrote: On 07/06/12 02:04, Chris Mason wrote: Hello everyone, Oracle has been a fantastic place to work, and I really appreciate their support for my projects. But, I've decided to take a new position at Fusion-io. I will start the new job on Monday, June 11. Congratulations. Fusion-io really believes in open source, and I'm excited to help them shape the future of high performance storage. Are you sure about that? I installed one of their IO Drive SSD cards in one of my employer's servers, and while the driver source code was supplied, the licence was definitely not open source. (See http://www.fusionio.com/legal/eula/) 4.1 General Restrictions. [...] you will not, and will not permit or authorize third parties to: (a) reproduce, modify, translate, enhance, decompile, disassemble, reverse engineer, or create derivative works of the Software; Hi everyone, Circling back around to this, now that I'm up and running again. Most of your storage is hidden behind some kind of closed source firmware. With Fusion-io, you get a closed driver, and that has its own long standing debates that won't get resolved here. Fusion-io has a strong track record of contributing to Linux, and I'm sure we'll keep hiring more developers that are well known in the community. Of course, Btrfs is a GPL project, and all the future work in Btrfs is going to stay GPL. The great thing about Fusion-io is they are very actively trying to engage higher parts of the storage stack to take advantage of the hardware. Since these features need to be in upstream filesystems, we'll have to hammer out nice generic apis to take advantage of them. (This is my favorite kind of we that really means Jens Axboe) Anyone who wants to support a backend for the apis is welcome to do so, and I'm sure they will change over time as we all figure out what works best. Long story short, yes, I am sure that Fusion-io cares about open source. Oracle too, since a few people misread that line as a dig at Oracle. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM on top of BTRFS
Hi, you can't beat the benchmarks that Serge Hallyn did, really thorough! http://s3hh.wordpress.com/2012/05/02/first-round-of-kvm-performance-tests/ Regards //Ernst 2012/6/12 steamraven steamra...@yahoo.com: Seems a little unfair on btrfs to just to look at absolutes in this context. Prior reports said that the fs ground to a halt, it isn't doing that by any stretch. I agree. What I am mostly looking for is the best setup for using KVM snapshots: KVM qcow2 on top of something like ext4 or raw on top of btrfs I haven't let any of these installs complete and used it as intended. So that's what I intend to do next; after all one doesn't install every day. I am going to try to benchmark a couple variations and flags qcow2 on ext4 (noatime) raw on btrfs (defaults) raw on btrfs (noatime,space_cache) raw on btrfs (noatime,nospace_cache) raw on btrfs (noatime,nodatacow) Any other options that might be good to try? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM on top of BTRFS
Ernst Sjöstrand ernstp at gmail.com writes: Hi, you can't beat the benchmarks that Serge Hallyn did, really thorough! http://s3hh.wordpress.com/2012/05/02/first-round-of-kvm-performance-tests/ They do seem very thorough. Unfortunately, they are kvm on top of ext4 and he was mainly checking caching parameters and storage formats. I am looking at BTRFS options and comparing them against qcow2 on ext4. Matt -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] Btrfs: use radix tree for checksum
On 06/14/2012 12:07 AM, Zach Brown wrote: int set_state_private(struct extent_io_tree *tree, u64 start, u64 private) { [...] +ret = radix_tree_insert(tree-csum, (unsigned long)start, + (void *)((unsigned long)private 1)); Will this fail for 64bit files on 32bit hosts? In theory it will fail, but crc32c return u32, so private will be originally u32, and it'd be ok on 32bit hosts. +BUG_ON(ret); I wonder if we can patch BUG_ON() to break the build if its only argument is ret. why? thanks, liubo - z -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] E2fsprogs: add missing usage for No_COW
On Wed, Jun 13, 2012 at 03:47:13PM +0800, Liu Bo wrote: Add the missing usage for No_COW since we've supported No_COW flag. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com Applied, thanks. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs: implement -show_devname V2
On tue, 12 Jun 2012 15:50:42 -0400, Josef Bacik wrote: Because btrfs can remove the device that was mounted we need to have a -show_devname so that in this case we can print out some other device in the file system to /proc/mount. So if there are multiple devices in a btrfs file system we will just print the device with the lowest devid that we can find. This will make everything consistent and deal with device removal properly. The drawback is if you mount with a device that is higher than the lowest devicd it won't show up as the mounted device in /proc/mounts, but this is a small price to pay. This was inspired by Miao Xie's patch. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com Reviewed-by: Miao Xie mi...@cn.fujitsu.com --- V1-V2: Dropped the mounted tracking stuff since it doesn't work right if you mount the same thing twice fs/btrfs/super.c | 33 + 1 files changed, 33 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 85cef50..0874dba 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -54,6 +54,7 @@ #include version.h #include export.h #include compression.h +#include rcu-string.h #define CREATE_TRACE_POINTS #include trace/events/btrfs.h @@ -1472,12 +1473,44 @@ static int btrfs_unfreeze(struct super_block *sb) return 0; } +static int btrfs_show_devname(struct seq_file *m, struct dentry *root) +{ + struct btrfs_fs_info *fs_info = btrfs_sb(root-d_sb); + struct btrfs_fs_devices *cur_devices; + struct btrfs_device *dev, *first_dev = NULL; + struct list_head *head; + struct rcu_string *name; + + mutex_lock(fs_info-fs_devices-device_list_mutex); + cur_devices = fs_info-fs_devices; + while (cur_devices) { + head = cur_devices-devices; + list_for_each_entry(dev, head, dev_list) { + if (!first_dev || dev-devid first_dev-devid) + first_dev = dev; + } + cur_devices = cur_devices-seed; + } + + if (first_dev) { + rcu_read_lock(); + name = rcu_dereference(first_dev-name); + seq_escape(m, name-str, \t\n\\); + rcu_read_unlock(); + } else { + WARN_ON(1); + } + mutex_unlock(fs_info-fs_devices-device_list_mutex); + return 0; +} + static const struct super_operations btrfs_super_ops = { .drop_inode = btrfs_drop_inode, .evict_inode= btrfs_evict_inode, .put_super = btrfs_put_super, .sync_fs= btrfs_sync_fs, .show_options = btrfs_show_options, + .show_devname = btrfs_show_devname, .write_inode= btrfs_write_inode, .alloc_inode= btrfs_alloc_inode, .destroy_inode = btrfs_destroy_inode, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: filenames collide with snapshot/subvolume names
Γιώργος (Giorgos?) reports: Namely, being inside a snapshot directory, I can't create a file/directory with the name of the snapshot directory. For example, inside /mnt/aSnap, I can't create a file named 'aSnap', so I'm filling this bug report. It seems that the snapshot directory is partially created before the snapshot is taken, so that the snapshot directory half-exists (can be looked up, but doesn't appear in listings) inside the snapshot itself. This doesn't seem to be the recommended way to organise subvolumes, but it seems like it should at least result in a coherent filesystem within each subvolume. Ben. Below follows full reproduction of this behavior: aris tmp # dd if=/dev/zero of=FILE bs=4k seek=`echo 5*1024*1024 | bc` count=1 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 1.8695e-05 s, 219 MB/s aris tmp # losetup /dev/loop0 FILE aris tmp # losetup -a /dev/loop0: [fe01]:263872 (/tmp/FILE) aris tmp # mkfs.btrfs -Ltest /dev/loop0 WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using fs created label test on /dev/loop0 nodesize 4096 leafsize 4096 sectorsize 4096 size 20.00GB Btrfs Btrfs v0.19 aris tmp # mount /dev/loop0 /mnt/ aris tmp # cd /mnt aris mnt # ls -la total 8 dr-xr-xr-x 1 root root0 Mar 8 12:07 . drwxr-xr-x 24 root root 4096 Mar 8 11:41 .. aris mnt # mkdir dir1 aris mnt # mkdir dir2 aris mnt # mkdir dir3 aris mnt # l total 0 drwxr-xr-x 1 root root 0 Mar 8 12:08 dir1 drwxr-xr-x 1 root root 0 Mar 8 12:08 dir2 drwxr-xr-x 1 root root 0 Mar 8 12:08 dir3 aris mnt # btrfs subvolume snapshot /mnt/ /mnt/aSnap Create a snapshot of '/mnt/' in '/mnt/aSnap' aris mnt # cd /mnt/aSnap/ aris aSnap # ls -la total 8 dr-xr-xr-x 1 root root 34 Mar 8 12:08 . dr-xr-xr-x 1 root root 34 Mar 8 12:08 .. drwxr-xr-x 1 root root 0 Mar 8 12:08 dir1 drwxr-xr-x 1 root root 0 Mar 8 12:08 dir2 drwxr-xr-x 1 root root 0 Mar 8 12:08 dir3 aris aSnap # date aSnap bash: aSnap: Is a directory -- Ben Hutchings Computers are not intelligent. They only think they are. signature.asc Description: This is a digitally signed message part