Re: BTRFS as image store for KVM?
On Mon, 05 Oct 2015 11:43:18 +0300 Erkki Seppalawrote: > Lionel Bouton writes: > > > 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked > > you need 2 processes to read from 2 devices at once) and I've never seen > > anyone arguing that the current md code is unstable. > > This indeed seems to be the case on my MD RAID1 HDD. The key difference is that MD is smart enough to distribute reads to the least loaded devices in a RAID (i.e. two processes reading from a two-HDD RAID1 will ~always use different HDDs, no matter what their PIDs are). Another example of the differences was that in my limited testing I saw Btrfs submit writes sequentially on RAID1: according to 'iostat', there were long periods of only one drive writing, and then only the other one. The result was the RAID's write performance ended up being lower than that of a single drive. -- With respect, Roman signature.asc Description: PGP signature
Re: [PATCH 16/23] Btrfs: device path change must be logged
Hi David, Kindly note. This is accepted and sent out as part of the patch which are dependinding together. Thanks, Anand On 10/01/2015 09:47 PM, Anand Jain wrote: On 10/01/2015 09:40 PM, David Sterba wrote: On Fri, Aug 14, 2015 at 06:33:01PM +0800, Anand Jain wrote: >From the issue diagnosable point of view, log if the device path is changed. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index ebf37a9..dcb10fa 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -595,6 +595,10 @@ static noinline int device_list_add(const char *path, return -EEXIST; } +printk_in_rcu(KERN_INFO \ +"BTRFS: device fsid %pU devid %llu old path %s new path %s\n", +disk_super->fsid, devid, rcu_str_deref(device->name), path); I don't think that the message should be put into device_list_add. Its only callsite in btrfs_scan_one_device prints some device info messages so it would be better to extend and use the return value if possible. Right. will fix it. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: optimize check for stale device
optimize check for stale device to only be checked when there is device added or changed. If there is no update to the device, there is no need to call btrfs_free_stale_device(). Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index b8b1171..ec1fcfa 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -620,7 +620,8 @@ static noinline int device_list_add(const char *path, * if there is new btrfs on an already registered device, * then remove the stale device entry. */ - btrfs_free_stale_device(device); + if (ret > 0) + btrfs_free_stale_device(device); *fs_devices_ret = fs_devices; -- 2.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Rich Freeman posted on Sun, 04 Oct 2015 08:21:53 -0400 as excerpted: > On Sun, Oct 4, 2015 at 8:03 AM, Lionel Bouton >wrote: >> >> This focus on single reader RAID1 performance surprises me. >> >> 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked >> you need 2 processes to read from 2 devices at once) and I've never >> seen anyone arguing that the current md code is unstable. I'm not a coder and could be wrong, but AFAIK, md/raid1 either works per thread (thus should multiplex I/O across raid1 devices in single-process- multi-thread), or handles multiple AIO requests in parallel, if not both. (If I'm laboring under a severe misconception, and I could be, please do correct me -- I'll rather be publicly corrected and have just that, my world-view corrected to align with reality, than be wrong, publicly or privately, and never know it, thus never correcting it! =:^) IOW, the primary case where I believe md/raid1 does single-device serial access, is where the single process is doing just that, serialized-single- request-sleep-until-the-data's-ready. Otherwise read requests are spread among the available spindles. =:^) But... > Perhaps, but with btrfs it wouldn't be hard to get 1000 processes > reading from a raid1 in btrfs and have every single request directed to > the same disk with the other disk remaining completely idle. I believe > the algorithm is just whether the pid is even or odd, and doesn't take > into account disk activity at all, let alone disk performance or > anything more sophisticated than that. > > I'm sure md does a better job than that. Exactly. Even/odd PID scheduling is great for testing, since it's simple enough to load either side exclusively or both sides exactly evenly, but it's absolutely horrible for multi-task, since worst-case single-device- bottleneck is all too easy to achieve by accident, and even pure-random distribution is going to favor one side or the other to some extent, most of the time. Even worse, due to the most-remaining-free-space chunk allocation algorithm and pair-mirroring only, no matter the number of devices, try to use 3+ devices of differing sizes, and until the space-available on the largest pair reaches that of the others, that largest pair will get the allocations. Consider a bunch of quarter-TiB devices in raid1, with a pair of 2 TiB devices as well. The quarter-TiB devices will remain idle until the pair of 2 TiB devices reach 1.75 TiB full, thus equalizing the space available on each compared to the other devices on the filesystem. Of course, that means reads too, are going to be tied to only those two devices, for anything in that first 1.75 TiB of data, and if all those reads are from even or all from odd PIDs, it's only going to be ONE of... perhaps 10 devices! Possibly hundreds of read threads bottlenecking on a single device of ten, while the other 9/10 of the filesystem-array remains entirely idle! =:^( -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Lionel Boutonwrites: > 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked > you need 2 processes to read from 2 devices at once) and I've never seen > anyone arguing that the current md code is unstable. This indeed seems to be the case on my MD RAID1 HDD. But on MD SSD RAID10 it does use all the four devices (using dd on the md raid device and inspection iostat at the samy time). So the lack of MD RAID1 doing it on HDDs seems to be for devices that don't perform well in random access scenarios, as you mentioned. In practical terms I seem to be getting about 1.1 GB/s from the 4 SSDs with 'dd', whereas I seem to be getting ~650 MB/s when I dd from two fastest components of the MD device at the same time. As it seems that I get 330 M/s from two of the SSDs and 150M/s from the other two, it seems the concurrent RAID10 IO is scaling linearly. (In fact maybe I should look into why the two devices are getting lower speeds overall - they used to be fast.) I didn't calculate how large the linearly transferred chunks would need to be to overcome the seek altency. Probably quite large. -- _ / __// /__ __ http://www.modeemi.fi/~flux/\ \ / /_ / // // /\ \/ /\ / /_/ /_/ \___/ /_/\_\@modeemi.fi \/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 3/3] Btrfs: device path change must be logged
>From the issue diagnosable point of view, log if the device path is changed. Signed-off-by: Anand Jain--- V2: Accepts David's review comment and adds a new ret value 2 for device_list_add() and based on that the caller function would indicate if the path is overwritten (thanks David). fs/btrfs/volumes.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e76cad3..b8b1171 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -504,6 +504,7 @@ void btrfs_free_stale_device(struct btrfs_device *cur_dev) * * Returns: * 1 - first time device is seen + * 2 - device already known but now is overwritten with new path * 0 - device already known * < 0 - error */ @@ -603,6 +604,7 @@ static noinline int device_list_add(const char *path, fs_devices->missing_devices--; device->missing = 0; } + ret = 2; } /* @@ -1033,8 +1035,9 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, ret = device_list_add(path, disk_super, devid, fs_devices_ret); if (ret > 0) { - printk(KERN_INFO "BTRFS: device fsid %pU devid %llu transid %llu %s\n", - disk_super->fsid, devid, transid, path); + printk(KERN_INFO "BTRFS: device fsid %pU devid %llu transid %llu %s %s\n", + disk_super->fsid, devid, transid, path, + ret == 2 ? "(overwritten)":""); ret = 0; } if (!ret && fs_devices_ret) -- 2.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Btrfs: create a helper function to read the disk super
A part of code from btrfs_scan_one_device() is moved to a new function btrfs_read_disk_super(), so that former function looks cleaner and moves the code to ensure null terminating label to it as well. Further there is opportunity to merge various duplicate code on read disk super. Earlier attempt on this was highlighted that there was some issues for which there are multiple versions, however it was not clear what was issue. So until its worked out we can keep it in a separate function. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 87 -- 1 file changed, 52 insertions(+), 35 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 1b2e40f..8f1d175 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -941,6 +941,56 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, return ret; } +void btrfs_release_disk_super(struct page *page) +{ + kunmap(page); + page_cache_release(page); +} + +int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr, + struct page **page, struct btrfs_super_block **disk_super) +{ + void *p; + pgoff_t index; + + /* make sure our super fits in the device */ + if (bytenr + PAGE_CACHE_SIZE >= i_size_read(bdev->bd_inode)) + return 1; + + /* make sure our super fits in the page */ + if (sizeof(**disk_super) > PAGE_CACHE_SIZE) + return 1; + + /* make sure our super doesn't straddle pages on disk */ + index = bytenr >> PAGE_CACHE_SHIFT; + if ((bytenr + sizeof(**disk_super) - 1) >> PAGE_CACHE_SHIFT != index) + return 1; + + /* pull in the page with our super */ + *page = read_cache_page_gfp(bdev->bd_inode->i_mapping, + index, GFP_NOFS); + + if (IS_ERR_OR_NULL(*page)) + return 1; + + p = kmap(*page); + + /* align our pointer to the offset of the super block */ + *disk_super = p + (bytenr & ~PAGE_CACHE_MASK); + + if (btrfs_super_bytenr(*disk_super) != bytenr || + btrfs_super_magic(*disk_super) != BTRFS_MAGIC) { + btrfs_release_disk_super(*page); + return 1; + } + + if ((*disk_super)->label[0] && + (*disk_super)->label[BTRFS_LABEL_SIZE - 1]) + (*disk_super)->label[BTRFS_LABEL_SIZE - 1] = '\0'; + + return 0; +} + /* * Look for a btrfs signature on a device. This may be called out of the mount path * and we are not allowed to call set_blocksize during the scan. The superblock @@ -952,13 +1002,11 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct btrfs_super_block *disk_super; struct block_device *bdev; struct page *page; - void *p; int ret = -EINVAL; u64 devid; u64 transid; u64 total_devices; u64 bytenr; - pgoff_t index; /* * we would like to check all the supers, but that would make @@ -971,41 +1019,14 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, mutex_lock(_mutex); bdev = blkdev_get_by_path(path, flags, holder); - if (IS_ERR(bdev)) { ret = PTR_ERR(bdev); goto error; } - /* make sure our super fits in the device */ - if (bytenr + PAGE_CACHE_SIZE >= i_size_read(bdev->bd_inode)) + if (btrfs_read_disk_super(bdev, bytenr, , _super)) goto error_bdev_put; - /* make sure our super fits in the page */ - if (sizeof(*disk_super) > PAGE_CACHE_SIZE) - goto error_bdev_put; - - /* make sure our super doesn't straddle pages on disk */ - index = bytenr >> PAGE_CACHE_SHIFT; - if ((bytenr + sizeof(*disk_super) - 1) >> PAGE_CACHE_SHIFT != index) - goto error_bdev_put; - - /* pull in the page with our super */ - page = read_cache_page_gfp(bdev->bd_inode->i_mapping, - index, GFP_NOFS); - - if (IS_ERR_OR_NULL(page)) - goto error_bdev_put; - - p = kmap(page); - - /* align our pointer to the offset of the super block */ - disk_super = p + (bytenr & ~PAGE_CACHE_MASK); - - if (btrfs_super_bytenr(disk_super) != bytenr || - btrfs_super_magic(disk_super) != BTRFS_MAGIC) - goto error_unmap; - devid = btrfs_stack_device_id(_super->dev_item); transid = btrfs_super_generation(disk_super); total_devices = btrfs_super_num_devices(disk_super); @@ -1013,8 +1034,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, ret = device_list_add(path, disk_super, devid, fs_devices_ret); if (ret > 0) { if (disk_super->label[0]) { - if (disk_super->label[BTRFS_LABEL_SIZE - 1]) -
[PATCH V2 2/3] btrfs: maintain consistency in logging to help debugging
Optional Label may or may not be set, or it might be set at some time later. However while debugging to search through the kernel logs the scripts would need the logs to be consistent, so logs search key words shouldn't depend on the optional variables, instead fsid is better. Signed-off-by: Anand Jain--- V2: commit corrected fs/btrfs/volumes.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 8f1d175..e76cad3 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1033,13 +1033,8 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, ret = device_list_add(path, disk_super, devid, fs_devices_ret); if (ret > 0) { - if (disk_super->label[0]) { - printk(KERN_INFO "BTRFS: device label %s ", disk_super->label); - } else { - printk(KERN_INFO "BTRFS: device fsid %pU ", disk_super->fsid); - } - - printk(KERN_CONT "devid %llu transid %llu %s\n", devid, transid, path); + printk(KERN_INFO "BTRFS: device fsid %pU devid %llu transid %llu %s\n", + disk_super->fsid, devid, transid, path); ret = 0; } if (!ret && fs_devices_ret) -- 2.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: add missing brelse when superblock checksum fails
looks like oversight, call brelse() when checksum fails. further down the code in the non error path we do call brelse() and so we don't see brelse() in the goto error.. paths. Signed-off-by: Anand Jain--- fs/btrfs/disk-io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7191b32..99d0804 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2665,6 +2665,7 @@ int open_ctree(struct super_block *sb, if (btrfs_check_super_csum(bh->b_data)) { printk(KERN_ERR "BTRFS: superblock checksum mismatch\n"); err = -EINVAL; + brelse(bh); goto fail_alloc; } -- 2.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] btrfs: qgroup: Don't copy extent buffer to do qgroup rescan
Ancient qgroup code call memcpy() on a extent buffer and use it for leaf iteration. As extent buffer contains lock, pointers to pages, it's never sane to do such copy. The following bug may be caused by this insane operation: [92098.841309] general protection fault: [#1] SMP [92098.841338] Modules linked in: ... [92098.841814] CPU: 1 PID: 24655 Comm: kworker/u4:12 Not tainted 4.3.0-rc1 #1 [92098.841868] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper [btrfs] [92098.842261] Call Trace: [92098.842277] [] ? read_extent_buffer+0xb8/0x110 [btrfs] [92098.842304] [] ? btrfs_find_all_roots+0x60/0x70 [btrfs] [92098.842329] [] btrfs_qgroup_rescan_worker+0x28d/0x5a0 [btrfs] Where btrfs_qgroup_rescan_worker+0x28d is btrfs_disk_key_to_cpu(), called in reading key from the memcpied extent_buffer. This patch will read the whole leaf into memory, and use newly introduced stack function to do qgroup rescan. Reported-by: Stephane LesimpleSigned-off-by: Qu Wenruo --- v2: Follow the parameter change in previous patch. --- fs/btrfs/qgroup.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index e9ace09..6a83a40 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2183,11 +2183,11 @@ void assert_qgroups_uptodate(struct btrfs_trans_handle *trans) */ static int qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, struct btrfs_path *path, - struct btrfs_trans_handle *trans, - struct extent_buffer *scratch_leaf) + struct btrfs_trans_handle *trans, char *stack_leaf) { struct btrfs_key found; struct ulist *roots = NULL; + struct btrfs_header *header; struct seq_list tree_mod_seq_elem = SEQ_LIST_INIT(tree_mod_seq_elem); u64 num_bytes; int slot; @@ -2224,13 +2224,15 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, struct btrfs_path *path, fs_info->qgroup_rescan_progress.objectid = found.objectid + 1; btrfs_get_tree_mod_seq(fs_info, _mod_seq_elem); - memcpy(scratch_leaf, path->nodes[0], sizeof(*scratch_leaf)); + read_extent_buffer(path->nodes[0], stack_leaf, 0, + fs_info->extent_root->nodesize); + header = (struct btrfs_header *)stack_leaf; slot = path->slots[0]; btrfs_release_path(path); mutex_unlock(_info->qgroup_rescan_lock); - for (; slot < btrfs_header_nritems(scratch_leaf); ++slot) { - btrfs_item_key_to_cpu(scratch_leaf, , slot); + for (; slot < btrfs_stack_header_nritems(header); ++slot) { + btrfs_stack_item_key_to_cpu(header, , slot); if (found.type != BTRFS_EXTENT_ITEM_KEY && found.type != BTRFS_METADATA_ITEM_KEY) continue; @@ -2261,15 +2263,15 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work *work) qgroup_rescan_work); struct btrfs_path *path; struct btrfs_trans_handle *trans = NULL; - struct extent_buffer *scratch_leaf = NULL; + char *stack_leaf = NULL; int err = -ENOMEM; int ret = 0; path = btrfs_alloc_path(); if (!path) goto out; - scratch_leaf = kmalloc(sizeof(*scratch_leaf), GFP_NOFS); - if (!scratch_leaf) + stack_leaf = kmalloc(fs_info->extent_root->nodesize, GFP_NOFS); + if (!stack_leaf) goto out; err = 0; @@ -2283,7 +2285,7 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work *work) err = -EINTR; } else { err = qgroup_rescan_leaf(fs_info, path, trans, -scratch_leaf); +stack_leaf); } if (err > 0) btrfs_commit_transaction(trans, fs_info->fs_root); @@ -2292,7 +2294,7 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work *work) } out: - kfree(scratch_leaf); + kfree(stack_leaf); btrfs_free_path(path); mutex_lock(_info->qgroup_rescan_lock); -- 2.5.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] btrfs: Add support to do stack item key operation
Normal btrfs_item_key_to_cpu() will need extent buffer to do it, and there is not stack version to handle in memory leaf. Add btrfs_stack_item_key_to_cpu() function for such operation, which will provide the basis for later qgroup fix. Signed-off-by: Qu Wenruo--- v2: Change the char* parameter to struct btrfs_header *, as a leaf always starts with a header. --- fs/btrfs/ctree.h | 20 1 file changed, 20 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 938efe3..4864dfe 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2683,6 +2683,17 @@ static inline void btrfs_item_key(struct extent_buffer *eb, read_eb_member(eb, item, struct btrfs_item, key, disk_key); } +static inline void btrfs_stack_item_key(struct btrfs_header *stack_leaf, + struct btrfs_disk_key *disk_key, + int nr) +{ + unsigned long item_offset = btrfs_item_nr_offset(nr); + struct btrfs_item *item; + + item = (struct btrfs_item *)(stack_leaf + item_offset); + memcpy(disk_key, >key, sizeof(*disk_key)); +} + static inline void btrfs_set_item_key(struct extent_buffer *eb, struct btrfs_disk_key *disk_key, int nr) { @@ -2785,6 +2796,15 @@ static inline void btrfs_item_key_to_cpu(struct extent_buffer *eb, btrfs_disk_key_to_cpu(key, _key); } +static inline void btrfs_stack_item_key_to_cpu(struct btrfs_header *stack_leaf, + struct btrfs_key *key, + int nr) +{ + struct btrfs_disk_key disk_key; + btrfs_stack_item_key(stack_leaf, _key, nr); + btrfs_disk_key_to_cpu(key, _key); +} + static inline void btrfs_dir_item_key_to_cpu(struct extent_buffer *eb, struct btrfs_dir_item *item, struct btrfs_key *key) -- 2.5.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] fstests: generic: Check if a bull fallocate will change extent number
On Fri, Oct 02, 2015 at 04:35:53PM +0800, Qu Wenruo wrote: > Hi Dave, I updated the patch and moved it to btrfs. > > But I still has some question about the fallocate behavior. > > Just as the new btrfs test case, I changed the fallocate range, not > to cover the last part, to make the problem more obvious: > > Btrfs will truncate beyond EOF even that's *not covered* by the > fallocate range. > > It's OK for a fs to modify the extent layout during fallocate, but > is it OK to modify extent layout completely *out* of the fallocate > range? It's beyond EOF, so the filesystem can make whatever changes to allocated blocks in that space whenever it likes because userspace cannot access it. In XFS, we don't remove unwritten extents beyond EOF if they were placed there by fallocate except via explicit truncate() or fallocate() operations (e.g. hole punch) from userspace that manipulate that extent range beyond EOF. What other filesytems do with blocks beyond EOF in any given operation is up to the filesystem, really. If btrfs wants to truncate away all extents beyond EOF when punching a hole that spans EOF, then you can do that. It might not be what the user expects, but blocks beyond EOF don't fall under posix_fallocate() because it explicitly states: "If the size of the file is less than offset+len, then the file is increased to this size; otherwise the file size is left unchanged." which means it can't allocate blocks beyond EOF Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 2015-10-05 07:16, Lionel Bouton wrote: Hi, Le 04/10/2015 14:03, Lionel Bouton a écrit : [...] This focus on single reader RAID1 performance surprises me. 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked you need 2 processes to read from 2 devices at once) and I've never seen anyone arguing that the current md code is unstable. To better illustrate my point. According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1 most of the time. http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1 The only case where md RAID1 was noticeably faster is sequential reads with FIO libaio. Part of this is because BTRFS's built-in raid functionality is designed for COW workloads, whereas mdraid isn't. On top of that, I would be willing to bet that they were using the dup profile for metadata when testing mdraid (which is the default when using a single disk), which isn't a fair comparison because it stores more data in that case than the BTRFS raid. If you do the same thing with ZFS, I'd expect that you would see similar results (although probably with a bigger difference between ZFS and mdraid). A much better comparison (which _will_ sway in mdraid's favor) would be running XFS or ext4 (or even JFS) on top of mdraid, as that is the regular usage of mdraid. Furthermore, there is no sane reason to be running BTRFS on top of a single mdraid device, thus this was even less of a reasonable comparison. It is worth noting however, that using BTRFS raid1 on top of two md RAID0 devices is significantly faster than BTRFS RAID10. So if you base your analysis on Phoronix tests when serving large files to a few clients maybe md could perform better. In all other cases BTRFS RAID1 seems to be a better place to start if you want performance. According to the bad performance -> unstable logic, md would then be the less stable RAID1 implementation which doesn't make sense to me. No reasonable person should be basing their analysis on results from someone else's testing without replicating said results themselves, especially when those results are based on benchmarks and not real workloads. This goes double for Phoronix, as they are essentially paid to make whatever the newest technology on Linux is look good. I'm not even saying that BTRFS performs better than md for most real-world scenarios (these are only benchmarks), but that arguing that BTRFS is not stable because it has performance issues still doesn't make sense to me. Even synthetic benchmarks aren't enough to find the best fit for real-world scenarios, so you could always find a very restrictive situation where any filesystem, RAID implementation, volume manager could look bad even the most robust ones. Of course if BTRFS RAID1 was always slower than md RAID1 the logic might make more sense. But clearly there were design decisions and performance tuning in BTRFS that led to better or similar performance in several scenarios, if the remaining scenarios don't get attention it may be because they represent a niche (at least from the point of view of the developers) not a lack of polishing. Like I said above, a large part of this is probably because BTRFS raid was designed for COW workloads, and mdraid wasn't. smime.p7s Description: S/MIME Cryptographic Signature
Re: Issues with unmountable BTRFS raid1 filesystem
On Mon, Oct 05, 2015 at 08:30:17AM -0400, Austin S Hemmelgarn wrote: > I've been having issues recently with a relatively simple setup > using a two device BTRFS raid1 on top of two two device md RAID0's, > and every time I've rebooted since starting trying to use this > particular filesystem, I've found it unable to mount and had to > recreate it from scratch. This is more of an inconvenience than > anything else (while I don't have backups of it, all the data is > trivial to recreate (in fact, so trivial that doing backups would be > more effort than just recreating the data by hand)), but it's still > something that I would like to try and fix. > > First off, general info: > Kernel version: 4.2.1-local+ (4.2.1 with minor modifications, > sources can be found here: https://github.com/ferroin/linux) > Btrfs-progs version: 4.2 > > I would post output from btrfs fi show, but that's spouting > obviously wrong data (it's saying I'm using only 127MB with 2GB of > allocations on each 'disk', I had been storing approximately 4-6GB > of actual data on the filesystem). > > This particular filesystem is composed of BTRFS raid1 across two LVM > managed DM/MD RAID0 devices, each of which spans 2 physical hard > drives. I have a couple of other filesystems with the exact same > configuration that have not ever displayed this issue. > > When I run 'btrfs check' on the filesystem when it refuses to mount, > I get a number of lines like the following: > bad metadata [, ) crossing stripe boundary > > followed eventually by: > Errors found in extent allocation tree or chunk allocation I _think_ this is a bug in mkfs from 4.2.0, fixed in later releases of the btrfs-progs. Hugo. > As is typical of a failed mount, dmesg shows a 'failed to read the > system array on ' 'open_ctree failed'. > > I doubt that this is a hardware issue because: > 1. Memory is brand new, and I ran a 48 hour burn-in test that showed > no errors. > 2. A failing storage controller, PSU, or CPU would be manifesting > with many more issues than just this. > 3. A disk failure would mean that two different disks, from > different manufacturing lots, are encountering errors on exactly the > same LBA's at exactly the same time, which while possible is > astronomically unlikely for disks bigger than a few hundred > gigabytes (the disks in question are 1TB each). > -- Hugo Mills | Jazz is the sort of music where no-one plays hugo@... carfax.org.uk | anything the same way once. http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: Can i have a word with you?
-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Mon, Oct 5, 2015 at 7:16 AM, Lionel Boutonwrote: > According to the bad performance -> unstable logic, md would then be the > less stable RAID1 implementation which doesn't make sense to me. > The argument wasn't that bad performance meant that something was unstable. The argument was that a lack of significant performance optimization meant that the developers considered it unstable and not worth investing time on optimizing. So, the question isn't whether btrfs is or isn't faster than something else. the question is whether it is or isn't faster than it could be if it were properly optimized. That is, how does btrfs perform today against btrfs from 20 years from now, which obviously cannot be benchmarked today. That said, I'm not really convinced that the developers haven't fixed this because they feel that it would need to be redone later after major refactoring. I think it is more likely that there are just very few developers working on btrfs and load-balancing on raid just doesn't rank high on their list of interests or possibly expertise. If any are being paid to work on btrfs then most likely their employers don't care too much about it either. I did find the phoronix results interesting though. The whole driver for "layer-violation" is that with knowledge of the filesystem you can better optimize what you do/don't read and write, and that may be showing here. -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs progs release 4.2.2
Hi, here's a bugfix release of btrfs-progs, fixes picked from current devel branch. * fi label: use fallback if the label ioctl is not available * convert: check nodesize constraints against commandline features (-O) * scrub: report status 'running' until all devices are finished * device scanning might crash in some scenarios * fi usage: print summary for non-root users Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: Anand Jain (2): btrfs-progs: provide fail safe for BTRFS_IOC_GET_FSLABEL ioctl btrfs-progs: fix is_block_device() return checks David Sterba (3): btrfs-progs: make sure that is_block_device will return only 0/1 btrfs-progs: fix double free during scanning btrfs-progs: fix error checking in load_device_info Qu Wenruo (1): btrfs-progs: utils: Check nodesize against features Silvio Fricke (2): btrfs-progs: tasks info->id is a pthread_t and should not set to -1 btrfs-progs: check: fix memset range Zhao Lei (2): btrfs-progs: Increase running state's priority in stat output btrfs-progs: Show detail error message when write sb failed in write_dev_supers() -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Hi, Le 04/10/2015 14:03, Lionel Bouton a écrit : > [...] > This focus on single reader RAID1 performance surprises me. > > 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked > you need 2 processes to read from 2 devices at once) and I've never seen > anyone arguing that the current md code is unstable. To better illustrate my point. According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1 most of the time. http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1 The only case where md RAID1 was noticeably faster is sequential reads with FIO libaio. So if you base your analysis on Phoronix tests when serving large files to a few clients maybe md could perform better. In all other cases BTRFS RAID1 seems to be a better place to start if you want performance. According to the bad performance -> unstable logic, md would then be the less stable RAID1 implementation which doesn't make sense to me. I'm not even saying that BTRFS performs better than md for most real-world scenarios (these are only benchmarks), but that arguing that BTRFS is not stable because it has performance issues still doesn't make sense to me. Even synthetic benchmarks aren't enough to find the best fit for real-world scenarios, so you could always find a very restrictive situation where any filesystem, RAID implementation, volume manager could look bad even the most robust ones. Of course if BTRFS RAID1 was always slower than md RAID1 the logic might make more sense. But clearly there were design decisions and performance tuning in BTRFS that led to better or similar performance in several scenarios, if the remaining scenarios don't get attention it may be because they represent a niche (at least from the point of view of the developers) not a lack of polishing. Best regards, Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Issues with unmountable BTRFS raid1 filesystem
I've been having issues recently with a relatively simple setup using a two device BTRFS raid1 on top of two two device md RAID0's, and every time I've rebooted since starting trying to use this particular filesystem, I've found it unable to mount and had to recreate it from scratch. This is more of an inconvenience than anything else (while I don't have backups of it, all the data is trivial to recreate (in fact, so trivial that doing backups would be more effort than just recreating the data by hand)), but it's still something that I would like to try and fix. First off, general info: Kernel version: 4.2.1-local+ (4.2.1 with minor modifications, sources can be found here: https://github.com/ferroin/linux) Btrfs-progs version: 4.2 I would post output from btrfs fi show, but that's spouting obviously wrong data (it's saying I'm using only 127MB with 2GB of allocations on each 'disk', I had been storing approximately 4-6GB of actual data on the filesystem). This particular filesystem is composed of BTRFS raid1 across two LVM managed DM/MD RAID0 devices, each of which spans 2 physical hard drives. I have a couple of other filesystems with the exact same configuration that have not ever displayed this issue. When I run 'btrfs check' on the filesystem when it refuses to mount, I get a number of lines like the following: bad metadata [, ) crossing stripe boundary followed eventually by: Errors found in extent allocation tree or chunk allocation As is typical of a failed mount, dmesg shows a 'failed to read the system array on ' 'open_ctree failed'. I doubt that this is a hardware issue because: 1. Memory is brand new, and I ran a 48 hour burn-in test that showed no errors. 2. A failing storage controller, PSU, or CPU would be manifesting with many more issues than just this. 3. A disk failure would mean that two different disks, from different manufacturing lots, are encountering errors on exactly the same LBA's at exactly the same time, which while possible is astronomically unlikely for disks bigger than a few hundred gigabytes (the disks in question are 1TB each). smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS as image store for KVM?
On Mon, 5 Oct 2015 13:16:03 +0200 Lionel Boutonwrote: > To better illustrate my point. > > According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1 > most of the time. > > http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1 > > The only case where md RAID1 was noticeably faster is sequential reads > with FIO libaio. > > So if you base your analysis on Phoronix tests [Oops. Actually send to the list too this time.] FYI... 1) It's worth noting that while I personally think Phoronix has more merit than sometimes given credit for, it does have a rather bad rep in kernel circles, due to how results are (claimed to be) misused in support of points they don't actually support at all, if you know how to read the results taking into account the configuration used. As such, Phoronix is about the last reference you want to be using if trying to support points with kernel folks, because rightly or wrongly, a lot of them will see that and simply shut down right there, having already decided based on previous experience that there's little use arguing with somebody quoting Phoronix, since they invariably don't know how to read the results based on what was /actually/ tested given the independent variable and the test configuration. Tho I personally actually do find quite some use for various Phoronix benchmark articles, reading them with the testing context in mind. But I definitely wouldn't be pulling them out to demonstrate a point to kernel folks, unless it was near enough the end of a list of references making a similar point, after I've demonstrated my ability to keep the testing context in mind when looking at their results, because I know based on the history, quoting Phoronix in support of something simply isn't likely to get me anywhere with kernel folks. As for the specific test you referenced... 2) At least the URL you pointed at was benchmarks for Intel SSDs, not spinning rust. The issues and bottlenecks in the case of good SSDs are so entirely different than for spinning rust that it's an entirely different debate. Among other things, good SATA-based SSDs are and have been for awhile now, so fast that if the tests really are I/O bound, SATA-bus speed (thru SATA 3.0, 600 MB/s, anyway, SATA 3.2 aka SATA Express and M.2, 1969 MB/s, are fast enough to often put the bottleneck on the device, again), not device speed, tends to be the bottleneck. However, in many cases, because good SSDs and modern buses are so fast, the bottleneck actually ends up being CPU, once again. So in the case of good reasonably current SSDs, CPU is likely to be the bottleneck. Tho these aren't as current as might be expected... The actual devices tested here are SATA 2, 300 MB/s bus speed, and are rather dated given the Dec 2014 article date, as they're only rated 205 MB/s read, 45 MB/s write, so only read is anything near bus speed, while bus speed itself is only SATA 2, 300 MB/s. Given that, it's likely that write speed is device-bound, and while raid isn't likely to slow it down despite the multi-time-writing because it /is/ device-bound, it's unlikely to be faster than single-device writing, either. But this thread was addressing read-speed. Read-speed is much closer to the bus speed, and depending on the application, particularly for raid, may well be CPU-bound. Where it's CPU-bound, because the device and bus speed are relatively high, the multiple devices of RAID aren't likely to be of much benefit at all. Meanwhile, what was the actual configuration on the devices themselves? Here, we see that in both cases it was actually btrfs, btrfs with defaults as installed (in single-device-mode, if reading between the lines) on top of md/raid of the tested level for the md/raid side, native btrfs raid of the tested level on the native btrfs raid side. But... there's already so much that isn't known -- he says defaults where not stated, but that's still ambiguous in some cases. For instance, he does specifically state that in native mode, btrfs detects the ssds, and activates ssd mode, but that it doesn't do so when installed on the md/raid. So we know for sure that he took the detected ssd (or not) defaults there. But... we do NOT know... does btrfs native raid level mean for both data and metadata, or only for (presumably) data, leaving metadata at the defaults (which is raid1 for multi-device), or perhaps the reverse, tested level metadata, defaults for data (which AFAIK is single mode for multi-device). And in single-device btrfs on top of md/raid mode, with the md/raid at the tested level, we already know that it didn't detect ssd and enable ssd mode, and he didn't enable it manually, but what we /don't/ know for sure is how it was installed at mkfs.btrfs time and whether that might have detected ssd. If mkfs.btrfs detected ssd, it would have created single-mode metadata by default, otherwise, dup mode metadata, unless specifically told
Re: [PATCH 4/4] btrfs-progs: change -t option for subvolume list to print a simple space-separated table (making it machine-readable)
On 2015-10-04 16:34, Goffredo Baroncelli wrote: > On 2015-10-04 05:37, Duncan wrote: >> Goffredo Baroncelli posted on Sat, 03 Oct 2015 19:41:33 +0200 as >> excerpted: >> >>> On 2015-10-03 12:09, Axel Burri wrote: On 2015-10-03 11:56, Goffredo Baroncelli wrote: > On 2015-10-02 18:41, a...@tty0.ch wrote: >> Old implementation used tabs "\t", and tried to work around problems >> by guessing amount of tabs needed (e.g. "\t\t" after top level", with >> buggy output as soon as empty uuids are printed). This will never >> work correctly, as tab width is a user-defined setting in the >> terminal. > > > Why not use string_table() and table_*() functions ? string_table(), as well as all table functions by nature, needs to know the maximum size of all cells in a row before printing, and therefore buffers all the output before printing. It would eat up a lot of memory for large tables (it is not unusual to have 1000+ subvolumes in btrfs if you make heavy use of snapshotting). Furthermore, it would slow down things by not printing the output linewise. >>> >>> >>> Assuming 200bytes per row (== subvolume) x 1000 subvolumes = 200kB... I >>> don't think that this could be a problem, nor in terms of memory used >>> nor in terms of speed: if you have 1000+ subvolumes the most time >>> consuming activity is traversing the filesystem looking at the >>> snapshot... >> >> Perhaps unfortunately, scaling to millions of snapshots/subvolumes really >> *is* expected by some people. You'd be surprised at the number of folks >> that setup automated per-minute snapshotting with no automated thinning, >> and expect to be able to keep several years' worth of snapshots, and then >> wonder why btrfs maintenance commands such as balance take weeks/months... > [...] >> Obviously btrfs doesn't scale to that level now, and if it did, someone >> making the mistake of trying to get a listing of millions of snapshots >> would very likely change their mind before even hitting 10%... >> >> But that's why actually processing line-by-line is important, so they'll >> actually /see/ what happened and ctrl-C it, instead of the program >> aborting as it runs into (for example) the 32-bit user/kernel memory >> barrier, without printing anything useful... > > Please Ducan, read the code. > > *TODAY* "btrfs list" does a scan of all subvolumes storing information in > memory ! > > This is the most time consuming activity (think traversing a filesystem with > millions of snapshots) > > *Then* "btrfs list" print the info. > > So you are already blocked at the screen until all subvolume are read. And I > repeat (re)processing the information requires less time than reading the > information from the disk. > > [] > A quick look at the code shows me that Goffredo is right here, as __list_subvol_search() always fetches ALL data from BTRFS_IOC_TREE_SEARCH, putting it into a rbtree for later processing (assemble full paths, sorting). While there is certainly room for improvements here (assuming that BTRFS_IOC_TREE_SEARCH returns objectid's in sorted order, it would definitively be possible to produce line-by-line output), the code looks pretty elegant the way it is. I still don't think it is wise to bloat things further just for printing nice tables. My impression is that "btrfs subvolume list" is human-readable enough without the '-t' flag, while the output with '-t' flag is much more machine-readable-friendly, and thus should have the highest possible performance. e.g.: btrfs sub list -t / | (read th; while read $th ; do echo $gen; done) btrfs sub list -t | column -t Again, this is just my opinion, being a "unix-purist". Maybe a good compromise would be to use a single "\t" instead of " " as column delimiter. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix qgroup sanity tests
With my changes to allow us to find old roots when resolving indirect refs I introduced a regression to the sanity tests. Since we don't really care to go down into the fs roots we just need to have the old behavior of returning ENOENT for dummy roots for the sanity tests. In the future if we want to get fancy we can populate the test fs trees with the references as well. Thanks, Signed-off-by: Josef Bacik--- fs/btrfs/backref.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 5de66e9..905d697 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -339,6 +339,13 @@ static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info, goto out; } +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS + if (unlikely(test_bit(BTRFS_ROOT_DUMMY_ROOT, >state))) { + srcu_read_unlock(_info->subvol_srcu, index); + ret = -ENOENT; + goto out; + } +#endif if (path->search_commit_root) root_level = btrfs_header_level(root->commit_root); else if (time_seq == (u64)-1) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] btrfs-progs: change -t option for subvolume list to print a simple space-separated table (making it machine-readable)
On 2015-10-05 17:42, Goffredo Baroncelli wrote: > Hi Axel, > > On 2015-10-05 17:04, Axel Burri wrote: > [...] >> I still don't think it is wise to bloat things further just for printing >> nice tables. My impression is that "btrfs subvolume list" is >> human-readable enough without the '-t' flag, while the output with '-t' >> flag is much more machine-readable-friendly, and thus should have the >> highest possible performance. e.g.: > > I disagree, the "-t" mode is for the human; the non '-t' mode would be for > machine (if you take the space as separator and the first word as key, with > the exception of the path which starts from the "path " word and ends to the > end of line); unfortunately "top level" is an exception. Anyway btrfs is not > very machine-friendly (sic). That's what my patches are all about in the first place, to make it more machine-friendly :) > Compare the two output below: > > $ sudo btrfs sub list -a / > ID 257 gen 44309 top level 5 path /debian > ID 289 gen 17415 top level 257 path debian/var/lib/machines > ID 298 gen 35434 top level 5 path /debian-i32 > ID 299 gen 43961 top level 5 path /boot > ID 300 gen 39452 top level 5 path /debian-jessie "grep-friendly" > > $ sudo btrfs sub list -at / > IDgen top level path > ----- - > 257 44310 5 /debian > 289 17415 257 debian/var/lib/machines > 298 35434 5 /debian-i32 > 299 43961 5 /boot > 300 39452 5 /debian-jessie > > > I think that is easy to declare the latter more "human" friendly. Well yes, but in some ways the latter is also more machine-friendly, as there is less data to pipe and a whitespace-separated list is much easier to parse (of course the same applies to your table implementation below). Problems here are: - "top level" (whitespace, addressed in [PATCH 4/4]) - timestamp (-s) "-MM-DD HH:MM:SS" (whitspace and no timezone, addressed in [PATCH 3/4]) - path can contain whitespace (not _that_ important as it's always printed last). > > BTW with the use of the table_* function the output become: > > $ sudo ./btrfs sub list -at / > ID gen top level path > === = = === > 257 44311 5 /debian > 289 17415 257 debian/var/lib/machines > 298 35434 5 /debian-i32 > 299 43961 5 /boot > 300 39452 5 /debian-jessie > > $ sudo ./btrfs sub list -aptcguqR / > ID gen cgen parent top level parent_uuid received_uuid uuid > path > === = = == = === = > === > 257 44313 7 5 5 - - > 840c86cf-e78b-d54a-ab38-2858812d /debian > 289 17415 17415257 257 - - > 8b857250-3a3e-754d-810e-57342bbb2f56 debian/var/lib/machines > 298 35434 35399 5 5 - - > 1f38049b-b153-d741-b903-d2de6fd7b3fd /debian-i32 > 299 43961 35512 5 5 - - > f9d52b6b-a6d1-8c45-a6cd-ddb68cf58062 /boot > 300 39452 37744 5 5 - - > 026e44bd-66d4-e341-9a14-0124acf79793 /debian-jessie > > I right aligned each field with the exception of the path (which is left > aligned). > I already prepared the patch which convert "btrfs sub list -t" to use the > table_* function. If you want I can send it to you to extend/replace your > patch #4. Looks promising, although I'd still not recommend it because of the increased memory/cpu requirements. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] btrfs-progs: change -t option for subvolume list to print a simple space-separated table (making it machine-readable)
Hi Axel, On 2015-10-05 17:04, Axel Burri wrote: [...] > > A quick look at the code shows me that Goffredo is right here, as > __list_subvol_search() always fetches ALL data from > BTRFS_IOC_TREE_SEARCH, putting it into a rbtree for later processing > (assemble full paths, sorting). > > While there is certainly room for improvements here (assuming that > BTRFS_IOC_TREE_SEARCH returns objectid's in sorted order, it would > definitively be possible to produce line-by-line output), the code looks > pretty elegant the way it is. > > I still don't think it is wise to bloat things further just for printing > nice tables. My impression is that "btrfs subvolume list" is > human-readable enough without the '-t' flag, while the output with '-t' > flag is much more machine-readable-friendly, and thus should have the > highest possible performance. e.g.: I disagree, the "-t" mode is for the human; the non '-t' mode would be for machine (if you take the space as separator and the first word as key, with the exception of the path which starts from the "path " word and ends to the end of line); unfortunately "top level" is an exception. Anyway btrfs is not very machine-friendly (sic). Compare the two output below: $ sudo btrfs sub list -a / ID 257 gen 44309 top level 5 path /debian ID 289 gen 17415 top level 257 path debian/var/lib/machines ID 298 gen 35434 top level 5 path /debian-i32 ID 299 gen 43961 top level 5 path /boot ID 300 gen 39452 top level 5 path /debian-jessie $ sudo btrfs sub list -at / ID gen top level path -- --- - 257 44310 5 /debian 289 17415 257 debian/var/lib/machines 298 35434 5 /debian-i32 299 43961 5 /boot 300 39452 5 /debian-jessie I think that is easy to declare the latter more "human" friendly. BTW with the use of the table_* function the output become: $ sudo ./btrfs sub list -at / ID gen top level path === = = === 257 44311 5 /debian 289 17415 257 debian/var/lib/machines 298 35434 5 /debian-i32 299 43961 5 /boot 300 39452 5 /debian-jessie $ sudo ./btrfs sub list -aptcguqR / ID gen cgen parent top level parent_uuid received_uuid uuid path === = = == = === = === 257 44313 7 5 5 - - 840c86cf-e78b-d54a-ab38-2858812d /debian 289 17415 17415257 257 - - 8b857250-3a3e-754d-810e-57342bbb2f56 debian/var/lib/machines 298 35434 35399 5 5 - - 1f38049b-b153-d741-b903-d2de6fd7b3fd /debian-i32 299 43961 35512 5 5 - - f9d52b6b-a6d1-8c45-a6cd-ddb68cf58062 /boot 300 39452 37744 5 5 - - 026e44bd-66d4-e341-9a14-0124acf79793 /debian-jessie I right aligned each field with the exception of the path (which is left aligned). I already prepared the patch which convert "btrfs sub list -t" to use the table_* function. If you want I can send it to you to extend/replace your patch #4. BR G.Baroncelli > > btrfs sub list -t / | (read th; while read $th ; do echo $gen; done) > btrfs sub list -t | column -t > > Again, this is just my opinion, being a "unix-purist". Maybe a good > compromise would be to use a single "\t" instead of " " as column delimiter. > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 2015-10-05 10:04, Duncan wrote: On Mon, 5 Oct 2015 13:16:03 +0200 Lionel Boutonwrote: To better illustrate my point. According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1 most of the time. http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1 The only case where md RAID1 was noticeably faster is sequential reads with FIO libaio. So if you base your analysis on Phoronix tests [...snip...] Hmm... I think I've begun to see the kernel folks point about people quoting Phoronix in support of their points, when it's really not apropos at all. Yes, I do still consider Phoronix reports in context to contain useful information, at some level. However, one really must be aware of what was actually tested in ordered to understand what the results actually mean, and unfortunately, it seems most people quoting it, including here, really can't properly do so in context, and thus end up using it in support of points that simply are not supported by the given evidence in the Phoronix articles people are attempting to use. Even aside from the obvious past issues with Phoronix reports, people forget that they are news organization (regardless of what they claim, they _are_ a news organization), and as such their employees are not paid to verify existing results, they're paid to make impactful articles that grab people's attention (I'd be willing to bet that this story started in response to the people who pointed out correctly that XFS or ext4 on top of mdraid beats the pants off of BTRFS performance wise, and (incorrectly) assumed that this meant that mdraid was better than BTRFS raid). This when combined with almost no evidence in many cases of actual statistical analysis, really hurts their credibility (at least, for me it does). The other issue is that so many people tout benchmarks as the pinnacle of testing, when they really aren't. Benchmarks are by definition synthetic workloads, and as such only the _really_ good ones (which there aren't many of) give you more than a very basic idea what performance differences you can expect with a given workload. On top of that, people just accept results without trying to reproduce them themselves (Kernel folks tend to be much better about this than many other people though). A truly sane person, looking to determine the best configuration for a given workload, will: 1. Look at a wide variety of sources to determine what configurations he should even be testing. (The author of the linked article obviously didn't do this, or just didn't care, the defaults on btrfs are unsuitable for a significant number of cases, including usage on top of mdraid). 2. Using this information, run established benchmarks similar to his use-case to further narrow down the test candidates. 3. Write his own benchmark to simulate to the greatest degree possible the actual workload he expects to run, and then use that for testing the final candidates. 4. Gather some reasonable number of samples with the above mentioned benchmark, and use _real_ statistical analysis to determine what he should be using. To put this in further perspective, most people just do step one, assume that other people know what they're talking about, and don't do any further testing, and there are other people who just do step two, and then claim their results are infallible. smime.p7s Description: S/MIME Cryptographic Signature