Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list
2011/12/12 Alexandre Oliva ol...@lsd.ic.unicamp.br: On Dec 7, 2011, Christian Brunner c...@muc.de wrote: With this patch applied I get much higher write-io values than without it. Some of the other patches help to reduce the effect, but it's still significant. iostat on an unpatched node is giving me: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 105.90 0.37 15.42 14.48 2657.33 560.13 107.61 1.89 62.75 6.26 18.71 while on a node with this patch it's sda 128.20 0.97 11.10 57.15 3376.80 552.80 57.58 20.58 296.33 4.16 28.36 Also interesting, is the fact that the average request size on the patched node is much smaller. That's probably expected for writes, as bitmaps are expected to be more fragmented, even if used only for metadata (or are you on SSD?) It's a traditional hardware RAID5 with spinning disks. - I would accept this if the writes would start right after the mount, but in this case it takes a few hours until the writes increase. Thats why I'm allmost certain that something is still wrong. Bitmaps are just a different in-memory (and on-disk-cache, if enabled) representation of free space, that can be far more compact: one bit per disk block, rather than an extent list entry. They're interchangeable otherwise, it's just that searching bitmaps for a free block (bit) is somewhat more expensive than taking the next entry from a list, but you don't want to use up too much memory with long lists of e.g. single-block free extents. Thanks for the explanation! I'll try to insert some debuging code, once my test server is ready. Christian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list
2011/12/7 Christian Brunner c...@muc.de: 2011/12/1 Christian Brunner c...@muc.de: 2011/12/1 Alexandre Oliva ol...@lsd.ic.unicamp.br: On Nov 29, 2011, Christian Brunner c...@muc.de wrote: When I'm doing havy reading in our ceph cluster. The load and wait-io on the patched servers is higher than on the unpatched ones. That's unexpected. In the mean time I know, that it's not related to the reads. I suppose I could wave my hands while explaining that you're getting higher data throughput, so it's natural that it would take up more resources, but that explanation doesn't satisfy me. I suppose allocation might have got slightly more CPU intensive in some cases, as we now use bitmaps where before we'd only use the cheaper-to-allocate extents. But that's unsafisfying as well. I must admit, that I do not completely understand the difference between bitmaps and extents. From what I see on my servers, I can tell, that the degradation over time is gone. (Rebooting the servers every day is no longer needed. This is a real plus.) But the performance compared to a freshly booted, unpatched server is much slower with my ceph workload. I wonder if it would make sense to initialize the list field only, when the cluster setup fails? This would avoid the fallback to the much unclustered allocation and would give us the cheaper-to-allocate extents. I've now tried various combinations of you patches and I can really nail it down to this one line. With this patch applied I get much higher write-io values than without it. Some of the other patches help to reduce the effect, but it's still significant. iostat on an unpatched node is giving me: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 105.90 0.37 15.42 14.48 2657.33 560.13 107.61 1.89 62.75 6.26 18.71 while on a node with this patch it's sda 128.20 0.97 11.10 57.15 3376.80 552.80 57.58 20.58 296.33 4.16 28.36 Also interesting, is the fact that the average request size on the patched node is much smaller. Josef was telling me, that this could be related to the number of bitmaps we write out, but I've no idea how to trace this. I would be very happy if someone could give me a hint on what to do next, as this is one of the last remaining issues with our ceph cluster. This is still bugging me and I just remembered something that might be helpfull. Also I hope that this is not misleading... Back in 2.6.38 we were running ceph without btrfs performance degradation. I found a thread on the list where similar problems where reported: http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10346.html In that thread someone bisected the issue to From 4e69b598f6cfb0940b75abf7e179d6020e94ad1e Mon Sep 17 00:00:00 2001 From: Josef Bacik jo...@redhat.com Date: Mon, 21 Mar 2011 10:11:24 -0400 Subject: [PATCH] Btrfs: cleanup how we setup free space clusters In this commit the bitmaps handling was changed. So I just thought that this may be related. I'm still hoping, that someone with a deeper understanding of btrfs could take a look at this. Thanks, Christian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list
2011/12/1 Christian Brunner c...@muc.de: 2011/12/1 Alexandre Oliva ol...@lsd.ic.unicamp.br: On Nov 29, 2011, Christian Brunner c...@muc.de wrote: When I'm doing havy reading in our ceph cluster. The load and wait-io on the patched servers is higher than on the unpatched ones. That's unexpected. In the mean time I know, that it's not related to the reads. I suppose I could wave my hands while explaining that you're getting higher data throughput, so it's natural that it would take up more resources, but that explanation doesn't satisfy me. I suppose allocation might have got slightly more CPU intensive in some cases, as we now use bitmaps where before we'd only use the cheaper-to-allocate extents. But that's unsafisfying as well. I must admit, that I do not completely understand the difference between bitmaps and extents. From what I see on my servers, I can tell, that the degradation over time is gone. (Rebooting the servers every day is no longer needed. This is a real plus.) But the performance compared to a freshly booted, unpatched server is much slower with my ceph workload. I wonder if it would make sense to initialize the list field only, when the cluster setup fails? This would avoid the fallback to the much unclustered allocation and would give us the cheaper-to-allocate extents. I've now tried various combinations of you patches and I can really nail it down to this one line. With this patch applied I get much higher write-io values than without it. Some of the other patches help to reduce the effect, but it's still significant. iostat on an unpatched node is giving me: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 105.90 0.37 15.42 14.48 2657.33 560.13 107.61 1.89 62.75 6.26 18.71 while on a node with this patch it's sda 128.20 0.97 11.10 57.15 3376.80 552.80 57.5820.58 296.33 4.16 28.36 Also interesting, is the fact that the average request size on the patched node is much smaller. Josef was telling me, that this could be related to the number of bitmaps we write out, but I've no idea how to trace this. I would be very happy if someone could give me a hint on what to do next, as this is one of the last remaining issues with our ceph cluster. Thanks, Christian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list
On Nov 29, 2011, Christian Brunner c...@muc.de wrote: When I'm doing havy reading in our ceph cluster. The load and wait-io on the patched servers is higher than on the unpatched ones. That's unexpected. This seems to be coming from btrfs-endio-1. A kernel thread that has not caught my attention on unpatched systems, yet. I suppose I could wave my hands while explaining that you're getting higher data throughput, so it's natural that it would take up more resources, but that explanation doesn't satisfy me. I suppose allocation might have got slightly more CPU intensive in some cases, as we now use bitmaps where before we'd only use the cheaper-to-allocate extents. But that's unsafisfying as well. Do you have any idea what's going on here? Sorry, not really. (Please note that the filesystem is still unmodified - metadata overhead is large). Speaking of metadata overhead, I found out that the bitmap-enabling patch is not enough for a metadata balance to get rid of excess metadata block groups. I had to apply patch #16 to get it again. It sort of makes sense: without patch 16, too often will we get to the end of the list of metadata block groups and advance from LOOP_FIND_IDEAL to LOOP_CACHING_WAIT (skipping NOWAIT after we've cached free space for all block groups), and if we get to the end of that loop as well (how? I couldn't quite figure out, but it only seems to happen under high contention) we'll advance to LOOP_ALLOC_CHUNK and end up unnecessarily allocating a new chunk. Patch 16 makes sure we don't jump ahead during LOOP_CACHING_WAIT, so we won't get new chunks unless they can really help us keep the system going. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list
2011/11/28 Alexandre Oliva ol...@lsd.ic.unicamp.br: We're failing to create clusters with bitmaps because setup_cluster_no_bitmap checks that the list is empty before inserting the bitmap entry in the list for setup_cluster_bitmap, but the list field is only initialized when it is restored from the on-disk free space cache, or when it is written out to disk. Besides a potential race condition due to the multiple use of the list field, filesystem performance severely degrades over time: as we use up all non-bitmap free extents, the try-to-set-up-cluster dance is done at every metadata block allocation. For every block group, we fail to set up a cluster, and after failing on them all up to twice, we fall back to the much slower unclustered allocation. This matches exactly what I've been observing in our ceph cluster. I've now installed your patches (1-11) on two servers. The cluster setup problem seems to be gone. - A big thanks for that! However another thing is causing me some headeache: When I'm doing havy reading in our ceph cluster. The load and wait-io on the patched servers is higher than on the unpatched ones. Dstat from an unpatched server: total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 1 6 83 8 0 1| 22M 348k| 336k 93M| 0 0 |8445 3715 1 5 87 7 0 1| 12M 1808k| 214k 65M| 0 0 |5461 1710 1 3 85 10 0 0| 11M 640k| 313k 49M| 0 0 |5919 2853 1 6 84 9 0 1| 12M 608k| 358k 69M| 0 0 |7406 3645 1 7 78 13 0 1| 15M 5344k| 348k 105M| 0 0 |9765 4403 1 7 80 10 0 1| 22M 1368k| 358k 89M| 0 0 |8036 3202 1 9 72 16 0 1| 22M 2424k| 646k 137M| 0 0 | 12k 5527 Dstat from a patched server: ---total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 1 2 61 35 0 0|2500k 2736k| 141k 34M| 0 0 |4415 1603 1 4 48 47 0 1| 10M 3924k| 353k 61M| 0 0 |6871 3771 1 5 55 38 0 1| 10M 1728k| 385k 92M| 0 0 |8030 2617 2 8 69 20 0 1| 18M 1384k| 435k 130M| 0 0 | 10k 4493 1 5 85 8 0 1|7664k 84k| 287k 97M| 0 0 |6231 1357 1 3 91 5 0 0| 10M 144k| 194k 44M| 0 0 |3807 1081 1 7 66 25 0 1| 20M 1248k| 404k 101M| 0 0 |8676 3632 0 3 38 58 0 0|8104k 2660k| 176k 40M| 0 0 |4841 2093 This seems to be coming from btrfs-endio-1. A kernel thread that has not caught my attention on unpatched systems, yet. I did some tracing on that process with ftrace and I can see that the time is wasted in end_bio_extent_readpage(). In a single call to end_bio_extent_readpage()the functions unlock_extent_cached(), unlock_page() and btrfs_readpage_end_io_hook() are invoked 128 times (each). Do you have any idea what's going on here? (Please note that the filesystem is still unmodified - metadata overhead is large). Thanks, Christian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/20] Btrfs: initialize new bitmaps' list
We're failing to create clusters with bitmaps because setup_cluster_no_bitmap checks that the list is empty before inserting the bitmap entry in the list for setup_cluster_bitmap, but the list field is only initialized when it is restored from the on-disk free space cache, or when it is written out to disk. Besides a potential race condition due to the multiple use of the list field, filesystem performance severely degrades over time: as we use up all non-bitmap free extents, the try-to-set-up-cluster dance is done at every metadata block allocation. For every block group, we fail to set up a cluster, and after failing on them all up to twice, we fall back to the much slower unclustered allocation. To make matters worse, before the unclustered allocation, we try to create new block groups until we reach the 1% threshold, which introduces additional bitmaps and thus block groups that we'll iterate over at each metadata block request. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/free-space-cache.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 6e5b7e4..ff179b1 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1470,6 +1470,7 @@ static void add_new_bitmap(struct btrfs_free_space_ctl *ctl, { info-offset = offset_to_bitmap(ctl, offset); info-bytes = 0; + INIT_LIST_HEAD(info-list); link_free_space(ctl, info); ctl-total_bitmaps++; -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html