Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-12-12 Thread Christian Brunner
2011/12/12 Alexandre Oliva ol...@lsd.ic.unicamp.br:
 On Dec  7, 2011, Christian Brunner c...@muc.de wrote:

 With this patch applied I get much higher write-io values than without
 it. Some of the other patches help to reduce the effect, but it's
 still significant.

 iostat on an unpatched node is giving me:

 Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda             105.90     0.37   15.42   14.48  2657.33   560.13
 107.61     1.89   62.75   6.26  18.71

 while on a node with this patch it's
 sda             128.20     0.97   11.10   57.15  3376.80   552.80
 57.58    20.58  296.33   4.16  28.36


 Also interesting, is the fact that the average request size on the
 patched node is much smaller.

 That's probably expected for writes, as bitmaps are expected to be more
 fragmented, even if used only for metadata (or are you on SSD?)


It's a traditional hardware RAID5 with spinning disks. - I would
accept this if the writes would start right after the mount, but in
this case it takes a few hours until the writes increase. Thats why
I'm allmost certain that something is still wrong.

 Bitmaps are just a different in-memory (and on-disk-cache, if enabled)
 representation of free space, that can be far more compact: one bit per
 disk block, rather than an extent list entry.  They're interchangeable
 otherwise, it's just that searching bitmaps for a free block (bit) is
 somewhat more expensive than taking the next entry from a list, but you
 don't want to use up too much memory with long lists of
 e.g. single-block free extents.

Thanks for the explanation! I'll try to insert some debuging code,
once my test server is ready.

Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-12-09 Thread Christian Brunner
2011/12/7 Christian Brunner c...@muc.de:
 2011/12/1 Christian Brunner c...@muc.de:
 2011/12/1 Alexandre Oliva ol...@lsd.ic.unicamp.br:
 On Nov 29, 2011, Christian Brunner c...@muc.de wrote:

 When I'm doing havy reading in our ceph cluster. The load and wait-io
 on the patched servers is higher than on the unpatched ones.

 That's unexpected.

 In the mean time I know, that it's not related to the reads.

 I suppose I could wave my hands while explaining that you're getting
 higher data throughput, so it's natural that it would take up more
 resources, but that explanation doesn't satisfy me.  I suppose
 allocation might have got slightly more CPU intensive in some cases, as
 we now use bitmaps where before we'd only use the cheaper-to-allocate
 extents.  But that's unsafisfying as well.

 I must admit, that I do not completely understand the difference
 between bitmaps and extents.

 From what I see on my servers, I can tell, that the degradation over
 time is gone. (Rebooting the servers every day is no longer needed.
 This is a real plus.) But the performance compared to a freshly
 booted, unpatched server is much slower with my ceph workload.

 I wonder if it would make sense to initialize the list field only,
 when the cluster setup fails? This would avoid the fallback to the
 much unclustered allocation and would give us the cheaper-to-allocate
 extents.

 I've now tried various combinations of you patches and I can really
 nail it down to this one line.

 With this patch applied I get much higher write-io values than without
 it. Some of the other patches help to reduce the effect, but it's
 still significant.

 iostat on an unpatched node is giving me:

 Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda             105.90     0.37   15.42   14.48  2657.33   560.13
 107.61     1.89   62.75   6.26  18.71

 while on a node with this patch it's
 sda             128.20     0.97   11.10   57.15  3376.80   552.80
 57.58    20.58  296.33   4.16  28.36


 Also interesting, is the fact that the average request size on the
 patched node is much smaller.

 Josef was telling me, that this could be related to the number of
 bitmaps we write out, but I've no idea how to trace this.

 I would be very happy if someone could give me a hint on what to do
 next, as this is one of the last remaining issues with our ceph
 cluster.

This is still bugging me and I just remembered something that might be
helpfull. Also I hope that this is not misleading...

Back in 2.6.38 we were running ceph without btrfs performance
degradation. I found a thread on the list where similar problems where
reported:

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10346.html

In that thread someone bisected the issue to

From 4e69b598f6cfb0940b75abf7e179d6020e94ad1e Mon Sep 17 00:00:00 2001
From: Josef Bacik jo...@redhat.com
Date: Mon, 21 Mar 2011 10:11:24 -0400
Subject: [PATCH] Btrfs: cleanup how we setup free space clusters

In this commit the bitmaps handling was changed. So I just thought
that this may be related.

I'm still hoping, that someone with a deeper understanding of btrfs
could take a look at this.

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-12-07 Thread Christian Brunner
2011/12/1 Christian Brunner c...@muc.de:
 2011/12/1 Alexandre Oliva ol...@lsd.ic.unicamp.br:
 On Nov 29, 2011, Christian Brunner c...@muc.de wrote:

 When I'm doing havy reading in our ceph cluster. The load and wait-io
 on the patched servers is higher than on the unpatched ones.

 That's unexpected.

In the mean time I know, that it's not related to the reads.

 I suppose I could wave my hands while explaining that you're getting
 higher data throughput, so it's natural that it would take up more
 resources, but that explanation doesn't satisfy me.  I suppose
 allocation might have got slightly more CPU intensive in some cases, as
 we now use bitmaps where before we'd only use the cheaper-to-allocate
 extents.  But that's unsafisfying as well.

 I must admit, that I do not completely understand the difference
 between bitmaps and extents.

 From what I see on my servers, I can tell, that the degradation over
 time is gone. (Rebooting the servers every day is no longer needed.
 This is a real plus.) But the performance compared to a freshly
 booted, unpatched server is much slower with my ceph workload.

 I wonder if it would make sense to initialize the list field only,
 when the cluster setup fails? This would avoid the fallback to the
 much unclustered allocation and would give us the cheaper-to-allocate
 extents.

I've now tried various combinations of you patches and I can really
nail it down to this one line.

With this patch applied I get much higher write-io values than without
it. Some of the other patches help to reduce the effect, but it's
still significant.

iostat on an unpatched node is giving me:

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda 105.90 0.37   15.42   14.48  2657.33   560.13
107.61 1.89   62.75   6.26  18.71

while on a node with this patch it's
sda 128.20 0.97   11.10   57.15  3376.80   552.80
57.5820.58  296.33   4.16  28.36


Also interesting, is the fact that the average request size on the
patched node is much smaller.

Josef was telling me, that this could be related to the number of
bitmaps we write out, but I've no idea how to trace this.

I would be very happy if someone could give me a hint on what to do
next, as this is one of the last remaining issues with our ceph
cluster.

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-11-30 Thread Alexandre Oliva
On Nov 29, 2011, Christian Brunner c...@muc.de wrote:

 When I'm doing havy reading in our ceph cluster. The load and wait-io
 on the patched servers is higher than on the unpatched ones.

That's unexpected.

 This seems to be coming from btrfs-endio-1. A kernel thread that has
 not caught my attention on unpatched systems, yet.

I suppose I could wave my hands while explaining that you're getting
higher data throughput, so it's natural that it would take up more
resources, but that explanation doesn't satisfy me.  I suppose
allocation might have got slightly more CPU intensive in some cases, as
we now use bitmaps where before we'd only use the cheaper-to-allocate
extents.  But that's unsafisfying as well.

 Do you have any idea what's going on here?

Sorry, not really.

 (Please note that the filesystem is still unmodified - metadata
 overhead is large).

Speaking of metadata overhead, I found out that the bitmap-enabling
patch is not enough for a metadata balance to get rid of excess metadata
block groups.  I had to apply patch #16 to get it again.  It sort of
makes sense: without patch 16, too often will we get to the end of the
list of metadata block groups and advance from LOOP_FIND_IDEAL to
LOOP_CACHING_WAIT (skipping NOWAIT after we've cached free space for all
block groups), and if we get to the end of that loop as well (how?  I
couldn't quite figure out, but it only seems to happen under high
contention) we'll advance to LOOP_ALLOC_CHUNK and end up unnecessarily
allocating a new chunk.

Patch 16 makes sure we don't jump ahead during LOOP_CACHING_WAIT, so we
won't get new chunks unless they can really help us keep the system
going.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-11-29 Thread Christian Brunner
2011/11/28 Alexandre Oliva ol...@lsd.ic.unicamp.br:
 We're failing to create clusters with bitmaps because
 setup_cluster_no_bitmap checks that the list is empty before inserting
 the bitmap entry in the list for setup_cluster_bitmap, but the list
 field is only initialized when it is restored from the on-disk free
 space cache, or when it is written out to disk.

 Besides a potential race condition due to the multiple use of the list
 field, filesystem performance severely degrades over time: as we use
 up all non-bitmap free extents, the try-to-set-up-cluster dance is
 done at every metadata block allocation.  For every block group, we
 fail to set up a cluster, and after failing on them all up to twice,
 we fall back to the much slower unclustered allocation.

This matches exactly what I've been observing in our ceph cluster.
I've now installed your patches (1-11) on two servers.
The cluster setup problem seems to be gone. - A big thanks for that!

However another thing is causing me some headeache:

When I'm doing havy reading in our ceph cluster. The load and wait-io
on the patched servers is higher than on the unpatched ones.

Dstat from an unpatched server:

total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  1   6  83   8   0   1|  22M  348k| 336k   93M|   0 0 |8445  3715
  1   5  87   7   0   1|  12M 1808k| 214k   65M|   0 0 |5461  1710
  1   3  85  10   0   0|  11M  640k| 313k   49M|   0 0 |5919  2853
  1   6  84   9   0   1|  12M  608k| 358k   69M|   0 0 |7406  3645
  1   7  78  13   0   1|  15M 5344k| 348k  105M|   0 0 |9765  4403
  1   7  80  10   0   1|  22M 1368k| 358k   89M|   0 0 |8036  3202
  1   9  72  16   0   1|  22M 2424k| 646k  137M|   0 0 |  12k 5527

Dstat from a patched server:

---total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  1   2  61  35   0   0|2500k 2736k| 141k   34M|   0 0 |4415  1603
  1   4  48  47   0   1|  10M 3924k| 353k   61M|   0 0 |6871  3771
  1   5  55  38   0   1|  10M 1728k| 385k   92M|   0 0 |8030  2617
  2   8  69  20   0   1|  18M 1384k| 435k  130M|   0 0 |  10k 4493
  1   5  85   8   0   1|7664k   84k| 287k   97M|   0 0 |6231  1357
  1   3  91   5   0   0|  10M  144k| 194k   44M|   0 0 |3807  1081
  1   7  66  25   0   1|  20M 1248k| 404k  101M|   0 0 |8676  3632
  0   3  38  58   0   0|8104k 2660k| 176k   40M|   0 0 |4841  2093


This seems to be coming from btrfs-endio-1. A kernel thread that has
not caught my attention on unpatched systems, yet.

I did some tracing on that process with ftrace and I can see that the
time is wasted in end_bio_extent_readpage(). In a single call to
end_bio_extent_readpage()the functions unlock_extent_cached(),
unlock_page() and btrfs_readpage_end_io_hook() are invoked 128 times
(each).

Do you have any idea what's going on here?

(Please note that the filesystem is still unmodified - metadata
overhead is large).

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-11-28 Thread Alexandre Oliva
We're failing to create clusters with bitmaps because
setup_cluster_no_bitmap checks that the list is empty before inserting
the bitmap entry in the list for setup_cluster_bitmap, but the list
field is only initialized when it is restored from the on-disk free
space cache, or when it is written out to disk.

Besides a potential race condition due to the multiple use of the list
field, filesystem performance severely degrades over time: as we use
up all non-bitmap free extents, the try-to-set-up-cluster dance is
done at every metadata block allocation.  For every block group, we
fail to set up a cluster, and after failing on them all up to twice,
we fall back to the much slower unclustered allocation.

To make matters worse, before the unclustered allocation, we try to
create new block groups until we reach the 1% threshold, which
introduces additional bitmaps and thus block groups that we'll iterate
over at each metadata block request.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/free-space-cache.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 6e5b7e4..ff179b1 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1470,6 +1470,7 @@ static void add_new_bitmap(struct btrfs_free_space_ctl 
*ctl,
 {
info-offset = offset_to_bitmap(ctl, offset);
info-bytes = 0;
+   INIT_LIST_HEAD(info-list);
link_free_space(ctl, info);
ctl-total_bitmaps++;
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html