Re: [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
- Original Message - I'll see if I can track down why the rgrplvb option isn't performing as well. I suspect the matter goes back to my first comment above. Namely, that the slowdown goes back to the slowness of page cache lookup for the buffers of the rgrps we are using (not rejected ones). Hi, I did verify that the vast majority of time is spent doing page cache lookups, in function gfs2_getbuf(). The time spent in gfs2_meta_read() and gfs2_meta_wait() are minimal--almost nil--presumably because the page is already in cache. The rgrplvb option isn't improving this because it still calls gfs2_rgrp_bh_get which still does all the page cache lookups for the rgrp we need. The rgrplvb option should, in theory, save us a lot of time when searching for a suitable rgrp, especially when there are multiple nodes doing allocations. I can't think of any reason why my patch would be incompatible with rgrplvb, but I'd like to ask Ben Marzinski to scrutinize my patch carefully and see if he can find any flaws in the design. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
Hi, On 15/06/15 15:43, Bob Peterson wrote: - Original Message - I'm assuming that these figures are bandwidth rather than times, since that appears to show that the patch makes quite a large difference. However the reclen is rather small. In the 32 bytes case, thats 128 writes for each new block thats being allocated, unless of course that is 32k? Steve. Hi, To do this test, I'm executing this command: numactl --cpunodebind=0 --membind=0 /home/bob/iozone/iozone3_429/src/current/iozone -az -f /mnt/gfs2/iozone-gfs2 -n 2048m -g 2048m -y 32k -q 1m -e -i 0 -+n /home/bob/iozone.out According to iozone -h, specifying -y this way is 32K, not 32 bytes. The -q is maximum write size, in KB, so -q 1m is 1MB writes. The -g is maximum file size, in KB, so -g 2048 is a 2MB file. So the test varies the writes from 32K to 1MB, adjusting the number of writes to get the file to 2MB. Regards, Bob Peterson Red Hat File Systems Ok, that makes sense to me. I guess that there is not a lot of searching for the rgrps going on, which is why we are not seeing a big gain in the rgrplvb case. If the fs was nearly full perhaps we'd see more of a difference in that case. Either way, provided the rgrplvb code can continue to work, that is really the important thing, so I think we should be ok with this. I'd still very much like to see what can be done to reduce the page look up cost more directly though, since that seems to be the real issue here, Steve.
Re: [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
Hi, On 12/06/15 20:50, Bob Peterson wrote: - Original Message - Hi, On 09/06/15 15:45, Bob Peterson wrote: - Original Message - Hi, On 05/06/15 15:49, Bob Peterson wrote: Hi, This patch allows the block allocation code to retain the buffers for the resource groups so they don't need to be re-read from buffer cache with every request. This is a performance improvement that's especially noticeable when resource groups are very large. For example, with 2GB resource groups and 4K blocks, there can be 33 blocks for every resource group. This patch allows those 33 buffers to be kept around and not read in and thrown away with every operation. The buffers are released when the resource group is either synced or invalidated. The blocks should be cached between operations, so this should only be resulting in a skip of the look up of the cached block, and no changes to the actual I/O. Does that mean that grab_cache_page() is slow I wonder? Or is this an issue of going around the retry loop due to lack of memory at some stage? How does this interact with the rgrplvb support? I'd guess that with that turned on, this is no longer an issue, because we'd only read in the blocks for the rgrps that we are actually going to use? Steve. Hi, If you compare the two vmstat outputs in the bugzilla #1154782, you'll see no significant difference in memory usage nor cpu usage. So I assume the page lookup is the slow part; not because it's such a slow thing but because it's done 33 times per read-reference-invalidate (33 pages to look up per rgrp). Regards, Bob Peterson Red Hat File Systems Thats true, however, as I understand the problem here, the issue is not reading in the blocks for the rgrp that is eventually selected to use, but the reading in of those blocks for the rgrps that we reject, for whatever reason (full, or congested, or whatever). So with rgrplvb enabled, we don't then read those rgrps in off disk at all in most cases - so I was wondering whether that solves the problem without needing this change? Ideally I'd like to make the rgrplvb setting the default, since it is much more efficient. The question is how we can do that and still remain backward compatible? Not an easy one to answer :( Also, if the page lookup is the slow thing, then we should look at using pagevec_lookup() to get the pages in chunks rather than doing it individually (and indeed, multiple times per page, in case of block size less than page size). We know that the blocks will always be contiguous on disk, so we should be able to send down large I/Os, rather than relying on the block stack to merge them as we do at the moment, which should be a further improvement too, Steve. Hi, The rgrplvb mount option only helps if the file system is using lock_dlm. For lock_nolock, it's still just as slow because lock_nolock has no knowledge of lvbs. Now, granted, that's an unusual case because GFS2 is normally used with lock_dlm. That sounds like a bug... it should work in the same way, even with lock_nolock. I like the idea of making rgrplvb the default mount option, and I don't see a problem doing that. The issue is that it will not be backwards compatible. We really need a way to at least easily detect if someone has a mixed cluster (some nodes with rgrplvb enabled, and some without) otherwise it might be confusing when we get odd reports of allocations failing, even when there appears to be free space. We need to treat what we put into LVBs in the same way as we treat the on-disk format in terms of backwards compatibility. I think the rgrplvb option should be compatible with this patch, but I'll set up a test environment in order to test that they work together harmoniously. I also like the idea of using a pagevec for reading in multiple pages for the rgrps, but that's another improvement for another day. If there's not a bugzilla record open for that, perhaps we should open one. Regards, Bob Peterson Red Hat File Systems If we have rgrplvb, then we my not need this patch, since we will not be looking up the rgrp's blocks as often. So we should see the benefit just by turning that on I think... at least it would be good to see whether there is any performance difference there. In cases where we have nodes competing for the rgrps, then the blocks will not be cached anyway, so we will gain no benefit from this patch, since we'll have to read the blocks anyway, hence my thought that speeding up the lookup is the way to go, since it will give the benefit for more different cases - both when the rgrps blocks are cached and uncached, Steve.
Re: [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
- Original Message - Hi, On 09/06/15 15:45, Bob Peterson wrote: - Original Message - Hi, On 05/06/15 15:49, Bob Peterson wrote: Hi, This patch allows the block allocation code to retain the buffers for the resource groups so they don't need to be re-read from buffer cache with every request. This is a performance improvement that's especially noticeable when resource groups are very large. For example, with 2GB resource groups and 4K blocks, there can be 33 blocks for every resource group. This patch allows those 33 buffers to be kept around and not read in and thrown away with every operation. The buffers are released when the resource group is either synced or invalidated. The blocks should be cached between operations, so this should only be resulting in a skip of the look up of the cached block, and no changes to the actual I/O. Does that mean that grab_cache_page() is slow I wonder? Or is this an issue of going around the retry loop due to lack of memory at some stage? How does this interact with the rgrplvb support? I'd guess that with that turned on, this is no longer an issue, because we'd only read in the blocks for the rgrps that we are actually going to use? Steve. Hi, If you compare the two vmstat outputs in the bugzilla #1154782, you'll see no significant difference in memory usage nor cpu usage. So I assume the page lookup is the slow part; not because it's such a slow thing but because it's done 33 times per read-reference-invalidate (33 pages to look up per rgrp). Regards, Bob Peterson Red Hat File Systems Thats true, however, as I understand the problem here, the issue is not reading in the blocks for the rgrp that is eventually selected to use, but the reading in of those blocks for the rgrps that we reject, for whatever reason (full, or congested, or whatever). So with rgrplvb enabled, we don't then read those rgrps in off disk at all in most cases - so I was wondering whether that solves the problem without needing this change? Ideally I'd like to make the rgrplvb setting the default, since it is much more efficient. The question is how we can do that and still remain backward compatible? Not an easy one to answer :( Also, if the page lookup is the slow thing, then we should look at using pagevec_lookup() to get the pages in chunks rather than doing it individually (and indeed, multiple times per page, in case of block size less than page size). We know that the blocks will always be contiguous on disk, so we should be able to send down large I/Os, rather than relying on the block stack to merge them as we do at the moment, which should be a further improvement too, Steve. Hi, The rgrplvb mount option only helps if the file system is using lock_dlm. For lock_nolock, it's still just as slow because lock_nolock has no knowledge of lvbs. Now, granted, that's an unusual case because GFS2 is normally used with lock_dlm. I like the idea of making rgrplvb the default mount option, and I don't see a problem doing that. I think the rgrplvb option should be compatible with this patch, but I'll set up a test environment in order to test that they work together harmoniously. I also like the idea of using a pagevec for reading in multiple pages for the rgrps, but that's another improvement for another day. If there's not a bugzilla record open for that, perhaps we should open one. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
Hi, On 09/06/15 15:45, Bob Peterson wrote: - Original Message - Hi, On 05/06/15 15:49, Bob Peterson wrote: Hi, This patch allows the block allocation code to retain the buffers for the resource groups so they don't need to be re-read from buffer cache with every request. This is a performance improvement that's especially noticeable when resource groups are very large. For example, with 2GB resource groups and 4K blocks, there can be 33 blocks for every resource group. This patch allows those 33 buffers to be kept around and not read in and thrown away with every operation. The buffers are released when the resource group is either synced or invalidated. The blocks should be cached between operations, so this should only be resulting in a skip of the look up of the cached block, and no changes to the actual I/O. Does that mean that grab_cache_page() is slow I wonder? Or is this an issue of going around the retry loop due to lack of memory at some stage? How does this interact with the rgrplvb support? I'd guess that with that turned on, this is no longer an issue, because we'd only read in the blocks for the rgrps that we are actually going to use? Steve. Hi, If you compare the two vmstat outputs in the bugzilla #1154782, you'll see no significant difference in memory usage nor cpu usage. So I assume the page lookup is the slow part; not because it's such a slow thing but because it's done 33 times per read-reference-invalidate (33 pages to look up per rgrp). Regards, Bob Peterson Red Hat File Systems Thats true, however, as I understand the problem here, the issue is not reading in the blocks for the rgrp that is eventually selected to use, but the reading in of those blocks for the rgrps that we reject, for whatever reason (full, or congested, or whatever). So with rgrplvb enabled, we don't then read those rgrps in off disk at all in most cases - so I was wondering whether that solves the problem without needing this change? Ideally I'd like to make the rgrplvb setting the default, since it is much more efficient. The question is how we can do that and still remain backward compatible? Not an easy one to answer :( Also, if the page lookup is the slow thing, then we should look at using pagevec_lookup() to get the pages in chunks rather than doing it individually (and indeed, multiple times per page, in case of block size less than page size). We know that the blocks will always be contiguous on disk, so we should be able to send down large I/Os, rather than relying on the block stack to merge them as we do at the moment, which should be a further improvement too, Steve.
Re: [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
- Original Message - Hi, On 05/06/15 15:49, Bob Peterson wrote: Hi, This patch allows the block allocation code to retain the buffers for the resource groups so they don't need to be re-read from buffer cache with every request. This is a performance improvement that's especially noticeable when resource groups are very large. For example, with 2GB resource groups and 4K blocks, there can be 33 blocks for every resource group. This patch allows those 33 buffers to be kept around and not read in and thrown away with every operation. The buffers are released when the resource group is either synced or invalidated. The blocks should be cached between operations, so this should only be resulting in a skip of the look up of the cached block, and no changes to the actual I/O. Does that mean that grab_cache_page() is slow I wonder? Or is this an issue of going around the retry loop due to lack of memory at some stage? How does this interact with the rgrplvb support? I'd guess that with that turned on, this is no longer an issue, because we'd only read in the blocks for the rgrps that we are actually going to use? Steve. Hi, If you compare the two vmstat outputs in the bugzilla #1154782, you'll see no significant difference in memory usage nor cpu usage. So I assume the page lookup is the slow part; not because it's such a slow thing but because it's done 33 times per read-reference-invalidate (33 pages to look up per rgrp). Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
Hi, On 05/06/15 15:49, Bob Peterson wrote: Hi, This patch allows the block allocation code to retain the buffers for the resource groups so they don't need to be re-read from buffer cache with every request. This is a performance improvement that's especially noticeable when resource groups are very large. For example, with 2GB resource groups and 4K blocks, there can be 33 blocks for every resource group. This patch allows those 33 buffers to be kept around and not read in and thrown away with every operation. The buffers are released when the resource group is either synced or invalidated. The blocks should be cached between operations, so this should only be resulting in a skip of the look up of the cached block, and no changes to the actual I/O. Does that mean that grab_cache_page() is slow I wonder? Or is this an issue of going around the retry loop due to lack of memory at some stage? How does this interact with the rgrplvb support? I'd guess that with that turned on, this is no longer an issue, because we'd only read in the blocks for the rgrps that we are actually going to use? Steve. Signed-off-by: Bob Peterson rpete...@redhat.com --- fs/gfs2/glops.c | 14 +++--- fs/gfs2/rgrp.c | 23 +++ fs/gfs2/rgrp.h | 1 + 3 files changed, 31 insertions(+), 7 deletions(-) diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c index fe91951..c23377f 100644 --- a/fs/gfs2/glops.c +++ b/fs/gfs2/glops.c @@ -144,6 +144,12 @@ static void rgrp_go_sync(struct gfs2_glock *gl) struct gfs2_rgrpd *rgd; int error; + spin_lock(gl-gl_spin); + rgd = gl-gl_object; + if (rgd) + gfs2_rgrp_brelse(rgd); + spin_unlock(gl-gl_spin); + if (!test_and_clear_bit(GLF_DIRTY, gl-gl_flags)) return; GLOCK_BUG_ON(gl, gl-gl_state != LM_ST_EXCLUSIVE); @@ -175,15 +181,17 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags) { struct gfs2_sbd *sdp = gl-gl_sbd; struct address_space *mapping = sdp-sd_aspace; + struct gfs2_rgrpd *rgd = gl-gl_object; + + if (rgd) + gfs2_rgrp_brelse(rgd); WARN_ON_ONCE(!(flags DIO_METADATA)); gfs2_assert_withdraw(sdp, !atomic_read(gl-gl_ail_count)); truncate_inode_pages_range(mapping, gl-gl_vm.start, gl-gl_vm.end); - if (gl-gl_object) { - struct gfs2_rgrpd *rgd = (struct gfs2_rgrpd *)gl-gl_object; + if (rgd) rgd-rd_flags = ~GFS2_RDF_UPTODATE; - } } /** diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index cd53d6e..c6c6232 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1244,14 +1244,13 @@ int gfs2_rgrp_go_lock(struct gfs2_holder *gh) } /** - * gfs2_rgrp_go_unlock - Release RG bitmaps read in with gfs2_rgrp_bh_get() - * @gh: The glock holder for the resource group + * gfs2_rgrp_brelse - Release RG bitmaps read in with gfs2_rgrp_bh_get() + * @rgd: The resource group * */ -void gfs2_rgrp_go_unlock(struct gfs2_holder *gh) +void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd) { - struct gfs2_rgrpd *rgd = gh-gh_gl-gl_object; int x, length = rgd-rd_length; for (x = 0; x length; x++) { @@ -1264,6 +1263,22 @@ void gfs2_rgrp_go_unlock(struct gfs2_holder *gh) } +/** + * gfs2_rgrp_go_unlock - Unlock a rgrp glock + * @gh: The glock holder for the resource group + * + */ + +void gfs2_rgrp_go_unlock(struct gfs2_holder *gh) +{ + struct gfs2_rgrpd *rgd = gh-gh_gl-gl_object; + int demote_requested = test_bit(GLF_DEMOTE, gh-gh_gl-gl_flags) | + test_bit(GLF_PENDING_DEMOTE, gh-gh_gl-gl_flags); + + if (rgd demote_requested) + gfs2_rgrp_brelse(rgd); +} + int gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 offset, struct buffer_head *bh, const struct gfs2_bitmap *bi, unsigned minlen, u64 *ptrimmed) diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h index 68972ec..c0ab33f 100644 --- a/fs/gfs2/rgrp.h +++ b/fs/gfs2/rgrp.h @@ -36,6 +36,7 @@ extern void gfs2_clear_rgrpd(struct gfs2_sbd *sdp); extern int gfs2_rindex_update(struct gfs2_sbd *sdp); extern void gfs2_free_clones(struct gfs2_rgrpd *rgd); extern int gfs2_rgrp_go_lock(struct gfs2_holder *gh); +extern void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd); extern void gfs2_rgrp_go_unlock(struct gfs2_holder *gh); extern struct gfs2_alloc *gfs2_alloc_get(struct gfs2_inode *ip);
[Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
Hi, This patch allows the block allocation code to retain the buffers for the resource groups so they don't need to be re-read from buffer cache with every request. This is a performance improvement that's especially noticeable when resource groups are very large. For example, with 2GB resource groups and 4K blocks, there can be 33 blocks for every resource group. This patch allows those 33 buffers to be kept around and not read in and thrown away with every operation. The buffers are released when the resource group is either synced or invalidated. Signed-off-by: Bob Peterson rpete...@redhat.com --- fs/gfs2/glops.c | 14 +++--- fs/gfs2/rgrp.c | 23 +++ fs/gfs2/rgrp.h | 1 + 3 files changed, 31 insertions(+), 7 deletions(-) diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c index fe91951..c23377f 100644 --- a/fs/gfs2/glops.c +++ b/fs/gfs2/glops.c @@ -144,6 +144,12 @@ static void rgrp_go_sync(struct gfs2_glock *gl) struct gfs2_rgrpd *rgd; int error; + spin_lock(gl-gl_spin); + rgd = gl-gl_object; + if (rgd) + gfs2_rgrp_brelse(rgd); + spin_unlock(gl-gl_spin); + if (!test_and_clear_bit(GLF_DIRTY, gl-gl_flags)) return; GLOCK_BUG_ON(gl, gl-gl_state != LM_ST_EXCLUSIVE); @@ -175,15 +181,17 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags) { struct gfs2_sbd *sdp = gl-gl_sbd; struct address_space *mapping = sdp-sd_aspace; + struct gfs2_rgrpd *rgd = gl-gl_object; + + if (rgd) + gfs2_rgrp_brelse(rgd); WARN_ON_ONCE(!(flags DIO_METADATA)); gfs2_assert_withdraw(sdp, !atomic_read(gl-gl_ail_count)); truncate_inode_pages_range(mapping, gl-gl_vm.start, gl-gl_vm.end); - if (gl-gl_object) { - struct gfs2_rgrpd *rgd = (struct gfs2_rgrpd *)gl-gl_object; + if (rgd) rgd-rd_flags = ~GFS2_RDF_UPTODATE; - } } /** diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index cd53d6e..c6c6232 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1244,14 +1244,13 @@ int gfs2_rgrp_go_lock(struct gfs2_holder *gh) } /** - * gfs2_rgrp_go_unlock - Release RG bitmaps read in with gfs2_rgrp_bh_get() - * @gh: The glock holder for the resource group + * gfs2_rgrp_brelse - Release RG bitmaps read in with gfs2_rgrp_bh_get() + * @rgd: The resource group * */ -void gfs2_rgrp_go_unlock(struct gfs2_holder *gh) +void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd) { - struct gfs2_rgrpd *rgd = gh-gh_gl-gl_object; int x, length = rgd-rd_length; for (x = 0; x length; x++) { @@ -1264,6 +1263,22 @@ void gfs2_rgrp_go_unlock(struct gfs2_holder *gh) } +/** + * gfs2_rgrp_go_unlock - Unlock a rgrp glock + * @gh: The glock holder for the resource group + * + */ + +void gfs2_rgrp_go_unlock(struct gfs2_holder *gh) +{ + struct gfs2_rgrpd *rgd = gh-gh_gl-gl_object; + int demote_requested = test_bit(GLF_DEMOTE, gh-gh_gl-gl_flags) | + test_bit(GLF_PENDING_DEMOTE, gh-gh_gl-gl_flags); + + if (rgd demote_requested) + gfs2_rgrp_brelse(rgd); +} + int gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 offset, struct buffer_head *bh, const struct gfs2_bitmap *bi, unsigned minlen, u64 *ptrimmed) diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h index 68972ec..c0ab33f 100644 --- a/fs/gfs2/rgrp.h +++ b/fs/gfs2/rgrp.h @@ -36,6 +36,7 @@ extern void gfs2_clear_rgrpd(struct gfs2_sbd *sdp); extern int gfs2_rindex_update(struct gfs2_sbd *sdp); extern void gfs2_free_clones(struct gfs2_rgrpd *rgd); extern int gfs2_rgrp_go_lock(struct gfs2_holder *gh); +extern void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd); extern void gfs2_rgrp_go_unlock(struct gfs2_holder *gh); extern struct gfs2_alloc *gfs2_alloc_get(struct gfs2_inode *ip);