[Cluster-devel] [GFS2 PATCH 3/4] GFS2: Only increase rs_sizehint
If an application does a sequence of (1) big write, (2) little write we don't necessarily want to reset the size hint based on the smaller size. The fact that they did any big writes implies they may do more, and therefore we should try to allocate bigger block reservations, even if the last few were small writes. Therefore this patch changes function gfs2_size_hint so that the size hint can only grow; it cannot shrink. This is especially important where there are multiple writers. --- fs/gfs2/file.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 2976019..5c7a9c1 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -337,7 +337,8 @@ static void gfs2_size_hint(struct file *filep, loff_t offset, size_t size) size_t blks = (size + sdp-sd_sb.sb_bsize - 1) sdp-sd_sb.sb_bsize_shift; int hint = min_t(size_t, INT_MAX, blks); - atomic_set(ip-i_res-rs_sizehint, hint); + if (hint atomic_read(ip-i_res-rs_sizehint)) + atomic_set(ip-i_res-rs_sizehint, hint); } /** -- 1.9.3
[Cluster-devel] [GFS2 PATCH 2/4] GFS2: Make block reservations more persistent
Before this patch, whenever a struct file (opened to allow writes) was closed, the multi-block reservation structure associated with the inode was deleted. That's a problem, especially when there are multiple writers. Applications that do open-write-close will suffer from greater levels of fragmentation and need to re-do work to perform write operations. This patch removes the reservation delete from the file close code so that they're more persistent until the inode is deleted. --- fs/gfs2/file.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 7f4ed3d..2976019 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -616,15 +616,8 @@ static int gfs2_open(struct inode *inode, struct file *file) static int gfs2_release(struct inode *inode, struct file *file) { - struct gfs2_inode *ip = GFS2_I(inode); - kfree(file-private_data); file-private_data = NULL; - - if (!(file-f_mode FMODE_WRITE)) - return 0; - - gfs2_rs_delete(ip, inode-i_writecount); return 0; } -- 1.9.3
[Cluster-devel] [GFS2 PATCH 0/4] Patches to reduce GFS2 fragmentation
Hi, On October 8, I posted a GFS2 patch that greatly reduced inter-node contention for resource group glocks. The patch was called: GFS2: Set of distributed preferences for rgrps. It implemented a new scheme whereby each node in a cluster tries to keep to itself for allocations. This is not unlike GFS1, which has a different scheme. Although the patch sped up GFS2 performance in general, it also caused more file fragmentation, because each node tended to focus on a smaller subset of resource groups. Here are run times and file fragmentation extent counts for my favorite customer application, using a STOCK RHEL7 kernel (no patches): Run times: Run 1 time: 2hr 40min 33sec Run 2 time: 2hr 39min 52sec Run 3 time: 2hr 39min 31sec Run 4 time: 2hr 33min 57sec Run 5 time: 2hr 41min 6sec Total file extents (File fragmentation): EXTENT COUNT FOR OUTPUT FILES = 744708 EXTENT COUNT FOR OUTPUT FILES = 749868 EXTENT COUNT FOR OUTPUT FILES = 721862 EXTENT COUNT FOR OUTPUT FILES = 635301 EXTENT COUNT FOR OUTPUT FILES = 689263 The times are bad and the fragmentation level is also bad. If I add just the first patch, GFS2: Set of distributed preferences for rgrps you can see that the performance improves, but the fragmentation is worse (I only did three iterations this time): Run times: Run 1 time: 2hr 2min 47sec Run 2 time: 2hr 8min 37sec Run 3 time: 2hr 10min 0sec Total file extents (File fragmentation): EXTENT COUNT FOR OUTPUT FILES = 1011217 EXTENT COUNT FOR OUTPUT FILES = 1025973 EXTENT COUNT FOR OUTPUT FILES = 1070163 So the patch improved performance by 25 percent, but file fragmentation is 30 percent worse. Some of this is undoubtedly due to the SAN array buffering, hiding our multitude of sins. But not every customer will have this quality of SAN. So it's important to reduce the fragmentation as well, so that some people are helped while others are hurt by the patch. Toward this end, I devised three relatively simple patches that greatly reduce file fragmentation. With all four patches, the numbers are as follows: Run times: Run 1 time: 2hr 5min 46sec Run 2 time: 2hr 10min 15sec Run 3 time: 2hr 8min 4sec Run 4 time: 2hr 9min 27sec Run 5 time: 2hr 6min 15sec Total file extents (File fragmentation): EXTENT COUNT FOR OUTPUT FILES = 330276 EXTENT COUNT FOR OUTPUT FILES = 358939 EXTENT COUNT FOR OUTPUT FILES = 375374 EXTENT COUNT FOR OUTPUT FILES = 383071 EXTENT COUNT FOR OUTPUT FILES = 369269 As you can see, with this combination of four patches, the run times are good as well as the file fragmentation levels. The file fragmentation is about twice as good as the stock kernel, and significantly better (almost three times better) than with the first patch alone. This patch set includes all four patches. Bob Peterson (4): GFS2: Set of distributed preferences for rgrps GFS2: Make block reservations more persistent GFS2: Only increase rs_sizehint GFS2: If we use up our block reservation, request more next time fs/gfs2/file.c | 10 ++-- fs/gfs2/incore.h | 2 ++ fs/gfs2/lock_dlm.c | 2 ++ fs/gfs2/ops_fstype.c | 1 + fs/gfs2/rgrp.c | 69 5 files changed, 71 insertions(+), 13 deletions(-) -- 1.9.3
[Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps
This patch tries to use the journal numbers to evenly distribute which node prefers which resource group for block allocations. This is to help performance. --- fs/gfs2/incore.h | 2 ++ fs/gfs2/lock_dlm.c | 2 ++ fs/gfs2/ops_fstype.c | 1 + fs/gfs2/rgrp.c | 66 4 files changed, 66 insertions(+), 5 deletions(-) diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index 39e7e99..618d20a 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -97,6 +97,7 @@ struct gfs2_rgrpd { #define GFS2_RDF_CHECK 0x1000 /* check for unlinked inodes */ #define GFS2_RDF_UPTODATE 0x2000 /* rg is up to date */ #define GFS2_RDF_ERROR 0x4000 /* error in rg */ +#define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */ #define GFS2_RDF_MASK 0xf000 /* mask for internal flags */ spinlock_t rd_rsspin; /* protects reservation related vars */ struct rb_root rd_rstree; /* multi-block reservation tree */ @@ -808,6 +809,7 @@ struct gfs2_sbd { char sd_table_name[GFS2_FSNAME_LEN]; char sd_proto_name[GFS2_FSNAME_LEN]; + int sd_nodes; /* Debugging crud */ unsigned long sd_last_warning; diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c index 641383a..5aeb03a 100644 --- a/fs/gfs2/lock_dlm.c +++ b/fs/gfs2/lock_dlm.c @@ -1113,6 +1113,8 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots, struct gfs2_sbd *sdp = arg; struct lm_lockstruct *ls = sdp-sd_lockstruct; + BUG_ON(num_slots == 0); + sdp-sd_nodes = num_slots; /* ensure the ls jid arrays are large enough */ set_recover_size(sdp, slots, num_slots); diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index d3eae24..bf3193f 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -134,6 +134,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb) atomic_set(sdp-sd_log_freeze, 0); atomic_set(sdp-sd_frozen_root, 0); init_waitqueue_head(sdp-sd_frozen_root_wait); + sdp-sd_nodes = 1; return sdp; } diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 7474c41..50cdba2 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip) rgd-rd_gl-gl_vm.start = rgd-rd_addr * bsize; rgd-rd_gl-gl_vm.end = rgd-rd_gl-gl_vm.start + (rgd-rd_length * bsize) - 1; rgd-rd_rgl = (struct gfs2_rgrp_lvb *)rgd-rd_gl-gl_lksb.sb_lvbptr; - rgd-rd_flags = ~GFS2_RDF_UPTODATE; + rgd-rd_flags = ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED); if (rgd-rd_data sdp-sd_max_rg_data) sdp-sd_max_rg_data = rgd-rd_data; spin_lock(sdp-sd_rindex_spin); @@ -955,6 +955,36 @@ fail: } /** + * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use + * @sdp: the GFS2 superblock + * + * The purpose of this function is to select a subset of the resource groups + * and mark them as PREFERRED. We do it in such a way that each node prefers + * to use a unique set of rgrps to minimize glock contention. + */ +static void set_rgrp_preferences(struct gfs2_sbd *sdp) +{ + struct gfs2_rgrpd *rgd, *first; + int i; + + /* Skip an initial number of rgrps, based on this node's journal ID. + That should start each node out on its own set. */ + rgd = gfs2_rgrpd_get_first(sdp); + for (i = 0; i sdp-sd_lockstruct.ls_jid; i++) + rgd = gfs2_rgrpd_get_next(rgd); + first = rgd; + + do { + rgd-rd_flags |= GFS2_RDF_PREFERRED; + for (i = 0; i sdp-sd_nodes; i++) { + rgd = gfs2_rgrpd_get_next(rgd); + if (rgd == first) + break; + } + } while (rgd != first); +} + +/** * gfs2_ri_update - Pull in a new resource index from the disk * @ip: pointer to the rindex inode * @@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip) if (error 0) return error; + set_rgrp_preferences(sdp); + sdp-sd_rindex_uptodate = 1; return 0; } @@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b } /** + * fast_to_acquire - determine if a resource group will be fast to acquire + * + * If this is one of our preferred rgrps, it should be quicker to acquire, + * because we tried to set ourselves up as dlm lock master. + */ +static inline int fast_to_acquire(struct gfs2_rgrpd *rgd) +{ + struct gfs2_glock *gl = rgd-rd_gl; + + if (gl-gl_state != LM_ST_UNLOCKED list_empty(gl-gl_holders) + !test_bit(GLF_DEMOTE_IN_PROGRESS, gl-gl_flags) + !test_bit(GLF_DEMOTE, gl-gl_flags)) + return 1; + if (rgd-rd_flags GFS2_RDF_PREFERRED) + return 1; + return 0; +} +