[Cluster-devel] [GFS2 PATCH 3/4] GFS2: Only increase rs_sizehint

2014-10-20 Thread Bob Peterson
If an application does a sequence of (1) big write, (2) little write
we don't necessarily want to reset the size hint based on the smaller
size. The fact that they did any big writes implies they may do more,
and therefore we should try to allocate bigger block reservations, even
if the last few were small writes. Therefore this patch changes function
gfs2_size_hint so that the size hint can only grow; it cannot shrink.
This is especially important where there are multiple writers.
---
 fs/gfs2/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 2976019..5c7a9c1 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -337,7 +337,8 @@ static void gfs2_size_hint(struct file *filep, loff_t 
offset, size_t size)
size_t blks = (size + sdp-sd_sb.sb_bsize - 1)  
sdp-sd_sb.sb_bsize_shift;
int hint = min_t(size_t, INT_MAX, blks);
 
-   atomic_set(ip-i_res-rs_sizehint, hint);
+   if (hint  atomic_read(ip-i_res-rs_sizehint))
+   atomic_set(ip-i_res-rs_sizehint, hint);
 }
 
 /**
-- 
1.9.3



[Cluster-devel] [GFS2 PATCH 2/4] GFS2: Make block reservations more persistent

2014-10-20 Thread Bob Peterson
Before this patch, whenever a struct file (opened to allow writes) was
closed, the multi-block reservation structure associated with the inode
was deleted. That's a problem, especially when there are multiple writers.
Applications that do open-write-close will suffer from greater levels
of fragmentation and need to re-do work to perform write operations.
This patch removes the reservation delete from the file close code so
that they're more persistent until the inode is deleted.
---
 fs/gfs2/file.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 7f4ed3d..2976019 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -616,15 +616,8 @@ static int gfs2_open(struct inode *inode, struct file 
*file)
 
 static int gfs2_release(struct inode *inode, struct file *file)
 {
-   struct gfs2_inode *ip = GFS2_I(inode);
-
kfree(file-private_data);
file-private_data = NULL;
-
-   if (!(file-f_mode  FMODE_WRITE))
-   return 0;
-
-   gfs2_rs_delete(ip, inode-i_writecount);
return 0;
 }
 
-- 
1.9.3



[Cluster-devel] [GFS2 PATCH 0/4] Patches to reduce GFS2 fragmentation

2014-10-20 Thread Bob Peterson
Hi,

On October 8, I posted a GFS2 patch that greatly reduced inter-node
contention for resource group glocks. The patch was called:
GFS2: Set of distributed preferences for rgrps. It implemented a
new scheme whereby each node in a cluster tries to keep to itself
for allocations. This is not unlike GFS1, which has a different scheme.

Although the patch sped up GFS2 performance in general, it also caused
more file fragmentation, because each node tended to focus on a smaller
subset of resource groups.

Here are run times and file fragmentation extent counts for my favorite
customer application, using a STOCK RHEL7 kernel (no patches):

Run times:
Run 1 time: 2hr 40min 33sec
Run 2 time: 2hr 39min 52sec
Run 3 time: 2hr 39min 31sec
Run 4 time: 2hr 33min 57sec
Run 5 time: 2hr 41min 6sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  744708
EXTENT COUNT FOR OUTPUT FILES =  749868
EXTENT COUNT FOR OUTPUT FILES =  721862
EXTENT COUNT FOR OUTPUT FILES =  635301
EXTENT COUNT FOR OUTPUT FILES =  689263

The times are bad and the fragmentation level is also bad. If I add
just the first patch, GFS2: Set of distributed preferences for rgrps
you can see that the performance improves, but the fragmentation is
worse (I only did three iterations this time):

Run times:
Run 1 time: 2hr 2min 47sec
Run 2 time: 2hr 8min 37sec
Run 3 time: 2hr 10min 0sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  1011217
EXTENT COUNT FOR OUTPUT FILES =  1025973
EXTENT COUNT FOR OUTPUT FILES =  1070163

So the patch improved performance by 25 percent, but file fragmentation is
30 percent worse. Some of this is undoubtedly due to the SAN array buffering,
hiding our multitude of sins. But not every customer will have this quality
of SAN. So it's important to reduce the fragmentation as well, so that some
people are helped while others are hurt by the patch.

Toward this end, I devised three relatively simple patches that greatly
reduce file fragmentation. With all four patches, the numbers are as follows:

Run times:
Run 1 time: 2hr 5min 46sec
Run 2 time: 2hr 10min 15sec
Run 3 time: 2hr 8min 4sec
Run 4 time: 2hr 9min 27sec
Run 5 time: 2hr 6min 15sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  330276
EXTENT COUNT FOR OUTPUT FILES =  358939
EXTENT COUNT FOR OUTPUT FILES =  375374
EXTENT COUNT FOR OUTPUT FILES =  383071
EXTENT COUNT FOR OUTPUT FILES =  369269

As you can see, with this combination of four patches, the run times are
good as well as the file fragmentation levels. The file fragmentation is
about twice as good as the stock kernel, and significantly better (almost
three times better) than with the first patch alone.

This patch set includes all four patches.

Bob Peterson (4):
  GFS2: Set of distributed preferences for rgrps
  GFS2: Make block reservations more persistent
  GFS2: Only increase rs_sizehint
  GFS2: If we use up our block reservation, request more next time

 fs/gfs2/file.c   | 10 ++--
 fs/gfs2/incore.h |  2 ++
 fs/gfs2/lock_dlm.c   |  2 ++
 fs/gfs2/ops_fstype.c |  1 +
 fs/gfs2/rgrp.c   | 69 
 5 files changed, 71 insertions(+), 13 deletions(-)

-- 
1.9.3



[Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps

2014-10-20 Thread Bob Peterson
This patch tries to use the journal numbers to evenly distribute
which node prefers which resource group for block allocations. This
is to help performance.
---
 fs/gfs2/incore.h |  2 ++
 fs/gfs2/lock_dlm.c   |  2 ++
 fs/gfs2/ops_fstype.c |  1 +
 fs/gfs2/rgrp.c   | 66 
 4 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 39e7e99..618d20a 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -97,6 +97,7 @@ struct gfs2_rgrpd {
 #define GFS2_RDF_CHECK 0x1000 /* check for unlinked inodes */
 #define GFS2_RDF_UPTODATE  0x2000 /* rg is up to date */
 #define GFS2_RDF_ERROR 0x4000 /* error in rg */
+#define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */
 #define GFS2_RDF_MASK  0xf000 /* mask for internal flags */
spinlock_t rd_rsspin;   /* protects reservation related vars */
struct rb_root rd_rstree;   /* multi-block reservation tree */
@@ -808,6 +809,7 @@ struct gfs2_sbd {
char sd_table_name[GFS2_FSNAME_LEN];
char sd_proto_name[GFS2_FSNAME_LEN];
 
+   int sd_nodes;
/* Debugging crud */
 
unsigned long sd_last_warning;
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 641383a..5aeb03a 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1113,6 +1113,8 @@ static void gdlm_recover_done(void *arg, struct dlm_slot 
*slots, int num_slots,
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = sdp-sd_lockstruct;
 
+   BUG_ON(num_slots == 0);
+   sdp-sd_nodes = num_slots;
/* ensure the ls jid arrays are large enough */
set_recover_size(sdp, slots, num_slots);
 
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index d3eae24..bf3193f 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -134,6 +134,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
atomic_set(sdp-sd_log_freeze, 0);
atomic_set(sdp-sd_frozen_root, 0);
init_waitqueue_head(sdp-sd_frozen_root_wait);
+   sdp-sd_nodes = 1;
 
return sdp;
 }
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 7474c41..50cdba2 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
rgd-rd_gl-gl_vm.start = rgd-rd_addr * bsize;
rgd-rd_gl-gl_vm.end = rgd-rd_gl-gl_vm.start + (rgd-rd_length * 
bsize) - 1;
rgd-rd_rgl = (struct gfs2_rgrp_lvb *)rgd-rd_gl-gl_lksb.sb_lvbptr;
-   rgd-rd_flags = ~GFS2_RDF_UPTODATE;
+   rgd-rd_flags = ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
if (rgd-rd_data  sdp-sd_max_rg_data)
sdp-sd_max_rg_data = rgd-rd_data;
spin_lock(sdp-sd_rindex_spin);
@@ -955,6 +955,36 @@ fail:
 }
 
 /**
+ * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use
+ * @sdp: the GFS2 superblock
+ *
+ * The purpose of this function is to select a subset of the resource groups
+ * and mark them as PREFERRED. We do it in such a way that each node prefers
+ * to use a unique set of rgrps to minimize glock contention.
+ */
+static void set_rgrp_preferences(struct gfs2_sbd *sdp)
+{
+   struct gfs2_rgrpd *rgd, *first;
+   int i;
+
+   /* Skip an initial number of rgrps, based on this node's journal ID.
+  That should start each node out on its own set. */
+   rgd = gfs2_rgrpd_get_first(sdp);
+   for (i = 0; i  sdp-sd_lockstruct.ls_jid; i++)
+   rgd = gfs2_rgrpd_get_next(rgd);
+   first = rgd;
+
+   do {
+   rgd-rd_flags |= GFS2_RDF_PREFERRED;
+   for (i = 0; i  sdp-sd_nodes; i++) {
+   rgd = gfs2_rgrpd_get_next(rgd);
+   if (rgd == first)
+   break;
+   }
+   } while (rgd != first);
+}
+
+/**
  * gfs2_ri_update - Pull in a new resource index from the disk
  * @ip: pointer to the rindex inode
  *
@@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip)
if (error  0)
return error;
 
+   set_rgrp_preferences(sdp);
+
sdp-sd_rindex_uptodate = 1;
return 0;
 }
@@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, 
const struct gfs2_rgrpd *b
 }
 
 /**
+ * fast_to_acquire - determine if a resource group will be fast to acquire
+ *
+ * If this is one of our preferred rgrps, it should be quicker to acquire,
+ * because we tried to set ourselves up as dlm lock master.
+ */
+static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
+{
+   struct gfs2_glock *gl = rgd-rd_gl;
+
+   if (gl-gl_state != LM_ST_UNLOCKED  list_empty(gl-gl_holders) 
+   !test_bit(GLF_DEMOTE_IN_PROGRESS, gl-gl_flags) 
+   !test_bit(GLF_DEMOTE, gl-gl_flags))
+   return 1;
+   if (rgd-rd_flags  GFS2_RDF_PREFERRED)
+   return 1;
+   return 0;
+}
+