[Cluster-devel] [GFS2 PATCH] GFS2: Re-add a call to log_flush_wait when flushing the journal

2014-03-12 Thread Bob Peterson
Hi,

Upstream commit 34cc178 changed a line of code from calling function
log_flush_commit to calling log_write_header. This had the effect of
eliminating a call to function log_flush_wait. That causes the journal
to skip over log headers, which results in multiple wrap points,
which itself leads to infinite loops in journal replay, both in the
kernel code and fsck.gfs2 code. This patch re-adds that call.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index edbd461..4a14d50 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -702,6 +702,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock 
*gl)
gfs2_log_flush_bio(sdp, WRITE);
 
if (sdp-sd_log_head != sdp-sd_log_flush_head) {
+   log_flush_wait(sdp);
log_write_header(sdp, 0);
} else if (sdp-sd_log_tail != current_tail(sdp)  !sdp-sd_log_idle){
atomic_dec(sdp-sd_log_blks_free); /* Adjust for unreserved 
buffer */



[Cluster-devel] [GFS2 PATCH] [TRY #2] GFS2: inline function gfs2_set_mode

2014-03-19 Thread Bob Peterson
Hi,

Here is a revised patch based on Steve's feedback:

This patch eliminates function gfs2_set_mode which was only called in
one place, and always returned 0.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com
---
diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c
index 394dc55..3088e2a 100644
--- a/fs/gfs2/acl.c
+++ b/fs/gfs2/acl.c
@@ -64,18 +64,6 @@ struct posix_acl *gfs2_get_acl(struct inode *inode, int type)
return acl;
 }
 
-static int gfs2_set_mode(struct inode *inode, umode_t mode)
-{
-   int error = 0;
-
-   if (mode != inode-i_mode) {
-   inode-i_mode = mode;
-   mark_inode_dirty(inode);
-   }
-
-   return error;
-}
-
 int gfs2_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 {
int error;
@@ -98,9 +86,10 @@ int gfs2_set_acl(struct inode *inode, struct posix_acl *acl, 
int type)
if (error == 0)
acl = NULL;
 
-   error = gfs2_set_mode(inode, mode);
-   if (error)
-   return error;
+   if (mode != inode-i_mode) {
+   inode-i_mode = mode;
+   mark_inode_dirty(inode);
+   }
}
 
if (acl) {



[Cluster-devel] [GFS2 PATCH] [TRY #2] GFS2: Prevent recovery before the local journal is set

2014-05-02 Thread Bob Peterson
Hi,

This patch uses a completion to prevent dlm's recovery process from
referencing and trying to recover a journal before a journal has been
opened.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index bdf70c1..04c062c 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -730,6 +730,8 @@ struct gfs2_sbd {
struct gfs2_holder sd_sc_gh;
struct gfs2_holder sd_qc_gh;
 
+   struct completion sd_journal_ready;
+
/* Daemon stuff */
 
struct task_struct *sd_logd_process;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 22f9540..4535156 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -94,6 +94,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
INIT_LIST_HEAD(sdp-sd_jindex_list);
spin_lock_init(sdp-sd_jindex_spin);
mutex_init(sdp-sd_jindex_mutex);
+   init_completion(sdp-sd_journal_ready);
 
INIT_LIST_HEAD(sdp-sd_quota_list);
mutex_init(sdp-sd_quota_mutex);
@@ -676,6 +677,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
 
if (sdp-sd_args.ar_spectator) {
sdp-sd_jdesc = gfs2_jdesc_find(sdp, 0);
+   complete_all(sdp-sd_journal_ready);
atomic_set(sdp-sd_log_blks_free, sdp-sd_jdesc-jd_blocks);
atomic_set(sdp-sd_log_thresh1, 2*sdp-sd_jdesc-jd_blocks/5);
atomic_set(sdp-sd_log_thresh2, 4*sdp-sd_jdesc-jd_blocks/5);
@@ -686,10 +688,13 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
fs_err(sdp, there are only %u journals (0 - %u)\n,
   gfs2_jindex_size(sdp),
   gfs2_jindex_size(sdp) - 1);
+   complete_all(sdp-sd_journal_ready);
goto fail_jindex;
}
sdp-sd_jdesc = gfs2_jdesc_find(sdp, sdp-sd_lockstruct.ls_jid);
 
+   complete_all(sdp-sd_journal_ready);
+
error = gfs2_glock_nq_num(sdp, sdp-sd_lockstruct.ls_jid,
  gfs2_journal_glops,
  LM_ST_EXCLUSIVE, LM_FLAG_NOEXP,
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index de25d55..02afa82 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -407,6 +407,9 @@ int gfs2_recover_set(struct gfs2_sbd *sdp, unsigned jid)
struct gfs2_jdesc *jd;
int rv;
 
+   /* Wait for our primary journal to be initialized */
+   wait_for_completion(sdp-sd_journal_ready);
+
spin_lock(sdp-sd_jindex_spin);
rv = -EBUSY;
if (sdp-sd_jdesc-jd_jid == jid)



Re: [Cluster-devel] [PATCH 2/5] gfs2-utils: Expressly expunge 'expert mode'

2014-05-07 Thread Bob Peterson
- Original Message -
 mkfs.gfs2 and gfs2_jadd used struct gfs2_sbd.expert to put them into an
 undocumented 'expert mode' but didn't really do anything useful in that
 mode. Remove that field and other 'expert' references.
 

In the past, we've used mkfs.gfs2 expert mode to create certain test
conditions, like very small resource groups. I don't think Nate uses
this in any of his tests, but it might be worth asking.
I've used it to try to recreate certain failing scenarios, but I haven't
needed it in years. Are we sure we want to get rid of it?

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [gfs2-utils patch] gfs2_edit: Print block types with log descriptors

2014-06-11 Thread Bob Peterson
Hi,

This patch makes gfs2_edit's journal print function print the metadata
block type along side the block number in its log descriptor output.
For example: In the past, when a log descriptor was printed, it looked
like this:
0x8200 (j+ 163): Log descriptor, type 300 (Metadata) len:504, data1: 503 (503)
0x70eb772  0x70df987  0x70dc7a1  
0x70eb574
0x70dc5a3  0x70eb376  0x70dc3a5  
0x70eb178
0x70dc1a7  0x70eaf7a  0x70dbfa9  
0x70ead7c
With this patch, the output is more useful:
0x8200 (j+ 163): Log descriptor, type 300 (Metadata) len:504, data1: 503 (503)
0x70eb772 in   0x70df987 rb   0x70dc7a1 in   
0x70eb574 in
0x70dc5a3 in   0x70eb376 in   0x70dc3a5 in   
0x70eb178 in
0x70dc1a7 in   0x70eaf7a in   0x70dbfa9 in   
0x70ead7c in
In this case, in indicates indirect blocks, and rb indicates a rgrp bitmap.
This is useful for getting a bigger picture of what that journal descriptor
contains and represents.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/edit/hexedit.h b/gfs2/edit/hexedit.h
index 541684a..f23c581 100644
--- a/gfs2/edit/hexedit.h
+++ b/gfs2/edit/hexedit.h
@@ -27,6 +27,7 @@ enum dsp_mode { HEX_MODE = 0, GFS2_MODE = 1, EXTENDED_MODE = 
2, INIT_MODE = 3 };
 #define RGLIST_DUMMY_BLOCK -2
 #define JOURNALS_DUMMY_BLOCK -3
 
+extern const char *mtypes[];
 extern struct gfs2_sb sb;
 extern uint64_t block;
 extern int blockhist;
diff --git a/gfs2/edit/journal.c b/gfs2/edit/journal.c
index bb56649..a72a044 100644
--- a/gfs2/edit/journal.c
+++ b/gfs2/edit/journal.c
@@ -202,6 +202,7 @@ static int print_ld_blks(const uint64_t *b, const char 
*end, int start_line,
 {
int bcount = 0, found_tblk = 0, found_bblk = 0;
static char str[256];
+   struct gfs2_buffer_head *j_bmap_bh;
 
if (tblk_off)
*tblk_off = 0;
@@ -218,8 +219,17 @@ static int print_ld_blks(const uint64_t *b, const char 
*end, int start_line,
}
bcount++;
if (prnt) {
-   sprintf(str, 0x%llx,
-   (unsigned long long)be64_to_cpu(*b));
+   if (is_meta_ld) {
+   j_bmap_bh = bread(sbd, abs_block +
+ bcount);
+   sprintf(str, 0x%llx %2s,
+   (unsigned long 
long)be64_to_cpu(*b),
+   
mtypes[lgfs2_get_block_type(j_bmap_bh)]);
+   brelse(j_bmap_bh);
+   } else {
+   sprintf(str, 0x%llx,
+   (unsigned long 
long)be64_to_cpu(*b));
+   }
print_gfs2(%-18.18s , str);
}
if (!found_tblk  tblk_off)
@@ -237,7 +247,6 @@ static int print_ld_blks(const uint64_t *b, const char 
*end, int start_line,
int type, bmap = 0;
uint64_t o;
struct gfs2_buffer_head *save_bh;
-   struct gfs2_buffer_head *j_bmap_bh;
 
found_bblk = 1;
print_gfs2(-);



[Cluster-devel] [gfs2-utils patch] gfs2_edit: print LB (log descriptor continuation blocks) for GFS2

2014-06-11 Thread Bob Peterson
Hi,

This patch allows gfs2_edit to print metadata blocks of type LB
(log descriptor continuation blocks) for GFS2. Prior to this, only
GFS1's continuation blocks would print.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/edit/journal.c b/gfs2/edit/journal.c
index e012bc3..bb56649 100644
--- a/gfs2/edit/journal.c
+++ b/gfs2/edit/journal.c
@@ -582,12 +582,15 @@ void dump_journal(const char *journal, int tblk)
[UNMOUNTED] : );
}
eol(0);
-   } else if (sbd.gfs1  ld_blocks  0) {
-   print_gfs2(0x%llx (j+%4llx): GFS log descriptor
+   } else if ((ld_blocks  0) 
+  (sbd.gfs1 || block_type == GFS2_METATYPE_LB)) {
+   print_gfs2(0x%llx (j+%4llx): Log descriptor
continuation block, abs_block, jb);
eol(0);
print_gfs2();
-   ld_blocks -= print_ld_blks((uint64_t *)dummy_bh.b_data,
+   ld_blocks -= print_ld_blks((uint64_t *)dummy_bh.b_data +
+  (sbd.gfs1 ? 0 :
+   sizeof(struct 
gfs2_meta_header)),
   (dummy_bh.b_data +
sbd.bsize), start_line,
   tblk, tblk_off, 0, rgd,



[Cluster-devel] [GFS2 PATCH] GFS2: Only wait for demote when last holder is dequeued

2014-06-20 Thread Bob Peterson
Hi,

Function gfs2_glock_dq_wait is supposed to dequeue a glock and then
wait for the lock to be demoted. The problem is, if this is a shared
lock, its demote will depend on the other holders, which means you
might end up waiting forever because the other process is blocked.
This problem is especially apparent when dealing with nested flocks.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index aec7f73..22ce2f0 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1128,7 +1128,9 @@ void gfs2_glock_dq_wait(struct gfs2_holder *gh)
struct gfs2_glock *gl = gh-gh_gl;
gfs2_glock_dq(gh);
might_sleep();
-   wait_on_bit(gl-gl_flags, GLF_DEMOTE, gfs2_glock_demote_wait, 
TASK_UNINTERRUPTIBLE);
+   if (!find_first_holder(gl))
+   wait_on_bit(gl-gl_flags, GLF_DEMOTE, gfs2_glock_demote_wait,
+   TASK_UNINTERRUPTIBLE);
 }
 
 /**



[Cluster-devel] [GFS2 PATCH] GFS2: Allow caching of glocks for flock

2014-06-26 Thread Bob Peterson
Hi,

This patch removes the GLF_NOCACHE flag from the glocks associated with
flocks. There should be no good reason not to cache glocks for flocks:
they only force the glock to be demoted before they can be reacquired,
which can slow down performance and even cause glock hangs, especially
in cases where the flocks are held in Shared (SH) mode.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 4fc3a30..0740a57 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -981,7 +981,7 @@ static int do_flock(struct file *file, int cmd, struct 
file_lock *fl)
int error = 0;
 
state = (fl-fl_type == F_WRLCK) ? LM_ST_EXCLUSIVE : LM_ST_SHARED;
-   flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY) | GL_EXACT | GL_NOCACHE;
+   flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY) | GL_EXACT;
 
mutex_lock(fp-f_fl_mutex);
 



[Cluster-devel] [GFS2 PATCH] GFS2: Allow caching of glocks for flock

2014-06-26 Thread Bob Peterson
Hi,

This patch removes the GLF_NOCACHE flag from the glocks associated with
flocks. There should be no good reason not to cache glocks for flocks:
they only force the glock to be demoted before they can be reacquired,
which can slow down performance and even cause glock hangs, especially
in cases where the flocks are held in Shared (SH) mode.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 4fc3a30..0740a57 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -981,7 +981,7 @@ static int do_flock(struct file *file, int cmd, struct 
file_lock *fl)
int error = 0;
 
state = (fl-fl_type == F_WRLCK) ? LM_ST_EXCLUSIVE : LM_ST_SHARED;
-   flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY) | GL_EXACT | GL_NOCACHE;
+   flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY) | GL_EXACT;
 
mutex_lock(fp-f_fl_mutex);
 



Re: [Cluster-devel] [RFC PATCH] dlm: Remove unused conf from lm_grant

2014-07-01 Thread Bob Peterson
- Original Message -
 On Tue, Jul 01, 2014 at 10:43:13AM -0400, Jeff Layton wrote:
  On Tue, 01 Jul 2014 06:20:10 -0700
  Joe Perches j...@perches.com wrote:
  
   While doing a bit of adding argument names to fs.h,
   I looked at lm_grant and it seems the 2nd argument
   is always NULL.
   
   How about removing it?
   
   This doesn't apply as it depends on some other patches
   but it should be clear enough...
   
  
  ACK on the general idea from my standpoint. Anything that simplifies
  the file locking interfaces is a good thing, particularly the deferred
  locking code.
 
 Fine with me.  I'd be happy to remove all the deferred locking code from
 dlm; it never really worked.
 
 Dave

Hi,

GFS2 uses deferred locks, at the very least in its direct_io path
(gfs2_direct_IO in aops.c). So AFAIK we can't remove THAT without a certain
amount of pain. Steve is on vacation / holiday this week, but he will
be back on Thursday and Friday (which is a holiday).

I'm all for getting rid of useless parameters, and I've done so on
several occasions in GFS2.

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [gfs2-utils PATCH] fsck.gfs2: File read-ahead

2014-07-10 Thread Bob Peterson
Hi,

This patch introduces file read-ahead to pass1.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
 gfs2/fsck/metawalk.c | 65 ++--
 gfs2/fsck/metawalk.h |  1 +
 gfs2/fsck/pass1.c|  1 +
 3 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/gfs2/fsck/metawalk.c b/gfs2/fsck/metawalk.c
index 659af4e..8da17c6 100644
--- a/gfs2/fsck/metawalk.c
+++ b/gfs2/fsck/metawalk.c
@@ -1184,6 +1184,59 @@ static void free_metalist(struct gfs2_inode *ip, 
osi_list_t *mlp)
}
 }
 
+static void file_ra(struct gfs2_inode *ip, struct gfs2_buffer_head *bh,
+   int head_size, int maxptrs, int h)
+{
+   struct gfs2_sbd *sdp = ip-i_sbd;
+   uint64_t *p, sblock = 0, block;
+   int extlen = 0;
+
+   if (h + 2 == ip-i_di.di_height) {
+   p = (uint64_t *)(bh-b_data + head_size);
+   if (*p  *(p + 1)) {
+   sblock = be64_to_cpu(*p);
+   p++;
+   block = be64_to_cpu(*p);
+   extlen = block - sblock;
+   if (extlen  1  extlen = maxptrs) {
+   posix_fadvise(sdp-device_fd,
+ sblock * sdp-bsize,
+ (extlen + 1) * sdp-bsize,
+ POSIX_FADV_WILLNEED);
+   return;
+   }
+   }
+   extlen = 0;
+   }
+   for (p = (uint64_t *)(bh-b_data + head_size);
+p  (uint64_t *)(bh-b_data + sdp-bsize); p++) {
+   if (*p) {
+   if (!sblock) {
+   sblock = be64_to_cpu(*p);
+   extlen = 1;
+   continue;
+   }
+   block = be64_to_cpu(*p);
+   if (block == sblock + extlen) {
+   extlen++;
+   continue;
+   }
+   }
+   if (extlen  sblock) {
+   if (extlen  1)
+   extlen--;
+   posix_fadvise(sdp-device_fd, sblock * sdp-bsize,
+ extlen * sdp-bsize,
+ POSIX_FADV_WILLNEED);
+   extlen = 0;
+   p--;
+   }
+   }
+   if (extlen)
+   posix_fadvise(sdp-device_fd, sblock * sdp-bsize,
+ extlen * sdp-bsize, POSIX_FADV_WILLNEED);
+}
+
 /**
  * build_and_check_metalist - check a bunch of indirect blocks
  *This includes hash table blocks for directories
@@ -1204,6 +1257,7 @@ static int build_and_check_metalist(struct gfs2_inode 
*ip, osi_list_t *mlp,
int h, head_size, iblk_type;
uint64_t *ptr, block;
int error, was_duplicate, is_valid;
+   int maxptrs;
 
osi_list_add(metabh-b_altlist, mlp[0]);
 
@@ -1225,13 +1279,18 @@ static int build_and_check_metalist(struct gfs2_inode 
*ip, osi_list_t *mlp,
iblk_type = GFS2_METATYPE_JD;
else
iblk_type = GFS2_METATYPE_IN;
-   if (ip-i_sbd-gfs1)
+   if (ip-i_sbd-gfs1) {
head_size = sizeof(struct gfs_indirect);
-   else
+   maxptrs = (ip-i_sbd-bsize - head_size) /
+   sizeof(uint64_t);
+   } else {
head_size = sizeof(struct gfs2_meta_header);
+   maxptrs = ip-i_sbd-sd_inptrs;
+   }
} else {
iblk_type = GFS2_METATYPE_DI;
head_size = sizeof(struct gfs2_dinode);
+   maxptrs = ip-i_sbd-sd_diptrs;
}
prev_list = mlp[h - 1];
cur_list = mlp[h];
@@ -1246,6 +1305,8 @@ static int build_and_check_metalist(struct gfs2_inode 
*ip, osi_list_t *mlp,
continue;
}
 
+   if (pass-readahead)
+   file_ra(ip, bh, head_size, maxptrs, h);
/* Now check the metadata itself */
for (ptr = (uint64_t *)(bh-b_data + head_size);
 (char *)ptr  (bh-b_data + ip-i_sbd-bsize);
diff --git a/gfs2/fsck/metawalk.h b/gfs2/fsck/metawalk.h
index 5e30bfe..a4e0676 100644
--- a/gfs2/fsck/metawalk.h
+++ b/gfs2/fsck/metawalk.h
@@ -94,6 +94,7 @@ enum meta_check_rc {
 struct metawalk_fxns {
void *private;
int invalid_meta_is_fatal;
+   int

Re: [Cluster-devel] [gfs2-utils PATCH] fsck.gfs2: time each of the passes

2014-07-11 Thread Bob Peterson
- Original Message -
 Hi,
 
 On 11/07/14 18:41, Bob Peterson wrote:
  - Original Message -
  (snip)
  Use timersub() here perhaps?
 
  Otherwise looks good,
 
  Steve.
  Hi Steve,
 
  Thanks for the suggestion. How about this version?
 Yes, that looks better. There is probably a nicer way to do the
 conversion to string too... a quick google points at using a time_t to
 contain tv_secs, converting to tm and then appending the tv_usecs after.
 Should be a bit cleaner than doing it manually,
 
 Steve.

Hi,

I could implement your suggestion like this. I could also use strftime,
but it's ugly as sin, so I'm reluctant to do so. What do you think?

Regards,

Bob Peterson
---
diff --git a/gfs2/fsck/main.c b/gfs2/fsck/main.c
index b4b1a03..ad42b0d 100644
--- a/gfs2/fsck/main.c
+++ b/gfs2/fsck/main.c
@@ -11,6 +11,8 @@
 #include signal.h
 #include libintl.h
 #include locale.h
+#include sys/time.h
+#include time.h
 #define _(String) gettext(String)
 
 #include copyright.cf
@@ -244,11 +246,15 @@ static const struct fsck_pass passes[] = {
 static int fsck_pass(const struct fsck_pass *p, struct gfs2_sbd *sdp)
 {
int ret;
+   struct  timeval before, after, diff;
+   time_t runtime;
+   struct tm *run_tm;
 
if (fsck_abort)
return FSCK_CANCELED;
pass = p-name;
log_notice( _(Starting %s\n), p-name);
+   gettimeofday(before, 0);
ret = p-f(sdp);
if (ret)
exit(ret);
@@ -257,7 +263,16 @@ static int fsck_pass(const struct fsck_pass *p, struct 
gfs2_sbd *sdp)
log_notice( _(%s interrupted   \n), p-name);
return FSCK_CANCELED;
}
-   log_notice( _(%s complete  \n), p-name);
+   gettimeofday(after, 0);
+   timersub(after, before, diff);
+   runtime = (time_t)diff.tv_sec;
+   run_tm = gmtime(runtime);
+   log_notice( _(%s completed in ), p-name);
+   if (run_tm-tm_hour)
+   log_notice(%dh, run_tm-tm_hour);
+   if (run_tm-tm_min)
+   log_notice(%dm, run_tm-tm_min);
+   log_notice(%d.%03lds  \n, run_tm-tm_sec, diff.tv_usec / 1000);
return 0;
 }
 



[Cluster-devel] [GFS2 PATCH] GFS2: Change maxlen variables to size_t

2014-08-06 Thread Bob Peterson
Hi,

This patch changes some variables (especially maxlen in function
gfs2_block_map) from unsigned int to size_t. We need 64-bit arithmetic
for very large files (e.g. 1PB) where the variables otherwise get
shifted to all 0's.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index e6ee5b6..f0b945a 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -359,7 +359,7 @@ static inline void release_metapath(struct metapath *mp)
  * Returns: The length of the extent (minimum of one block)
  */
 
-static inline unsigned int gfs2_extent_length(void *start, unsigned int len, 
__be64 *ptr, unsigned limit, int *eob)
+static inline unsigned int gfs2_extent_length(void *start, unsigned int len, 
__be64 *ptr, size_t limit, int *eob)
 {
const __be64 *end = (start + len);
const __be64 *first = ptr;
@@ -449,7 +449,7 @@ static int gfs2_bmap_alloc(struct inode *inode, const 
sector_t lblock,
   struct buffer_head *bh_map, struct metapath *mp,
   const unsigned int sheight,
   const unsigned int height,
-  const unsigned int maxlen)
+  const size_t maxlen)
 {
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);
@@ -483,7 +483,8 @@ static int gfs2_bmap_alloc(struct inode *inode, const 
sector_t lblock,
} else {
/* Need to allocate indirect blocks */
ptrs_per_blk = height  1 ? sdp-sd_inptrs : sdp-sd_diptrs;
-   dblks = min(maxlen, ptrs_per_blk - 
mp-mp_list[end_of_metadata]);
+   dblks = min(maxlen, (size_t)(ptrs_per_blk -
+mp-mp_list[end_of_metadata]));
if (height == ip-i_height) {
/* Writing into existing tree, extend tree down */
iblks = height - sheight;
@@ -605,7 +606,7 @@ int gfs2_block_map(struct inode *inode, sector_t lblock,
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);
unsigned int bsize = sdp-sd_sb.sb_bsize;
-   const unsigned int maxlen = bh_map-b_size  inode-i_blkbits;
+   const size_t maxlen = bh_map-b_size  inode-i_blkbits;
const u64 *arr = sdp-sd_heightsize;
__be64 *ptr;
u64 size;



[Cluster-devel] [GFS2 PATCH] GFS2: Request demote when a try flock fails

2014-08-20 Thread Bob Peterson
Hi,

This patch changes the flock code so that it uses the TRY_1CB flag
instead of the TRY flag on the first attempt. That forces any holding
nodes to issue a dlm callback, which requests a demote of the glock.
Then, if the try failed, it sleeps a small amount of time for the
demote to occur. Then it tries again, for an increasing amount of time.
Subsequent attempts to gain the try lock don't use _1CB so that
only one callback is issued.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 26b3f95..7f4ed3d 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -26,6 +26,7 @@
 #include linux/dlm.h
 #include linux/dlm_plock.h
 #include linux/aio.h
+#include linux/delay.h
 
 #include gfs2.h
 #include incore.h
@@ -979,9 +980,10 @@ static int do_flock(struct file *file, int cmd, struct 
file_lock *fl)
unsigned int state;
int flags;
int error = 0;
+   int sleeptime;
 
state = (fl-fl_type == F_WRLCK) ? LM_ST_EXCLUSIVE : LM_ST_SHARED;
-   flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY) | GL_EXACT;
+   flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY_1CB) | GL_EXACT;
 
mutex_lock(fp-f_fl_mutex);
 
@@ -1001,7 +1003,14 @@ static int do_flock(struct file *file, int cmd, struct 
file_lock *fl)
gfs2_holder_init(gl, state, flags, fl_gh);
gfs2_glock_put(gl);
}
-   error = gfs2_glock_nq(fl_gh);
+   for (sleeptime = 1; sleeptime = 4; sleeptime = 1) {
+   error = gfs2_glock_nq(fl_gh);
+   if (error != GLR_TRYFAILED)
+   break;
+   fl_gh-gh_flags = LM_FLAG_TRY | GL_EXACT;
+   fl_gh-gh_error = 0;
+   msleep(sleeptime);
+   }
if (error) {
gfs2_holder_uninit(fl_gh);
if (error == GLR_TRYFAILED)
@@ -1024,7 +1033,7 @@ static void do_unflock(struct file *file, struct 
file_lock *fl)
mutex_lock(fp-f_fl_mutex);
flock_lock_file_wait(file, fl);
if (fl_gh-gh_gl) {
-   gfs2_glock_dq_wait(fl_gh);
+   gfs2_glock_dq(fl_gh);
gfs2_holder_uninit(fl_gh);
}
mutex_unlock(fp-f_fl_mutex);



Re: [Cluster-devel] [PATCH 00/19] gfs2-utils: Introduce extent allocation and speed up journal creation

2014-09-02 Thread Bob Peterson
- Original Message -
 One thing to note is that, with these patches, the root and master inodes are
 no longer the first objects in the first resource group. The master inode is
 written in the first free block after the journals and then the other metafs
 structures are placed. The root directory inode is then finally created. This
 is not a format change but it may cause some confusion after years of
 expecting
 the root and master inodes to be at certain addresses so I thought it worth
 mentioning.

Hi,

I know that in fsck.gfs2, in initialize.c, it plays some games trying to find
and repair damaged system dinodes. For example, it looks for a missing master
directory by looking for no_formal_ino==2 for example. So I'd be very cautious
and check to make sure these repairs still work properly. In the past, I've
done a for loop, wiping out the first X blocks of the file system, running
fsck.gfs2, and seeing if it can properly repair it.

Another concern is gfs2_convert. I don't know if it makes any assumptions
about the master directory, but it's much less likely. I think it just
assumes the file system is healthy. But fsck.gfs2 is a concern.

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [GFS2 PATCH] GFS2: call d_add for a NULL entry when an inode lookup fails

2014-09-09 Thread Bob Peterson
Hi,

This patch was originally written by Benjamin Coddington (credit where
credit is due).

GFS2's inode lookup wasn't hashing the dentry when an inode wasn't found.
That caused the dentry/inode to get an early cleanup attempt through
dentry_kill when a create op cleaned up the filehandle. That caused the
entry to hang around until unmount time, which means if the file was
created then deleted, its space and inode were not reclaimed until then.
This problem was introduced by gfs2's new atomic open code. The patch
simply adds the d_add (as other file systems do) when inode lookups fail.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index e62e594..9317ddc 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -840,8 +840,10 @@ static struct dentry *__gfs2_lookup(struct inode *dir, 
struct dentry *dentry,
int error;
 
inode = gfs2_lookupi(dir, dentry-d_name, 0);
-   if (!inode)
+   if (inode == NULL) {
+   d_add(dentry, NULL);
return NULL;
+   }
if (IS_ERR(inode))
return ERR_CAST(inode);
 



[Cluster-devel] [GFS2 PATCH] GFS2: Make rename not save dirent location

2014-09-29 Thread Bob Peterson
Hi,

This patch fixes a regression in the patch GFS2: Remember directory
insert point, commit 2b47dad866d04f14c328f888ba5406057b8c7d33.
The problem had to do with the rename function: The function found
space for the new dirent, and remembered that location. But then the
old dirent was removed, which often moved the eligible location for
the renamed dirent. Putting the new dirent at the saved location
caused file system corruption.

This patch adds a new save_loc variable to struct gfs2_diradd.
If 1, the dirent location is saved. If 0, the dirent location is not
saved and the buffer_head is released as per previous behavior.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 1a349f9..5d4261f 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -2100,8 +2100,13 @@ int gfs2_diradd_alloc_required(struct inode *inode, 
const struct qstr *name,
}
if (IS_ERR(dent))
return PTR_ERR(dent);
-   da-bh = bh;
-   da-dent = dent;
+
+   if (da-save_loc) {
+   da-bh = bh;
+   da-dent = dent;
+   } else {
+   brelse(bh);
+   }
return 0;
 }
 
diff --git a/fs/gfs2/dir.h b/fs/gfs2/dir.h
index 126c65d..e1b309c 100644
--- a/fs/gfs2/dir.h
+++ b/fs/gfs2/dir.h
@@ -23,6 +23,7 @@ struct gfs2_diradd {
unsigned nr_blocks;
struct gfs2_dirent *dent;
struct buffer_head *bh;
+   int save_loc;
 };
 
 extern struct inode *gfs2_dir_search(struct inode *dir,
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 9516f5c..fcf42ea 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -600,7 +600,7 @@ static int gfs2_create_inode(struct inode *dir, struct 
dentry *dentry,
int error, free_vfs_inode = 0;
u32 aflags = 0;
unsigned blocks = 1;
-   struct gfs2_diradd da = { .bh = NULL, };
+   struct gfs2_diradd da = { .bh = NULL, .save_loc = 1, };
 
if (!name-len || name-len  GFS2_FNAMESIZE)
return -ENAMETOOLONG;
@@ -900,7 +900,7 @@ static int gfs2_link(struct dentry *old_dentry, struct 
inode *dir,
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_holder ghs[2];
struct buffer_head *dibh;
-   struct gfs2_diradd da = { .bh = NULL, };
+   struct gfs2_diradd da = { .bh = NULL, .save_loc = 1, };
int error;
 
if (S_ISDIR(inode-i_mode))
@@ -1338,7 +1338,7 @@ static int gfs2_rename(struct inode *odir, struct dentry 
*odentry,
struct gfs2_rgrpd *nrgd;
unsigned int num_gh;
int dir_rename = 0;
-   struct gfs2_diradd da = { .nr_blocks = 0, };
+   struct gfs2_diradd da = { .nr_blocks = 0, .save_loc = 0, };
unsigned int x;
int error;
 



[Cluster-devel] [DLM PATCH] DLM: Don't wait for resource library lookups if NOLOOKUP is specified

2014-10-01 Thread Bob Peterson
Hi,

This patch adds a new lock flag, DLM_LKF_NOLOOKUP, which instructs DLM
to refrain from sending lookup requests in cases where the lock library
node is not the current node. This is similar to the DLM_LKF_NOQUEUE
flag, except it fails locks that would require a lookup, with -EAGAIN.

This is not just about saving a network operation. It allows callers
like GFS2 to master locks for which they are the directory node. Each
node can then prefer local locks, especially in the case of GFS2
selecting resource groups for block allocations (implemented with a
separate patch). This mastering of local locks distributes the locks
between the nodes (at least until nodes enter or leave the cluster),
which tends to make each node keep to itself when doing allocations.
Thus, dlm communications are kept to a minimum, which results in
significantly faster block allocations.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
 fs/dlm/lock.c | 16 ++--
 include/uapi/linux/dlmconstants.h |  7 +++
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 83f3d55..f1e5b04 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -222,6 +222,11 @@ static inline int can_be_queued(struct dlm_lkb *lkb)
return !(lkb-lkb_exflags  DLM_LKF_NOQUEUE);
 }
 
+static inline int can_be_looked_up(struct dlm_lkb *lkb)
+{
+   return !(lkb-lkb_exflags  DLM_LKF_NOLOOKUP);
+}
+
 static inline int force_blocking_asts(struct dlm_lkb *lkb)
 {
return (lkb-lkb_exflags  DLM_LKF_NOQUEUEBAST);
@@ -2745,6 +2750,11 @@ static int set_master(struct dlm_rsb *r, struct dlm_lkb 
*lkb)
return 0;
}
 
+   if (!can_be_looked_up(lkb)) {
+   queue_cast(r, lkb, -EAGAIN);
+   return -EAGAIN;
+   }
+
wait_pending_remove(r);
 
r-res_first_lkid = lkb-lkb_id;
@@ -2828,7 +2838,8 @@ static int set_lock_args(int mode, struct dlm_lksb *lksb, 
uint32_t flags,
if (flags  DLM_LKF_CONVDEADLK  !(flags  DLM_LKF_CONVERT))
goto out;
 
-   if (flags  DLM_LKF_CONVDEADLK  flags  DLM_LKF_NOQUEUE)
+   if (flags  DLM_LKF_CONVDEADLK  (flags  (DLM_LKF_NOQUEUE |
+   DLM_LKF_NOLOOKUP)))
goto out;
 
if (flags  DLM_LKF_EXPEDITE  flags  DLM_LKF_CONVERT)
@@ -2837,7 +2848,8 @@ static int set_lock_args(int mode, struct dlm_lksb *lksb, 
uint32_t flags,
if (flags  DLM_LKF_EXPEDITE  flags  DLM_LKF_QUECVT)
goto out;
 
-   if (flags  DLM_LKF_EXPEDITE  flags  DLM_LKF_NOQUEUE)
+   if (flags  DLM_LKF_EXPEDITE  (flags  (DLM_LKF_NOQUEUE |
+ DLM_LKF_NOLOOKUP)))
goto out;
 
if (flags  DLM_LKF_EXPEDITE  mode != DLM_LOCK_NL)
diff --git a/include/uapi/linux/dlmconstants.h 
b/include/uapi/linux/dlmconstants.h
index 47bf08d..4b9ba15 100644
--- a/include/uapi/linux/dlmconstants.h
+++ b/include/uapi/linux/dlmconstants.h
@@ -131,6 +131,12 @@
  * Unlock the lock even if it is converting or waiting or has sublocks.
  * Only really for use by the userland device.c code.
  *
+ * DLM_LKF_NOLOOKUP
+ *
+ * Don't take any network time/bandwidth to do directory owner lookups.
+ * This is a lock for which we only care whether it's completely under
+ * local jurisdiction.
+ *
  */
 
 #define DLM_LKF_NOQUEUE0x0001
@@ -152,6 +158,7 @@
 #define DLM_LKF_ALTCW  0x0001
 #define DLM_LKF_FORCEUNLOCK0x0002
 #define DLM_LKF_TIMEOUT0x0004
+#define DLM_LKF_NOLOOKUP   0x0008
 
 /*
  * Some return codes that are not in errno.h



Re: [Cluster-devel] [DLM PATCH] DLM: Don't wait for resource library lookups if NOLOOKUP is specified

2014-10-03 Thread Bob Peterson
- Original Message -
 Hi,
 
 I think we need to do some more investigation here... how long do the
 lookups take? If the issue is just to create a list of perferred rgrps
 for each node, then there are various ways in which we might do that.
 That is not to say that this isn't a good way to do it, but I think we
 should try to understand the timings here first and make sure that we
 are solving the right problem,
 
 Steve.

Hi,

Contrary to my previous findings (which I think I may have screwed up),
I've done further investigation and found that the DLM lookups aren't the
issue at all. Also, making the DLM master the lookup node doesn't seem to
make much difference either. So I'm scrapping this dlm patch.

My latest set of patches now includes one that evenly distributes a
preferred set of rgrps to each node. Unlike Dave's original algorithm in
GFS1, this is a round-robin scheme that makes every node prefer every
Nth rgrp, where N is the number of nodes. This seems to be just as fast
if not faster than messing around with DLM.

Hopefully I'll be posting more patches in the near future. I'm currently
running some more tests regarding minimum reservations, and I'll post
patches depending on those results.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [DLM PATCH] DLM: Don't wait for resource library lookups if NOLOOKUP is specified

2014-10-03 Thread Bob Peterson
- Original Message -
 Can we just use NOQUEUE?  It tells you that there's a lock conflict, which
 tells you to move along and try another if you don't want to contend.  If
 you cache acquired locks and reuse them, then it doesn't matter if the
 master node is remote or local.

The problem with NOQUEUE is that it seems to depend on circumstances.
If each node mounts the file system at a separate time, and as part of that
process, they do a NOQUEUE lock on every rgrp, they all are granted the lock.
With this scheme, they're evenly divided between the nodes.

It doesn't matter anyway, because I'm scrapping the DLM patch in favor of
a scheme like the one you pointed out in GFS1 below. See my other email for
more details.

 If lookups are a problem in general, there is the nodir lockspace mode,
 which replaces the resource directory lookup system with a static mapping
 of resources to master nodes.
 
(snip)
 Back in 2002 I solved what sounds like the same problem in gfs(1).  It
 allowed all nodes to allocate blocks independent of each other, without
 constant locking.  You can see the solution here:
 
 https://git.fedorahosted.org/cgit/cluster.git/tree/gfs-kernel/src/gfs/rgrp.c?h=RHEL4

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [GFS2 PATCH] GFS2: Change minimum reservation size to 64 blocks

2014-10-04 Thread Bob Peterson
Hi,

I've been working on the problems of both slow performance and excessive
file fragmentation, especially with regard to the RHEL7 kernel. I've
created some patches (still forthcoming) to improve these things. For a
long time, I thought the fragmentation issue was related to GFS2 not
using its reservations efficiently. I wrote a patch that revises the
calculations in function gfs2_write_calc_reserv, but it didn't seem to
help. I also thought it might be releasing the reservations too early,
but that turned out not to be the case. I've done extensive performance
testing and determined that, even under the best circumstances (IOW by
minimizing both inter and intra-node resource group contention), file
fragmentation can be improved significantly by doubling the minimum
block reservation size from 32 to 64 blocks. To test this, I did
several two+ hour tests (real-life application) that uses lots of
files in GFS2. After the test is complete, I totaled all the file
extents (fragments) using filefrag.

Here are the results of five two-hour runs performed with new, revised
calculations, and today's 32-block reservations:

EXTENT COUNT FOR OUTPUT FILES =  444352
EXTENT COUNT FOR OUTPUT FILES =  468425
EXTENT COUNT FOR OUTPUT FILES =  472464
EXTENT COUNT FOR OUTPUT FILES =  482224
EXTENT COUNT FOR OUTPUT FILES =  481174

Here are the results of five runs performed with the old calculations
and 64-block reservations:

EXTENT COUNT FOR OUTPUT FILES =  365743
EXTENT COUNT FOR OUTPUT FILES =  398628
EXTENT COUNT FOR OUTPUT FILES =  404614
EXTENT COUNT FOR OUTPUT FILES =  401355
EXTENT COUNT FOR OUTPUT FILES =  384599

As you can see, by doubling the minimum reservation size, the files
have about 20 percent fewer extents.

This patch, therefore, doubles the minimum block reservations.

Incidentally, if we don't take measures to minimize resource group
contention (both inter and intra-node) the results are much worse.
Here is the same test, done on a stock RHEL7 kernel:

EXTENT COUNT FOR OUTPUT FILES =  826314
EXTENT COUNT FOR OUTPUT FILES =  791406
EXTENT COUNT FOR OUTPUT FILES =  735760

Patch description:

The minimum block reservation size was 32 blocks. This patch doubles
it to 64 blocks.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
index 5d8f085..e2058a7 100644
--- a/fs/gfs2/rgrp.h
+++ b/fs/gfs2/rgrp.h
@@ -14,11 +14,9 @@
 #include linux/uaccess.h
 
 /* Since each block in the file system is represented by two bits in the
- * bitmap, one 64-bit word in the bitmap will represent 32 blocks.
- * By reserving 32 blocks at a time, we can optimize / shortcut how we search
- * through the bitmaps by looking a word at a time.
+ * bitmap, each 64-bit word in the bitmap represents 32 blocks.
  */
-#define RGRP_RSRV_MINBYTES 8
+#define RGRP_RSRV_MINBYTES 16 /* Reserve 64 blocks */
 #define RGRP_RSRV_MINBLKS ((u32)(RGRP_RSRV_MINBYTES * GFS2_NBBY))
 
 struct gfs2_rgrpd;



[Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme

2014-10-08 Thread Bob Peterson
Hi,

Steve Whitehouse wrote:
| I'd much prefer to see an algorithm that is adaptive, rather than simply
| bumping up the default here. We do need to be able to cope with cases
| where the files are much smaller, and without adaptive sizing, we might
| land up creating small holes in the allocations, which would then cause
| problems for later allocations,

I took this to heart and came up with a new design. The idea is not unlike
the block reservation doubling schemes in other file systems. The results
are fantastic; much better than those gained by hard coding 64 blocks.
This is the best level of fragmentation I've ever achieved with this app:

EXTENT COUNT FOR OUTPUT FILES =  310103
EXTENT COUNT FOR OUTPUT FILES =  343990
EXTENT COUNT FOR OUTPUT FILES =  332818
EXTENT COUNT FOR OUTPUT FILES =  336852
EXTENT COUNT FOR OUTPUT FILES =  334820

Compare these results to counts without the patch:

EXTENT COUNT FOR OUTPUT FILES =  951813
EXTENT COUNT FOR OUTPUT FILES =  966978
EXTENT COUNT FOR OUTPUT FILES =  1065481

The only down side I see is that it makes the gfs2 inode structure bigger.
I also thought about changing the minimum reservation size to 16 blocks
rather than 32, since it's now adaptive, but before I did that, I'd have
to run some performance tests. What do you think?

Regards,

Bob Peterson
Red Hat File Systems

Patch text:

This patch introduces a new block reservation doubling scheme. If we
get to the end of a multi-block reservation, we probably did not
reserve enough blocks. So we double the size of the reservation for
next time. If we can't find any rgrps that match, we go back to the
default 32 blocks.

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
 fs/gfs2/incore.h | 1 +
 fs/gfs2/main.c   | 2 ++
 fs/gfs2/rgrp.c   | 7 ++-
 fs/gfs2/rgrp.h   | 9 ++---
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 39e7e99..f98fa37 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -398,6 +398,7 @@ struct gfs2_inode {
u32 i_diskflags;
u8 i_height;
u8 i_depth;
+   u32 i_rsrv_minblks; /* minimum blocks per reservation */
 };
 
 /*
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 82b6ac8..2be2f98 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -29,6 +29,7 @@
 #include glock.h
 #include quota.h
 #include recovery.h
+#include rgrp.h
 #include dir.h
 
 struct workqueue_struct *gfs2_control_wq;
@@ -42,6 +43,7 @@ static void gfs2_init_inode_once(void *foo)
INIT_LIST_HEAD(ip-i_trunc_list);
ip-i_res = NULL;
ip-i_hash_cache = NULL;
+   ip-i_rsrv_minblks = RGRP_RSRV_MINBLKS;
 }
 
 static void gfs2_init_glock_once(void *foo)
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 7474c41..986c33f 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1465,7 +1465,7 @@ static void rg_mblk_search(struct gfs2_rgrpd *rgd, struct 
gfs2_inode *ip,
extlen = 1;
else {
extlen = max_t(u32, atomic_read(rs-rs_sizehint), ap-target);
-   extlen = clamp(extlen, RGRP_RSRV_MINBLKS, free_blocks);
+   extlen = clamp(extlen, ip-i_rsrv_minblks, free_blocks);
}
if ((rgd-rd_free_clone  rgd-rd_reserved) || (free_blocks  extlen))
return;
@@ -2000,6 +2000,7 @@ next_rgrp:
 * then this checks for some less likely conditions before
 * trying again.
 */
+   ip-i_rsrv_minblks = RGRP_RSRV_MINBLKS;
loops++;
/* Check that fs hasn't grown if writing to rindex */
if (ip == GFS2_I(sdp-sd_rindex)  !sdp-sd_rindex_uptodate) {
@@ -2195,6 +2196,10 @@ static void gfs2_adjust_reservation(struct gfs2_inode 
*ip,
trace_gfs2_rs(rs, TRACE_RS_CLAIM);
if (rs-rs_free  !ret)
goto out;
+   /* We used up our block reservation, so double the
+  minimum reservation size for the next write. */
+   if (ip-i_rsrv_minblks  RGRP_RSRV_MAXBLKS)
+   ip-i_rsrv_minblks = 1;
}
__rs_deltree(rs);
}
diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
index 5d8f085..d081cac 100644
--- a/fs/gfs2/rgrp.h
+++ b/fs/gfs2/rgrp.h
@@ -13,13 +13,8 @@
 #include linux/slab.h
 #include linux/uaccess.h
 
-/* Since each block in the file system is represented by two bits in the
- * bitmap, one 64-bit word in the bitmap will represent 32 blocks.
- * By reserving 32 blocks at a time, we can optimize / shortcut how we search
- * through the bitmaps by looking a word at a time.
- */
-#define RGRP_RSRV_MINBYTES 8
-#define RGRP_RSRV_MINBLKS ((u32)(RGRP_RSRV_MINBYTES * GFS2_NBBY))
+#define RGRP_RSRV_MINBLKS 32
+#define RGRP_RSRV_MAXBLKS 512
 
 struct gfs2_rgrpd;
 struct gfs2_sbd;



[Cluster-devel] [GFS2 PATCH] GFS2: Set of distributed preferences for rgrps

2014-10-08 Thread Bob Peterson
Hi,

This patch uses journal numbers to evenly distribute which node
prefers which resource groups for block allocations. This is to
help performance. The idea is to make each node in the cluster
prefer to use a certain subset of resource groups for its block
allocations. Other nodes use a different subset, thus minimizing
inter-node DLM communications.

GFS1 has a similar scheme, but with GFS1, each node starts out
using an initial section of the file system and works their way
forward. While this works for the most part, simultaneous writes
by different nodes can cause the heads to bounce on some devices.
Also, nodes tend to stray from their initial locations. With this
patch, the preferred resource groups are assigned in a round-robin
fashion.

Tests prove that this patch can be a major performance boost for
some applications. For example, using one particular application,
I posted these run times before applying this patch:

Run 1 time: 2hr 49min 39sec
Run 2 time: 2hr 57min 59sec
Run 3 time: 3hr 1min 10sec

With the patch applied the times are improved by almost a third:

Run 1 time: 2hr 7min 31sec
Run 2 time: 2hr 5min 49sec
Run 3 time: 2hr 5min 1sec
Run 4 time: 2hr 4min 35sec
Run 5 time: 2hr 5min 55sec

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index f98fa37..c9f4f4c 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -97,6 +97,7 @@ struct gfs2_rgrpd {
 #define GFS2_RDF_CHECK 0x1000 /* check for unlinked inodes */
 #define GFS2_RDF_UPTODATE  0x2000 /* rg is up to date */
 #define GFS2_RDF_ERROR 0x4000 /* error in rg */
+#define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */
 #define GFS2_RDF_MASK  0xf000 /* mask for internal flags */
spinlock_t rd_rsspin;   /* protects reservation related vars */
struct rb_root rd_rstree;   /* multi-block reservation tree */
@@ -809,6 +810,7 @@ struct gfs2_sbd {
char sd_table_name[GFS2_FSNAME_LEN];
char sd_proto_name[GFS2_FSNAME_LEN];
 
+   int sd_nodes;
/* Debugging crud */
 
unsigned long sd_last_warning;
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 641383a..5aeb03a 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1113,6 +1113,8 @@ static void gdlm_recover_done(void *arg, struct dlm_slot 
*slots, int num_slots,
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = sdp-sd_lockstruct;
 
+   BUG_ON(num_slots == 0);
+   sdp-sd_nodes = num_slots;
/* ensure the ls jid arrays are large enough */
set_recover_size(sdp, slots, num_slots);
 
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index d3eae24..bf3193f 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -134,6 +134,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
atomic_set(sdp-sd_log_freeze, 0);
atomic_set(sdp-sd_frozen_root, 0);
init_waitqueue_head(sdp-sd_frozen_root_wait);
+   sdp-sd_nodes = 1;
 
return sdp;
 }
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 986c33f..bd8bddc 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
rgd-rd_gl-gl_vm.start = rgd-rd_addr * bsize;
rgd-rd_gl-gl_vm.end = rgd-rd_gl-gl_vm.start + (rgd-rd_length * 
bsize) - 1;
rgd-rd_rgl = (struct gfs2_rgrp_lvb *)rgd-rd_gl-gl_lksb.sb_lvbptr;
-   rgd-rd_flags = ~GFS2_RDF_UPTODATE;
+   rgd-rd_flags = ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
if (rgd-rd_data  sdp-sd_max_rg_data)
sdp-sd_max_rg_data = rgd-rd_data;
spin_lock(sdp-sd_rindex_spin);
@@ -955,6 +955,36 @@ fail:
 }
 
 /**
+ * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use
+ * @sdp: the GFS2 superblock
+ *
+ * The purpose of this function is to select a subset of the resource groups
+ * and mark them as PREFERRED. We do it in such a way that each node prefers
+ * to use a unique set of rgrps to minimize glock contention.
+ */
+static void set_rgrp_preferences(struct gfs2_sbd *sdp)
+{
+   struct gfs2_rgrpd *rgd, *first;
+   int i;
+
+   /* Skip an initial number of rgrps, based on this node's journal ID.
+  That should start each node out on its own set. */
+   rgd = gfs2_rgrpd_get_first(sdp);
+   for (i = 0; i  sdp-sd_lockstruct.ls_jid; i++)
+   rgd = gfs2_rgrpd_get_next(rgd);
+   first = rgd;
+
+   do {
+   rgd-rd_flags |= GFS2_RDF_PREFERRED;
+   for (i = 0; i  sdp-sd_nodes; i++) {
+   rgd = gfs2_rgrpd_get_next(rgd);
+   if (rgd == first)
+   break;
+   }
+   } while (rgd != first);
+}
+
+/**
  * gfs2_ri_update - Pull in a new resource index from the disk
  * @ip: pointer to the rindex inode
  *
@@ -973,6 +1003,8

Re: [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme

2014-10-09 Thread Bob Peterson
- Original Message -
 - Original Message -
  This patch introduces a new block reservation doubling scheme. If we
 
 Maybe I sent this patch out prematurely. Instead of doubling the
 reservation, maybe I should experiment with making it grow additively.
 IOW, Instead of 32-64-128-256-512, I should use:
 32-64-96-128-160-192-224-etc...
 I know other file systems using doubling schemes, but I'm concerned
 about it being too aggressive.

I tried an additive reservations algorithm. I basically changed the
previous patch from doubling the reservation to adding 32 blocks.
In other words, I replaced:

+   ip-i_rsrv_minblks = 1;
with this:
+   ip-i_rsrv_minblks += RGRP_RSRV_MINBLKS;

The results were not as good, but still very impressive, and maybe
acceptable:

Reservation doubling scheme:
EXTENT COUNT FOR OUTPUT FILES =  310103
EXTENT COUNT FOR OUTPUT FILES =  343990
EXTENT COUNT FOR OUTPUT FILES =  332818
EXTENT COUNT FOR OUTPUT FILES =  336852
EXTENT COUNT FOR OUTPUT FILES =  334820

Reservation additive scheme (32 blocks):
EXTENT COUNT FOR OUTPUT FILES =  322406
EXTENT COUNT FOR OUTPUT FILES =  341665
EXTENT COUNT FOR OUTPUT FILES =  341769
EXTENT COUNT FOR OUTPUT FILES =  348676
EXTENT COUNT FOR OUTPUT FILES =  348079

So I'm looking for opinions:
(a) Stick with the original reservation doubling patch, or
(b) Go with the additive version.
(c) Any other ideas?

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme

2014-10-14 Thread Bob Peterson
 are necessary. That way,
   applications that do a long sequence of: large-appends followed by
   small rewrites don't get their append size hint whacked by the
   small rewrite. This didn't help the customer application I'm testing,
   but it did help some of the other benchmarks I ran yesterday.
5. I don't like the idea of adjusting the reservation at fsync time.
   Similarly, I don't like the idea of adjusting the reservation at
   file close time. I think it makes the most sense to keep the info
   associated with the inode as we do today. My next iteration will
   hopefully not add any fields to the inode.
6. I like the idea of adjusting the reservation for non-linear writes,
   such as backwards writes, but I may have to do more testing. For
   example, if I do multiple writes to a file at: 2MB, 1MB, 500KB, etc.,
   is it better to reserve X blocks which will be assigned in reverse
   order? Or is it better to just reserve them as needed and have them
   more scattered but possibly more linear? Maybe testing will show.
7. In regards to storing the context information in the struct file:
   It depends on what information. Today, there is only one reservation
   structure, and reservation size per inode, whereas there can be many
   struct files for many writers to the inode. The question of whether
   a reservation is adequate is not so much about will this reservation
   be adequate for this writer?. Rather, it's about will this
   reservation be adequate for our most demanding writer? All the
   rewriters in the world shouldn't affect the outcome of a single
   aggressive appender, for example.

To answer your question: I'd wager that yes, there are multiple writers
to at least some of the files, but I'm not sure how extensive it is.
The workload seems to have a good variety of linear and non-linear writes
as well. At least now I'm starting to use multiple benchmarks for my
tests.

Regards,

Bob Peterson
Red Hat File Systems




[Cluster-devel] [GFS2 PATCH] GFS2: Speed up fiemap function by skipping holes

2014-10-14 Thread Bob Peterson
Hi,

This patch detects the new holesize bit in block_map requests. If a
hole is found during fiemap, it figures out the size of the hole
based on the current metapath information. Since the metapath only
represents a section of the file, it can only extrapolate to a
certain size based on the current metapath buffers. Therefore,
fiemap may call blockmap several times to get the hole size.
The hole size is determined by a new function.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index f0b945a..edd7ed6 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -587,6 +587,62 @@ static int gfs2_bmap_alloc(struct inode *inode, const 
sector_t lblock,
 }
 
 /**
+ * hole_size - figure out the size of a hole
+ * @ip: The inode
+ * @lblock: The logical starting block number
+ * @mp: The metapath
+ *
+ * Returns: The hole size in bytes
+ *
+ */
+static u64 hole_size(struct inode *inode, sector_t lblock,
+struct metapath *mp)
+{
+   struct gfs2_inode *ip = GFS2_I(inode);
+   struct gfs2_sbd *sdp = GFS2_SB(inode);
+   unsigned int end_of_metadata = ip-i_height - 1;
+   u64 factor = 1;
+   int hgt = end_of_metadata;
+   u64 holesz = 0, holestep;
+   const __be64 *first, *end, *ptr;
+   const struct buffer_head *bh;
+   u64 isize = i_size_read(inode);
+   int zeroptrs;
+   struct metapath mp_eof;
+
+   /* Get a metapath to the very last byte */
+   find_metapath(sdp, (isize - 1)  inode-i_blkbits, mp_eof,
+ ip-i_height);
+   for (hgt = end_of_metadata; hgt = 0; hgt--) {
+   bh = mp-mp_bh[hgt];
+   if (bh) {
+   zeroptrs = 0;
+   first = metapointer(hgt, mp);
+   end = (const __be64 *)(bh-b_data + bh-b_size);
+
+   for (ptr = first; ptr  end; ptr++) {
+   if (*ptr)
+   break;
+   else
+   zeroptrs++;
+   }
+   } else {
+   zeroptrs = sdp-sd_inptrs;
+   }
+   holestep = min(factor * zeroptrs,
+  isize - (lblock + (zeroptrs * holesz)));
+   holesz += holestep;
+   if (lblock + holesz = isize)
+   return holesz  inode-i_blkbits;
+
+   factor *= sdp-sd_inptrs;
+   if (hgt  (mp-mp_list[hgt - 1]  mp_eof.mp_list[hgt - 1]))
+   (mp-mp_list[hgt - 1])++;
+   }
+   return holesz  inode-i_blkbits;
+}
+
+/**
  * gfs2_block_map - Map a block from an inode to a disk block
  * @inode: The inode
  * @lblock: The logical block number
@@ -645,11 +701,17 @@ int gfs2_block_map(struct inode *inode, sector_t lblock,
ret = lookup_metapath(ip, mp);
if (ret  0)
goto out;
-   if (ret != ip-i_height)
+   if (ret != ip-i_height) {
+   if (test_clear_buffer_holesize(bh_map))
+   bh_map-b_size = hole_size(inode, lblock, mp);
goto do_alloc;
+   }
ptr = metapointer(ip-i_height - 1, mp);
-   if (*ptr == 0)
+   if (*ptr == 0) {
+   if (test_clear_buffer_holesize(bh_map))
+   bh_map-b_size = hole_size(inode, lblock, mp);
goto do_alloc;
+   }
map_bh(bh_map, inode-i_sb, be64_to_cpu(*ptr));
bh = mp.mp_bh[ip-i_height - 1];
len = gfs2_extent_length(bh-b_data, bh-b_size, ptr, maxlen, eob);



[Cluster-devel] [GFS2 PATCH 3/4] GFS2: Only increase rs_sizehint

2014-10-20 Thread Bob Peterson
If an application does a sequence of (1) big write, (2) little write
we don't necessarily want to reset the size hint based on the smaller
size. The fact that they did any big writes implies they may do more,
and therefore we should try to allocate bigger block reservations, even
if the last few were small writes. Therefore this patch changes function
gfs2_size_hint so that the size hint can only grow; it cannot shrink.
This is especially important where there are multiple writers.
---
 fs/gfs2/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 2976019..5c7a9c1 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -337,7 +337,8 @@ static void gfs2_size_hint(struct file *filep, loff_t 
offset, size_t size)
size_t blks = (size + sdp-sd_sb.sb_bsize - 1)  
sdp-sd_sb.sb_bsize_shift;
int hint = min_t(size_t, INT_MAX, blks);
 
-   atomic_set(ip-i_res-rs_sizehint, hint);
+   if (hint  atomic_read(ip-i_res-rs_sizehint))
+   atomic_set(ip-i_res-rs_sizehint, hint);
 }
 
 /**
-- 
1.9.3



[Cluster-devel] [GFS2 PATCH 2/4] GFS2: Make block reservations more persistent

2014-10-20 Thread Bob Peterson
Before this patch, whenever a struct file (opened to allow writes) was
closed, the multi-block reservation structure associated with the inode
was deleted. That's a problem, especially when there are multiple writers.
Applications that do open-write-close will suffer from greater levels
of fragmentation and need to re-do work to perform write operations.
This patch removes the reservation delete from the file close code so
that they're more persistent until the inode is deleted.
---
 fs/gfs2/file.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 7f4ed3d..2976019 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -616,15 +616,8 @@ static int gfs2_open(struct inode *inode, struct file 
*file)
 
 static int gfs2_release(struct inode *inode, struct file *file)
 {
-   struct gfs2_inode *ip = GFS2_I(inode);
-
kfree(file-private_data);
file-private_data = NULL;
-
-   if (!(file-f_mode  FMODE_WRITE))
-   return 0;
-
-   gfs2_rs_delete(ip, inode-i_writecount);
return 0;
 }
 
-- 
1.9.3



[Cluster-devel] [GFS2 PATCH 0/4] Patches to reduce GFS2 fragmentation

2014-10-20 Thread Bob Peterson
Hi,

On October 8, I posted a GFS2 patch that greatly reduced inter-node
contention for resource group glocks. The patch was called:
GFS2: Set of distributed preferences for rgrps. It implemented a
new scheme whereby each node in a cluster tries to keep to itself
for allocations. This is not unlike GFS1, which has a different scheme.

Although the patch sped up GFS2 performance in general, it also caused
more file fragmentation, because each node tended to focus on a smaller
subset of resource groups.

Here are run times and file fragmentation extent counts for my favorite
customer application, using a STOCK RHEL7 kernel (no patches):

Run times:
Run 1 time: 2hr 40min 33sec
Run 2 time: 2hr 39min 52sec
Run 3 time: 2hr 39min 31sec
Run 4 time: 2hr 33min 57sec
Run 5 time: 2hr 41min 6sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  744708
EXTENT COUNT FOR OUTPUT FILES =  749868
EXTENT COUNT FOR OUTPUT FILES =  721862
EXTENT COUNT FOR OUTPUT FILES =  635301
EXTENT COUNT FOR OUTPUT FILES =  689263

The times are bad and the fragmentation level is also bad. If I add
just the first patch, GFS2: Set of distributed preferences for rgrps
you can see that the performance improves, but the fragmentation is
worse (I only did three iterations this time):

Run times:
Run 1 time: 2hr 2min 47sec
Run 2 time: 2hr 8min 37sec
Run 3 time: 2hr 10min 0sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  1011217
EXTENT COUNT FOR OUTPUT FILES =  1025973
EXTENT COUNT FOR OUTPUT FILES =  1070163

So the patch improved performance by 25 percent, but file fragmentation is
30 percent worse. Some of this is undoubtedly due to the SAN array buffering,
hiding our multitude of sins. But not every customer will have this quality
of SAN. So it's important to reduce the fragmentation as well, so that some
people are helped while others are hurt by the patch.

Toward this end, I devised three relatively simple patches that greatly
reduce file fragmentation. With all four patches, the numbers are as follows:

Run times:
Run 1 time: 2hr 5min 46sec
Run 2 time: 2hr 10min 15sec
Run 3 time: 2hr 8min 4sec
Run 4 time: 2hr 9min 27sec
Run 5 time: 2hr 6min 15sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  330276
EXTENT COUNT FOR OUTPUT FILES =  358939
EXTENT COUNT FOR OUTPUT FILES =  375374
EXTENT COUNT FOR OUTPUT FILES =  383071
EXTENT COUNT FOR OUTPUT FILES =  369269

As you can see, with this combination of four patches, the run times are
good as well as the file fragmentation levels. The file fragmentation is
about twice as good as the stock kernel, and significantly better (almost
three times better) than with the first patch alone.

This patch set includes all four patches.

Bob Peterson (4):
  GFS2: Set of distributed preferences for rgrps
  GFS2: Make block reservations more persistent
  GFS2: Only increase rs_sizehint
  GFS2: If we use up our block reservation, request more next time

 fs/gfs2/file.c   | 10 ++--
 fs/gfs2/incore.h |  2 ++
 fs/gfs2/lock_dlm.c   |  2 ++
 fs/gfs2/ops_fstype.c |  1 +
 fs/gfs2/rgrp.c   | 69 
 5 files changed, 71 insertions(+), 13 deletions(-)

-- 
1.9.3



[Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps

2014-10-20 Thread Bob Peterson
This patch tries to use the journal numbers to evenly distribute
which node prefers which resource group for block allocations. This
is to help performance.
---
 fs/gfs2/incore.h |  2 ++
 fs/gfs2/lock_dlm.c   |  2 ++
 fs/gfs2/ops_fstype.c |  1 +
 fs/gfs2/rgrp.c   | 66 
 4 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 39e7e99..618d20a 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -97,6 +97,7 @@ struct gfs2_rgrpd {
 #define GFS2_RDF_CHECK 0x1000 /* check for unlinked inodes */
 #define GFS2_RDF_UPTODATE  0x2000 /* rg is up to date */
 #define GFS2_RDF_ERROR 0x4000 /* error in rg */
+#define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */
 #define GFS2_RDF_MASK  0xf000 /* mask for internal flags */
spinlock_t rd_rsspin;   /* protects reservation related vars */
struct rb_root rd_rstree;   /* multi-block reservation tree */
@@ -808,6 +809,7 @@ struct gfs2_sbd {
char sd_table_name[GFS2_FSNAME_LEN];
char sd_proto_name[GFS2_FSNAME_LEN];
 
+   int sd_nodes;
/* Debugging crud */
 
unsigned long sd_last_warning;
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 641383a..5aeb03a 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1113,6 +1113,8 @@ static void gdlm_recover_done(void *arg, struct dlm_slot 
*slots, int num_slots,
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = sdp-sd_lockstruct;
 
+   BUG_ON(num_slots == 0);
+   sdp-sd_nodes = num_slots;
/* ensure the ls jid arrays are large enough */
set_recover_size(sdp, slots, num_slots);
 
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index d3eae24..bf3193f 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -134,6 +134,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
atomic_set(sdp-sd_log_freeze, 0);
atomic_set(sdp-sd_frozen_root, 0);
init_waitqueue_head(sdp-sd_frozen_root_wait);
+   sdp-sd_nodes = 1;
 
return sdp;
 }
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 7474c41..50cdba2 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
rgd-rd_gl-gl_vm.start = rgd-rd_addr * bsize;
rgd-rd_gl-gl_vm.end = rgd-rd_gl-gl_vm.start + (rgd-rd_length * 
bsize) - 1;
rgd-rd_rgl = (struct gfs2_rgrp_lvb *)rgd-rd_gl-gl_lksb.sb_lvbptr;
-   rgd-rd_flags = ~GFS2_RDF_UPTODATE;
+   rgd-rd_flags = ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
if (rgd-rd_data  sdp-sd_max_rg_data)
sdp-sd_max_rg_data = rgd-rd_data;
spin_lock(sdp-sd_rindex_spin);
@@ -955,6 +955,36 @@ fail:
 }
 
 /**
+ * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use
+ * @sdp: the GFS2 superblock
+ *
+ * The purpose of this function is to select a subset of the resource groups
+ * and mark them as PREFERRED. We do it in such a way that each node prefers
+ * to use a unique set of rgrps to minimize glock contention.
+ */
+static void set_rgrp_preferences(struct gfs2_sbd *sdp)
+{
+   struct gfs2_rgrpd *rgd, *first;
+   int i;
+
+   /* Skip an initial number of rgrps, based on this node's journal ID.
+  That should start each node out on its own set. */
+   rgd = gfs2_rgrpd_get_first(sdp);
+   for (i = 0; i  sdp-sd_lockstruct.ls_jid; i++)
+   rgd = gfs2_rgrpd_get_next(rgd);
+   first = rgd;
+
+   do {
+   rgd-rd_flags |= GFS2_RDF_PREFERRED;
+   for (i = 0; i  sdp-sd_nodes; i++) {
+   rgd = gfs2_rgrpd_get_next(rgd);
+   if (rgd == first)
+   break;
+   }
+   } while (rgd != first);
+}
+
+/**
  * gfs2_ri_update - Pull in a new resource index from the disk
  * @ip: pointer to the rindex inode
  *
@@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip)
if (error  0)
return error;
 
+   set_rgrp_preferences(sdp);
+
sdp-sd_rindex_uptodate = 1;
return 0;
 }
@@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, 
const struct gfs2_rgrpd *b
 }
 
 /**
+ * fast_to_acquire - determine if a resource group will be fast to acquire
+ *
+ * If this is one of our preferred rgrps, it should be quicker to acquire,
+ * because we tried to set ourselves up as dlm lock master.
+ */
+static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
+{
+   struct gfs2_glock *gl = rgd-rd_gl;
+
+   if (gl-gl_state != LM_ST_UNLOCKED  list_empty(gl-gl_holders) 
+   !test_bit(GLF_DEMOTE_IN_PROGRESS, gl-gl_flags) 
+   !test_bit(GLF_DEMOTE, gl-gl_flags))
+   return 1;
+   if (rgd-rd_flags  GFS2_RDF_PREFERRED)
+   return 1;
+   return 0;
+}
+

[Cluster-devel] [PATCH][try6] VFS: new want_holesize and got_holesize buffer_head flags for fiemap

2014-10-21 Thread Bob Peterson
Hi,

This is my sixth rework of this patch. The problem with the previous
version is that the underlying file system's block_map function
may set the buffer_head's b_state to 0, which may be misinterpreted
by fiemap as meaning it has returned the hole size in b_size.
This version implements a suggestion from Steve Whitehouse: it
improves on the design by (unfortunately) using two flags: a flag to
tell block_map to return hole size if possible, and another flag to
tell fiemap that the hole size has been returned. This way there is
no possibility of a misunderstanding.

The problem:
If you do a fiemap operation on a very large sparse file, it can take
an extremely long amount of time (we're talking days here) because
function __generic_block_fiemap does a block-for-block search when it
encounters a hole.

The solution:
Allow the underlying file system to return the hole size so that
function __generic_block_fiemap can quickly skip the hole. I have a
companion patch to GFS2 that takes advantage of the new flags to
speed up fiemap on sparse GFS2 files. Other file systems can do the
same as they see fit. For GFS2, the time it takes to skip a 1PB hole
in a sparse file goes from days to milliseconds.

Patch description:
This patch changes function __generic_block_fiemap so that it sets a new
buffer_want_holesize bit. The new bit signals to the underlying file system
to return a hole size from its block_map function (if possible) in the
event that a hole is encountered at the requested block. If the block_map
function encounters a hole, and sets buffer_got_holesize, fiemap takes the
returned b_size to be the size of the hole, in bytes. It then skips the
hole and moves to the next block. This may be repeated several times
in a row, especially for large holes, due to possible limitations of the
fs-specific block_map function. This is still much faster than trying
each block individually when large holes are encountered. If the
block_map function does not set buffer_got_holesize, the request for
holesize has been ignored, and it falls back to today's method of doing a
block-by-block search for the next valid block.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
 fs/ioctl.c  | 7 ++-
 include/linux/buffer_head.h | 4 
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 8ac3fad..9c07a40 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -291,13 +291,18 @@ int __generic_block_fiemap(struct inode *inode,
memset(map_bh, 0, sizeof(struct buffer_head));
map_bh.b_size = len;
 
+   set_buffer_want_holesize(map_bh);
ret = get_block(inode, start_blk, map_bh, 0);
if (ret)
break;
 
/* HOLE */
if (!buffer_mapped(map_bh)) {
-   start_blk++;
+   if (buffer_got_holesize(map_bh))
+   start_blk += logical_to_blk(inode,
+   map_bh.b_size);
+   else
+   start_blk++;
 
/*
 * We want to handle the case where there is an
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 324329c..fc8e47c 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -37,6 +37,8 @@ enum bh_state_bits {
BH_Meta,/* Buffer contains metadata */
BH_Prio,/* Buffer should be submitted with REQ_PRIO */
BH_Defer_Completion, /* Defer AIO completion to workqueue */
+   BH_Want_Holesize, /* Return hole size if possible */
+   BH_Got_Holesize, /* Hole size has been returned */
 
BH_PrivateStart,/* not a state bit, but the first bit available
 * for private allocation by other entities
@@ -128,6 +130,8 @@ BUFFER_FNS(Boundary, boundary)
 BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Unwritten, unwritten)
 BUFFER_FNS(Meta, meta)
+BUFFER_FNS(Want_Holesize, want_holesize)
+BUFFER_FNS(Got_Holesize, got_holesize)
 BUFFER_FNS(Prio, prio)
 BUFFER_FNS(Defer_Completion, defer_completion)
 
-- 
1.9.3



[Cluster-devel] [PATCH] GFS2: Speed up fiemap function by skipping holes

2014-10-21 Thread Bob Peterson
Hi,

This is the GFS2 companion patch to the one I previously posted for fiemap.

Patch description:
This patch detects the new want_holesize bit in block_map requests.
If a hole is found during fiemap, it calculates the size of the hole
based on the current metapath information, then it sets the new
buffer_got_holesize bit and returns the hole size in b_size.
Since the metapath only represents a section of the file, it can
only extrapolate to a certain size based on the current metapath
buffers. Therefore, fiemap may call blockmap several times to get
the hole size. The hole size is determined by a new function.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
 fs/gfs2/bmap.c | 70 --
 1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index f0b945a..450ea17 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -587,6 +587,62 @@ static int gfs2_bmap_alloc(struct inode *inode, const 
sector_t lblock,
 }
 
 /**
+ * hole_size - figure out the size of a hole
+ * @ip: The inode
+ * @lblock: The logical starting block number
+ * @mp: The metapath
+ *
+ * Returns: The hole size in bytes
+ *
+ */
+static u64 hole_size(struct inode *inode, sector_t lblock,
+struct metapath *mp)
+{
+   struct gfs2_inode *ip = GFS2_I(inode);
+   struct gfs2_sbd *sdp = GFS2_SB(inode);
+   unsigned int end_of_metadata = ip-i_height - 1;
+   u64 factor = 1;
+   int hgt = end_of_metadata;
+   u64 holesz = 0, holestep;
+   const __be64 *first, *end, *ptr;
+   const struct buffer_head *bh;
+   u64 isize = i_size_read(inode);
+   int zeroptrs;
+   struct metapath mp_eof;
+
+   /* Get a metapath to the very last byte */
+   find_metapath(sdp, (isize - 1)  inode-i_blkbits, mp_eof,
+ ip-i_height);
+   for (hgt = end_of_metadata; hgt = 0; hgt--) {
+   bh = mp-mp_bh[hgt];
+   if (bh) {
+   zeroptrs = 0;
+   first = metapointer(hgt, mp);
+   end = (const __be64 *)(bh-b_data + bh-b_size);
+
+   for (ptr = first; ptr  end; ptr++) {
+   if (*ptr)
+   break;
+   else
+   zeroptrs++;
+   }
+   } else {
+   zeroptrs = sdp-sd_inptrs;
+   }
+   holestep = min(factor * zeroptrs,
+  isize - (lblock + (zeroptrs * holesz)));
+   holesz += holestep;
+   if (lblock + holesz = isize)
+   return holesz  inode-i_blkbits;
+
+   factor *= sdp-sd_inptrs;
+   if (hgt  (mp-mp_list[hgt - 1]  mp_eof.mp_list[hgt - 1]))
+   (mp-mp_list[hgt - 1])++;
+   }
+   return holesz  inode-i_blkbits;
+}
+
+/**
  * gfs2_block_map - Map a block from an inode to a disk block
  * @inode: The inode
  * @lblock: The logical block number
@@ -645,11 +701,21 @@ int gfs2_block_map(struct inode *inode, sector_t lblock,
ret = lookup_metapath(ip, mp);
if (ret  0)
goto out;
-   if (ret != ip-i_height)
+   if (ret != ip-i_height) {
+   if (buffer_want_holesize(bh_map)) {
+   bh_map-b_size = hole_size(inode, lblock, mp);
+   set_buffer_got_holesize(bh_map);
+   }
goto do_alloc;
+   }
ptr = metapointer(ip-i_height - 1, mp);
-   if (*ptr == 0)
+   if (*ptr == 0) {
+   if (buffer_want_holesize(bh_map)) {
+   bh_map-b_size = hole_size(inode, lblock, mp);
+   set_buffer_got_holesize(bh_map);
+   }
goto do_alloc;
+   }
map_bh(bh_map, inode-i_sb, be64_to_cpu(*ptr));
bh = mp.mp_bh[ip-i_height - 1];
len = gfs2_extent_length(bh-b_data, bh-b_size, ptr, maxlen, eob);
-- 
1.9.3



Re: [Cluster-devel] [PATCH][try6] VFS: new want_holesize and got_holesize buffer_head flags for fiemap

2014-10-22 Thread Bob Peterson
- Original Message -
 This looks like a big indicator that get_blocks is the wrong
 interface for fiemap.

Hi Christoph,

Yes, I thought about that.
One of my early prototypes had a separate function used by fiemap.
Function __generic_block_fiemap would call get_block() which
returned an indication of a hole as it does today. When it saw
the hole, fiemap called a new function get_hole_size() that was
passed in like get_block. The problem is: it's grossly inefficient,
since the new function get_hole_size() has to redo most of the work
that get_block just did (at least in the case of GFS2). (Which in the
case of a 1PB sparse file is non-trivial, since it involves several
levels of metadata indirection). Combining it with get_block made it
much more efficient.

Making a separate get_block_map_fiemap() function just seems like an
exercise in redundancy.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH][try6] VFS: new want_holesize and got_holesize buffer_head flags for fiemap

2014-10-23 Thread Bob Peterson
- Original Message -
 On Wed, Oct 22, 2014 at 08:28:53AM -0400, Bob Peterson wrote:
  Yes, I thought about that.
  One of my early prototypes had a separate function used by fiemap.
  Function __generic_block_fiemap would call get_block() which
  returned an indication of a hole as it does today. When it saw
  the hole, fiemap called a new function get_hole_size() that was
  passed in like get_block. The problem is: it's grossly inefficient,
  since the new function get_hole_size() has to redo most of the work
  that get_block just did (at least in the case of GFS2). (Which in the
  case of a 1PB sparse file is non-trivial, since it involves several
  levels of metadata indirection). Combining it with get_block made it
  much more efficient.
  
  Making a separate get_block_map_fiemap() function just seems like an
  exercise in redundancy.
 
 I was thinking of replacing get_blocks entirely.  We're not actually
 using a buffer_head in fiemap, so the interface seems somewhat awkward.
 If it used something like the iomap interface proposed by Dave long
 time ago we'd have a much saner interface that for example XFS could use
 as well.

Hi Christoph. Can you send a link to the thread regarding Dave's iomap?
proposal? I don't recall it offhand, so I don't know what it was or
why it was never implemented. I assume you mean Dave Chinner. Maybe it's
time to revisit the concept as a long-term solution.

In the meantime, do you otherwise object to this patch as a short-term
solution?

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps

2014-10-24 Thread Bob Peterson
This patch tries to use the journal numbers to evenly distribute
which node prefers which resource group for block allocations. This
is to help performance.
---
 fs/gfs2/incore.h |  1 +
 fs/gfs2/rgrp.c   | 66 +++-
 2 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 39e7e99..1b89918 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -97,6 +97,7 @@ struct gfs2_rgrpd {
 #define GFS2_RDF_CHECK 0x1000 /* check for unlinked inodes */
 #define GFS2_RDF_UPTODATE  0x2000 /* rg is up to date */
 #define GFS2_RDF_ERROR 0x4000 /* error in rg */
+#define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */
 #define GFS2_RDF_MASK  0xf000 /* mask for internal flags */
spinlock_t rd_rsspin;   /* protects reservation related vars */
struct rb_root rd_rstree;   /* multi-block reservation tree */
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 7474c41..f65e56b 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
rgd-rd_gl-gl_vm.start = rgd-rd_addr * bsize;
rgd-rd_gl-gl_vm.end = rgd-rd_gl-gl_vm.start + (rgd-rd_length * 
bsize) - 1;
rgd-rd_rgl = (struct gfs2_rgrp_lvb *)rgd-rd_gl-gl_lksb.sb_lvbptr;
-   rgd-rd_flags = ~GFS2_RDF_UPTODATE;
+   rgd-rd_flags = ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
if (rgd-rd_data  sdp-sd_max_rg_data)
sdp-sd_max_rg_data = rgd-rd_data;
spin_lock(sdp-sd_rindex_spin);
@@ -955,6 +955,36 @@ fail:
 }
 
 /**
+ * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use
+ * @sdp: the GFS2 superblock
+ *
+ * The purpose of this function is to select a subset of the resource groups
+ * and mark them as PREFERRED. We do it in such a way that each node prefers
+ * to use a unique set of rgrps to minimize glock contention.
+ */
+static void set_rgrp_preferences(struct gfs2_sbd *sdp)
+{
+   struct gfs2_rgrpd *rgd, *first;
+   int i;
+
+   /* Skip an initial number of rgrps, based on this node's journal ID.
+  That should start each node out on its own set. */
+   rgd = gfs2_rgrpd_get_first(sdp);
+   for (i = 0; i  sdp-sd_lockstruct.ls_jid; i++)
+   rgd = gfs2_rgrpd_get_next(rgd);
+   first = rgd;
+
+   do {
+   rgd-rd_flags |= GFS2_RDF_PREFERRED;
+   for (i = 0; i  sdp-sd_journals; i++) {
+   rgd = gfs2_rgrpd_get_next(rgd);
+   if (rgd == first)
+   break;
+   }
+   } while (rgd != first);
+}
+
+/**
  * gfs2_ri_update - Pull in a new resource index from the disk
  * @ip: pointer to the rindex inode
  *
@@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip)
if (error  0)
return error;
 
+   set_rgrp_preferences(sdp);
+
sdp-sd_rindex_uptodate = 1;
return 0;
 }
@@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, 
const struct gfs2_rgrpd *b
 }
 
 /**
+ * fast_to_acquire - determine if a resource group will be fast to acquire
+ *
+ * If this is one of our preferred rgrps, it should be quicker to acquire,
+ * because we tried to set ourselves up as dlm lock master.
+ */
+static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
+{
+   struct gfs2_glock *gl = rgd-rd_gl;
+
+   if (gl-gl_state != LM_ST_UNLOCKED  list_empty(gl-gl_holders) 
+   !test_bit(GLF_DEMOTE_IN_PROGRESS, gl-gl_flags) 
+   !test_bit(GLF_DEMOTE, gl-gl_flags))
+   return 1;
+   if (rgd-rd_flags  GFS2_RDF_PREFERRED)
+   return 1;
+   return 0;
+}
+
+/**
  * gfs2_inplace_reserve - Reserve space in the filesystem
  * @ip: the inode to reserve space for
  * @ap: the allocation parameters
@@ -1932,10 +1983,15 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const 
struct gfs2_alloc_parms *a
rg_locked = 0;
if (skip  skip--)
goto next_rgrp;
-   if (!gfs2_rs_active(rs)  (loops  2) 
-gfs2_rgrp_used_recently(rs, 1000) 
-gfs2_rgrp_congested(rs-rs_rbm.rgd, loops))
-   goto next_rgrp;
+   if (!gfs2_rs_active(rs)) {
+   if (loops == 0 
+   !fast_to_acquire(rs-rs_rbm.rgd))
+   goto next_rgrp;
+   if ((loops  3) 
+   gfs2_rgrp_used_recently(rs, 1000) 
+   gfs2_rgrp_congested(rs-rs_rbm.rgd, loops))
+   goto next_rgrp;
+   }
error = gfs2_glock_nq_init(rs-rs_rbm.rgd-rd_gl,
 

[Cluster-devel] [GFS2 PATCH 3/3] GFS2: If we use up our block reservation, request more next time

2014-10-24 Thread Bob Peterson
If we run out of blocks for a given multi-block allocation, we obviously
did not reserve enough. We should reserve more blocks for the next
reservation to reduce fragmentation. This patch increases the size hint
for reservations when they run out.
---
 fs/gfs2/rgrp.c | 3 +++
 fs/gfs2/rgrp.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index f65e56b..d4b9d93 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -2251,6 +2251,9 @@ static void gfs2_adjust_reservation(struct gfs2_inode *ip,
trace_gfs2_rs(rs, TRACE_RS_CLAIM);
if (rs-rs_free  !ret)
goto out;
+   /* We used up our block reservation, so we should
+  reserve more blocks next time. */
+   atomic_add(RGRP_RSRV_ADDBLKS, rs-rs_sizehint);
}
__rs_deltree(rs);
}
diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
index 5d8f085..b104f4a 100644
--- a/fs/gfs2/rgrp.h
+++ b/fs/gfs2/rgrp.h
@@ -20,6 +20,7 @@
  */
 #define RGRP_RSRV_MINBYTES 8
 #define RGRP_RSRV_MINBLKS ((u32)(RGRP_RSRV_MINBYTES * GFS2_NBBY))
+#define RGRP_RSRV_ADDBLKS 64
 
 struct gfs2_rgrpd;
 struct gfs2_sbd;
-- 
1.9.3



[Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation

2014-10-24 Thread Bob Peterson
Hi,

This is a revised version of my patches to improve GFS2 performance and
reduce GFS2 fragmentation. As suggested by Steve Whitehouse, the first patch
has been modified to base its distribution of resource groups on the number
of journals.

Also as suggested, the patch that kept file close from deleting rgrp
reservations has been removed from the set, leaving only three patches.
The removal of that patch caused fragmentation to climb 15 percent, but it's
still much (41 percent) better than the original stock kernels. I agree that
use cases like untar would make for much worse fragmentation problems if
that patch were to remain.

I've re-tested with these three patches and found that performance and
fragmentation are still good. Here are numbers:

STOCK RHEL7 kernel (none of these patches):

Run times:
Run 1 time: 2hr 40min 33sec
Run 2 time: 2hr 39min 52sec
Run 3 time: 2hr 39min 31sec
Run 4 time: 2hr 33min 57sec
Run 5 time: 2hr 41min 6sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  744708
EXTENT COUNT FOR OUTPUT FILES =  749868
EXTENT COUNT FOR OUTPUT FILES =  721862
EXTENT COUNT FOR OUTPUT FILES =  635301
EXTENT COUNT FOR OUTPUT FILES =  689263 (Average 708200)

The times are bad and the fragmentation level is also bad. If I add
just the first patch, GFS2: Set of distributed preferences for rgrps
the performance improves, but the fragmentation is worse (I only did three
iterations this time):

Run times:
Run 1 time: 2hr 2min 47sec
Run 2 time: 2hr 8min 37sec
Run 3 time: 2hr 10min 0sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  1011217
EXTENT COUNT FOR OUTPUT FILES =  1025973
EXTENT COUNT FOR OUTPUT FILES =  1070163

With these three patches, both performance and fragmentation are good:

Run 1 time: 2hr 9min 49sec
Run 2 time: 2hr 10min 15sec
Run 3 time: 2hr 8min 35sec
Run 4 time: 2hr 15min 15sec
Run 5 time: 2hr 7min 4sec
Run 6 time: 2hr 8min 8sec
Run 7 time: 2hr 7min 54sec

EXTENT COUNT FOR OUTPUT FILES =  406366
EXTENT COUNT FOR OUTPUT FILES =  432978
EXTENT COUNT FOR OUTPUT FILES =  428736
EXTENT COUNT FOR OUTPUT FILES =  419736
EXTENT COUNT FOR OUTPUT FILES =  419040
EXTENT COUNT FOR OUTPUT FILES =  422774
EXTENT COUNT FOR OUTPUT FILES =  409281 (Average 419844)

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
Bob Peterson (3):
  GFS2: Set of distributed preferences for rgrps
  GFS2: Only increase rs_sizehint
  GFS2: If we use up our block reservation, request more next time

 fs/gfs2/file.c   |  3 ++-
 fs/gfs2/incore.h |  1 +
 fs/gfs2/rgrp.c   | 69 
 fs/gfs2/rgrp.h   |  1 +
 4 files changed, 68 insertions(+), 6 deletions(-)

-- 
1.9.3



[Cluster-devel] [GFS2 PATCH 2/3] GFS2: Only increase rs_sizehint

2014-10-24 Thread Bob Peterson
If an application does a sequence of (1) big write, (2) little write
we don't necessarily want to reset the size hint based on the smaller
size. The fact that they did any big writes implies they may do more,
and therefore we should try to allocate bigger block reservations, even
if the last few were small writes. Therefore this patch changes function
gfs2_size_hint so that the size hint can only grow; it cannot shrink.
This is especially important where there are multiple writers.
---
 fs/gfs2/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 7f4ed3d..f64084b 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -337,7 +337,8 @@ static void gfs2_size_hint(struct file *filep, loff_t 
offset, size_t size)
size_t blks = (size + sdp-sd_sb.sb_bsize - 1)  
sdp-sd_sb.sb_bsize_shift;
int hint = min_t(size_t, INT_MAX, blks);
 
-   atomic_set(ip-i_res-rs_sizehint, hint);
+   if (hint  atomic_read(ip-i_res-rs_sizehint))
+   atomic_set(ip-i_res-rs_sizehint, hint);
 }
 
 /**
-- 
1.9.3



Re: [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps

2014-10-27 Thread Bob Peterson
- Original Message -
  +   if ((loops  3) 
  +   gfs2_rgrp_used_recently(rs, 1000) 
  +   gfs2_rgrp_congested(rs-rs_rbm.rgd, loops))
  +   goto next_rgrp;
  +   }
 This makes no sense, we end the loop when loops == 3 so that these
 conditions will be applied in every case which is not what we want. We
 must always end up doing a search of every rgrp in the worst case, in
 order that if there is some space left somewhere, we will eventually
 find it.
 
 Definitely better wrt figuring out which rgrps to prefer, but I'm not
 yet convinced about this logic. The whole point of the congestion logic
 is to figure out ahead of time, whether it will take a long time to
 access that rgrp, so it seems that this is not quite right, otherwise
 there should be no need to bypass it like this. The fast_to_acquire
 logic should at least by merged into the rgrp_congested logic, possibly
 by just reducing the threshold at which congestion is measured.
 
 It might be useful to introduce a tracepoint for when we reject and rgrp
 during allocation, with a reason as to why it was rejected, so that it
 is easier to see whats going on here,
 
 Steve.

Hi,

Sigh. You're right: Good catch. The problem is that I've done more than 30
attempts at trying to get this right in my git tree, each of which has
up to 15 patches for various things. Somewhere around iteration 20, I
dropped an important change. My intent was always to add another layer
of rgrp criteria, so loops would be 4 rather than 3. I had done this
with a different patch, but it got dropped by accident. The 3 original
layers are as follows:

loop 0: Reject rgrps that are likely congested (based on past history)
and rgrps where we just found congestion.
Only accept rgrps for which we can get a multi-block reservation.
loop 1: Reject rgrps that are likely congested (based on past history)
and rgrps where we just found congestion. Accept rgrps that have
enough free space, even if we can't get a reservation.
loop 2: Don't ever reject rgrps because we're out of ideal conditions.

The new scheme was intended to add a new layer 0 which only accepts rgrps
within a preferred subset of rgrps. In other words:

loop 0: Reject rgrps that aren't in our preferred subset of rgrps.
Reject rgrps that are likely congested (based on past history)
and rgrps where we just found congestion.
Only accept rgrps for which we can get a multi-block reservation.
loop 1: Reject rgrps that are likely congested (based on past history)
and rgrps where we just found congestion.
Only accept rgrps for which we can get a multi-block reservation.
loop 2: Reject rgrps that are likely congested (based on past history)
and rgrps where we just found congestion. Accept any rgrp that has
enough free space, even if we can't get a reservation.
loop 3: Don't ever reject rgrps because we're out of ideal conditions.

But is 4 loops too many? I could combine 0 and 1, and in fact, today's code
accidentally does just that. The mistake was probably that I had been
experimenting with 3 versus 4 layers and had switched them back and forth
a few times for various tests.

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation

2014-10-29 Thread Bob Peterson
Hi,

This is a revised version of my patches to improve GFS2 performance and
reduce GFS2 fragmentation. As suggested by Steve Whitehouse, the first patch
has been modified to base its distribution of resource groups on the number
of journals. It also corrects the criteria on the final loop, changing
loops  3 back to loops  2 as it should be. Also, I re-tested to make
sure nothing had regressed. I updated the timing numbers below.

Also as suggested, the patch that kept file close from deleting rgrp
reservations has been removed from the set, leaving only three patches.
The removal of that patch caused fragmentation to climb 15 percent, but it's
still much (41 percent) better than the original stock kernels. I agree that
use cases like untar would make for much worse fragmentation problems if
that patch were to remain.

I've re-tested with these three patches and found that performance and
fragmentation are still good. Here are numbers:

STOCK RHEL7 kernel (none of these patches):

Run times:
Run 1 time: 2hr 40min 33sec
Run 2 time: 2hr 39min 52sec
Run 3 time: 2hr 39min 31sec
Run 4 time: 2hr 33min 57sec
Run 5 time: 2hr 41min 6sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  744708
EXTENT COUNT FOR OUTPUT FILES =  749868
EXTENT COUNT FOR OUTPUT FILES =  721862
EXTENT COUNT FOR OUTPUT FILES =  635301
EXTENT COUNT FOR OUTPUT FILES =  689263

The times are bad and the fragmentation level is also bad. If I add
just the first patch, GFS2: Set of distributed preferences for rgrps
the performance improves, but the fragmentation is worse (I only did three
iterations this time):

Run times:
Run 1 time: 2hr 2min 47sec
Run 2 time: 2hr 8min 37sec
Run 3 time: 2hr 10min 0sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  1011217
EXTENT COUNT FOR OUTPUT FILES =  1025973
EXTENT COUNT FOR OUTPUT FILES =  1070163

With these three patches, both performance and fragmentation are good:

Run 1 time: 2hr 11min 11sec
Run 2 time: 2hr 8min 57sec
Run 3 time: 2hr 10min 43sec
Run 4 time: 2hr 12min 50sec
Run 5 time: 2hr 15min 32sec
Run 6 time: 2hr 10min 58sec

EXTENT COUNT FOR OUTPUT FILES =  424329
EXTENT COUNT FOR OUTPUT FILES =  430663
EXTENT COUNT FOR OUTPUT FILES =  430186
EXTENT COUNT FOR OUTPUT FILES =  421635
EXTENT COUNT FOR OUTPUT FILES =  416706
EXTENT COUNT FOR OUTPUT FILES =  418887

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
Bob Peterson (3):
  GFS2: Set of distributed preferences for rgrps
  GFS2: Only increase rs_sizehint
  GFS2: If we use up our block reservation, request more next time

 fs/gfs2/file.c   |  3 ++-
 fs/gfs2/incore.h |  1 +
 fs/gfs2/rgrp.c   | 69 
 fs/gfs2/rgrp.h   |  1 +
 4 files changed, 68 insertions(+), 6 deletions(-)

-- 
1.9.3



[Cluster-devel] [GFS2 PATCH 3/3] GFS2: If we use up our block reservation, request more next time

2014-10-29 Thread Bob Peterson
If we run out of blocks for a given multi-block allocation, we obviously
did not reserve enough. We should reserve more blocks for the next
reservation to reduce fragmentation. This patch increases the size hint
for reservations when they run out.
---
 fs/gfs2/rgrp.c | 3 +++
 fs/gfs2/rgrp.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index f4e4a0c..9150207 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -2251,6 +2251,9 @@ static void gfs2_adjust_reservation(struct gfs2_inode *ip,
trace_gfs2_rs(rs, TRACE_RS_CLAIM);
if (rs-rs_free  !ret)
goto out;
+   /* We used up our block reservation, so we should
+  reserve more blocks next time. */
+   atomic_add(RGRP_RSRV_ADDBLKS, rs-rs_sizehint);
}
__rs_deltree(rs);
}
diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
index 5d8f085..b104f4a 100644
--- a/fs/gfs2/rgrp.h
+++ b/fs/gfs2/rgrp.h
@@ -20,6 +20,7 @@
  */
 #define RGRP_RSRV_MINBYTES 8
 #define RGRP_RSRV_MINBLKS ((u32)(RGRP_RSRV_MINBYTES * GFS2_NBBY))
+#define RGRP_RSRV_ADDBLKS 64
 
 struct gfs2_rgrpd;
 struct gfs2_sbd;
-- 
1.9.3



[Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps

2014-10-29 Thread Bob Peterson
This patch tries to use the journal numbers to evenly distribute
which node prefers which resource group for block allocations. This
is to help performance.
---
 fs/gfs2/incore.h |  1 +
 fs/gfs2/rgrp.c   | 66 +++-
 2 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 39e7e99..1b89918 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -97,6 +97,7 @@ struct gfs2_rgrpd {
 #define GFS2_RDF_CHECK 0x1000 /* check for unlinked inodes */
 #define GFS2_RDF_UPTODATE  0x2000 /* rg is up to date */
 #define GFS2_RDF_ERROR 0x4000 /* error in rg */
+#define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */
 #define GFS2_RDF_MASK  0xf000 /* mask for internal flags */
spinlock_t rd_rsspin;   /* protects reservation related vars */
struct rb_root rd_rstree;   /* multi-block reservation tree */
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 7474c41..f4e4a0c 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
rgd-rd_gl-gl_vm.start = rgd-rd_addr * bsize;
rgd-rd_gl-gl_vm.end = rgd-rd_gl-gl_vm.start + (rgd-rd_length * 
bsize) - 1;
rgd-rd_rgl = (struct gfs2_rgrp_lvb *)rgd-rd_gl-gl_lksb.sb_lvbptr;
-   rgd-rd_flags = ~GFS2_RDF_UPTODATE;
+   rgd-rd_flags = ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
if (rgd-rd_data  sdp-sd_max_rg_data)
sdp-sd_max_rg_data = rgd-rd_data;
spin_lock(sdp-sd_rindex_spin);
@@ -955,6 +955,36 @@ fail:
 }
 
 /**
+ * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use
+ * @sdp: the GFS2 superblock
+ *
+ * The purpose of this function is to select a subset of the resource groups
+ * and mark them as PREFERRED. We do it in such a way that each node prefers
+ * to use a unique set of rgrps to minimize glock contention.
+ */
+static void set_rgrp_preferences(struct gfs2_sbd *sdp)
+{
+   struct gfs2_rgrpd *rgd, *first;
+   int i;
+
+   /* Skip an initial number of rgrps, based on this node's journal ID.
+  That should start each node out on its own set. */
+   rgd = gfs2_rgrpd_get_first(sdp);
+   for (i = 0; i  sdp-sd_lockstruct.ls_jid; i++)
+   rgd = gfs2_rgrpd_get_next(rgd);
+   first = rgd;
+
+   do {
+   rgd-rd_flags |= GFS2_RDF_PREFERRED;
+   for (i = 0; i  sdp-sd_journals; i++) {
+   rgd = gfs2_rgrpd_get_next(rgd);
+   if (rgd == first)
+   break;
+   }
+   } while (rgd != first);
+}
+
+/**
  * gfs2_ri_update - Pull in a new resource index from the disk
  * @ip: pointer to the rindex inode
  *
@@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip)
if (error  0)
return error;
 
+   set_rgrp_preferences(sdp);
+
sdp-sd_rindex_uptodate = 1;
return 0;
 }
@@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, 
const struct gfs2_rgrpd *b
 }
 
 /**
+ * fast_to_acquire - determine if a resource group will be fast to acquire
+ *
+ * If this is one of our preferred rgrps, it should be quicker to acquire,
+ * because we tried to set ourselves up as dlm lock master.
+ */
+static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
+{
+   struct gfs2_glock *gl = rgd-rd_gl;
+
+   if (gl-gl_state != LM_ST_UNLOCKED  list_empty(gl-gl_holders) 
+   !test_bit(GLF_DEMOTE_IN_PROGRESS, gl-gl_flags) 
+   !test_bit(GLF_DEMOTE, gl-gl_flags))
+   return 1;
+   if (rgd-rd_flags  GFS2_RDF_PREFERRED)
+   return 1;
+   return 0;
+}
+
+/**
  * gfs2_inplace_reserve - Reserve space in the filesystem
  * @ip: the inode to reserve space for
  * @ap: the allocation parameters
@@ -1932,10 +1983,15 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const 
struct gfs2_alloc_parms *a
rg_locked = 0;
if (skip  skip--)
goto next_rgrp;
-   if (!gfs2_rs_active(rs)  (loops  2) 
-gfs2_rgrp_used_recently(rs, 1000) 
-gfs2_rgrp_congested(rs-rs_rbm.rgd, loops))
-   goto next_rgrp;
+   if (!gfs2_rs_active(rs)) {
+   if (loops == 0 
+   !fast_to_acquire(rs-rs_rbm.rgd))
+   goto next_rgrp;
+   if ((loops  2) 
+   gfs2_rgrp_used_recently(rs, 1000) 
+   gfs2_rgrp_congested(rs-rs_rbm.rgd, loops))
+   goto next_rgrp;
+   }
error = gfs2_glock_nq_init(rs-rs_rbm.rgd-rd_gl,
 

[Cluster-devel] [GFS2 PATCH 2/3] GFS2: Only increase rs_sizehint

2014-10-29 Thread Bob Peterson
If an application does a sequence of (1) big write, (2) little write
we don't necessarily want to reset the size hint based on the smaller
size. The fact that they did any big writes implies they may do more,
and therefore we should try to allocate bigger block reservations, even
if the last few were small writes. Therefore this patch changes function
gfs2_size_hint so that the size hint can only grow; it cannot shrink.
This is especially important where there are multiple writers.
---
 fs/gfs2/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 80dd44d..5ebe568 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -337,7 +337,8 @@ static void gfs2_size_hint(struct file *filep, loff_t 
offset, size_t size)
size_t blks = (size + sdp-sd_sb.sb_bsize - 1)  
sdp-sd_sb.sb_bsize_shift;
int hint = min_t(size_t, INT_MAX, blks);
 
-   atomic_set(ip-i_res-rs_sizehint, hint);
+   if (hint  atomic_read(ip-i_res-rs_sizehint))
+   atomic_set(ip-i_res-rs_sizehint, hint);
 }
 
 /**
-- 
1.9.3



[Cluster-devel] [gfs2-utils PATCH] fsck.gfs2: Detect and correct corrupt journals

2014-11-13 Thread Bob Peterson
Hi,

This patch changes fsck.gfs2 so that it can detect corruption in the
journal sequence numbering scheme. If the corruption is within a
certain tolerance (currently 10 sequencing errors) the journal sequence
numbers are fixed and journal is replayed as normal. If there are more
errors than can be tolerated, the problem is detected and the journal
may be cleared.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
 gfs2/fsck/fs_recovery.c | 54 +++--
 1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/gfs2/fsck/fs_recovery.c b/gfs2/fsck/fs_recovery.c
index bc4210d..5aa455e 100644
--- a/gfs2/fsck/fs_recovery.c
+++ b/gfs2/fsck/fs_recovery.c
@@ -17,6 +17,7 @@
 #include util.h
 
 #define JOURNAL_NAME_SIZE 16
+#define JOURNAL_SEQ_TOLERANCE 10
 
 unsigned int sd_found_jblocks = 0, sd_replayed_jblocks = 0;
 unsigned int sd_found_metablocks = 0, sd_replayed_metablocks = 0;
@@ -138,10 +139,15 @@ static int buf_lo_scan_elements(struct gfs2_inode *ip, 
unsigned int start,
check_magic = ((struct gfs2_meta_header *)
   (bh_ip-b_data))-mh_magic;
check_magic = be32_to_cpu(check_magic);
-   if (check_magic != GFS2_MAGIC)
+   if (check_magic != GFS2_MAGIC) {
+   log_err(_(Journal corruption detected at block #
+ %lld (0x%llx) for journal+0x%x.\n),
+   (unsigned long long)blkno, (unsigned long 
long)blkno,
+   start);
error = -EIO;
-   else
+   } else {
bmodified(bh_ip);
+   }
 
brelse(bh_log);
brelse(bh_ip);
@@ -313,8 +319,11 @@ static int foreach_descriptor(struct gfs2_inode *ip, 
unsigned int start,
brelse(bh);
continue;
}
-   if (error == 1)
+   if (error == 1) {
+   log_err(_(Journal corruption detected at 
+ journal+0x%x.\n), start);
error = -EIO;
+   }
bmodified(bh);
brelse(bh);
return error;
@@ -354,10 +363,13 @@ static int foreach_descriptor(struct gfs2_inode *ip, 
unsigned int start,
 }
 
 /**
- * fix_journal_seq_no - Fix log header sequencing problems
+ * check_journal_seq_no - Check and Fix log header sequencing problems
  * @ip: the journal incore inode
+ * @fix: if 1, fix the sequence numbers, otherwise just report the problem
+ *
+ * Returns: The number of sequencing errors (hopefully none).
  */
-static int fix_journal_seq_no(struct gfs2_inode *ip)
+static int check_journal_seq_no(struct gfs2_inode *ip, int fix)
 {
int error = 0, wrapped = 0;
uint32_t jd_blocks = ip-i_di.di_size / ip-i_sbd-sd_sb.sb_bsize;
@@ -368,6 +380,7 @@ static int fix_journal_seq_no(struct gfs2_inode *ip)
uint64_t dblock;
uint32_t extlen;
struct gfs2_buffer_head *bh;
+   int seq_errors = 0;
 
memset(lh, 0, sizeof(lh));
for (blk = 0; blk  jd_blocks; blk++) {
@@ -389,6 +402,10 @@ static int fix_journal_seq_no(struct gfs2_inode *ip)
prev_seq = lh.lh_sequence;
continue;
}
+   if (!fix) {
+   seq_errors++;
+   continue;
+   }
log_err( _(Journal block %u (0x%x): sequence no. 0x%llx 
   out of order.\n), blk, blk, lh.lh_sequence);
log_info( _(Low: 0x%llx, High: 0x%llx, Prev: 0x%llx\n),
@@ -404,7 +421,9 @@ static int fix_journal_seq_no(struct gfs2_inode *ip)
gfs2_log_header_out(lh, bh);
brelse(bh);
}
-   return 0;
+   if (seq_errors  fix)
+   log_err(_(%d sequence errors fixed.\n), seq_errors);
+   return seq_errors;
 }
 
 /**
@@ -456,6 +475,15 @@ static int gfs2_recover_journal(struct gfs2_inode *ip, int 
j, int preen,
 
osi_list_init(sd_revoke_list);
error = gfs2_find_jhead(ip, head);
+   if (!error) {
+   error = check_journal_seq_no(ip, 0);
+   if (error  JOURNAL_SEQ_TOLERANCE) {
+   log_err( _(Journal #%d (\journal%d\) has %d 
+  sequencing errors; tolerance is %d.\n),
+j+1, j, error, JOURNAL_SEQ_TOLERANCE);
+   goto out;
+   }
+   }
if (error) {
if (opts.no) {
log_err( _(Journal #%d (\journal%d\) is 
corrupt\n),j+1, j);
@@ -481,7 +509,7 @@ static int gfs2_recover_journal(struct gfs2_inode *ip, int 
j, int

Re: [Cluster-devel] [PATCH 1/2] fsck.gfs2: Improve reporting of pass timings

2014-11-14 Thread Bob Peterson
- Original Message -
 The days value is not currently reported in the pass timings but the
 hours value is still reported modulo 24.  Drop the use of gmtime(3) as
 it's more appropriate for calendar time operations than elapsed time,
 and add a simple duration reporting function which matches the existing
 output format.
 
 Signed-off-by: Andrew Price anpr...@redhat.com
 ---

ACK

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH 2/2] mkfs.gfs2: Revert default resource group size

2014-11-14 Thread Bob Peterson
- Original Message -
 Choose a more performant resource group size than the maximum by default
 until gfs2 is more tuned for the larger number of bitmap blocks required
 for 2GB resource groups.
 
 Signed-off-by: Andrew Price anpr...@redhat.com
 ---
  gfs2/mkfs/main_mkfs.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/gfs2/mkfs/main_mkfs.c b/gfs2/mkfs/main_mkfs.c
 index b6353dd..2590ec8 100644
 --- a/gfs2/mkfs/main_mkfs.c
 +++ b/gfs2/mkfs/main_mkfs.c
 @@ -147,7 +147,7 @@ static void opts_init(struct mkfs_opts *opts)
   opts-bsize = GFS2_DEFAULT_BSIZE;
   opts-jsize = GFS2_DEFAULT_JSIZE;
   opts-qcsize = GFS2_DEFAULT_QCSIZE;
 - opts-rgsize = GFS2_MAX_RGSIZE;
 + opts-rgsize = GFS2_DEFAULT_RGSIZE;
   opts-lockproto = lock_dlm;
   opts-locktable = ;
   opts-confirm = 1;
 --
 1.9.3
 

ACK

Bob Peterson
Red Hat File Systems



[Cluster-devel] [GFS2 PATCH] GFS2: Simplify directory filling, hashing and unhashing functions

2014-11-19 Thread Bob Peterson
Hi,

This patch implements a new cookie encoding scheme for encoding, decoding
and filling directory entries. This new scheme prevents duplicate cookies,
so it eliminates the need to sort the entries to prevent duplicates.
Thus, it simplies the code greatly.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
 fs/gfs2/dir.c | 165 +++---
 1 file changed, 66 insertions(+), 99 deletions(-)

diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 5d4261f..039479f 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -58,7 +58,6 @@
 #include linux/slab.h
 #include linux/spinlock.h
 #include linux/buffer_head.h
-#include linux/sort.h
 #include linux/gfs2_ondisk.h
 #include linux/crc32.h
 #include linux/vmalloc.h
@@ -80,8 +79,48 @@
 
 #define MAX_RA_BLOCKS 32 /* max read-ahead blocks */
 
-#define gfs2_disk_hash2offset(h) (((u64)(h))  1)
-#define gfs2_dir_offset2hash(p) ((u32)(((u64)(p))  1))
+/**
+ * Notes about the directory hash table:
+ *
+ * Hash table index is hash  (32 - dip-i_depth). Max directory depth is 17.
+ * 32 - 17 = 15. The hash value is 32-bits, so 0x1  15 = 0x2.
+ * So the maximum index to the hash table is 0x1.
+ *
+ * At the same time, pages are at most 4K, and GFS2 dirents are 96 bytes.
+ * According to other comments, 99 is the maximum number of entries that can
+ * fit in a single leaf block. However, if you do the math: Take one 4K block,
+ * subtract out the leaf header, so 0x1000 - 0x68 = 0xf98 bytes available.
+ * The minimum dirent, including a 1-character file name is 0x30 bytes long.
+ * So 0xf98 / 0x30 = 0x53, which is 83 decimal. So the theoretical limit is
+ * 83 index entries per leaf block.
+ *
+ * So we can represent the index number with 17 bits, plus the index in 7.
+ * Therefore, we can represent the offset to any dirent by a 32-bit number:
+ * 18-bits for hash table index, 7 for page index, 8 for leaf block extension
+ * 
+ *0 1
+ *01234567 89abcdef 01234567 89abcdef
+ *   iooo
+ *leafblk  Hash table index   Offset
+ * Leaf blk:  (0xff)
+ * HTI Max Value:    1(0x1)
+ * Pg offset Max: 111 (0x7f)
+ **/
+
+#define gfs2_dir_offset2pge(p) (((u32)p  24)  0xff) /* leaf extension blk */
+#define gfs2_dir_pge2offset(p) (p  24)
+
+#define gfs2_dir_offset2hti(p) (((u32)p  7)  0x01) /* hash tbl index */
+#define gfs2_dir_hti2offset(p) (p  7)
+
+#define gfs2_dir_offset2pgo(p) ((u32)p  0x7f)   /* leaf blk page offset */
+#define gfs2_dir_pgo2offset(p) (p)
+
+static inline u32 dirloc2offset(u32 page, u32 hti, u32 off)
+{
+   return (gfs2_dir_pge2offset(page) | gfs2_dir_hti2offset(hti) |
+   gfs2_dir_pgo2offset(off));
+}
 
 struct qstr gfs2_qdot __read_mostly;
 struct qstr gfs2_qdotdot __read_mostly;
@@ -1176,48 +1215,6 @@ out_kfree:
 }
 
 /**
- * compare_dents - compare directory entries by hash value
- * @a: first dent
- * @b: second dent
- *
- * When comparing the hash entries of @a to @b:
- *   gt: returns 1
- *   lt: returns -1
- *   eq: returns 0
- */
-
-static int compare_dents(const void *a, const void *b)
-{
-   const struct gfs2_dirent *dent_a, *dent_b;
-   u32 hash_a, hash_b;
-   int ret = 0;
-
-   dent_a = *(const struct gfs2_dirent **)a;
-   hash_a = be32_to_cpu(dent_a-de_hash);
-
-   dent_b = *(const struct gfs2_dirent **)b;
-   hash_b = be32_to_cpu(dent_b-de_hash);
-
-   if (hash_a  hash_b)
-   ret = 1;
-   else if (hash_a  hash_b)
-   ret = -1;
-   else {
-   unsigned int len_a = be16_to_cpu(dent_a-de_name_len);
-   unsigned int len_b = be16_to_cpu(dent_b-de_name_len);
-
-   if (len_a  len_b)
-   ret = 1;
-   else if (len_a  len_b)
-   ret = -1;
-   else
-   ret = memcmp(dent_a + 1, dent_b + 1, len_a);
-   }
-
-   return ret;
-}
-
-/**
  * do_filldir_main - read out directory entries
  * @dip: The GFS2 inode
  * @ctx: what to feed the entries to
@@ -1225,11 +1222,6 @@ static int compare_dents(const void *a, const void *b)
  * @entries: the number of entries in darr
  * @copied: pointer to int that's non-zero if a entry has been copied out
  *
- * Jump through some hoops to make sure that if there are hash collsions,
- * they are read out at the beginning of a buffer.  We want to minimize
- * the possibility that they will fall into different readdir buffers or
- * that someone will want to seek to that location.
- *
  * Returns: errno, 0 if the actor tells you to stop
  */
 
@@ -1237,42 +1229,15 @@ static int do_filldir_main(struct gfs2_inode *dip, 
struct dir_context *ctx,
   const struct gfs2_dirent **darr, u32 entries

Re: [Cluster-devel] [PATCH 1/3] gfs2: bugger off early if O_CREAT open finds a directory

2014-11-19 Thread Bob Peterson
- Original Message -
 Signed-off-by: Al Viro v...@zeniv.linux.org.uk
 ---
  fs/gfs2/inode.c |5 +
  1 file changed, 5 insertions(+)
 
 diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
 index c4ed823..310e248 100644
 --- a/fs/gfs2/inode.c
 +++ b/fs/gfs2/inode.c
 @@ -624,6 +624,11 @@ static int gfs2_create_inode(struct inode *dir, struct
 dentry *dentry,
   inode = gfs2_dir_search(dir, dentry-d_name, !S_ISREG(mode) || excl);
   error = PTR_ERR(inode);
   if (!IS_ERR(inode)) {
 + if (S_ISDIR(inode-i_mode)) {
 + iput(inode);
 + inode = ERR_PTR(-EISDIR);
 + goto fail_gunlock;
 + }
   d = d_splice_alias(inode, dentry);
   error = PTR_ERR(d);
   if (IS_ERR(d)) {
 --
 1.7.10.4
 
 

Hm. Seems wrong that it should return 0 if the dirent exists (mkdir of a
directory that already exists) but it looks like it already behaves that way.
So I guess so. It may warrant further investigation.

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH 1/3] gfs2: bugger off early if O_CREAT open finds a directory

2014-11-19 Thread Bob Peterson
- Original Message -
 On Wed, Nov 19, 2014 at 03:33:29PM -0500, Bob Peterson wrote:
  - Original Message -
   Signed-off-by: Al Viro v...@zeniv.linux.org.uk
   ---
fs/gfs2/inode.c |5 +
1 file changed, 5 insertions(+)
   
   diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
   index c4ed823..310e248 100644
   --- a/fs/gfs2/inode.c
   +++ b/fs/gfs2/inode.c
   @@ -624,6 +624,11 @@ static int gfs2_create_inode(struct inode *dir,
   struct
   dentry *dentry,
 inode = gfs2_dir_search(dir, dentry-d_name, !S_ISREG(mode) || excl);
 error = PTR_ERR(inode);
 if (!IS_ERR(inode)) {
   + if (S_ISDIR(inode-i_mode)) {
   + iput(inode);
   + inode = ERR_PTR(-EISDIR);
   + goto fail_gunlock;
   + }
 d = d_splice_alias(inode, dentry);
 error = PTR_ERR(d);
 if (IS_ERR(d)) {
 
  Hm. Seems wrong that it should return 0 if the dirent exists (mkdir of a
  directory that already exists) but it looks like it already behaves that
  way.
  So I guess so. It may warrant further investigation.
 
 It doesn't.  Note that !S_ISREG(mode) || excl in there - *anything* other
 than -create() will treat existing entries with the same name as EEXIST.
 The only case when we can possibly get into that if (!IS_ERR(inode)) is
 -create() hitting an existing file.  And with this change it narrows to
 -create() hitting an existing non-directory.  That allows the next patch
 to use d_instantiate() instead of d_splice_alias() - for non-directories
 it's the same thing, since dentry is already hashed here and we don't need
 to avoid multiple aliases.  Which kills one of the two places in the
 tree where d_splice_alias() is called for an already hashed dentry (another
 is d_add_ci() and that call also goes down - see vfs.git#for-next for that
 one).
 
Okay, thanks for clearing that up.

ACK

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] gfs2: use kvfree() instead of open-coding it

2014-11-20 Thread Bob Peterson
- Original Message -
 Signed-off-by: Al Viro v...@zeniv.linux.org.uk
 ---
  fs/gfs2/dir.c   |   40 
  fs/gfs2/quota.c |9 ++---
  2 files changed, 10 insertions(+), 39 deletions(-)
 
 diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
 index 5d4261f..c247fed 100644
 --- a/fs/gfs2/dir.c
 +++ b/fs/gfs2/dir.c
 @@ -365,22 +365,15 @@ static __be64 *gfs2_dir_get_hash_table(struct
 gfs2_inode *ip)
  
   ret = gfs2_dir_read_data(ip, hc, hsize);
   if (ret  0) {
 - if (is_vmalloc_addr(hc))
 - vfree(hc);
 - else
 - kfree(hc);
 + kvfree(hc);
   return ERR_PTR(ret);
   }
  
   spin_lock(inode-i_lock);
 - if (ip-i_hash_cache) {
 - if (is_vmalloc_addr(hc))
 - vfree(hc);
 - else
 - kfree(hc);
 - } else {
 + if (ip-i_hash_cache)
 + kvfree(hc);
 + else
   ip-i_hash_cache = hc;
 - }
   spin_unlock(inode-i_lock);
  
   return ip-i_hash_cache;
 @@ -396,10 +389,7 @@ void gfs2_dir_hash_inval(struct gfs2_inode *ip)
  {
   __be64 *hc = ip-i_hash_cache;
   ip-i_hash_cache = NULL;
 - if (is_vmalloc_addr(hc))
 - vfree(hc);
 - else
 - kfree(hc);
 + kvfree(hc);
  }
  
  static inline int gfs2_dirent_sentinel(const struct gfs2_dirent *dent)
 @@ -1168,10 +1158,7 @@ fail:
   gfs2_dinode_out(dip, dibh-b_data);
   brelse(dibh);
  out_kfree:
 - if (is_vmalloc_addr(hc2))
 - vfree(hc2);
 - else
 - kfree(hc2);
 + kvfree(hc2);
   return error;
  }
  
 @@ -1302,14 +1289,6 @@ static void *gfs2_alloc_sort_buffer(unsigned size)
   return ptr;
  }
  
 -static void gfs2_free_sort_buffer(void *ptr)
 -{
 - if (is_vmalloc_addr(ptr))
 - vfree(ptr);
 - else
 - kfree(ptr);
 -}
 -
  static int gfs2_dir_read_leaf(struct inode *inode, struct dir_context *ctx,
 int *copied, unsigned *depth,
 u64 leaf_no)
 @@ -1393,7 +1372,7 @@ static int gfs2_dir_read_leaf(struct inode *inode,
 struct dir_context *ctx,
  out_free:
   for(i = 0; i  leaf; i++)
   brelse(larr[i]);
 - gfs2_free_sort_buffer(larr);
 + kvfree(larr);
  out:
   return error;
  }
 @@ -2004,10 +1983,7 @@ out_rlist:
   gfs2_rlist_free(rlist);
   gfs2_quota_unhold(dip);
  out:
 - if (is_vmalloc_addr(ht))
 - vfree(ht);
 - else
 - kfree(ht);
 + kvfree(ht);
   return error;
  }
  
 diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
 index 64b29f7..c8b148b 100644
 --- a/fs/gfs2/quota.c
 +++ b/fs/gfs2/quota.c
 @@ -1360,13 +1360,8 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
  
   gfs2_assert_warn(sdp, !atomic_read(sdp-sd_quota_count));
  
 - if (sdp-sd_quota_bitmap) {
 - if (is_vmalloc_addr(sdp-sd_quota_bitmap))
 - vfree(sdp-sd_quota_bitmap);
 - else
 - kfree(sdp-sd_quota_bitmap);
 - sdp-sd_quota_bitmap = NULL;
 - }
 + kvfree(sdp-sd_quota_bitmap);
 + sdp-sd_quota_bitmap = NULL;
  }
  
  static void quotad_error(struct gfs2_sbd *sdp, const char *msg, int error)
 --
 1.7.10.4
 
 

ACK

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] gfs2: gfs2_dir_get_hash_table(): avoiding deferred vfree() is easy here...

2014-11-20 Thread Bob Peterson
- Original Message -
 vfree() is allowed under spinlock these days, but it's cheaper when
 it doesn't step into deferred case and here it's very easy to avoid.
 
 Signed-off-by: Al Viro v...@zeniv.linux.org.uk
 ---
  fs/gfs2/dir.c |7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)
 
 diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
 index c247fed..c5a34f0 100644
 --- a/fs/gfs2/dir.c
 +++ b/fs/gfs2/dir.c
 @@ -370,11 +370,12 @@ static __be64 *gfs2_dir_get_hash_table(struct
 gfs2_inode *ip)
   }
  
   spin_lock(inode-i_lock);
 - if (ip-i_hash_cache)
 - kvfree(hc);
 - else
 + if (likely(!ip-i_hash_cache)) {
   ip-i_hash_cache = hc;
 + hc = NULL;
 + }
   spin_unlock(inode-i_lock);
 + kvfree(hc);
  
   return ip-i_hash_cache;
  }
 --
 1.7.10.4
 
 

ACK

Bob Peterson
Red Hat File Systems



[Cluster-devel] [GFS2 PATCH] GFS2: Write the log descriptor entries in the correct order

2014-11-26 Thread Bob Peterson
Hi,

I recently discovered GFS2 was writing the log descriptor items in
the journal in reverse order. For example, if I do a bunch of file
creates, the result was a log header that looked something like this:

0x298 (j+ 242): Log descriptor, type 300 (Metadata) len:504, data1: 503
   0xa6e9ce di0xa6e9a7 di0xa6e97b di0xa6e942 di
   0xa6e911 di0xa6e8d2 di0xa6e898 di0xa6e86e di
   0xa6e840 di0xa6e809 di0xa6e7d7 di0xa6e77e di
   0xa6e747 di0xa6e721 di0xa6e6f8 di0xa6e6ca di
   0xa6e680 di0xa6e63d di0xa6e619 di0xa6e5f4 di
   0xa6e5d1 di0xa6e598 di0xa6e520 di0xa6e4f4 di
   0xa6e4c1 di0xa6e489 di0xa6e45c di0xa6e427 di

As you can see, the blocks are reverse-sorted. This patch changes the
order so that they're written in sorted order. Log headers written with
the patch look more like this:

0x36e (j+ 318): Log descriptor, type 300 (Metadata) len:504, data1: 503
   0x4d5a80 di0x4d5aa0 lf0x4d5aa1 di0x4d5aa2 di
   0x4d5aa3 di0x4d5aa4 di0x4d5aa5 di0x4d5aa6 di
   0x4d5aa7 di0x4d5aa8 di0x4d5aa9 di0x4d5aaa di
   0x4d5aab di0x4d5aac di0x4d5aad di0x4d5aae di
   0x4d5aaf di0x4d5ab0 di0x4d5ab1 di0x4d5ab2 di
   0x4d5ab3 di0x4d5ab4 di0x4d5ab5 di0x4d5ab6 di
   0x4d5ab7 di0x4d5ab8 di0x4d5ab9 di0x4d5aba di

The blocks following the log descriptor are written sequentially, so the
sort order won't affect performance at all, but it makes the journal
easier to decipher for debugging purposes.

Patch description:

This patch changes the order in which the blocks are processed by function
gfs2_before_commit. Before, it was processing them in the wrong order such
that the blocks would be layed out and written in reverse order. This
changes the order to the correct order.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 2c1ae86..d644470 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -444,7 +444,7 @@ static void gfs2_before_commit(struct gfs2_sbd *sdp, 
unsigned int limit,
ptr = (__be64 *)(ld + 1);
 
n = 0;
-   list_for_each_entry_continue(bd1, blist, bd_list) {
+   list_for_each_entry_continue_reverse(bd1, blist, bd_list) {
*ptr++ = cpu_to_be64(bd1-bd_bh-b_blocknr);
if (is_databuf) {
gfs2_check_magic(bd1-bd_bh);
@@ -459,7 +459,7 @@ static void gfs2_before_commit(struct gfs2_sbd *sdp, 
unsigned int limit,
gfs2_log_lock(sdp);
 
n = 0;
-   list_for_each_entry_continue(bd2, blist, bd_list) {
+   list_for_each_entry_continue_reverse(bd2, blist, bd_list) {
get_bh(bd2-bd_bh);
gfs2_log_unlock(sdp);
lock_buffer(bd2-bd_bh);



Re: [Cluster-devel] [GFS2 PATCH] GFS2: Write the log descriptor entries in the correct order

2014-11-26 Thread Bob Peterson
- Original Message -
  As you can see, the blocks are reverse-sorted. This patch changes the
  order so that they're written in sorted order. Log headers written with
  the patch look more like this:

 Does that mean that this patch got the ordering the wrong way around then?
 
 https://git.kernel.org/cgit/linux/kernel/git/steve/gfs2-3.0-nmw.git/commit/fs/gfs2/lops.c?id=7f63257da1aeea9b2ee68f469b0f9f4a39e5dff8
 
 Steve.

No. It means I missed Ben's patch. Or maybe I've lost my mind. Sigh.

Bob



Re: [Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Change block_map to match bitmap

2015-01-26 Thread Bob Peterson
- Original Message -
 Hi Bob,
 
 On 22/01/15 20:41, Bob Peterson wrote:
  Hi,
 
  This patch changes the old block_map structure for fsck.gfs2 to the
  simpler bitmap structure so that we have a 1:1 correspondence. This
  was done to reduce memory requirements of fsck.gfs2.
 
 I'm curious as to whether we're losing any useful information here. I
 see an extra bread() for BLKST_USED blocks in check_data, is that the
 main performance compromise, and how severe is it?
 
 Cheers,
 Andy

Hi,

That's exactly where things get a bit sticky. It used to keep track of
the various GFS2_METATYPE_* block types. For the most part, dinodes were
straightforward because we only needed to distinguish between directories
and non-directories. Free space is pretty straightforward too, with only
a few corner cases where things were marked a bad type. Unlinked also
wasn't a big problem. The problem comes with blocks marked Data.

In GFS1, data blocks were data blocks. In GFS2, a data block could be
data or metadata. In some cases we don't care because the height will
tell us if it's real data. But things get sticky when we need to
distinguish between different types of non-dinode metadata in fsck.
For example, what if a dinode has indirect blocks that are not
really indirect blocks, but say, an extended attribute block? The old
fsck could distinguish between them pretty easily because of their
unique types. Going by just the bitmap alone is ambiguous, so sometimes
we need to know more details, especially in cases where we have a block
that's referenced multiple times: it's much more likely that one of the
references is for the wrong type, and in that case, we do the extra
read. This shouldn't impact performance, except for duplicate references
(which is rare) and needs to be done.

There are probably cases where fsck is now making assumptions about the
previously processed block type for indirect blocks, so in that
respect, the information is lost. However, I've tried to minimize the
risk, and rigorous testing has shown that it works in many very difficult
situations. You can tell my testing has been good, since I've posted
so many bug fix patches recently that I've uncovered. :)

In that respect, this new fsck is better than the previous.

Still, there may very well be cases where the lost information causes
an unforeseen problem. I'm not sure we can do anything about that but
test it thoroughly.

Maybe I'll try to dummy up some potential problems (like the one I gave
above with indirect versus extended attribute) to see how well it does.
I suppose I could try to dummy up every permutation. That is likely to
uncover even more problems, most of which already exist in today's fsck.

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Fix journal sequence number reporting problem

2015-01-23 Thread Bob Peterson
Hi,

This bug was spotted by coverity. The fsck.gfs2 should include a line
to report the number of journal sequence errors fixed. It was coded
improperly so that it was never displayed. This patch fixes the code
so that the message will be printed properly.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/fs_recovery.c b/gfs2/fsck/fs_recovery.c
index abb8187..9be5a95 100644
--- a/gfs2/fsck/fs_recovery.c
+++ b/gfs2/fsck/fs_recovery.c
@@ -436,10 +436,9 @@ static int check_journal_seq_no(struct gfs2_inode *ip, int 
fix)
  (unsigned long long)lowest_seq,
  (unsigned long long)highest_seq,
  (unsigned long long)prev_seq);
-   if (!fix) {
-   seq_errors++;
+   seq_errors++;
+   if (!fix)
continue;
-   }
highest_seq++;
lh.lh_sequence = highest_seq;
prev_seq = lh.lh_sequence;
@@ -449,8 +448,10 @@ static int check_journal_seq_no(struct gfs2_inode *ip, int 
fix)
gfs2_log_header_out(lh, bh);
brelse(bh);
}
-   if (seq_errors  fix)
+   if (seq_errors  fix) {
log_err(_(%d sequence errors fixed.\n), seq_errors);
+   seq_errors = 0;
+   }
return seq_errors;
 }
 



[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Fix a use-after-free in pass2

2015-01-15 Thread Bob Peterson
Hi,

This patch fixes a rare code path that's doing use-after-free
I spotted in pass2.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/pass2.c b/gfs2/fsck/pass2.c
index 27b7336..4ea322a 100644
--- a/gfs2/fsck/pass2.c
+++ b/gfs2/fsck/pass2.c
@@ -1928,13 +1928,14 @@ int pass2(struct gfs2_sbd *sdp)
ip = fsck_load_inode(sdp, dirblk);
cur_blks = ip-i_di.di_blocks;
error = check_metatree(ip, pass2_fxns);
-   fsck_inode_put(ip);
if (error  0) {
stack;
+   fsck_inode_put(ip);
return error;
}
if (ip-i_di.di_blocks != cur_blks)
reprocess_inode(ip, current);
+   fsck_inode_put(ip);
}
error = check_dir(sdp, dirblk, pass2_fxns);
if (skip_this_pass || fsck_abort) /* if asked to skip the rest 
*/



[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Reprocess nodes if anything changed

2015-01-15 Thread Bob Peterson
Hi,

Before this patch, fsck.gfs2 called reprocess_inode in several
places, especially when the number of blocks had changed from the
original. That works in almost all cases. However, there's a corner
case where a block may be deleted, due to errors (for example, a
bad directory leaf), and another block takes its place. In that
case the count of the number of blocks is still the same, but the
inode should be reprocessed anyway, because the deleted blocks
and replacement blocks may be at different locations or maybe of
different types. A hash table block may be data when it's freed,
but if the block is reused, it may be repurposed as a leaf block.
That may confuse subsequent processing. Another reason to call
reprocess_inode is when the goal block changes to a value outside
the resource group, due to block allocations for repairs.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/lost_n_found.c b/gfs2/fsck/lost_n_found.c
index 9672c7a..b2e35e2 100644
--- a/gfs2/fsck/lost_n_found.c
+++ b/gfs2/fsck/lost_n_found.c
@@ -174,7 +174,6 @@ void make_sure_lf_exists(struct gfs2_inode *ip)
 int add_inode_to_lf(struct gfs2_inode *ip){
char tmp_name[256];
__be32 inode_type;
-   uint64_t lf_blocks;
struct gfs2_sbd *sdp = ip-i_sbd;
int err = 0;
uint32_t mode;
@@ -184,7 +183,6 @@ int add_inode_to_lf(struct gfs2_inode *ip){
log_err( _(Trying to add lost+found to itself...skipping));
return 0;
}
-   lf_blocks = lf_dip-i_di.di_blocks;
 
if (sdp-gfs1)
mode = gfs_to_gfs2_mode(ip);
@@ -242,10 +240,6 @@ int add_inode_to_lf(struct gfs2_inode *ip){
 tmp_name, strerror(errno));
exit(FSCK_ERROR);
}
-   /* If the lf directory had new blocks added we have to mark them
-  properly in the bitmap so they're not freed. */
-   if (lf_dip-i_di.di_blocks != lf_blocks)
-   reprocess_inode(lf_dip, lost+found);
 
/* This inode is linked from lost+found */
incr_link_count(ip-i_di.di_num, lf_dip, _(from lost+found));
diff --git a/gfs2/fsck/metawalk.c b/gfs2/fsck/metawalk.c
index 16faafb..217bb07 100644
--- a/gfs2/fsck/metawalk.c
+++ b/gfs2/fsck/metawalk.c
@@ -1693,11 +1693,11 @@ int check_dir(struct gfs2_sbd *sdp, uint64_t block, 
struct metawalk_fxns *pass)
 {
struct gfs2_inode *ip;
int error = 0;
-   uint64_t cur_blks;
+   struct alloc_state as;
 
ip = fsck_load_inode(sdp, block);
 
-   cur_blks = ip-i_di.di_blocks;
+   astate_save(ip, as);
 
if (ip-i_di.di_flags  GFS2_DIF_EXHASH)
error = check_leaf_blks(ip, pass);
@@ -1707,7 +1707,7 @@ int check_dir(struct gfs2_sbd *sdp, uint64_t block, 
struct metawalk_fxns *pass)
if (error  0)
stack;
 
-   if (ip-i_di.di_blocks != cur_blks)
+   if (astate_changed(ip, as))
reprocess_inode(ip, _(Current));
 
fsck_inode_put(ip); /* does a brelse */
diff --git a/gfs2/fsck/pass2.c b/gfs2/fsck/pass2.c
index 4ea322a..74bcc26 100644
--- a/gfs2/fsck/pass2.c
+++ b/gfs2/fsck/pass2.c
@@ -1723,7 +1723,8 @@ build_it:
 static int check_system_dir(struct gfs2_inode *sysinode, const char *dirname,
 int builder(struct gfs2_sbd *sdp))
 {
-   uint64_t iblock = 0, cur_blks;
+   uint64_t iblock = 0;
+   struct alloc_state as;
struct dir_status ds = {0};
int error = 0;
 
@@ -1740,14 +1741,14 @@ static int check_system_dir(struct gfs2_inode 
*sysinode, const char *dirname,
 
pass2_fxns.private = (void *) ds;
if (ds.q == gfs2_bad_block) {
-   cur_blks = sysinode-i_di.di_blocks;
+   astate_save(sysinode, as);
/* First check that the directory's metatree is valid */
error = check_metatree(sysinode, pass2_fxns);
if (error  0) {
stack;
return error;
}
-   if (sysinode-i_di.di_blocks != cur_blks)
+   if (astate_changed(sysinode, as))
reprocess_inode(sysinode, _(System inode));
}
error = check_dir(sysinode-i_sbd, iblock, pass2_fxns);
@@ -1768,7 +1769,7 @@ static int check_system_dir(struct gfs2_inode *sysinode, 
const char *dirname,
if (!ds.dotdir) {
log_err( _(No '.' entry found for %s directory.\n), dirname);
if (query( _(Is it okay to add '.' entry? (y/n) ))) {
-   cur_blks = sysinode-i_di.di_blocks;
+   astate_save(sysinode, as);
log_warn( _(Adding '.' entry\n));
error = dir_add(sysinode, ., 1, 
(sysinode-i_di.di_num),
(sysinode-i_sbd-gfs1 ? GFS_FILE_DIR : 
DT_DIR));
@@ -1777,7 +1778,7 @@ static int check_system_dir(struct

[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: fix double-free bug

2015-01-15 Thread Bob Peterson
Hi,

This patch fixes a double-free bug that can occur when errors are found
in an inode and later, it's freed. The problem is that function
free_metalist was using osi_list_del when removing buffer_heads from the
inode in memory. It should be using osi_list_del_init so that the list
head is reinitialized. Otherwise the list head is left with an unknown
value which gets interpreted later on as !list_empty(), so it tries to
free the list a second time.

This also fixes a minor issue with function rangecheck_jmeta, which
frees the buffer_head when it's done. The patch sets the returned *bh
value back to NULL to ensure it's not reused. This is just a precaution
I spotted while debugging, and it's worth doing.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/fs_recovery.c b/gfs2/fsck/fs_recovery.c
index 3d8fe98..095d118 100644
--- a/gfs2/fsck/fs_recovery.c
+++ b/gfs2/fsck/fs_recovery.c
@@ -631,6 +631,7 @@ static int rangecheck_jmeta(struct gfs2_inode *ip, uint64_t 
block,
 (unsigned long long)block,
 (unsigned long long)block);
brelse(*bh);
+   *bh = NULL;
return meta_skip_further;
}
}
diff --git a/gfs2/fsck/metawalk.c b/gfs2/fsck/metawalk.c
index 1875b24..16faafb 100644
--- a/gfs2/fsck/metawalk.c
+++ b/gfs2/fsck/metawalk.c
@@ -1203,7 +1203,7 @@ static void free_metalist(struct gfs2_inode *ip, 
osi_list_t *mlp)
nbh = osi_list_entry(list-next,
 struct gfs2_buffer_head, 
b_altlist);
if (nbh == ip-i_bh)
-   osi_list_del(nbh-b_altlist);
+   osi_list_del_init(nbh-b_altlist);
else
brelse(nbh);
}



[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: Rebuild system files if they don't have the SYS bit set

2015-01-19 Thread Bob Peterson
Hi,

This patch checks for the GFS2_DIF_SYSTEM bit on system dinodes. If the
bit is not set, the dinode is assumed to be corrupt and rebuilt.
If the jindex is rebuilt, it needs to be re-read.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/pass1.c b/gfs2/fsck/pass1.c
index 2841b8c..4348683 100644
--- a/gfs2/fsck/pass1.c
+++ b/gfs2/fsck/pass1.c
@@ -27,6 +27,7 @@
 #include util.h
 #include link.h
 #include metawalk.h
+#include fs_recovery.h
 
 struct special_blocks gfs1_rindex_blks;
 
@@ -1241,7 +1242,8 @@ static int check_system_inode(struct gfs2_sbd *sdp,
  struct gfs2_inode **sysinode,
  const char *filename,
  int builder(struct gfs2_sbd *sdp),
- enum gfs2_mark_block mark)
+ enum gfs2_mark_block mark,
+ struct gfs2_inode *sysdir, int needs_sysbit)
 {
uint64_t iblock = 0;
struct dir_status ds = {0};
@@ -1282,6 +1284,24 @@ static int check_system_inode(struct gfs2_sbd *sdp,
if (mark == gfs2_inode_dir)
dirtree_insert((*sysinode)-i_di.di_num);
}
+   /* Make sure it's marked as a system file/directory */
+   if (needs_sysbit 
+   !((*sysinode)-i_di.di_flags  GFS2_DIF_SYSTEM)) {
+   log_err( _(System inode %s is missing the 'system' 
+  flag. It should be rebuilt.\n), filename);
+   if (sysdir  query(_(Delete the corrupt %s system 
+ inode? (y/n) ), filename)) {
+   inode_put(sysinode);
+   gfs2_dirent_del(sysdir, filename,
+   strlen(filename));
+   /* Set the blockmap (but not bitmap) back to
+  'free' so that it gets checked like any
+  normal dinode. */
+   gfs2_blockmap_set(bl, iblock, gfs2_block_free);
+   log_err( _(Removed system inode \%s\.\n),
+filename);
+   }
+   }
} else
log_info( _(System inode for '%s' is corrupt or missing.\n),
  filename);
@@ -1290,18 +1310,22 @@ static int check_system_inode(struct gfs2_sbd *sdp,
   lost+found then, but we *need* our system inodes before we can
   do any of that. */
if (!(*sysinode) || ds.q != mark) {
-   log_err( _(Invalid or missing %s system inode (should be %d, 
-  is %d).\n), filename, mark, ds.q);
+   log_err( _(Invalid or missing %s system inode (should be 
+  '%s', is '%s').\n), filename,
+block_type_string(mark),
+block_type_string(ds.q));
if (query(_(Create new %s system inode? (y/n) ), filename)) {
log_err( _(Rebuilding system file \%s\\n),
 filename);
error = builder(sdp);
if (error) {
-   log_err( _(Error trying to rebuild system 
-  file %s: Cannot continue\n),
+   log_err( _(Error rebuilding system 
+  inode %s: Cannot continue\n),
 filename);
return error;
}
+   if (*sysinode == sdp-md.jiinode)
+   ji_update(sdp);
fsck_blockmap_set(*sysinode,
  (*sysinode)-i_di.di_num.no_addr,
  filename, mark);
@@ -1367,7 +1391,8 @@ static int check_system_inodes(struct gfs2_sbd *sdp)
  sdp-master_dir-i_di.di_num.no_addr,
  master, gfs2_inode_dir);
if (check_system_inode(sdp, sdp-master_dir, master,
-  build_master, gfs2_inode_dir)) {
+  build_master, gfs2_inode_dir,
+  NULL, 1)) {
stack;
return -1;
}
@@ -1377,39 +1402,41 @@ static int check_system_inodes(struct gfs2_sbd *sdp)
fsck_blockmap_set(sdp-md.rooti, sdp-md.rooti-i_di.di_num.no_addr,
  root, gfs2_inode_dir);
if (check_system_inode(sdp, sdp-md.rooti, root, build_root,
-  gfs2_inode_dir

[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Check the integrity of the journal index

2015-01-19 Thread Bob Peterson
Hi,

This patch checks the jindex system directory to make sure the entries
all start with journal and so forth. If not, the jindex is deleted
and rebuilt. As part of this patch, I moved where we read in the rindex
file and rgrps to an earlier point in time, before the journals are
replayed. This allows us to remove a dummied-up rgrp kludge in the code.
However, if the replayed journal block is part of an rgrp, we need to
refresh the rgrp based on the values rewritten from the journal.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/fs_recovery.c b/gfs2/fsck/fs_recovery.c
index 095d118..4eaba1e 100644
--- a/gfs2/fsck/fs_recovery.c
+++ b/gfs2/fsck/fs_recovery.c
@@ -96,6 +96,30 @@ void gfs2_revoke_clean(struct gfs2_sbd *sdp)
}
 }
 
+static void refresh_rgrp(struct gfs2_sbd *sdp, struct rgrp_tree *rgd,
+struct gfs2_buffer_head *bh, uint64_t blkno)
+{
+   int i;
+
+   log_debug(_(Block is part of rgrp 0x%llx; refreshing the rgrp.\n),
+ (unsigned long long)rgd-ri.ri_addr);
+   for (i = 0; i  rgd-ri.ri_length; i++) {
+   if (rgd-bits[i].bi_bh-b_blocknr != blkno)
+   continue;
+
+   memcpy(rgd-bits[i].bi_bh-b_data, bh-b_data, sdp-bsize);
+   bmodified(rgd-bits[i].bi_bh);
+   if (i == 0) { /* this is the rgrp itself */
+   if (sdp-gfs1)
+   gfs_rgrp_in((struct gfs_rgrp *)rgd-rg,
+   rgd-bits[0].bi_bh);
+   else
+   gfs2_rgrp_in(rgd-rg, rgd-bits[0].bi_bh);
+   }
+   break;
+   }
+}
+
 static int buf_lo_scan_elements(struct gfs2_inode *ip, unsigned int start,
struct gfs2_log_descriptor *ld, __be64 *ptr,
int pass)
@@ -105,6 +129,7 @@ static int buf_lo_scan_elements(struct gfs2_inode *ip, 
unsigned int start,
struct gfs2_buffer_head *bh_log, *bh_ip;
uint64_t blkno;
int error = 0;
+   struct rgrp_tree *rgd;
 
if (pass != 1 || be32_to_cpu(ld-ld_type) != GFS2_LOG_DESC_METADATA)
return 0;
@@ -147,6 +172,9 @@ static int buf_lo_scan_elements(struct gfs2_inode *ip, 
unsigned int start,
error = -EIO;
} else {
bmodified(bh_ip);
+   rgd = gfs2_blk2rgrpd(sdp, blkno);
+   if (rgd  blkno  rgd-ri.ri_data0)
+   refresh_rgrp(sdp, rgd, bh_ip, blkno);
}
 
brelse(bh_log);
@@ -676,28 +704,8 @@ int replay_journals(struct gfs2_sbd *sdp, int preen, int 
force_check,
 
for(i = 0; i  sdp-md.journals; i++) {
if (sdp-md.journal[i]) {
-   struct rgrp_tree rgd;
-   struct gfs2_bitmap bits;
-
-   /* The real rgrp tree hasn't been built at this point,
-* so we need to dummy one up that covers the whole
-* file system so basic functions in check_metatree
-* don't segfault. */
-   rgd.start = sdp-sb_addr + 1;
-   rgd.length = 1;
-   bits.bi_bh = NULL;
-   bits.bi_start = 0;
-   bits.bi_len = sdp-fssize / GFS2_NBBY;
-   rgd.bits = bits;
-   rgd.ri.ri_addr = sdp-sb_addr + 1;
-   rgd.ri.ri_length = 1;
-   rgd.ri.ri_data0 = sdp-sb_addr + 2;
-   rgd.ri.ri_data = sdp-fssize - (sdp-sb_addr + 2);
-
-   sdp-rgtree.osi_node = (struct osi_node *)rgd;
error = check_metatree(sdp-md.journal[i],
   rangecheck_journal);
-   sdp-rgtree.osi_node = NULL;
if (error)
/* Don't use fsck_inode_put here because it's a
   system file and we need to dismantle it. */
@@ -707,8 +715,7 @@ int replay_journals(struct gfs2_sbd *sdp, int preen, int 
force_check,
if (!sdp-md.journal[i]) {
log_err(_(File system journal \journal%d\ is 
  missing or corrupt: pass1 will try to 
- recreate it.\n),
-   i);
+ recreate it.\n), i);
continue;
}
if (!error) {
diff --git a/gfs2/fsck/initialize.c b/gfs2/fsck/initialize.c
index 4e52262..043917c 100644
--- a/gfs2/fsck/initialize.c
+++ b/gfs2/fsck/initialize.c
@@ -142,7 +142,7 @@ static int set_block_ranges(struct gfs2_sbd *sdp)
uint64_t rmin = 0

[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: rgrp block count reform

2015-01-19 Thread Bob Peterson
Hi,

In the past, pass5 has kept track of different counts for the different
metadata types, but the counts did not correspond to the type in the
bitmap. This patch changes pass5 so that it makes a little more sense.
There should be a straightforward blockmap-to-bitmap conversion, and
the counts should be based on the bitmap type. For example, before the
patch, the rgrp's inode count would be kept in count[1], but in the
bitmap, dinodes are of type 3. So this patch reworks that system so
that the count of dinodes is kept in count[3] and so forth, and it
uses a simple blockmap-to-bitmap to figure out the index to count.
This breaks down for GFS1 which keeps a few extra numbers in the rgrp,
however this patch compensates for that. This patch is being done
as a precursor to another patch that reduces fsck's memory requirements
by using a blockmap that's 1:1 with the bitmaps rather than today's
blockmap that's 4:1.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/pass5.c b/gfs2/fsck/pass5.c
index b2e8adf..2b8536d 100644
--- a/gfs2/fsck/pass5.c
+++ b/gfs2/fsck/pass5.c
@@ -12,94 +12,6 @@
 #include fsck.h
 #include util.h
 
-static int gfs1_convert_mark(uint8_t q, uint32_t *count)
-{
-   switch(q) {
-
-   case gfs2_meta_inval:
-   case gfs2_inode_invalid:
-   /* Convert invalid metadata to free blocks */
-   case gfs2_block_free:
-   count[0]++;
-   return GFS2_BLKST_FREE;
-
-   case gfs2_block_used:
-   count[2]++;
-   return GFS2_BLKST_USED;
-
-   case gfs2_inode_dir:
-   case gfs2_inode_file:
-   case gfs2_inode_lnk:
-   case gfs2_inode_device:
-   case gfs2_inode_fifo:
-   case gfs2_inode_sock:
-   count[1]++;
-   return GFS2_BLKST_DINODE;
-
-   case gfs2_indir_blk:
-   case gfs2_leaf_blk:
-   /*case gfs2_meta_rgrp:*/
-   case gfs2_jdata: /* gfs1 jdata blocks count as metadata and gfs1
-   metadata is marked the same as gfs2 inode in the
-   bitmap. */
-   case gfs2_meta_eattr:
-   count[3]++;
-   return GFS2_BLKST_DINODE;
-
-   case gfs2_freemeta:
-   count[4]++;
-   return GFS2_BLKST_UNLINKED;
-
-   default:
-   log_err( _(Invalid block type %d found\n), q);
-   }
-   return -1;
-}
-
-static int gfs2_convert_mark(uint8_t q, uint32_t *count)
-{
-   switch(q) {
-
-   case gfs2_meta_inval:
-   case gfs2_inode_invalid:
-   /* Convert invalid metadata to free blocks */
-   case gfs2_block_free:
-   count[0]++;
-   return GFS2_BLKST_FREE;
-
-   case gfs2_block_used:
-   count[2]++;
-   return GFS2_BLKST_USED;
-
-   case gfs2_inode_dir:
-   case gfs2_inode_file:
-   case gfs2_inode_lnk:
-   case gfs2_inode_device:
-   case gfs2_jdata: /* gfs1 jdata blocks count as metadata and gfs1
-   metadata is marked the same as gfs2 inode in the
-   bitmap. */
-   case gfs2_inode_fifo:
-   case gfs2_inode_sock:
-   count[1]++;
-   return GFS2_BLKST_DINODE;
-
-   case gfs2_indir_blk:
-   case gfs2_leaf_blk:
-   case gfs2_meta_eattr:
-   count[2]++;
-   return GFS2_BLKST_USED;
-
-   case gfs2_freemeta:
-   log_err( _(Invalid freemeta type %d found\n), q);
-   count[4]++;
-   return -1;
-
-   default:
-   log_err( _(Invalid block type %d found\n), q);
-   }
-   return -1;
-}
-
 static int check_block_status(struct gfs2_sbd *sdp, char *buffer,
  unsigned int buflen, uint64_t *rg_block,
  uint64_t rg_data, uint32_t *count)
@@ -107,7 +19,6 @@ static int check_block_status(struct gfs2_sbd *sdp, char 
*buffer,
unsigned char *byte, *end;
unsigned int bit;
unsigned char rg_status;
-   int block_status;
uint8_t q;
uint64_t block;
 
@@ -123,26 +34,33 @@ static int check_block_status(struct gfs2_sbd *sdp, char 
*buffer,
if (skip_this_pass || fsck_abort) /* if asked to skip the rest 
*/
return 0;
 
-   q = block_type(block);
-
-   if (sdp-gfs1)
-   block_status = gfs1_convert_mark(q, count);
-   else
-   block_status = gfs2_convert_mark(q, count);
-
-   if (block_status  0) {
-   log_err( _(Invalid status for block %llu (0x%llx).\n),
-(unsigned long long)block,
-(unsigned long long)block);
-   return block_status;
+   q = blockmap_to_bitmap(block_type(block), sdp-gfs1

[Cluster-devel] [GFS2 PATCH] GFS2: Eliminate a nonsense goto

2015-01-22 Thread Bob Peterson
Hi,

This patch just removes a goto that did nothing.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 9054002..73c72253 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -543,10 +543,7 @@ static int link_dinode(struct gfs2_inode *dip, const 
struct qstr *name,
}
 
error = gfs2_dir_add(dip-i_inode, name, ip, da);
-   if (error)
-   goto fail_end_trans;
 
-fail_end_trans:
gfs2_trans_end(sdp);
 fail_ipreserv:
gfs2_inplace_release(dip);



Re: [Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Change block_map to match bitmap

2015-02-12 Thread Bob Peterson
- Original Message -
 - Original Message -
  Hi Bob,
  
  On 22/01/15 20:41, Bob Peterson wrote:
   Hi,
  
   This patch changes the old block_map structure for fsck.gfs2 to the
   simpler bitmap structure so that we have a 1:1 correspondence. This
   was done to reduce memory requirements of fsck.gfs2.
  
  I'm curious as to whether we're losing any useful information here. I
  see an extra bread() for BLKST_USED blocks in check_data, is that the
  main performance compromise, and how severe is it?
  
  Cheers,
  Andy
 
 Hi,
 
 That's exactly where things get a bit sticky. It used to keep track of
 the various GFS2_METATYPE_* block types. For the most part, dinodes were
 straightforward because we only needed to distinguish between directories
 and non-directories. Free space is pretty straightforward too, with only
 a few corner cases where things were marked a bad type. Unlinked also
 wasn't a big problem. The problem comes with blocks marked Data.
 
 In GFS1, data blocks were data blocks. In GFS2, a data block could be
 data or metadata. In some cases we don't care because the height will
 tell us if it's real data. But things get sticky when we need to
 distinguish between different types of non-dinode metadata in fsck.
 For example, what if a dinode has indirect blocks that are not
 really indirect blocks, but say, an extended attribute block? The old
 fsck could distinguish between them pretty easily because of their
 unique types. Going by just the bitmap alone is ambiguous, so sometimes
 we need to know more details, especially in cases where we have a block
 that's referenced multiple times: it's much more likely that one of the
 references is for the wrong type, and in that case, we do the extra
 read. This shouldn't impact performance, except for duplicate references
 (which is rare) and needs to be done.
 
 There are probably cases where fsck is now making assumptions about the
 previously processed block type for indirect blocks, so in that
 respect, the information is lost. However, I've tried to minimize the
 risk, and rigorous testing has shown that it works in many very difficult
 situations. You can tell my testing has been good, since I've posted
 so many bug fix patches recently that I've uncovered. :)
 
 In that respect, this new fsck is better than the previous.
 
 Still, there may very well be cases where the lost information causes
 an unforeseen problem. I'm not sure we can do anything about that but
 test it thoroughly.
 
 Maybe I'll try to dummy up some potential problems (like the one I gave
 above with indirect versus extended attribute) to see how well it does.
 I suppose I could try to dummy up every permutation. That is likely to
 uncover even more problems, most of which already exist in today's fsck.
 
 Regards,
 
 Bob Peterson
 Red Hat File Systems

Hi,

As promised, I dummied up some file systems to intentionally have the wrong
metadata block types. I tried these scenarios:

1. File that had an extended attribute block that was really an indirect block.
2. File that had an indirect block that was really an extended attribute block.
3. Directory that had an extended attribute block that was really a leaf block.
4. Directory that had a leaf block that was really an extended attribute block.
5. File that had an extended attribute block that was really an ED block.
6. File whose indirect pointer points to a leaf block
7. Directory whose leaf block pointers all point to an indirect block

In all these scenarios, the fsck.gfs2 with my patch made the same decisions
and produced relatively the same output as the stock fsck.gfs2. So I think
it's safe to push this patch now, unless anyone has an objection or has
specific tests they'd like me to try.

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [GFS2 PATCH] GFS2: Allocate reservation during write_begin if needed

2015-02-17 Thread Bob Peterson
Hi,

This patch adds a call to function gfs2_rs_alloc to make sure a
reservation structure has been allocated before attempting to
reserve blocks.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
 fs/gfs2/aops.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 805b37f..6453e23 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -675,6 +675,9 @@ static int gfs2_write_begin(struct file *file, struct 
address_space *mapping,
if (error)
goto out_unlock;
 
+   error = gfs2_rs_alloc(ip);
+   if (error)
+   goto out_qunlock;
requested = data_blocks + ind_blocks;
ap.target = requested;
error = gfs2_inplace_reserve(ip, ap);



Re: [Cluster-devel] [GFS2 PATCH] GFS2: Allocate reservation during write_begin if needed

2015-02-17 Thread Bob Peterson
- Original Message -
 Hi,
 
 
 Since we set the allocation structure when the write call begins, and it
 is not deallocated until there are no writers left with the file open,
 how does this happen?
 
 Steve.

Hi,

In a normal write, the code goes through gfs2_page_mkwrite or
gfs2_file_aio_write. In the failing scenario, it's going through
sendfile. I suppose I could patch sendfile as an alternative, but
the advantage here is that this patch will do it only if a block
allocation is needed.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH] Add myself (Bob Peterson) as a maintainer of GFS2

2015-02-17 Thread Bob Peterson
- Original Message -
 Hi Bob,
 
 On Mon, 16 Feb 2015 09:20:56 -0500 (EST) Bob Peterson rpete...@redhat.com
 wrote:
 
  This patch adds Bob Peterson as a maintainer of the GFS2 file system.
  It also changes the development repository to a shared location rather
  than Steve Whitehouse's private location.
 
 Please also consider letting the linux-next maintainer know :-)
 
 i.e. does this mean I should change the repo/branch I fetch into
 linux-next each day and should I add you to the contacts?
 --
 Cheers,
 Stephen Rothwells...@canb.auug.org.au
 

Hi Stephen,

Yes, indeed. I was going to let you know as soon as I got things set up.
Unfortunately, I've been pulled in a lot of directions lately.

I've set up a for-next branch from which you can do your daily fetch:
pub/scm/linux/kernel/git/gfs2/linux-gfs2.git

Currently it's empty, and yes, please add me to the contacts.

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [GFS2 PATCH] GFS2: Allocate reservation during splice_write

2015-02-19 Thread Bob Peterson
Hi,

This patch adds a GFS2-specific function for splice_write which
first calls function gfs2_rs_alloc to make sure a reservation
structure has been allocated before attempting to reserve blocks.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
 fs/gfs2/file.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index ec9c2d3..bd86e79 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1063,6 +1063,22 @@ static int gfs2_flock(struct file *file, int cmd, struct 
file_lock *fl)
}
 }
 
+static ssize_t gfs2_file_splice_write(struct pipe_inode_info *pipe,
+ struct file *out, loff_t *ppos,
+ size_t len, unsigned int flags)
+{
+   int error;
+   struct gfs2_inode *ip = GFS2_I(out-f_mapping-host);
+
+   error = gfs2_rs_alloc(ip);
+   if (error)
+   return (ssize_t)error;
+
+   gfs2_size_hint(out, *ppos, len);
+
+   return iter_file_splice_write(pipe, out, ppos, len, flags);
+}
+
 const struct file_operations gfs2_file_fops = {
.llseek = gfs2_llseek,
.read   = new_sync_read,
@@ -1077,7 +1093,7 @@ const struct file_operations gfs2_file_fops = {
.lock   = gfs2_lock,
.flock  = gfs2_flock,
.splice_read= generic_file_splice_read,
-   .splice_write   = iter_file_splice_write,
+   .splice_write   = gfs2_file_splice_write,
.setlease   = simple_nosetlease,
.fallocate  = gfs2_fallocate,
 };
@@ -1107,7 +1123,7 @@ const struct file_operations gfs2_file_fops_nolock = {
.release= gfs2_release,
.fsync  = gfs2_fsync,
.splice_read= generic_file_splice_read,
-   .splice_write   = iter_file_splice_write,
+   .splice_write   = gfs2_file_splice_write,
.setlease   = generic_setlease,
.fallocate  = gfs2_fallocate,
 };



[Cluster-devel] [PATCH] Add myself (Bob Peterson) as a maintainer of GFS2

2015-02-16 Thread Bob Peterson
Hi,

This patch adds Bob Peterson as a maintainer of the GFS2 file system.
It also changes the development repository to a shared location rather
than Steve Whitehouse's private location.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
 MAINTAINERS | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 249e8dd..eee1680 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4239,10 +4239,10 @@ F:  scripts/get_maintainer.pl
 
 GFS2 FILE SYSTEM
 M: Steven Whitehouse swhit...@redhat.com
+M: Bob Peterson rpete...@redhat.com
 L: cluster-devel@redhat.com
 W: http://sources.redhat.com/cluster/
-T: git 
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes.git
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw.git
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git
 S: Supported
 F: Documentation/filesystems/gfs2*.txt
 F: fs/gfs2/



Re: [Cluster-devel] [PATCH 4/8] gfs2: Convert to using -get_state callback

2015-02-18 Thread Bob Peterson
- Original Message -
 Convert gfs2 to use -get_state callback instead of -get_xstate.
 
 Signed-off-by: Jan Kara j...@suse.cz
 ---
  fs/gfs2/quota.c | 28 +++-
  1 file changed, 15 insertions(+), 13 deletions(-)
 
 diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
 index 3e193cb36996..c76e031ccbb4 100644
 --- a/fs/gfs2/quota.c
 +++ b/fs/gfs2/quota.c
 @@ -1467,32 +1467,34 @@ int gfs2_quotad(void *data)
   return 0;
  }
  
 -static int gfs2_quota_get_xstate(struct super_block *sb,
 -  struct fs_quota_stat *fqs)
 +static int gfs2_quota_get_state(struct super_block *sb, struct qc_state
 *state)
  {
   struct gfs2_sbd *sdp = sb-s_fs_info;
  
 - memset(fqs, 0, sizeof(struct fs_quota_stat));
 - fqs-qs_version = FS_QSTAT_VERSION;
 + memset(state, 0, sizeof(*state));
  
   switch (sdp-sd_args.ar_quota) {
   case GFS2_QUOTA_ON:
 - fqs-qs_flags |= (FS_QUOTA_UDQ_ENFD | FS_QUOTA_GDQ_ENFD);
 + state-s_state[USRQUOTA].flags |= QCI_LIMITS_ENFORCED;
 + state-s_state[GRPQUOTA].flags |= QCI_LIMITS_ENFORCED;
   /*FALLTHRU*/
   case GFS2_QUOTA_ACCOUNT:
 - fqs-qs_flags |= (FS_QUOTA_UDQ_ACCT | FS_QUOTA_GDQ_ACCT);
 + state-s_state[USRQUOTA].flags |= QCI_ACCT_ENABLED |
 +   QCI_SYSFILE;
 + state-s_state[GRPQUOTA].flags |= QCI_ACCT_ENABLED |
 +   QCI_SYSFILE;
   break;
   case GFS2_QUOTA_OFF:
   break;
   }
 -
   if (sdp-sd_quota_inode) {
 - fqs-qs_uquota.qfs_ino = GFS2_I(sdp-sd_quota_inode)-i_no_addr;
 - fqs-qs_uquota.qfs_nblks = sdp-sd_quota_inode-i_blocks;
 + state-s_state[USRQUOTA].ino =
 + GFS2_I(sdp-sd_quota_inode)-i_no_addr;
 + state-s_state[USRQUOTA].blocks = sdp-sd_quota_inode-i_blocks;
   }
 - fqs-qs_uquota.qfs_nextents = 1; /* unsupported */
 - fqs-qs_gquota = fqs-qs_uquota; /* its the same inode in both cases */
 - fqs-qs_incoredqs = list_lru_count(gfs2_qd_lru);
 + state-s_state[USRQUOTA].nextents = 1;  /* unsupported */
 + state-s_state[GRPQUOTA] = state-s_state[USRQUOTA];
 + state-s_incoredqs = list_lru_count(gfs2_qd_lru);
   return 0;
  }
  
 @@ -1637,7 +1639,7 @@ out_put:
  
  const struct quotactl_ops gfs2_quotactl_ops = {
   .quota_sync = gfs2_quota_sync,
 - .get_xstate = gfs2_quota_get_xstate,
 + .get_state  = gfs2_quota_get_state,
   .get_dqblk  = gfs2_get_dqblk,
   .set_dqblk  = gfs2_set_dqblk,
  };
 --
 2.1.4
 
 

ACK

Looks right.

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH 2/3] fsck.gfs2: Simplify bad_journalname

2015-01-28 Thread Bob Peterson
- Original Message -
 Remove the need for a temporary string and strncpy call by passing the
 length of the string to printf.
 
 Signed-off-by: Andrew Price anpr...@redhat.com
 ---
  gfs2/fsck/initialize.c | 8 ++--
  1 file changed, 2 insertions(+), 6 deletions(-)
 
 diff --git a/gfs2/fsck/initialize.c b/gfs2/fsck/initialize.c
 index 1ab078b..c052205 100644
 --- a/gfs2/fsck/initialize.c
 +++ b/gfs2/fsck/initialize.c
 @@ -1513,14 +1513,10 @@ static int init_rindex(struct gfs2_sbd *sdp)
  
  static void bad_journalname(const char *filename, int len)
  {
 - char tmp_name[64];
 -
   if (len = 64)
   len = 63;
 - strncpy(tmp_name, filename, len);
 - tmp_name[len] = '\0';
 - log_debug(_(Journal index entry '%s' has an invalid filename.\n),
 -   tmp_name);
 + log_debug(_(Journal index entry '%.*s' has an invalid filename.\n),
 +   len, filename);
  }
  
  /**
 --
 1.9.3

Nice trick (strangely, I've never used that construct), but we could just as 
well get rid of the whole function altogether.

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH 3/3] gfs2-utils build: Add a configure script summary

2015-01-28 Thread Bob Peterson
- Original Message -
 Print a nicely formatted summary of some of the more interesting
 configure options. Required some tweaking of earlier configure stages
 for accuracy.
 
 Signed-off-by: Andrew Price anpr...@redhat.com
 ---

Hi,

ACK

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH 1/3] fsck.gfs2: Fix 'initializer element is not constant' build error

2015-01-28 Thread Bob Peterson
- Original Message -
 This error occurs when gfs2-utils is compiled with -std options more
 recent than gnu89:
 
 CC   fsck_gfs2-main.o
   main.c:39:38: error: initializer element is not constant
struct osi_root dup_blocks = (struct osi_root) { NULL, };
 ^
   main.c:40:35: error: initializer element is not constant
struct osi_root dirtree = (struct osi_root) { NULL, };
  ^
   main.c:41:37: error: initializer element is not constant
struct osi_root inodetree = (struct osi_root) { NULL, };
  ^
 As far as I can tell, with C89/gnu89 the use of a cast in this context
 is undefined behaviour and the later standards are more strict about it,
 hence the error. As the standards specify that members of objects with
 static storage duration are zeroed/NULLed anyway, the initializers can
 be removed to achieve the intended result.
 
 Signed-off-by: Andrew Price anpr...@redhat.com
 ---

ACK

Bob Peterson
Red Hat File Systems



[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: Revise undo processing

2015-01-09 Thread Bob Peterson
Hi,

This patch has to do with undo processing. This is where pass1 has
processed a bunch of data blocks before encountering an error. If the
error is fatal processing is stopped immediately and the work done
to that point (up to the fatal error block) is backed out. If the
error is not fatal, processing may continue for the rest of the
indirect block it's processing. The problem occurs when it finds a
non-fatal error block, then later it decides it cannot continue. In
that case, the undo function needs to back out changes, but it
needs to finish working through the data blocks on the rest of that
indirect block. This patch allows it to finish out the rest of the
undo work properly.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/metawalk.c b/gfs2/fsck/metawalk.c
index cd224fe..1875b24 100644
--- a/gfs2/fsck/metawalk.c
+++ b/gfs2/fsck/metawalk.c
@@ -1446,17 +1446,28 @@ static int check_data(struct gfs2_inode *ip, struct 
metawalk_fxns *pass,
   pass1. Therefore the individual check_data functions
   should do a range check. */
rc = pass-check_data(ip, metablock, block, pass-private);
-   if (!error  rc) {
-   error = rc;
+   if (rc  (!error || (rc  error))) {
log_info(\n);
if (rc  0) {
-   *error_blk = block;
+   /* A fatal error trumps a non-fatal one. */
+   if ((*error_blk == 0) || (rc  error)) {
+   log_debug(_(Fatal error on block 0x
+   %llx preempts non-fatal 
+   error on block 0x%llx\n),
+ (unsigned long long)block,
+ (unsigned long 
long)*error_blk);
+   *error_blk = block;
+   }
log_info(_(Unrecoverable ));
+   } else { /* nonfatal error */
+   if ((*error_blk) == 0)
+   *error_blk = block;
}
log_info(_(data block error %d on block %llu 
   (0x%llx).\n), rc,
 (unsigned long long)block,
 (unsigned long long)block);
+   error = rc;
}
if (rc  0)
return rc;
@@ -1467,10 +1478,11 @@ static int check_data(struct gfs2_inode *ip, struct 
metawalk_fxns *pass,
 
 static int undo_check_data(struct gfs2_inode *ip, struct metawalk_fxns *pass,
   uint64_t *ptr_start, char *ptr_end,
-  uint64_t error_blk)
+  uint64_t error_blk, int error)
 {
int rc = 0;
uint64_t block, *ptr;
+   int found_error_blk = 0;
 
/* If there isn't much pointer corruption check the pointers */
for (ptr = ptr_start ; (char *)ptr  ptr_end  !fsck_abort; ptr++) {
@@ -1480,13 +1492,25 @@ static int undo_check_data(struct gfs2_inode *ip, 
struct metawalk_fxns *pass,
if (skip_this_pass || fsck_abort)
return 1;
block =  be64_to_cpu(*ptr);
-   if (block == error_blk)
-   return 1;
+   if (block == error_blk) {
+   if (error  0) { /* A fatal error that stopped it? */
+   log_debug(_(Stopping the undo process: 
+   fatal error block 0x%llx was 
+   found.\n),
+ (unsigned long long)error_blk);
+   return 1;
+   }
+   found_error_blk = 1;
+   log_debug(_(The non-fatal error block 0x%llx was 
+   found, but undo processing will continue 
+   until the end of this metadata block.\n),
+ (unsigned long long)error_blk);
+   }
rc = pass-undo_check_data(ip, block, pass-private);
if (rc  0)
return rc;
}
-   return 0;
+   return found_error_blk;
 }
 
 static int hdr_size(struct gfs2_buffer_head *bh, int height)
@@ -1589,9 +1613,11 @@ int check_metatree(struct gfs2_inode *ip, struct 
metawalk_fxns *pass)
 undo_metalist:
if (!error)
goto out;
-   log_err( _(Error: inode %llu (0x%llx) had unrecoverable errors.\n),
+   log_err( _(Error: inode %llu (0x%llx) had unrecoverable errors

[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: Change basic dentry checks for too long of file names

2015-01-09 Thread Bob Peterson
Hi,

This is the first in a series of patches to fsck.gfs2. I've uncovered some
problems during recent testing, and these patches fix them.

This patch adds a check to the basic dentry check such that it will reject
any dirents that have a file name more than the maximum allowed.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/pass2.c b/gfs2/fsck/pass2.c
index 1559d8e..bdcf77c 100644
--- a/gfs2/fsck/pass2.c
+++ b/gfs2/fsck/pass2.c
@@ -468,7 +468,8 @@ static int basic_dentry_checks(struct gfs2_inode *ip, 
struct gfs2_dirent *dent,
}
}
 
-   if (de-de_rec_len  GFS2_DIRENT_SIZE(de-de_name_len)) {
+   if (de-de_rec_len  GFS2_DIRENT_SIZE(de-de_name_len) ||
+   de-de_name_len  GFS2_FNAMESIZE) {
log_err( _(Dir entry with bad record or name length\n
\tRecord length = %u\n\tName length = %u\n),
de-de_rec_len, de-de_name_len);
@@ -476,9 +477,12 @@ static int basic_dentry_checks(struct gfs2_inode *ip, 
struct gfs2_dirent *dent,
log_err( _(Directory entry not fixed.\n));
return 0;
}
+   /* Don't be tempted to do this:
fsck_blockmap_set(ip, ip-i_di.di_num.no_addr,
  _(corrupt directory entry),
  gfs2_inode_invalid);
+   We can't free it because another dir may have a valid reference
+   to it. Just return 1 so we can delete the bad dirent. */
log_err( _(Bad directory entry deleted.\n));
return 1;
}



[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: Print out block number when pass3 finds a bad directory

2015-01-09 Thread Bob Peterson
Hi,

This patch changes pass3 so that it prints out the directory inode
number when it finds a directory containing a bad block.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/pass3.c b/gfs2/fsck/pass3.c
index 9582b5b..33865df 100644
--- a/gfs2/fsck/pass3.c
+++ b/gfs2/fsck/pass3.c
@@ -246,7 +246,10 @@ int pass3(struct gfs2_sbd *sdp)
q = block_type(di-dinode.no_addr);
if (q == gfs2_bad_block) {
log_err( _(Found unlinked directory 
-  containing bad block\n));
+  containing bad block at block %llu
+   (0x%llx)\n),
+   (unsigned long long)di-dinode.no_addr,
+   (unsigned long long)di-dinode.no_addr);
if (query(_(Clear unlinked directory 
   with bad blocks? (y/n) ))) {
log_warn( _(inode %lld (0x%llx) is 



[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: Adjust when hash table is doubled

2015-01-09 Thread Bob Peterson
Hi,

If directory entries are found to be on the wrong leaf block, they are
moved as needed. That sometimes requires the leaf block to be split,
which, in turn, sometimes requires the hash table be doubled. The
problem was that the functions that traverse the hash tables were
not taking into account the fact that the hash table could just
unexpectly double, throwing off all the pointers. This patch attempts
to correct that by fixing up the pointers as necessary.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/metawalk.c b/gfs2/fsck/metawalk.c
index 9aa9fb8..cd224fe 100644
--- a/gfs2/fsck/metawalk.c
+++ b/gfs2/fsck/metawalk.c
@@ -461,7 +461,7 @@ static int check_entries(struct gfs2_inode *ip, struct 
gfs2_buffer_head *bh,
} else {
error = pass-check_dentry(ip, dent, prev, bh,
   filename, count,
-  lindex,
+  lindex,
   pass-private);
if (error  0) {
stack;
@@ -504,6 +504,7 @@ int check_leaf(struct gfs2_inode *ip, int lindex, struct 
metawalk_fxns *pass,
uint32_t count = 0;
struct gfs2_sbd *sdp = ip-i_sbd;
const char *msg;
+   int di_depth = ip-i_di.di_depth;
 
/* Make sure the block number is in range. */
if (!valid_block(ip-i_sbd, *leaf_no)) {
@@ -604,6 +605,16 @@ int check_leaf(struct gfs2_inode *ip, int lindex, struct 
metawalk_fxns *pass,
}
}
 out:
+   if (di_depth  ip-i_di.di_depth) {
+   log_debug(_(Depth of directory %lld (0x%llx) changed from 
+   %d to %d; adjusting ref_count from %d to %d\n),
+ (unsigned long long)ip-i_di.di_num.no_addr,
+ (unsigned long long)ip-i_di.di_num.no_addr,
+ di_depth, ip-i_di.di_depth,
+ *ref_count,
+ (*ref_count)  (ip-i_di.di_depth - di_depth));
+   (*ref_count) = (ip-i_di.di_depth - di_depth);
+   }
brelse(lbh);
return 0;
 
@@ -618,6 +629,16 @@ bad_leaf:
return fix;
 
}
+   if (di_depth  ip-i_di.di_depth) {
+   log_debug(_(Depth of directory %lld (0x%llx) changed from 
+   %d to %d. Adjusting ref_count from %d to %d\n),
+ (unsigned long long)ip-i_di.di_num.no_addr,
+ (unsigned long long)ip-i_di.di_num.no_addr,
+ di_depth, ip-i_di.di_depth,
+ *ref_count,
+ (*ref_count)  (ip-i_di.di_depth - di_depth));
+   (*ref_count) = (ip-i_di.di_depth - di_depth);
+   }
return 1;
 }
 
@@ -773,8 +794,13 @@ static int check_leaf_blks(struct gfs2_inode *ip, struct 
metawalk_fxns *pass)
}
error = check_leaf(ip, lindex, pass, leaf_no, leaf,
   ref_count);
-   if (ref_count != orig_ref_count)
+   if (ref_count != orig_ref_count) {
+   log_debug(_(Ref count of leaf 0x%llx 
+   changed from %d to %d.\n),
+ (unsigned long long)leaf_no,
+ orig_ref_count, ref_count);
tbl_valid = 0;
+   }
if (!leaf.lf_next || error)
break;
leaf_no = leaf.lf_next;
@@ -787,6 +813,8 @@ static int check_leaf_blks(struct gfs2_inode *ip, struct 
metawalk_fxns *pass)
  (unsigned long long)ip-i_di.di_num.no_addr,
  orig_di_depth, ip-i_di.di_depth);
tbl_valid = 0;
+   lindex = (ip-i_di.di_depth - orig_di_depth);
+   hsize = (1  ip-i_di.di_depth);
}
if (orig_di_height != ip-i_di.di_height) {
log_debug(_(Height of 0x%llx changed from %d to 
@@ -1663,7 +1691,7 @@ int check_dir(struct gfs2_sbd *sdp, uint64_t block, 
struct metawalk_fxns *pass)
 static int remove_dentry(struct gfs2_inode *ip, struct gfs2_dirent *dent,
 struct gfs2_dirent *prev_de,
 struct gfs2_buffer_head *bh,
-char *filename, uint32_t *count, int lindex,
+char *filename, uint32_t *count, int *lindex,
 void *private)
 {
/* the metawalk_fxn's private field must be set

[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: remove duplicate designation during undo

2015-01-09 Thread Bob Peterson
Hi,

This patch fixes a problem whereby fsck.gfs2's pass1 would perform this
sequence of events:
1. Metadata block X is identified as being referenced from dinode D1.
2. Metadata block X is identified as being referenced from another dinode, D2,
   which makes it a duplicate reference, but so far, no serious errors were
   found for that dinode.
3. Dinode D2 is found later to be irreparably damaged, and needs to be removed.
When D2 is deleted, the duplicate reference from D2 is removed and block X is
not freed because D1 still references it. However, it's still marked as a
duplicate and requires processing in pass1b.

Later, pass1b resolves the duplicate and determine's there is really only one
reference remaining, so it makes the correct decision. However, it should not
be necessary. The undo functions should remove the duplicate reference if
(and only if) the only reference was from D2. Note, though, that if the
corruption is found later in the cycle (after undo is possible) the duplicate
reference MUST remain and be resolved by pass1b.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/gfs2/fsck/pass1.c b/gfs2/fsck/pass1.c
index a4ba04c..73b054c 100644
--- a/gfs2/fsck/pass1.c
+++ b/gfs2/fsck/pass1.c
@@ -364,6 +364,11 @@ static int undo_reference(struct gfs2_inode *ip, uint64_t 
block, int meta,
  from another inode; not freeing.\n),
(unsigned long long)block,
(unsigned long long)block);
+   if (dt-refs == 1) {
+   log_err(_(This was the only duplicate 
+ reference so far; removing it.\n));
+   dup_delete(dt);
+   }
return 1;
}
}
@@ -1055,8 +1060,13 @@ static int rangecheck_block(struct gfs2_inode *ip, 
uint64_t block,
  (unsigned long long)ip-i_di.di_num.no_addr);
if ((*bad_pointers) = BAD_POINTER_TOLERANCE)
return meta_is_good;
-   else
+   else {
+   log_debug(_(Inode 0x%llx bad pointer tolerance 
+   exceeded: block 0x%llx.\n),
+ (unsigned long long)ip-i_di.di_num.no_addr,
+ (unsigned long long)block);
return meta_error; /* Exits check_metatree quicker */
+   }
}
return meta_is_good;
 }



Re: [Cluster-devel] 3.18.5 kernel panic: fs/gfs2/acl.c:76

2015-02-09 Thread Bob Peterson
- Original Message -
 On 06/02/15 23:50, Andreas Gruenbacher wrote:
  Andrew,
 
  3.18.5 kernel crashing on acl deletion:
 
  null pointer dereference in fs/gfs2/acl.c:76
 
  this bug seems to exist since commit 2646a1f6 from October 2009.
 
 The if-statement originates in 2646a1f6 but the bug was introduced by
 the deletion of a NULL check in e01580bf9e which was in December 2013.
 
  fix we're using currently:
 
  ---
fs/gfs2/acl.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
 
  diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c
  index 3088e2a..8339754 100644
  --- a/fs/gfs2/acl.c
  +++ b/fs/gfs2/acl.c
  @@ -73,7 +73,7 @@ int gfs2_set_acl(struct inode *inode, struct posix_acl
  *acl, int type)
 
   BUG_ON(name == NULL);
 
  -   if (acl-a_count  GFS2_ACL_MAX_ENTRIES(GFS2_SB(inode)))
  +   if ((acl)  (acl-a_count 
  GFS2_ACL_MAX_ENTRIES(GFS2_SB(inode
   return -E2BIG;
 
   if (type == ACL_TYPE_ACCESS) {
 
  Except for the extra parentheses this seems correct, thank you.
 
 Agreed. Good catch.
 
 Thanks,
 Andy
Hi,

Christoph's patch, which introduced the problem, was never ported to RHEL7, so
let's just treat this as an upstream bug.

Andreas: I think maybe you should post your acl patch separately.

Andrew Elble:
I don't think we even need a bugzilla for this one. Do you want to just
post your latest patch (with fewer parentheses) to cluster-devel@redhat.com
so Steve Whitehouse can pick it up in the GFS2 nmw git tree? Then you can
get the credit.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [GFS2 PATCH 0/4] fallocate quota fixes

2015-02-12 Thread Bob Peterson
- Original Message -
 These patches are related to bz1174295 where fallocate could
 exceed quota. I'm posting these for early feedback as these
 patches are only compile-tested so far.
 
 patch 1 - This is the patch that actually addresses the quota
 exceed issue. Quota checks were not being performed
 against the blocks about to be allocated.
 
 patch 2 - Adds new variants of quota check functions that
 return the number of allowed blocks if quotas are
 violated by the number of requested blocks
 
 patch 3 - Adds a new variant of gfs2_inplace_reserve that
 returns the max number of available blocks if the
 function returns -ENOSPC due to unavailability of
 the requested number of blocks.
 
 patch 4 - Allows fallocate to take advantage of patches 2 and
 3 to efficiently max out quotas or fill up the fs
 instead of returning -EDQUOT/-ENOSPC and leaving some
 available blocks unallocated.
 
 Abhi Das (4):
   gfs2: check quota for blocks we're about to allocate
   gfs2: add new quota check functions
   gfs2: add new function gfs2_inpl_rsrv_ret_max_avl
   gfs2: allow fallocate to max out quotas/fs efficiently
 
  fs/gfs2/aops.c  |  6 +++---
  fs/gfs2/bmap.c  |  2 +-
  fs/gfs2/file.c  | 30 ++
  fs/gfs2/inode.c | 14 --
  fs/gfs2/quota.c | 26 +-
  fs/gfs2/quota.h | 20 +---
  fs/gfs2/rgrp.c  | 13 +++--
  fs/gfs2/rgrp.h  | 10 +-
  fs/gfs2/xattr.c |  2 +-
  9 files changed, 89 insertions(+), 34 deletions(-)
 
 --
 1.8.1.4

Hi,

ACK. Looks okay to me.

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [GFS2 PATCH] GFS2: Eliminate __gfs2_glock_remove_from_lru

2015-01-05 Thread Bob Peterson
Hi,

Since the only caller of function __gfs2_glock_remove_from_lru locks the
same spin_lock as gfs2_glock_remove_from_lru, the functions can be combined.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index a23524a..aeb7bc9 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -173,19 +173,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
spin_unlock(lru_lock);
 }
 
-static void __gfs2_glock_remove_from_lru(struct gfs2_glock *gl)
+static void gfs2_glock_remove_from_lru(struct gfs2_glock *gl)
 {
+   spin_lock(lru_lock);
if (!list_empty(gl-gl_lru)) {
list_del_init(gl-gl_lru);
atomic_dec(lru_count);
clear_bit(GLF_LRU, gl-gl_flags);
}
-}
-
-static void gfs2_glock_remove_from_lru(struct gfs2_glock *gl)
-{
-   spin_lock(lru_lock);
-   __gfs2_glock_remove_from_lru(gl);
spin_unlock(lru_lock);
 }
 
@@ -205,9 +200,7 @@ void gfs2_glock_put(struct gfs2_glock *gl)
 
lockref_mark_dead(gl-gl_lockref);
 
-   spin_lock(lru_lock);
-   __gfs2_glock_remove_from_lru(gl);
-   spin_unlock(lru_lock);
+   gfs2_glock_remove_from_lru(gl);
spin_unlock(gl-gl_lockref.lock);
spin_lock_bucket(gl-gl_hash);
hlist_bl_del_rcu(gl-gl_list);



Re: [Cluster-devel] [GFS2 0/3] fallocate and quota fixes

2015-03-18 Thread Bob Peterson
- Original Message -
 This is a revised version of the patches required to properly fix the
 fallocate quota issue described in bz1174295
 
 patch1: This patch supplies gfs2_quota_check() with the number of blocks
   the caller intends to allocate in the current operation, resulting
   in a more accurate quota check.
 
 patch2: gfs2_quota_check() and gfs2_inplace_reserve() return the number
   of blocks available subject to quota limits and rgrp size
   respectively. These functions don't return error if atleast
   ap.min_target (if set) blocks are allocatable.
 
 patch3: The fallocate() function uses the features of patch2 to determine
   how many total blocks are available for allocation and uses them
   right away, instead of guessing/retrying inefficiently.
 
 The behavior of quota enforcement is altered by this patchset.
 i.   Quotas are exceeded (a warning message is also issued to syslog) before
  the actual usage blocks exceed the imposed limit. In fact, the actual
  usage can never exceed the limit. Whenever it is determined that the
  completion of an operation will cause a quota to exceed its limit, such
  an operation is aborted with -EDQUOT and a 'quota exceeded' message is
  dispatched. Note: When min_target is set and allowed blocks are =
  min_target, we don't issue an error. It is assumed that the caller will
  only allocate the allowed blocks.
 ii.  The gfs2_write_calc_reserv()/calc_max_reserv() functions are used to
  map between available blocks and the data bytes that can be written
  using them. Typically, for large files, some blocks are used up for
  metadata and only the remaining blocks can be used for data. Example:
  To write only a handful of bytes that would easily fit in one block, we
  might have to allocate an extra bunch of intermediate metadata blocks.
  If we had only 1 block left in our allotted quota, this operation would
  likely fail.
 
 The functions mentioned in ii. are not very efficient. They always compute
 the worst case number of extra blocks required and it is often the case that
 not all those extra blocks are used. We need to find a better algorithm to
 get a tighter estimate on the blocks needed for a given number of bytes.
 
 I've run some basic tests and things seem to be holding up. The failing case
 in bz1174295 is fixed using this patchset. I'll do test build and pass it on
 to Nate to test with.
 
 Abhi Das (3):
   gfs2: perform quota checks against allocation parameters
   gfs2: allow quota_check and inplace_reserve to return available blocks
   gfs2: allow fallocate to max out quotas/fs efficiently
 
  fs/gfs2/aops.c   |  6 ++---
  fs/gfs2/bmap.c   |  2 +-
  fs/gfs2/file.c   | 81
  
  fs/gfs2/incore.h |  4 ++-
  fs/gfs2/inode.c  | 18 +++--
  fs/gfs2/quota.c  | 54 +++--
  fs/gfs2/quota.h  |  8 +++---
  fs/gfs2/rgrp.c   | 20 ++
  fs/gfs2/rgrp.h   |  3 ++-
  fs/gfs2/xattr.c  |  2 +-
  10 files changed, 133 insertions(+), 65 deletions(-)
 
 --
 1.8.1.4

Hi,

ACK to all three.
Now pushed to the for-next branch of the linux-gfs2.git git tree.

Regards,

Bob Peterson



Re: [Cluster-devel] [PATCH 1/1] gfs2: incorrect check for debugfs returns

2015-03-30 Thread Bob Peterson
- Original Message -
 debugfs_create_dir and debugfs_create_file may return -ENODEV when debugfs
 is not configured, so the return value should be checked against ERROR_VALUE
 as well, otherwise the later dereference of the dentry pointer would crash
 the kernel.
 
 Signed-off-by: Chengyu Song cson...@gatech.edu
 ---
  fs/gfs2/glock.c | 47 ---
  1 file changed, 28 insertions(+), 19 deletions(-)

Hi,

I've now added this patch to the for-next branch of the linux-gfs2.git
tree. Thanks.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH] GFS2: Improve readability of gfs2_alloc_inode

2015-03-03 Thread Bob Peterson
- Original Message -
 Returning ip-i_inode when ip is NULL is safe as i_inode is the first
 member in struct gfs2_inode, but that's not immediately obvious.
 Reorganize gfs2_alloc_inode to avoid any doubt.
 
 Signed-off-by: Andrew Price anpr...@redhat.com
 ---
 
 Re-sending with a more appropriate commit log based on Andreas' comments.
 
  fs/gfs2/super.c | 13 +++--
  1 file changed, 7 insertions(+), 6 deletions(-)
 
 diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
 index 1666382..37c59ee 100644
 --- a/fs/gfs2/super.c
 +++ b/fs/gfs2/super.c
 @@ -1628,12 +1628,13 @@ static struct inode *gfs2_alloc_inode(struct
 super_block *sb)
   struct gfs2_inode *ip;
  
   ip = kmem_cache_alloc(gfs2_inode_cachep, GFP_KERNEL);
 - if (ip) {
 - ip-i_flags = 0;
 - ip-i_gl = NULL;
 - ip-i_rgd = NULL;
 - ip-i_res = NULL;
 - }
 + if (!ip)
 + return NULL;
 +
 + ip-i_flags = 0;
 + ip-i_gl = NULL;
 + ip-i_rgd = NULL;
 + ip-i_res = NULL;
   return ip-i_inode;
  }
  
 --
 1.9.3

ACK,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH RHEL6] libgfs2: Use a matching context mount option in mount_gfs2_meta

2015-02-27 Thread Bob Peterson
- Original Message -
 On a system with SELinux enabled, if a gfs2 file system is mounted with
 a context= option, the tools gfs2_quota, gfs2_tool, gfs2_grow and
 gfs2_jadd will fail with Device or resource busy. This is due to
 SELinux failing the mount due to a mismatched context (SELinux: mount
 invalid.  Same superblock, different security settings).
 
 In order to work around this, parse the context option of the gfs2 mount
 point in is_pathname_mounted() and use it in mount_gfs2_meta().
 
 Resolves: rhbz#1121693
 
 Signed-off-by: Andrew Price anpr...@redhat.com
 ---
  gfs2/libgfs2/libgfs2.h |  1 +
  gfs2/libgfs2/misc.c| 21 -
  2 files changed, 21 insertions(+), 1 deletion(-)
 
 diff --git a/gfs2/libgfs2/libgfs2.h b/gfs2/libgfs2/libgfs2.h
 index 9c20f11..25286d1 100644
 --- a/gfs2/libgfs2/libgfs2.h
 +++ b/gfs2/libgfs2/libgfs2.h
 @@ -217,6 +217,7 @@ struct gfs2_sbd {
  
   int device_fd;
   int path_fd;
 + char *secontext;
  
   uint64_t sb_addr;
  
 diff --git a/gfs2/libgfs2/misc.c b/gfs2/libgfs2/misc.c
 index 8e0ca6f..5ef4a2a 100644
 --- a/gfs2/libgfs2/misc.c
 +++ b/gfs2/libgfs2/misc.c
 @@ -100,6 +100,24 @@ int compute_constants(struct gfs2_sbd *sdp)
   return 0;
  }
  
 +/**
 + * Returns a duplicate of the 'context' mount option, or NULL if not found.
 + */
 +static char *copy_context_opt(struct mntent *mnt)
 +{
 + char *ctx, *end;
 +
 + ctx = hasmntopt(mnt, context);
 + if (ctx == NULL)
 + return NULL;
 +
 + end = strchr(ctx, ',');
 + if (end == NULL)
 + return NULL;
 +
 + return strndup(ctx, end - ctx);
 +}
 +
  int is_pathname_mounted(struct gfs2_sbd *sdp, int *ro_mount)
  {
   FILE *fp;
 @@ -161,6 +179,7 @@ int is_pathname_mounted(struct gfs2_sbd *sdp, int
 *ro_mount)
   return 0;
   if (hasmntopt(mnt, MNTOPT_RO))
 *ro_mount = 1;
 + sdp-secontext = copy_context_opt(mnt);
   return 1; /* mounted */
  }
  
 @@ -319,7 +338,7 @@ int mount_gfs2_meta(struct gfs2_sbd *sdp)
   sigaction(SIGCONT, sa, NULL);
   sigaction(SIGUSR1, sa, NULL);
   sigaction(SIGUSR2, sa, NULL);
 - ret = mount(sdp-path_name, sdp-metafs_path, gfs2meta, 0, NULL);
 + ret = mount(sdp-path_name, sdp-metafs_path, gfs2meta, 0,
 sdp-secontext);
   if (ret) {
   rmdir(sdp-metafs_path);
   return -1;
 --
 1.9.3
 
 
Hi,

Is this a memory leak (albeit a small one) or did I miss something?
I don't see where the memory allocate by strndup is ever freed.

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] Announcing a new GFS2 development git repository and branch

2015-02-23 Thread Bob Peterson
Hi,

I'm going to be helping Steve Whitehouse maintain the upstream GFS2 kernel git
tree. To make this possible, we had to switch the repository to a different
location that's shared between us. The new GFS2 upstream tree is located here:

https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/

As per the kernel.org standard, we're planning to use branch for-next to
keep recent upstream patches queued for the next merge window. This replaces
the old master branch in the -nmw development repo. So linux-gfs2's master
branch will track Linus's tree, and for-next will be for queuing new patches.

To clone the new tree, do something like this:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git

There are currently three patches pushed to the new for-next branch:

0108044 Add myself (Bob Peterson) as a maintainer of GFS2 (Bob Peterson)
9717c00 GFS2: gfs2_set_acl(): Cache no acl as well  (Andreas Gruenbacher)
76514ae GFS2: Allocate reservation during splice_write(Bob Peterson)

Stephen Rothwell: If you haven't already done so, please change your nightly
script to pull patches for linux-next from the for-next branch of this tree.

I've never been a kernel maintainer before, so expect a few blunders, and let
me know if (and how) I can do better with the process.

Regards,

Bob Peterson



Re: [Cluster-devel] [PATCH] GFS2: gfs2_set_acl(): Cache no acl as well

2015-02-23 Thread Bob Peterson
Hi,

Now in the for-next tree of the linux-gfs2.git repo.

Regards,

Bob Peterson

- Original Message -
 When removing a default acl or setting an access acl that is entirely
 represented in the file mode, we end up with acl == NULL in gfs2_set_acl().
 In
 that case, bring gfs2 in line with other file systems and cache the NULL acl
 with set_cached_acl() instead of invalidating the cache with
 forget_cached_acl().
 
 Signed-off-by: Andreas Gruenbacher agrue...@redhat.com
 ---
  fs/gfs2/acl.c | 6 +-
  1 file changed, 1 insertion(+), 5 deletions(-)
 
 diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c
 index 7b31430..1be3b06 100644
 --- a/fs/gfs2/acl.c
 +++ b/fs/gfs2/acl.c
 @@ -110,11 +110,7 @@ int gfs2_set_acl(struct inode *inode, struct posix_acl
 *acl, int type)
   error = __gfs2_xattr_set(inode, name, data, len, 0, GFS2_EATYPE_SYS);
   if (error)
   goto out;
 -
 - if (acl)
 - set_cached_acl(inode, type, acl);
 - else
 - forget_cached_acl(inode, type);
 + set_cached_acl(inode, type, acl);
  out:
   kfree(data);
   return error;
 --
 2.1.0
 
 



Re: [Cluster-devel] [GFS2 0/3] fallocate and quota fixes

2015-02-25 Thread Bob Peterson
- Original Message -
 This is a revised version of the patches required to properly fix the
 fallocate quota issue described in bz1174295
 
 patch1: This patch supplies gfs2_quota_check() with the number of blocks
   the caller intends to allocate in the current operation, resulting
   in a more accurate quota check.
 
 patch2: gfs2_quota_check() and gfs2_inplace_reserve() return the number
   of blocks available subject to quota limits and rgrp size
   respectively. The difference from previous versions of this patch
   is that we return the available blocks even on success.
 
 patch3: The fallocate() function uses the features of patch2 to determine
   how many total blocks are available for allocation and uses them
   right away, instead of guessing inefficiently.
 
 The behavior of quota enforcement is altered by this patchset.
 i.   Quotas are exceeded (a warning message is also issued to syslog) before
  the actual usage blocks exceed the imposed limit. In fact, the actual
  usage can never exceed the limit. Whenever it is determined that the
  completion of an operation will cause a quota to exceed its limit, such
  an operation is aborted with -EDQUOT and a 'quota exceeded' message is
  dispatched.
 ii.  The gfs2_write_calc_reserv()/calc_max_reserv() functions are used to
  map between available blocks and the data bytes that can be written
  using them. Typically, for large files, some blocks are used up for
  metadata and only the remaining blocks can be used for data. Example:
  To write only a handful of bytes that would easily fit in one block, we
  might have to allocate an extra bunch of intermediate metadata blocks.
  If we had only 1 block left in our allotted quota, this operation would
  likely fail.
 
 The functions mentioned in ii. are not very efficient. They always compute
 the worst case number of extra blocks required and it is often the case that
 not all those extra blocks are used. We need to find a better algorithm to
 get a tighter estimate on the blocks needed for a given number of bytes.
 
 I've run some basic tests and things seem to be holding up. The failing case
 in bz1174295 is fixed using this patchset. I'll do test build and pass it on
 to Nate to test with.
 
 Abhi Das (3):
   gfs2: perform quota checks against allocation parameters
   gfs2: allow quota_check and inplace_reserve to return available blocks
   gfs2: allow fallocate to max out quotas/fs efficiently
 
  fs/gfs2/aops.c   |  6 ++---
  fs/gfs2/bmap.c   |  2 +-
  fs/gfs2/file.c   | 75
  ++--
  fs/gfs2/incore.h |  3 ++-
  fs/gfs2/inode.c  | 18 --
  fs/gfs2/quota.c  | 40 ++
  fs/gfs2/quota.h  |  8 +++---
  fs/gfs2/rgrp.c   | 12 +++--
  fs/gfs2/rgrp.h   |  6 +++--
  fs/gfs2/xattr.c  |  2 +-
  10 files changed, 117 insertions(+), 55 deletions(-)
 
 --
 1.8.1.4

Looks good to me.
ACK to the series.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] linux-next: Tree for Apr 20 (gfs2)

2015-04-24 Thread Bob Peterson
- Original Message -
 On 04/19/15 22:45, Stephen Rothwell wrote:
  Hi all,
  
  Please do not add any v4.2 material to your linux-next included trees
  until after v4.1-rc1 is released.
  
  Changes since 20150415:
  
 
 on i386:
 
 ERROR: __divdi3 [fs/gfs2/gfs2.ko] undefined!
 
 
 
 --
 ~Randy

Hi,

My apologies. I've pushed an addendum patch that fixes the problem to
the for-next branch of the linux-gfs2.git git tree.

Regards,

Bob Peterson



Re: [Cluster-devel] [PATCH] mkfs.gfs2: Allow longer cluster names

2015-04-22 Thread Bob Peterson
- Original Message -
 Increase the enforced limit for cluster name to 32 bytes and file
 system name to 30 bytes for mkfs.gfs2 (was previously 16 + 16
 bytes).
 
 Also increased this limit in tunegfs2 when labelling gfs2 file
 systems.
 
 Updated the formation in the man pages along with adding a new test
 case for mkfs.gfs2 to validate the increased cluster/file system
 name support.
 
 Signed-off-by: Paul Evans pev...@redhat.com
 ---
  gfs2/man/mkfs.gfs2.8  | 4 ++--
  gfs2/mkfs/main_mkfs.c | 4 ++--
  gfs2/tune/super.c | 2 +-
  tests/mkfs.at | 8 
  4 files changed, 13 insertions(+), 5 deletions(-)

Hi,

ACK

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH] GFS2: mark the journal idle to fix ro mounts

2015-04-24 Thread Bob Peterson
- Original Message -
 When gfs2 was mounted read-only and then unmounted, it was writing a
 header block to the journal in the syncing gfs2_log_flush() call from
 kill_sb(). This is because the journal was not being marked as idle
 until the first log header was written out, and on a read-only mount
 there never was a log header written out. Since the journal was not
 marked idle, gfs2_log_flush() was writing out a header lock to make
 sure it was empty during the sync.  Not only did this cause IO to a
 read-only filesystem, but the journalling isn't completely initialized
 on read-only mounts, and so gfs2 was writing out the wrong sequence
 number in the log header.
 
 Now, the journal is marked idle on mount, and gfs2_log_flush() won't
 write out anything until there starts being transactions to flush.
 
 Signed-off-by: Benjamin Marzinski bmarz...@redhat.com

Hi,

ACK
Makes sense to me.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH] GFS2: mark the journal idle to fix ro mounts

2015-05-01 Thread Bob Peterson
- Original Message -
 When gfs2 was mounted read-only and then unmounted, it was writing a
 header block to the journal in the syncing gfs2_log_flush() call from
 kill_sb(). This is because the journal was not being marked as idle
 until the first log header was written out, and on a read-only mount
 there never was a log header written out. Since the journal was not
 marked idle, gfs2_log_flush() was writing out a header lock to make
 sure it was empty during the sync.  Not only did this cause IO to a
 read-only filesystem, but the journalling isn't completely initialized
 on read-only mounts, and so gfs2 was writing out the wrong sequence
 number in the log header.
 
 Now, the journal is marked idle on mount, and gfs2_log_flush() won't
 write out anything until there starts being transactions to flush.
 
 Signed-off-by: Benjamin Marzinski bmarz...@redhat.com
 ---
  fs/gfs2/ops_fstype.c | 1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
 index 8633ad3..fd984f6 100644
 --- a/fs/gfs2/ops_fstype.c
 +++ b/fs/gfs2/ops_fstype.c
 @@ -757,6 +757,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
   }
   }
  
 + sdp-sd_log_idle = 1;
   set_bit(SDF_JOURNAL_CHECKED, sdp-sd_flags);
   gfs2_glock_dq_uninit(ji_gh);
   jindex = 0;
 --
 1.8.3.1

Hi,

Now applied to the for-next branch of the linux-gfs2 tree:
https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-nextid=086cc672e1cb600b9c17688a4aa44560db858c03

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH 02/12] trivial: GFS2: inode.c: indent with TABs, not spaces

2015-05-01 Thread Bob Peterson
- Original Message -
 Follow the same style used for the other functions in the same file.
 
 Signed-off-by: Antonio Ospite a...@ao2.it
 Cc: Steven Whitehouse swhit...@redhat.com
 Cc: Bob Peterson rpete...@redhat.com
 Cc: cluster-devel@redhat.com
 ---
  fs/gfs2/inode.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
 index 1b3ca7a..efd6bfb 100644
 --- a/fs/gfs2/inode.c
 +++ b/fs/gfs2/inode.c
 @@ -1227,8 +1227,8 @@ static int gfs2_mknod(struct inode *dir, struct dentry
 *dentry, umode_t mode,
   */
  
  static int gfs2_atomic_open(struct inode *dir, struct dentry *dentry,
 -struct file *file, unsigned flags,
 -umode_t mode, int *opened)
 + struct file *file, unsigned flags,
 + umode_t mode, int *opened)
  {
   struct dentry *d;
   bool excl = !!(flags  O_EXCL);
 --
 2.1.4

Hi,

Thanks. This is now applied to the for-next branch of the linux-gfs2 tree:
https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-nextid=86fbca4923f956dae31247e68dc73ffdfd6e5cb0

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH v2] GFS2: add support for rename2 and RENAME_EXCHANGE

2015-05-05 Thread Bob Peterson
- Original Message -
 gfs2 now uses the rename2 directory iop, and supports the
 RENAME_EXCHANGE flag (as well as RENAME_NOREPLACE, which the vfs
 takes care of).
 
 Signed-off-by: Benjamin Marzinski bmarz...@redhat.com

Hi,

Thanks. V2 patch is now in the upstream tree:
https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-nextid=a63b7bbc2175901d79fa36ba734499655c077f0d

Regards,

Bob Peterson
Red Hat File Systems




Re: [Cluster-devel] [GFS2] gfs2: handle NULL rgd in set_rgrp_preferences

2015-05-05 Thread Bob Peterson
- Original Message -
 The function set_rgrp_preferences() does not handle the (rarely
 returned) NULL value from gfs2_rgrpd_get_next() and this patch
 fixes that.
 
 The fs image in question is only 150MB in size which allows for
 only 1 rgrp to be created. The in-memory rb tree has only 1 node
 and when gfs2_rgrpd_get_next() is called on this sole rgrp, it
 returns NULL. (Default behavior is to wrap around the rb tree and
 return the first node to give the illusion of a circular linked
 list. In the case of only 1 rgrp, we can't have
 gfs2_rgrpd_get_next() return the same rgrp (first, last, next all
 point to the same rgrp)... that would cause unintended consequences
 and infinite loops.)
 
 Resolves: rhbz#1211663
 Signed-off-by: Abhi Das a...@redhat.com
 ---
  fs/gfs2/rgrp.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
 index cb27065..900e515 100644
 --- a/fs/gfs2/rgrp.c
 +++ b/fs/gfs2/rgrp.c
 @@ -978,10 +978,10 @@ static void set_rgrp_preferences(struct gfs2_sbd *sdp)
   rgd-rd_flags |= GFS2_RDF_PREFERRED;
   for (i = 0; i  sdp-sd_journals; i++) {
   rgd = gfs2_rgrpd_get_next(rgd);
 - if (rgd == first)
 + if (!rgd || rgd == first)
   break;
   }
 - } while (rgd != first);
 + } while (rgd  rgd != first);
  }
  
  /**
 --
 1.8.1.4
 
 

Hi,

Thanks. Now in the for-next branch of the linux-gfs2 git tree.
FYI: I removed Resolves: for upstream.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH v2] GFS2: make sure S_NOSEC flag isn't overwritten

2015-05-05 Thread Bob Peterson
- Original Message -
 At the end of gfs2_set_inode_flags inode-i_flags is set to flags, so
 we should be mondifying flags instead of inode-i_flags, so it isn't
 overwritten.
 
 Signed-off-by: Benjamin Marzinski bmarz...@redhat.com
 ---

Hi,

The v2 version of the patch is now in the for-next branch of the linux-gfs2 
tree:
https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-nextid=01e64ee40ad741037352d1d6202eaa432f833eb4

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH V3 linux-next] gfs2: convert simple_str to kstr

2015-05-05 Thread Bob Peterson
- Original Message -
 -Remove obsolete simple_str functions.
 -Return error code when kstr failed.
 -This patch also calls functions corresponding to destination type.
 
 Thanks to Alexey Dobriyan for suggesting improvements in
 block_store() and wdack_store()
 
 Signed-off-by: Fabian Frederick f...@skynet.be

Hi,

Thanks. The v3 version of the patch is now in the for-next branch of the 
linux-gfs2 tree:
https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-nextid=e50ead480fac63ede9e0b656cd29c1820f7af9de

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] GFS2: Pull request (merge window)

2015-04-14 Thread Bob Peterson
Hi,

Please consider pulling the following changes,

Bob Peterson


GFS2: merge window

Here is a list of patches we've accumulated for GFS2 for the current upstream
merge window. Most of the patches fix GFS2 quotas, which were not properly
enforced. There's another that adds me as a GFS2 co-maintainer, and a
couple patches that fix a kernel panic doing splice_write on GFS2 as well
as a few correctness patches.



The following changes since commit 28666d6dc3feca2b1983e6011df383299d8b6b64:

  Add myself (Bob Peterson) as a maintainer of GFS2 (2015-02-24 08:55:30 -0600)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
tags/gfs2-merge-window

for you to fetch changes up to 30133177957dca9a3e2a37b720f891d3225a92a1:

  gfs2: fix quota refresh race in do_glock() (2015-04-08 09:31:18 -0500)


GFS2: merge window

Here is a list of patches we've accumulated for GFS2 for the current upstream
merge window. Most of the patches fix GFS2 quotas, which were not properly
enforced. There's another that adds me as a GFS2 co-maintainer, and a
couple patches that fix a kernel panic doing splice_write on GFS2 as well
as a few correctness patches.


Abhi Das (4):
  gfs2: perform quota checks against allocation parameters
  gfs2: allow quota_check and inplace_reserve to return available blocks
  gfs2: allow fallocate to max out quotas/fs efficiently
  gfs2: fix quota refresh race in do_glock()

Andreas Gruenbacher (1):
  GFS2: gfs2_set_acl(): Cache no acl as well

Bob Peterson (2):
  GFS2: Allocate reservation during splice_write
  GFS2: Move gfs2_file_splice_write outside of #ifdef

Chengyu Song (1):
  gfs2: incorrect check for debugfs returns

 fs/gfs2/acl.c|   6 +---
 fs/gfs2/aops.c   |   6 ++--
 fs/gfs2/bmap.c   |   2 +-
 fs/gfs2/file.c   | 101 ++-
 fs/gfs2/glock.c  |  47 +++---
 fs/gfs2/incore.h |   4 ++-
 fs/gfs2/inode.c  |  18 +-
 fs/gfs2/quota.c  |  62 --
 fs/gfs2/quota.h  |   8 +++--
 fs/gfs2/rgrp.c   |  20 ---
 fs/gfs2/rgrp.h   |   3 +-
 fs/gfs2/xattr.c  |   2 +-
 12 files changed, 184 insertions(+), 95 deletions(-)



Re: [Cluster-devel] GFS2: Pull request (merge window)

2015-04-15 Thread Bob Peterson
- Original Message -
 On Tue, Apr 14, 2015 at 10:47 AM, Bob Peterson rpete...@redhat.com wrote:
 
   12 files changed, 184 insertions(+), 95 deletions(-)
 
 Oh, and this was incorrect. You had apparently limited the statistics
 to the fs/gfs2 directory, and thus missed the changes to the
 MAINTAINERS file. Don't do that - include all the changes. It's what I
 see and compare against when I actually do the pull anyway.
 
   Linus

Duly noted, and sorry for the confusion.

Bob Peterson



[Cluster-devel] GFS2: Patches pulled, linux-gfs2.git rebased

2015-04-15 Thread Bob Peterson
Hi,

Linus has pulled the latest set of patches from the for-next branch of
the linux-gfs2.git tree, so I rebased the tree based on Linus's master.
So the current tree is back to having no unmerged GFS2 patches.

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [GFS2 PATCH] GFS2: Use average srttb value in congestion calculations

2015-04-15 Thread Bob Peterson
Hi,

This patch changes function gfs2_rgrp_congested so that it uses an
average srttb (smoothed round trip time for blocking rgrp glocks)
rather than the CPU-specific value. If we use the CPU-specific value
it can incorrectly report no contention when there really is contention
due to the glock processing occurring on a different CPU.

Signed-off-by: Bob Peterson rpete...@redhat.com
---
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 6af2396..f39eedc 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1850,14 +1850,19 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd 
*rgd, int loops)
const struct gfs2_sbd *sdp = gl-gl_sbd;
struct gfs2_lkstats *st;
s64 r_dcount, l_dcount;
-   s64 r_srttb, l_srttb;
+   s64 l_srttb, a_srttb = 0;
s64 srttb_diff;
s64 sqr_diff;
s64 var;
+   int cpu;
 
preempt_disable();
+   for_each_present_cpu(cpu) {
+   st = per_cpu_ptr(sdp-sd_lkstats, cpu)-lkstats[LM_TYPE_RGRP];
+   a_srttb += st-stats[GFS2_LKS_SRTTB];
+   }
st = this_cpu_ptr(sdp-sd_lkstats)-lkstats[LM_TYPE_RGRP];
-   r_srttb = st-stats[GFS2_LKS_SRTTB];
+   a_srttb /= num_present_cpus();
r_dcount = st-stats[GFS2_LKS_DCOUNT];
var = st-stats[GFS2_LKS_SRTTVARB] +
  gl-gl_stats.stats[GFS2_LKS_SRTTVARB];
@@ -1866,10 +1871,10 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd 
*rgd, int loops)
l_srttb = gl-gl_stats.stats[GFS2_LKS_SRTTB];
l_dcount = gl-gl_stats.stats[GFS2_LKS_DCOUNT];
 
-   if ((l_dcount  1) || (r_dcount  1) || (r_srttb == 0))
+   if ((l_dcount  1) || (r_dcount  1) || (a_srttb == 0))
return false;
 
-   srttb_diff = r_srttb - l_srttb;
+   srttb_diff = a_srttb - l_srttb;
sqr_diff = srttb_diff * srttb_diff;
 
var *= 2;



<    1   2   3   4   5   6   7   8   9   10   >