[Cluster-devel] [GFS2 PATCH] GFS2: Re-add a call to log_flush_wait when flushing the journal
Hi, Upstream commit 34cc178 changed a line of code from calling function log_flush_commit to calling log_write_header. This had the effect of eliminating a call to function log_flush_wait. That causes the journal to skip over log headers, which results in multiple wrap points, which itself leads to infinite loops in journal replay, both in the kernel code and fsck.gfs2 code. This patch re-adds that call. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c index edbd461..4a14d50 100644 --- a/fs/gfs2/log.c +++ b/fs/gfs2/log.c @@ -702,6 +702,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl) gfs2_log_flush_bio(sdp, WRITE); if (sdp-sd_log_head != sdp-sd_log_flush_head) { + log_flush_wait(sdp); log_write_header(sdp, 0); } else if (sdp-sd_log_tail != current_tail(sdp) !sdp-sd_log_idle){ atomic_dec(sdp-sd_log_blks_free); /* Adjust for unreserved buffer */
[Cluster-devel] [GFS2 PATCH] [TRY #2] GFS2: inline function gfs2_set_mode
Hi, Here is a revised patch based on Steve's feedback: This patch eliminates function gfs2_set_mode which was only called in one place, and always returned 0. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c index 394dc55..3088e2a 100644 --- a/fs/gfs2/acl.c +++ b/fs/gfs2/acl.c @@ -64,18 +64,6 @@ struct posix_acl *gfs2_get_acl(struct inode *inode, int type) return acl; } -static int gfs2_set_mode(struct inode *inode, umode_t mode) -{ - int error = 0; - - if (mode != inode-i_mode) { - inode-i_mode = mode; - mark_inode_dirty(inode); - } - - return error; -} - int gfs2_set_acl(struct inode *inode, struct posix_acl *acl, int type) { int error; @@ -98,9 +86,10 @@ int gfs2_set_acl(struct inode *inode, struct posix_acl *acl, int type) if (error == 0) acl = NULL; - error = gfs2_set_mode(inode, mode); - if (error) - return error; + if (mode != inode-i_mode) { + inode-i_mode = mode; + mark_inode_dirty(inode); + } } if (acl) {
[Cluster-devel] [GFS2 PATCH] [TRY #2] GFS2: Prevent recovery before the local journal is set
Hi, This patch uses a completion to prevent dlm's recovery process from referencing and trying to recover a journal before a journal has been opened. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index bdf70c1..04c062c 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -730,6 +730,8 @@ struct gfs2_sbd { struct gfs2_holder sd_sc_gh; struct gfs2_holder sd_qc_gh; + struct completion sd_journal_ready; + /* Daemon stuff */ struct task_struct *sd_logd_process; diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index 22f9540..4535156 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -94,6 +94,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb) INIT_LIST_HEAD(sdp-sd_jindex_list); spin_lock_init(sdp-sd_jindex_spin); mutex_init(sdp-sd_jindex_mutex); + init_completion(sdp-sd_journal_ready); INIT_LIST_HEAD(sdp-sd_quota_list); mutex_init(sdp-sd_quota_mutex); @@ -676,6 +677,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo) if (sdp-sd_args.ar_spectator) { sdp-sd_jdesc = gfs2_jdesc_find(sdp, 0); + complete_all(sdp-sd_journal_ready); atomic_set(sdp-sd_log_blks_free, sdp-sd_jdesc-jd_blocks); atomic_set(sdp-sd_log_thresh1, 2*sdp-sd_jdesc-jd_blocks/5); atomic_set(sdp-sd_log_thresh2, 4*sdp-sd_jdesc-jd_blocks/5); @@ -686,10 +688,13 @@ static int init_journal(struct gfs2_sbd *sdp, int undo) fs_err(sdp, there are only %u journals (0 - %u)\n, gfs2_jindex_size(sdp), gfs2_jindex_size(sdp) - 1); + complete_all(sdp-sd_journal_ready); goto fail_jindex; } sdp-sd_jdesc = gfs2_jdesc_find(sdp, sdp-sd_lockstruct.ls_jid); + complete_all(sdp-sd_journal_ready); + error = gfs2_glock_nq_num(sdp, sdp-sd_lockstruct.ls_jid, gfs2_journal_glops, LM_ST_EXCLUSIVE, LM_FLAG_NOEXP, diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c index de25d55..02afa82 100644 --- a/fs/gfs2/sys.c +++ b/fs/gfs2/sys.c @@ -407,6 +407,9 @@ int gfs2_recover_set(struct gfs2_sbd *sdp, unsigned jid) struct gfs2_jdesc *jd; int rv; + /* Wait for our primary journal to be initialized */ + wait_for_completion(sdp-sd_journal_ready); + spin_lock(sdp-sd_jindex_spin); rv = -EBUSY; if (sdp-sd_jdesc-jd_jid == jid)
Re: [Cluster-devel] [PATCH 2/5] gfs2-utils: Expressly expunge 'expert mode'
- Original Message - mkfs.gfs2 and gfs2_jadd used struct gfs2_sbd.expert to put them into an undocumented 'expert mode' but didn't really do anything useful in that mode. Remove that field and other 'expert' references. In the past, we've used mkfs.gfs2 expert mode to create certain test conditions, like very small resource groups. I don't think Nate uses this in any of his tests, but it might be worth asking. I've used it to try to recreate certain failing scenarios, but I haven't needed it in years. Are we sure we want to get rid of it? Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [gfs2-utils patch] gfs2_edit: Print block types with log descriptors
Hi, This patch makes gfs2_edit's journal print function print the metadata block type along side the block number in its log descriptor output. For example: In the past, when a log descriptor was printed, it looked like this: 0x8200 (j+ 163): Log descriptor, type 300 (Metadata) len:504, data1: 503 (503) 0x70eb772 0x70df987 0x70dc7a1 0x70eb574 0x70dc5a3 0x70eb376 0x70dc3a5 0x70eb178 0x70dc1a7 0x70eaf7a 0x70dbfa9 0x70ead7c With this patch, the output is more useful: 0x8200 (j+ 163): Log descriptor, type 300 (Metadata) len:504, data1: 503 (503) 0x70eb772 in 0x70df987 rb 0x70dc7a1 in 0x70eb574 in 0x70dc5a3 in 0x70eb376 in 0x70dc3a5 in 0x70eb178 in 0x70dc1a7 in 0x70eaf7a in 0x70dbfa9 in 0x70ead7c in In this case, in indicates indirect blocks, and rb indicates a rgrp bitmap. This is useful for getting a bigger picture of what that journal descriptor contains and represents. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/edit/hexedit.h b/gfs2/edit/hexedit.h index 541684a..f23c581 100644 --- a/gfs2/edit/hexedit.h +++ b/gfs2/edit/hexedit.h @@ -27,6 +27,7 @@ enum dsp_mode { HEX_MODE = 0, GFS2_MODE = 1, EXTENDED_MODE = 2, INIT_MODE = 3 }; #define RGLIST_DUMMY_BLOCK -2 #define JOURNALS_DUMMY_BLOCK -3 +extern const char *mtypes[]; extern struct gfs2_sb sb; extern uint64_t block; extern int blockhist; diff --git a/gfs2/edit/journal.c b/gfs2/edit/journal.c index bb56649..a72a044 100644 --- a/gfs2/edit/journal.c +++ b/gfs2/edit/journal.c @@ -202,6 +202,7 @@ static int print_ld_blks(const uint64_t *b, const char *end, int start_line, { int bcount = 0, found_tblk = 0, found_bblk = 0; static char str[256]; + struct gfs2_buffer_head *j_bmap_bh; if (tblk_off) *tblk_off = 0; @@ -218,8 +219,17 @@ static int print_ld_blks(const uint64_t *b, const char *end, int start_line, } bcount++; if (prnt) { - sprintf(str, 0x%llx, - (unsigned long long)be64_to_cpu(*b)); + if (is_meta_ld) { + j_bmap_bh = bread(sbd, abs_block + + bcount); + sprintf(str, 0x%llx %2s, + (unsigned long long)be64_to_cpu(*b), + mtypes[lgfs2_get_block_type(j_bmap_bh)]); + brelse(j_bmap_bh); + } else { + sprintf(str, 0x%llx, + (unsigned long long)be64_to_cpu(*b)); + } print_gfs2(%-18.18s , str); } if (!found_tblk tblk_off) @@ -237,7 +247,6 @@ static int print_ld_blks(const uint64_t *b, const char *end, int start_line, int type, bmap = 0; uint64_t o; struct gfs2_buffer_head *save_bh; - struct gfs2_buffer_head *j_bmap_bh; found_bblk = 1; print_gfs2(-);
[Cluster-devel] [gfs2-utils patch] gfs2_edit: print LB (log descriptor continuation blocks) for GFS2
Hi, This patch allows gfs2_edit to print metadata blocks of type LB (log descriptor continuation blocks) for GFS2. Prior to this, only GFS1's continuation blocks would print. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/edit/journal.c b/gfs2/edit/journal.c index e012bc3..bb56649 100644 --- a/gfs2/edit/journal.c +++ b/gfs2/edit/journal.c @@ -582,12 +582,15 @@ void dump_journal(const char *journal, int tblk) [UNMOUNTED] : ); } eol(0); - } else if (sbd.gfs1 ld_blocks 0) { - print_gfs2(0x%llx (j+%4llx): GFS log descriptor + } else if ((ld_blocks 0) + (sbd.gfs1 || block_type == GFS2_METATYPE_LB)) { + print_gfs2(0x%llx (j+%4llx): Log descriptor continuation block, abs_block, jb); eol(0); print_gfs2(); - ld_blocks -= print_ld_blks((uint64_t *)dummy_bh.b_data, + ld_blocks -= print_ld_blks((uint64_t *)dummy_bh.b_data + + (sbd.gfs1 ? 0 : + sizeof(struct gfs2_meta_header)), (dummy_bh.b_data + sbd.bsize), start_line, tblk, tblk_off, 0, rgd,
[Cluster-devel] [GFS2 PATCH] GFS2: Only wait for demote when last holder is dequeued
Hi, Function gfs2_glock_dq_wait is supposed to dequeue a glock and then wait for the lock to be demoted. The problem is, if this is a shared lock, its demote will depend on the other holders, which means you might end up waiting forever because the other process is blocked. This problem is especially apparent when dealing with nested flocks. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index aec7f73..22ce2f0 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -1128,7 +1128,9 @@ void gfs2_glock_dq_wait(struct gfs2_holder *gh) struct gfs2_glock *gl = gh-gh_gl; gfs2_glock_dq(gh); might_sleep(); - wait_on_bit(gl-gl_flags, GLF_DEMOTE, gfs2_glock_demote_wait, TASK_UNINTERRUPTIBLE); + if (!find_first_holder(gl)) + wait_on_bit(gl-gl_flags, GLF_DEMOTE, gfs2_glock_demote_wait, + TASK_UNINTERRUPTIBLE); } /**
[Cluster-devel] [GFS2 PATCH] GFS2: Allow caching of glocks for flock
Hi, This patch removes the GLF_NOCACHE flag from the glocks associated with flocks. There should be no good reason not to cache glocks for flocks: they only force the glock to be demoted before they can be reacquired, which can slow down performance and even cause glock hangs, especially in cases where the flocks are held in Shared (SH) mode. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 4fc3a30..0740a57 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -981,7 +981,7 @@ static int do_flock(struct file *file, int cmd, struct file_lock *fl) int error = 0; state = (fl-fl_type == F_WRLCK) ? LM_ST_EXCLUSIVE : LM_ST_SHARED; - flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY) | GL_EXACT | GL_NOCACHE; + flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY) | GL_EXACT; mutex_lock(fp-f_fl_mutex);
[Cluster-devel] [GFS2 PATCH] GFS2: Allow caching of glocks for flock
Hi, This patch removes the GLF_NOCACHE flag from the glocks associated with flocks. There should be no good reason not to cache glocks for flocks: they only force the glock to be demoted before they can be reacquired, which can slow down performance and even cause glock hangs, especially in cases where the flocks are held in Shared (SH) mode. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 4fc3a30..0740a57 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -981,7 +981,7 @@ static int do_flock(struct file *file, int cmd, struct file_lock *fl) int error = 0; state = (fl-fl_type == F_WRLCK) ? LM_ST_EXCLUSIVE : LM_ST_SHARED; - flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY) | GL_EXACT | GL_NOCACHE; + flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY) | GL_EXACT; mutex_lock(fp-f_fl_mutex);
Re: [Cluster-devel] [RFC PATCH] dlm: Remove unused conf from lm_grant
- Original Message - On Tue, Jul 01, 2014 at 10:43:13AM -0400, Jeff Layton wrote: On Tue, 01 Jul 2014 06:20:10 -0700 Joe Perches j...@perches.com wrote: While doing a bit of adding argument names to fs.h, I looked at lm_grant and it seems the 2nd argument is always NULL. How about removing it? This doesn't apply as it depends on some other patches but it should be clear enough... ACK on the general idea from my standpoint. Anything that simplifies the file locking interfaces is a good thing, particularly the deferred locking code. Fine with me. I'd be happy to remove all the deferred locking code from dlm; it never really worked. Dave Hi, GFS2 uses deferred locks, at the very least in its direct_io path (gfs2_direct_IO in aops.c). So AFAIK we can't remove THAT without a certain amount of pain. Steve is on vacation / holiday this week, but he will be back on Thursday and Friday (which is a holiday). I'm all for getting rid of useless parameters, and I've done so on several occasions in GFS2. Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [gfs2-utils PATCH] fsck.gfs2: File read-ahead
Hi, This patch introduces file read-ahead to pass1. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- gfs2/fsck/metawalk.c | 65 ++-- gfs2/fsck/metawalk.h | 1 + gfs2/fsck/pass1.c| 1 + 3 files changed, 65 insertions(+), 2 deletions(-) diff --git a/gfs2/fsck/metawalk.c b/gfs2/fsck/metawalk.c index 659af4e..8da17c6 100644 --- a/gfs2/fsck/metawalk.c +++ b/gfs2/fsck/metawalk.c @@ -1184,6 +1184,59 @@ static void free_metalist(struct gfs2_inode *ip, osi_list_t *mlp) } } +static void file_ra(struct gfs2_inode *ip, struct gfs2_buffer_head *bh, + int head_size, int maxptrs, int h) +{ + struct gfs2_sbd *sdp = ip-i_sbd; + uint64_t *p, sblock = 0, block; + int extlen = 0; + + if (h + 2 == ip-i_di.di_height) { + p = (uint64_t *)(bh-b_data + head_size); + if (*p *(p + 1)) { + sblock = be64_to_cpu(*p); + p++; + block = be64_to_cpu(*p); + extlen = block - sblock; + if (extlen 1 extlen = maxptrs) { + posix_fadvise(sdp-device_fd, + sblock * sdp-bsize, + (extlen + 1) * sdp-bsize, + POSIX_FADV_WILLNEED); + return; + } + } + extlen = 0; + } + for (p = (uint64_t *)(bh-b_data + head_size); +p (uint64_t *)(bh-b_data + sdp-bsize); p++) { + if (*p) { + if (!sblock) { + sblock = be64_to_cpu(*p); + extlen = 1; + continue; + } + block = be64_to_cpu(*p); + if (block == sblock + extlen) { + extlen++; + continue; + } + } + if (extlen sblock) { + if (extlen 1) + extlen--; + posix_fadvise(sdp-device_fd, sblock * sdp-bsize, + extlen * sdp-bsize, + POSIX_FADV_WILLNEED); + extlen = 0; + p--; + } + } + if (extlen) + posix_fadvise(sdp-device_fd, sblock * sdp-bsize, + extlen * sdp-bsize, POSIX_FADV_WILLNEED); +} + /** * build_and_check_metalist - check a bunch of indirect blocks *This includes hash table blocks for directories @@ -1204,6 +1257,7 @@ static int build_and_check_metalist(struct gfs2_inode *ip, osi_list_t *mlp, int h, head_size, iblk_type; uint64_t *ptr, block; int error, was_duplicate, is_valid; + int maxptrs; osi_list_add(metabh-b_altlist, mlp[0]); @@ -1225,13 +1279,18 @@ static int build_and_check_metalist(struct gfs2_inode *ip, osi_list_t *mlp, iblk_type = GFS2_METATYPE_JD; else iblk_type = GFS2_METATYPE_IN; - if (ip-i_sbd-gfs1) + if (ip-i_sbd-gfs1) { head_size = sizeof(struct gfs_indirect); - else + maxptrs = (ip-i_sbd-bsize - head_size) / + sizeof(uint64_t); + } else { head_size = sizeof(struct gfs2_meta_header); + maxptrs = ip-i_sbd-sd_inptrs; + } } else { iblk_type = GFS2_METATYPE_DI; head_size = sizeof(struct gfs2_dinode); + maxptrs = ip-i_sbd-sd_diptrs; } prev_list = mlp[h - 1]; cur_list = mlp[h]; @@ -1246,6 +1305,8 @@ static int build_and_check_metalist(struct gfs2_inode *ip, osi_list_t *mlp, continue; } + if (pass-readahead) + file_ra(ip, bh, head_size, maxptrs, h); /* Now check the metadata itself */ for (ptr = (uint64_t *)(bh-b_data + head_size); (char *)ptr (bh-b_data + ip-i_sbd-bsize); diff --git a/gfs2/fsck/metawalk.h b/gfs2/fsck/metawalk.h index 5e30bfe..a4e0676 100644 --- a/gfs2/fsck/metawalk.h +++ b/gfs2/fsck/metawalk.h @@ -94,6 +94,7 @@ enum meta_check_rc { struct metawalk_fxns { void *private; int invalid_meta_is_fatal; + int
Re: [Cluster-devel] [gfs2-utils PATCH] fsck.gfs2: time each of the passes
- Original Message - Hi, On 11/07/14 18:41, Bob Peterson wrote: - Original Message - (snip) Use timersub() here perhaps? Otherwise looks good, Steve. Hi Steve, Thanks for the suggestion. How about this version? Yes, that looks better. There is probably a nicer way to do the conversion to string too... a quick google points at using a time_t to contain tv_secs, converting to tm and then appending the tv_usecs after. Should be a bit cleaner than doing it manually, Steve. Hi, I could implement your suggestion like this. I could also use strftime, but it's ugly as sin, so I'm reluctant to do so. What do you think? Regards, Bob Peterson --- diff --git a/gfs2/fsck/main.c b/gfs2/fsck/main.c index b4b1a03..ad42b0d 100644 --- a/gfs2/fsck/main.c +++ b/gfs2/fsck/main.c @@ -11,6 +11,8 @@ #include signal.h #include libintl.h #include locale.h +#include sys/time.h +#include time.h #define _(String) gettext(String) #include copyright.cf @@ -244,11 +246,15 @@ static const struct fsck_pass passes[] = { static int fsck_pass(const struct fsck_pass *p, struct gfs2_sbd *sdp) { int ret; + struct timeval before, after, diff; + time_t runtime; + struct tm *run_tm; if (fsck_abort) return FSCK_CANCELED; pass = p-name; log_notice( _(Starting %s\n), p-name); + gettimeofday(before, 0); ret = p-f(sdp); if (ret) exit(ret); @@ -257,7 +263,16 @@ static int fsck_pass(const struct fsck_pass *p, struct gfs2_sbd *sdp) log_notice( _(%s interrupted \n), p-name); return FSCK_CANCELED; } - log_notice( _(%s complete \n), p-name); + gettimeofday(after, 0); + timersub(after, before, diff); + runtime = (time_t)diff.tv_sec; + run_tm = gmtime(runtime); + log_notice( _(%s completed in ), p-name); + if (run_tm-tm_hour) + log_notice(%dh, run_tm-tm_hour); + if (run_tm-tm_min) + log_notice(%dm, run_tm-tm_min); + log_notice(%d.%03lds \n, run_tm-tm_sec, diff.tv_usec / 1000); return 0; }
[Cluster-devel] [GFS2 PATCH] GFS2: Change maxlen variables to size_t
Hi, This patch changes some variables (especially maxlen in function gfs2_block_map) from unsigned int to size_t. We need 64-bit arithmetic for very large files (e.g. 1PB) where the variables otherwise get shifted to all 0's. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c index e6ee5b6..f0b945a 100644 --- a/fs/gfs2/bmap.c +++ b/fs/gfs2/bmap.c @@ -359,7 +359,7 @@ static inline void release_metapath(struct metapath *mp) * Returns: The length of the extent (minimum of one block) */ -static inline unsigned int gfs2_extent_length(void *start, unsigned int len, __be64 *ptr, unsigned limit, int *eob) +static inline unsigned int gfs2_extent_length(void *start, unsigned int len, __be64 *ptr, size_t limit, int *eob) { const __be64 *end = (start + len); const __be64 *first = ptr; @@ -449,7 +449,7 @@ static int gfs2_bmap_alloc(struct inode *inode, const sector_t lblock, struct buffer_head *bh_map, struct metapath *mp, const unsigned int sheight, const unsigned int height, - const unsigned int maxlen) + const size_t maxlen) { struct gfs2_inode *ip = GFS2_I(inode); struct gfs2_sbd *sdp = GFS2_SB(inode); @@ -483,7 +483,8 @@ static int gfs2_bmap_alloc(struct inode *inode, const sector_t lblock, } else { /* Need to allocate indirect blocks */ ptrs_per_blk = height 1 ? sdp-sd_inptrs : sdp-sd_diptrs; - dblks = min(maxlen, ptrs_per_blk - mp-mp_list[end_of_metadata]); + dblks = min(maxlen, (size_t)(ptrs_per_blk - +mp-mp_list[end_of_metadata])); if (height == ip-i_height) { /* Writing into existing tree, extend tree down */ iblks = height - sheight; @@ -605,7 +606,7 @@ int gfs2_block_map(struct inode *inode, sector_t lblock, struct gfs2_inode *ip = GFS2_I(inode); struct gfs2_sbd *sdp = GFS2_SB(inode); unsigned int bsize = sdp-sd_sb.sb_bsize; - const unsigned int maxlen = bh_map-b_size inode-i_blkbits; + const size_t maxlen = bh_map-b_size inode-i_blkbits; const u64 *arr = sdp-sd_heightsize; __be64 *ptr; u64 size;
[Cluster-devel] [GFS2 PATCH] GFS2: Request demote when a try flock fails
Hi, This patch changes the flock code so that it uses the TRY_1CB flag instead of the TRY flag on the first attempt. That forces any holding nodes to issue a dlm callback, which requests a demote of the glock. Then, if the try failed, it sleeps a small amount of time for the demote to occur. Then it tries again, for an increasing amount of time. Subsequent attempts to gain the try lock don't use _1CB so that only one callback is issued. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 26b3f95..7f4ed3d 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -26,6 +26,7 @@ #include linux/dlm.h #include linux/dlm_plock.h #include linux/aio.h +#include linux/delay.h #include gfs2.h #include incore.h @@ -979,9 +980,10 @@ static int do_flock(struct file *file, int cmd, struct file_lock *fl) unsigned int state; int flags; int error = 0; + int sleeptime; state = (fl-fl_type == F_WRLCK) ? LM_ST_EXCLUSIVE : LM_ST_SHARED; - flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY) | GL_EXACT; + flags = (IS_SETLKW(cmd) ? 0 : LM_FLAG_TRY_1CB) | GL_EXACT; mutex_lock(fp-f_fl_mutex); @@ -1001,7 +1003,14 @@ static int do_flock(struct file *file, int cmd, struct file_lock *fl) gfs2_holder_init(gl, state, flags, fl_gh); gfs2_glock_put(gl); } - error = gfs2_glock_nq(fl_gh); + for (sleeptime = 1; sleeptime = 4; sleeptime = 1) { + error = gfs2_glock_nq(fl_gh); + if (error != GLR_TRYFAILED) + break; + fl_gh-gh_flags = LM_FLAG_TRY | GL_EXACT; + fl_gh-gh_error = 0; + msleep(sleeptime); + } if (error) { gfs2_holder_uninit(fl_gh); if (error == GLR_TRYFAILED) @@ -1024,7 +1033,7 @@ static void do_unflock(struct file *file, struct file_lock *fl) mutex_lock(fp-f_fl_mutex); flock_lock_file_wait(file, fl); if (fl_gh-gh_gl) { - gfs2_glock_dq_wait(fl_gh); + gfs2_glock_dq(fl_gh); gfs2_holder_uninit(fl_gh); } mutex_unlock(fp-f_fl_mutex);
Re: [Cluster-devel] [PATCH 00/19] gfs2-utils: Introduce extent allocation and speed up journal creation
- Original Message - One thing to note is that, with these patches, the root and master inodes are no longer the first objects in the first resource group. The master inode is written in the first free block after the journals and then the other metafs structures are placed. The root directory inode is then finally created. This is not a format change but it may cause some confusion after years of expecting the root and master inodes to be at certain addresses so I thought it worth mentioning. Hi, I know that in fsck.gfs2, in initialize.c, it plays some games trying to find and repair damaged system dinodes. For example, it looks for a missing master directory by looking for no_formal_ino==2 for example. So I'd be very cautious and check to make sure these repairs still work properly. In the past, I've done a for loop, wiping out the first X blocks of the file system, running fsck.gfs2, and seeing if it can properly repair it. Another concern is gfs2_convert. I don't know if it makes any assumptions about the master directory, but it's much less likely. I think it just assumes the file system is healthy. But fsck.gfs2 is a concern. Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH] GFS2: call d_add for a NULL entry when an inode lookup fails
Hi, This patch was originally written by Benjamin Coddington (credit where credit is due). GFS2's inode lookup wasn't hashing the dentry when an inode wasn't found. That caused the dentry/inode to get an early cleanup attempt through dentry_kill when a create op cleaned up the filehandle. That caused the entry to hang around until unmount time, which means if the file was created then deleted, its space and inode were not reclaimed until then. This problem was introduced by gfs2's new atomic open code. The patch simply adds the d_add (as other file systems do) when inode lookups fail. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index e62e594..9317ddc 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -840,8 +840,10 @@ static struct dentry *__gfs2_lookup(struct inode *dir, struct dentry *dentry, int error; inode = gfs2_lookupi(dir, dentry-d_name, 0); - if (!inode) + if (inode == NULL) { + d_add(dentry, NULL); return NULL; + } if (IS_ERR(inode)) return ERR_CAST(inode);
[Cluster-devel] [GFS2 PATCH] GFS2: Make rename not save dirent location
Hi, This patch fixes a regression in the patch GFS2: Remember directory insert point, commit 2b47dad866d04f14c328f888ba5406057b8c7d33. The problem had to do with the rename function: The function found space for the new dirent, and remembered that location. But then the old dirent was removed, which often moved the eligible location for the renamed dirent. Putting the new dirent at the saved location caused file system corruption. This patch adds a new save_loc variable to struct gfs2_diradd. If 1, the dirent location is saved. If 0, the dirent location is not saved and the buffer_head is released as per previous behavior. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c index 1a349f9..5d4261f 100644 --- a/fs/gfs2/dir.c +++ b/fs/gfs2/dir.c @@ -2100,8 +2100,13 @@ int gfs2_diradd_alloc_required(struct inode *inode, const struct qstr *name, } if (IS_ERR(dent)) return PTR_ERR(dent); - da-bh = bh; - da-dent = dent; + + if (da-save_loc) { + da-bh = bh; + da-dent = dent; + } else { + brelse(bh); + } return 0; } diff --git a/fs/gfs2/dir.h b/fs/gfs2/dir.h index 126c65d..e1b309c 100644 --- a/fs/gfs2/dir.h +++ b/fs/gfs2/dir.h @@ -23,6 +23,7 @@ struct gfs2_diradd { unsigned nr_blocks; struct gfs2_dirent *dent; struct buffer_head *bh; + int save_loc; }; extern struct inode *gfs2_dir_search(struct inode *dir, diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index 9516f5c..fcf42ea 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -600,7 +600,7 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry, int error, free_vfs_inode = 0; u32 aflags = 0; unsigned blocks = 1; - struct gfs2_diradd da = { .bh = NULL, }; + struct gfs2_diradd da = { .bh = NULL, .save_loc = 1, }; if (!name-len || name-len GFS2_FNAMESIZE) return -ENAMETOOLONG; @@ -900,7 +900,7 @@ static int gfs2_link(struct dentry *old_dentry, struct inode *dir, struct gfs2_inode *ip = GFS2_I(inode); struct gfs2_holder ghs[2]; struct buffer_head *dibh; - struct gfs2_diradd da = { .bh = NULL, }; + struct gfs2_diradd da = { .bh = NULL, .save_loc = 1, }; int error; if (S_ISDIR(inode-i_mode)) @@ -1338,7 +1338,7 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry, struct gfs2_rgrpd *nrgd; unsigned int num_gh; int dir_rename = 0; - struct gfs2_diradd da = { .nr_blocks = 0, }; + struct gfs2_diradd da = { .nr_blocks = 0, .save_loc = 0, }; unsigned int x; int error;
[Cluster-devel] [DLM PATCH] DLM: Don't wait for resource library lookups if NOLOOKUP is specified
Hi, This patch adds a new lock flag, DLM_LKF_NOLOOKUP, which instructs DLM to refrain from sending lookup requests in cases where the lock library node is not the current node. This is similar to the DLM_LKF_NOQUEUE flag, except it fails locks that would require a lookup, with -EAGAIN. This is not just about saving a network operation. It allows callers like GFS2 to master locks for which they are the directory node. Each node can then prefer local locks, especially in the case of GFS2 selecting resource groups for block allocations (implemented with a separate patch). This mastering of local locks distributes the locks between the nodes (at least until nodes enter or leave the cluster), which tends to make each node keep to itself when doing allocations. Thus, dlm communications are kept to a minimum, which results in significantly faster block allocations. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- fs/dlm/lock.c | 16 ++-- include/uapi/linux/dlmconstants.h | 7 +++ 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c index 83f3d55..f1e5b04 100644 --- a/fs/dlm/lock.c +++ b/fs/dlm/lock.c @@ -222,6 +222,11 @@ static inline int can_be_queued(struct dlm_lkb *lkb) return !(lkb-lkb_exflags DLM_LKF_NOQUEUE); } +static inline int can_be_looked_up(struct dlm_lkb *lkb) +{ + return !(lkb-lkb_exflags DLM_LKF_NOLOOKUP); +} + static inline int force_blocking_asts(struct dlm_lkb *lkb) { return (lkb-lkb_exflags DLM_LKF_NOQUEUEBAST); @@ -2745,6 +2750,11 @@ static int set_master(struct dlm_rsb *r, struct dlm_lkb *lkb) return 0; } + if (!can_be_looked_up(lkb)) { + queue_cast(r, lkb, -EAGAIN); + return -EAGAIN; + } + wait_pending_remove(r); r-res_first_lkid = lkb-lkb_id; @@ -2828,7 +2838,8 @@ static int set_lock_args(int mode, struct dlm_lksb *lksb, uint32_t flags, if (flags DLM_LKF_CONVDEADLK !(flags DLM_LKF_CONVERT)) goto out; - if (flags DLM_LKF_CONVDEADLK flags DLM_LKF_NOQUEUE) + if (flags DLM_LKF_CONVDEADLK (flags (DLM_LKF_NOQUEUE | + DLM_LKF_NOLOOKUP))) goto out; if (flags DLM_LKF_EXPEDITE flags DLM_LKF_CONVERT) @@ -2837,7 +2848,8 @@ static int set_lock_args(int mode, struct dlm_lksb *lksb, uint32_t flags, if (flags DLM_LKF_EXPEDITE flags DLM_LKF_QUECVT) goto out; - if (flags DLM_LKF_EXPEDITE flags DLM_LKF_NOQUEUE) + if (flags DLM_LKF_EXPEDITE (flags (DLM_LKF_NOQUEUE | + DLM_LKF_NOLOOKUP))) goto out; if (flags DLM_LKF_EXPEDITE mode != DLM_LOCK_NL) diff --git a/include/uapi/linux/dlmconstants.h b/include/uapi/linux/dlmconstants.h index 47bf08d..4b9ba15 100644 --- a/include/uapi/linux/dlmconstants.h +++ b/include/uapi/linux/dlmconstants.h @@ -131,6 +131,12 @@ * Unlock the lock even if it is converting or waiting or has sublocks. * Only really for use by the userland device.c code. * + * DLM_LKF_NOLOOKUP + * + * Don't take any network time/bandwidth to do directory owner lookups. + * This is a lock for which we only care whether it's completely under + * local jurisdiction. + * */ #define DLM_LKF_NOQUEUE0x0001 @@ -152,6 +158,7 @@ #define DLM_LKF_ALTCW 0x0001 #define DLM_LKF_FORCEUNLOCK0x0002 #define DLM_LKF_TIMEOUT0x0004 +#define DLM_LKF_NOLOOKUP 0x0008 /* * Some return codes that are not in errno.h
Re: [Cluster-devel] [DLM PATCH] DLM: Don't wait for resource library lookups if NOLOOKUP is specified
- Original Message - Hi, I think we need to do some more investigation here... how long do the lookups take? If the issue is just to create a list of perferred rgrps for each node, then there are various ways in which we might do that. That is not to say that this isn't a good way to do it, but I think we should try to understand the timings here first and make sure that we are solving the right problem, Steve. Hi, Contrary to my previous findings (which I think I may have screwed up), I've done further investigation and found that the DLM lookups aren't the issue at all. Also, making the DLM master the lookup node doesn't seem to make much difference either. So I'm scrapping this dlm patch. My latest set of patches now includes one that evenly distributes a preferred set of rgrps to each node. Unlike Dave's original algorithm in GFS1, this is a round-robin scheme that makes every node prefer every Nth rgrp, where N is the number of nodes. This seems to be just as fast if not faster than messing around with DLM. Hopefully I'll be posting more patches in the near future. I'm currently running some more tests regarding minimum reservations, and I'll post patches depending on those results. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [DLM PATCH] DLM: Don't wait for resource library lookups if NOLOOKUP is specified
- Original Message - Can we just use NOQUEUE? It tells you that there's a lock conflict, which tells you to move along and try another if you don't want to contend. If you cache acquired locks and reuse them, then it doesn't matter if the master node is remote or local. The problem with NOQUEUE is that it seems to depend on circumstances. If each node mounts the file system at a separate time, and as part of that process, they do a NOQUEUE lock on every rgrp, they all are granted the lock. With this scheme, they're evenly divided between the nodes. It doesn't matter anyway, because I'm scrapping the DLM patch in favor of a scheme like the one you pointed out in GFS1 below. See my other email for more details. If lookups are a problem in general, there is the nodir lockspace mode, which replaces the resource directory lookup system with a static mapping of resources to master nodes. (snip) Back in 2002 I solved what sounds like the same problem in gfs(1). It allowed all nodes to allocate blocks independent of each other, without constant locking. You can see the solution here: https://git.fedorahosted.org/cgit/cluster.git/tree/gfs-kernel/src/gfs/rgrp.c?h=RHEL4 Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH] GFS2: Change minimum reservation size to 64 blocks
Hi, I've been working on the problems of both slow performance and excessive file fragmentation, especially with regard to the RHEL7 kernel. I've created some patches (still forthcoming) to improve these things. For a long time, I thought the fragmentation issue was related to GFS2 not using its reservations efficiently. I wrote a patch that revises the calculations in function gfs2_write_calc_reserv, but it didn't seem to help. I also thought it might be releasing the reservations too early, but that turned out not to be the case. I've done extensive performance testing and determined that, even under the best circumstances (IOW by minimizing both inter and intra-node resource group contention), file fragmentation can be improved significantly by doubling the minimum block reservation size from 32 to 64 blocks. To test this, I did several two+ hour tests (real-life application) that uses lots of files in GFS2. After the test is complete, I totaled all the file extents (fragments) using filefrag. Here are the results of five two-hour runs performed with new, revised calculations, and today's 32-block reservations: EXTENT COUNT FOR OUTPUT FILES = 444352 EXTENT COUNT FOR OUTPUT FILES = 468425 EXTENT COUNT FOR OUTPUT FILES = 472464 EXTENT COUNT FOR OUTPUT FILES = 482224 EXTENT COUNT FOR OUTPUT FILES = 481174 Here are the results of five runs performed with the old calculations and 64-block reservations: EXTENT COUNT FOR OUTPUT FILES = 365743 EXTENT COUNT FOR OUTPUT FILES = 398628 EXTENT COUNT FOR OUTPUT FILES = 404614 EXTENT COUNT FOR OUTPUT FILES = 401355 EXTENT COUNT FOR OUTPUT FILES = 384599 As you can see, by doubling the minimum reservation size, the files have about 20 percent fewer extents. This patch, therefore, doubles the minimum block reservations. Incidentally, if we don't take measures to minimize resource group contention (both inter and intra-node) the results are much worse. Here is the same test, done on a stock RHEL7 kernel: EXTENT COUNT FOR OUTPUT FILES = 826314 EXTENT COUNT FOR OUTPUT FILES = 791406 EXTENT COUNT FOR OUTPUT FILES = 735760 Patch description: The minimum block reservation size was 32 blocks. This patch doubles it to 64 blocks. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h index 5d8f085..e2058a7 100644 --- a/fs/gfs2/rgrp.h +++ b/fs/gfs2/rgrp.h @@ -14,11 +14,9 @@ #include linux/uaccess.h /* Since each block in the file system is represented by two bits in the - * bitmap, one 64-bit word in the bitmap will represent 32 blocks. - * By reserving 32 blocks at a time, we can optimize / shortcut how we search - * through the bitmaps by looking a word at a time. + * bitmap, each 64-bit word in the bitmap represents 32 blocks. */ -#define RGRP_RSRV_MINBYTES 8 +#define RGRP_RSRV_MINBYTES 16 /* Reserve 64 blocks */ #define RGRP_RSRV_MINBLKS ((u32)(RGRP_RSRV_MINBYTES * GFS2_NBBY)) struct gfs2_rgrpd;
[Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme
Hi, Steve Whitehouse wrote: | I'd much prefer to see an algorithm that is adaptive, rather than simply | bumping up the default here. We do need to be able to cope with cases | where the files are much smaller, and without adaptive sizing, we might | land up creating small holes in the allocations, which would then cause | problems for later allocations, I took this to heart and came up with a new design. The idea is not unlike the block reservation doubling schemes in other file systems. The results are fantastic; much better than those gained by hard coding 64 blocks. This is the best level of fragmentation I've ever achieved with this app: EXTENT COUNT FOR OUTPUT FILES = 310103 EXTENT COUNT FOR OUTPUT FILES = 343990 EXTENT COUNT FOR OUTPUT FILES = 332818 EXTENT COUNT FOR OUTPUT FILES = 336852 EXTENT COUNT FOR OUTPUT FILES = 334820 Compare these results to counts without the patch: EXTENT COUNT FOR OUTPUT FILES = 951813 EXTENT COUNT FOR OUTPUT FILES = 966978 EXTENT COUNT FOR OUTPUT FILES = 1065481 The only down side I see is that it makes the gfs2 inode structure bigger. I also thought about changing the minimum reservation size to 16 blocks rather than 32, since it's now adaptive, but before I did that, I'd have to run some performance tests. What do you think? Regards, Bob Peterson Red Hat File Systems Patch text: This patch introduces a new block reservation doubling scheme. If we get to the end of a multi-block reservation, we probably did not reserve enough blocks. So we double the size of the reservation for next time. If we can't find any rgrps that match, we go back to the default 32 blocks. Signed-off-by: Bob Peterson rpete...@redhat.com --- fs/gfs2/incore.h | 1 + fs/gfs2/main.c | 2 ++ fs/gfs2/rgrp.c | 7 ++- fs/gfs2/rgrp.h | 9 ++--- 4 files changed, 11 insertions(+), 8 deletions(-) diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index 39e7e99..f98fa37 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -398,6 +398,7 @@ struct gfs2_inode { u32 i_diskflags; u8 i_height; u8 i_depth; + u32 i_rsrv_minblks; /* minimum blocks per reservation */ }; /* diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c index 82b6ac8..2be2f98 100644 --- a/fs/gfs2/main.c +++ b/fs/gfs2/main.c @@ -29,6 +29,7 @@ #include glock.h #include quota.h #include recovery.h +#include rgrp.h #include dir.h struct workqueue_struct *gfs2_control_wq; @@ -42,6 +43,7 @@ static void gfs2_init_inode_once(void *foo) INIT_LIST_HEAD(ip-i_trunc_list); ip-i_res = NULL; ip-i_hash_cache = NULL; + ip-i_rsrv_minblks = RGRP_RSRV_MINBLKS; } static void gfs2_init_glock_once(void *foo) diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 7474c41..986c33f 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1465,7 +1465,7 @@ static void rg_mblk_search(struct gfs2_rgrpd *rgd, struct gfs2_inode *ip, extlen = 1; else { extlen = max_t(u32, atomic_read(rs-rs_sizehint), ap-target); - extlen = clamp(extlen, RGRP_RSRV_MINBLKS, free_blocks); + extlen = clamp(extlen, ip-i_rsrv_minblks, free_blocks); } if ((rgd-rd_free_clone rgd-rd_reserved) || (free_blocks extlen)) return; @@ -2000,6 +2000,7 @@ next_rgrp: * then this checks for some less likely conditions before * trying again. */ + ip-i_rsrv_minblks = RGRP_RSRV_MINBLKS; loops++; /* Check that fs hasn't grown if writing to rindex */ if (ip == GFS2_I(sdp-sd_rindex) !sdp-sd_rindex_uptodate) { @@ -2195,6 +2196,10 @@ static void gfs2_adjust_reservation(struct gfs2_inode *ip, trace_gfs2_rs(rs, TRACE_RS_CLAIM); if (rs-rs_free !ret) goto out; + /* We used up our block reservation, so double the + minimum reservation size for the next write. */ + if (ip-i_rsrv_minblks RGRP_RSRV_MAXBLKS) + ip-i_rsrv_minblks = 1; } __rs_deltree(rs); } diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h index 5d8f085..d081cac 100644 --- a/fs/gfs2/rgrp.h +++ b/fs/gfs2/rgrp.h @@ -13,13 +13,8 @@ #include linux/slab.h #include linux/uaccess.h -/* Since each block in the file system is represented by two bits in the - * bitmap, one 64-bit word in the bitmap will represent 32 blocks. - * By reserving 32 blocks at a time, we can optimize / shortcut how we search - * through the bitmaps by looking a word at a time. - */ -#define RGRP_RSRV_MINBYTES 8 -#define RGRP_RSRV_MINBLKS ((u32)(RGRP_RSRV_MINBYTES * GFS2_NBBY)) +#define RGRP_RSRV_MINBLKS 32 +#define RGRP_RSRV_MAXBLKS 512 struct gfs2_rgrpd; struct gfs2_sbd;
[Cluster-devel] [GFS2 PATCH] GFS2: Set of distributed preferences for rgrps
Hi, This patch uses journal numbers to evenly distribute which node prefers which resource groups for block allocations. This is to help performance. The idea is to make each node in the cluster prefer to use a certain subset of resource groups for its block allocations. Other nodes use a different subset, thus minimizing inter-node DLM communications. GFS1 has a similar scheme, but with GFS1, each node starts out using an initial section of the file system and works their way forward. While this works for the most part, simultaneous writes by different nodes can cause the heads to bounce on some devices. Also, nodes tend to stray from their initial locations. With this patch, the preferred resource groups are assigned in a round-robin fashion. Tests prove that this patch can be a major performance boost for some applications. For example, using one particular application, I posted these run times before applying this patch: Run 1 time: 2hr 49min 39sec Run 2 time: 2hr 57min 59sec Run 3 time: 3hr 1min 10sec With the patch applied the times are improved by almost a third: Run 1 time: 2hr 7min 31sec Run 2 time: 2hr 5min 49sec Run 3 time: 2hr 5min 1sec Run 4 time: 2hr 4min 35sec Run 5 time: 2hr 5min 55sec Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index f98fa37..c9f4f4c 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -97,6 +97,7 @@ struct gfs2_rgrpd { #define GFS2_RDF_CHECK 0x1000 /* check for unlinked inodes */ #define GFS2_RDF_UPTODATE 0x2000 /* rg is up to date */ #define GFS2_RDF_ERROR 0x4000 /* error in rg */ +#define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */ #define GFS2_RDF_MASK 0xf000 /* mask for internal flags */ spinlock_t rd_rsspin; /* protects reservation related vars */ struct rb_root rd_rstree; /* multi-block reservation tree */ @@ -809,6 +810,7 @@ struct gfs2_sbd { char sd_table_name[GFS2_FSNAME_LEN]; char sd_proto_name[GFS2_FSNAME_LEN]; + int sd_nodes; /* Debugging crud */ unsigned long sd_last_warning; diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c index 641383a..5aeb03a 100644 --- a/fs/gfs2/lock_dlm.c +++ b/fs/gfs2/lock_dlm.c @@ -1113,6 +1113,8 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots, struct gfs2_sbd *sdp = arg; struct lm_lockstruct *ls = sdp-sd_lockstruct; + BUG_ON(num_slots == 0); + sdp-sd_nodes = num_slots; /* ensure the ls jid arrays are large enough */ set_recover_size(sdp, slots, num_slots); diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index d3eae24..bf3193f 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -134,6 +134,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb) atomic_set(sdp-sd_log_freeze, 0); atomic_set(sdp-sd_frozen_root, 0); init_waitqueue_head(sdp-sd_frozen_root_wait); + sdp-sd_nodes = 1; return sdp; } diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 986c33f..bd8bddc 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip) rgd-rd_gl-gl_vm.start = rgd-rd_addr * bsize; rgd-rd_gl-gl_vm.end = rgd-rd_gl-gl_vm.start + (rgd-rd_length * bsize) - 1; rgd-rd_rgl = (struct gfs2_rgrp_lvb *)rgd-rd_gl-gl_lksb.sb_lvbptr; - rgd-rd_flags = ~GFS2_RDF_UPTODATE; + rgd-rd_flags = ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED); if (rgd-rd_data sdp-sd_max_rg_data) sdp-sd_max_rg_data = rgd-rd_data; spin_lock(sdp-sd_rindex_spin); @@ -955,6 +955,36 @@ fail: } /** + * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use + * @sdp: the GFS2 superblock + * + * The purpose of this function is to select a subset of the resource groups + * and mark them as PREFERRED. We do it in such a way that each node prefers + * to use a unique set of rgrps to minimize glock contention. + */ +static void set_rgrp_preferences(struct gfs2_sbd *sdp) +{ + struct gfs2_rgrpd *rgd, *first; + int i; + + /* Skip an initial number of rgrps, based on this node's journal ID. + That should start each node out on its own set. */ + rgd = gfs2_rgrpd_get_first(sdp); + for (i = 0; i sdp-sd_lockstruct.ls_jid; i++) + rgd = gfs2_rgrpd_get_next(rgd); + first = rgd; + + do { + rgd-rd_flags |= GFS2_RDF_PREFERRED; + for (i = 0; i sdp-sd_nodes; i++) { + rgd = gfs2_rgrpd_get_next(rgd); + if (rgd == first) + break; + } + } while (rgd != first); +} + +/** * gfs2_ri_update - Pull in a new resource index from the disk * @ip: pointer to the rindex inode * @@ -973,6 +1003,8
Re: [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme
- Original Message - - Original Message - This patch introduces a new block reservation doubling scheme. If we Maybe I sent this patch out prematurely. Instead of doubling the reservation, maybe I should experiment with making it grow additively. IOW, Instead of 32-64-128-256-512, I should use: 32-64-96-128-160-192-224-etc... I know other file systems using doubling schemes, but I'm concerned about it being too aggressive. I tried an additive reservations algorithm. I basically changed the previous patch from doubling the reservation to adding 32 blocks. In other words, I replaced: + ip-i_rsrv_minblks = 1; with this: + ip-i_rsrv_minblks += RGRP_RSRV_MINBLKS; The results were not as good, but still very impressive, and maybe acceptable: Reservation doubling scheme: EXTENT COUNT FOR OUTPUT FILES = 310103 EXTENT COUNT FOR OUTPUT FILES = 343990 EXTENT COUNT FOR OUTPUT FILES = 332818 EXTENT COUNT FOR OUTPUT FILES = 336852 EXTENT COUNT FOR OUTPUT FILES = 334820 Reservation additive scheme (32 blocks): EXTENT COUNT FOR OUTPUT FILES = 322406 EXTENT COUNT FOR OUTPUT FILES = 341665 EXTENT COUNT FOR OUTPUT FILES = 341769 EXTENT COUNT FOR OUTPUT FILES = 348676 EXTENT COUNT FOR OUTPUT FILES = 348079 So I'm looking for opinions: (a) Stick with the original reservation doubling patch, or (b) Go with the additive version. (c) Any other ideas? Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme
are necessary. That way, applications that do a long sequence of: large-appends followed by small rewrites don't get their append size hint whacked by the small rewrite. This didn't help the customer application I'm testing, but it did help some of the other benchmarks I ran yesterday. 5. I don't like the idea of adjusting the reservation at fsync time. Similarly, I don't like the idea of adjusting the reservation at file close time. I think it makes the most sense to keep the info associated with the inode as we do today. My next iteration will hopefully not add any fields to the inode. 6. I like the idea of adjusting the reservation for non-linear writes, such as backwards writes, but I may have to do more testing. For example, if I do multiple writes to a file at: 2MB, 1MB, 500KB, etc., is it better to reserve X blocks which will be assigned in reverse order? Or is it better to just reserve them as needed and have them more scattered but possibly more linear? Maybe testing will show. 7. In regards to storing the context information in the struct file: It depends on what information. Today, there is only one reservation structure, and reservation size per inode, whereas there can be many struct files for many writers to the inode. The question of whether a reservation is adequate is not so much about will this reservation be adequate for this writer?. Rather, it's about will this reservation be adequate for our most demanding writer? All the rewriters in the world shouldn't affect the outcome of a single aggressive appender, for example. To answer your question: I'd wager that yes, there are multiple writers to at least some of the files, but I'm not sure how extensive it is. The workload seems to have a good variety of linear and non-linear writes as well. At least now I'm starting to use multiple benchmarks for my tests. Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH] GFS2: Speed up fiemap function by skipping holes
Hi, This patch detects the new holesize bit in block_map requests. If a hole is found during fiemap, it figures out the size of the hole based on the current metapath information. Since the metapath only represents a section of the file, it can only extrapolate to a certain size based on the current metapath buffers. Therefore, fiemap may call blockmap several times to get the hole size. The hole size is determined by a new function. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c index f0b945a..edd7ed6 100644 --- a/fs/gfs2/bmap.c +++ b/fs/gfs2/bmap.c @@ -587,6 +587,62 @@ static int gfs2_bmap_alloc(struct inode *inode, const sector_t lblock, } /** + * hole_size - figure out the size of a hole + * @ip: The inode + * @lblock: The logical starting block number + * @mp: The metapath + * + * Returns: The hole size in bytes + * + */ +static u64 hole_size(struct inode *inode, sector_t lblock, +struct metapath *mp) +{ + struct gfs2_inode *ip = GFS2_I(inode); + struct gfs2_sbd *sdp = GFS2_SB(inode); + unsigned int end_of_metadata = ip-i_height - 1; + u64 factor = 1; + int hgt = end_of_metadata; + u64 holesz = 0, holestep; + const __be64 *first, *end, *ptr; + const struct buffer_head *bh; + u64 isize = i_size_read(inode); + int zeroptrs; + struct metapath mp_eof; + + /* Get a metapath to the very last byte */ + find_metapath(sdp, (isize - 1) inode-i_blkbits, mp_eof, + ip-i_height); + for (hgt = end_of_metadata; hgt = 0; hgt--) { + bh = mp-mp_bh[hgt]; + if (bh) { + zeroptrs = 0; + first = metapointer(hgt, mp); + end = (const __be64 *)(bh-b_data + bh-b_size); + + for (ptr = first; ptr end; ptr++) { + if (*ptr) + break; + else + zeroptrs++; + } + } else { + zeroptrs = sdp-sd_inptrs; + } + holestep = min(factor * zeroptrs, + isize - (lblock + (zeroptrs * holesz))); + holesz += holestep; + if (lblock + holesz = isize) + return holesz inode-i_blkbits; + + factor *= sdp-sd_inptrs; + if (hgt (mp-mp_list[hgt - 1] mp_eof.mp_list[hgt - 1])) + (mp-mp_list[hgt - 1])++; + } + return holesz inode-i_blkbits; +} + +/** * gfs2_block_map - Map a block from an inode to a disk block * @inode: The inode * @lblock: The logical block number @@ -645,11 +701,17 @@ int gfs2_block_map(struct inode *inode, sector_t lblock, ret = lookup_metapath(ip, mp); if (ret 0) goto out; - if (ret != ip-i_height) + if (ret != ip-i_height) { + if (test_clear_buffer_holesize(bh_map)) + bh_map-b_size = hole_size(inode, lblock, mp); goto do_alloc; + } ptr = metapointer(ip-i_height - 1, mp); - if (*ptr == 0) + if (*ptr == 0) { + if (test_clear_buffer_holesize(bh_map)) + bh_map-b_size = hole_size(inode, lblock, mp); goto do_alloc; + } map_bh(bh_map, inode-i_sb, be64_to_cpu(*ptr)); bh = mp.mp_bh[ip-i_height - 1]; len = gfs2_extent_length(bh-b_data, bh-b_size, ptr, maxlen, eob);
[Cluster-devel] [GFS2 PATCH 3/4] GFS2: Only increase rs_sizehint
If an application does a sequence of (1) big write, (2) little write we don't necessarily want to reset the size hint based on the smaller size. The fact that they did any big writes implies they may do more, and therefore we should try to allocate bigger block reservations, even if the last few were small writes. Therefore this patch changes function gfs2_size_hint so that the size hint can only grow; it cannot shrink. This is especially important where there are multiple writers. --- fs/gfs2/file.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 2976019..5c7a9c1 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -337,7 +337,8 @@ static void gfs2_size_hint(struct file *filep, loff_t offset, size_t size) size_t blks = (size + sdp-sd_sb.sb_bsize - 1) sdp-sd_sb.sb_bsize_shift; int hint = min_t(size_t, INT_MAX, blks); - atomic_set(ip-i_res-rs_sizehint, hint); + if (hint atomic_read(ip-i_res-rs_sizehint)) + atomic_set(ip-i_res-rs_sizehint, hint); } /** -- 1.9.3
[Cluster-devel] [GFS2 PATCH 2/4] GFS2: Make block reservations more persistent
Before this patch, whenever a struct file (opened to allow writes) was closed, the multi-block reservation structure associated with the inode was deleted. That's a problem, especially when there are multiple writers. Applications that do open-write-close will suffer from greater levels of fragmentation and need to re-do work to perform write operations. This patch removes the reservation delete from the file close code so that they're more persistent until the inode is deleted. --- fs/gfs2/file.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 7f4ed3d..2976019 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -616,15 +616,8 @@ static int gfs2_open(struct inode *inode, struct file *file) static int gfs2_release(struct inode *inode, struct file *file) { - struct gfs2_inode *ip = GFS2_I(inode); - kfree(file-private_data); file-private_data = NULL; - - if (!(file-f_mode FMODE_WRITE)) - return 0; - - gfs2_rs_delete(ip, inode-i_writecount); return 0; } -- 1.9.3
[Cluster-devel] [GFS2 PATCH 0/4] Patches to reduce GFS2 fragmentation
Hi, On October 8, I posted a GFS2 patch that greatly reduced inter-node contention for resource group glocks. The patch was called: GFS2: Set of distributed preferences for rgrps. It implemented a new scheme whereby each node in a cluster tries to keep to itself for allocations. This is not unlike GFS1, which has a different scheme. Although the patch sped up GFS2 performance in general, it also caused more file fragmentation, because each node tended to focus on a smaller subset of resource groups. Here are run times and file fragmentation extent counts for my favorite customer application, using a STOCK RHEL7 kernel (no patches): Run times: Run 1 time: 2hr 40min 33sec Run 2 time: 2hr 39min 52sec Run 3 time: 2hr 39min 31sec Run 4 time: 2hr 33min 57sec Run 5 time: 2hr 41min 6sec Total file extents (File fragmentation): EXTENT COUNT FOR OUTPUT FILES = 744708 EXTENT COUNT FOR OUTPUT FILES = 749868 EXTENT COUNT FOR OUTPUT FILES = 721862 EXTENT COUNT FOR OUTPUT FILES = 635301 EXTENT COUNT FOR OUTPUT FILES = 689263 The times are bad and the fragmentation level is also bad. If I add just the first patch, GFS2: Set of distributed preferences for rgrps you can see that the performance improves, but the fragmentation is worse (I only did three iterations this time): Run times: Run 1 time: 2hr 2min 47sec Run 2 time: 2hr 8min 37sec Run 3 time: 2hr 10min 0sec Total file extents (File fragmentation): EXTENT COUNT FOR OUTPUT FILES = 1011217 EXTENT COUNT FOR OUTPUT FILES = 1025973 EXTENT COUNT FOR OUTPUT FILES = 1070163 So the patch improved performance by 25 percent, but file fragmentation is 30 percent worse. Some of this is undoubtedly due to the SAN array buffering, hiding our multitude of sins. But not every customer will have this quality of SAN. So it's important to reduce the fragmentation as well, so that some people are helped while others are hurt by the patch. Toward this end, I devised three relatively simple patches that greatly reduce file fragmentation. With all four patches, the numbers are as follows: Run times: Run 1 time: 2hr 5min 46sec Run 2 time: 2hr 10min 15sec Run 3 time: 2hr 8min 4sec Run 4 time: 2hr 9min 27sec Run 5 time: 2hr 6min 15sec Total file extents (File fragmentation): EXTENT COUNT FOR OUTPUT FILES = 330276 EXTENT COUNT FOR OUTPUT FILES = 358939 EXTENT COUNT FOR OUTPUT FILES = 375374 EXTENT COUNT FOR OUTPUT FILES = 383071 EXTENT COUNT FOR OUTPUT FILES = 369269 As you can see, with this combination of four patches, the run times are good as well as the file fragmentation levels. The file fragmentation is about twice as good as the stock kernel, and significantly better (almost three times better) than with the first patch alone. This patch set includes all four patches. Bob Peterson (4): GFS2: Set of distributed preferences for rgrps GFS2: Make block reservations more persistent GFS2: Only increase rs_sizehint GFS2: If we use up our block reservation, request more next time fs/gfs2/file.c | 10 ++-- fs/gfs2/incore.h | 2 ++ fs/gfs2/lock_dlm.c | 2 ++ fs/gfs2/ops_fstype.c | 1 + fs/gfs2/rgrp.c | 69 5 files changed, 71 insertions(+), 13 deletions(-) -- 1.9.3
[Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps
This patch tries to use the journal numbers to evenly distribute which node prefers which resource group for block allocations. This is to help performance. --- fs/gfs2/incore.h | 2 ++ fs/gfs2/lock_dlm.c | 2 ++ fs/gfs2/ops_fstype.c | 1 + fs/gfs2/rgrp.c | 66 4 files changed, 66 insertions(+), 5 deletions(-) diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index 39e7e99..618d20a 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -97,6 +97,7 @@ struct gfs2_rgrpd { #define GFS2_RDF_CHECK 0x1000 /* check for unlinked inodes */ #define GFS2_RDF_UPTODATE 0x2000 /* rg is up to date */ #define GFS2_RDF_ERROR 0x4000 /* error in rg */ +#define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */ #define GFS2_RDF_MASK 0xf000 /* mask for internal flags */ spinlock_t rd_rsspin; /* protects reservation related vars */ struct rb_root rd_rstree; /* multi-block reservation tree */ @@ -808,6 +809,7 @@ struct gfs2_sbd { char sd_table_name[GFS2_FSNAME_LEN]; char sd_proto_name[GFS2_FSNAME_LEN]; + int sd_nodes; /* Debugging crud */ unsigned long sd_last_warning; diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c index 641383a..5aeb03a 100644 --- a/fs/gfs2/lock_dlm.c +++ b/fs/gfs2/lock_dlm.c @@ -1113,6 +1113,8 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots, struct gfs2_sbd *sdp = arg; struct lm_lockstruct *ls = sdp-sd_lockstruct; + BUG_ON(num_slots == 0); + sdp-sd_nodes = num_slots; /* ensure the ls jid arrays are large enough */ set_recover_size(sdp, slots, num_slots); diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index d3eae24..bf3193f 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -134,6 +134,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb) atomic_set(sdp-sd_log_freeze, 0); atomic_set(sdp-sd_frozen_root, 0); init_waitqueue_head(sdp-sd_frozen_root_wait); + sdp-sd_nodes = 1; return sdp; } diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 7474c41..50cdba2 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip) rgd-rd_gl-gl_vm.start = rgd-rd_addr * bsize; rgd-rd_gl-gl_vm.end = rgd-rd_gl-gl_vm.start + (rgd-rd_length * bsize) - 1; rgd-rd_rgl = (struct gfs2_rgrp_lvb *)rgd-rd_gl-gl_lksb.sb_lvbptr; - rgd-rd_flags = ~GFS2_RDF_UPTODATE; + rgd-rd_flags = ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED); if (rgd-rd_data sdp-sd_max_rg_data) sdp-sd_max_rg_data = rgd-rd_data; spin_lock(sdp-sd_rindex_spin); @@ -955,6 +955,36 @@ fail: } /** + * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use + * @sdp: the GFS2 superblock + * + * The purpose of this function is to select a subset of the resource groups + * and mark them as PREFERRED. We do it in such a way that each node prefers + * to use a unique set of rgrps to minimize glock contention. + */ +static void set_rgrp_preferences(struct gfs2_sbd *sdp) +{ + struct gfs2_rgrpd *rgd, *first; + int i; + + /* Skip an initial number of rgrps, based on this node's journal ID. + That should start each node out on its own set. */ + rgd = gfs2_rgrpd_get_first(sdp); + for (i = 0; i sdp-sd_lockstruct.ls_jid; i++) + rgd = gfs2_rgrpd_get_next(rgd); + first = rgd; + + do { + rgd-rd_flags |= GFS2_RDF_PREFERRED; + for (i = 0; i sdp-sd_nodes; i++) { + rgd = gfs2_rgrpd_get_next(rgd); + if (rgd == first) + break; + } + } while (rgd != first); +} + +/** * gfs2_ri_update - Pull in a new resource index from the disk * @ip: pointer to the rindex inode * @@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip) if (error 0) return error; + set_rgrp_preferences(sdp); + sdp-sd_rindex_uptodate = 1; return 0; } @@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b } /** + * fast_to_acquire - determine if a resource group will be fast to acquire + * + * If this is one of our preferred rgrps, it should be quicker to acquire, + * because we tried to set ourselves up as dlm lock master. + */ +static inline int fast_to_acquire(struct gfs2_rgrpd *rgd) +{ + struct gfs2_glock *gl = rgd-rd_gl; + + if (gl-gl_state != LM_ST_UNLOCKED list_empty(gl-gl_holders) + !test_bit(GLF_DEMOTE_IN_PROGRESS, gl-gl_flags) + !test_bit(GLF_DEMOTE, gl-gl_flags)) + return 1; + if (rgd-rd_flags GFS2_RDF_PREFERRED) + return 1; + return 0; +} +
[Cluster-devel] [PATCH][try6] VFS: new want_holesize and got_holesize buffer_head flags for fiemap
Hi, This is my sixth rework of this patch. The problem with the previous version is that the underlying file system's block_map function may set the buffer_head's b_state to 0, which may be misinterpreted by fiemap as meaning it has returned the hole size in b_size. This version implements a suggestion from Steve Whitehouse: it improves on the design by (unfortunately) using two flags: a flag to tell block_map to return hole size if possible, and another flag to tell fiemap that the hole size has been returned. This way there is no possibility of a misunderstanding. The problem: If you do a fiemap operation on a very large sparse file, it can take an extremely long amount of time (we're talking days here) because function __generic_block_fiemap does a block-for-block search when it encounters a hole. The solution: Allow the underlying file system to return the hole size so that function __generic_block_fiemap can quickly skip the hole. I have a companion patch to GFS2 that takes advantage of the new flags to speed up fiemap on sparse GFS2 files. Other file systems can do the same as they see fit. For GFS2, the time it takes to skip a 1PB hole in a sparse file goes from days to milliseconds. Patch description: This patch changes function __generic_block_fiemap so that it sets a new buffer_want_holesize bit. The new bit signals to the underlying file system to return a hole size from its block_map function (if possible) in the event that a hole is encountered at the requested block. If the block_map function encounters a hole, and sets buffer_got_holesize, fiemap takes the returned b_size to be the size of the hole, in bytes. It then skips the hole and moves to the next block. This may be repeated several times in a row, especially for large holes, due to possible limitations of the fs-specific block_map function. This is still much faster than trying each block individually when large holes are encountered. If the block_map function does not set buffer_got_holesize, the request for holesize has been ignored, and it falls back to today's method of doing a block-by-block search for the next valid block. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- fs/ioctl.c | 7 ++- include/linux/buffer_head.h | 4 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index 8ac3fad..9c07a40 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -291,13 +291,18 @@ int __generic_block_fiemap(struct inode *inode, memset(map_bh, 0, sizeof(struct buffer_head)); map_bh.b_size = len; + set_buffer_want_holesize(map_bh); ret = get_block(inode, start_blk, map_bh, 0); if (ret) break; /* HOLE */ if (!buffer_mapped(map_bh)) { - start_blk++; + if (buffer_got_holesize(map_bh)) + start_blk += logical_to_blk(inode, + map_bh.b_size); + else + start_blk++; /* * We want to handle the case where there is an diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 324329c..fc8e47c 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -37,6 +37,8 @@ enum bh_state_bits { BH_Meta,/* Buffer contains metadata */ BH_Prio,/* Buffer should be submitted with REQ_PRIO */ BH_Defer_Completion, /* Defer AIO completion to workqueue */ + BH_Want_Holesize, /* Return hole size if possible */ + BH_Got_Holesize, /* Hole size has been returned */ BH_PrivateStart,/* not a state bit, but the first bit available * for private allocation by other entities @@ -128,6 +130,8 @@ BUFFER_FNS(Boundary, boundary) BUFFER_FNS(Write_EIO, write_io_error) BUFFER_FNS(Unwritten, unwritten) BUFFER_FNS(Meta, meta) +BUFFER_FNS(Want_Holesize, want_holesize) +BUFFER_FNS(Got_Holesize, got_holesize) BUFFER_FNS(Prio, prio) BUFFER_FNS(Defer_Completion, defer_completion) -- 1.9.3
[Cluster-devel] [PATCH] GFS2: Speed up fiemap function by skipping holes
Hi, This is the GFS2 companion patch to the one I previously posted for fiemap. Patch description: This patch detects the new want_holesize bit in block_map requests. If a hole is found during fiemap, it calculates the size of the hole based on the current metapath information, then it sets the new buffer_got_holesize bit and returns the hole size in b_size. Since the metapath only represents a section of the file, it can only extrapolate to a certain size based on the current metapath buffers. Therefore, fiemap may call blockmap several times to get the hole size. The hole size is determined by a new function. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- fs/gfs2/bmap.c | 70 -- 1 file changed, 68 insertions(+), 2 deletions(-) diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c index f0b945a..450ea17 100644 --- a/fs/gfs2/bmap.c +++ b/fs/gfs2/bmap.c @@ -587,6 +587,62 @@ static int gfs2_bmap_alloc(struct inode *inode, const sector_t lblock, } /** + * hole_size - figure out the size of a hole + * @ip: The inode + * @lblock: The logical starting block number + * @mp: The metapath + * + * Returns: The hole size in bytes + * + */ +static u64 hole_size(struct inode *inode, sector_t lblock, +struct metapath *mp) +{ + struct gfs2_inode *ip = GFS2_I(inode); + struct gfs2_sbd *sdp = GFS2_SB(inode); + unsigned int end_of_metadata = ip-i_height - 1; + u64 factor = 1; + int hgt = end_of_metadata; + u64 holesz = 0, holestep; + const __be64 *first, *end, *ptr; + const struct buffer_head *bh; + u64 isize = i_size_read(inode); + int zeroptrs; + struct metapath mp_eof; + + /* Get a metapath to the very last byte */ + find_metapath(sdp, (isize - 1) inode-i_blkbits, mp_eof, + ip-i_height); + for (hgt = end_of_metadata; hgt = 0; hgt--) { + bh = mp-mp_bh[hgt]; + if (bh) { + zeroptrs = 0; + first = metapointer(hgt, mp); + end = (const __be64 *)(bh-b_data + bh-b_size); + + for (ptr = first; ptr end; ptr++) { + if (*ptr) + break; + else + zeroptrs++; + } + } else { + zeroptrs = sdp-sd_inptrs; + } + holestep = min(factor * zeroptrs, + isize - (lblock + (zeroptrs * holesz))); + holesz += holestep; + if (lblock + holesz = isize) + return holesz inode-i_blkbits; + + factor *= sdp-sd_inptrs; + if (hgt (mp-mp_list[hgt - 1] mp_eof.mp_list[hgt - 1])) + (mp-mp_list[hgt - 1])++; + } + return holesz inode-i_blkbits; +} + +/** * gfs2_block_map - Map a block from an inode to a disk block * @inode: The inode * @lblock: The logical block number @@ -645,11 +701,21 @@ int gfs2_block_map(struct inode *inode, sector_t lblock, ret = lookup_metapath(ip, mp); if (ret 0) goto out; - if (ret != ip-i_height) + if (ret != ip-i_height) { + if (buffer_want_holesize(bh_map)) { + bh_map-b_size = hole_size(inode, lblock, mp); + set_buffer_got_holesize(bh_map); + } goto do_alloc; + } ptr = metapointer(ip-i_height - 1, mp); - if (*ptr == 0) + if (*ptr == 0) { + if (buffer_want_holesize(bh_map)) { + bh_map-b_size = hole_size(inode, lblock, mp); + set_buffer_got_holesize(bh_map); + } goto do_alloc; + } map_bh(bh_map, inode-i_sb, be64_to_cpu(*ptr)); bh = mp.mp_bh[ip-i_height - 1]; len = gfs2_extent_length(bh-b_data, bh-b_size, ptr, maxlen, eob); -- 1.9.3
Re: [Cluster-devel] [PATCH][try6] VFS: new want_holesize and got_holesize buffer_head flags for fiemap
- Original Message - This looks like a big indicator that get_blocks is the wrong interface for fiemap. Hi Christoph, Yes, I thought about that. One of my early prototypes had a separate function used by fiemap. Function __generic_block_fiemap would call get_block() which returned an indication of a hole as it does today. When it saw the hole, fiemap called a new function get_hole_size() that was passed in like get_block. The problem is: it's grossly inefficient, since the new function get_hole_size() has to redo most of the work that get_block just did (at least in the case of GFS2). (Which in the case of a 1PB sparse file is non-trivial, since it involves several levels of metadata indirection). Combining it with get_block made it much more efficient. Making a separate get_block_map_fiemap() function just seems like an exercise in redundancy. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH][try6] VFS: new want_holesize and got_holesize buffer_head flags for fiemap
- Original Message - On Wed, Oct 22, 2014 at 08:28:53AM -0400, Bob Peterson wrote: Yes, I thought about that. One of my early prototypes had a separate function used by fiemap. Function __generic_block_fiemap would call get_block() which returned an indication of a hole as it does today. When it saw the hole, fiemap called a new function get_hole_size() that was passed in like get_block. The problem is: it's grossly inefficient, since the new function get_hole_size() has to redo most of the work that get_block just did (at least in the case of GFS2). (Which in the case of a 1PB sparse file is non-trivial, since it involves several levels of metadata indirection). Combining it with get_block made it much more efficient. Making a separate get_block_map_fiemap() function just seems like an exercise in redundancy. I was thinking of replacing get_blocks entirely. We're not actually using a buffer_head in fiemap, so the interface seems somewhat awkward. If it used something like the iomap interface proposed by Dave long time ago we'd have a much saner interface that for example XFS could use as well. Hi Christoph. Can you send a link to the thread regarding Dave's iomap? proposal? I don't recall it offhand, so I don't know what it was or why it was never implemented. I assume you mean Dave Chinner. Maybe it's time to revisit the concept as a long-term solution. In the meantime, do you otherwise object to this patch as a short-term solution? Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps
This patch tries to use the journal numbers to evenly distribute which node prefers which resource group for block allocations. This is to help performance. --- fs/gfs2/incore.h | 1 + fs/gfs2/rgrp.c | 66 +++- 2 files changed, 62 insertions(+), 5 deletions(-) diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index 39e7e99..1b89918 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -97,6 +97,7 @@ struct gfs2_rgrpd { #define GFS2_RDF_CHECK 0x1000 /* check for unlinked inodes */ #define GFS2_RDF_UPTODATE 0x2000 /* rg is up to date */ #define GFS2_RDF_ERROR 0x4000 /* error in rg */ +#define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */ #define GFS2_RDF_MASK 0xf000 /* mask for internal flags */ spinlock_t rd_rsspin; /* protects reservation related vars */ struct rb_root rd_rstree; /* multi-block reservation tree */ diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 7474c41..f65e56b 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip) rgd-rd_gl-gl_vm.start = rgd-rd_addr * bsize; rgd-rd_gl-gl_vm.end = rgd-rd_gl-gl_vm.start + (rgd-rd_length * bsize) - 1; rgd-rd_rgl = (struct gfs2_rgrp_lvb *)rgd-rd_gl-gl_lksb.sb_lvbptr; - rgd-rd_flags = ~GFS2_RDF_UPTODATE; + rgd-rd_flags = ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED); if (rgd-rd_data sdp-sd_max_rg_data) sdp-sd_max_rg_data = rgd-rd_data; spin_lock(sdp-sd_rindex_spin); @@ -955,6 +955,36 @@ fail: } /** + * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use + * @sdp: the GFS2 superblock + * + * The purpose of this function is to select a subset of the resource groups + * and mark them as PREFERRED. We do it in such a way that each node prefers + * to use a unique set of rgrps to minimize glock contention. + */ +static void set_rgrp_preferences(struct gfs2_sbd *sdp) +{ + struct gfs2_rgrpd *rgd, *first; + int i; + + /* Skip an initial number of rgrps, based on this node's journal ID. + That should start each node out on its own set. */ + rgd = gfs2_rgrpd_get_first(sdp); + for (i = 0; i sdp-sd_lockstruct.ls_jid; i++) + rgd = gfs2_rgrpd_get_next(rgd); + first = rgd; + + do { + rgd-rd_flags |= GFS2_RDF_PREFERRED; + for (i = 0; i sdp-sd_journals; i++) { + rgd = gfs2_rgrpd_get_next(rgd); + if (rgd == first) + break; + } + } while (rgd != first); +} + +/** * gfs2_ri_update - Pull in a new resource index from the disk * @ip: pointer to the rindex inode * @@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip) if (error 0) return error; + set_rgrp_preferences(sdp); + sdp-sd_rindex_uptodate = 1; return 0; } @@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b } /** + * fast_to_acquire - determine if a resource group will be fast to acquire + * + * If this is one of our preferred rgrps, it should be quicker to acquire, + * because we tried to set ourselves up as dlm lock master. + */ +static inline int fast_to_acquire(struct gfs2_rgrpd *rgd) +{ + struct gfs2_glock *gl = rgd-rd_gl; + + if (gl-gl_state != LM_ST_UNLOCKED list_empty(gl-gl_holders) + !test_bit(GLF_DEMOTE_IN_PROGRESS, gl-gl_flags) + !test_bit(GLF_DEMOTE, gl-gl_flags)) + return 1; + if (rgd-rd_flags GFS2_RDF_PREFERRED) + return 1; + return 0; +} + +/** * gfs2_inplace_reserve - Reserve space in the filesystem * @ip: the inode to reserve space for * @ap: the allocation parameters @@ -1932,10 +1983,15 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const struct gfs2_alloc_parms *a rg_locked = 0; if (skip skip--) goto next_rgrp; - if (!gfs2_rs_active(rs) (loops 2) -gfs2_rgrp_used_recently(rs, 1000) -gfs2_rgrp_congested(rs-rs_rbm.rgd, loops)) - goto next_rgrp; + if (!gfs2_rs_active(rs)) { + if (loops == 0 + !fast_to_acquire(rs-rs_rbm.rgd)) + goto next_rgrp; + if ((loops 3) + gfs2_rgrp_used_recently(rs, 1000) + gfs2_rgrp_congested(rs-rs_rbm.rgd, loops)) + goto next_rgrp; + } error = gfs2_glock_nq_init(rs-rs_rbm.rgd-rd_gl,
[Cluster-devel] [GFS2 PATCH 3/3] GFS2: If we use up our block reservation, request more next time
If we run out of blocks for a given multi-block allocation, we obviously did not reserve enough. We should reserve more blocks for the next reservation to reduce fragmentation. This patch increases the size hint for reservations when they run out. --- fs/gfs2/rgrp.c | 3 +++ fs/gfs2/rgrp.h | 1 + 2 files changed, 4 insertions(+) diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index f65e56b..d4b9d93 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -2251,6 +2251,9 @@ static void gfs2_adjust_reservation(struct gfs2_inode *ip, trace_gfs2_rs(rs, TRACE_RS_CLAIM); if (rs-rs_free !ret) goto out; + /* We used up our block reservation, so we should + reserve more blocks next time. */ + atomic_add(RGRP_RSRV_ADDBLKS, rs-rs_sizehint); } __rs_deltree(rs); } diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h index 5d8f085..b104f4a 100644 --- a/fs/gfs2/rgrp.h +++ b/fs/gfs2/rgrp.h @@ -20,6 +20,7 @@ */ #define RGRP_RSRV_MINBYTES 8 #define RGRP_RSRV_MINBLKS ((u32)(RGRP_RSRV_MINBYTES * GFS2_NBBY)) +#define RGRP_RSRV_ADDBLKS 64 struct gfs2_rgrpd; struct gfs2_sbd; -- 1.9.3
[Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation
Hi, This is a revised version of my patches to improve GFS2 performance and reduce GFS2 fragmentation. As suggested by Steve Whitehouse, the first patch has been modified to base its distribution of resource groups on the number of journals. Also as suggested, the patch that kept file close from deleting rgrp reservations has been removed from the set, leaving only three patches. The removal of that patch caused fragmentation to climb 15 percent, but it's still much (41 percent) better than the original stock kernels. I agree that use cases like untar would make for much worse fragmentation problems if that patch were to remain. I've re-tested with these three patches and found that performance and fragmentation are still good. Here are numbers: STOCK RHEL7 kernel (none of these patches): Run times: Run 1 time: 2hr 40min 33sec Run 2 time: 2hr 39min 52sec Run 3 time: 2hr 39min 31sec Run 4 time: 2hr 33min 57sec Run 5 time: 2hr 41min 6sec Total file extents (File fragmentation): EXTENT COUNT FOR OUTPUT FILES = 744708 EXTENT COUNT FOR OUTPUT FILES = 749868 EXTENT COUNT FOR OUTPUT FILES = 721862 EXTENT COUNT FOR OUTPUT FILES = 635301 EXTENT COUNT FOR OUTPUT FILES = 689263 (Average 708200) The times are bad and the fragmentation level is also bad. If I add just the first patch, GFS2: Set of distributed preferences for rgrps the performance improves, but the fragmentation is worse (I only did three iterations this time): Run times: Run 1 time: 2hr 2min 47sec Run 2 time: 2hr 8min 37sec Run 3 time: 2hr 10min 0sec Total file extents (File fragmentation): EXTENT COUNT FOR OUTPUT FILES = 1011217 EXTENT COUNT FOR OUTPUT FILES = 1025973 EXTENT COUNT FOR OUTPUT FILES = 1070163 With these three patches, both performance and fragmentation are good: Run 1 time: 2hr 9min 49sec Run 2 time: 2hr 10min 15sec Run 3 time: 2hr 8min 35sec Run 4 time: 2hr 15min 15sec Run 5 time: 2hr 7min 4sec Run 6 time: 2hr 8min 8sec Run 7 time: 2hr 7min 54sec EXTENT COUNT FOR OUTPUT FILES = 406366 EXTENT COUNT FOR OUTPUT FILES = 432978 EXTENT COUNT FOR OUTPUT FILES = 428736 EXTENT COUNT FOR OUTPUT FILES = 419736 EXTENT COUNT FOR OUTPUT FILES = 419040 EXTENT COUNT FOR OUTPUT FILES = 422774 EXTENT COUNT FOR OUTPUT FILES = 409281 (Average 419844) Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- Bob Peterson (3): GFS2: Set of distributed preferences for rgrps GFS2: Only increase rs_sizehint GFS2: If we use up our block reservation, request more next time fs/gfs2/file.c | 3 ++- fs/gfs2/incore.h | 1 + fs/gfs2/rgrp.c | 69 fs/gfs2/rgrp.h | 1 + 4 files changed, 68 insertions(+), 6 deletions(-) -- 1.9.3
[Cluster-devel] [GFS2 PATCH 2/3] GFS2: Only increase rs_sizehint
If an application does a sequence of (1) big write, (2) little write we don't necessarily want to reset the size hint based on the smaller size. The fact that they did any big writes implies they may do more, and therefore we should try to allocate bigger block reservations, even if the last few were small writes. Therefore this patch changes function gfs2_size_hint so that the size hint can only grow; it cannot shrink. This is especially important where there are multiple writers. --- fs/gfs2/file.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 7f4ed3d..f64084b 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -337,7 +337,8 @@ static void gfs2_size_hint(struct file *filep, loff_t offset, size_t size) size_t blks = (size + sdp-sd_sb.sb_bsize - 1) sdp-sd_sb.sb_bsize_shift; int hint = min_t(size_t, INT_MAX, blks); - atomic_set(ip-i_res-rs_sizehint, hint); + if (hint atomic_read(ip-i_res-rs_sizehint)) + atomic_set(ip-i_res-rs_sizehint, hint); } /** -- 1.9.3
Re: [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps
- Original Message - + if ((loops 3) + gfs2_rgrp_used_recently(rs, 1000) + gfs2_rgrp_congested(rs-rs_rbm.rgd, loops)) + goto next_rgrp; + } This makes no sense, we end the loop when loops == 3 so that these conditions will be applied in every case which is not what we want. We must always end up doing a search of every rgrp in the worst case, in order that if there is some space left somewhere, we will eventually find it. Definitely better wrt figuring out which rgrps to prefer, but I'm not yet convinced about this logic. The whole point of the congestion logic is to figure out ahead of time, whether it will take a long time to access that rgrp, so it seems that this is not quite right, otherwise there should be no need to bypass it like this. The fast_to_acquire logic should at least by merged into the rgrp_congested logic, possibly by just reducing the threshold at which congestion is measured. It might be useful to introduce a tracepoint for when we reject and rgrp during allocation, with a reason as to why it was rejected, so that it is easier to see whats going on here, Steve. Hi, Sigh. You're right: Good catch. The problem is that I've done more than 30 attempts at trying to get this right in my git tree, each of which has up to 15 patches for various things. Somewhere around iteration 20, I dropped an important change. My intent was always to add another layer of rgrp criteria, so loops would be 4 rather than 3. I had done this with a different patch, but it got dropped by accident. The 3 original layers are as follows: loop 0: Reject rgrps that are likely congested (based on past history) and rgrps where we just found congestion. Only accept rgrps for which we can get a multi-block reservation. loop 1: Reject rgrps that are likely congested (based on past history) and rgrps where we just found congestion. Accept rgrps that have enough free space, even if we can't get a reservation. loop 2: Don't ever reject rgrps because we're out of ideal conditions. The new scheme was intended to add a new layer 0 which only accepts rgrps within a preferred subset of rgrps. In other words: loop 0: Reject rgrps that aren't in our preferred subset of rgrps. Reject rgrps that are likely congested (based on past history) and rgrps where we just found congestion. Only accept rgrps for which we can get a multi-block reservation. loop 1: Reject rgrps that are likely congested (based on past history) and rgrps where we just found congestion. Only accept rgrps for which we can get a multi-block reservation. loop 2: Reject rgrps that are likely congested (based on past history) and rgrps where we just found congestion. Accept any rgrp that has enough free space, even if we can't get a reservation. loop 3: Don't ever reject rgrps because we're out of ideal conditions. But is 4 loops too many? I could combine 0 and 1, and in fact, today's code accidentally does just that. The mistake was probably that I had been experimenting with 3 versus 4 layers and had switched them back and forth a few times for various tests. Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation
Hi, This is a revised version of my patches to improve GFS2 performance and reduce GFS2 fragmentation. As suggested by Steve Whitehouse, the first patch has been modified to base its distribution of resource groups on the number of journals. It also corrects the criteria on the final loop, changing loops 3 back to loops 2 as it should be. Also, I re-tested to make sure nothing had regressed. I updated the timing numbers below. Also as suggested, the patch that kept file close from deleting rgrp reservations has been removed from the set, leaving only three patches. The removal of that patch caused fragmentation to climb 15 percent, but it's still much (41 percent) better than the original stock kernels. I agree that use cases like untar would make for much worse fragmentation problems if that patch were to remain. I've re-tested with these three patches and found that performance and fragmentation are still good. Here are numbers: STOCK RHEL7 kernel (none of these patches): Run times: Run 1 time: 2hr 40min 33sec Run 2 time: 2hr 39min 52sec Run 3 time: 2hr 39min 31sec Run 4 time: 2hr 33min 57sec Run 5 time: 2hr 41min 6sec Total file extents (File fragmentation): EXTENT COUNT FOR OUTPUT FILES = 744708 EXTENT COUNT FOR OUTPUT FILES = 749868 EXTENT COUNT FOR OUTPUT FILES = 721862 EXTENT COUNT FOR OUTPUT FILES = 635301 EXTENT COUNT FOR OUTPUT FILES = 689263 The times are bad and the fragmentation level is also bad. If I add just the first patch, GFS2: Set of distributed preferences for rgrps the performance improves, but the fragmentation is worse (I only did three iterations this time): Run times: Run 1 time: 2hr 2min 47sec Run 2 time: 2hr 8min 37sec Run 3 time: 2hr 10min 0sec Total file extents (File fragmentation): EXTENT COUNT FOR OUTPUT FILES = 1011217 EXTENT COUNT FOR OUTPUT FILES = 1025973 EXTENT COUNT FOR OUTPUT FILES = 1070163 With these three patches, both performance and fragmentation are good: Run 1 time: 2hr 11min 11sec Run 2 time: 2hr 8min 57sec Run 3 time: 2hr 10min 43sec Run 4 time: 2hr 12min 50sec Run 5 time: 2hr 15min 32sec Run 6 time: 2hr 10min 58sec EXTENT COUNT FOR OUTPUT FILES = 424329 EXTENT COUNT FOR OUTPUT FILES = 430663 EXTENT COUNT FOR OUTPUT FILES = 430186 EXTENT COUNT FOR OUTPUT FILES = 421635 EXTENT COUNT FOR OUTPUT FILES = 416706 EXTENT COUNT FOR OUTPUT FILES = 418887 Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- Bob Peterson (3): GFS2: Set of distributed preferences for rgrps GFS2: Only increase rs_sizehint GFS2: If we use up our block reservation, request more next time fs/gfs2/file.c | 3 ++- fs/gfs2/incore.h | 1 + fs/gfs2/rgrp.c | 69 fs/gfs2/rgrp.h | 1 + 4 files changed, 68 insertions(+), 6 deletions(-) -- 1.9.3
[Cluster-devel] [GFS2 PATCH 3/3] GFS2: If we use up our block reservation, request more next time
If we run out of blocks for a given multi-block allocation, we obviously did not reserve enough. We should reserve more blocks for the next reservation to reduce fragmentation. This patch increases the size hint for reservations when they run out. --- fs/gfs2/rgrp.c | 3 +++ fs/gfs2/rgrp.h | 1 + 2 files changed, 4 insertions(+) diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index f4e4a0c..9150207 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -2251,6 +2251,9 @@ static void gfs2_adjust_reservation(struct gfs2_inode *ip, trace_gfs2_rs(rs, TRACE_RS_CLAIM); if (rs-rs_free !ret) goto out; + /* We used up our block reservation, so we should + reserve more blocks next time. */ + atomic_add(RGRP_RSRV_ADDBLKS, rs-rs_sizehint); } __rs_deltree(rs); } diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h index 5d8f085..b104f4a 100644 --- a/fs/gfs2/rgrp.h +++ b/fs/gfs2/rgrp.h @@ -20,6 +20,7 @@ */ #define RGRP_RSRV_MINBYTES 8 #define RGRP_RSRV_MINBLKS ((u32)(RGRP_RSRV_MINBYTES * GFS2_NBBY)) +#define RGRP_RSRV_ADDBLKS 64 struct gfs2_rgrpd; struct gfs2_sbd; -- 1.9.3
[Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps
This patch tries to use the journal numbers to evenly distribute which node prefers which resource group for block allocations. This is to help performance. --- fs/gfs2/incore.h | 1 + fs/gfs2/rgrp.c | 66 +++- 2 files changed, 62 insertions(+), 5 deletions(-) diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index 39e7e99..1b89918 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -97,6 +97,7 @@ struct gfs2_rgrpd { #define GFS2_RDF_CHECK 0x1000 /* check for unlinked inodes */ #define GFS2_RDF_UPTODATE 0x2000 /* rg is up to date */ #define GFS2_RDF_ERROR 0x4000 /* error in rg */ +#define GFS2_RDF_PREFERRED 0x8000 /* This rgrp is preferred */ #define GFS2_RDF_MASK 0xf000 /* mask for internal flags */ spinlock_t rd_rsspin; /* protects reservation related vars */ struct rb_root rd_rstree; /* multi-block reservation tree */ diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 7474c41..f4e4a0c 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip) rgd-rd_gl-gl_vm.start = rgd-rd_addr * bsize; rgd-rd_gl-gl_vm.end = rgd-rd_gl-gl_vm.start + (rgd-rd_length * bsize) - 1; rgd-rd_rgl = (struct gfs2_rgrp_lvb *)rgd-rd_gl-gl_lksb.sb_lvbptr; - rgd-rd_flags = ~GFS2_RDF_UPTODATE; + rgd-rd_flags = ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED); if (rgd-rd_data sdp-sd_max_rg_data) sdp-sd_max_rg_data = rgd-rd_data; spin_lock(sdp-sd_rindex_spin); @@ -955,6 +955,36 @@ fail: } /** + * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use + * @sdp: the GFS2 superblock + * + * The purpose of this function is to select a subset of the resource groups + * and mark them as PREFERRED. We do it in such a way that each node prefers + * to use a unique set of rgrps to minimize glock contention. + */ +static void set_rgrp_preferences(struct gfs2_sbd *sdp) +{ + struct gfs2_rgrpd *rgd, *first; + int i; + + /* Skip an initial number of rgrps, based on this node's journal ID. + That should start each node out on its own set. */ + rgd = gfs2_rgrpd_get_first(sdp); + for (i = 0; i sdp-sd_lockstruct.ls_jid; i++) + rgd = gfs2_rgrpd_get_next(rgd); + first = rgd; + + do { + rgd-rd_flags |= GFS2_RDF_PREFERRED; + for (i = 0; i sdp-sd_journals; i++) { + rgd = gfs2_rgrpd_get_next(rgd); + if (rgd == first) + break; + } + } while (rgd != first); +} + +/** * gfs2_ri_update - Pull in a new resource index from the disk * @ip: pointer to the rindex inode * @@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip) if (error 0) return error; + set_rgrp_preferences(sdp); + sdp-sd_rindex_uptodate = 1; return 0; } @@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b } /** + * fast_to_acquire - determine if a resource group will be fast to acquire + * + * If this is one of our preferred rgrps, it should be quicker to acquire, + * because we tried to set ourselves up as dlm lock master. + */ +static inline int fast_to_acquire(struct gfs2_rgrpd *rgd) +{ + struct gfs2_glock *gl = rgd-rd_gl; + + if (gl-gl_state != LM_ST_UNLOCKED list_empty(gl-gl_holders) + !test_bit(GLF_DEMOTE_IN_PROGRESS, gl-gl_flags) + !test_bit(GLF_DEMOTE, gl-gl_flags)) + return 1; + if (rgd-rd_flags GFS2_RDF_PREFERRED) + return 1; + return 0; +} + +/** * gfs2_inplace_reserve - Reserve space in the filesystem * @ip: the inode to reserve space for * @ap: the allocation parameters @@ -1932,10 +1983,15 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const struct gfs2_alloc_parms *a rg_locked = 0; if (skip skip--) goto next_rgrp; - if (!gfs2_rs_active(rs) (loops 2) -gfs2_rgrp_used_recently(rs, 1000) -gfs2_rgrp_congested(rs-rs_rbm.rgd, loops)) - goto next_rgrp; + if (!gfs2_rs_active(rs)) { + if (loops == 0 + !fast_to_acquire(rs-rs_rbm.rgd)) + goto next_rgrp; + if ((loops 2) + gfs2_rgrp_used_recently(rs, 1000) + gfs2_rgrp_congested(rs-rs_rbm.rgd, loops)) + goto next_rgrp; + } error = gfs2_glock_nq_init(rs-rs_rbm.rgd-rd_gl,
[Cluster-devel] [GFS2 PATCH 2/3] GFS2: Only increase rs_sizehint
If an application does a sequence of (1) big write, (2) little write we don't necessarily want to reset the size hint based on the smaller size. The fact that they did any big writes implies they may do more, and therefore we should try to allocate bigger block reservations, even if the last few were small writes. Therefore this patch changes function gfs2_size_hint so that the size hint can only grow; it cannot shrink. This is especially important where there are multiple writers. --- fs/gfs2/file.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 80dd44d..5ebe568 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -337,7 +337,8 @@ static void gfs2_size_hint(struct file *filep, loff_t offset, size_t size) size_t blks = (size + sdp-sd_sb.sb_bsize - 1) sdp-sd_sb.sb_bsize_shift; int hint = min_t(size_t, INT_MAX, blks); - atomic_set(ip-i_res-rs_sizehint, hint); + if (hint atomic_read(ip-i_res-rs_sizehint)) + atomic_set(ip-i_res-rs_sizehint, hint); } /** -- 1.9.3
[Cluster-devel] [gfs2-utils PATCH] fsck.gfs2: Detect and correct corrupt journals
Hi, This patch changes fsck.gfs2 so that it can detect corruption in the journal sequence numbering scheme. If the corruption is within a certain tolerance (currently 10 sequencing errors) the journal sequence numbers are fixed and journal is replayed as normal. If there are more errors than can be tolerated, the problem is detected and the journal may be cleared. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- gfs2/fsck/fs_recovery.c | 54 +++-- 1 file changed, 43 insertions(+), 11 deletions(-) diff --git a/gfs2/fsck/fs_recovery.c b/gfs2/fsck/fs_recovery.c index bc4210d..5aa455e 100644 --- a/gfs2/fsck/fs_recovery.c +++ b/gfs2/fsck/fs_recovery.c @@ -17,6 +17,7 @@ #include util.h #define JOURNAL_NAME_SIZE 16 +#define JOURNAL_SEQ_TOLERANCE 10 unsigned int sd_found_jblocks = 0, sd_replayed_jblocks = 0; unsigned int sd_found_metablocks = 0, sd_replayed_metablocks = 0; @@ -138,10 +139,15 @@ static int buf_lo_scan_elements(struct gfs2_inode *ip, unsigned int start, check_magic = ((struct gfs2_meta_header *) (bh_ip-b_data))-mh_magic; check_magic = be32_to_cpu(check_magic); - if (check_magic != GFS2_MAGIC) + if (check_magic != GFS2_MAGIC) { + log_err(_(Journal corruption detected at block # + %lld (0x%llx) for journal+0x%x.\n), + (unsigned long long)blkno, (unsigned long long)blkno, + start); error = -EIO; - else + } else { bmodified(bh_ip); + } brelse(bh_log); brelse(bh_ip); @@ -313,8 +319,11 @@ static int foreach_descriptor(struct gfs2_inode *ip, unsigned int start, brelse(bh); continue; } - if (error == 1) + if (error == 1) { + log_err(_(Journal corruption detected at + journal+0x%x.\n), start); error = -EIO; + } bmodified(bh); brelse(bh); return error; @@ -354,10 +363,13 @@ static int foreach_descriptor(struct gfs2_inode *ip, unsigned int start, } /** - * fix_journal_seq_no - Fix log header sequencing problems + * check_journal_seq_no - Check and Fix log header sequencing problems * @ip: the journal incore inode + * @fix: if 1, fix the sequence numbers, otherwise just report the problem + * + * Returns: The number of sequencing errors (hopefully none). */ -static int fix_journal_seq_no(struct gfs2_inode *ip) +static int check_journal_seq_no(struct gfs2_inode *ip, int fix) { int error = 0, wrapped = 0; uint32_t jd_blocks = ip-i_di.di_size / ip-i_sbd-sd_sb.sb_bsize; @@ -368,6 +380,7 @@ static int fix_journal_seq_no(struct gfs2_inode *ip) uint64_t dblock; uint32_t extlen; struct gfs2_buffer_head *bh; + int seq_errors = 0; memset(lh, 0, sizeof(lh)); for (blk = 0; blk jd_blocks; blk++) { @@ -389,6 +402,10 @@ static int fix_journal_seq_no(struct gfs2_inode *ip) prev_seq = lh.lh_sequence; continue; } + if (!fix) { + seq_errors++; + continue; + } log_err( _(Journal block %u (0x%x): sequence no. 0x%llx out of order.\n), blk, blk, lh.lh_sequence); log_info( _(Low: 0x%llx, High: 0x%llx, Prev: 0x%llx\n), @@ -404,7 +421,9 @@ static int fix_journal_seq_no(struct gfs2_inode *ip) gfs2_log_header_out(lh, bh); brelse(bh); } - return 0; + if (seq_errors fix) + log_err(_(%d sequence errors fixed.\n), seq_errors); + return seq_errors; } /** @@ -456,6 +475,15 @@ static int gfs2_recover_journal(struct gfs2_inode *ip, int j, int preen, osi_list_init(sd_revoke_list); error = gfs2_find_jhead(ip, head); + if (!error) { + error = check_journal_seq_no(ip, 0); + if (error JOURNAL_SEQ_TOLERANCE) { + log_err( _(Journal #%d (\journal%d\) has %d + sequencing errors; tolerance is %d.\n), +j+1, j, error, JOURNAL_SEQ_TOLERANCE); + goto out; + } + } if (error) { if (opts.no) { log_err( _(Journal #%d (\journal%d\) is corrupt\n),j+1, j); @@ -481,7 +509,7 @@ static int gfs2_recover_journal(struct gfs2_inode *ip, int j, int
Re: [Cluster-devel] [PATCH 1/2] fsck.gfs2: Improve reporting of pass timings
- Original Message - The days value is not currently reported in the pass timings but the hours value is still reported modulo 24. Drop the use of gmtime(3) as it's more appropriate for calendar time operations than elapsed time, and add a simple duration reporting function which matches the existing output format. Signed-off-by: Andrew Price anpr...@redhat.com --- ACK Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH 2/2] mkfs.gfs2: Revert default resource group size
- Original Message - Choose a more performant resource group size than the maximum by default until gfs2 is more tuned for the larger number of bitmap blocks required for 2GB resource groups. Signed-off-by: Andrew Price anpr...@redhat.com --- gfs2/mkfs/main_mkfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gfs2/mkfs/main_mkfs.c b/gfs2/mkfs/main_mkfs.c index b6353dd..2590ec8 100644 --- a/gfs2/mkfs/main_mkfs.c +++ b/gfs2/mkfs/main_mkfs.c @@ -147,7 +147,7 @@ static void opts_init(struct mkfs_opts *opts) opts-bsize = GFS2_DEFAULT_BSIZE; opts-jsize = GFS2_DEFAULT_JSIZE; opts-qcsize = GFS2_DEFAULT_QCSIZE; - opts-rgsize = GFS2_MAX_RGSIZE; + opts-rgsize = GFS2_DEFAULT_RGSIZE; opts-lockproto = lock_dlm; opts-locktable = ; opts-confirm = 1; -- 1.9.3 ACK Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH] GFS2: Simplify directory filling, hashing and unhashing functions
Hi, This patch implements a new cookie encoding scheme for encoding, decoding and filling directory entries. This new scheme prevents duplicate cookies, so it eliminates the need to sort the entries to prevent duplicates. Thus, it simplies the code greatly. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- fs/gfs2/dir.c | 165 +++--- 1 file changed, 66 insertions(+), 99 deletions(-) diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c index 5d4261f..039479f 100644 --- a/fs/gfs2/dir.c +++ b/fs/gfs2/dir.c @@ -58,7 +58,6 @@ #include linux/slab.h #include linux/spinlock.h #include linux/buffer_head.h -#include linux/sort.h #include linux/gfs2_ondisk.h #include linux/crc32.h #include linux/vmalloc.h @@ -80,8 +79,48 @@ #define MAX_RA_BLOCKS 32 /* max read-ahead blocks */ -#define gfs2_disk_hash2offset(h) (((u64)(h)) 1) -#define gfs2_dir_offset2hash(p) ((u32)(((u64)(p)) 1)) +/** + * Notes about the directory hash table: + * + * Hash table index is hash (32 - dip-i_depth). Max directory depth is 17. + * 32 - 17 = 15. The hash value is 32-bits, so 0x1 15 = 0x2. + * So the maximum index to the hash table is 0x1. + * + * At the same time, pages are at most 4K, and GFS2 dirents are 96 bytes. + * According to other comments, 99 is the maximum number of entries that can + * fit in a single leaf block. However, if you do the math: Take one 4K block, + * subtract out the leaf header, so 0x1000 - 0x68 = 0xf98 bytes available. + * The minimum dirent, including a 1-character file name is 0x30 bytes long. + * So 0xf98 / 0x30 = 0x53, which is 83 decimal. So the theoretical limit is + * 83 index entries per leaf block. + * + * So we can represent the index number with 17 bits, plus the index in 7. + * Therefore, we can represent the offset to any dirent by a 32-bit number: + * 18-bits for hash table index, 7 for page index, 8 for leaf block extension + * + *0 1 + *01234567 89abcdef 01234567 89abcdef + * iooo + *leafblk Hash table index Offset + * Leaf blk: (0xff) + * HTI Max Value: 1(0x1) + * Pg offset Max: 111 (0x7f) + **/ + +#define gfs2_dir_offset2pge(p) (((u32)p 24) 0xff) /* leaf extension blk */ +#define gfs2_dir_pge2offset(p) (p 24) + +#define gfs2_dir_offset2hti(p) (((u32)p 7) 0x01) /* hash tbl index */ +#define gfs2_dir_hti2offset(p) (p 7) + +#define gfs2_dir_offset2pgo(p) ((u32)p 0x7f) /* leaf blk page offset */ +#define gfs2_dir_pgo2offset(p) (p) + +static inline u32 dirloc2offset(u32 page, u32 hti, u32 off) +{ + return (gfs2_dir_pge2offset(page) | gfs2_dir_hti2offset(hti) | + gfs2_dir_pgo2offset(off)); +} struct qstr gfs2_qdot __read_mostly; struct qstr gfs2_qdotdot __read_mostly; @@ -1176,48 +1215,6 @@ out_kfree: } /** - * compare_dents - compare directory entries by hash value - * @a: first dent - * @b: second dent - * - * When comparing the hash entries of @a to @b: - * gt: returns 1 - * lt: returns -1 - * eq: returns 0 - */ - -static int compare_dents(const void *a, const void *b) -{ - const struct gfs2_dirent *dent_a, *dent_b; - u32 hash_a, hash_b; - int ret = 0; - - dent_a = *(const struct gfs2_dirent **)a; - hash_a = be32_to_cpu(dent_a-de_hash); - - dent_b = *(const struct gfs2_dirent **)b; - hash_b = be32_to_cpu(dent_b-de_hash); - - if (hash_a hash_b) - ret = 1; - else if (hash_a hash_b) - ret = -1; - else { - unsigned int len_a = be16_to_cpu(dent_a-de_name_len); - unsigned int len_b = be16_to_cpu(dent_b-de_name_len); - - if (len_a len_b) - ret = 1; - else if (len_a len_b) - ret = -1; - else - ret = memcmp(dent_a + 1, dent_b + 1, len_a); - } - - return ret; -} - -/** * do_filldir_main - read out directory entries * @dip: The GFS2 inode * @ctx: what to feed the entries to @@ -1225,11 +1222,6 @@ static int compare_dents(const void *a, const void *b) * @entries: the number of entries in darr * @copied: pointer to int that's non-zero if a entry has been copied out * - * Jump through some hoops to make sure that if there are hash collsions, - * they are read out at the beginning of a buffer. We want to minimize - * the possibility that they will fall into different readdir buffers or - * that someone will want to seek to that location. - * * Returns: errno, 0 if the actor tells you to stop */ @@ -1237,42 +1229,15 @@ static int do_filldir_main(struct gfs2_inode *dip, struct dir_context *ctx, const struct gfs2_dirent **darr, u32 entries
Re: [Cluster-devel] [PATCH 1/3] gfs2: bugger off early if O_CREAT open finds a directory
- Original Message - Signed-off-by: Al Viro v...@zeniv.linux.org.uk --- fs/gfs2/inode.c |5 + 1 file changed, 5 insertions(+) diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index c4ed823..310e248 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -624,6 +624,11 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry, inode = gfs2_dir_search(dir, dentry-d_name, !S_ISREG(mode) || excl); error = PTR_ERR(inode); if (!IS_ERR(inode)) { + if (S_ISDIR(inode-i_mode)) { + iput(inode); + inode = ERR_PTR(-EISDIR); + goto fail_gunlock; + } d = d_splice_alias(inode, dentry); error = PTR_ERR(d); if (IS_ERR(d)) { -- 1.7.10.4 Hm. Seems wrong that it should return 0 if the dirent exists (mkdir of a directory that already exists) but it looks like it already behaves that way. So I guess so. It may warrant further investigation. Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH 1/3] gfs2: bugger off early if O_CREAT open finds a directory
- Original Message - On Wed, Nov 19, 2014 at 03:33:29PM -0500, Bob Peterson wrote: - Original Message - Signed-off-by: Al Viro v...@zeniv.linux.org.uk --- fs/gfs2/inode.c |5 + 1 file changed, 5 insertions(+) diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index c4ed823..310e248 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -624,6 +624,11 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry, inode = gfs2_dir_search(dir, dentry-d_name, !S_ISREG(mode) || excl); error = PTR_ERR(inode); if (!IS_ERR(inode)) { + if (S_ISDIR(inode-i_mode)) { + iput(inode); + inode = ERR_PTR(-EISDIR); + goto fail_gunlock; + } d = d_splice_alias(inode, dentry); error = PTR_ERR(d); if (IS_ERR(d)) { Hm. Seems wrong that it should return 0 if the dirent exists (mkdir of a directory that already exists) but it looks like it already behaves that way. So I guess so. It may warrant further investigation. It doesn't. Note that !S_ISREG(mode) || excl in there - *anything* other than -create() will treat existing entries with the same name as EEXIST. The only case when we can possibly get into that if (!IS_ERR(inode)) is -create() hitting an existing file. And with this change it narrows to -create() hitting an existing non-directory. That allows the next patch to use d_instantiate() instead of d_splice_alias() - for non-directories it's the same thing, since dentry is already hashed here and we don't need to avoid multiple aliases. Which kills one of the two places in the tree where d_splice_alias() is called for an already hashed dentry (another is d_add_ci() and that call also goes down - see vfs.git#for-next for that one). Okay, thanks for clearing that up. ACK Bob Peterson Red Hat File Systems
Re: [Cluster-devel] gfs2: use kvfree() instead of open-coding it
- Original Message - Signed-off-by: Al Viro v...@zeniv.linux.org.uk --- fs/gfs2/dir.c | 40 fs/gfs2/quota.c |9 ++--- 2 files changed, 10 insertions(+), 39 deletions(-) diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c index 5d4261f..c247fed 100644 --- a/fs/gfs2/dir.c +++ b/fs/gfs2/dir.c @@ -365,22 +365,15 @@ static __be64 *gfs2_dir_get_hash_table(struct gfs2_inode *ip) ret = gfs2_dir_read_data(ip, hc, hsize); if (ret 0) { - if (is_vmalloc_addr(hc)) - vfree(hc); - else - kfree(hc); + kvfree(hc); return ERR_PTR(ret); } spin_lock(inode-i_lock); - if (ip-i_hash_cache) { - if (is_vmalloc_addr(hc)) - vfree(hc); - else - kfree(hc); - } else { + if (ip-i_hash_cache) + kvfree(hc); + else ip-i_hash_cache = hc; - } spin_unlock(inode-i_lock); return ip-i_hash_cache; @@ -396,10 +389,7 @@ void gfs2_dir_hash_inval(struct gfs2_inode *ip) { __be64 *hc = ip-i_hash_cache; ip-i_hash_cache = NULL; - if (is_vmalloc_addr(hc)) - vfree(hc); - else - kfree(hc); + kvfree(hc); } static inline int gfs2_dirent_sentinel(const struct gfs2_dirent *dent) @@ -1168,10 +1158,7 @@ fail: gfs2_dinode_out(dip, dibh-b_data); brelse(dibh); out_kfree: - if (is_vmalloc_addr(hc2)) - vfree(hc2); - else - kfree(hc2); + kvfree(hc2); return error; } @@ -1302,14 +1289,6 @@ static void *gfs2_alloc_sort_buffer(unsigned size) return ptr; } -static void gfs2_free_sort_buffer(void *ptr) -{ - if (is_vmalloc_addr(ptr)) - vfree(ptr); - else - kfree(ptr); -} - static int gfs2_dir_read_leaf(struct inode *inode, struct dir_context *ctx, int *copied, unsigned *depth, u64 leaf_no) @@ -1393,7 +1372,7 @@ static int gfs2_dir_read_leaf(struct inode *inode, struct dir_context *ctx, out_free: for(i = 0; i leaf; i++) brelse(larr[i]); - gfs2_free_sort_buffer(larr); + kvfree(larr); out: return error; } @@ -2004,10 +1983,7 @@ out_rlist: gfs2_rlist_free(rlist); gfs2_quota_unhold(dip); out: - if (is_vmalloc_addr(ht)) - vfree(ht); - else - kfree(ht); + kvfree(ht); return error; } diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c index 64b29f7..c8b148b 100644 --- a/fs/gfs2/quota.c +++ b/fs/gfs2/quota.c @@ -1360,13 +1360,8 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp) gfs2_assert_warn(sdp, !atomic_read(sdp-sd_quota_count)); - if (sdp-sd_quota_bitmap) { - if (is_vmalloc_addr(sdp-sd_quota_bitmap)) - vfree(sdp-sd_quota_bitmap); - else - kfree(sdp-sd_quota_bitmap); - sdp-sd_quota_bitmap = NULL; - } + kvfree(sdp-sd_quota_bitmap); + sdp-sd_quota_bitmap = NULL; } static void quotad_error(struct gfs2_sbd *sdp, const char *msg, int error) -- 1.7.10.4 ACK Bob Peterson Red Hat File Systems
Re: [Cluster-devel] gfs2: gfs2_dir_get_hash_table(): avoiding deferred vfree() is easy here...
- Original Message - vfree() is allowed under spinlock these days, but it's cheaper when it doesn't step into deferred case and here it's very easy to avoid. Signed-off-by: Al Viro v...@zeniv.linux.org.uk --- fs/gfs2/dir.c |7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c index c247fed..c5a34f0 100644 --- a/fs/gfs2/dir.c +++ b/fs/gfs2/dir.c @@ -370,11 +370,12 @@ static __be64 *gfs2_dir_get_hash_table(struct gfs2_inode *ip) } spin_lock(inode-i_lock); - if (ip-i_hash_cache) - kvfree(hc); - else + if (likely(!ip-i_hash_cache)) { ip-i_hash_cache = hc; + hc = NULL; + } spin_unlock(inode-i_lock); + kvfree(hc); return ip-i_hash_cache; } -- 1.7.10.4 ACK Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH] GFS2: Write the log descriptor entries in the correct order
Hi, I recently discovered GFS2 was writing the log descriptor items in the journal in reverse order. For example, if I do a bunch of file creates, the result was a log header that looked something like this: 0x298 (j+ 242): Log descriptor, type 300 (Metadata) len:504, data1: 503 0xa6e9ce di0xa6e9a7 di0xa6e97b di0xa6e942 di 0xa6e911 di0xa6e8d2 di0xa6e898 di0xa6e86e di 0xa6e840 di0xa6e809 di0xa6e7d7 di0xa6e77e di 0xa6e747 di0xa6e721 di0xa6e6f8 di0xa6e6ca di 0xa6e680 di0xa6e63d di0xa6e619 di0xa6e5f4 di 0xa6e5d1 di0xa6e598 di0xa6e520 di0xa6e4f4 di 0xa6e4c1 di0xa6e489 di0xa6e45c di0xa6e427 di As you can see, the blocks are reverse-sorted. This patch changes the order so that they're written in sorted order. Log headers written with the patch look more like this: 0x36e (j+ 318): Log descriptor, type 300 (Metadata) len:504, data1: 503 0x4d5a80 di0x4d5aa0 lf0x4d5aa1 di0x4d5aa2 di 0x4d5aa3 di0x4d5aa4 di0x4d5aa5 di0x4d5aa6 di 0x4d5aa7 di0x4d5aa8 di0x4d5aa9 di0x4d5aaa di 0x4d5aab di0x4d5aac di0x4d5aad di0x4d5aae di 0x4d5aaf di0x4d5ab0 di0x4d5ab1 di0x4d5ab2 di 0x4d5ab3 di0x4d5ab4 di0x4d5ab5 di0x4d5ab6 di 0x4d5ab7 di0x4d5ab8 di0x4d5ab9 di0x4d5aba di The blocks following the log descriptor are written sequentially, so the sort order won't affect performance at all, but it makes the journal easier to decipher for debugging purposes. Patch description: This patch changes the order in which the blocks are processed by function gfs2_before_commit. Before, it was processing them in the wrong order such that the blocks would be layed out and written in reverse order. This changes the order to the correct order. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c index 2c1ae86..d644470 100644 --- a/fs/gfs2/lops.c +++ b/fs/gfs2/lops.c @@ -444,7 +444,7 @@ static void gfs2_before_commit(struct gfs2_sbd *sdp, unsigned int limit, ptr = (__be64 *)(ld + 1); n = 0; - list_for_each_entry_continue(bd1, blist, bd_list) { + list_for_each_entry_continue_reverse(bd1, blist, bd_list) { *ptr++ = cpu_to_be64(bd1-bd_bh-b_blocknr); if (is_databuf) { gfs2_check_magic(bd1-bd_bh); @@ -459,7 +459,7 @@ static void gfs2_before_commit(struct gfs2_sbd *sdp, unsigned int limit, gfs2_log_lock(sdp); n = 0; - list_for_each_entry_continue(bd2, blist, bd_list) { + list_for_each_entry_continue_reverse(bd2, blist, bd_list) { get_bh(bd2-bd_bh); gfs2_log_unlock(sdp); lock_buffer(bd2-bd_bh);
Re: [Cluster-devel] [GFS2 PATCH] GFS2: Write the log descriptor entries in the correct order
- Original Message - As you can see, the blocks are reverse-sorted. This patch changes the order so that they're written in sorted order. Log headers written with the patch look more like this: Does that mean that this patch got the ordering the wrong way around then? https://git.kernel.org/cgit/linux/kernel/git/steve/gfs2-3.0-nmw.git/commit/fs/gfs2/lops.c?id=7f63257da1aeea9b2ee68f469b0f9f4a39e5dff8 Steve. No. It means I missed Ben's patch. Or maybe I've lost my mind. Sigh. Bob
Re: [Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Change block_map to match bitmap
- Original Message - Hi Bob, On 22/01/15 20:41, Bob Peterson wrote: Hi, This patch changes the old block_map structure for fsck.gfs2 to the simpler bitmap structure so that we have a 1:1 correspondence. This was done to reduce memory requirements of fsck.gfs2. I'm curious as to whether we're losing any useful information here. I see an extra bread() for BLKST_USED blocks in check_data, is that the main performance compromise, and how severe is it? Cheers, Andy Hi, That's exactly where things get a bit sticky. It used to keep track of the various GFS2_METATYPE_* block types. For the most part, dinodes were straightforward because we only needed to distinguish between directories and non-directories. Free space is pretty straightforward too, with only a few corner cases where things were marked a bad type. Unlinked also wasn't a big problem. The problem comes with blocks marked Data. In GFS1, data blocks were data blocks. In GFS2, a data block could be data or metadata. In some cases we don't care because the height will tell us if it's real data. But things get sticky when we need to distinguish between different types of non-dinode metadata in fsck. For example, what if a dinode has indirect blocks that are not really indirect blocks, but say, an extended attribute block? The old fsck could distinguish between them pretty easily because of their unique types. Going by just the bitmap alone is ambiguous, so sometimes we need to know more details, especially in cases where we have a block that's referenced multiple times: it's much more likely that one of the references is for the wrong type, and in that case, we do the extra read. This shouldn't impact performance, except for duplicate references (which is rare) and needs to be done. There are probably cases where fsck is now making assumptions about the previously processed block type for indirect blocks, so in that respect, the information is lost. However, I've tried to minimize the risk, and rigorous testing has shown that it works in many very difficult situations. You can tell my testing has been good, since I've posted so many bug fix patches recently that I've uncovered. :) In that respect, this new fsck is better than the previous. Still, there may very well be cases where the lost information causes an unforeseen problem. I'm not sure we can do anything about that but test it thoroughly. Maybe I'll try to dummy up some potential problems (like the one I gave above with indirect versus extended attribute) to see how well it does. I suppose I could try to dummy up every permutation. That is likely to uncover even more problems, most of which already exist in today's fsck. Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Fix journal sequence number reporting problem
Hi, This bug was spotted by coverity. The fsck.gfs2 should include a line to report the number of journal sequence errors fixed. It was coded improperly so that it was never displayed. This patch fixes the code so that the message will be printed properly. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/fs_recovery.c b/gfs2/fsck/fs_recovery.c index abb8187..9be5a95 100644 --- a/gfs2/fsck/fs_recovery.c +++ b/gfs2/fsck/fs_recovery.c @@ -436,10 +436,9 @@ static int check_journal_seq_no(struct gfs2_inode *ip, int fix) (unsigned long long)lowest_seq, (unsigned long long)highest_seq, (unsigned long long)prev_seq); - if (!fix) { - seq_errors++; + seq_errors++; + if (!fix) continue; - } highest_seq++; lh.lh_sequence = highest_seq; prev_seq = lh.lh_sequence; @@ -449,8 +448,10 @@ static int check_journal_seq_no(struct gfs2_inode *ip, int fix) gfs2_log_header_out(lh, bh); brelse(bh); } - if (seq_errors fix) + if (seq_errors fix) { log_err(_(%d sequence errors fixed.\n), seq_errors); + seq_errors = 0; + } return seq_errors; }
[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Fix a use-after-free in pass2
Hi, This patch fixes a rare code path that's doing use-after-free I spotted in pass2. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/pass2.c b/gfs2/fsck/pass2.c index 27b7336..4ea322a 100644 --- a/gfs2/fsck/pass2.c +++ b/gfs2/fsck/pass2.c @@ -1928,13 +1928,14 @@ int pass2(struct gfs2_sbd *sdp) ip = fsck_load_inode(sdp, dirblk); cur_blks = ip-i_di.di_blocks; error = check_metatree(ip, pass2_fxns); - fsck_inode_put(ip); if (error 0) { stack; + fsck_inode_put(ip); return error; } if (ip-i_di.di_blocks != cur_blks) reprocess_inode(ip, current); + fsck_inode_put(ip); } error = check_dir(sdp, dirblk, pass2_fxns); if (skip_this_pass || fsck_abort) /* if asked to skip the rest */
[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Reprocess nodes if anything changed
Hi, Before this patch, fsck.gfs2 called reprocess_inode in several places, especially when the number of blocks had changed from the original. That works in almost all cases. However, there's a corner case where a block may be deleted, due to errors (for example, a bad directory leaf), and another block takes its place. In that case the count of the number of blocks is still the same, but the inode should be reprocessed anyway, because the deleted blocks and replacement blocks may be at different locations or maybe of different types. A hash table block may be data when it's freed, but if the block is reused, it may be repurposed as a leaf block. That may confuse subsequent processing. Another reason to call reprocess_inode is when the goal block changes to a value outside the resource group, due to block allocations for repairs. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/lost_n_found.c b/gfs2/fsck/lost_n_found.c index 9672c7a..b2e35e2 100644 --- a/gfs2/fsck/lost_n_found.c +++ b/gfs2/fsck/lost_n_found.c @@ -174,7 +174,6 @@ void make_sure_lf_exists(struct gfs2_inode *ip) int add_inode_to_lf(struct gfs2_inode *ip){ char tmp_name[256]; __be32 inode_type; - uint64_t lf_blocks; struct gfs2_sbd *sdp = ip-i_sbd; int err = 0; uint32_t mode; @@ -184,7 +183,6 @@ int add_inode_to_lf(struct gfs2_inode *ip){ log_err( _(Trying to add lost+found to itself...skipping)); return 0; } - lf_blocks = lf_dip-i_di.di_blocks; if (sdp-gfs1) mode = gfs_to_gfs2_mode(ip); @@ -242,10 +240,6 @@ int add_inode_to_lf(struct gfs2_inode *ip){ tmp_name, strerror(errno)); exit(FSCK_ERROR); } - /* If the lf directory had new blocks added we have to mark them - properly in the bitmap so they're not freed. */ - if (lf_dip-i_di.di_blocks != lf_blocks) - reprocess_inode(lf_dip, lost+found); /* This inode is linked from lost+found */ incr_link_count(ip-i_di.di_num, lf_dip, _(from lost+found)); diff --git a/gfs2/fsck/metawalk.c b/gfs2/fsck/metawalk.c index 16faafb..217bb07 100644 --- a/gfs2/fsck/metawalk.c +++ b/gfs2/fsck/metawalk.c @@ -1693,11 +1693,11 @@ int check_dir(struct gfs2_sbd *sdp, uint64_t block, struct metawalk_fxns *pass) { struct gfs2_inode *ip; int error = 0; - uint64_t cur_blks; + struct alloc_state as; ip = fsck_load_inode(sdp, block); - cur_blks = ip-i_di.di_blocks; + astate_save(ip, as); if (ip-i_di.di_flags GFS2_DIF_EXHASH) error = check_leaf_blks(ip, pass); @@ -1707,7 +1707,7 @@ int check_dir(struct gfs2_sbd *sdp, uint64_t block, struct metawalk_fxns *pass) if (error 0) stack; - if (ip-i_di.di_blocks != cur_blks) + if (astate_changed(ip, as)) reprocess_inode(ip, _(Current)); fsck_inode_put(ip); /* does a brelse */ diff --git a/gfs2/fsck/pass2.c b/gfs2/fsck/pass2.c index 4ea322a..74bcc26 100644 --- a/gfs2/fsck/pass2.c +++ b/gfs2/fsck/pass2.c @@ -1723,7 +1723,8 @@ build_it: static int check_system_dir(struct gfs2_inode *sysinode, const char *dirname, int builder(struct gfs2_sbd *sdp)) { - uint64_t iblock = 0, cur_blks; + uint64_t iblock = 0; + struct alloc_state as; struct dir_status ds = {0}; int error = 0; @@ -1740,14 +1741,14 @@ static int check_system_dir(struct gfs2_inode *sysinode, const char *dirname, pass2_fxns.private = (void *) ds; if (ds.q == gfs2_bad_block) { - cur_blks = sysinode-i_di.di_blocks; + astate_save(sysinode, as); /* First check that the directory's metatree is valid */ error = check_metatree(sysinode, pass2_fxns); if (error 0) { stack; return error; } - if (sysinode-i_di.di_blocks != cur_blks) + if (astate_changed(sysinode, as)) reprocess_inode(sysinode, _(System inode)); } error = check_dir(sysinode-i_sbd, iblock, pass2_fxns); @@ -1768,7 +1769,7 @@ static int check_system_dir(struct gfs2_inode *sysinode, const char *dirname, if (!ds.dotdir) { log_err( _(No '.' entry found for %s directory.\n), dirname); if (query( _(Is it okay to add '.' entry? (y/n) ))) { - cur_blks = sysinode-i_di.di_blocks; + astate_save(sysinode, as); log_warn( _(Adding '.' entry\n)); error = dir_add(sysinode, ., 1, (sysinode-i_di.di_num), (sysinode-i_sbd-gfs1 ? GFS_FILE_DIR : DT_DIR)); @@ -1777,7 +1778,7 @@ static int check_system_dir(struct
[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: fix double-free bug
Hi, This patch fixes a double-free bug that can occur when errors are found in an inode and later, it's freed. The problem is that function free_metalist was using osi_list_del when removing buffer_heads from the inode in memory. It should be using osi_list_del_init so that the list head is reinitialized. Otherwise the list head is left with an unknown value which gets interpreted later on as !list_empty(), so it tries to free the list a second time. This also fixes a minor issue with function rangecheck_jmeta, which frees the buffer_head when it's done. The patch sets the returned *bh value back to NULL to ensure it's not reused. This is just a precaution I spotted while debugging, and it's worth doing. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/fs_recovery.c b/gfs2/fsck/fs_recovery.c index 3d8fe98..095d118 100644 --- a/gfs2/fsck/fs_recovery.c +++ b/gfs2/fsck/fs_recovery.c @@ -631,6 +631,7 @@ static int rangecheck_jmeta(struct gfs2_inode *ip, uint64_t block, (unsigned long long)block, (unsigned long long)block); brelse(*bh); + *bh = NULL; return meta_skip_further; } } diff --git a/gfs2/fsck/metawalk.c b/gfs2/fsck/metawalk.c index 1875b24..16faafb 100644 --- a/gfs2/fsck/metawalk.c +++ b/gfs2/fsck/metawalk.c @@ -1203,7 +1203,7 @@ static void free_metalist(struct gfs2_inode *ip, osi_list_t *mlp) nbh = osi_list_entry(list-next, struct gfs2_buffer_head, b_altlist); if (nbh == ip-i_bh) - osi_list_del(nbh-b_altlist); + osi_list_del_init(nbh-b_altlist); else brelse(nbh); }
[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: Rebuild system files if they don't have the SYS bit set
Hi, This patch checks for the GFS2_DIF_SYSTEM bit on system dinodes. If the bit is not set, the dinode is assumed to be corrupt and rebuilt. If the jindex is rebuilt, it needs to be re-read. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/pass1.c b/gfs2/fsck/pass1.c index 2841b8c..4348683 100644 --- a/gfs2/fsck/pass1.c +++ b/gfs2/fsck/pass1.c @@ -27,6 +27,7 @@ #include util.h #include link.h #include metawalk.h +#include fs_recovery.h struct special_blocks gfs1_rindex_blks; @@ -1241,7 +1242,8 @@ static int check_system_inode(struct gfs2_sbd *sdp, struct gfs2_inode **sysinode, const char *filename, int builder(struct gfs2_sbd *sdp), - enum gfs2_mark_block mark) + enum gfs2_mark_block mark, + struct gfs2_inode *sysdir, int needs_sysbit) { uint64_t iblock = 0; struct dir_status ds = {0}; @@ -1282,6 +1284,24 @@ static int check_system_inode(struct gfs2_sbd *sdp, if (mark == gfs2_inode_dir) dirtree_insert((*sysinode)-i_di.di_num); } + /* Make sure it's marked as a system file/directory */ + if (needs_sysbit + !((*sysinode)-i_di.di_flags GFS2_DIF_SYSTEM)) { + log_err( _(System inode %s is missing the 'system' + flag. It should be rebuilt.\n), filename); + if (sysdir query(_(Delete the corrupt %s system + inode? (y/n) ), filename)) { + inode_put(sysinode); + gfs2_dirent_del(sysdir, filename, + strlen(filename)); + /* Set the blockmap (but not bitmap) back to + 'free' so that it gets checked like any + normal dinode. */ + gfs2_blockmap_set(bl, iblock, gfs2_block_free); + log_err( _(Removed system inode \%s\.\n), +filename); + } + } } else log_info( _(System inode for '%s' is corrupt or missing.\n), filename); @@ -1290,18 +1310,22 @@ static int check_system_inode(struct gfs2_sbd *sdp, lost+found then, but we *need* our system inodes before we can do any of that. */ if (!(*sysinode) || ds.q != mark) { - log_err( _(Invalid or missing %s system inode (should be %d, - is %d).\n), filename, mark, ds.q); + log_err( _(Invalid or missing %s system inode (should be + '%s', is '%s').\n), filename, +block_type_string(mark), +block_type_string(ds.q)); if (query(_(Create new %s system inode? (y/n) ), filename)) { log_err( _(Rebuilding system file \%s\\n), filename); error = builder(sdp); if (error) { - log_err( _(Error trying to rebuild system - file %s: Cannot continue\n), + log_err( _(Error rebuilding system + inode %s: Cannot continue\n), filename); return error; } + if (*sysinode == sdp-md.jiinode) + ji_update(sdp); fsck_blockmap_set(*sysinode, (*sysinode)-i_di.di_num.no_addr, filename, mark); @@ -1367,7 +1391,8 @@ static int check_system_inodes(struct gfs2_sbd *sdp) sdp-master_dir-i_di.di_num.no_addr, master, gfs2_inode_dir); if (check_system_inode(sdp, sdp-master_dir, master, - build_master, gfs2_inode_dir)) { + build_master, gfs2_inode_dir, + NULL, 1)) { stack; return -1; } @@ -1377,39 +1402,41 @@ static int check_system_inodes(struct gfs2_sbd *sdp) fsck_blockmap_set(sdp-md.rooti, sdp-md.rooti-i_di.di_num.no_addr, root, gfs2_inode_dir); if (check_system_inode(sdp, sdp-md.rooti, root, build_root, - gfs2_inode_dir
[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Check the integrity of the journal index
Hi, This patch checks the jindex system directory to make sure the entries all start with journal and so forth. If not, the jindex is deleted and rebuilt. As part of this patch, I moved where we read in the rindex file and rgrps to an earlier point in time, before the journals are replayed. This allows us to remove a dummied-up rgrp kludge in the code. However, if the replayed journal block is part of an rgrp, we need to refresh the rgrp based on the values rewritten from the journal. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/fs_recovery.c b/gfs2/fsck/fs_recovery.c index 095d118..4eaba1e 100644 --- a/gfs2/fsck/fs_recovery.c +++ b/gfs2/fsck/fs_recovery.c @@ -96,6 +96,30 @@ void gfs2_revoke_clean(struct gfs2_sbd *sdp) } } +static void refresh_rgrp(struct gfs2_sbd *sdp, struct rgrp_tree *rgd, +struct gfs2_buffer_head *bh, uint64_t blkno) +{ + int i; + + log_debug(_(Block is part of rgrp 0x%llx; refreshing the rgrp.\n), + (unsigned long long)rgd-ri.ri_addr); + for (i = 0; i rgd-ri.ri_length; i++) { + if (rgd-bits[i].bi_bh-b_blocknr != blkno) + continue; + + memcpy(rgd-bits[i].bi_bh-b_data, bh-b_data, sdp-bsize); + bmodified(rgd-bits[i].bi_bh); + if (i == 0) { /* this is the rgrp itself */ + if (sdp-gfs1) + gfs_rgrp_in((struct gfs_rgrp *)rgd-rg, + rgd-bits[0].bi_bh); + else + gfs2_rgrp_in(rgd-rg, rgd-bits[0].bi_bh); + } + break; + } +} + static int buf_lo_scan_elements(struct gfs2_inode *ip, unsigned int start, struct gfs2_log_descriptor *ld, __be64 *ptr, int pass) @@ -105,6 +129,7 @@ static int buf_lo_scan_elements(struct gfs2_inode *ip, unsigned int start, struct gfs2_buffer_head *bh_log, *bh_ip; uint64_t blkno; int error = 0; + struct rgrp_tree *rgd; if (pass != 1 || be32_to_cpu(ld-ld_type) != GFS2_LOG_DESC_METADATA) return 0; @@ -147,6 +172,9 @@ static int buf_lo_scan_elements(struct gfs2_inode *ip, unsigned int start, error = -EIO; } else { bmodified(bh_ip); + rgd = gfs2_blk2rgrpd(sdp, blkno); + if (rgd blkno rgd-ri.ri_data0) + refresh_rgrp(sdp, rgd, bh_ip, blkno); } brelse(bh_log); @@ -676,28 +704,8 @@ int replay_journals(struct gfs2_sbd *sdp, int preen, int force_check, for(i = 0; i sdp-md.journals; i++) { if (sdp-md.journal[i]) { - struct rgrp_tree rgd; - struct gfs2_bitmap bits; - - /* The real rgrp tree hasn't been built at this point, -* so we need to dummy one up that covers the whole -* file system so basic functions in check_metatree -* don't segfault. */ - rgd.start = sdp-sb_addr + 1; - rgd.length = 1; - bits.bi_bh = NULL; - bits.bi_start = 0; - bits.bi_len = sdp-fssize / GFS2_NBBY; - rgd.bits = bits; - rgd.ri.ri_addr = sdp-sb_addr + 1; - rgd.ri.ri_length = 1; - rgd.ri.ri_data0 = sdp-sb_addr + 2; - rgd.ri.ri_data = sdp-fssize - (sdp-sb_addr + 2); - - sdp-rgtree.osi_node = (struct osi_node *)rgd; error = check_metatree(sdp-md.journal[i], rangecheck_journal); - sdp-rgtree.osi_node = NULL; if (error) /* Don't use fsck_inode_put here because it's a system file and we need to dismantle it. */ @@ -707,8 +715,7 @@ int replay_journals(struct gfs2_sbd *sdp, int preen, int force_check, if (!sdp-md.journal[i]) { log_err(_(File system journal \journal%d\ is missing or corrupt: pass1 will try to - recreate it.\n), - i); + recreate it.\n), i); continue; } if (!error) { diff --git a/gfs2/fsck/initialize.c b/gfs2/fsck/initialize.c index 4e52262..043917c 100644 --- a/gfs2/fsck/initialize.c +++ b/gfs2/fsck/initialize.c @@ -142,7 +142,7 @@ static int set_block_ranges(struct gfs2_sbd *sdp) uint64_t rmin = 0
[Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: rgrp block count reform
Hi, In the past, pass5 has kept track of different counts for the different metadata types, but the counts did not correspond to the type in the bitmap. This patch changes pass5 so that it makes a little more sense. There should be a straightforward blockmap-to-bitmap conversion, and the counts should be based on the bitmap type. For example, before the patch, the rgrp's inode count would be kept in count[1], but in the bitmap, dinodes are of type 3. So this patch reworks that system so that the count of dinodes is kept in count[3] and so forth, and it uses a simple blockmap-to-bitmap to figure out the index to count. This breaks down for GFS1 which keeps a few extra numbers in the rgrp, however this patch compensates for that. This patch is being done as a precursor to another patch that reduces fsck's memory requirements by using a blockmap that's 1:1 with the bitmaps rather than today's blockmap that's 4:1. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/pass5.c b/gfs2/fsck/pass5.c index b2e8adf..2b8536d 100644 --- a/gfs2/fsck/pass5.c +++ b/gfs2/fsck/pass5.c @@ -12,94 +12,6 @@ #include fsck.h #include util.h -static int gfs1_convert_mark(uint8_t q, uint32_t *count) -{ - switch(q) { - - case gfs2_meta_inval: - case gfs2_inode_invalid: - /* Convert invalid metadata to free blocks */ - case gfs2_block_free: - count[0]++; - return GFS2_BLKST_FREE; - - case gfs2_block_used: - count[2]++; - return GFS2_BLKST_USED; - - case gfs2_inode_dir: - case gfs2_inode_file: - case gfs2_inode_lnk: - case gfs2_inode_device: - case gfs2_inode_fifo: - case gfs2_inode_sock: - count[1]++; - return GFS2_BLKST_DINODE; - - case gfs2_indir_blk: - case gfs2_leaf_blk: - /*case gfs2_meta_rgrp:*/ - case gfs2_jdata: /* gfs1 jdata blocks count as metadata and gfs1 - metadata is marked the same as gfs2 inode in the - bitmap. */ - case gfs2_meta_eattr: - count[3]++; - return GFS2_BLKST_DINODE; - - case gfs2_freemeta: - count[4]++; - return GFS2_BLKST_UNLINKED; - - default: - log_err( _(Invalid block type %d found\n), q); - } - return -1; -} - -static int gfs2_convert_mark(uint8_t q, uint32_t *count) -{ - switch(q) { - - case gfs2_meta_inval: - case gfs2_inode_invalid: - /* Convert invalid metadata to free blocks */ - case gfs2_block_free: - count[0]++; - return GFS2_BLKST_FREE; - - case gfs2_block_used: - count[2]++; - return GFS2_BLKST_USED; - - case gfs2_inode_dir: - case gfs2_inode_file: - case gfs2_inode_lnk: - case gfs2_inode_device: - case gfs2_jdata: /* gfs1 jdata blocks count as metadata and gfs1 - metadata is marked the same as gfs2 inode in the - bitmap. */ - case gfs2_inode_fifo: - case gfs2_inode_sock: - count[1]++; - return GFS2_BLKST_DINODE; - - case gfs2_indir_blk: - case gfs2_leaf_blk: - case gfs2_meta_eattr: - count[2]++; - return GFS2_BLKST_USED; - - case gfs2_freemeta: - log_err( _(Invalid freemeta type %d found\n), q); - count[4]++; - return -1; - - default: - log_err( _(Invalid block type %d found\n), q); - } - return -1; -} - static int check_block_status(struct gfs2_sbd *sdp, char *buffer, unsigned int buflen, uint64_t *rg_block, uint64_t rg_data, uint32_t *count) @@ -107,7 +19,6 @@ static int check_block_status(struct gfs2_sbd *sdp, char *buffer, unsigned char *byte, *end; unsigned int bit; unsigned char rg_status; - int block_status; uint8_t q; uint64_t block; @@ -123,26 +34,33 @@ static int check_block_status(struct gfs2_sbd *sdp, char *buffer, if (skip_this_pass || fsck_abort) /* if asked to skip the rest */ return 0; - q = block_type(block); - - if (sdp-gfs1) - block_status = gfs1_convert_mark(q, count); - else - block_status = gfs2_convert_mark(q, count); - - if (block_status 0) { - log_err( _(Invalid status for block %llu (0x%llx).\n), -(unsigned long long)block, -(unsigned long long)block); - return block_status; + q = blockmap_to_bitmap(block_type(block), sdp-gfs1
[Cluster-devel] [GFS2 PATCH] GFS2: Eliminate a nonsense goto
Hi, This patch just removes a goto that did nothing. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index 9054002..73c72253 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -543,10 +543,7 @@ static int link_dinode(struct gfs2_inode *dip, const struct qstr *name, } error = gfs2_dir_add(dip-i_inode, name, ip, da); - if (error) - goto fail_end_trans; -fail_end_trans: gfs2_trans_end(sdp); fail_ipreserv: gfs2_inplace_release(dip);
Re: [Cluster-devel] [fsck.gfs2 PATCH] fsck.gfs2: Change block_map to match bitmap
- Original Message - - Original Message - Hi Bob, On 22/01/15 20:41, Bob Peterson wrote: Hi, This patch changes the old block_map structure for fsck.gfs2 to the simpler bitmap structure so that we have a 1:1 correspondence. This was done to reduce memory requirements of fsck.gfs2. I'm curious as to whether we're losing any useful information here. I see an extra bread() for BLKST_USED blocks in check_data, is that the main performance compromise, and how severe is it? Cheers, Andy Hi, That's exactly where things get a bit sticky. It used to keep track of the various GFS2_METATYPE_* block types. For the most part, dinodes were straightforward because we only needed to distinguish between directories and non-directories. Free space is pretty straightforward too, with only a few corner cases where things were marked a bad type. Unlinked also wasn't a big problem. The problem comes with blocks marked Data. In GFS1, data blocks were data blocks. In GFS2, a data block could be data or metadata. In some cases we don't care because the height will tell us if it's real data. But things get sticky when we need to distinguish between different types of non-dinode metadata in fsck. For example, what if a dinode has indirect blocks that are not really indirect blocks, but say, an extended attribute block? The old fsck could distinguish between them pretty easily because of their unique types. Going by just the bitmap alone is ambiguous, so sometimes we need to know more details, especially in cases where we have a block that's referenced multiple times: it's much more likely that one of the references is for the wrong type, and in that case, we do the extra read. This shouldn't impact performance, except for duplicate references (which is rare) and needs to be done. There are probably cases where fsck is now making assumptions about the previously processed block type for indirect blocks, so in that respect, the information is lost. However, I've tried to minimize the risk, and rigorous testing has shown that it works in many very difficult situations. You can tell my testing has been good, since I've posted so many bug fix patches recently that I've uncovered. :) In that respect, this new fsck is better than the previous. Still, there may very well be cases where the lost information causes an unforeseen problem. I'm not sure we can do anything about that but test it thoroughly. Maybe I'll try to dummy up some potential problems (like the one I gave above with indirect versus extended attribute) to see how well it does. I suppose I could try to dummy up every permutation. That is likely to uncover even more problems, most of which already exist in today's fsck. Regards, Bob Peterson Red Hat File Systems Hi, As promised, I dummied up some file systems to intentionally have the wrong metadata block types. I tried these scenarios: 1. File that had an extended attribute block that was really an indirect block. 2. File that had an indirect block that was really an extended attribute block. 3. Directory that had an extended attribute block that was really a leaf block. 4. Directory that had a leaf block that was really an extended attribute block. 5. File that had an extended attribute block that was really an ED block. 6. File whose indirect pointer points to a leaf block 7. Directory whose leaf block pointers all point to an indirect block In all these scenarios, the fsck.gfs2 with my patch made the same decisions and produced relatively the same output as the stock fsck.gfs2. So I think it's safe to push this patch now, unless anyone has an objection or has specific tests they'd like me to try. Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH] GFS2: Allocate reservation during write_begin if needed
Hi, This patch adds a call to function gfs2_rs_alloc to make sure a reservation structure has been allocated before attempting to reserve blocks. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- fs/gfs2/aops.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index 805b37f..6453e23 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -675,6 +675,9 @@ static int gfs2_write_begin(struct file *file, struct address_space *mapping, if (error) goto out_unlock; + error = gfs2_rs_alloc(ip); + if (error) + goto out_qunlock; requested = data_blocks + ind_blocks; ap.target = requested; error = gfs2_inplace_reserve(ip, ap);
Re: [Cluster-devel] [GFS2 PATCH] GFS2: Allocate reservation during write_begin if needed
- Original Message - Hi, Since we set the allocation structure when the write call begins, and it is not deallocated until there are no writers left with the file open, how does this happen? Steve. Hi, In a normal write, the code goes through gfs2_page_mkwrite or gfs2_file_aio_write. In the failing scenario, it's going through sendfile. I suppose I could patch sendfile as an alternative, but the advantage here is that this patch will do it only if a block allocation is needed. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH] Add myself (Bob Peterson) as a maintainer of GFS2
- Original Message - Hi Bob, On Mon, 16 Feb 2015 09:20:56 -0500 (EST) Bob Peterson rpete...@redhat.com wrote: This patch adds Bob Peterson as a maintainer of the GFS2 file system. It also changes the development repository to a shared location rather than Steve Whitehouse's private location. Please also consider letting the linux-next maintainer know :-) i.e. does this mean I should change the repo/branch I fetch into linux-next each day and should I add you to the contacts? -- Cheers, Stephen Rothwells...@canb.auug.org.au Hi Stephen, Yes, indeed. I was going to let you know as soon as I got things set up. Unfortunately, I've been pulled in a lot of directions lately. I've set up a for-next branch from which you can do your daily fetch: pub/scm/linux/kernel/git/gfs2/linux-gfs2.git Currently it's empty, and yes, please add me to the contacts. Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH] GFS2: Allocate reservation during splice_write
Hi, This patch adds a GFS2-specific function for splice_write which first calls function gfs2_rs_alloc to make sure a reservation structure has been allocated before attempting to reserve blocks. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- fs/gfs2/file.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index ec9c2d3..bd86e79 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -1063,6 +1063,22 @@ static int gfs2_flock(struct file *file, int cmd, struct file_lock *fl) } } +static ssize_t gfs2_file_splice_write(struct pipe_inode_info *pipe, + struct file *out, loff_t *ppos, + size_t len, unsigned int flags) +{ + int error; + struct gfs2_inode *ip = GFS2_I(out-f_mapping-host); + + error = gfs2_rs_alloc(ip); + if (error) + return (ssize_t)error; + + gfs2_size_hint(out, *ppos, len); + + return iter_file_splice_write(pipe, out, ppos, len, flags); +} + const struct file_operations gfs2_file_fops = { .llseek = gfs2_llseek, .read = new_sync_read, @@ -1077,7 +1093,7 @@ const struct file_operations gfs2_file_fops = { .lock = gfs2_lock, .flock = gfs2_flock, .splice_read= generic_file_splice_read, - .splice_write = iter_file_splice_write, + .splice_write = gfs2_file_splice_write, .setlease = simple_nosetlease, .fallocate = gfs2_fallocate, }; @@ -1107,7 +1123,7 @@ const struct file_operations gfs2_file_fops_nolock = { .release= gfs2_release, .fsync = gfs2_fsync, .splice_read= generic_file_splice_read, - .splice_write = iter_file_splice_write, + .splice_write = gfs2_file_splice_write, .setlease = generic_setlease, .fallocate = gfs2_fallocate, };
[Cluster-devel] [PATCH] Add myself (Bob Peterson) as a maintainer of GFS2
Hi, This patch adds Bob Peterson as a maintainer of the GFS2 file system. It also changes the development repository to a shared location rather than Steve Whitehouse's private location. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- MAINTAINERS | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 249e8dd..eee1680 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4239,10 +4239,10 @@ F: scripts/get_maintainer.pl GFS2 FILE SYSTEM M: Steven Whitehouse swhit...@redhat.com +M: Bob Peterson rpete...@redhat.com L: cluster-devel@redhat.com W: http://sources.redhat.com/cluster/ -T: git git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes.git -T: git git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw.git +T: git git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git S: Supported F: Documentation/filesystems/gfs2*.txt F: fs/gfs2/
Re: [Cluster-devel] [PATCH 4/8] gfs2: Convert to using -get_state callback
- Original Message - Convert gfs2 to use -get_state callback instead of -get_xstate. Signed-off-by: Jan Kara j...@suse.cz --- fs/gfs2/quota.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c index 3e193cb36996..c76e031ccbb4 100644 --- a/fs/gfs2/quota.c +++ b/fs/gfs2/quota.c @@ -1467,32 +1467,34 @@ int gfs2_quotad(void *data) return 0; } -static int gfs2_quota_get_xstate(struct super_block *sb, - struct fs_quota_stat *fqs) +static int gfs2_quota_get_state(struct super_block *sb, struct qc_state *state) { struct gfs2_sbd *sdp = sb-s_fs_info; - memset(fqs, 0, sizeof(struct fs_quota_stat)); - fqs-qs_version = FS_QSTAT_VERSION; + memset(state, 0, sizeof(*state)); switch (sdp-sd_args.ar_quota) { case GFS2_QUOTA_ON: - fqs-qs_flags |= (FS_QUOTA_UDQ_ENFD | FS_QUOTA_GDQ_ENFD); + state-s_state[USRQUOTA].flags |= QCI_LIMITS_ENFORCED; + state-s_state[GRPQUOTA].flags |= QCI_LIMITS_ENFORCED; /*FALLTHRU*/ case GFS2_QUOTA_ACCOUNT: - fqs-qs_flags |= (FS_QUOTA_UDQ_ACCT | FS_QUOTA_GDQ_ACCT); + state-s_state[USRQUOTA].flags |= QCI_ACCT_ENABLED | + QCI_SYSFILE; + state-s_state[GRPQUOTA].flags |= QCI_ACCT_ENABLED | + QCI_SYSFILE; break; case GFS2_QUOTA_OFF: break; } - if (sdp-sd_quota_inode) { - fqs-qs_uquota.qfs_ino = GFS2_I(sdp-sd_quota_inode)-i_no_addr; - fqs-qs_uquota.qfs_nblks = sdp-sd_quota_inode-i_blocks; + state-s_state[USRQUOTA].ino = + GFS2_I(sdp-sd_quota_inode)-i_no_addr; + state-s_state[USRQUOTA].blocks = sdp-sd_quota_inode-i_blocks; } - fqs-qs_uquota.qfs_nextents = 1; /* unsupported */ - fqs-qs_gquota = fqs-qs_uquota; /* its the same inode in both cases */ - fqs-qs_incoredqs = list_lru_count(gfs2_qd_lru); + state-s_state[USRQUOTA].nextents = 1; /* unsupported */ + state-s_state[GRPQUOTA] = state-s_state[USRQUOTA]; + state-s_incoredqs = list_lru_count(gfs2_qd_lru); return 0; } @@ -1637,7 +1639,7 @@ out_put: const struct quotactl_ops gfs2_quotactl_ops = { .quota_sync = gfs2_quota_sync, - .get_xstate = gfs2_quota_get_xstate, + .get_state = gfs2_quota_get_state, .get_dqblk = gfs2_get_dqblk, .set_dqblk = gfs2_set_dqblk, }; -- 2.1.4 ACK Looks right. Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH 2/3] fsck.gfs2: Simplify bad_journalname
- Original Message - Remove the need for a temporary string and strncpy call by passing the length of the string to printf. Signed-off-by: Andrew Price anpr...@redhat.com --- gfs2/fsck/initialize.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/gfs2/fsck/initialize.c b/gfs2/fsck/initialize.c index 1ab078b..c052205 100644 --- a/gfs2/fsck/initialize.c +++ b/gfs2/fsck/initialize.c @@ -1513,14 +1513,10 @@ static int init_rindex(struct gfs2_sbd *sdp) static void bad_journalname(const char *filename, int len) { - char tmp_name[64]; - if (len = 64) len = 63; - strncpy(tmp_name, filename, len); - tmp_name[len] = '\0'; - log_debug(_(Journal index entry '%s' has an invalid filename.\n), - tmp_name); + log_debug(_(Journal index entry '%.*s' has an invalid filename.\n), + len, filename); } /** -- 1.9.3 Nice trick (strangely, I've never used that construct), but we could just as well get rid of the whole function altogether. Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH 3/3] gfs2-utils build: Add a configure script summary
- Original Message - Print a nicely formatted summary of some of the more interesting configure options. Required some tweaking of earlier configure stages for accuracy. Signed-off-by: Andrew Price anpr...@redhat.com --- Hi, ACK Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH 1/3] fsck.gfs2: Fix 'initializer element is not constant' build error
- Original Message - This error occurs when gfs2-utils is compiled with -std options more recent than gnu89: CC fsck_gfs2-main.o main.c:39:38: error: initializer element is not constant struct osi_root dup_blocks = (struct osi_root) { NULL, }; ^ main.c:40:35: error: initializer element is not constant struct osi_root dirtree = (struct osi_root) { NULL, }; ^ main.c:41:37: error: initializer element is not constant struct osi_root inodetree = (struct osi_root) { NULL, }; ^ As far as I can tell, with C89/gnu89 the use of a cast in this context is undefined behaviour and the later standards are more strict about it, hence the error. As the standards specify that members of objects with static storage duration are zeroed/NULLed anyway, the initializers can be removed to achieve the intended result. Signed-off-by: Andrew Price anpr...@redhat.com --- ACK Bob Peterson Red Hat File Systems
[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: Revise undo processing
Hi, This patch has to do with undo processing. This is where pass1 has processed a bunch of data blocks before encountering an error. If the error is fatal processing is stopped immediately and the work done to that point (up to the fatal error block) is backed out. If the error is not fatal, processing may continue for the rest of the indirect block it's processing. The problem occurs when it finds a non-fatal error block, then later it decides it cannot continue. In that case, the undo function needs to back out changes, but it needs to finish working through the data blocks on the rest of that indirect block. This patch allows it to finish out the rest of the undo work properly. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/metawalk.c b/gfs2/fsck/metawalk.c index cd224fe..1875b24 100644 --- a/gfs2/fsck/metawalk.c +++ b/gfs2/fsck/metawalk.c @@ -1446,17 +1446,28 @@ static int check_data(struct gfs2_inode *ip, struct metawalk_fxns *pass, pass1. Therefore the individual check_data functions should do a range check. */ rc = pass-check_data(ip, metablock, block, pass-private); - if (!error rc) { - error = rc; + if (rc (!error || (rc error))) { log_info(\n); if (rc 0) { - *error_blk = block; + /* A fatal error trumps a non-fatal one. */ + if ((*error_blk == 0) || (rc error)) { + log_debug(_(Fatal error on block 0x + %llx preempts non-fatal + error on block 0x%llx\n), + (unsigned long long)block, + (unsigned long long)*error_blk); + *error_blk = block; + } log_info(_(Unrecoverable )); + } else { /* nonfatal error */ + if ((*error_blk) == 0) + *error_blk = block; } log_info(_(data block error %d on block %llu (0x%llx).\n), rc, (unsigned long long)block, (unsigned long long)block); + error = rc; } if (rc 0) return rc; @@ -1467,10 +1478,11 @@ static int check_data(struct gfs2_inode *ip, struct metawalk_fxns *pass, static int undo_check_data(struct gfs2_inode *ip, struct metawalk_fxns *pass, uint64_t *ptr_start, char *ptr_end, - uint64_t error_blk) + uint64_t error_blk, int error) { int rc = 0; uint64_t block, *ptr; + int found_error_blk = 0; /* If there isn't much pointer corruption check the pointers */ for (ptr = ptr_start ; (char *)ptr ptr_end !fsck_abort; ptr++) { @@ -1480,13 +1492,25 @@ static int undo_check_data(struct gfs2_inode *ip, struct metawalk_fxns *pass, if (skip_this_pass || fsck_abort) return 1; block = be64_to_cpu(*ptr); - if (block == error_blk) - return 1; + if (block == error_blk) { + if (error 0) { /* A fatal error that stopped it? */ + log_debug(_(Stopping the undo process: + fatal error block 0x%llx was + found.\n), + (unsigned long long)error_blk); + return 1; + } + found_error_blk = 1; + log_debug(_(The non-fatal error block 0x%llx was + found, but undo processing will continue + until the end of this metadata block.\n), + (unsigned long long)error_blk); + } rc = pass-undo_check_data(ip, block, pass-private); if (rc 0) return rc; } - return 0; + return found_error_blk; } static int hdr_size(struct gfs2_buffer_head *bh, int height) @@ -1589,9 +1613,11 @@ int check_metatree(struct gfs2_inode *ip, struct metawalk_fxns *pass) undo_metalist: if (!error) goto out; - log_err( _(Error: inode %llu (0x%llx) had unrecoverable errors.\n), + log_err( _(Error: inode %llu (0x%llx) had unrecoverable errors
[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: Change basic dentry checks for too long of file names
Hi, This is the first in a series of patches to fsck.gfs2. I've uncovered some problems during recent testing, and these patches fix them. This patch adds a check to the basic dentry check such that it will reject any dirents that have a file name more than the maximum allowed. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/pass2.c b/gfs2/fsck/pass2.c index 1559d8e..bdcf77c 100644 --- a/gfs2/fsck/pass2.c +++ b/gfs2/fsck/pass2.c @@ -468,7 +468,8 @@ static int basic_dentry_checks(struct gfs2_inode *ip, struct gfs2_dirent *dent, } } - if (de-de_rec_len GFS2_DIRENT_SIZE(de-de_name_len)) { + if (de-de_rec_len GFS2_DIRENT_SIZE(de-de_name_len) || + de-de_name_len GFS2_FNAMESIZE) { log_err( _(Dir entry with bad record or name length\n \tRecord length = %u\n\tName length = %u\n), de-de_rec_len, de-de_name_len); @@ -476,9 +477,12 @@ static int basic_dentry_checks(struct gfs2_inode *ip, struct gfs2_dirent *dent, log_err( _(Directory entry not fixed.\n)); return 0; } + /* Don't be tempted to do this: fsck_blockmap_set(ip, ip-i_di.di_num.no_addr, _(corrupt directory entry), gfs2_inode_invalid); + We can't free it because another dir may have a valid reference + to it. Just return 1 so we can delete the bad dirent. */ log_err( _(Bad directory entry deleted.\n)); return 1; }
[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: Print out block number when pass3 finds a bad directory
Hi, This patch changes pass3 so that it prints out the directory inode number when it finds a directory containing a bad block. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/pass3.c b/gfs2/fsck/pass3.c index 9582b5b..33865df 100644 --- a/gfs2/fsck/pass3.c +++ b/gfs2/fsck/pass3.c @@ -246,7 +246,10 @@ int pass3(struct gfs2_sbd *sdp) q = block_type(di-dinode.no_addr); if (q == gfs2_bad_block) { log_err( _(Found unlinked directory - containing bad block\n)); + containing bad block at block %llu + (0x%llx)\n), + (unsigned long long)di-dinode.no_addr, + (unsigned long long)di-dinode.no_addr); if (query(_(Clear unlinked directory with bad blocks? (y/n) ))) { log_warn( _(inode %lld (0x%llx) is
[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: Adjust when hash table is doubled
Hi, If directory entries are found to be on the wrong leaf block, they are moved as needed. That sometimes requires the leaf block to be split, which, in turn, sometimes requires the hash table be doubled. The problem was that the functions that traverse the hash tables were not taking into account the fact that the hash table could just unexpectly double, throwing off all the pointers. This patch attempts to correct that by fixing up the pointers as necessary. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/metawalk.c b/gfs2/fsck/metawalk.c index 9aa9fb8..cd224fe 100644 --- a/gfs2/fsck/metawalk.c +++ b/gfs2/fsck/metawalk.c @@ -461,7 +461,7 @@ static int check_entries(struct gfs2_inode *ip, struct gfs2_buffer_head *bh, } else { error = pass-check_dentry(ip, dent, prev, bh, filename, count, - lindex, + lindex, pass-private); if (error 0) { stack; @@ -504,6 +504,7 @@ int check_leaf(struct gfs2_inode *ip, int lindex, struct metawalk_fxns *pass, uint32_t count = 0; struct gfs2_sbd *sdp = ip-i_sbd; const char *msg; + int di_depth = ip-i_di.di_depth; /* Make sure the block number is in range. */ if (!valid_block(ip-i_sbd, *leaf_no)) { @@ -604,6 +605,16 @@ int check_leaf(struct gfs2_inode *ip, int lindex, struct metawalk_fxns *pass, } } out: + if (di_depth ip-i_di.di_depth) { + log_debug(_(Depth of directory %lld (0x%llx) changed from + %d to %d; adjusting ref_count from %d to %d\n), + (unsigned long long)ip-i_di.di_num.no_addr, + (unsigned long long)ip-i_di.di_num.no_addr, + di_depth, ip-i_di.di_depth, + *ref_count, + (*ref_count) (ip-i_di.di_depth - di_depth)); + (*ref_count) = (ip-i_di.di_depth - di_depth); + } brelse(lbh); return 0; @@ -618,6 +629,16 @@ bad_leaf: return fix; } + if (di_depth ip-i_di.di_depth) { + log_debug(_(Depth of directory %lld (0x%llx) changed from + %d to %d. Adjusting ref_count from %d to %d\n), + (unsigned long long)ip-i_di.di_num.no_addr, + (unsigned long long)ip-i_di.di_num.no_addr, + di_depth, ip-i_di.di_depth, + *ref_count, + (*ref_count) (ip-i_di.di_depth - di_depth)); + (*ref_count) = (ip-i_di.di_depth - di_depth); + } return 1; } @@ -773,8 +794,13 @@ static int check_leaf_blks(struct gfs2_inode *ip, struct metawalk_fxns *pass) } error = check_leaf(ip, lindex, pass, leaf_no, leaf, ref_count); - if (ref_count != orig_ref_count) + if (ref_count != orig_ref_count) { + log_debug(_(Ref count of leaf 0x%llx + changed from %d to %d.\n), + (unsigned long long)leaf_no, + orig_ref_count, ref_count); tbl_valid = 0; + } if (!leaf.lf_next || error) break; leaf_no = leaf.lf_next; @@ -787,6 +813,8 @@ static int check_leaf_blks(struct gfs2_inode *ip, struct metawalk_fxns *pass) (unsigned long long)ip-i_di.di_num.no_addr, orig_di_depth, ip-i_di.di_depth); tbl_valid = 0; + lindex = (ip-i_di.di_depth - orig_di_depth); + hsize = (1 ip-i_di.di_depth); } if (orig_di_height != ip-i_di.di_height) { log_debug(_(Height of 0x%llx changed from %d to @@ -1663,7 +1691,7 @@ int check_dir(struct gfs2_sbd *sdp, uint64_t block, struct metawalk_fxns *pass) static int remove_dentry(struct gfs2_inode *ip, struct gfs2_dirent *dent, struct gfs2_dirent *prev_de, struct gfs2_buffer_head *bh, -char *filename, uint32_t *count, int lindex, +char *filename, uint32_t *count, int *lindex, void *private) { /* the metawalk_fxn's private field must be set
[Cluster-devel] [fsck.gfs2 patch] fsck.gfs2: remove duplicate designation during undo
Hi, This patch fixes a problem whereby fsck.gfs2's pass1 would perform this sequence of events: 1. Metadata block X is identified as being referenced from dinode D1. 2. Metadata block X is identified as being referenced from another dinode, D2, which makes it a duplicate reference, but so far, no serious errors were found for that dinode. 3. Dinode D2 is found later to be irreparably damaged, and needs to be removed. When D2 is deleted, the duplicate reference from D2 is removed and block X is not freed because D1 still references it. However, it's still marked as a duplicate and requires processing in pass1b. Later, pass1b resolves the duplicate and determine's there is really only one reference remaining, so it makes the correct decision. However, it should not be necessary. The undo functions should remove the duplicate reference if (and only if) the only reference was from D2. Note, though, that if the corruption is found later in the cycle (after undo is possible) the duplicate reference MUST remain and be resolved by pass1b. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/gfs2/fsck/pass1.c b/gfs2/fsck/pass1.c index a4ba04c..73b054c 100644 --- a/gfs2/fsck/pass1.c +++ b/gfs2/fsck/pass1.c @@ -364,6 +364,11 @@ static int undo_reference(struct gfs2_inode *ip, uint64_t block, int meta, from another inode; not freeing.\n), (unsigned long long)block, (unsigned long long)block); + if (dt-refs == 1) { + log_err(_(This was the only duplicate + reference so far; removing it.\n)); + dup_delete(dt); + } return 1; } } @@ -1055,8 +1060,13 @@ static int rangecheck_block(struct gfs2_inode *ip, uint64_t block, (unsigned long long)ip-i_di.di_num.no_addr); if ((*bad_pointers) = BAD_POINTER_TOLERANCE) return meta_is_good; - else + else { + log_debug(_(Inode 0x%llx bad pointer tolerance + exceeded: block 0x%llx.\n), + (unsigned long long)ip-i_di.di_num.no_addr, + (unsigned long long)block); return meta_error; /* Exits check_metatree quicker */ + } } return meta_is_good; }
Re: [Cluster-devel] 3.18.5 kernel panic: fs/gfs2/acl.c:76
- Original Message - On 06/02/15 23:50, Andreas Gruenbacher wrote: Andrew, 3.18.5 kernel crashing on acl deletion: null pointer dereference in fs/gfs2/acl.c:76 this bug seems to exist since commit 2646a1f6 from October 2009. The if-statement originates in 2646a1f6 but the bug was introduced by the deletion of a NULL check in e01580bf9e which was in December 2013. fix we're using currently: --- fs/gfs2/acl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c index 3088e2a..8339754 100644 --- a/fs/gfs2/acl.c +++ b/fs/gfs2/acl.c @@ -73,7 +73,7 @@ int gfs2_set_acl(struct inode *inode, struct posix_acl *acl, int type) BUG_ON(name == NULL); - if (acl-a_count GFS2_ACL_MAX_ENTRIES(GFS2_SB(inode))) + if ((acl) (acl-a_count GFS2_ACL_MAX_ENTRIES(GFS2_SB(inode return -E2BIG; if (type == ACL_TYPE_ACCESS) { Except for the extra parentheses this seems correct, thank you. Agreed. Good catch. Thanks, Andy Hi, Christoph's patch, which introduced the problem, was never ported to RHEL7, so let's just treat this as an upstream bug. Andreas: I think maybe you should post your acl patch separately. Andrew Elble: I don't think we even need a bugzilla for this one. Do you want to just post your latest patch (with fewer parentheses) to cluster-devel@redhat.com so Steve Whitehouse can pick it up in the GFS2 nmw git tree? Then you can get the credit. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [GFS2 PATCH 0/4] fallocate quota fixes
- Original Message - These patches are related to bz1174295 where fallocate could exceed quota. I'm posting these for early feedback as these patches are only compile-tested so far. patch 1 - This is the patch that actually addresses the quota exceed issue. Quota checks were not being performed against the blocks about to be allocated. patch 2 - Adds new variants of quota check functions that return the number of allowed blocks if quotas are violated by the number of requested blocks patch 3 - Adds a new variant of gfs2_inplace_reserve that returns the max number of available blocks if the function returns -ENOSPC due to unavailability of the requested number of blocks. patch 4 - Allows fallocate to take advantage of patches 2 and 3 to efficiently max out quotas or fill up the fs instead of returning -EDQUOT/-ENOSPC and leaving some available blocks unallocated. Abhi Das (4): gfs2: check quota for blocks we're about to allocate gfs2: add new quota check functions gfs2: add new function gfs2_inpl_rsrv_ret_max_avl gfs2: allow fallocate to max out quotas/fs efficiently fs/gfs2/aops.c | 6 +++--- fs/gfs2/bmap.c | 2 +- fs/gfs2/file.c | 30 ++ fs/gfs2/inode.c | 14 -- fs/gfs2/quota.c | 26 +- fs/gfs2/quota.h | 20 +--- fs/gfs2/rgrp.c | 13 +++-- fs/gfs2/rgrp.h | 10 +- fs/gfs2/xattr.c | 2 +- 9 files changed, 89 insertions(+), 34 deletions(-) -- 1.8.1.4 Hi, ACK. Looks okay to me. Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH] GFS2: Eliminate __gfs2_glock_remove_from_lru
Hi, Since the only caller of function __gfs2_glock_remove_from_lru locks the same spin_lock as gfs2_glock_remove_from_lru, the functions can be combined. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index a23524a..aeb7bc9 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -173,19 +173,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl) spin_unlock(lru_lock); } -static void __gfs2_glock_remove_from_lru(struct gfs2_glock *gl) +static void gfs2_glock_remove_from_lru(struct gfs2_glock *gl) { + spin_lock(lru_lock); if (!list_empty(gl-gl_lru)) { list_del_init(gl-gl_lru); atomic_dec(lru_count); clear_bit(GLF_LRU, gl-gl_flags); } -} - -static void gfs2_glock_remove_from_lru(struct gfs2_glock *gl) -{ - spin_lock(lru_lock); - __gfs2_glock_remove_from_lru(gl); spin_unlock(lru_lock); } @@ -205,9 +200,7 @@ void gfs2_glock_put(struct gfs2_glock *gl) lockref_mark_dead(gl-gl_lockref); - spin_lock(lru_lock); - __gfs2_glock_remove_from_lru(gl); - spin_unlock(lru_lock); + gfs2_glock_remove_from_lru(gl); spin_unlock(gl-gl_lockref.lock); spin_lock_bucket(gl-gl_hash); hlist_bl_del_rcu(gl-gl_list);
Re: [Cluster-devel] [GFS2 0/3] fallocate and quota fixes
- Original Message - This is a revised version of the patches required to properly fix the fallocate quota issue described in bz1174295 patch1: This patch supplies gfs2_quota_check() with the number of blocks the caller intends to allocate in the current operation, resulting in a more accurate quota check. patch2: gfs2_quota_check() and gfs2_inplace_reserve() return the number of blocks available subject to quota limits and rgrp size respectively. These functions don't return error if atleast ap.min_target (if set) blocks are allocatable. patch3: The fallocate() function uses the features of patch2 to determine how many total blocks are available for allocation and uses them right away, instead of guessing/retrying inefficiently. The behavior of quota enforcement is altered by this patchset. i. Quotas are exceeded (a warning message is also issued to syslog) before the actual usage blocks exceed the imposed limit. In fact, the actual usage can never exceed the limit. Whenever it is determined that the completion of an operation will cause a quota to exceed its limit, such an operation is aborted with -EDQUOT and a 'quota exceeded' message is dispatched. Note: When min_target is set and allowed blocks are = min_target, we don't issue an error. It is assumed that the caller will only allocate the allowed blocks. ii. The gfs2_write_calc_reserv()/calc_max_reserv() functions are used to map between available blocks and the data bytes that can be written using them. Typically, for large files, some blocks are used up for metadata and only the remaining blocks can be used for data. Example: To write only a handful of bytes that would easily fit in one block, we might have to allocate an extra bunch of intermediate metadata blocks. If we had only 1 block left in our allotted quota, this operation would likely fail. The functions mentioned in ii. are not very efficient. They always compute the worst case number of extra blocks required and it is often the case that not all those extra blocks are used. We need to find a better algorithm to get a tighter estimate on the blocks needed for a given number of bytes. I've run some basic tests and things seem to be holding up. The failing case in bz1174295 is fixed using this patchset. I'll do test build and pass it on to Nate to test with. Abhi Das (3): gfs2: perform quota checks against allocation parameters gfs2: allow quota_check and inplace_reserve to return available blocks gfs2: allow fallocate to max out quotas/fs efficiently fs/gfs2/aops.c | 6 ++--- fs/gfs2/bmap.c | 2 +- fs/gfs2/file.c | 81 fs/gfs2/incore.h | 4 ++- fs/gfs2/inode.c | 18 +++-- fs/gfs2/quota.c | 54 +++-- fs/gfs2/quota.h | 8 +++--- fs/gfs2/rgrp.c | 20 ++ fs/gfs2/rgrp.h | 3 ++- fs/gfs2/xattr.c | 2 +- 10 files changed, 133 insertions(+), 65 deletions(-) -- 1.8.1.4 Hi, ACK to all three. Now pushed to the for-next branch of the linux-gfs2.git git tree. Regards, Bob Peterson
Re: [Cluster-devel] [PATCH 1/1] gfs2: incorrect check for debugfs returns
- Original Message - debugfs_create_dir and debugfs_create_file may return -ENODEV when debugfs is not configured, so the return value should be checked against ERROR_VALUE as well, otherwise the later dereference of the dentry pointer would crash the kernel. Signed-off-by: Chengyu Song cson...@gatech.edu --- fs/gfs2/glock.c | 47 --- 1 file changed, 28 insertions(+), 19 deletions(-) Hi, I've now added this patch to the for-next branch of the linux-gfs2.git tree. Thanks. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH] GFS2: Improve readability of gfs2_alloc_inode
- Original Message - Returning ip-i_inode when ip is NULL is safe as i_inode is the first member in struct gfs2_inode, but that's not immediately obvious. Reorganize gfs2_alloc_inode to avoid any doubt. Signed-off-by: Andrew Price anpr...@redhat.com --- Re-sending with a more appropriate commit log based on Andreas' comments. fs/gfs2/super.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index 1666382..37c59ee 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -1628,12 +1628,13 @@ static struct inode *gfs2_alloc_inode(struct super_block *sb) struct gfs2_inode *ip; ip = kmem_cache_alloc(gfs2_inode_cachep, GFP_KERNEL); - if (ip) { - ip-i_flags = 0; - ip-i_gl = NULL; - ip-i_rgd = NULL; - ip-i_res = NULL; - } + if (!ip) + return NULL; + + ip-i_flags = 0; + ip-i_gl = NULL; + ip-i_rgd = NULL; + ip-i_res = NULL; return ip-i_inode; } -- 1.9.3 ACK, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH RHEL6] libgfs2: Use a matching context mount option in mount_gfs2_meta
- Original Message - On a system with SELinux enabled, if a gfs2 file system is mounted with a context= option, the tools gfs2_quota, gfs2_tool, gfs2_grow and gfs2_jadd will fail with Device or resource busy. This is due to SELinux failing the mount due to a mismatched context (SELinux: mount invalid. Same superblock, different security settings). In order to work around this, parse the context option of the gfs2 mount point in is_pathname_mounted() and use it in mount_gfs2_meta(). Resolves: rhbz#1121693 Signed-off-by: Andrew Price anpr...@redhat.com --- gfs2/libgfs2/libgfs2.h | 1 + gfs2/libgfs2/misc.c| 21 - 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/gfs2/libgfs2/libgfs2.h b/gfs2/libgfs2/libgfs2.h index 9c20f11..25286d1 100644 --- a/gfs2/libgfs2/libgfs2.h +++ b/gfs2/libgfs2/libgfs2.h @@ -217,6 +217,7 @@ struct gfs2_sbd { int device_fd; int path_fd; + char *secontext; uint64_t sb_addr; diff --git a/gfs2/libgfs2/misc.c b/gfs2/libgfs2/misc.c index 8e0ca6f..5ef4a2a 100644 --- a/gfs2/libgfs2/misc.c +++ b/gfs2/libgfs2/misc.c @@ -100,6 +100,24 @@ int compute_constants(struct gfs2_sbd *sdp) return 0; } +/** + * Returns a duplicate of the 'context' mount option, or NULL if not found. + */ +static char *copy_context_opt(struct mntent *mnt) +{ + char *ctx, *end; + + ctx = hasmntopt(mnt, context); + if (ctx == NULL) + return NULL; + + end = strchr(ctx, ','); + if (end == NULL) + return NULL; + + return strndup(ctx, end - ctx); +} + int is_pathname_mounted(struct gfs2_sbd *sdp, int *ro_mount) { FILE *fp; @@ -161,6 +179,7 @@ int is_pathname_mounted(struct gfs2_sbd *sdp, int *ro_mount) return 0; if (hasmntopt(mnt, MNTOPT_RO)) *ro_mount = 1; + sdp-secontext = copy_context_opt(mnt); return 1; /* mounted */ } @@ -319,7 +338,7 @@ int mount_gfs2_meta(struct gfs2_sbd *sdp) sigaction(SIGCONT, sa, NULL); sigaction(SIGUSR1, sa, NULL); sigaction(SIGUSR2, sa, NULL); - ret = mount(sdp-path_name, sdp-metafs_path, gfs2meta, 0, NULL); + ret = mount(sdp-path_name, sdp-metafs_path, gfs2meta, 0, sdp-secontext); if (ret) { rmdir(sdp-metafs_path); return -1; -- 1.9.3 Hi, Is this a memory leak (albeit a small one) or did I miss something? I don't see where the memory allocate by strndup is ever freed. Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] Announcing a new GFS2 development git repository and branch
Hi, I'm going to be helping Steve Whitehouse maintain the upstream GFS2 kernel git tree. To make this possible, we had to switch the repository to a different location that's shared between us. The new GFS2 upstream tree is located here: https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/ As per the kernel.org standard, we're planning to use branch for-next to keep recent upstream patches queued for the next merge window. This replaces the old master branch in the -nmw development repo. So linux-gfs2's master branch will track Linus's tree, and for-next will be for queuing new patches. To clone the new tree, do something like this: git clone git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git There are currently three patches pushed to the new for-next branch: 0108044 Add myself (Bob Peterson) as a maintainer of GFS2 (Bob Peterson) 9717c00 GFS2: gfs2_set_acl(): Cache no acl as well (Andreas Gruenbacher) 76514ae GFS2: Allocate reservation during splice_write(Bob Peterson) Stephen Rothwell: If you haven't already done so, please change your nightly script to pull patches for linux-next from the for-next branch of this tree. I've never been a kernel maintainer before, so expect a few blunders, and let me know if (and how) I can do better with the process. Regards, Bob Peterson
Re: [Cluster-devel] [PATCH] GFS2: gfs2_set_acl(): Cache no acl as well
Hi, Now in the for-next tree of the linux-gfs2.git repo. Regards, Bob Peterson - Original Message - When removing a default acl or setting an access acl that is entirely represented in the file mode, we end up with acl == NULL in gfs2_set_acl(). In that case, bring gfs2 in line with other file systems and cache the NULL acl with set_cached_acl() instead of invalidating the cache with forget_cached_acl(). Signed-off-by: Andreas Gruenbacher agrue...@redhat.com --- fs/gfs2/acl.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c index 7b31430..1be3b06 100644 --- a/fs/gfs2/acl.c +++ b/fs/gfs2/acl.c @@ -110,11 +110,7 @@ int gfs2_set_acl(struct inode *inode, struct posix_acl *acl, int type) error = __gfs2_xattr_set(inode, name, data, len, 0, GFS2_EATYPE_SYS); if (error) goto out; - - if (acl) - set_cached_acl(inode, type, acl); - else - forget_cached_acl(inode, type); + set_cached_acl(inode, type, acl); out: kfree(data); return error; -- 2.1.0
Re: [Cluster-devel] [GFS2 0/3] fallocate and quota fixes
- Original Message - This is a revised version of the patches required to properly fix the fallocate quota issue described in bz1174295 patch1: This patch supplies gfs2_quota_check() with the number of blocks the caller intends to allocate in the current operation, resulting in a more accurate quota check. patch2: gfs2_quota_check() and gfs2_inplace_reserve() return the number of blocks available subject to quota limits and rgrp size respectively. The difference from previous versions of this patch is that we return the available blocks even on success. patch3: The fallocate() function uses the features of patch2 to determine how many total blocks are available for allocation and uses them right away, instead of guessing inefficiently. The behavior of quota enforcement is altered by this patchset. i. Quotas are exceeded (a warning message is also issued to syslog) before the actual usage blocks exceed the imposed limit. In fact, the actual usage can never exceed the limit. Whenever it is determined that the completion of an operation will cause a quota to exceed its limit, such an operation is aborted with -EDQUOT and a 'quota exceeded' message is dispatched. ii. The gfs2_write_calc_reserv()/calc_max_reserv() functions are used to map between available blocks and the data bytes that can be written using them. Typically, for large files, some blocks are used up for metadata and only the remaining blocks can be used for data. Example: To write only a handful of bytes that would easily fit in one block, we might have to allocate an extra bunch of intermediate metadata blocks. If we had only 1 block left in our allotted quota, this operation would likely fail. The functions mentioned in ii. are not very efficient. They always compute the worst case number of extra blocks required and it is often the case that not all those extra blocks are used. We need to find a better algorithm to get a tighter estimate on the blocks needed for a given number of bytes. I've run some basic tests and things seem to be holding up. The failing case in bz1174295 is fixed using this patchset. I'll do test build and pass it on to Nate to test with. Abhi Das (3): gfs2: perform quota checks against allocation parameters gfs2: allow quota_check and inplace_reserve to return available blocks gfs2: allow fallocate to max out quotas/fs efficiently fs/gfs2/aops.c | 6 ++--- fs/gfs2/bmap.c | 2 +- fs/gfs2/file.c | 75 ++-- fs/gfs2/incore.h | 3 ++- fs/gfs2/inode.c | 18 -- fs/gfs2/quota.c | 40 ++ fs/gfs2/quota.h | 8 +++--- fs/gfs2/rgrp.c | 12 +++-- fs/gfs2/rgrp.h | 6 +++-- fs/gfs2/xattr.c | 2 +- 10 files changed, 117 insertions(+), 55 deletions(-) -- 1.8.1.4 Looks good to me. ACK to the series. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] linux-next: Tree for Apr 20 (gfs2)
- Original Message - On 04/19/15 22:45, Stephen Rothwell wrote: Hi all, Please do not add any v4.2 material to your linux-next included trees until after v4.1-rc1 is released. Changes since 20150415: on i386: ERROR: __divdi3 [fs/gfs2/gfs2.ko] undefined! -- ~Randy Hi, My apologies. I've pushed an addendum patch that fixes the problem to the for-next branch of the linux-gfs2.git git tree. Regards, Bob Peterson
Re: [Cluster-devel] [PATCH] mkfs.gfs2: Allow longer cluster names
- Original Message - Increase the enforced limit for cluster name to 32 bytes and file system name to 30 bytes for mkfs.gfs2 (was previously 16 + 16 bytes). Also increased this limit in tunegfs2 when labelling gfs2 file systems. Updated the formation in the man pages along with adding a new test case for mkfs.gfs2 to validate the increased cluster/file system name support. Signed-off-by: Paul Evans pev...@redhat.com --- gfs2/man/mkfs.gfs2.8 | 4 ++-- gfs2/mkfs/main_mkfs.c | 4 ++-- gfs2/tune/super.c | 2 +- tests/mkfs.at | 8 4 files changed, 13 insertions(+), 5 deletions(-) Hi, ACK Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH] GFS2: mark the journal idle to fix ro mounts
- Original Message - When gfs2 was mounted read-only and then unmounted, it was writing a header block to the journal in the syncing gfs2_log_flush() call from kill_sb(). This is because the journal was not being marked as idle until the first log header was written out, and on a read-only mount there never was a log header written out. Since the journal was not marked idle, gfs2_log_flush() was writing out a header lock to make sure it was empty during the sync. Not only did this cause IO to a read-only filesystem, but the journalling isn't completely initialized on read-only mounts, and so gfs2 was writing out the wrong sequence number in the log header. Now, the journal is marked idle on mount, and gfs2_log_flush() won't write out anything until there starts being transactions to flush. Signed-off-by: Benjamin Marzinski bmarz...@redhat.com Hi, ACK Makes sense to me. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH] GFS2: mark the journal idle to fix ro mounts
- Original Message - When gfs2 was mounted read-only and then unmounted, it was writing a header block to the journal in the syncing gfs2_log_flush() call from kill_sb(). This is because the journal was not being marked as idle until the first log header was written out, and on a read-only mount there never was a log header written out. Since the journal was not marked idle, gfs2_log_flush() was writing out a header lock to make sure it was empty during the sync. Not only did this cause IO to a read-only filesystem, but the journalling isn't completely initialized on read-only mounts, and so gfs2 was writing out the wrong sequence number in the log header. Now, the journal is marked idle on mount, and gfs2_log_flush() won't write out anything until there starts being transactions to flush. Signed-off-by: Benjamin Marzinski bmarz...@redhat.com --- fs/gfs2/ops_fstype.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index 8633ad3..fd984f6 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -757,6 +757,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo) } } + sdp-sd_log_idle = 1; set_bit(SDF_JOURNAL_CHECKED, sdp-sd_flags); gfs2_glock_dq_uninit(ji_gh); jindex = 0; -- 1.8.3.1 Hi, Now applied to the for-next branch of the linux-gfs2 tree: https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-nextid=086cc672e1cb600b9c17688a4aa44560db858c03 Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH 02/12] trivial: GFS2: inode.c: indent with TABs, not spaces
- Original Message - Follow the same style used for the other functions in the same file. Signed-off-by: Antonio Ospite a...@ao2.it Cc: Steven Whitehouse swhit...@redhat.com Cc: Bob Peterson rpete...@redhat.com Cc: cluster-devel@redhat.com --- fs/gfs2/inode.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index 1b3ca7a..efd6bfb 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -1227,8 +1227,8 @@ static int gfs2_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, */ static int gfs2_atomic_open(struct inode *dir, struct dentry *dentry, -struct file *file, unsigned flags, -umode_t mode, int *opened) + struct file *file, unsigned flags, + umode_t mode, int *opened) { struct dentry *d; bool excl = !!(flags O_EXCL); -- 2.1.4 Hi, Thanks. This is now applied to the for-next branch of the linux-gfs2 tree: https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-nextid=86fbca4923f956dae31247e68dc73ffdfd6e5cb0 Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH v2] GFS2: add support for rename2 and RENAME_EXCHANGE
- Original Message - gfs2 now uses the rename2 directory iop, and supports the RENAME_EXCHANGE flag (as well as RENAME_NOREPLACE, which the vfs takes care of). Signed-off-by: Benjamin Marzinski bmarz...@redhat.com Hi, Thanks. V2 patch is now in the upstream tree: https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-nextid=a63b7bbc2175901d79fa36ba734499655c077f0d Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [GFS2] gfs2: handle NULL rgd in set_rgrp_preferences
- Original Message - The function set_rgrp_preferences() does not handle the (rarely returned) NULL value from gfs2_rgrpd_get_next() and this patch fixes that. The fs image in question is only 150MB in size which allows for only 1 rgrp to be created. The in-memory rb tree has only 1 node and when gfs2_rgrpd_get_next() is called on this sole rgrp, it returns NULL. (Default behavior is to wrap around the rb tree and return the first node to give the illusion of a circular linked list. In the case of only 1 rgrp, we can't have gfs2_rgrpd_get_next() return the same rgrp (first, last, next all point to the same rgrp)... that would cause unintended consequences and infinite loops.) Resolves: rhbz#1211663 Signed-off-by: Abhi Das a...@redhat.com --- fs/gfs2/rgrp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index cb27065..900e515 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -978,10 +978,10 @@ static void set_rgrp_preferences(struct gfs2_sbd *sdp) rgd-rd_flags |= GFS2_RDF_PREFERRED; for (i = 0; i sdp-sd_journals; i++) { rgd = gfs2_rgrpd_get_next(rgd); - if (rgd == first) + if (!rgd || rgd == first) break; } - } while (rgd != first); + } while (rgd rgd != first); } /** -- 1.8.1.4 Hi, Thanks. Now in the for-next branch of the linux-gfs2 git tree. FYI: I removed Resolves: for upstream. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH v2] GFS2: make sure S_NOSEC flag isn't overwritten
- Original Message - At the end of gfs2_set_inode_flags inode-i_flags is set to flags, so we should be mondifying flags instead of inode-i_flags, so it isn't overwritten. Signed-off-by: Benjamin Marzinski bmarz...@redhat.com --- Hi, The v2 version of the patch is now in the for-next branch of the linux-gfs2 tree: https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-nextid=01e64ee40ad741037352d1d6202eaa432f833eb4 Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH V3 linux-next] gfs2: convert simple_str to kstr
- Original Message - -Remove obsolete simple_str functions. -Return error code when kstr failed. -This patch also calls functions corresponding to destination type. Thanks to Alexey Dobriyan for suggesting improvements in block_store() and wdack_store() Signed-off-by: Fabian Frederick f...@skynet.be Hi, Thanks. The v3 version of the patch is now in the for-next branch of the linux-gfs2 tree: https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-nextid=e50ead480fac63ede9e0b656cd29c1820f7af9de Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] GFS2: Pull request (merge window)
Hi, Please consider pulling the following changes, Bob Peterson GFS2: merge window Here is a list of patches we've accumulated for GFS2 for the current upstream merge window. Most of the patches fix GFS2 quotas, which were not properly enforced. There's another that adds me as a GFS2 co-maintainer, and a couple patches that fix a kernel panic doing splice_write on GFS2 as well as a few correctness patches. The following changes since commit 28666d6dc3feca2b1983e6011df383299d8b6b64: Add myself (Bob Peterson) as a maintainer of GFS2 (2015-02-24 08:55:30 -0600) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-merge-window for you to fetch changes up to 30133177957dca9a3e2a37b720f891d3225a92a1: gfs2: fix quota refresh race in do_glock() (2015-04-08 09:31:18 -0500) GFS2: merge window Here is a list of patches we've accumulated for GFS2 for the current upstream merge window. Most of the patches fix GFS2 quotas, which were not properly enforced. There's another that adds me as a GFS2 co-maintainer, and a couple patches that fix a kernel panic doing splice_write on GFS2 as well as a few correctness patches. Abhi Das (4): gfs2: perform quota checks against allocation parameters gfs2: allow quota_check and inplace_reserve to return available blocks gfs2: allow fallocate to max out quotas/fs efficiently gfs2: fix quota refresh race in do_glock() Andreas Gruenbacher (1): GFS2: gfs2_set_acl(): Cache no acl as well Bob Peterson (2): GFS2: Allocate reservation during splice_write GFS2: Move gfs2_file_splice_write outside of #ifdef Chengyu Song (1): gfs2: incorrect check for debugfs returns fs/gfs2/acl.c| 6 +--- fs/gfs2/aops.c | 6 ++-- fs/gfs2/bmap.c | 2 +- fs/gfs2/file.c | 101 ++- fs/gfs2/glock.c | 47 +++--- fs/gfs2/incore.h | 4 ++- fs/gfs2/inode.c | 18 +- fs/gfs2/quota.c | 62 -- fs/gfs2/quota.h | 8 +++-- fs/gfs2/rgrp.c | 20 --- fs/gfs2/rgrp.h | 3 +- fs/gfs2/xattr.c | 2 +- 12 files changed, 184 insertions(+), 95 deletions(-)
Re: [Cluster-devel] GFS2: Pull request (merge window)
- Original Message - On Tue, Apr 14, 2015 at 10:47 AM, Bob Peterson rpete...@redhat.com wrote: 12 files changed, 184 insertions(+), 95 deletions(-) Oh, and this was incorrect. You had apparently limited the statistics to the fs/gfs2 directory, and thus missed the changes to the MAINTAINERS file. Don't do that - include all the changes. It's what I see and compare against when I actually do the pull anyway. Linus Duly noted, and sorry for the confusion. Bob Peterson
[Cluster-devel] GFS2: Patches pulled, linux-gfs2.git rebased
Hi, Linus has pulled the latest set of patches from the for-next branch of the linux-gfs2.git tree, so I rebased the tree based on Linus's master. So the current tree is back to having no unmerged GFS2 patches. Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH] GFS2: Use average srttb value in congestion calculations
Hi, This patch changes function gfs2_rgrp_congested so that it uses an average srttb (smoothed round trip time for blocking rgrp glocks) rather than the CPU-specific value. If we use the CPU-specific value it can incorrectly report no contention when there really is contention due to the glock processing occurring on a different CPU. Signed-off-by: Bob Peterson rpete...@redhat.com --- diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 6af2396..f39eedc 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1850,14 +1850,19 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops) const struct gfs2_sbd *sdp = gl-gl_sbd; struct gfs2_lkstats *st; s64 r_dcount, l_dcount; - s64 r_srttb, l_srttb; + s64 l_srttb, a_srttb = 0; s64 srttb_diff; s64 sqr_diff; s64 var; + int cpu; preempt_disable(); + for_each_present_cpu(cpu) { + st = per_cpu_ptr(sdp-sd_lkstats, cpu)-lkstats[LM_TYPE_RGRP]; + a_srttb += st-stats[GFS2_LKS_SRTTB]; + } st = this_cpu_ptr(sdp-sd_lkstats)-lkstats[LM_TYPE_RGRP]; - r_srttb = st-stats[GFS2_LKS_SRTTB]; + a_srttb /= num_present_cpus(); r_dcount = st-stats[GFS2_LKS_DCOUNT]; var = st-stats[GFS2_LKS_SRTTVARB] + gl-gl_stats.stats[GFS2_LKS_SRTTVARB]; @@ -1866,10 +1871,10 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops) l_srttb = gl-gl_stats.stats[GFS2_LKS_SRTTB]; l_dcount = gl-gl_stats.stats[GFS2_LKS_DCOUNT]; - if ((l_dcount 1) || (r_dcount 1) || (r_srttb == 0)) + if ((l_dcount 1) || (r_dcount 1) || (a_srttb == 0)) return false; - srttb_diff = r_srttb - l_srttb; + srttb_diff = a_srttb - l_srttb; sqr_diff = srttb_diff * srttb_diff; var *= 2;