Hi, This is a revision to the patch set I sent on 13 February 2019. These won't make this merge window, obviously, because that's almost upon us.
This version fixes some glaring mistakes and problems of the first set. As before, this may not be the final version, but I wanted to put it out for review anyway. Among changes from the original are: 1. I fixed some really stupid mistakes of the original patch set. 2. I found and fixed several additional problems not covered by the first patch set. 3. I broke up the patch "Force withdraw to replay journals and wait for it to finish" into more reasonaly sized pieces. It's still complex, but not nearly as bad as the original. 4. I included some of the instrumentation I've used to detect file system corruption. It makes sense to include them in mainline, I think. 5. I still need to figure out what to do about Dave Teigland's observation regarding the patch "dlm: recover slot regardless of whether we still have a connection". The patch is omitted in this set until I figure out a reasonable course of action. This version is much more stable. I've still been able to break it, given enough pressure, but I think that's an additional bug. I'll continue to chase it, and will post further patches, if necessary. These patches address a bunch of problems related to journal replay overwriting valid gfs2 metadata due to io errors, withdraws and such. These seem to fix several metadata corruption problems I've been able to reliably recreate lately with multi-node multi-file system recovery tests. Bob Peterson (15): gfs2: log error reform gfs2: Introduce concept of a pending withdraw gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn gfs2: move check_journal_clean to util.c for future use gfs2: Allow some glocks to be used during withdraw gfs2: Make secondary withdrawers wait for first withdrawer gfs2: Don't write log headers after file system withdraw gfs2: Force withdraw to replay journals and wait for it to finish gfs2: Add verbose option to check_journal_clean gfs2: Check for log write errors before telling dlm to unlock gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty gfs2: If the journal isn't live ignore log flushes gfs2: Issue revokes more intelligently gfs2: Warn when a journal replay overwrites a rgrp with buffers gfs2: log which portion of the journal is replayed fs/gfs2/aops.c | 4 +- fs/gfs2/file.c | 2 +- fs/gfs2/glock.c | 39 ++++++++-- fs/gfs2/glock.h | 1 + fs/gfs2/glops.c | 82 ++++++++++++++++++++- fs/gfs2/incore.h | 14 +++- fs/gfs2/lock_dlm.c | 68 +++++++++++++++++ fs/gfs2/log.c | 94 +++++++++++------------- fs/gfs2/log.h | 1 + fs/gfs2/lops.c | 29 +++++++- fs/gfs2/meta_io.c | 6 +- fs/gfs2/ops_fstype.c | 52 ++----------- fs/gfs2/quota.c | 8 +- fs/gfs2/recovery.c | 3 +- fs/gfs2/super.c | 30 ++++---- fs/gfs2/super.h | 1 + fs/gfs2/sys.c | 2 +- fs/gfs2/util.c | 171 ++++++++++++++++++++++++++++++++++++++++++- fs/gfs2/util.h | 11 +++ 19 files changed, 477 insertions(+), 141 deletions(-) -- 2.20.1