[Cluster-devel] [GFS2 v4 PATCH 14/25] gfs2: move check_journal_clean to util.c for future use

2019-05-15 Thread Bob Peterson
Before this patch function check_journal_clean was in ops_fstype.c. This patch moves it to util.c so we can make use of it elsewhere in a future patch. Signed-off-by: Bob Peterson --- fs/gfs2/ops_fstype.c | 42 - fs/gfs2/util.c | 45

[Cluster-devel] [GFS2 v4 PATCH 06/25] gfs2: simplify gfs2_freeze by removing case

2019-05-15 Thread Bob Peterson
Function gfs2_freeze had a case statement that simply checked the error code, but the break statements just made the logic hard to read. This patch simplifies the logic in favor of a simple if. Signed-off-by: Bob Peterson --- fs/gfs2/super.c | 10 ++ 1 file changed, 2 insertions(+), 8

[Cluster-devel] [GFS2 v4 PATCH 25/25] gfs2: Check for log write errors before telling dlm to unlock

2019-05-15 Thread Bob Peterson
Before this patch, function do_xmote just assumed all the writes submitted to the journal were finished and successful, and it called the go_unlock function to release the dlm lock. But if they're not, and a revoke failed to make its way to the journal, a journal replay on another node will cause

[Cluster-devel] [GFS2 v4 PATCH 21/25] gfs2: Abort gfs2_freeze if io error is seen

2019-05-15 Thread Bob Peterson
Before this patch, an io error, such as -EIO writing to the journal would cause function gfs2_freeze to go into an infinite loop, continuously retrying the freeze operation. But nothing ever clears the -EIO except unmount after withdraw, which is impossible if the freeze operation never ends

[Cluster-devel] [GFS2 v4 PATCH 19/25] gfs2: Force withdraw to replay journals and wait for it to finish

2019-05-15 Thread Bob Peterson
When a node withdraws from a file system, it often leaves its journal in an incomplete state. This is especially true when the withdraw is caused by io errors writing to the journal. Before this patch, a withdraw would try to write a "shutdown" record to the journal, tell dlm it's done with the

[Cluster-devel] [GFS2 v4 PATCH 20/25] gfs2: Add verbose option to check_journal_clean

2019-05-15 Thread Bob Peterson
Before this patch, function check_journal_clean would give messages related to journal recovery. That's fine for mount time, but when a node withdraws and forces replay that way, we don't want all those distracting and misleading messages. This patch adds a new parameter to make those messages

[Cluster-devel] [GFS2 v4 PATCH 11/25] gfs2: Only complain the first time an io error occurs in quota or log

2019-05-15 Thread Bob Peterson
Before this patch, all io errors received by the quota daemon or the logd daemon would cause a complaint message to be issued, such as: gfs2: fsid=dm-13.0: Error 10 writing to journal, jid=0 This patch changes it so that the error message is only issued the first time the error is

[Cluster-devel] [GFS2 v4 PATCH 22/25] gfs2: Check if holding freeze glock when making fs ro

2019-05-15 Thread Bob Peterson
If gfs2 withdraws the file system due to errors, such as being unable to write to the journals, we withdraw. When we withdraw, we try to change the file system to read-only by calling gfs2_make_fs_ro so that the appropriate items are synced and flushed. However, if the withdraw occurs while we're

[Cluster-devel] [GFS2 v4 PATCH 15/25] gfs2: Allow some glocks to be used during withdraw

2019-05-15 Thread Bob Peterson
Before this patch, when a file system was withdrawn, all further attempts to enqueue or promote glocks were rejected and returned -EIO. This is only important for media-backed glocks like inode and rgrp glocks. All other glocks may be safely used because there is no potential for metadata

[Cluster-devel] [GFS2 v4 PATCH 23/25] gfs2: Issue revokes more intelligently

2019-05-15 Thread Bob Peterson
Before this patch, function gfs2_write_revokes would call gfs2_ail1_empty, then traverse the sd_ail1_list looking for transactions that had bds which were no longer queued to a glock. And if it found some, it would try to issue revokes for them, up to a predetermined maximum. There were two

[Cluster-devel] [GFS2 v4 PATCH 24/25] gfs2: Prepare to withdraw as soon as an IO error occurs in log write

2019-05-15 Thread Bob Peterson
Before this patch, function gfs2_end_log_write would detect any IO errors writing to the journal and put out an appropriate message, but it never set a withdrawing condition. Eventually, the log daemon would see the error and determine it was time to withdraw, but in the meantime, other processes

[Cluster-devel] [GFS2 v4 PATCH 12/25] gfs2: Stop ail1 wait loop when withdrawn

2019-05-15 Thread Bob Peterson
Before this patch, function gfs2_log_flush could get into an infinite loop trying to clear out its ail1 list. If the file system was withdrawn (or pending withdraw) due to a problem with writing the ail1 list, it would never clear out the list, and therefore, would loop infinitely. This patch

[Cluster-devel] [GFS2 v4 PATCH 17/25] gfs2: Make secondary withdrawers wait for first withdrawer

2019-05-15 Thread Bob Peterson
Before this patch, if a process encountered an error and decided to withdraw, if another process was already in the process of withdrawing, the secondary withdraw would be silently ignored, which set it free to proceed with its processing, unlock any locks, etc. That's correct behavior if the

[Cluster-devel] [GFS2 v4 PATCH 04/25] gfs2: Warn when a journal replay overwrites a rgrp with buffers

2019-05-15 Thread Bob Peterson
This patch adds some instrumentation in gfs2's journal replay that indicates when we're about to overwrite a rgrp for which we already have a valid buffer_head. When this problem occurs, it's a situation in which this node has been granted a rgrp glock and subsequently read in buffer_heads for

[Cluster-devel] [GFS2 v4 PATCH 05/25] gfs2: Change SDF_SHUTDOWN to SDF_WITHDRAWN

2019-05-15 Thread Bob Peterson
Before this patch, the superblock flag indicating when a file system is withdrawn was called SDF_SHUTDOWN. This patch simply renames it to the more obvious SDF_WITHDRAWN. Signed-off-by: Bob Peterson --- fs/gfs2/aops.c | 4 ++-- fs/gfs2/file.c | 2 +- fs/gfs2/glock.c | 6 +++---

[Cluster-devel] [GFS2 v4 PATCH 01/25] gfs2: kthread and remount improvements

2019-05-15 Thread Bob Peterson
Before this patch, gfs2 saved the pointers to the two daemon threads (logd and quotad) in the superblock, but they were never cleared, even if the threads were stopped (e.g. on remount -o ro). That meant that certain error conditions (like a withdrawn file system) could race. For example, xfstests

[Cluster-devel] [GFS2 v4 PATCH 02/25] gfs2: eliminate tr_num_revoke_rm

2019-05-15 Thread Bob Peterson
For its journal processing, gfs2 kept track of the number of buffers added and removed on a per-transaction basis. These values are used to calculate space needed in the journal. But while these calculations make sense for the number of buffers, they make no sense for revokes. Revokes are managed

[Cluster-devel] [GFS2 v4 PATCH 07/25] gfs2: dump fsid when dumping glock problems

2019-05-15 Thread Bob Peterson
Before this patch, if a glock error was encountered, the glock with the problem was dumped. But sometimes you may have lots of file systems mounted, and that doesn't tell you which file system it was for. This patch adds a new boolean parameter fsid to the dump_glock family of functions. For

[Cluster-devel] [GFS2 v4 PATCH 00/25] gfs2: misc recovery patch collection

2019-05-15 Thread Bob Peterson
Here is version 4 of the patch set I posted on 23 April. It is revised based on bugs I found testing with xfstests. The first 8 are cleanups, the rest are bug fixes. This is a collection of patches I've been using to address the myriad of recovery problems I've found. There aren't many other

[Cluster-devel] [GFS2 v4 PATCH 08/25] gfs2: replace more printk with calls to fs_info and friends

2019-05-15 Thread Bob Peterson
This patch replaces a few leftover printk errors with calls to fs_info and similar, so that the file system having the error is properly logged. Signed-off-by: Bob Peterson --- fs/gfs2/bmap.c | 2 +- fs/gfs2/glops.c | 3 ++- fs/gfs2/rgrp.c | 27 ++- fs/gfs2/super.c |

[Cluster-devel] [GFS2 v4 PATCH 09/25] gfs2: Introduce concept of a pending withdraw

2019-05-15 Thread Bob Peterson
File system withdraws can be delayed when inconsistencies are discovered when we cannot withdraw immediately, for example, when critical spin_locks are held. But delaying the withdraw can cause gfs2 to ignore the error and keep running for a short period of time. For example, an rgrp glock may be

[Cluster-devel] [GFS2 v4 PATCH 13/25] gfs2: Ignore dlm recovery requests if gfs2 is withdrawn

2019-05-15 Thread Bob Peterson
When a node fails, user space informs dlm of the node failure, and dlm instructs gfs2 on the surviving nodes to perform journal recovery. It does this by calling various callback functions in lock_dlm.c. To mark its progress, it keeps generation numbers and recover bits in a dlm "control" lock

[Cluster-devel] [GFS2 v4 PATCH 16/25] gfs2: Don't loop forever in gfs2_freeze if withdrawn

2019-05-15 Thread Bob Peterson
Before this patch, function gfs2_freeze would loop forever if the file system trying to be frozen is withdrawn. That's because function gfs2_lock_fs_check_clean tries to enqueue the glock of the journal and the gfs2_glock returns -EIO because you can't enqueue a journaled glock after a withdraw.

[Cluster-devel] [GFS2 v4 PATCH 18/25] gfs2: Don't write log headers after file system withdraw

2019-05-15 Thread Bob Peterson
Before this patch, when a node withdrew a gfs2 file system, it wrote a (clean) unmount log header. That's wrong. You don't want to write anything to the journal once you're withdrawn because that's acknowledging that the transaction is complete and the journal is in good shape, neither of which

[Cluster-devel] [gfs2:for-next.recovery6o 11/25] fs/gfs2/lops.c:215:22: sparse: sparse: incorrect type in initializer (different base types)

2019-05-15 Thread kbuild test robot
tree: https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git for-next.recovery6o head: 1b7696df0c689a9c6610ebd1f1a5f5353c228ba5 commit: 846ea61cee4947e91765afe2df90bdd9f9913349 [11/25] gfs2: Only complain the first time an io error occurs in quota or log reproduce: #