Re: [Cluster-devel] GFS2 filesystem corruption when mkfs block size (4k) does not match blockdevice blocksize (512 bytes)

2019-03-26 Thread Andreas Gruenbacher
Hi Edwin,

On Mon, 25 Mar 2019 at 20:07, Edwin Török  wrote:
> On 20/03/2019 15:54, Edwin Török wrote:
> > Hello,
> >
> > The patch below to xfstests reliably reproduces the filesystem corruption 
> > on a 4.19 kernel.
>
> I've also reproduced this on CentOS 7.6:
> [ 2565.093062] Buffer I/O error on dev dm-5, logical block 123761, lost async 
> page write
> [ 2565.307581] GFS2: fsid=dm-5.0: fatal: invalid metadata block
> GFS2: fsid=dm-5.0:   bh = 123761 (magic number)
> GFS2: fsid=dm-5.0:   function = gfs2_meta_indirect_buffer, file = 
> fs/gfs2/meta_io.c, line = 428
> [ 2565.307641] GFS2: fsid=dm-5.0: about to withdraw this file system
> [ 2565.307706] GFS2: fsid=dm-5.0: withdrawn
> [ 2565.307737] CPU: 0 PID: 61453 Comm: ls Kdump: loaded Not tainted 
> 3.10.0-957.10.1.el7.x86_64 #1
> [ 2565.307772] Hardware name: Xen HVM domU, BIOS 4.7.6-6.3 08/24/2018
>
> See below for the script that I used to do it, 100% reproducible and takes 
> only a few minutes to reproduce.
>
> The problem here seems to be that the metadata block encountered some I/O 
> error, wrote an incomplete block
> (writes are atomic only at 512 byte level, not 4k), and later on GFS2 tried 
> to read back the now corrupt metadata block.
>
> I've also been experimenting with a patch to make GFS2 withdraw sooner, 
> before it actually tries to read back the corrupted block,
> but then of course it still fails when mounting the filesystem again, when it 
> detects the corrupt metadata block.
> I would've thought that a journal replay on mount would fix this, i.e. try to 
> write out the metadata block again and succeed,
> but I'm not convinced the journal still contains the metadata block entry 
> because the I/O error here is all asynchronous.

when the filesystem is withdrawn and unmounted, and then either
re-mounted or fsck'ed, it's pretty clear that the journal needs to be
replayed. If that doesn't happen, and it didn't at least in my local
testing, it's no surprise that partially written blocks will lead to
all kinds of problems. This is an area that Bob has spent a lot of
time looking into recently. He has found and fixed several bugs, but
things are still not working entirely correctly. I'll try to help Bob
with these patches so that we can get at least the initial fixes out
the door as soon as possible.

> While looking at this I also discovered that gfs2_write_jdata_pagevec has 
> some copy/pasted code from page-writeback.c that is missing
> https://github.com/torvalds/linux/commit/3fa750dcf29e8606e3969d13d8e188cc1c0f511d#diff-f9a62819096a044efc553c77d4f6fdbb,
>  but that doesn't seem to be used by default.
>
>
> Do you have some thoughts/suggestions on how to proceed here?
>
> diff --git i/fs/gfs2/meta_io.c w/fs/gfs2/meta_io.c
> index 868caa0..39c1030 100644
> --- i/fs/gfs2/meta_io.c
> +++ w/fs/gfs2/meta_io.c
> @@ -33,6 +33,17 @@
>  #include "util.h"
>  #include "trace_gfs2.h"
>
> +static void gfs2_async_end_io(struct buffer_head *bh, int uptodate)
> +{
> +   end_buffer_async_write(bh, uptodate);
> +   if (!uptodate) {
> +   struct address_space *mapping = bh->b_page->mapping;
> +   struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
> +   gfs2_io_error_bh_wd(sdp, bh);
> +   gfs2_lm_withdraw(sdp, NULL);
> +   }
> +}
> +
>  static int gfs2_aspace_writepage(struct page *page, struct writeback_control 
> *wbc)
>  {
> struct buffer_head *bh, *head;
> @@ -63,6 +74,7 @@ static int gfs2_aspace_writepage(struct page *page, struct 
> writeback_control *wb
> }
> if (test_clear_buffer_dirty(bh)) {
> mark_buffer_async_write(bh);
> +   bh->b_end_io = gfs2_async_end_io;
> } else {
> unlock_buffer(bh);
> }
>
> The script to reproduce the issue on CentOS:
> #/bin/sh
> set -eu
> SIZE=1G
> # XFSTESTS=https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/
> XFSTESTS=https://github.com/edwintorok/xfstests.git
>
> sudo yum install -y epel-release
> sudo yum install -y acl attr automake bc dump e2fsprogs fio gawk gcc indent 
> libtool make psmisc quota sed xfsdump xfsprogs libacl-devel libattr-devel 
> libaio-devel libuuid-devel xfsprogs-devel python sqlite git gfs2-utils
> test -d xfstests-dev || git clone "${XFSTESTS}" xfstests-dev && make -C 
> xfstests-dev -j8
> cd xfstests-dev
>
> SCRATCH_FILE=scratch-file-$(date +%s)
> truncate --size "${SIZE}" "${SCRATCH_FILE}"
> SCRATCH_DEV=$(losetup --show -f "${SCRATCH_FILE}")
>
> TEST_FILE=test-file-$(date +%s)
> truncate --size "${SIZE}" "${TEST_FILE}"
> TEST_DEV=$(losetup --show -f "${TEST_FILE}")
> TEST_DIR=/mnt/xfstests-test
> SCRATCH_MNT=/mnt/xfstests-scratch
> mkdir -p "${TEST_DIR}" "${SCRATCH_MNT}"
>
> cat  export SCRATCH_DEV=$SCRATCH_DEV
> export TEST_DEV=$TEST_DEV
> export TEST_DIR=$TEST_DIR
> export SCRATCH_MNT=$SCRATCH_MNT
> EOF
>
> umount ${TEST_DIR} || true
> wipefs -a -f ${TEST_DEV}
> 

Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free

2019-03-26 Thread Bob Peterson
Hi Ross,

- Original Message -
> 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and
> uses the freed glock (stack trace above).
> 
> Should evict_inode call gfs2_ail_flush() or something so that the revoke
> is written before it drops its reference to the glock?
> 
> Or is there something else that is meant to keep the glock around if the
> inode is evicted while there is a linked gfs2_bufdata sitting on some
> AIL list?
> 
> Let me know if any more specific information is needed since I now have
> a test setup that can usually reproduce this within 10 minutes.

Very interesting.

It's not unusual for glocks to outlive their inodes, but I'm not sure
what the right answer is in this case. Either the revoke should
take an additional reference to the glock, and not let go until the
revoke is written by some log flush, or else the evict needs to do the
log flush to make sure the revoke is committed. But we've had issues with
evict in the past, so we need to be careful about how we fix it.
Andreas and I will look into the best way to fix it. Thanks again for your help.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free

2019-03-26 Thread Ross Lagerwall

On 1/31/19 5:18 PM, Andreas Gruenbacher wrote:

Hi Ross,

On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall  wrote:

Each gfs2_bufdata stores a reference to a glock but the reference count
isn't incremented. This causes an occasional use-after-free of the
glock. Fix by taking a reference on the glock during allocation and
dropping it when freeing.

Found by KASAN:

BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371

CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
Workqueue: glock_workqueue glock_work_func [gfs2]
Call Trace:
  dump_stack+0x71/0xab
  print_address_description+0x6a/0x270
  kasan_report+0x258/0x380
  ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
  revoke_lo_after_commit+0x8e/0xe0 [gfs2]
  gfs2_log_flush+0x511/0xa70 [gfs2]
  ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
  ? __brelse+0x48/0x50
  ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
  ? gfs2_trans_end+0x18d/0x340 [gfs2]
  gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
  ? inode_go_dump+0xe0/0xe0 [gfs2]
  ? inode_go_sync+0xe4/0x220 [gfs2]
  inode_go_sync+0xe4/0x220 [gfs2]
  do_xmote+0x12b/0x290 [gfs2]
  glock_work_func+0x6f/0x160 [gfs2]
  process_one_work+0x461/0x790
  worker_thread+0x69/0x6b0
  ? process_one_work+0x790/0x790
  kthread+0x1ae/0x1d0
  ? kthread_create_worker_on_cpu+0xc0/0xc0
  ret_from_fork+0x22/0x40


thanks for tracking this down, very interesting.

The consistency model here is that every buffer head that a struct
gfs2_bufdata object is attached to is protected by a glock. Before a
glock can be released, all the buffers under that glock have to be
flushed out and released; this is what allows another node to access
the same on-disk location without causing inconsistencies. When there
is a bufdata object that points to a glock that has already been
freed, this consistency model is broken. Taking an additional refcount
as this patch does may make the use-after-free go away, but it doesn't
fix the underlying problem. So I think we'll need a different fix
here.

Did you observe this problem in a real-world scenario, or with KASAN
only? It might be that we're looking at a small race that is unlikely
to trigger in the field. In any case, I think we need to understand
better what't actually going on.



I finally got time to look into this further. The following is what 
happens as far as I can tell though my understanding is a bit limited at 
this point:


1. A file is opened, truncated and closed which results in a metadata 
write so a gfs2_bufdata object is created and associated with the inode 
glock.


2. The gfs2_bufdata is written out and placed on an AIL list.

3. The VFS layer calls gfs2_evict_inode() and the inode is evicted which 
drops a reference on the glock. It takes case 3 (i_nlink > 0) so no log 
flush or ail_flush happens.


4. The gfs2_bufdata object is however still on one of the AIL lists and 
the next gfs2_log_flush() begins to write a revoke entry.


5. At about the same time the glock is freed. At the point of freeing, 
gl_revokes is 1 (should probably have a GLOCK_BUG_ON for this).


6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and 
uses the freed glock (stack trace above).


Should evict_inode call gfs2_ail_flush() or something so that the revoke 
is written before it drops its reference to the glock?


Or is there something else that is meant to keep the glock around if the 
inode is evicted while there is a linked gfs2_bufdata sitting on some 
AIL list?


Let me know if any more specific information is needed since I now have 
a test setup that can usually reproduce this within 10 minutes.


Thanks,
--
Ross Lagerwall