Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
On Tue, 9 Apr 2019 at 17:37, Ross Lagerwall wrote: > On 4/5/19 6:50 PM, Andreas Gruenbacher wrote: > > Hi Ross, > > > > On Tue, 2 Apr 2019 at 00:59, Andreas Gruenbacher > > wrote: > >> thanks for the great analysis. > >> > >> On Tue, 26 Mar 2019 at 20:14, Bob Peterson wrote: > >>> - Original Message - > 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and > uses the freed glock (stack trace above). > > Should evict_inode call gfs2_ail_flush() or something so that the revoke > is written before it drops its reference to the glock? > > Or is there something else that is meant to keep the glock around if the > inode is evicted while there is a linked gfs2_bufdata sitting on some > AIL list? > > Let me know if any more specific information is needed since I now have > a test setup that can usually reproduce this within 10 minutes. > >>> > >>> Very interesting. > >>> > >>> It's not unusual for glocks to outlive their inodes, but I'm not sure > >>> what the right answer is in this case. Either the revoke should > >>> take an additional reference to the glock, and not let go until the > >>> revoke is written by some log flush, or else the evict needs to do the > >>> log flush to make sure the revoke is committed. But we've had issues with > >>> evict in the past, so we need to be careful about how we fix it. > >>> Andreas and I will look into the best way to fix it. Thanks again for > >>> your help. > >> > >> Usually, glocks are attached to inodes or other objects. When the > >> state of the glock changes, the go_lock, go_sync, and go_inval > >> operations determine what happens to the attached object. In > >> gfs2_evict_inode, we disassociate the inode from the glock, do the > >> necessary cleanup work locally, and put the glock asynchronously when > >> needed, though. We do that in PF_MEMALLOC context to avoid > >> deadlocking; see commit 71c1b2136835 ("gfs2: gfs2_evict_inode: Put > >> glocks asynchronously"). Even if we didn't do that, there could still > >> be other glock holders like the glock state machine. There is no > >> guarantee that the glock will go away anytime soon, or actually at > >> all: it could get reused by another instance of the same inode. So > >> waiting for the glock to go away first in gfs2_evict_inode definitely > >> is not an option. > >> > >> This basically answers your above question: gfs2_evict_inode should > >> definitely clean up before putting the glock. I'm just not sure how to > >> best achieve that, yet. > > > > after more discussions, Bob convinced me that it makes perfect sense > > to not write out outstanding revokes in gfs2_evict_inode. We really > > want to delay writing revokes as long as possible so that as many of > > the revokes as possible will go away (either because the blocks are > > re-added to the journal and the revokes are "unrevoked", or because > > the journal wraps around). So the right fix here is indeed taking > > additional glock references. > > > > I've come up with a patch that takes one additional reference for each > > glock instead of one reference for each bufdata object; that should > > scale better. I'll post that and a follow-up patch separately. > > > > Could you please verify? > > > > I've tested the patches and no longer see the use-after-free. Ok great, we'll line the fix up for the next merge window. Thanks, Andreas
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
On 4/5/19 6:50 PM, Andreas Gruenbacher wrote: Hi Ross, On Tue, 2 Apr 2019 at 00:59, Andreas Gruenbacher wrote: thanks for the great analysis. On Tue, 26 Mar 2019 at 20:14, Bob Peterson wrote: - Original Message - 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and uses the freed glock (stack trace above). Should evict_inode call gfs2_ail_flush() or something so that the revoke is written before it drops its reference to the glock? Or is there something else that is meant to keep the glock around if the inode is evicted while there is a linked gfs2_bufdata sitting on some AIL list? Let me know if any more specific information is needed since I now have a test setup that can usually reproduce this within 10 minutes. Very interesting. It's not unusual for glocks to outlive their inodes, but I'm not sure what the right answer is in this case. Either the revoke should take an additional reference to the glock, and not let go until the revoke is written by some log flush, or else the evict needs to do the log flush to make sure the revoke is committed. But we've had issues with evict in the past, so we need to be careful about how we fix it. Andreas and I will look into the best way to fix it. Thanks again for your help. Usually, glocks are attached to inodes or other objects. When the state of the glock changes, the go_lock, go_sync, and go_inval operations determine what happens to the attached object. In gfs2_evict_inode, we disassociate the inode from the glock, do the necessary cleanup work locally, and put the glock asynchronously when needed, though. We do that in PF_MEMALLOC context to avoid deadlocking; see commit 71c1b2136835 ("gfs2: gfs2_evict_inode: Put glocks asynchronously"). Even if we didn't do that, there could still be other glock holders like the glock state machine. There is no guarantee that the glock will go away anytime soon, or actually at all: it could get reused by another instance of the same inode. So waiting for the glock to go away first in gfs2_evict_inode definitely is not an option. This basically answers your above question: gfs2_evict_inode should definitely clean up before putting the glock. I'm just not sure how to best achieve that, yet. after more discussions, Bob convinced me that it makes perfect sense to not write out outstanding revokes in gfs2_evict_inode. We really want to delay writing revokes as long as possible so that as many of the revokes as possible will go away (either because the blocks are re-added to the journal and the revokes are "unrevoked", or because the journal wraps around). So the right fix here is indeed taking additional glock references. I've come up with a patch that takes one additional reference for each glock instead of one reference for each bufdata object; that should scale better. I'll post that and a follow-up patch separately. Could you please verify? I've tested the patches and no longer see the use-after-free. Thanks for working on this! -- Ross Lagerwall
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
Hi Ross, On Tue, 2 Apr 2019 at 00:59, Andreas Gruenbacher wrote: > thanks for the great analysis. > > On Tue, 26 Mar 2019 at 20:14, Bob Peterson wrote: > > - Original Message - > > > 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and > > > uses the freed glock (stack trace above). > > > > > > Should evict_inode call gfs2_ail_flush() or something so that the revoke > > > is written before it drops its reference to the glock? > > > > > > Or is there something else that is meant to keep the glock around if the > > > inode is evicted while there is a linked gfs2_bufdata sitting on some > > > AIL list? > > > > > > Let me know if any more specific information is needed since I now have > > > a test setup that can usually reproduce this within 10 minutes. > > > > Very interesting. > > > > It's not unusual for glocks to outlive their inodes, but I'm not sure > > what the right answer is in this case. Either the revoke should > > take an additional reference to the glock, and not let go until the > > revoke is written by some log flush, or else the evict needs to do the > > log flush to make sure the revoke is committed. But we've had issues with > > evict in the past, so we need to be careful about how we fix it. > > Andreas and I will look into the best way to fix it. Thanks again for your > > help. > > Usually, glocks are attached to inodes or other objects. When the > state of the glock changes, the go_lock, go_sync, and go_inval > operations determine what happens to the attached object. In > gfs2_evict_inode, we disassociate the inode from the glock, do the > necessary cleanup work locally, and put the glock asynchronously when > needed, though. We do that in PF_MEMALLOC context to avoid > deadlocking; see commit 71c1b2136835 ("gfs2: gfs2_evict_inode: Put > glocks asynchronously"). Even if we didn't do that, there could still > be other glock holders like the glock state machine. There is no > guarantee that the glock will go away anytime soon, or actually at > all: it could get reused by another instance of the same inode. So > waiting for the glock to go away first in gfs2_evict_inode definitely > is not an option. > > This basically answers your above question: gfs2_evict_inode should > definitely clean up before putting the glock. I'm just not sure how to > best achieve that, yet. after more discussions, Bob convinced me that it makes perfect sense to not write out outstanding revokes in gfs2_evict_inode. We really want to delay writing revokes as long as possible so that as many of the revokes as possible will go away (either because the blocks are re-added to the journal and the revokes are "unrevoked", or because the journal wraps around). So the right fix here is indeed taking additional glock references. I've come up with a patch that takes one additional reference for each glock instead of one reference for each bufdata object; that should scale better. I'll post that and a follow-up patch separately. Could you please verify? Thanks, Andreas
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
Hi Ross, thanks for the great analysis. On Tue, 26 Mar 2019 at 20:14, Bob Peterson wrote: > - Original Message - > > 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and > > uses the freed glock (stack trace above). > > > > Should evict_inode call gfs2_ail_flush() or something so that the revoke > > is written before it drops its reference to the glock? > > > > Or is there something else that is meant to keep the glock around if the > > inode is evicted while there is a linked gfs2_bufdata sitting on some > > AIL list? > > > > Let me know if any more specific information is needed since I now have > > a test setup that can usually reproduce this within 10 minutes. > > Very interesting. > > It's not unusual for glocks to outlive their inodes, but I'm not sure > what the right answer is in this case. Either the revoke should > take an additional reference to the glock, and not let go until the > revoke is written by some log flush, or else the evict needs to do the > log flush to make sure the revoke is committed. But we've had issues with > evict in the past, so we need to be careful about how we fix it. > Andreas and I will look into the best way to fix it. Thanks again for your > help. Usually, glocks are attached to inodes or other objects. When the state of the glock changes, the go_lock, go_sync, and go_inval operations determine what happens to the attached object. In gfs2_evict_inode, we disassociate the inode from the glock, do the necessary cleanup work locally, and put the glock asynchronously when needed, though. We do that in PF_MEMALLOC context to avoid deadlocking; see commit 71c1b2136835 ("gfs2: gfs2_evict_inode: Put glocks asynchronously"). Even if we didn't do that, there could still be other glock holders like the glock state machine. There is no guarantee that the glock will go away anytime soon, or actually at all: it could get reused by another instance of the same inode. So waiting for the glock to go away first in gfs2_evict_inode definitely is not an option. This basically answers your above question: gfs2_evict_inode should definitely clean up before putting the glock. I'm just not sure how to best achieve that, yet. Thanks, Andreas
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
Hi Ross, - Original Message - > 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and > uses the freed glock (stack trace above). > > Should evict_inode call gfs2_ail_flush() or something so that the revoke > is written before it drops its reference to the glock? > > Or is there something else that is meant to keep the glock around if the > inode is evicted while there is a linked gfs2_bufdata sitting on some > AIL list? > > Let me know if any more specific information is needed since I now have > a test setup that can usually reproduce this within 10 minutes. Very interesting. It's not unusual for glocks to outlive their inodes, but I'm not sure what the right answer is in this case. Either the revoke should take an additional reference to the glock, and not let go until the revoke is written by some log flush, or else the evict needs to do the log flush to make sure the revoke is committed. But we've had issues with evict in the past, so we need to be careful about how we fix it. Andreas and I will look into the best way to fix it. Thanks again for your help. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
On 1/31/19 5:18 PM, Andreas Gruenbacher wrote: Hi Ross, On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall wrote: Each gfs2_bufdata stores a reference to a glock but the reference count isn't incremented. This causes an occasional use-after-free of the glock. Fix by taking a reference on the glock during allocation and dropping it when freeing. Found by KASAN: BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2] Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371 CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1 Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010 Workqueue: glock_workqueue glock_work_func [gfs2] Call Trace: dump_stack+0x71/0xab print_address_description+0x6a/0x270 kasan_report+0x258/0x380 ? revoke_lo_after_commit+0x8e/0xe0 [gfs2] revoke_lo_after_commit+0x8e/0xe0 [gfs2] gfs2_log_flush+0x511/0xa70 [gfs2] ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2] ? __brelse+0x48/0x50 ? gfs2_log_commit+0x4de/0x6e0 [gfs2] ? gfs2_trans_end+0x18d/0x340 [gfs2] gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2] ? inode_go_dump+0xe0/0xe0 [gfs2] ? inode_go_sync+0xe4/0x220 [gfs2] inode_go_sync+0xe4/0x220 [gfs2] do_xmote+0x12b/0x290 [gfs2] glock_work_func+0x6f/0x160 [gfs2] process_one_work+0x461/0x790 worker_thread+0x69/0x6b0 ? process_one_work+0x790/0x790 kthread+0x1ae/0x1d0 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x22/0x40 thanks for tracking this down, very interesting. The consistency model here is that every buffer head that a struct gfs2_bufdata object is attached to is protected by a glock. Before a glock can be released, all the buffers under that glock have to be flushed out and released; this is what allows another node to access the same on-disk location without causing inconsistencies. When there is a bufdata object that points to a glock that has already been freed, this consistency model is broken. Taking an additional refcount as this patch does may make the use-after-free go away, but it doesn't fix the underlying problem. So I think we'll need a different fix here. Did you observe this problem in a real-world scenario, or with KASAN only? It might be that we're looking at a small race that is unlikely to trigger in the field. In any case, I think we need to understand better what't actually going on. I finally got time to look into this further. The following is what happens as far as I can tell though my understanding is a bit limited at this point: 1. A file is opened, truncated and closed which results in a metadata write so a gfs2_bufdata object is created and associated with the inode glock. 2. The gfs2_bufdata is written out and placed on an AIL list. 3. The VFS layer calls gfs2_evict_inode() and the inode is evicted which drops a reference on the glock. It takes case 3 (i_nlink > 0) so no log flush or ail_flush happens. 4. The gfs2_bufdata object is however still on one of the AIL lists and the next gfs2_log_flush() begins to write a revoke entry. 5. At about the same time the glock is freed. At the point of freeing, gl_revokes is 1 (should probably have a GLOCK_BUG_ON for this). 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and uses the freed glock (stack trace above). Should evict_inode call gfs2_ail_flush() or something so that the revoke is written before it drops its reference to the glock? Or is there something else that is meant to keep the glock around if the inode is evicted while there is a linked gfs2_bufdata sitting on some AIL list? Let me know if any more specific information is needed since I now have a test setup that can usually reproduce this within 10 minutes. Thanks, -- Ross Lagerwall
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free (Another debug patch)
Hi Ross, - Original Message - > Do you have any suggestions for tracking down the root cause? Attached is the starting point for a generic "debug" kernel trace point, complete with examples of how it's used. It's always associated with a particular glock. You might find it helpful, but again, it's old, so it'll probably need to be tweaked for an upstream kernel. Regards, Bob Peterson Red Hat File Systems From 68ad040adc255be9d662fcf0d24820c9e7408187 Mon Sep 17 00:00:00 2001 From: Bob Peterson Date: Fri, 16 Feb 2018 06:34:43 -0700 Subject: [PATCH] GFS2: Beginning of a debug trace point This patch introduces a new debug trace point for GFS2 and uses it to track a few fundamental things, like when the file system is switched from RO to RW and back, and when files are switched from ordinary to jdata and back. Signed-off-by: Bob Peterson --- fs/gfs2/super.c | 8 fs/gfs2/trace_gfs2.h | 32 2 files changed, 40 insertions(+) diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index 03cebdef00bd3..23cb03624aff4 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -38,6 +38,7 @@ #include "recovery.h" #include "rgrp.h" #include "super.h" +#include "trace_gfs2.h" #include "trans.h" #include "util.h" #include "sys.h" @@ -422,6 +423,7 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp) struct gfs2_log_header_host head; int error; + trace_gfs2_debug(sdp->sd_trans_gl, 0, "gfs2_make_fs_rw start"); error = init_threads(sdp); if (error) return error; @@ -464,6 +466,7 @@ fail_threads: sdp->sd_quotad_process = NULL; kthread_stop(sdp->sd_logd_process); sdp->sd_logd_process = NULL; + trace_gfs2_debug(sdp->sd_trans_gl, 0, "gfs2_make_fs_rw finished"); return error; } @@ -853,6 +856,7 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp) struct gfs2_holder t_gh; int error; + trace_gfs2_debug(sdp->sd_trans_gl, 0, "gfs2_make_fs_ro start"); if (sdp->sd_quotad_process) { kthread_stop(sdp->sd_quotad_process); sdp->sd_quotad_process = NULL; @@ -881,6 +885,7 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp) gfs2_glock_dq_uninit(&t_gh); gfs2_quota_cleanup(sdp); + trace_gfs2_debug(sdp->sd_trans_gl, 0, "gfs2_make_fs_ro finished."); return 0; } @@ -896,6 +901,7 @@ static void gfs2_put_super(struct super_block *sb) int error; struct gfs2_jdesc *jd; + trace_gfs2_debug(sdp->sd_trans_gl, 0, "put_super begin"); /* No more recovery requests */ set_bit(SDF_NORECOVERY, &sdp->sd_flags); smp_mb(); @@ -914,6 +920,7 @@ restart: spin_unlock(&sdp->sd_jindex_spin); if (!(sb->s_flags & MS_RDONLY)) { + trace_gfs2_debug(sdp->sd_trans_gl, 0, "put_super rw"); error = gfs2_make_fs_ro(sdp); if (error) gfs2_io_error(sdp); @@ -1228,6 +1235,7 @@ static int gfs2_remount_fs(struct super_block *sb, int *flags, char *data) *flags |= MS_RDONLY; if ((sb->s_flags ^ *flags) & MS_RDONLY) { + trace_gfs2_debug(sdp->sd_trans_gl, 0, "remount fs"); if (*flags & MS_RDONLY) error = gfs2_make_fs_ro(sdp); else diff --git a/fs/gfs2/trace_gfs2.h b/fs/gfs2/trace_gfs2.h index d1de2edc27050..9095ed8479fb4 100644 --- a/fs/gfs2/trace_gfs2.h +++ b/fs/gfs2/trace_gfs2.h @@ -613,6 +613,38 @@ TRACE_EVENT(gfs2_rs, rs_func_name(__entry->func), (unsigned long)__entry->free) ); +/* Generic debug messages */ +TRACE_EVENT(gfs2_debug, + + TP_PROTO(struct gfs2_glock *gl, int ret, const char *msg), + + TP_ARGS(gl, ret, msg), + + TP_STRUCT__entry( + __field(dev_t, dev ) + __field( u64, glnum ) + __field( u32, gltype ) + __field( int, ret ) + __field( int, nrpages ) + __string( msg, msg ) + ), + + TP_fast_assign( + __entry->dev = gl ? gl->gl_name.ln_sbd->sd_vfs->s_dev : 0; + __entry->glnum = gl ? gl->gl_name.ln_number : 0; + __entry->gltype = gl ? gl->gl_name.ln_type : 0; + __entry->ret = ret; + __entry->nrpages = gl ? (gfs2_glock2aspace(gl) ? + gfs2_glock2aspace(gl)->nrpages : 0) : 0; + __assign_str(msg, msg); + ), + + TP_printk("%u,%u %u:%llx rc:%d pgs: %d %s", + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->gltype, + (unsigned long long)__entry->glnum, __entry->ret, + __entry->nrpages, __get_str(msg)) +); + #endif /* _TRACE_GFS2_H */ /* This part must be outside protection */
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
Hi Ross, - Original Message - > Do you have any suggestions for tracking down the root cause? One time, when I had a similar problem in rhel7, and couldn't use kernel tracing because there were millions of glocks involved. The trace was too huge and quickly swamped the biggest possible kernel trace buffer. So I ended up writing this ugly, hacky patch that's attached. Perhaps you can use it as a starting point. The idea is: every time there's a get or a put to a glock, it saves off a 1-byte identifier of what function did the get/put. It saved it in a new 64-byte field kept for each glock, which of course meant the slab became much bigger, but it was never meant to be shipped, right? Then, when the problem occurred, it would dump out the problematic glock, including the 64-byte get/put history value. Then I would go through it and identify the history of what went wrong. Since this is a fairly old (2015) patch that targets an old rhel7, it will obviously need a lot of updating to get it to work, but it might work better than the kernel tracing, depending on how many glocks are involved in your test. Regards, Bob Peterson Red Hat File Systems From 01f7c8b9ef44280be90eabb44847d27db68dec32 Mon Sep 17 00:00:00 2001 From: Bob Peterson Date: Fri, 4 Sep 2015 15:06:47 -0500 Subject: [PATCH] GFS2: Instrumentation to track down glock get/puts out of sync --- fs/gfs2/file.c | 4 +- fs/gfs2/glock.c | 198 ++- fs/gfs2/glock.h | 6 +- fs/gfs2/glops.c | 3 + fs/gfs2/incore.h | 5 ++ fs/gfs2/inode.c | 16 ++--- fs/gfs2/ops_fstype.c | 16 ++--- fs/gfs2/quota.c | 8 +-- fs/gfs2/recovery.c | 2 +- fs/gfs2/rgrp.c | 10 +-- fs/gfs2/super.c | 6 +- fs/gfs2/sys.c| 4 +- 12 files changed, 207 insertions(+), 71 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index af0d418da809..11c02ff155ed 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -1086,11 +1086,11 @@ static int do_flock(struct file *file, int cmd, struct file_lock *fl) gfs2_holder_reinit(state, flags, fl_gh); } else { error = gfs2_glock_get(GFS2_SB(&ip->i_inode), ip->i_no_addr, - &gfs2_flock_glops, CREATE, &gl); + &gfs2_flock_glops, CREATE, &gl, 10); if (error) goto out; gfs2_holder_init(gl, state, flags, fl_gh); - gfs2_glock_put(gl); + gfs2_glock_put(gl, 67); } for (sleeptime = 1; sleeptime <= 4; sleeptime <<= 1) { error = gfs2_glock_nq(fl_gh); diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index 1c186f9ec4d1..303fed6af984 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -98,10 +98,13 @@ void gfs2_glock_free(struct gfs2_glock *gl) * */ -static void gfs2_glock_hold(struct gfs2_glock *gl) +static void gfs2_glock_hold(struct gfs2_glock *gl, int caller) { GLOCK_BUG_ON(gl, __lockref_is_dead(&gl->gl_lockref)); lockref_get(&gl->gl_lockref); + gl->gl_getputhist[gl->gl_getputs++] = caller; + if (gl->gl_getputs > GETPUTHIST) + gl->gl_getputs = 0; } /** @@ -161,11 +164,18 @@ static void gfs2_glock_remove_from_lru(struct gfs2_glock *gl) * */ -void gfs2_glock_put(struct gfs2_glock *gl) +void gfs2_glock_put(struct gfs2_glock *gl, int caller) { struct gfs2_sbd *sdp = gl->gl_name.ln_sbd; struct address_space *mapping = gfs2_glock2aspace(gl); + if (caller == 60 && gl->gl_getputs && + gl->gl_getputhist[gl->gl_getputs - 1] == 4) + gl->gl_getputs--; + else + gl->gl_getputhist[gl->gl_getputs++] = caller; + if (gl->gl_getputs > GETPUTHIST) + gl->gl_getputs = 0; if (lockref_put_or_lock(&gl->gl_lockref)) return; @@ -328,10 +338,17 @@ static void state_change(struct gfs2_glock *gl, unsigned int new_state) if (held1 != held2) { GLOCK_BUG_ON(gl, __lockref_is_dead(&gl->gl_lockref)); - if (held2) + if (held2) { gl->gl_lockref.count++; - else + gl->gl_getputhist[gl->gl_getputs++] = 1; + if (gl->gl_getputs > GETPUTHIST) +gl->gl_getputs = 0; + } else { gl->gl_lockref.count--; + gl->gl_getputhist[gl->gl_getputs++] = 51; + if (gl->gl_getputs > GETPUTHIST) +gl->gl_getputs = 0; + } } if (held1 && held2 && list_empty(&gl->gl_holders)) clear_bit(GLF_QUEUED, &gl->gl_flags); @@ -476,7 +493,7 @@ __acquires(&gl->gl_spin) glops->go_inval(gl, target == LM_ST_DEFERRED ? 0 : DIO_METADATA); clear_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags); - gfs2_glock_hold(gl); + gfs2_glock_hold(gl, 2); if (sdp->sd_lockstruct.ls_ops->lm_lock) { /* lock_dlm */ ret = sdp->sd_lockstruct.ls_ops->lm_lock(gl, target, lck_flags); @@ -485,7 +502,7 @@ __acquires(&gl->gl_spin) test_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags)) { finish_xmote(gl, target); if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0) -gfs2_glock_put(gl); +gfs2_glock_put(gl, 54); } else if (ret) { printk(KERN_ERR "GFS2: lm_lock ret %d\n", ret); @@ -494,7 +511,9 @@ __acquires(&gl->gl_spin) } else { /* lock_nolock */ finish
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
Hi Ross, - Original Message - (snip) > We haven't observed any problems that can be directly attributed to this > without KASAN, although it is hard to tell what a stray write may do. We > have hit sporadic asserts and filesystem corruption during testing. > > When I added tracing, the time between freeing a glock and writing to it > varied but could be up to hundreds of milliseconds so I would guess that > this could easily happen without KASAN. It is relatively easy to > reproduce in our test environment. > > Do you have any suggestions for tracking down the root cause? In the past, I've debugged problems with glock reference counting by using kernel tracing and instrumentation. Unfortunately, the "glock_put" trace point only shows you when the glock ref count goes to 0, and doesn't show when or how the glock is first created, which, of course, doesn't show if it's created and destroyed multiple times, and often that's important to figuring these out, otherwise it's just a lot of chaos. In the past, I've added my own temporary kernel trace point for when new glocks are created, and called it "glock_new." You probably also want to modify the glock put functions, such as gfs2_glock_put and gfs2_glock_queue_put, to call a trace point so you can tell that too, and have it save off the gl_lockref reference count in the trace. Then recreate the problem with the trace running. I attached a script I often use for these purposes. The script contains several bogus trace point references for various sets of temporary trace points I've added and deleted over the years, like a generic "debug" trace point where I can add generic messages of what's happening. So don't be surprised if you get errors about trying to cat values into non-existent debugfs files. Just ignore them. The script DOES contain a trigger for a "glock_new" trace point for just this purpose. I can try to dig out whether I still have that trace point (glock_new) and the generic debug trace point lying around somewhere in my many git repositories, but it might take longer than just writing them again from scratch. I know it pre-dates the concept of a "queued_put" so things will need to be tweaked anyway. The script had a bunch of declares at the top for which trace points to monitor and collect. I modified it for glock_new and glock_put, but you can play with it. To run the script and collect the trace, just do this: ./gfs2trace.sh & (recreate the problem) rm /var/run/gfs2-tracepoints.pid Removing that file triggers the trace script to stop tracing and save the results to a file in /tmp/ named after the machine's name (so we can keep them straight in clustered situations). Then, of course, someone needs to analyze the resulting trace file and figure out where the count is getting off. I hope this helps. Regards, Bob Peterson Red Hat File Systems gfs2trace.sh Description: application/shellscript
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
On 1/31/19 5:18 PM, Andreas Gruenbacher wrote: Hi Ross, On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall wrote: Each gfs2_bufdata stores a reference to a glock but the reference count isn't incremented. This causes an occasional use-after-free of the glock. Fix by taking a reference on the glock during allocation and dropping it when freeing. Found by KASAN: BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2] Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371 CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1 Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010 Workqueue: glock_workqueue glock_work_func [gfs2] Call Trace: dump_stack+0x71/0xab print_address_description+0x6a/0x270 kasan_report+0x258/0x380 ? revoke_lo_after_commit+0x8e/0xe0 [gfs2] revoke_lo_after_commit+0x8e/0xe0 [gfs2] gfs2_log_flush+0x511/0xa70 [gfs2] ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2] ? __brelse+0x48/0x50 ? gfs2_log_commit+0x4de/0x6e0 [gfs2] ? gfs2_trans_end+0x18d/0x340 [gfs2] gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2] ? inode_go_dump+0xe0/0xe0 [gfs2] ? inode_go_sync+0xe4/0x220 [gfs2] inode_go_sync+0xe4/0x220 [gfs2] do_xmote+0x12b/0x290 [gfs2] glock_work_func+0x6f/0x160 [gfs2] process_one_work+0x461/0x790 worker_thread+0x69/0x6b0 ? process_one_work+0x790/0x790 kthread+0x1ae/0x1d0 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x22/0x40 thanks for tracking this down, very interesting. The consistency model here is that every buffer head that a struct gfs2_bufdata object is attached to is protected by a glock. Before a glock can be released, all the buffers under that glock have to be flushed out and released; this is what allows another node to access the same on-disk location without causing inconsistencies. When there is a bufdata object that points to a glock that has already been freed, this consistency model is broken. Taking an additional refcount as this patch does may make the use-after-free go away, but it doesn't fix the underlying problem. So I think we'll need a different fix here. Yes, I kind of suspected that this is papering over the problem rather than fixing the root cause. Did you observe this problem in a real-world scenario, or with KASAN only? It might be that we're looking at a small race that is unlikely to trigger in the field. In any case, I think we need to understand better what't actually going on. We haven't observed any problems that can be directly attributed to this without KASAN, although it is hard to tell what a stray write may do. We have hit sporadic asserts and filesystem corruption during testing. When I added tracing, the time between freeing a glock and writing to it varied but could be up to hundreds of milliseconds so I would guess that this could easily happen without KASAN. It is relatively easy to reproduce in our test environment. Do you have any suggestions for tracking down the root cause? Thanks, -- Ross Lagerwall
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
Hi Ross, On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall wrote: > Each gfs2_bufdata stores a reference to a glock but the reference count > isn't incremented. This causes an occasional use-after-free of the > glock. Fix by taking a reference on the glock during allocation and > dropping it when freeing. > > Found by KASAN: > > BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2] > Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371 > > CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1 > Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010 > Workqueue: glock_workqueue glock_work_func [gfs2] > Call Trace: > dump_stack+0x71/0xab > print_address_description+0x6a/0x270 > kasan_report+0x258/0x380 > ? revoke_lo_after_commit+0x8e/0xe0 [gfs2] > revoke_lo_after_commit+0x8e/0xe0 [gfs2] > gfs2_log_flush+0x511/0xa70 [gfs2] > ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2] > ? __brelse+0x48/0x50 > ? gfs2_log_commit+0x4de/0x6e0 [gfs2] > ? gfs2_trans_end+0x18d/0x340 [gfs2] > gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2] > ? inode_go_dump+0xe0/0xe0 [gfs2] > ? inode_go_sync+0xe4/0x220 [gfs2] > inode_go_sync+0xe4/0x220 [gfs2] > do_xmote+0x12b/0x290 [gfs2] > glock_work_func+0x6f/0x160 [gfs2] > process_one_work+0x461/0x790 > worker_thread+0x69/0x6b0 > ? process_one_work+0x790/0x790 > kthread+0x1ae/0x1d0 > ? kthread_create_worker_on_cpu+0xc0/0xc0 > ret_from_fork+0x22/0x40 thanks for tracking this down, very interesting. The consistency model here is that every buffer head that a struct gfs2_bufdata object is attached to is protected by a glock. Before a glock can be released, all the buffers under that glock have to be flushed out and released; this is what allows another node to access the same on-disk location without causing inconsistencies. When there is a bufdata object that points to a glock that has already been freed, this consistency model is broken. Taking an additional refcount as this patch does may make the use-after-free go away, but it doesn't fix the underlying problem. So I think we'll need a different fix here. Did you observe this problem in a real-world scenario, or with KASAN only? It might be that we're looking at a small race that is unlikely to trigger in the field. In any case, I think we need to understand better what't actually going on. Thanks, Andreas
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
- Original Message - > Each gfs2_bufdata stores a reference to a glock but the reference count > isn't incremented. This causes an occasional use-after-free of the > glock. Fix by taking a reference on the glock during allocation and > dropping it when freeing. > > Found by KASAN: > > BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2] > Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371 > > CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1 > Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010 > Workqueue: glock_workqueue glock_work_func [gfs2] > Call Trace: > dump_stack+0x71/0xab > print_address_description+0x6a/0x270 > kasan_report+0x258/0x380 > ? revoke_lo_after_commit+0x8e/0xe0 [gfs2] > revoke_lo_after_commit+0x8e/0xe0 [gfs2] > gfs2_log_flush+0x511/0xa70 [gfs2] > ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2] > ? __brelse+0x48/0x50 > ? gfs2_log_commit+0x4de/0x6e0 [gfs2] > ? gfs2_trans_end+0x18d/0x340 [gfs2] > gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2] > ? inode_go_dump+0xe0/0xe0 [gfs2] > ? inode_go_sync+0xe4/0x220 [gfs2] > inode_go_sync+0xe4/0x220 [gfs2] > do_xmote+0x12b/0x290 [gfs2] > glock_work_func+0x6f/0x160 [gfs2] > process_one_work+0x461/0x790 > worker_thread+0x69/0x6b0 > ? process_one_work+0x790/0x790 > kthread+0x1ae/0x1d0 > ? kthread_create_worker_on_cpu+0xc0/0xc0 > ret_from_fork+0x22/0x40 > > Allocated by task 20805: > kasan_kmalloc+0xa0/0xd0 > kmem_cache_alloc+0xb5/0x1b0 > gfs2_glock_get+0x14b/0x620 [gfs2] > gfs2_inode_lookup+0x20c/0x640 [gfs2] > gfs2_dir_search+0x150/0x180 [gfs2] > gfs2_lookupi+0x272/0x360 [gfs2] > __gfs2_lookup+0x8b/0x1d0 [gfs2] > gfs2_atomic_open+0x77/0x100 [gfs2] > path_openat+0x1454/0x1c10 > do_filp_open+0x124/0x1d0 > do_sys_open+0x213/0x2c0 > do_syscall_64+0x69/0x160 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Freed by task 0: > __kasan_slab_free+0x130/0x180 > kmem_cache_free+0x78/0x1e0 > rcu_process_callbacks+0x2ad/0x6c0 > __do_softirq+0x111/0x38c > > The buggy address belongs to the object at 88801aff6040 > which belongs to the cache gfs2_glock(aspace) of size 560 > The buggy address is located 244 bytes inside of > 560-byte region [88801aff6040, 88801aff6270) > ... > > Signed-off-by: Ross Lagerwall > --- This makes sense to me. Regards, Bob Peterson
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
Hi, On 31/01/2019 10:55, Ross Lagerwall wrote: Each gfs2_bufdata stores a reference to a glock but the reference count isn't incremented. This causes an occasional use-after-free of the glock. Fix by taking a reference on the glock during allocation and dropping it when freeing. Another good bit of debugging. It would be nice if we can (longer term) avoid using the ref count though, since that will have some overhead, but for the time being, the correctness is the important thing, Steve. Found by KASAN: BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2] Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371 CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1 Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010 Workqueue: glock_workqueue glock_work_func [gfs2] Call Trace: dump_stack+0x71/0xab print_address_description+0x6a/0x270 kasan_report+0x258/0x380 ? revoke_lo_after_commit+0x8e/0xe0 [gfs2] revoke_lo_after_commit+0x8e/0xe0 [gfs2] gfs2_log_flush+0x511/0xa70 [gfs2] ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2] ? __brelse+0x48/0x50 ? gfs2_log_commit+0x4de/0x6e0 [gfs2] ? gfs2_trans_end+0x18d/0x340 [gfs2] gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2] ? inode_go_dump+0xe0/0xe0 [gfs2] ? inode_go_sync+0xe4/0x220 [gfs2] inode_go_sync+0xe4/0x220 [gfs2] do_xmote+0x12b/0x290 [gfs2] glock_work_func+0x6f/0x160 [gfs2] process_one_work+0x461/0x790 worker_thread+0x69/0x6b0 ? process_one_work+0x790/0x790 kthread+0x1ae/0x1d0 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x22/0x40 Allocated by task 20805: kasan_kmalloc+0xa0/0xd0 kmem_cache_alloc+0xb5/0x1b0 gfs2_glock_get+0x14b/0x620 [gfs2] gfs2_inode_lookup+0x20c/0x640 [gfs2] gfs2_dir_search+0x150/0x180 [gfs2] gfs2_lookupi+0x272/0x360 [gfs2] __gfs2_lookup+0x8b/0x1d0 [gfs2] gfs2_atomic_open+0x77/0x100 [gfs2] path_openat+0x1454/0x1c10 do_filp_open+0x124/0x1d0 do_sys_open+0x213/0x2c0 do_syscall_64+0x69/0x160 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 0: __kasan_slab_free+0x130/0x180 kmem_cache_free+0x78/0x1e0 rcu_process_callbacks+0x2ad/0x6c0 __do_softirq+0x111/0x38c The buggy address belongs to the object at 88801aff6040 which belongs to the cache gfs2_glock(aspace) of size 560 The buggy address is located 244 bytes inside of 560-byte region [88801aff6040, 88801aff6270) ... Signed-off-by: Ross Lagerwall --- fs/gfs2/aops.c| 3 +-- fs/gfs2/lops.c| 2 +- fs/gfs2/meta_io.c | 2 +- fs/gfs2/trans.c | 9 - fs/gfs2/trans.h | 2 ++ 5 files changed, 13 insertions(+), 5 deletions(-) diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index 05dd78f4b2b3..8c2b572a7fb1 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -868,9 +868,8 @@ int gfs2_releasepage(struct page *page, gfp_t gfp_mask) gfs2_assert_warn(sdp, bd->bd_bh == bh); if (!list_empty(&bd->bd_list)) list_del_init(&bd->bd_list); - bd->bd_bh = NULL; bh->b_private = NULL; - kmem_cache_free(gfs2_bufdata_cachep, bd); + gfs2_free_bufdata(bd); } bh = bh->b_this_page; diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c index 94dcab655bc0..f40be71677d1 100644 --- a/fs/gfs2/lops.c +++ b/fs/gfs2/lops.c @@ -847,7 +847,7 @@ static void revoke_lo_after_commit(struct gfs2_sbd *sdp, struct gfs2_trans *tr) gl = bd->bd_gl; atomic_dec(&gl->gl_revokes); clear_bit(GLF_LFLUSH, &gl->gl_flags); - kmem_cache_free(gfs2_bufdata_cachep, bd); + gfs2_free_bufdata(bd); } } diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c index be9c0bf697fe..868caa0eb104 100644 --- a/fs/gfs2/meta_io.c +++ b/fs/gfs2/meta_io.c @@ -355,7 +355,7 @@ void gfs2_remove_from_journal(struct buffer_head *bh, int meta) gfs2_trans_add_revoke(sdp, bd); } else if (was_pinned) { bh->b_private = NULL; - kmem_cache_free(gfs2_bufdata_cachep, bd); + gfs2_free_bufdata(bd); } spin_unlock(&sdp->sd_ail_lock); } diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c index cd9a94a6b5bb..423cbee8fa08 100644 --- a/fs/gfs2/trans.c +++ b/fs/gfs2/trans.c @@ -133,9 +133,16 @@ static struct gfs2_bufdata *gfs2_alloc_bufdata(struct gfs2_glock *gl, bd->bd_gl = gl; INIT_LIST_HEAD(&bd->bd_list); bh->b_private = bd; + gfs2_glock_hold(gl); return bd; } +void gfs2_free_bufdata(struct gfs2_bufdata *bd) +{ + gfs2_glock_put(bd->bd_gl); + kmem_cache_free(gfs2_bufdata_cachep, bd); +} + /** * gfs2_trans_add_data - Add a databuf to the transaction. * @gl: The inode glock associated with the buffer @@ -265,7 +272,7 @@ void gfs2_trans_add_unrevoke(struc