[Cluster-devel] [PATCH net-next v4 12/12] sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW

2019-02-01 Thread Deepa Dinamani
Add new socket timeout options that are y2038 safe.

Signed-off-by: Deepa Dinamani 
Cc: ccaul...@redhat.com
Cc: da...@davemloft.net
Cc: del...@gmx.de
Cc: pau...@samba.org
Cc: r...@linux-mips.org
Cc: r...@twiddle.net
Cc: cluster-devel@redhat.com
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-al...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
---
 arch/alpha/include/uapi/asm/socket.h  | 12 +--
 arch/mips/include/uapi/asm/socket.h   | 11 --
 arch/parisc/include/uapi/asm/socket.h | 10 --
 arch/sparc/include/uapi/asm/socket.h  | 11 --
 include/uapi/asm-generic/socket.h | 11 --
 net/core/sock.c   | 49 ---
 6 files changed, 81 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index 9826d1db71d0..0d0fddb7e738 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -119,19 +119,25 @@
 #define SO_TIMESTAMPNS_NEW  64
 #define SO_TIMESTAMPING_NEW 65
 
-#if !defined(__KERNEL__)
+#define SO_RCVTIMEO_NEW 66
+#define SO_SNDTIMEO_NEW 67
 
-#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
-#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
+#if !defined(__KERNEL__)
 
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
 #define SO_TIMESTAMPING SO_TIMESTAMPING_OLD
+
+#define SO_RCVTIMEOSO_RCVTIMEO_OLD
+#define SO_SNDTIMEOSO_SNDTIMEO_OLD
 #else
 #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
 #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
 #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW)
+
+#define SO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW)
+#define SO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_SNDTIMEO_OLD : SO_SNDTIMEO_NEW)
 #endif
 
 #define SCM_TIMESTAMP   SO_TIMESTAMP
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 96cc0e907f12..eb9f33f8a8b3 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -130,18 +130,25 @@
 #define SO_TIMESTAMPNS_NEW  64
 #define SO_TIMESTAMPING_NEW 65
 
+#define SO_RCVTIMEO_NEW 66
+#define SO_SNDTIMEO_NEW 67
+
 #if !defined(__KERNEL__)
 
-#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
-#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
 #define SO_TIMESTAMPINGSO_TIMESTAMPING_OLD
+
+#define SO_RCVTIMEO SO_RCVTIMEO_OLD
+#define SO_SNDTIMEO SO_SNDTIMEO_OLD
 #else
 #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
 #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
 #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW)
+
+#define SO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW)
+#define SO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_SNDTIMEO_OLD : SO_SNDTIMEO_NEW)
 #endif
 
 #define SCM_TIMESTAMP   SO_TIMESTAMP
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 046f0cd9cce4..16e428f03526 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -111,18 +111,24 @@
 #define SO_TIMESTAMPNS_NEW  0x4039
 #define SO_TIMESTAMPING_NEW 0x403A
 
+#define SO_RCVTIMEO_NEW 0x4040
+#define SO_SNDTIMEO_NEW 0x4041
+
 #if !defined(__KERNEL__)
 
-#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
-#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
 #define SO_TIMESTAMPING SO_TIMESTAMPING_OLD
+#define SO_RCVTIMEOSO_RCVTIMEO_OLD
+#define SO_SNDTIMEOSO_SNDTIMEO_OLD
 #else
 #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
 #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
 #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW)
+
+#define SO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW)
+#define SO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_SNDTIMEO_OLD : SO_SNDTIMEO_NEW)
 #endif
 
 #define SCM_TIMESTAMP   SO_T

[Cluster-devel] [PATCH net-next v4 11/12] socket: Rename SO_RCVTIMEO/ SO_SNDTIMEO with _OLD suffixes

2019-02-01 Thread Deepa Dinamani
SO_RCVTIMEO and SO_SNDTIMEO socket options use struct timeval
as the time format. struct timeval is not y2038 safe.
The subsequent patches in the series add support for new socket
timeout options with _NEW suffix that will use y2038 safe
data structures. Although the existing struct timeval layout
is sufficiently wide to represent timeouts, because of the way
libc will interpret time_t based on user defined flag, these
new flags provide a way of having a structure that is the same
for all architectures consistently.
Rename the existing options with _OLD suffix forms so that the
right option is enabled for userspace applications according
to the architecture and time_t definition of libc.

Signed-off-by: Deepa Dinamani 
Cc: ccaul...@redhat.com
Cc: del...@gmx.de
Cc: pau...@samba.org
Cc: r...@linux-mips.org
Cc: r...@twiddle.net
Cc: cluster-devel@redhat.com
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-al...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
---
 arch/alpha/include/uapi/asm/socket.h   | 7 +--
 arch/mips/include/uapi/asm/socket.h| 6 --
 arch/parisc/include/uapi/asm/socket.h  | 6 --
 arch/powerpc/include/uapi/asm/socket.h | 4 ++--
 arch/sparc/include/uapi/asm/socket.h   | 7 +--
 fs/dlm/lowcomms.c  | 4 ++--
 include/uapi/asm-generic/socket.h  | 6 --
 net/core/sock.c| 8 
 8 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index 934ea6268f1a..9826d1db71d0 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -31,8 +31,8 @@
 #define SO_RCVBUFFORCE 0x100b
 #defineSO_RCVLOWAT 0x1010
 #defineSO_SNDLOWAT 0x1011
-#defineSO_RCVTIMEO 0x1012
-#defineSO_SNDTIMEO 0x1013
+#defineSO_RCVTIMEO_OLD 0x1012
+#defineSO_SNDTIMEO_OLD 0x1013
 #define SO_ACCEPTCONN  0x1014
 #define SO_PROTOCOL0x1028
 #define SO_DOMAIN  0x1029
@@ -121,6 +121,9 @@
 
 #if !defined(__KERNEL__)
 
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
+
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 110f9506d64f..96cc0e907f12 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -39,8 +39,8 @@
 #define SO_RCVBUF  0x1002  /* Receive buffer. */
 #define SO_SNDLOWAT0x1003  /* send low-water mark */
 #define SO_RCVLOWAT0x1004  /* receive low-water mark */
-#define SO_SNDTIMEO0x1005  /* send timeout */
-#define SO_RCVTIMEO0x1006  /* receive timeout */
+#define SO_SNDTIMEO_OLD0x1005  /* send timeout */
+#define SO_RCVTIMEO_OLD0x1006  /* receive timeout */
 #define SO_ACCEPTCONN  0x1009
 #define SO_PROTOCOL0x1028  /* protocol type */
 #define SO_DOMAIN  0x1029  /* domain/socket family */
@@ -132,6 +132,8 @@
 
 #if !defined(__KERNEL__)
 
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index bee2a9dde656..046f0cd9cce4 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -22,8 +22,8 @@
 #define SO_RCVBUFFORCE 0x100b
 #define SO_SNDLOWAT0x1003
 #define SO_RCVLOWAT0x1004
-#define SO_SNDTIMEO0x1005
-#define SO_RCVTIMEO0x1006
+#define SO_SNDTIMEO_OLD0x1005
+#define SO_RCVTIMEO_OLD0x1006
 #define SO_ERROR   0x1007
 #define SO_TYPE0x1008
 #define SO_PROTOCOL0x1028
@@ -113,6 +113,8 @@
 
 #if !defined(__KERNEL__)
 
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
diff --git a/arch/powerpc/include/uapi/asm/socket.h 
b/arch/powerpc/include/uapi/asm/socket.h
index 94de465e0920..12aa0c43e775 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -11,8 +11,8 @@
 
 #define SO_RCVLOWAT16
 #define SO_SNDLOWAT17
-#define SO_RCVTIMEO18
-#define SO_SNDTIMEO19
+#define SO_RCVTIMEO_OLD18
+#define SO_SNDTIMEO_OLD19
 #define SO_PASSCRED20
 #define SO_PEERCRED21
 
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index 2b38dda51426..342ffdc3b424 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -21,8 +21,8 @@
 #define SO_BSDCOMPAT0x0400

Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free (Another debug patch)

2019-02-01 Thread Bob Peterson
Hi Ross,

- Original Message -
> Do you have any suggestions for tracking down the root cause?

Attached is the starting point for a generic "debug" kernel trace point,
complete with examples of how it's used. It's always associated with a
particular glock. You might find it helpful, but again, it's old, so
it'll probably need to be tweaked for an upstream kernel.

Regards,

Bob Peterson
Red Hat File Systems
From 68ad040adc255be9d662fcf0d24820c9e7408187 Mon Sep 17 00:00:00 2001
From: Bob Peterson 
Date: Fri, 16 Feb 2018 06:34:43 -0700
Subject: [PATCH] GFS2: Beginning of a debug trace point

This patch introduces a new debug trace point for GFS2 and uses it
to track a few fundamental things, like when the file system is
switched from RO to RW and back, and when files are switched from
ordinary to jdata and back.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/super.c  |  8 
 fs/gfs2/trace_gfs2.h | 32 
 2 files changed, 40 insertions(+)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 03cebdef00bd3..23cb03624aff4 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -38,6 +38,7 @@
 #include "recovery.h"
 #include "rgrp.h"
 #include "super.h"
+#include "trace_gfs2.h"
 #include "trans.h"
 #include "util.h"
 #include "sys.h"
@@ -422,6 +423,7 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp)
 	struct gfs2_log_header_host head;
 	int error;
 
+	trace_gfs2_debug(sdp->sd_trans_gl, 0, "gfs2_make_fs_rw start");
 	error = init_threads(sdp);
 	if (error)
 		return error;
@@ -464,6 +466,7 @@ fail_threads:
 	sdp->sd_quotad_process = NULL;
 	kthread_stop(sdp->sd_logd_process);
 	sdp->sd_logd_process = NULL;
+	trace_gfs2_debug(sdp->sd_trans_gl, 0, "gfs2_make_fs_rw finished");
 	return error;
 }
 
@@ -853,6 +856,7 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
 	struct gfs2_holder t_gh;
 	int error;
 
+	trace_gfs2_debug(sdp->sd_trans_gl, 0, "gfs2_make_fs_ro start");
 	if (sdp->sd_quotad_process) {
 		kthread_stop(sdp->sd_quotad_process);
 		sdp->sd_quotad_process = NULL;
@@ -881,6 +885,7 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
 		gfs2_glock_dq_uninit(&t_gh);
 
 	gfs2_quota_cleanup(sdp);
+	trace_gfs2_debug(sdp->sd_trans_gl, 0, "gfs2_make_fs_ro finished.");
 	return 0;
 }
 
@@ -896,6 +901,7 @@ static void gfs2_put_super(struct super_block *sb)
 	int error;
 	struct gfs2_jdesc *jd;
 
+	trace_gfs2_debug(sdp->sd_trans_gl, 0, "put_super begin");
 	/* No more recovery requests */
 	set_bit(SDF_NORECOVERY, &sdp->sd_flags);
 	smp_mb();
@@ -914,6 +920,7 @@ restart:
 	spin_unlock(&sdp->sd_jindex_spin);
 
 	if (!(sb->s_flags & MS_RDONLY)) {
+		trace_gfs2_debug(sdp->sd_trans_gl, 0, "put_super rw");
 		error = gfs2_make_fs_ro(sdp);
 		if (error)
 			gfs2_io_error(sdp);
@@ -1228,6 +1235,7 @@ static int gfs2_remount_fs(struct super_block *sb, int *flags, char *data)
 		*flags |= MS_RDONLY;
 
 	if ((sb->s_flags ^ *flags) & MS_RDONLY) {
+		trace_gfs2_debug(sdp->sd_trans_gl, 0, "remount fs");
 		if (*flags & MS_RDONLY)
 			error = gfs2_make_fs_ro(sdp);
 		else
diff --git a/fs/gfs2/trace_gfs2.h b/fs/gfs2/trace_gfs2.h
index d1de2edc27050..9095ed8479fb4 100644
--- a/fs/gfs2/trace_gfs2.h
+++ b/fs/gfs2/trace_gfs2.h
@@ -613,6 +613,38 @@ TRACE_EVENT(gfs2_rs,
 		  rs_func_name(__entry->func), (unsigned long)__entry->free)
 );
 
+/* Generic debug messages */
+TRACE_EVENT(gfs2_debug,
+
+	TP_PROTO(struct gfs2_glock *gl, int ret, const char *msg),
+
+	TP_ARGS(gl, ret, msg),
+
+	TP_STRUCT__entry(
+		__field(dev_t,  dev )
+		__field(	u64,	glnum			)
+		__field(	u32,	gltype			)
+		__field(	int,	ret			)
+		__field(	int,	nrpages			)
+		__string(	msg,	msg			)
+	),
+
+	TP_fast_assign(
+		__entry->dev	= gl ? gl->gl_name.ln_sbd->sd_vfs->s_dev : 0;
+		__entry->glnum	= gl ? gl->gl_name.ln_number : 0;
+		__entry->gltype	= gl ? gl->gl_name.ln_type : 0;
+		__entry->ret = ret;
+		__entry->nrpages = gl ? (gfs2_glock2aspace(gl) ?
+	 gfs2_glock2aspace(gl)->nrpages : 0) : 0;
+		__assign_str(msg, msg);
+	),
+
+	TP_printk("%u,%u %u:%llx rc:%d pgs: %d %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->gltype,
+		  (unsigned long long)__entry->glnum, __entry->ret,
+		  __entry->nrpages, __get_str(msg))
+);
+
 #endif /* _TRACE_GFS2_H */
 
 /* This part must be outside protection */


Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free

2019-02-01 Thread Bob Peterson
Hi Ross,

- Original Message -
> Do you have any suggestions for tracking down the root cause?

One time, when I had a similar problem in rhel7, and couldn't use
kernel tracing because there were millions of glocks involved.
The trace was too huge and quickly swamped the biggest possible
kernel trace buffer. So I ended up writing this ugly, hacky patch
that's attached. Perhaps you can use it as a starting point.

The idea is: every time there's a get or a put to a glock, it
saves off a 1-byte identifier of what function did the get/put.
It saved it in a new 64-byte field kept for each glock, which of
course meant the slab became much bigger, but it was never meant
to be shipped, right?

Then, when the problem occurred, it would dump out the problematic
glock, including the 64-byte get/put history value.
Then I would go through it and identify the history of what went
wrong.

Since this is a fairly old (2015) patch that targets an old rhel7,
it will obviously need a lot of updating to get it to work, but
it might work better than the kernel tracing, depending on how
many glocks are involved in your test.

Regards,

Bob Peterson
Red Hat File Systems
From 01f7c8b9ef44280be90eabb44847d27db68dec32 Mon Sep 17 00:00:00 2001
From: Bob Peterson 
Date: Fri, 4 Sep 2015 15:06:47 -0500
Subject: [PATCH] GFS2: Instrumentation to track down glock get/puts out of
 sync

---
 fs/gfs2/file.c   |   4 +-
 fs/gfs2/glock.c  | 198 ++-
 fs/gfs2/glock.h  |   6 +-
 fs/gfs2/glops.c  |   3 +
 fs/gfs2/incore.h |   5 ++
 fs/gfs2/inode.c  |  16 ++---
 fs/gfs2/ops_fstype.c |  16 ++---
 fs/gfs2/quota.c  |   8 +--
 fs/gfs2/recovery.c   |   2 +-
 fs/gfs2/rgrp.c   |  10 +--
 fs/gfs2/super.c  |   6 +-
 fs/gfs2/sys.c|   4 +-
 12 files changed, 207 insertions(+), 71 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index af0d418da809..11c02ff155ed 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1086,11 +1086,11 @@ static int do_flock(struct file *file, int cmd, struct file_lock *fl)
 		gfs2_holder_reinit(state, flags, fl_gh);
 	} else {
 		error = gfs2_glock_get(GFS2_SB(&ip->i_inode), ip->i_no_addr,
-   &gfs2_flock_glops, CREATE, &gl);
+   &gfs2_flock_glops, CREATE, &gl, 10);
 		if (error)
 			goto out;
 		gfs2_holder_init(gl, state, flags, fl_gh);
-		gfs2_glock_put(gl);
+		gfs2_glock_put(gl, 67);
 	}
 	for (sleeptime = 1; sleeptime <= 4; sleeptime <<= 1) {
 		error = gfs2_glock_nq(fl_gh);
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 1c186f9ec4d1..303fed6af984 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -98,10 +98,13 @@ void gfs2_glock_free(struct gfs2_glock *gl)
  *
  */
 
-static void gfs2_glock_hold(struct gfs2_glock *gl)
+static void gfs2_glock_hold(struct gfs2_glock *gl, int caller)
 {
 	GLOCK_BUG_ON(gl, __lockref_is_dead(&gl->gl_lockref));
 	lockref_get(&gl->gl_lockref);
+	gl->gl_getputhist[gl->gl_getputs++] = caller;
+	if (gl->gl_getputs > GETPUTHIST)
+		gl->gl_getputs = 0;
 }
 
 /**
@@ -161,11 +164,18 @@ static void gfs2_glock_remove_from_lru(struct gfs2_glock *gl)
  *
  */
 
-void gfs2_glock_put(struct gfs2_glock *gl)
+void gfs2_glock_put(struct gfs2_glock *gl, int caller)
 {
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	struct address_space *mapping = gfs2_glock2aspace(gl);
 
+	if (caller == 60 && gl->gl_getputs &&
+	gl->gl_getputhist[gl->gl_getputs - 1] == 4)
+		gl->gl_getputs--;
+	else
+		gl->gl_getputhist[gl->gl_getputs++] = caller;
+	if (gl->gl_getputs > GETPUTHIST)
+		gl->gl_getputs = 0;
 	if (lockref_put_or_lock(&gl->gl_lockref))
 		return;
 
@@ -328,10 +338,17 @@ static void state_change(struct gfs2_glock *gl, unsigned int new_state)
 
 	if (held1 != held2) {
 		GLOCK_BUG_ON(gl, __lockref_is_dead(&gl->gl_lockref));
-		if (held2)
+		if (held2) {
 			gl->gl_lockref.count++;
-		else
+			gl->gl_getputhist[gl->gl_getputs++] = 1;
+			if (gl->gl_getputs > GETPUTHIST)
+gl->gl_getputs = 0;
+		} else {
 			gl->gl_lockref.count--;
+			gl->gl_getputhist[gl->gl_getputs++] = 51;
+			if (gl->gl_getputs > GETPUTHIST)
+gl->gl_getputs = 0;
+		}
 	}
 	if (held1 && held2 && list_empty(&gl->gl_holders))
 		clear_bit(GLF_QUEUED, &gl->gl_flags);
@@ -476,7 +493,7 @@ __acquires(&gl->gl_spin)
 		glops->go_inval(gl, target == LM_ST_DEFERRED ? 0 : DIO_METADATA);
 	clear_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags);
 
-	gfs2_glock_hold(gl);
+	gfs2_glock_hold(gl, 2);
 	if (sdp->sd_lockstruct.ls_ops->lm_lock)	{
 		/* lock_dlm */
 		ret = sdp->sd_lockstruct.ls_ops->lm_lock(gl, target, lck_flags);
@@ -485,7 +502,7 @@ __acquires(&gl->gl_spin)
 		test_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags)) {
 			finish_xmote(gl, target);
 			if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
-gfs2_glock_put(gl);
+gfs2_glock_put(gl, 54);
 		}
 		else if (ret) {
 			printk(KERN_ERR "GFS2: lm_lock ret %d\n", ret);
@@ -494,7 +511,9 @@ __acquires(&gl->gl_spin)
 	} else { /* lock_nolock */
 		finish

Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free

2019-02-01 Thread Bob Peterson
Hi Ross,

- Original Message -
(snip)
> We haven't observed any problems that can be directly attributed to this
> without KASAN, although it is hard to tell what a stray write may do. We
> have hit sporadic asserts and filesystem corruption during testing.
> 
> When I added tracing, the time between freeing a glock and writing to it
> varied but could be up to hundreds of milliseconds so I would guess that
> this could easily happen without KASAN. It is relatively easy to
> reproduce in our test environment.
> 
> Do you have any suggestions for tracking down the root cause?

In the past, I've debugged problems with glock reference counting by
using kernel tracing and instrumentation. Unfortunately, the "glock_put"
trace point only shows you when the glock ref count goes to 0, and
doesn't show when or how the glock is first created, which, of course,
doesn't show if it's created and destroyed multiple times, and often
that's important to figuring these out, otherwise it's just a lot of chaos.

In the past, I've added my own temporary kernel trace point for when new
glocks are created, and called it "glock_new." You probably also want to
modify the glock put functions, such as gfs2_glock_put and
gfs2_glock_queue_put, to call a trace point so you can tell that too, and
have it save off the gl_lockref reference count in the trace.

Then recreate the problem with the trace running. I attached a script I
often use for these purposes. The script contains several bogus trace
point references for various sets of temporary trace points I've added
and deleted over the years, like a generic "debug" trace point where I
can add generic messages of what's happening. So don't be surprised if
you get errors about trying to cat values into non-existent debugfs files.
Just ignore them. The script DOES contain a trigger for a "glock_new"
trace point for just this purpose. I can try to dig out whether I still
have that trace point (glock_new) and the generic debug trace point
lying around somewhere in my many git repositories, but it might take
longer than just writing them again from scratch. I know it pre-dates
the concept of a "queued_put" so things will need to be tweaked anyway.

The script had a bunch of declares at the top for which trace points to
monitor and collect. I modified it for glock_new and glock_put, but
you can play with it.

To run the script and collect the trace, just do this:
./gfs2trace.sh &
(recreate the problem)
rm /var/run/gfs2-tracepoints.pid

Removing that file triggers the trace script to stop tracing and save
the results to a file in /tmp/ named after the machine's name
(so we can keep them straight in clustered situations).
Then, of course, someone needs to analyze the resulting trace file and
figure out where the count is getting off. I hope this helps.

Regards,

Bob Peterson
Red Hat File Systems


gfs2trace.sh
Description: application/shellscript


[Cluster-devel] [PATCH] Retry wait_event_interruptible in event of ERESTARTSYS

2019-02-01 Thread Mark Syms
Signed-off-by: Mark Syms 
---
 fs/dlm/lockspace.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c
index db43b98..7aae23a 100644
--- a/fs/dlm/lockspace.c
+++ b/fs/dlm/lockspace.c
@@ -210,8 +210,10 @@ static int do_uevent(struct dlm_ls *ls, int in)
/* dlm_controld will see the uevent, do the necessary group management
   and then write to sysfs to wake us */
 
-   error = wait_event_interruptible(ls->ls_uevent_wait,
-   test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags));
+   do {
+   error = wait_event_interruptible(ls->ls_uevent_wait,
+
test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags));
+   } while (error == ERESTARTSYS);
 
log_rinfo(ls, "group event done %d %d", error, ls->ls_uevent_result);
 
-- 
1.8.3.1



[Cluster-devel] [PATCH] Retry wait_event_interruptible in event of ERESTARTSYS

2019-02-01 Thread Mark Syms
We saw an issue in a production server on a customer deployment where
DLM 4.0.7 gets "stuck" and unable to join new lockspaces.

See - https://lists.clusterlabs.org/pipermail/users/2019-January/016054.html

This was forwarded off list to David Teigland who responded thusly.

"
Hi, thanks for the debugging info.  You've spent more time looking at
this than I have, but from a first glance it seems to me that the
initial problem (there may be multiple) is that in the kernel,
lockspace.c do_event() does not sensibly handle the ERESTARTSYS error
from wait_event_interruptible().  I think do_event() should continue
waiting for a uevent result from userspace until it gets one, because
the kernel can't do anything sensible until it gets that.

Dave
"

This change does that. We have it running in automation with no problems
so far but comments welcome.

Mark Syms (1):
  Retry wait_event_interruptible in event of ERESTARTSYS

 fs/dlm/lockspace.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

-- 
1.8.3.1



Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free

2019-02-01 Thread Ross Lagerwall

On 1/31/19 5:18 PM, Andreas Gruenbacher wrote:

Hi Ross,

On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall  wrote:

Each gfs2_bufdata stores a reference to a glock but the reference count
isn't incremented. This causes an occasional use-after-free of the
glock. Fix by taking a reference on the glock during allocation and
dropping it when freeing.

Found by KASAN:

BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371

CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
Workqueue: glock_workqueue glock_work_func [gfs2]
Call Trace:
  dump_stack+0x71/0xab
  print_address_description+0x6a/0x270
  kasan_report+0x258/0x380
  ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
  revoke_lo_after_commit+0x8e/0xe0 [gfs2]
  gfs2_log_flush+0x511/0xa70 [gfs2]
  ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
  ? __brelse+0x48/0x50
  ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
  ? gfs2_trans_end+0x18d/0x340 [gfs2]
  gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
  ? inode_go_dump+0xe0/0xe0 [gfs2]
  ? inode_go_sync+0xe4/0x220 [gfs2]
  inode_go_sync+0xe4/0x220 [gfs2]
  do_xmote+0x12b/0x290 [gfs2]
  glock_work_func+0x6f/0x160 [gfs2]
  process_one_work+0x461/0x790
  worker_thread+0x69/0x6b0
  ? process_one_work+0x790/0x790
  kthread+0x1ae/0x1d0
  ? kthread_create_worker_on_cpu+0xc0/0xc0
  ret_from_fork+0x22/0x40


thanks for tracking this down, very interesting.

The consistency model here is that every buffer head that a struct
gfs2_bufdata object is attached to is protected by a glock. Before a
glock can be released, all the buffers under that glock have to be
flushed out and released; this is what allows another node to access
the same on-disk location without causing inconsistencies. When there
is a bufdata object that points to a glock that has already been
freed, this consistency model is broken. Taking an additional refcount
as this patch does may make the use-after-free go away, but it doesn't
fix the underlying problem. So I think we'll need a different fix
here.


Yes, I kind of suspected that this is papering over the problem rather 
than fixing the root cause.




Did you observe this problem in a real-world scenario, or with KASAN
only? It might be that we're looking at a small race that is unlikely
to trigger in the field. In any case, I think we need to understand
better what't actually going on.


We haven't observed any problems that can be directly attributed to this 
without KASAN, although it is hard to tell what a stray write may do. We 
have hit sporadic asserts and filesystem corruption during testing.


When I added tracing, the time between freeing a glock and writing to it 
varied but could be up to hundreds of milliseconds so I would guess that 
this could easily happen without KASAN. It is relatively easy to 
reproduce in our test environment.


Do you have any suggestions for tracking down the root cause?

Thanks,
--
Ross Lagerwall