from:"Lars Ellenberg"

Re: [PATCH] drbd: fix discard_zeroes_if_aligned regression

2019-05-28 Thread Lars Ellenberg

On Fri, May 10, 2019 at 05:36:32PM +, Eric Wheeler wrote:
> Hi Lars,
> 
> We just tried 4.19.x and this bugs still exists. We applied the patch 
> which was originally submitted to this thread and it still applies cleanly 
> and seems to work for our use case. You mentioned that you had some older 
> code which zeroed out unaligned discard requests (or perhaps it was for a 
> different purpose) that you may be able to use to patch this. Could you 
> dig those up and see if we can get this solved?
> 
> It would be nice to be able to use drbd with thin backing volumes from the 
> vanilla kernel. If this has already been fixed in something newer than 
> 4.19, then please point me to the commit.

I think it was merged upstream in 5.0
f31e583aa2c2 drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire")

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

[PATCH 16/17] drbd: Avoid Clang warning about pointless switch statment

2018-12-20 Thread Lars Ellenberg

From: Nathan Chancellor 

There are several warnings from Clang about no case statement matching
the constant 0:

In file included from drivers/block/drbd/drbd_receiver.c:48:
In file included from drivers/block/drbd/drbd_int.h:48:
In file included from ./include/linux/drbd_genl_api.h:54:
In file included from ./include/linux/genl_magic_struct.h:236:
./include/linux/drbd_genl.h:321:1: warning: no case matching constant
switch condition '0'
GENL_struct(DRBD_NLA_HELPER, 24, drbd_helper_info,
^~
./include/linux/genl_magic_struct.h:220:10: note: expanded from macro
'GENL_struct'
switch (0) {
^

Silence this warning by adding a 'case 0:' statement. Additionally,
adjust the alignment of the statements in the ct_assert_unique macro to
avoid a checkpatch warning.

This solution was originally sent by Arnd Bergmann with a default case
statement: https://lore.kernel.org/patchwork/patch/756723/

Link: https://github.com/ClangBuiltLinux/linux/issues/43
Suggested-by: Lars Ellenberg 
Signed-off-by: Nathan Chancellor 
---
 include/linux/genl_magic_struct.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/genl_magic_struct.h 
b/include/linux/genl_magic_struct.h
index 5972e4969197..eeae59d3ceb7 100644
--- a/include/linux/genl_magic_struct.h
+++ b/include/linux/genl_magic_struct.h
@@ -191,6 +191,7 @@ static inline void ct_assert_unique_operations(void)
 {
switch (0) {
 #include GENL_MAGIC_INCLUDE_FILE
+   case 0:
;
}
 }
@@ -209,6 +210,7 @@ static inline void 
ct_assert_unique_top_level_attributes(void)
 {
switch (0) {
 #include GENL_MAGIC_INCLUDE_FILE
+   case 0:
;
}
 }
@@ -218,7 +220,8 @@ static inline void 
ct_assert_unique_top_level_attributes(void)
 static inline void ct_assert_unique_ ## s_name ## _attributes(void)\
 {  \
switch (0) {\
-   s_fields\
+   s_fields\
+   case 0: \
;   \
}   \
 }
-- 
2.17.1

[PATCH 17/17] drbd: Change drbd_request_detach_interruptible's return type to int

2018-12-20 Thread Lars Ellenberg

From: Nathan Chancellor 

Clang warns when an implicit conversion is done between enumerated
types:

drivers/block/drbd/drbd_state.c:708:8: warning: implicit conversion from
enumeration type 'enum drbd_ret_code' to different enumeration type
'enum drbd_state_rv' [-Wenum-conversion]
rv = ERR_INTR;
   ~ ^~~~

drbd_request_detach_interruptible's only call site is in the return
statement of adm_detach, which returns an int. Change the return type of
drbd_request_detach_interruptible to match, silencing Clang's warning.

Reported-by: Nick Desaulniers 
Reviewed-by: Nick Desaulniers 
Signed-off-by: Nathan Chancellor 
---
 drivers/block/drbd/drbd_state.c | 6 ++
 drivers/block/drbd/drbd_state.h | 3 +--
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 883155657273..2b4c0db5d867 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -688,11 +688,9 @@ request_detach(struct drbd_device *device)
CS_VERBOSE | CS_ORDERED | CS_INHIBIT_MD_IO);
 }
 
-enum drbd_state_rv
-drbd_request_detach_interruptible(struct drbd_device *device)
+int drbd_request_detach_interruptible(struct drbd_device *device)
 {
-   enum drbd_state_rv rv;
-   int ret;
+   int ret, rv;
 
drbd_suspend_io(device); /* so no-one is stuck in drbd_al_begin_io */
wait_event_interruptible(device->state_wait,
diff --git a/drivers/block/drbd/drbd_state.h b/drivers/block/drbd/drbd_state.h
index b2a390ba73a0..f87371e55e68 100644
--- a/drivers/block/drbd/drbd_state.h
+++ b/drivers/block/drbd/drbd_state.h
@@ -162,8 +162,7 @@ static inline int drbd_request_state(struct drbd_device 
*device,
 }
 
 /* for use in adm_detach() (drbd_adm_detach(), drbd_adm_down()) */
-enum drbd_state_rv
-drbd_request_detach_interruptible(struct drbd_device *device);
+int drbd_request_detach_interruptible(struct drbd_device *device);
 
 enum drbd_role conn_highest_role(struct drbd_connection *connection);
 enum drbd_role conn_highest_peer(struct drbd_connection *connection);
-- 
2.17.1

[PATCH 11/17] drbd: avoid spurious self-outdating with concurrent disconnect / down

2018-12-20 Thread Lars Ellenberg

If peers are "simultaneously" told to disconnect from each other,
either explicitly, or implicitly by taking down the resource,
with bad timing, one side may see its disconnect "fail" with
a result of "state change failed by peer", and interpret this as
"please oudate yourself".

Try to catch this by checking for current connection status,
and possibly retry as local-only state change instead.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 1958eb33b643..82915880c5e9 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -2711,8 +2711,10 @@ int drbd_adm_connect(struct sk_buff *skb, struct 
genl_info *info)
 
 static enum drbd_state_rv conn_try_disconnect(struct drbd_connection 
*connection, bool force)
 {
+   enum drbd_conns cstate;
enum drbd_state_rv rv;
 
+repeat:
rv = conn_request_state(connection, NS(conn, C_DISCONNECTING),
force ? CS_HARD : 0);
 
@@ -2730,6 +2732,11 @@ static enum drbd_state_rv conn_try_disconnect(struct 
drbd_connection *connection
 
break;
case SS_CW_FAILED_BY_PEER:
+   spin_lock_irq(>resource->req_lock);
+   cstate = connection->cstate;
+   spin_unlock_irq(>resource->req_lock);
+   if (cstate <= C_WF_CONNECTION)
+   goto repeat;
/* The peer probably wants to see us outdated. */
rv = conn_request_state(connection, NS2(conn, C_DISCONNECTING,
disk, D_OUTDATED), 0);
-- 
2.17.1

[PATCH 12/17] drbd: fix print_st_err()'s prototype to match the definition

2018-12-20 Thread Lars Ellenberg

From: Luc Van Oostenryck 

print_st_err() is defined with its 4th argument taking an
'enum drbd_state_rv' but its prototype use an int for it.

Fix this by using 'enum drbd_state_rv' in the prototype too.

Signed-off-by: Luc Van Oostenryck 
Signed-off-by: Roland Kammerer 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_state.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_state.h b/drivers/block/drbd/drbd_state.h
index ea58301d0895..b2a390ba73a0 100644
--- a/drivers/block/drbd/drbd_state.h
+++ b/drivers/block/drbd/drbd_state.h
@@ -131,7 +131,7 @@ extern enum drbd_state_rv _drbd_set_state(struct 
drbd_device *, union drbd_state
  enum chg_state_flags,
  struct completion *done);
 extern void print_st_err(struct drbd_device *, union drbd_state,
-   union drbd_state, int);
+   union drbd_state, enum drbd_state_rv);
 
 enum drbd_state_rv
 _conn_request_state(struct drbd_connection *connection, union drbd_state mask, 
union drbd_state val,
-- 
2.17.1

[PATCH 15/17] drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire")

2018-12-20 Thread Lars Ellenberg

And also re-enable partial-zero-out + discard aligned.

With the introduction of REQ_OP_WRITE_ZEROES,
we started to use that for both WRITE_ZEROES and DISCARDS,
hoping that WRITE_ZEROES would "do what we want",
UNMAP if possible, zero-out the rest.

The example scenario is some LVM "thin" backend.

While an un-allocated block on dm-thin reads as zeroes, on a dm-thin
with "skip_block_zeroing=true", after a partial block write allocated
that block, that same block may well map "undefined old garbage" from
the backends on LBAs that have not yet been written to.

If we cannot distinguish between zero-out and discard on the receiving
side, to avoid "undefined old garbage" to pop up randomly at later times
on supposedly zero-initialized blocks, we'd need to map all discards to
zero-out on the receiving side.  But that would potentially do a full
alloc on thinly provisioned backends, even when the expectation was to
unmap/trim/discard/de-allocate.

We need to distinguish on the protocol level, whether we need to guarantee
zeroes (and thus use zero-out, potentially doing the mentioned full-alloc),
or if we want to put the emphasis on discard, and only do a "best effort
zeroing" (by "discarding" blocks aligned to discard-granularity, and zeroing
only potential unaligned head and tail clippings to at least *try* to
avoid "false positives" in an online-verify later), hoping that someone
set skip_block_zeroing=false.

For some discussion regarding this on dm-devel, see also
https://www.mail-archive.com/dm-devel%40redhat.com/msg07965.html
https://www.redhat.com/archives/dm-devel/2018-January/msg00271.html

For backward compatibility, P_TRIM means zero-out, unless the
DRBD_FF_WZEROES feature flag is agreed upon during handshake.

To have upper layers even try to submit WRITE ZEROES requests,
we need to announce "efficient zeroout" independently.

We need to fixup max_write_zeroes_sectors after blk_queue_stack_limits():
if we can handle "zeroes" efficiently on the protocol,
we want to do that, even if our backend does not announce
max_write_zeroes_sectors itself.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_debugfs.c  |   2 +
 drivers/block/drbd/drbd_int.h  |  11 +-
 drivers/block/drbd/drbd_main.c |  11 +-
 drivers/block/drbd/drbd_nl.c   |  16 +++
 drivers/block/drbd/drbd_protocol.h |  47 
 drivers/block/drbd/drbd_receiver.c | 171 ++---
 drivers/block/drbd/drbd_req.c  |  19 ++--
 drivers/block/drbd/drbd_req.h  |   2 +
 drivers/block/drbd/drbd_worker.c   |   2 +-
 include/linux/drbd.h   |   2 +-
 10 files changed, 252 insertions(+), 31 deletions(-)

diff --git a/drivers/block/drbd/drbd_debugfs.c 
b/drivers/block/drbd/drbd_debugfs.c
index 5d5e8d6a8a56..f13b48ff5f43 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -237,6 +237,8 @@ static void seq_print_peer_request_flags(struct seq_file 
*m, struct drbd_peer_re
seq_print_rq_state_bit(m, f & EE_CALL_AL_COMPLETE_IO, , "in-AL");
seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, , "C");
seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, , "set-in-sync");
+   seq_print_rq_state_bit(m, f & EE_TRIM, , "trim");
+   seq_print_rq_state_bit(m, f & EE_ZEROOUT, , "zero-out");
seq_print_rq_state_bit(m, f & EE_WRITE_SAME, , "write-same");
seq_putc(m, '\n');
 }
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index ab718582a092..000a2f4c0e92 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -430,7 +430,11 @@ enum {
__EE_MAY_SET_IN_SYNC,
 
/* is this a TRIM aka REQ_OP_DISCARD? */
-   __EE_IS_TRIM,
+   __EE_TRIM,
+   /* explicit zero-out requested, or
+* our lower level cannot handle trim,
+* and we want to fall back to zeroout instead */
+   __EE_ZEROOUT,
 
/* In case a barrier failed,
 * we need to resubmit without the barrier flag. */
@@ -472,7 +476,8 @@ enum {
 };
 #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
 #define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC)
-#define EE_IS_TRIM (1<<__EE_IS_TRIM)
+#define EE_TRIM(1<<__EE_TRIM)
+#define EE_ZEROOUT (1<<__EE_ZEROOUT)
 #define EE_RESUBMITTED (1<<__EE_RESUBMITTED)
 #define EE_WAS_ERROR   (1<<__EE_WAS_ERROR)
 #define EE_HAS_DIGEST  (1<<__EE_HAS_DIGEST)
@@ -1556,6 +1561,8 @@ extern void start_resync_timer_fn(struct timer_list *t);
 extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req);
 
 /* drbd_receiver.c */
+extern int drbd_issue_discard_or_zero_out(struct drbd_device *device,
+   sector_t start, unsigned int n

[PATCH 13/17] drbd: don't retry connection if peers do not agree on "authentication" settings

2018-12-20 Thread Lars Ellenberg

emma: "Unexpected data packet AuthChallenge (0x0010)"
 ava: "expected AuthChallenge packet, received: ReportProtocol (0x000b)"
  "Authentication of peer failed, trying again."

Pattern repeats.

There is no point in retrying the handshake,
if we expect to receive an AuthChallenge,
but the peer is not even configured to expect or use a shared secret.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 3a0fe357b68b..02a327891568 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -5332,7 +5332,7 @@ static int drbd_do_auth(struct drbd_connection 
*connection)
if (pi.cmd != P_AUTH_CHALLENGE) {
drbd_err(connection, "expected AuthChallenge packet, received: 
%s (0x%04x)\n",
 cmdname(pi.cmd), pi.cmd);
-   rv = 0;
+   rv = -1;
goto fail;
}
 
-- 
2.17.1

[PATCH 14/17] drbd: skip spurious timeout (ping-timeo) when failing promote

2018-12-20 Thread Lars Ellenberg

If you try to promote a Secondary while connected to a Primary
and allow-two-primaries is NOT set, we will wait for "ping-timeout"
to give this node a chance to detect a dead primary,
in case the cluster manager noticed faster than we did.

But if we then are *still* connected to a Primary,
we fail (after an additional timeout of ping-timout).

This change skips the spurious second timeout.

Most people won't notice really,
since "ping-timeout" by default is half a second.

But in some installations, ping-timeout may be 10 or 20 seconds or more,
and spuriously delaying the error return becomes annoying.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 82915880c5e9..bfe1b0062d62 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -697,14 +697,15 @@ drbd_set_role(struct drbd_device *const device, enum 
drbd_role new_role, int for
if (rv == SS_TWO_PRIMARIES) {
/* Maybe the peer is detected as dead very soon...
   retry at most once more in this case. */
-   int timeo;
-   rcu_read_lock();
-   nc = rcu_dereference(connection->net_conf);
-   timeo = nc ? (nc->ping_timeo + 1) * HZ / 10 : 1;
-   rcu_read_unlock();
-   schedule_timeout_interruptible(timeo);
-   if (try < max_tries)
+   if (try < max_tries) {
+   int timeo;
try = max_tries - 1;
+   rcu_read_lock();
+   nc = rcu_dereference(connection->net_conf);
+   timeo = nc ? (nc->ping_timeo + 1) * HZ / 10 : 1;
+   rcu_read_unlock();
+   schedule_timeout_interruptible(timeo);
+   }
continue;
}
if (rv < SS_SUCCESS) {
-- 
2.17.1

[PATCH 09/17] drbd: fix comment typos

2018-12-20 Thread Lars Ellenberg

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 2 +-
 drivers/block/drbd/drbd_state.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 76d74b2122d6..3a0fe357b68b 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -4428,7 +4428,7 @@ static int receive_state(struct drbd_connection 
*connection, struct packet_info
   (peer_state.disk == D_NEGOTIATING ||
os.disk == D_NEGOTIATING));
/* if we have both been inconsistent, and the peer has been
-* forced to be UpToDate with --overwrite-data */
+* forced to be UpToDate with --force */
cr |= test_bit(CONSIDER_RESYNC, >flags);
/* if we had been plain connected, and the admin requested to
 * start a sync by "invalidate" or "invalidate-remote" */
diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 18d53fe60d1d..883155657273 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -1124,7 +1124,7 @@ static union drbd_state sanitize_state(struct drbd_device 
*device, union drbd_st
ns.pdsk = D_UP_TO_DATE;
}
 
-   /* Implications of the connection stat on the disk states */
+   /* Implications of the connection state on the disk states */
disk_min = D_DISKLESS;
disk_max = D_UP_TO_DATE;
pdsk_min = D_INCONSISTENT;
-- 
2.17.1

[PATCH 06/17] drbd: fix confusing error message during attach

2018-12-20 Thread Lars Ellenberg

If we attach a (consistent) backing device,
which knows about a last-agreed effective size,
and that effective size is *larger* than the currently requested size,
we refused to attach with ERR_DISK_TOO_SMALL
  Failure: (111) Low.dev. smaller than requested DRBD-dev. size.
which is confusing to say the least.

This patch changes the error code in that case to ERR_IMPLICIT_SHRINK
  Failure: (170) Implicit device shrinking not allowed. See kernel log.
  additional info from kernel:
  To-be-attached device has last effective > current size, and is consistent
  ( >  sectors). Refusing to attach.

It also allows to attach with an explicit size.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 49 
 1 file changed, 44 insertions(+), 5 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index d02d38fd1288..e4774f720de5 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -127,6 +127,35 @@ static int drbd_msg_put_info(struct sk_buff *skb, const 
char *info)
return 0;
 }
 
+__printf(2, 3)
+static int drbd_msg_sprintf_info(struct sk_buff *skb, const char *fmt, ...)
+{
+   va_list args;
+   struct nlattr *nla, *txt;
+   int err = -EMSGSIZE;
+   int len;
+
+   nla = nla_nest_start(skb, DRBD_NLA_CFG_REPLY);
+   if (!nla)
+   return err;
+
+   txt = nla_reserve(skb, T_info_text, 256);
+   if (!txt) {
+   nla_nest_cancel(skb, nla);
+   return err;
+   }
+   va_start(args, fmt);
+   len = vscnprintf(nla_data(txt), 256, fmt, args);
+   va_end(args);
+
+   /* maybe: retry with larger reserve, if truncated */
+   txt->nla_len = nla_attr_size(len+1);
+   nlmsg_trim(skb, (char*)txt + NLA_ALIGN(txt->nla_len));
+   nla_nest_end(skb, nla);
+
+   return 0;
+}
+
 /* This would be a good candidate for a "pre_doit" hook,
  * and per-family private info->pointers.
  * But we need to stay compatible with older kernels.
@@ -1947,11 +1976,21 @@ int drbd_adm_attach(struct sk_buff *skb, struct 
genl_info *info)
}
 
/* Prevent shrinking of consistent devices ! */
-   if (drbd_md_test_flag(nbc, MDF_CONSISTENT) &&
-   drbd_new_dev_size(device, nbc, nbc->disk_conf->disk_size, 0) < 
nbc->md.la_size_sect) {
-   drbd_warn(device, "refusing to truncate a consistent device\n");
-   retcode = ERR_DISK_TOO_SMALL;
-   goto force_diskless_dec;
+   {
+   unsigned long long nsz = drbd_new_dev_size(device, nbc, 
nbc->disk_conf->disk_size, 0);
+   unsigned long long eff = nbc->md.la_size_sect;
+   if (drbd_md_test_flag(nbc, MDF_CONSISTENT) && nsz < eff) {
+   if (nsz == nbc->disk_conf->disk_size) {
+   drbd_warn(device, "truncating a consistent device 
during attach (%llu < %llu)\n", nsz, eff);
+   } else {
+   drbd_warn(device, "refusing to truncate a consistent 
device (%llu < %llu)\n", nsz, eff);
+   drbd_msg_sprintf_info(adm_ctx.reply_skb,
+   "To-be-attached device has last effective > 
current size, and is consistent\n"
+   "(%llu > %llu sectors). Refusing to attach.", 
eff, nsz);
+   retcode = ERR_IMPLICIT_SHRINK;
+   goto force_diskless_dec;
+   }
+   }
}
 
lock_all_resources();
-- 
2.17.1

[PATCH 02/17] drbd: must not use connection after kref_put(>kref)

2018-12-20 Thread Lars Ellenberg

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_state.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 0813c654c893..18d53fe60d1d 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -2109,9 +2109,8 @@ static int w_after_conn_state_ch(struct drbd_work *w, int 
unused)
spin_unlock_irq(>resource->req_lock);
}
}
-   kref_put(>kref, drbd_destroy_connection);
-
conn_md_sync(connection);
+   kref_put(>kref, drbd_destroy_connection);
 
return 0;
 }
-- 
2.17.1

[PATCH 08/17] drbd: reject attach of unsuitable uuids even if connected

2018-12-20 Thread Lars Ellenberg

Multiple failure scenario:
a) all good
   Connected Primary/Secondary UpToDate/UpToDate
b) lose disk on Primary,
   Connected Primary/Secondary Diskless/UpToDate
c) continue to write to the device,
   changes only make it to the Secondary storage.
d) lose disk on Secondary,
   Connected Primary/Secondary Diskless/Diskless
e) now try to re-attach on Primary

This would have succeeded before, even though that is clearly the
wrong data set to attach to (missing the modifications from c).
Because we only compared our "effective" and the "to-be-attached"
data generation uuid tags if (device->state.conn < C_CONNECTED).

Fix: change that constraint to (device->state.pdsk != D_UP_TO_DATE)
compare the uuids, and reject the attach.

This patch also tries to improve the reverse scenario:
first lose Secondary, then Primary disk,
then try to attach the disk on Secondary.

Before this patch, the attach on the Secondary succeeds, but since commit
drbd: disconnect, if the wrong UUIDs are attached on a connected peer
the Primary will notice unsuitable data, and drop the connection hard.

Though unfortunately at a point in time during the handshake where
we cannot easily abort the attach on the peer without more
refactoring of the handshake.

We now reject any attach to "unsuitable" uuids,
as long as we can see a Primary role,
unless we already have access to "good" data.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c   |  6 +++---
 drivers/block/drbd/drbd_receiver.c | 19 +++
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index e4774f720de5..4b934e543e2d 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1960,9 +1960,9 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info 
*info)
}
}
 
-   if (device->state.conn < C_CONNECTED &&
-   device->state.role == R_PRIMARY && device->ed_uuid &&
-   (device->ed_uuid & ~((u64)1)) != (nbc->md.uuid[UI_CURRENT] & 
~((u64)1))) {
+   if (device->state.pdsk != D_UP_TO_DATE && device->ed_uuid &&
+   (device->state.role == R_PRIMARY || device->state.peer == 
R_PRIMARY) &&
+(device->ed_uuid & ~((u64)1)) != (nbc->md.uuid[UI_CURRENT] & 
~((u64)1))) {
drbd_err(device, "Can only attach to data with current 
UUID=%016llX\n",
(unsigned long long)device->ed_uuid);
retcode = ERR_DATA_NOT_CURRENT;
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 85e3d846a23a..76d74b2122d6 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -4397,6 +4397,25 @@ static int receive_state(struct drbd_connection 
*connection, struct packet_info
if (peer_state.conn == C_AHEAD)
ns.conn = C_BEHIND;
 
+   /* TODO:
+* if (primary and diskless and peer uuid != effective uuid)
+* abort attach on peer;
+*
+* If this node does not have good data, was already connected, but
+* the peer did a late attach only now, trying to "negotiate" with me,
+* AND I am currently Primary, possibly frozen, with some specific
+* "effective" uuid, this should never be reached, really, because
+* we first send the uuids, then the current state.
+*
+* In this scenario, we already dropped the connection hard
+* when we received the unsuitable uuids (receive_uuids().
+*
+* Should we want to change this, that is: not drop the connection in
+* receive_uuids() already, then we would need to add a branch here
+* that aborts the attach of "unsuitable uuids" on the peer in case
+* this node is currently Diskless Primary.
+*/
+
if (device->p_uuid && peer_state.disk >= D_NEGOTIATING &&
get_ldev_if_state(device, D_NEGOTIATING)) {
int cr; /* consider resync */
-- 
2.17.1

[PATCH 05/17] drbd: disconnect, if the wrong UUIDs are attached on a connected peer

2018-12-20 Thread Lars Ellenberg

With "on-no-data-accessible suspend-io", DRBD requires the next attach
or connect to be to the very same data generation uuid tag it lost last.

If we first lost connection to the peer,
then later lost connection to our own disk,
we would usually refuse to re-connect to the peer,
because it presents the wrong data set.

However, if the peer first connects without a disk,
and then attached its disk, we accepted that same wrong data set,
which would be "unexpected" by any user of that DRBD
and cause "undefined results" (read: very likely data corruption).

The fix is to forcefully disconnect as soon as we notice that the peer
attached to the "wrong" dataset.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index fbf30fe45862..0a9f3c65f70a 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -4170,7 +4170,7 @@ static int receive_uuids(struct drbd_connection 
*connection, struct packet_info
kfree(device->p_uuid);
device->p_uuid = p_uuid;
 
-   if (device->state.conn < C_CONNECTED &&
+   if ((device->state.conn < C_CONNECTED || device->state.pdsk == 
D_DISKLESS) &&
device->state.disk < D_INCONSISTENT &&
device->state.role == R_PRIMARY &&
(device->ed_uuid & ~((u64)1)) != (p_uuid[UI_CURRENT] & ~((u64)1))) {
-- 
2.17.1

[PATCH 10/17] drbd: do not block when adjusting "disk-options" while IO is frozen

2018-12-20 Thread Lars Ellenberg

"suspending" IO is overloaded.
It can mean "do not allow new requests" (obviously),
but it also may mean "must not complete pending IO",
for example while the fencing handlers do their arbitration.

When adjusting disk options, we suspend io (disallow new requests), then
wait for the activity-log to become unused (drain all IO completions),
and possibly replace it with a new activity log of different size.

If the other "suspend IO" aspect is active, pending IO completions won't
happen, and we would block forever (unkillable drbdsetup process).

Fix this by skipping the activity log adjustment if the "al-extents"
setting did not change. Also, in case it did change, fail early without
blocking if it looks like we would block forever.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 37 
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 4b934e543e2d..1958eb33b643 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1540,6 +1540,30 @@ static void sanitize_disk_conf(struct drbd_device 
*device, struct disk_conf *dis
}
 }
 
+static int disk_opts_check_al_size(struct drbd_device *device, struct 
disk_conf *dc)
+{
+   int err = -EBUSY;
+
+   if (device->act_log &&
+   device->act_log->nr_elements == dc->al_extents)
+   return 0;
+
+   drbd_suspend_io(device);
+   /* If IO completion is currently blocked, we would likely wait
+* "forever" for the activity log to become unused. So we don't. */
+   if (atomic_read(>ap_bio_cnt))
+   goto out;
+
+   wait_event(device->al_wait, lc_try_lock(device->act_log));
+   drbd_al_shrink(device);
+   err = drbd_check_al_size(device, dc);
+   lc_unlock(device->act_log);
+   wake_up(>al_wait);
+out:
+   drbd_resume_io(device);
+   return err;
+}
+
 int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
 {
struct drbd_config_context adm_ctx;
@@ -1602,15 +1626,12 @@ int drbd_adm_disk_opts(struct sk_buff *skb, struct 
genl_info *info)
}
}
 
-   drbd_suspend_io(device);
-   wait_event(device->al_wait, lc_try_lock(device->act_log));
-   drbd_al_shrink(device);
-   err = drbd_check_al_size(device, new_disk_conf);
-   lc_unlock(device->act_log);
-   wake_up(>al_wait);
-   drbd_resume_io(device);
-
+   err = disk_opts_check_al_size(device, new_disk_conf);
if (err) {
+   /* Could be just "busy". Ignore?
+* Introduce dedicated error code? */
+   drbd_msg_put_info(adm_ctx.reply_skb,
+   "Try again without changing current al-extents 
setting");
retcode = ERR_NOMEM;
goto fail_unlock;
}
-- 
2.17.1

[PATCH 04/17] drbd: ignore "all zero" peer volume sizes in handshake

2018-12-20 Thread Lars Ellenberg

During handshake, if we are diskless ourselves, we used to accept any size
presented by the peer.

Which could be zero if that peer was just brought up and connected
to us without having a disk attached first, in which case both
peers would just "flip" their volume sizes.

Now, even a diskless node will ignore "zero" sizes
presented by a diskless peer.

Also a currently Diskless Primary will refuse to shrink during handshake:
it may be frozen, and waiting for a "suitable" local disk or peer to
re-appear (on-no-data-accessible suspend-io). If the peer is smaller
than what we used to be, it is not suitable.

The logic for a diskless node during handshake is now supposed to be:
believe the peer, if
 - I don't have a current size myself
 - we agree on the size anyways
 - I do have a current size, am Secondary, and he has the only disk
 - I do have a current size, am Primary, and he has the only disk,
   which is larger than my current size

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 33 +++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 1b9822f264d2..fbf30fe45862 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3981,6 +3981,7 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
struct o_qlim *o = (connection->agreed_features & DRBD_FF_WSAME) ? 
p->qlim : NULL;
enum determine_dev_size dd = DS_UNCHANGED;
sector_t p_size, p_usize, p_csize, my_usize;
+   sector_t new_size, cur_size;
int ldsc = 0; /* local disk size changed */
enum dds_flags ddsf;
 
@@ -3988,6 +3989,7 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
if (!peer_device)
return config_unknown_volume(connection, pi);
device = peer_device->device;
+   cur_size = drbd_get_capacity(device->this_bdev);
 
p_size = be64_to_cpu(p->d_size);
p_usize = be64_to_cpu(p->u_size);
@@ -3998,7 +4000,6 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
device->p_size = p_size;
 
if (get_ldev(device)) {
-   sector_t new_size, cur_size;
rcu_read_lock();
my_usize = rcu_dereference(device->ldev->disk_conf)->disk_size;
rcu_read_unlock();
@@ -4016,7 +4017,6 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
/* Never shrink a device with usable data during connect.
   But allow online shrinking if we are connected. */
new_size = drbd_new_dev_size(device, device->ldev, p_usize, 0);
-   cur_size = drbd_get_capacity(device->this_bdev);
if (new_size < cur_size &&
device->state.disk >= D_OUTDATED &&
device->state.conn < C_CONNECTED) {
@@ -4081,9 +4081,36 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
 *
 * However, if he sends a zero current size,
 * take his (user-capped or) backing disk size anyways.
+*
+* Unless of course he does not have a disk himself.
+* In which case we ignore this completely.
 */
+   sector_t new_size = p_csize ?: p_usize ?: p_size;
drbd_reconsider_queue_parameters(device, NULL, o);
-   drbd_set_my_capacity(device, p_csize ?: p_usize ?: p_size);
+   if (new_size == 0) {
+   /* Ignore, peer does not know nothing. */
+   } else if (new_size == cur_size) {
+   /* nothing to do */
+   } else if (cur_size != 0 && p_size == 0) {
+   drbd_warn(device, "Ignored diskless peer device size 
(peer:%llu != me:%llu sectors)!\n",
+   (unsigned long long)new_size, (unsigned 
long long)cur_size);
+   } else if (new_size < cur_size && device->state.role == 
R_PRIMARY) {
+   drbd_err(device, "The peer's device size is too small! 
(%llu < %llu sectors); demote me first!\n",
+   (unsigned long long)new_size, (unsigned 
long long)cur_size);
+   conn_request_state(peer_device->connection, NS(conn, 
C_DISCONNECTING), CS_HARD);
+   return -EIO;
+   } else {
+   /* I believe the peer, if
+*  - I don't have a current size myself
+*  - we agree on the size anyways
+*  - I do have a

[PATCH 07/17] drbd: attach on connected diskless peer must not shrink a consistent device

2018-12-20 Thread Lars Ellenberg

If we would reject a new handshake, if the peer had attached first,
and then connected, we should force disconnect if the peer first connects,
and only then attaches.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 0a9f3c65f70a..85e3d846a23a 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -4014,12 +4014,13 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
if (device->state.conn == C_WF_REPORT_PARAMS)
p_usize = min_not_zero(my_usize, p_usize);
 
-   /* Never shrink a device with usable data during connect.
-  But allow online shrinking if we are connected. */
+   /* Never shrink a device with usable data during connect,
+* or "attach" on the peer.
+* But allow online shrinking if we are connected. */
new_size = drbd_new_dev_size(device, device->ldev, p_usize, 0);
if (new_size < cur_size &&
device->state.disk >= D_OUTDATED &&
-   device->state.conn < C_CONNECTED) {
+   (device->state.conn < C_CONNECTED || device->state.pdsk == 
D_DISKLESS)) {
drbd_err(device, "The peer's disk size is too small! 
(%llu < %llu sectors)\n",
(unsigned long long)new_size, (unsigned 
long long)cur_size);
conn_request_state(peer_device->connection, NS(conn, 
C_DISCONNECTING), CS_HARD);
@@ -4047,8 +4048,8 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
synchronize_rcu();
kfree(old_disk_conf);
 
-   drbd_info(device, "Peer sets u_size to %lu sectors\n",
-(unsigned long)my_usize);
+   drbd_info(device, "Peer sets u_size to %lu sectors 
(old: %lu)\n",
+(unsigned long)p_usize, (unsigned 
long)my_usize);
}
 
put_ldev(device);
-- 
2.17.1

[PATCH 03/17] drbd: centralize printk reporting of new size into drbd_set_my_capacity()

2018-12-20 Thread Lars Ellenberg

Previously, some implicit resizes that happend during handshake
have not been reported as prominently as explicit resize.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h  |  8 +---
 drivers/block/drbd/drbd_main.c | 17 -
 drivers/block/drbd/drbd_nl.c   |  3 ---
 3 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 1e47db57b9d2..ab718582a092 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1609,13 +1609,7 @@ static inline void drbd_tcp_quickack(struct socket *sock)
 }
 
 /* sets the number of 512 byte sectors of our virtual device */
-static inline void drbd_set_my_capacity(struct drbd_device *device,
-   sector_t size)
-{
-   /* set_capacity(device->this_bdev->bd_disk, size); */
-   set_capacity(device->vdisk, size);
-   device->this_bdev->bd_inode->i_size = (loff_t)size << 9;
-}
+void drbd_set_my_capacity(struct drbd_device *device, sector_t size);
 
 /*
  * used to submit our private bio
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index f973a2a845c8..f9b4228cc2d9 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2034,6 +2034,21 @@ void drbd_init_set_defaults(struct drbd_device *device)
device->local_max_bio_size = DRBD_MAX_BIO_SIZE_SAFE;
 }
 
+static void _drbd_set_my_capacity(struct drbd_device *device, sector_t size)
+{
+   /* set_capacity(device->this_bdev->bd_disk, size); */
+   set_capacity(device->vdisk, size);
+   device->this_bdev->bd_inode->i_size = (loff_t)size << 9;
+}
+
+void drbd_set_my_capacity(struct drbd_device *device, sector_t size)
+{
+   char ppb[10];
+   _drbd_set_my_capacity(device, size);
+   drbd_info(device, "size = %s (%llu KB)\n",
+   ppsize(ppb, size>>1), (unsigned long long)size>>1);
+}
+
 void drbd_device_cleanup(struct drbd_device *device)
 {
int i;
@@ -2059,7 +2074,7 @@ void drbd_device_cleanup(struct drbd_device *device)
}
D_ASSERT(device, first_peer_device(device)->connection->net_conf == 
NULL);
 
-   drbd_set_my_capacity(device, 0);
+   _drbd_set_my_capacity(device, 0);
if (device->bitmap) {
/* maybe never allocated. */
drbd_bm_resize(device, 0, 1);
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index d15703b1ffe8..d02d38fd1288 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -921,7 +921,6 @@ drbd_determine_dev_size(struct drbd_device *device, enum 
dds_flags flags, struct
} prev;
sector_t u_size, size;
struct drbd_md *md = >ldev->md;
-   char ppb[10];
void *buffer;
 
int md_moved, la_size_changed;
@@ -999,8 +998,6 @@ drbd_determine_dev_size(struct drbd_device *device, enum 
dds_flags flags, struct
/* racy, see comments above. */
drbd_set_my_capacity(device, size);
md->la_size_sect = size;
-   drbd_info(device, "size = %s (%llu KB)\n", ppsize(ppb, size>>1),
-(unsigned long long)size>>1);
}
if (rv <= DS_ERROR)
goto err_out;
-- 
2.17.1

[PATCH 01/17] drbd: narrow rcu_read_lock in drbd_sync_handshake

2018-12-20 Thread Lars Ellenberg

From: Roland Kammerer 

So far there was the possibility that we called
genlmsg_new(GFP_NOIO)/mutex_lock() while holding an rcu_read_lock().

This included cases like:

drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
drbd_khelper
  drbd_bcast_event
genlmsg_new(GFP_NOIO) --> may sleep

drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
drbd_khelper
  notify_helper
genlmsg_new(GFP_NOIO) --> may sleep

drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
drbd_khelper
  notify_helper
mutex_lock --> may sleep

While using GFP_ATOMIC whould have been possible in the first two cases,
the real fix is to narrow the rcu_read_lock.

Reported-by: Jia-Ju Bai 
Reviewed-by: Lars Ellenberg 
Signed-off-by: Roland Kammerer 
---
 drivers/block/drbd/drbd_receiver.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 61c392752fe4..1b9822f264d2 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3364,7 +3364,7 @@ static enum drbd_conns drbd_sync_handshake(struct 
drbd_peer_device *peer_device,
enum drbd_conns rv = C_MASK;
enum drbd_disk_state mydisk;
struct net_conf *nc;
-   int hg, rule_nr, rr_conflict, tentative;
+   int hg, rule_nr, rr_conflict, tentative, always_asbp;
 
mydisk = device->state.disk;
if (mydisk == D_NEGOTIATING)
@@ -3415,8 +3415,12 @@ static enum drbd_conns drbd_sync_handshake(struct 
drbd_peer_device *peer_device,
 
rcu_read_lock();
nc = rcu_dereference(peer_device->connection->net_conf);
+   always_asbp = nc->always_asbp;
+   rr_conflict = nc->rr_conflict;
+   tentative = nc->tentative;
+   rcu_read_unlock();
 
-   if (hg == 100 || (hg == -100 && nc->always_asbp)) {
+   if (hg == 100 || (hg == -100 && always_asbp)) {
int pcount = (device->state.role == R_PRIMARY)
   + (peer_role == R_PRIMARY);
int forced = (hg == -100);
@@ -3455,9 +3459,6 @@ static enum drbd_conns drbd_sync_handshake(struct 
drbd_peer_device *peer_device,
 "Sync from %s node\n",
 (hg < 0) ? "peer" : "this");
}
-   rr_conflict = nc->rr_conflict;
-   tentative = nc->tentative;
-   rcu_read_unlock();
 
if (hg == -100) {
/* FIXME this log message is not correct if we end up here
-- 
2.17.1

[PATCH 00/17] DRBD updates for 4.21

2018-12-20 Thread Lars Ellenberg



Hi Jens,

I hope this is not too late for your for-4.21 branch.

These are all from last April or even older,
I was convinced we sent these for 4.19 already.
But we didn't :-(

The interesting new feature is "introduce P_ZEROES", which is replacing
45c21793a660 drbd: implement REQ_OP_WRITE_ZEROES
because that led to "full allocation" for fstrim on dm-thin or similar,
which is certainly not the intended behavior.

Patch 3 to 8 help with recovery from odd corner cases in
multiple error scenarios in a cluster, potentially after some admin
already "tried things".

All others are sufficiently simple and supposedly "obvious".

Lars Ellenberg (13):
  drbd: must not use connection after kref_put(>kref)
  drbd: centralize printk reporting of new size into drbd_set_my_capacity()
  drbd: ignore "all zero" peer volume sizes in handshake
  drbd: disconnect, if the wrong UUIDs are attached on a connected peer
  drbd: fix confusing error message during attach
  drbd: attach on connected diskless peer must not shrink a consistent device
  drbd: reject attach of unsuitable uuids even if connected
  drbd: fix comment typos
  drbd: do not block when adjusting "disk-options" while IO is frozen
  drbd: avoid spurious self-outdating with concurrent disconnect / down
  drbd: don't retry connection if peers do not agree on "authentication" 
settings
  drbd: skip spurious timeout (ping-timeo) when failing promote
  drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire")

Luc Van Oostenryck (1):
  drbd: fix print_st_err()'s prototype to match the definition

Nathan Chancellor (2):
  drbd: Avoid Clang warning about pointless switch statment
  drbd: Change drbd_request_detach_interruptible's return type to int

Roland Kammerer (1):
  drbd: narrow rcu_read_lock in drbd_sync_handshake

 drivers/block/drbd/drbd_debugfs.c  |   2 +
 drivers/block/drbd/drbd_int.h  |  19 +--
 drivers/block/drbd/drbd_main.c |  28 +++-
 drivers/block/drbd/drbd_nl.c   | 133 ---
 drivers/block/drbd/drbd_protocol.h |  47 ++
 drivers/block/drbd/drbd_receiver.c | 251 +
 drivers/block/drbd/drbd_req.c  |  19 +--
 drivers/block/drbd/drbd_req.h  |   2 +
 drivers/block/drbd/drbd_state.c|  11 +-
 drivers/block/drbd/drbd_state.h|   5 +-
 drivers/block/drbd/drbd_worker.c   |   2 +-
 include/linux/drbd.h   |   2 +-
 include/linux/genl_magic_struct.h  |   5 +-
 13 files changed, 434 insertions(+), 92 deletions(-)

-- 
2.17.1

Re: [PATCH v2] drbd: Avoid Clang warning about pointless switch statment

2018-12-18 Thread Lars Ellenberg

On Mon, Dec 17, 2018 at 10:29:38AM -0700, Jens Axboe wrote:
> > Hi Lars and Philipp,
> > 
> > Could you please make sure that this patch and the other one I sent make
> > it into 4.21/5.0? I am not sure when you were planning on sending the
> > pull request to Jens that you mentioned in the other thread but I've
> > noticed most maintainers typically send their requests for the impending
> > merge window around -rc7 or so and I wanted to make sure it was on your
> > radar.

I'm sorry.
>From my point of view, "fixing the pointless switch" is just "pointless",
so it (again) fell through.  I get it that it is important to others,
and getting rid of Clang warnings is a good thing.  So sorry again.

> It needs to get here now, but drbd hasn't really sent anything in for
> about a year,

Last actual fix was 2018-06-25 64dafbc9530c drbd: fix access after free
so almost six month, yes.

> so I'm starting to doubt how maintained it is at this
> point.

Oh, it is maintained.  It is just happens to be "stable".
We don't add new features there,
and I'm currently not aware of any misbehavior.

I'll prepare a pull request containing this,
the other Clang warning patch from Nathan,
and possibly other small stuff that accumulated,
if I can find any, and send that out later today.

Thanks for the reminder,

Lars

Re: [Drbd-dev] [PATCH 28/42] drbd: switch to proc_create_single

2018-05-18 Thread Lars Ellenberg

On Wed, May 16, 2018 at 11:43:32AM +0200, Christoph Hellwig wrote:
> And stop messing with try_module_get on THIS_MODULE, which doesn't make
> any sense here.

The idea was to increase module count on /proc/drbd access.

If someone holds /proc/drbd open, previously rmmod would
"succeed" in starting the unload, but then block on remove_proc_entry,
leading to a situation where the lsmod does not show drbd anymore,
but /proc/drbd being still there (but no longer accessible).

I'd rather have rmmod fail up front in this case.
And try_module_get() seemed most appropriate.

Lars

Re: [Drbd-dev] [PATCH 28/42] drbd: switch to proc_create_single

2018-05-18 Thread Lars Ellenberg

On Wed, May 16, 2018 at 11:43:32AM +0200, Christoph Hellwig wrote:
> And stop messing with try_module_get on THIS_MODULE, which doesn't make
> any sense here.

The idea was to increase module count on /proc/drbd access.

If someone holds /proc/drbd open, previously rmmod would
"succeed" in starting the unload, but then block on remove_proc_entry,
leading to a situation where the lsmod does not show drbd anymore,
but /proc/drbd being still there (but no longer accessible).

I'd rather have rmmod fail up front in this case.
And try_module_get() seemed most appropriate.

Lars

Re: [PATCH] drbd: fix discard_zeroes_if_aligned regression

2018-01-16 Thread Lars Ellenberg

On Mon, Jan 15, 2018 at 11:26:15PM -0800, Christoph Hellwig wrote:
> NAK.  Calling a discard and expecting zeroing is simply buggy.

What he said.

The bug/misunderstanding was that we now use zeroout even for discards,
with the assumption that it would try to do discards where it can.

Which is even true.

But our expectations of when zeroout "should" use unmap,
and where it actually can do that safely
based on the information it has,
don't really match:
our expectations where wrong, we assumed too much.
(in the general case).

Which means in DRBD we have to stop use zeroout for discards,
and again pass down normal discard for discards.

In the specific case where the backend to DRBD is lvm-thin,
AND it does de-alloc blocks on discard,
AND it does NOT have skip_block_zeroing set or it's backend
does zero on discard and it does discard passdown, and we tell
DRBD about all of that (using the discard_zeroes_if_aligned flag)
then we can do this "zero head and tail, discard full blocks",
and expect them to be zero.

If skip_block_zeroing is set however, and there is no discard
passdown in thin, or the backend of thin does not zero on discard,
DRBD can still pass down discards to thin, and expect them do be
de-allocated, but can not expect discarded ranges to remain
"zero", any later partial write to an unallocated area could pull
in different "undefined" garbage on different replicas for the
not-written-to part of a new allocated block.

The end result is that you have areas of the block device
that return different data depending on which replica you
read from.

But that is the case even eithout discard in that setup,
don't do that then or live with it.

"undefined data" is undefined, you have that directly on thin
in that setup already, with DRBD on top you now have several
versions of "undefined".

Anything on top of such a setup must not do "read-modify-write"
of "undefined" data and expect a defined result, adding DRBD
on top does not change that.

DRBD can deal with that just fine, but our "online verify" will
report loads of boring "mismatches" for those areas.

TL;DR: we'll stop doing "discard-is-zeroout"
(our assumptions were wrong).
We still won't do exactly "discard-is-discard", but re-enable our
"discard-is-discard plus zeroout on head and tail", because in
some relevant setups, this gives us the best result, and avoids
the false positives in our online-verify.

Lars

Re: [PATCH] drbd: fix discard_zeroes_if_aligned regression

2018-01-16 Thread Lars Ellenberg

On Mon, Jan 15, 2018 at 11:26:15PM -0800, Christoph Hellwig wrote:
> NAK.  Calling a discard and expecting zeroing is simply buggy.

What he said.

The bug/misunderstanding was that we now use zeroout even for discards,
with the assumption that it would try to do discards where it can.

Which is even true.

But our expectations of when zeroout "should" use unmap,
and where it actually can do that safely
based on the information it has,
don't really match:
our expectations where wrong, we assumed too much.
(in the general case).

Which means in DRBD we have to stop use zeroout for discards,
and again pass down normal discard for discards.

In the specific case where the backend to DRBD is lvm-thin,
AND it does de-alloc blocks on discard,
AND it does NOT have skip_block_zeroing set or it's backend
does zero on discard and it does discard passdown, and we tell
DRBD about all of that (using the discard_zeroes_if_aligned flag)
then we can do this "zero head and tail, discard full blocks",
and expect them to be zero.

If skip_block_zeroing is set however, and there is no discard
passdown in thin, or the backend of thin does not zero on discard,
DRBD can still pass down discards to thin, and expect them do be
de-allocated, but can not expect discarded ranges to remain
"zero", any later partial write to an unallocated area could pull
in different "undefined" garbage on different replicas for the
not-written-to part of a new allocated block.

The end result is that you have areas of the block device
that return different data depending on which replica you
read from.

But that is the case even eithout discard in that setup,
don't do that then or live with it.

"undefined data" is undefined, you have that directly on thin
in that setup already, with DRBD on top you now have several
versions of "undefined".

Anything on top of such a setup must not do "read-modify-write"
of "undefined" data and expect a defined result, adding DRBD
on top does not change that.

DRBD can deal with that just fine, but our "online verify" will
report loads of boring "mismatches" for those areas.

TL;DR: we'll stop doing "discard-is-zeroout"
(our assumptions were wrong).
We still won't do exactly "discard-is-discard", but re-enable our
"discard-is-discard plus zeroout on head and tail", because in
some relevant setups, this gives us the best result, and avoids
the false positives in our online-verify.

Lars

Re: [PATCH] drbd: standardize kthread/workqueue thread naming to include drbd minor number

2018-01-16 Thread Lars Ellenberg

On Mon, Jan 15, 2018 at 03:52:15PM -0800, Eric Wheeler wrote:
> From: Eric Wheeler 
> 
> For DRBD resources with long names that start with the same prefix,
> it was difficult to find all process pids for that minor since names
> are truncated to the task_struct's comm field (16 bytes).
> 
> This patch names all processes associated with a DRBD device as drbdN_*
> where N is the DRBD minor in the same ways that the drbdN_submit workqueue
> is named.  Userspace tools can then lookup the name=>minor=>pid mapping
> and for all pids and use tools like chrt, ioprio, nice, add pids to
> cgroups, or for other useful purpose.


DRBD can do "multi volume" resources, which means most threads
are associated not with one, but with many minors.

"does not work just like that"

Lars

Re: [PATCH] drbd: standardize kthread/workqueue thread naming to include drbd minor number

2018-01-16 Thread Lars Ellenberg

On Mon, Jan 15, 2018 at 03:52:15PM -0800, Eric Wheeler wrote:
> From: Eric Wheeler 
> 
> For DRBD resources with long names that start with the same prefix,
> it was difficult to find all process pids for that minor since names
> are truncated to the task_struct's comm field (16 bytes).
> 
> This patch names all processes associated with a DRBD device as drbdN_*
> where N is the DRBD minor in the same ways that the drbdN_submit workqueue
> is named.  Userspace tools can then lookup the name=>minor=>pid mapping
> and for all pids and use tools like chrt, ioprio, nice, add pids to
> cgroups, or for other useful purpose.


DRBD can do "multi volume" resources, which means most threads
are associated not with one, but with many minors.

"does not work just like that"

Lars

Re: [PATCH] drbd: rename "usermode_helper" to "drbd_usermode_helper"

2017-06-08 Thread Lars Ellenberg

On Mon, Jun 05, 2017 at 11:26:23AM +0200, Greg KH wrote:
> From: Greg Kroah-Hartman 
> 
> Nothing like having a very generic global variable in a tiny driver
> subsystem to make a mess of the global namespace...
> 
> Note, there are many other "generic" named global variables in the drbd
> subsystem, someone should fix those up one day before they hit a linking
> error.

Thanks. We'll prepare a patch to that effect, probably converting some
of them to static, fixing up the rest, and submit with the updates for
the next merge window.

Lars

Re: [PATCH] drbd: rename "usermode_helper" to "drbd_usermode_helper"

2017-06-08 Thread Lars Ellenberg

On Mon, Jun 05, 2017 at 11:26:23AM +0200, Greg KH wrote:
> From: Greg Kroah-Hartman 
> 
> Nothing like having a very generic global variable in a tiny driver
> subsystem to make a mess of the global namespace...
> 
> Note, there are many other "generic" named global variables in the drbd
> subsystem, someone should fix those up one day before they hit a linking
> error.

Thanks. We'll prepare a patch to that effect, probably converting some
of them to static, fixing up the rest, and submit with the updates for
the next merge window.

Lars

[PATCH] drbd: fix request leak introduced by locking/atomic, kref: Kill kref_sub()

2017-05-11 Thread Lars Ellenberg

Regression fix for 4.11, which totally broke DRBD

When killing kref_sub(), the unconditional additional kref_get()
was not properly paired with the necessary kref_put(), causing
a leak of struct drbd_requests (~ 224 Bytes) per submitted bio,
and breaking DRBD in general, as the destructor of those "drbd_requests"
does more than just the mempoll_free().

Fixes: bdfafc4ffdd2 ("locking/atomic, kref: Kill kref_sub()")
Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
---
 drivers/block/drbd/drbd_req.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 652114a..1fc8a67 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -314,24 +314,32 @@ void drbd_req_complete(struct drbd_request *req, struct 
bio_and_error *m)
 }
 
 /* still holds resource->req_lock */
-static int drbd_req_put_completion_ref(struct drbd_request *req, struct 
bio_and_error *m, int put)
+static void drbd_req_put_completion_ref(struct drbd_request *req, struct 
bio_and_error *m, int put)
 {
struct drbd_device *device = req->device;
D_ASSERT(device, m || (req->rq_state & RQ_POSTPONED));
 
+   if (!put)
+   return;
+
if (!atomic_sub_and_test(put, >completion_ref))
-   return 0;
+   return;
 
drbd_req_complete(req, m);
 
+   /* local completion may still come in later,
+* we need to keep the req object around. */
+   if (req->rq_state & RQ_LOCAL_ABORTED)
+   return;
+
if (req->rq_state & RQ_POSTPONED) {
/* don't destroy the req object just yet,
 * but queue it for retry */
drbd_restart_request(req);
-   return 0;
+   return;
}
 
-   return 1;
+   kref_put(>kref, drbd_req_destroy);
 }
 
 static void set_if_null_req_next(struct drbd_peer_device *peer_device, struct 
drbd_request *req)
@@ -518,12 +526,8 @@ static void mod_rq_state(struct drbd_request *req, struct 
bio_and_error *m,
if (req->i.waiting)
wake_up(>misc_wait);
 
-   if (c_put) {
-   if (drbd_req_put_completion_ref(req, m, c_put))
-   kref_put(>kref, drbd_req_destroy);
-   } else {
-   kref_put(>kref, drbd_req_destroy);
-   }
+   drbd_req_put_completion_ref(req, m, c_put);
+   kref_put(>kref, drbd_req_destroy);
 }
 
 static void drbd_report_io_error(struct drbd_device *device, struct 
drbd_request *req)
@@ -1363,8 +1367,7 @@ static void drbd_send_and_submit(struct drbd_device 
*device, struct drbd_request
}
 
 out:
-   if (drbd_req_put_completion_ref(req, , 1))
-   kref_put(>kref, drbd_req_destroy);
+   drbd_req_put_completion_ref(req, , 1);
spin_unlock_irq(>req_lock);
 
/* Even though above is a kref_put(), this is safe.
-- 
2.7.4

[PATCH] drbd: fix request leak introduced by locking/atomic, kref: Kill kref_sub()

2017-05-11 Thread Lars Ellenberg

Regression fix for 4.11, which totally broke DRBD

When killing kref_sub(), the unconditional additional kref_get()
was not properly paired with the necessary kref_put(), causing
a leak of struct drbd_requests (~ 224 Bytes) per submitted bio,
and breaking DRBD in general, as the destructor of those "drbd_requests"
does more than just the mempoll_free().

Fixes: bdfafc4ffdd2 ("locking/atomic, kref: Kill kref_sub()")
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_req.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 652114a..1fc8a67 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -314,24 +314,32 @@ void drbd_req_complete(struct drbd_request *req, struct 
bio_and_error *m)
 }
 
 /* still holds resource->req_lock */
-static int drbd_req_put_completion_ref(struct drbd_request *req, struct 
bio_and_error *m, int put)
+static void drbd_req_put_completion_ref(struct drbd_request *req, struct 
bio_and_error *m, int put)
 {
struct drbd_device *device = req->device;
D_ASSERT(device, m || (req->rq_state & RQ_POSTPONED));
 
+   if (!put)
+   return;
+
if (!atomic_sub_and_test(put, >completion_ref))
-   return 0;
+   return;
 
drbd_req_complete(req, m);
 
+   /* local completion may still come in later,
+* we need to keep the req object around. */
+   if (req->rq_state & RQ_LOCAL_ABORTED)
+   return;
+
if (req->rq_state & RQ_POSTPONED) {
/* don't destroy the req object just yet,
 * but queue it for retry */
drbd_restart_request(req);
-   return 0;
+   return;
}
 
-   return 1;
+   kref_put(>kref, drbd_req_destroy);
 }
 
 static void set_if_null_req_next(struct drbd_peer_device *peer_device, struct 
drbd_request *req)
@@ -518,12 +526,8 @@ static void mod_rq_state(struct drbd_request *req, struct 
bio_and_error *m,
if (req->i.waiting)
wake_up(>misc_wait);
 
-   if (c_put) {
-   if (drbd_req_put_completion_ref(req, m, c_put))
-   kref_put(>kref, drbd_req_destroy);
-   } else {
-   kref_put(>kref, drbd_req_destroy);
-   }
+   drbd_req_put_completion_ref(req, m, c_put);
+   kref_put(>kref, drbd_req_destroy);
 }
 
 static void drbd_report_io_error(struct drbd_device *device, struct 
drbd_request *req)
@@ -1363,8 +1367,7 @@ static void drbd_send_and_submit(struct drbd_device 
*device, struct drbd_request
}
 
 out:
-   if (drbd_req_put_completion_ref(req, , 1))
-   kref_put(>kref, drbd_req_destroy);
+   drbd_req_put_completion_ref(req, , 1);
spin_unlock_irq(>req_lock);
 
/* Even though above is a kref_put(), this is safe.
-- 
2.7.4

Re: [PATCH] drbd:fix null pointer deref in _drbd_md_sync_page_io

2017-04-26 Thread Lars Ellenberg

On Wed, Apr 26, 2017 at 02:49:37AM -0700, Heloise wrote:
> The return value of bio_alloc_drbd can be NULL and is used without

No, apparently it cannot, because it is basically a mempool_alloc()
with GFP_NOIO, it may sleep, but it will loop "forever" and not return NULL.

So rather fix that nonsense in bio_alloc_drbd, see below:

Thanks,

Lars

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 92c60cb..9ffd940 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -150,15 +150,10 @@ static const struct block_device_operations drbd_ops = {
 
 struct bio *bio_alloc_drbd(gfp_t gfp_mask)
 {
-   struct bio *bio;
-
if (!drbd_md_io_bio_set)
return bio_alloc(gfp_mask, 1);
 
-   bio = bio_alloc_bioset(gfp_mask, 1, drbd_md_io_bio_set);
-   if (!bio)
-   return NULL;
-   return bio;
+   return bio_alloc_bioset(gfp_mask, 1, drbd_md_io_bio_set);
 }
 
 #ifdef __CHECKER__


> validation, which may cause null-pointer dereference, fix it.
> 
> Signed-off-by: Heloise 
> ---
>  drivers/block/drbd/drbd_actlog.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/block/drbd/drbd_actlog.c 
> b/drivers/block/drbd/drbd_actlog.c
> index 8d7bcfa..d6bb30e 100644
> --- a/drivers/block/drbd/drbd_actlog.c
> +++ b/drivers/block/drbd/drbd_actlog.c
> @@ -151,6 +151,10 @@ static int _drbd_md_sync_page_io(struct drbd_device 
> *device,
>   op_flags |= REQ_SYNC;
>  
>   bio = bio_alloc_drbd(GFP_NOIO);
> + if (!bio) {
> + err = -ENOMEM;
> + return err;
> + }
>   bio->bi_bdev = bdev->md_bdev;
>   bio->bi_iter.bi_sector = sector;
>   err = -EIO;
> -- 
> 2.1.0

Re: [PATCH] drbd:fix null pointer deref in _drbd_md_sync_page_io

2017-04-26 Thread Lars Ellenberg

On Wed, Apr 26, 2017 at 02:49:37AM -0700, Heloise wrote:
> The return value of bio_alloc_drbd can be NULL and is used without

No, apparently it cannot, because it is basically a mempool_alloc()
with GFP_NOIO, it may sleep, but it will loop "forever" and not return NULL.

So rather fix that nonsense in bio_alloc_drbd, see below:

Thanks,

Lars

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 92c60cb..9ffd940 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -150,15 +150,10 @@ static const struct block_device_operations drbd_ops = {
 
 struct bio *bio_alloc_drbd(gfp_t gfp_mask)
 {
-   struct bio *bio;
-
if (!drbd_md_io_bio_set)
return bio_alloc(gfp_mask, 1);
 
-   bio = bio_alloc_bioset(gfp_mask, 1, drbd_md_io_bio_set);
-   if (!bio)
-   return NULL;
-   return bio;
+   return bio_alloc_bioset(gfp_mask, 1, drbd_md_io_bio_set);
 }
 
 #ifdef __CHECKER__


> validation, which may cause null-pointer dereference, fix it.
> 
> Signed-off-by: Heloise 
> ---
>  drivers/block/drbd/drbd_actlog.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/block/drbd/drbd_actlog.c 
> b/drivers/block/drbd/drbd_actlog.c
> index 8d7bcfa..d6bb30e 100644
> --- a/drivers/block/drbd/drbd_actlog.c
> +++ b/drivers/block/drbd/drbd_actlog.c
> @@ -151,6 +151,10 @@ static int _drbd_md_sync_page_io(struct drbd_device 
> *device,
>   op_flags |= REQ_SYNC;
>  
>   bio = bio_alloc_drbd(GFP_NOIO);
> + if (!bio) {
> + err = -ENOMEM;
> + return err;
> + }
>   bio->bi_bdev = bdev->md_bdev;
>   bio->bi_iter.bi_sector = sector;
>   err = -EIO;
> -- 
> 2.1.0

Re: [PATCH] drivers:block:drbd:drbd_state:fix null-pointer dereference

2017-04-25 Thread Lars Ellenberg

On Mon, Apr 24, 2017 at 11:35:18PM -0700, Heloise wrote:
> Signed-off-by: Heloise 
> 
> In is_valid_state(), there is NULL validation for the variable nc
> "if (nc)". However,the code will continue to execute when nc is NULL.
> nc->verify_alg[0] is used in subsequent code, which may cause
> null-pointer dereference, fix it.
> ---
>  drivers/block/drbd/drbd_state.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
> index eea0c4a..1963b83 100644
> --- a/drivers/block/drbd/drbd_state.c
> +++ b/drivers/block/drbd/drbd_state.c
> @@ -845,7 +845,7 @@ is_valid_state(struct drbd_device *device, union 
> drbd_state ns)
>   rv = SS_CONNECTED_OUTDATES;
>  
>   else if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) &&
> -  (nc->verify_alg[0] == 0))
> +  (nc != NULL && nc->verify_alg[0] == 0))


What the static checker cannot know:
ns.conn != C_STANDALONE implies nc != NULL.

But if you feel like it, the additional check won't hurt.

Thanks,

Lars

Re: [PATCH] drivers:block:drbd:drbd_state:fix null-pointer dereference

2017-04-25 Thread Lars Ellenberg

On Mon, Apr 24, 2017 at 11:35:18PM -0700, Heloise wrote:
> Signed-off-by: Heloise 
> 
> In is_valid_state(), there is NULL validation for the variable nc
> "if (nc)". However,the code will continue to execute when nc is NULL.
> nc->verify_alg[0] is used in subsequent code, which may cause
> null-pointer dereference, fix it.
> ---
>  drivers/block/drbd/drbd_state.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
> index eea0c4a..1963b83 100644
> --- a/drivers/block/drbd/drbd_state.c
> +++ b/drivers/block/drbd/drbd_state.c
> @@ -845,7 +845,7 @@ is_valid_state(struct drbd_device *device, union 
> drbd_state ns)
>   rv = SS_CONNECTED_OUTDATES;
>  
>   else if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) &&
> -  (nc->verify_alg[0] == 0))
> +  (nc != NULL && nc->verify_alg[0] == 0))


What the static checker cannot know:
ns.conn != C_STANDALONE implies nc != NULL.

But if you feel like it, the additional check won't hurt.

Thanks,

Lars

Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

2017-03-10 Thread Lars Ellenberg

On Fri, Mar 10, 2017 at 04:07:58PM +0100, Jack Wang wrote:
> On 10.03.2017 15:55, Mikulas Patocka wrote:
> > On Fri, 10 Mar 2017, Mike Snitzer wrote:
> >> On Fri, Mar 10 2017 at  7:34am -0500,
> >> Lars Ellenberg <lars.ellenb...@linbit.com> wrote:
> >>
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> >>>>   */
> >>>>  blk_qc_t generic_make_request(struct bio *bio)
> >>>>  {
> >>>> -   struct bio_list bio_list_on_stack;
> >>>> +   /*
> >>>> +* bio_list_on_stack[0] contains bios submitted by the current
> >>>> +* make_request_fn.
> >>>> +* bio_list_on_stack[1] contains bios that were submitted before
> >>>> +* the current make_request_fn, but that haven't been processed
> >>>> +* yet.
> >>>> +*/
> >>>> +   struct bio_list bio_list_on_stack[2];
> >>>> blk_qc_t ret = BLK_QC_T_NONE;
> >>>
> >>> May I suggest that, if you intend to assign something that is not a
> >>> plain &(struct bio_list), but a &(struct bio_list[2]),
> >>> you change the task member so it is renamed (current->bio_list vs
> >>> current->bio_lists, plural, is what I did last year).
> >>> Or you will break external modules, silently, and horribly (or,
> >>> rather, they won't notice, but break the kernel).
> >>> Examples of such modules would be DRBD, ZFS, quite possibly others.

> > It's better to make external modules not compile than to silently 
> > introduce bugs in them. So yes, I would rename that.
> > 
> > Mikulas
> 
> Agree, better rename current->bio_list to current->bio_lists
>
> Regards,
> Jack

Thank you.

(I don't know if some one does, but...)
Thing is: *IF* some external thing messes with
current->bio_list in "interesting" ways, and not just the
"I don't care, one level of real recursion fixes this for me"
pattern of
struct bio_list *tmp = current->bio_list;
current->bio_list = NULL;
submit_bio()
current->bio_list = tmp;
you get a chance of stack corruption,
without even as much as a compiler warning.

Which is why I wrote https://lkml.org/lkml/2016/7/8/469
...

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
struct recursion_to_iteration_bio_lists {
struct bio_list recursion;
struct bio_list queue;
}

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "queue" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first,
and "sibling" bios of the same level in "natural" order.

...

Cheers,

Lars

Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

2017-03-10 Thread Lars Ellenberg

On Fri, Mar 10, 2017 at 04:07:58PM +0100, Jack Wang wrote:
> On 10.03.2017 15:55, Mikulas Patocka wrote:
> > On Fri, 10 Mar 2017, Mike Snitzer wrote:
> >> On Fri, Mar 10 2017 at  7:34am -0500,
> >> Lars Ellenberg  wrote:
> >>
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> >>>>   */
> >>>>  blk_qc_t generic_make_request(struct bio *bio)
> >>>>  {
> >>>> -   struct bio_list bio_list_on_stack;
> >>>> +   /*
> >>>> +* bio_list_on_stack[0] contains bios submitted by the current
> >>>> +* make_request_fn.
> >>>> +* bio_list_on_stack[1] contains bios that were submitted before
> >>>> +* the current make_request_fn, but that haven't been processed
> >>>> +* yet.
> >>>> +*/
> >>>> +   struct bio_list bio_list_on_stack[2];
> >>>> blk_qc_t ret = BLK_QC_T_NONE;
> >>>
> >>> May I suggest that, if you intend to assign something that is not a
> >>> plain &(struct bio_list), but a &(struct bio_list[2]),
> >>> you change the task member so it is renamed (current->bio_list vs
> >>> current->bio_lists, plural, is what I did last year).
> >>> Or you will break external modules, silently, and horribly (or,
> >>> rather, they won't notice, but break the kernel).
> >>> Examples of such modules would be DRBD, ZFS, quite possibly others.

> > It's better to make external modules not compile than to silently 
> > introduce bugs in them. So yes, I would rename that.
> > 
> > Mikulas
> 
> Agree, better rename current->bio_list to current->bio_lists
>
> Regards,
> Jack

Thank you.

(I don't know if some one does, but...)
Thing is: *IF* some external thing messes with
current->bio_list in "interesting" ways, and not just the
"I don't care, one level of real recursion fixes this for me"
pattern of
struct bio_list *tmp = current->bio_list;
current->bio_list = NULL;
submit_bio()
current->bio_list = tmp;
you get a chance of stack corruption,
without even as much as a compiler warning.

Which is why I wrote https://lkml.org/lkml/2016/7/8/469
...

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
struct recursion_to_iteration_bio_lists {
struct bio_list recursion;
struct bio_list queue;
}

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "queue" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first,
and "sibling" bios of the same level in "natural" order.

...

Cheers,

Lars

Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

2017-03-10 Thread Lars Ellenberg

> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -   struct bio_list bio_list_on_stack;
> +   /*
> +* bio_list_on_stack[0] contains bios submitted by the current
> +* make_request_fn.
> +* bio_list_on_stack[1] contains bios that were submitted before
> +* the current make_request_fn, but that haven't been processed
> +* yet.
> +*/
> +   struct bio_list bio_list_on_stack[2];
> blk_qc_t ret = BLK_QC_T_NONE;

May I suggest that, if you intend to assign something that is not a
plain &(struct bio_list), but a &(struct bio_list[2]),
you change the task member so it is renamed (current->bio_list vs
current->bio_lists, plural, is what I did last year).
Or you will break external modules, silently, and horribly (or,
rather, they won't notice, but break the kernel).
Examples of such modules would be DRBD, ZFS, quite possibly others.

Thanks,

Lars

Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

2017-03-10 Thread Lars Ellenberg

> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -   struct bio_list bio_list_on_stack;
> +   /*
> +* bio_list_on_stack[0] contains bios submitted by the current
> +* make_request_fn.
> +* bio_list_on_stack[1] contains bios that were submitted before
> +* the current make_request_fn, but that haven't been processed
> +* yet.
> +*/
> +   struct bio_list bio_list_on_stack[2];
> blk_qc_t ret = BLK_QC_T_NONE;

May I suggest that, if you intend to assign something that is not a
plain &(struct bio_list), but a &(struct bio_list[2]),
you change the task member so it is renamed (current->bio_list vs
current->bio_lists, plural, is what I did last year).
Or you will break external modules, silently, and horribly (or,
rather, they won't notice, but break the kernel).
Examples of such modules would be DRBD, ZFS, quite possibly others.

Thanks,

Lars

Re: blk: improve order of bio handling in generic_make_request()

2017-03-08 Thread Lars Ellenberg

On 8 March 2017 at 17:40, Mikulas Patocka  wrote:
>
> On Wed, 8 Mar 2017, NeilBrown wrote:
> > I don't think this will fix the DM snapshot deadlock by itself.
> > Rather, it make it possible for some internal changes to DM to fix it.
> > The DM change might be something vaguely like:
> >
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 3086da5664f3..06ee0960e415 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct 
> > clone_info *ci)
> >
> >   len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
> >
> > + if (len < ci->sector_count) {
> > + struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
>
> fs_bio_set is a shared bio set, so it is prone to deadlocks. For this
> change, we would need two bio sets per dm device, one for the split bio
> and one for the outgoing bio. (this also means having one more kernel
> thread per dm device)
>
> It would be possible to avoid having two bio sets if the incoming bio were
> the same as the outgoing bio (allocate a small structure, move bi_end_io
> and bi_private into it, replace bi_end_io and bi_private with pointers to
> device mapper and send the bio to the target driver), but it would need
> much more changes - basically rewrite the whole bio handling code in dm.c
> and in the targets.
>
> Mikulas

"back then" (see previously posted link into ML archive)
I suggested this:

...

A bit of conflict here may be that DM has all its own
split and clone and queue magic, and wants to process
"all of the bio" before returning back to generic_make_request().

To change that, __split_and_process_bio() and all its helpers
would need to learn to "push back" (pieces of) the bio they are
currently working on, and not push back via "DM_ENDIO_REQUEUE",
but by bio_list_add_head(>bio_lists->queue, piece_to_be_done_later).

Then, after they processed each piece,
*return* all the way up to the top-level generic_make_request(),
where the recursion-to-iteration logic would then
make sure that all deeper level bios, submitted via
recursive calls to generic_make_request() will be processed, before the
next, pushed back, piece of the "original incoming" bio.

And *not* do their own iteration over all pieces first.

Probably not as easy as dropping the while loop,
using bio_advance, and pushing that "advanced" bio back to
current->...queue?

static void __split_and_process_bio(struct mapped_device *md,
struct dm_table *map, struct bio *bio)
...
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
while (ci.sector_count && !error)
error = __split_and_process_non_flush();
...
error = __split_and_process_non_flush();
if (ci.sector_count)
bio_advance()
bio_list_add_head(>bio_lists->queue, )
...

Something like that, maybe?

Needs to be adapted to this new and improved recursion-to-iteration
logic, obviously.  Would that be doable, or does device-mapper for some
reason really *need* its own iteration loop (which, because it is called
from generic_make_request(), won't be able to ever submit anything to
any device, ever, so needs all these helper threads just in case).

Lars

Re: blk: improve order of bio handling in generic_make_request()

2017-03-08 Thread Lars Ellenberg

On 8 March 2017 at 17:40, Mikulas Patocka  wrote:
>
> On Wed, 8 Mar 2017, NeilBrown wrote:
> > I don't think this will fix the DM snapshot deadlock by itself.
> > Rather, it make it possible for some internal changes to DM to fix it.
> > The DM change might be something vaguely like:
> >
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 3086da5664f3..06ee0960e415 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct 
> > clone_info *ci)
> >
> >   len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
> >
> > + if (len < ci->sector_count) {
> > + struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
>
> fs_bio_set is a shared bio set, so it is prone to deadlocks. For this
> change, we would need two bio sets per dm device, one for the split bio
> and one for the outgoing bio. (this also means having one more kernel
> thread per dm device)
>
> It would be possible to avoid having two bio sets if the incoming bio were
> the same as the outgoing bio (allocate a small structure, move bi_end_io
> and bi_private into it, replace bi_end_io and bi_private with pointers to
> device mapper and send the bio to the target driver), but it would need
> much more changes - basically rewrite the whole bio handling code in dm.c
> and in the targets.
>
> Mikulas

"back then" (see previously posted link into ML archive)
I suggested this:

...

A bit of conflict here may be that DM has all its own
split and clone and queue magic, and wants to process
"all of the bio" before returning back to generic_make_request().

To change that, __split_and_process_bio() and all its helpers
would need to learn to "push back" (pieces of) the bio they are
currently working on, and not push back via "DM_ENDIO_REQUEUE",
but by bio_list_add_head(>bio_lists->queue, piece_to_be_done_later).

Then, after they processed each piece,
*return* all the way up to the top-level generic_make_request(),
where the recursion-to-iteration logic would then
make sure that all deeper level bios, submitted via
recursive calls to generic_make_request() will be processed, before the
next, pushed back, piece of the "original incoming" bio.

And *not* do their own iteration over all pieces first.

Probably not as easy as dropping the while loop,
using bio_advance, and pushing that "advanced" bio back to
current->...queue?

static void __split_and_process_bio(struct mapped_device *md,
struct dm_table *map, struct bio *bio)
...
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
while (ci.sector_count && !error)
error = __split_and_process_non_flush();
...
error = __split_and_process_non_flush();
if (ci.sector_count)
bio_advance()
bio_list_add_head(>bio_lists->queue, )
...

Something like that, maybe?

Needs to be adapted to this new and improved recursion-to-iteration
logic, obviously.  Would that be doable, or does device-mapper for some
reason really *need* its own iteration loop (which, because it is called
from generic_make_request(), won't be able to ever submit anything to
any device, ever, so needs all these helper threads just in case).

Lars

Re: blk: improve order of bio handling in generic_make_request()

2017-03-08 Thread Lars Ellenberg

On 7 March 2017 at 17:52, Mike Snitzer  wrote:

> > On 06.03.2017 21:18, Jens Axboe wrote:
> > > I like the change, and thanks for tackling this. It's been a pending
> > > issue for way too long. I do think we should squash Jack's patch
> > > into the original, as it does clean up the code nicely.
> > >
> > > Do we have a proper test case for this, so we can verify that it
> > > does indeed also work in practice?
> > >
> > Hi Jens,
> >
> > I can trigger deadlock with in RAID1 with test below:
> >
> > I create one md with one local loop device and one remote scsi
> > exported by SRP. running fio with mix rw on top of md, force_close
> > session on storage side. mdx_raid1 is wait on free_array in D state,
> > and a lot of fio also in D state in wait_barrier.
> >
> > With the patch from Neil above, I can no longer trigger it anymore.
> >
> > The discussion was in link below:
> > http://www.spinics.net/lists/raid/msg54680.html
>
> In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> albeit unpolished/needy to get running, see:
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
>
> But to actually test block core's ability to handle this, upstream
> commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755 ("dm: flush queued bios
> when process blocks to avoid deadlock") would need to be reverted.
>
> Also, I know Lars had a drbd deadlock too.  Not sure if Jack's MD test
> is sufficient to coverage for drbd.  Lars?
>

As this is just a slightly different implementation, trading some bytes of
stack
for more local, self-contained, "obvious" code changes (good job!),
but follows the same basic idea as my original RFC [*]  (see the
"inspired-by" tag)
I have no doubt it fixes the issues we are able to provoke with DRBD.
[*] https://lkml.org/lkml/2016/7/19/263
(where I also already suggest to fix the device-mapper issues
by losing the in-device-mapper loop,
relying on the loop in generic_make_request())

Cheers,
Lars

Re: blk: improve order of bio handling in generic_make_request()

2017-03-08 Thread Lars Ellenberg

On 7 March 2017 at 17:52, Mike Snitzer  wrote:

> > On 06.03.2017 21:18, Jens Axboe wrote:
> > > I like the change, and thanks for tackling this. It's been a pending
> > > issue for way too long. I do think we should squash Jack's patch
> > > into the original, as it does clean up the code nicely.
> > >
> > > Do we have a proper test case for this, so we can verify that it
> > > does indeed also work in practice?
> > >
> > Hi Jens,
> >
> > I can trigger deadlock with in RAID1 with test below:
> >
> > I create one md with one local loop device and one remote scsi
> > exported by SRP. running fio with mix rw on top of md, force_close
> > session on storage side. mdx_raid1 is wait on free_array in D state,
> > and a lot of fio also in D state in wait_barrier.
> >
> > With the patch from Neil above, I can no longer trigger it anymore.
> >
> > The discussion was in link below:
> > http://www.spinics.net/lists/raid/msg54680.html
>
> In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> albeit unpolished/needy to get running, see:
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
>
> But to actually test block core's ability to handle this, upstream
> commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755 ("dm: flush queued bios
> when process blocks to avoid deadlock") would need to be reverted.
>
> Also, I know Lars had a drbd deadlock too.  Not sure if Jack's MD test
> is sufficient to coverage for drbd.  Lars?
>

As this is just a slightly different implementation, trading some bytes of
stack
for more local, self-contained, "obvious" code changes (good job!),
but follows the same basic idea as my original RFC [*]  (see the
"inspired-by" tag)
I have no doubt it fixes the issues we are able to provoke with DRBD.
[*] https://lkml.org/lkml/2016/7/19/263
(where I also already suggest to fix the device-mapper issues
by losing the in-device-mapper loop,
relying on the loop in generic_make_request())

Cheers,
Lars

Re: [PATCH] drbd: avoid clang warning about pointless switch statement

2017-02-06 Thread Lars Ellenberg

Ack.

Ok, not exactly "pintless", it's a valid compile time assert for uniq ids,
basically a BUILD_BUG_ON(duplicate-ids). But adding a default clause
there does not hurt.

Thanks,
Lars


On Wed, Feb 01, 2017 at 05:55:02PM +0100, Arnd Bergmann wrote:
> The drbd code causes warnings that we cannot easily disable when building 
> with clang:
> 
> In file included from drivers/block/drbd/drbd_debugfs.c:10:
> In file included from drivers/block/drbd/drbd_int.h:48:
> In file included from include/linux/drbd_genl_api.h:53:
> In file included from include/linux/genl_magic_struct.h:237:
> include/linux/drbd_genl.h:300:1: warning: no case matching constant switch 
> condition '0'
> 
> There is nothing wrong with the code, and adding 'default:' labels
> in the right place is enough to let clang shut up about the warning.
> 
> Fixes: ec2c35ac1ea2 ("drbd: prepare the transition from connector to 
> genetlink")
> Signed-off-by: Arnd Bergmann 
> ---
>  include/linux/genl_magic_struct.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/genl_magic_struct.h 
> b/include/linux/genl_magic_struct.h
> index 6270a56e5edc..c6fbafb834f1 100644
> --- a/include/linux/genl_magic_struct.h
> +++ b/include/linux/genl_magic_struct.h
> @@ -190,6 +190,7 @@ static inline void ct_assert_unique_operations(void)
>  {
>   switch (0) {
>  #include GENL_MAGIC_INCLUDE_FILE
> + default:
>   ;
>   }
>  }
> @@ -208,6 +209,7 @@ static inline void 
> ct_assert_unique_top_level_attributes(void)
>  {
>   switch (0) {
>  #include GENL_MAGIC_INCLUDE_FILE
> + default:
>   ;
>   }
>  }
> @@ -218,6 +220,7 @@ static inline void ct_assert_unique_ ## s_name ## 
> _attributes(void)   \
>  {\
>   switch (0) {\
>   s_fields\
> + default:\
>   ;   \
>   }   \
>  }
> -- 
> 2.9.0

Re: [PATCH] drbd: avoid clang warning about pointless switch statement

2017-02-06 Thread Lars Ellenberg

Ack.

Ok, not exactly "pintless", it's a valid compile time assert for uniq ids,
basically a BUILD_BUG_ON(duplicate-ids). But adding a default clause
there does not hurt.

Thanks,
Lars


On Wed, Feb 01, 2017 at 05:55:02PM +0100, Arnd Bergmann wrote:
> The drbd code causes warnings that we cannot easily disable when building 
> with clang:
> 
> In file included from drivers/block/drbd/drbd_debugfs.c:10:
> In file included from drivers/block/drbd/drbd_int.h:48:
> In file included from include/linux/drbd_genl_api.h:53:
> In file included from include/linux/genl_magic_struct.h:237:
> include/linux/drbd_genl.h:300:1: warning: no case matching constant switch 
> condition '0'
> 
> There is nothing wrong with the code, and adding 'default:' labels
> in the right place is enough to let clang shut up about the warning.
> 
> Fixes: ec2c35ac1ea2 ("drbd: prepare the transition from connector to 
> genetlink")
> Signed-off-by: Arnd Bergmann 
> ---
>  include/linux/genl_magic_struct.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/genl_magic_struct.h 
> b/include/linux/genl_magic_struct.h
> index 6270a56e5edc..c6fbafb834f1 100644
> --- a/include/linux/genl_magic_struct.h
> +++ b/include/linux/genl_magic_struct.h
> @@ -190,6 +190,7 @@ static inline void ct_assert_unique_operations(void)
>  {
>   switch (0) {
>  #include GENL_MAGIC_INCLUDE_FILE
> + default:
>   ;
>   }
>  }
> @@ -208,6 +209,7 @@ static inline void 
> ct_assert_unique_top_level_attributes(void)
>  {
>   switch (0) {
>  #include GENL_MAGIC_INCLUDE_FILE
> + default:
>   ;
>   }
>  }
> @@ -218,6 +220,7 @@ static inline void ct_assert_unique_ ## s_name ## 
> _attributes(void)   \
>  {\
>   switch (0) {\
>   s_fields\
> + default:\
>   ;   \
>   }   \
>  }
> -- 
> 2.9.0

Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2017-01-07 Thread Lars Ellenberg

On Sat, Jan 07, 2017 at 10:01:07AM +1100, NeilBrown wrote:
> On Sat, Jan 07 2017, Mike Snitzer wrote:
> > On Fri, Jan 06 2017 at 12:34pm -0500,
> > Mikulas Patocka <mpato...@redhat.com> wrote:
> >> On Fri, 6 Jan 2017, Mikulas Patocka wrote:
> >> > On Wed, 4 Jan 2017, Mike Snitzer wrote:
> >> > > On Wed, Jan 04 2017 at 12:12am -0500,
> >> > > NeilBrown <ne...@suse.com> wrote:
...

> >> > And with the above above patch, the snapshot deadlock bug still happens.
> >
> > That is really unfortunate.  Would be useful to dig in and understand
> > why.  Because ordering of the IO in generic_make_request() really should
> > take care of it.
> 
> I *think* you might be able to resolve this by changing
> __split_and_process_bio() to only ever perform a single split.  No
> looping.
> i.e. if the bio is too big to handle directly, then split off the front
> to a new bio, which you bio_chain to the original.  The original then
> has bio_advance() called to stop over the front, then
> generic_make_request() so it is queued.
> Then the code proceeds to __clone_and_map_data_bio() on the front that
> got split off.
> When that completes it *doesn't* loop round, but returns into
> generic_make_request() which does the looping and makes sure to handle
> the lowest-level bio next.
> 
> something vaguely like this:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3086da5664f3..06ee0960e415 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct 
> clone_info *ci)
>  
>   len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
>  
> + if (len < ci->sector_count) {
> + struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
> + bio_chain(split, bio);
> + generic_make_request(bio);
> + bio = split;
> + ci->sector_count = len;
> + }
> +
>   r = __clone_and_map_data_bio(ci, ti, ci->sector, );
>   if (r < 0)
>   return r;
> 
> though I haven't tested, and the change (if it works) should probably be
> more fully integrated into surrounding code.
> 
> You probably don't want to use "fs_bio_set" either - a target-local
> pool would be best.
> 
> NeilBrown

Which is pretty much what I suggested in this thread
back in July already, see below.

Cheers,
Lars

On Tue, Jul 19, 2016 at 11:00:24AM +0200, Lars Ellenberg wrote:
...

> > > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > >   by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > >   (was https://patchwork.kernel.org/patch/7398411/)
> 
> As it stands now,
> this is yet an other issue, but related.
> 
> From the link above:
> 
> | ** Here is the dm-snapshot deadlock that was observed:
> | 
> | 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> | spans snapshot chunk boundary and so it is split to two bios by device
> | mapper.
> | 
> | 2) Device mapper creates the first sub-bio and sends it to the snapshot
> | driver.
> | 
> | 3) The function snapshot_map calls track_chunk (that allocates a
> | structure
> | dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> | the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
> | 
> | 4) The remapped bio is submitted with generic_make_request, but it isn't
> | issued - it is added to current->bio_list instead.
> | 
> | 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> | chunk affected be the first remapped bio, it takes down_write(>lock)
> | and then loops in __check_for_conflicting_io, waiting for
> | dm_snap_tracked_chunk created in step 3) to be released.
> | 
> | 6) Process A continues, it creates a second sub-bio for the rest of the
> | original bio.
> 
> Aha.
> Here is the relation.
> If "A" had only ever processed "just the chunk it can handle now",
> and "pushed back" the rest of the incoming bio,
> it could rely on all deeper level bios to have been submitted already.
> 
> But this does not look like it easily fits into the current DM model.
> 
> | 7) snapshot_map is called for this new bio, it waits on
> | down_write(>lock) that is held by Process B (in step 5).
> 
> There is an other suggestion:
> Use down_trylock (or down_timeout),
> and if it fails, push back the currently to-be-processed bio.
> We can introduce a new bio helper for that.
> Kind of what blk_queue_split() does with my patch applied.
> 
> Or even better, ign

Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2017-01-07 Thread Lars Ellenberg

On Sat, Jan 07, 2017 at 10:01:07AM +1100, NeilBrown wrote:
> On Sat, Jan 07 2017, Mike Snitzer wrote:
> > On Fri, Jan 06 2017 at 12:34pm -0500,
> > Mikulas Patocka  wrote:
> >> On Fri, 6 Jan 2017, Mikulas Patocka wrote:
> >> > On Wed, 4 Jan 2017, Mike Snitzer wrote:
> >> > > On Wed, Jan 04 2017 at 12:12am -0500,
> >> > > NeilBrown  wrote:
...

> >> > And with the above above patch, the snapshot deadlock bug still happens.
> >
> > That is really unfortunate.  Would be useful to dig in and understand
> > why.  Because ordering of the IO in generic_make_request() really should
> > take care of it.
> 
> I *think* you might be able to resolve this by changing
> __split_and_process_bio() to only ever perform a single split.  No
> looping.
> i.e. if the bio is too big to handle directly, then split off the front
> to a new bio, which you bio_chain to the original.  The original then
> has bio_advance() called to stop over the front, then
> generic_make_request() so it is queued.
> Then the code proceeds to __clone_and_map_data_bio() on the front that
> got split off.
> When that completes it *doesn't* loop round, but returns into
> generic_make_request() which does the looping and makes sure to handle
> the lowest-level bio next.
> 
> something vaguely like this:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3086da5664f3..06ee0960e415 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct 
> clone_info *ci)
>  
>   len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
>  
> + if (len < ci->sector_count) {
> + struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
> + bio_chain(split, bio);
> + generic_make_request(bio);
> + bio = split;
> + ci->sector_count = len;
> + }
> +
>   r = __clone_and_map_data_bio(ci, ti, ci->sector, );
>   if (r < 0)
>   return r;
> 
> though I haven't tested, and the change (if it works) should probably be
> more fully integrated into surrounding code.
> 
> You probably don't want to use "fs_bio_set" either - a target-local
> pool would be best.
> 
> NeilBrown

Which is pretty much what I suggested in this thread
back in July already, see below.

Cheers,
Lars

On Tue, Jul 19, 2016 at 11:00:24AM +0200, Lars Ellenberg wrote:
...

> > > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > >   by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > >   (was https://patchwork.kernel.org/patch/7398411/)
> 
> As it stands now,
> this is yet an other issue, but related.
> 
> From the link above:
> 
> | ** Here is the dm-snapshot deadlock that was observed:
> | 
> | 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> | spans snapshot chunk boundary and so it is split to two bios by device
> | mapper.
> | 
> | 2) Device mapper creates the first sub-bio and sends it to the snapshot
> | driver.
> | 
> | 3) The function snapshot_map calls track_chunk (that allocates a
> | structure
> | dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> | the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
> | 
> | 4) The remapped bio is submitted with generic_make_request, but it isn't
> | issued - it is added to current->bio_list instead.
> | 
> | 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> | chunk affected be the first remapped bio, it takes down_write(>lock)
> | and then loops in __check_for_conflicting_io, waiting for
> | dm_snap_tracked_chunk created in step 3) to be released.
> | 
> | 6) Process A continues, it creates a second sub-bio for the rest of the
> | original bio.
> 
> Aha.
> Here is the relation.
> If "A" had only ever processed "just the chunk it can handle now",
> and "pushed back" the rest of the incoming bio,
> it could rely on all deeper level bios to have been submitted already.
> 
> But this does not look like it easily fits into the current DM model.
> 
> | 7) snapshot_map is called for this new bio, it waits on
> | down_write(>lock) that is held by Process B (in step 5).
> 
> There is an other suggestion:
> Use down_trylock (or down_timeout),
> and if it fails, push back the currently to-be-processed bio.
> We can introduce a new bio helper for that.
> Kind of what blk_queue_split() does with my patch applied.
> 
> Or even better, ignore the down_trylock suggestion,
> simply n

Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-12-23 Thread Lars Ellenberg

On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
> Dear Maintainers
> 
> I'd like to ask for the status of this patch since we hit the
> issue too during our testing on md raid1.
> 
> Split remainder bio_A was queued ahead, following by bio_B for
> lower device, at this moment raid start freezing, the loop take
> out bio_A firstly and deliver it, which will hung since raid is
> freezing, while the freezing never end since it waiting for
> bio_B to finish, and bio_B is still on the queue, waiting for
> bio_A to finish...
> 
> We're looking for a good solution and we found this patch
> already progressed a lot, but we can't find it on linux-next,
> so we'd like to ask are we still planning to have this fix
> in upstream?

I don't see why not, I'd even like to have it in older kernels,
but did not have the time and energy to push it.

Thanks for the bump.

Lars

On 07/11/2016 04:10 PM, Lars Ellenberg wrote:
> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
> 
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
> 
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
> 
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
> 
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
> 
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
> 
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
> 
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
>   struct recursion_to_iteration_bio_lists {
>   struct bio_list recursion;
>   struct bio_list queue;
>   }
> 
> By providing each q->make_request_fn() with an empty "recursion"
> bio_list, then merging any recursively submitted bios to the
> head of the "queue" list, we can make the recursion-to-iteration
> logic in generic_make_request() process deepest level bios first,
> and "sibling" bios of the same level in "natural" order.
> 
> Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
> Signed-off-by: Roland Kammerer <roland.kamme...@linbit.com>
> ---
>  block/bio.c   | 20 +++
>  block/blk-core.c  | 49 
> +--
>  block/blk-merge.c |  5 -
>  drivers/md/bcache/btree.c | 12 ++--
>  drivers/md/dm-bufio.c |  2 +-
>  drivers/md/raid1.c|  5 ++---
>  drivers/md/raid10.c   |  5 ++---
>  include/linux/bio.h   | 25 
>  include/linux/sched.h |  4 ++--
>  9 files changed, 80 insertions(+), 47 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 848cd35..c2606fd 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>*/
>  
>   bio_list_init();
> - bio_list_init();
>  
> - while ((bio = bio_list_pop(current->bio_list)))
> + bio_list_init();
> + while ((bio = bio_list_pop(>bio_lists->recursion)))
>   bio_list_add(bio->bi_pool == bs ?  : , bio);
> + current->bio_lists->recursion = nopunt;
>  
> -

Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-12-23 Thread Lars Ellenberg

On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
> Dear Maintainers
> 
> I'd like to ask for the status of this patch since we hit the
> issue too during our testing on md raid1.
> 
> Split remainder bio_A was queued ahead, following by bio_B for
> lower device, at this moment raid start freezing, the loop take
> out bio_A firstly and deliver it, which will hung since raid is
> freezing, while the freezing never end since it waiting for
> bio_B to finish, and bio_B is still on the queue, waiting for
> bio_A to finish...
> 
> We're looking for a good solution and we found this patch
> already progressed a lot, but we can't find it on linux-next,
> so we'd like to ask are we still planning to have this fix
> in upstream?

I don't see why not, I'd even like to have it in older kernels,
but did not have the time and energy to push it.

Thanks for the bump.

Lars

On 07/11/2016 04:10 PM, Lars Ellenberg wrote:
> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
> 
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
> 
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
> 
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
> 
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
> 
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
> 
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
> 
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
>   struct recursion_to_iteration_bio_lists {
>   struct bio_list recursion;
>   struct bio_list queue;
>   }
> 
> By providing each q->make_request_fn() with an empty "recursion"
> bio_list, then merging any recursively submitted bios to the
> head of the "queue" list, we can make the recursion-to-iteration
> logic in generic_make_request() process deepest level bios first,
> and "sibling" bios of the same level in "natural" order.
> 
> Signed-off-by: Lars Ellenberg 
> Signed-off-by: Roland Kammerer 
> ---
>  block/bio.c   | 20 +++
>  block/blk-core.c  | 49 
> +--
>  block/blk-merge.c |  5 -
>  drivers/md/bcache/btree.c | 12 ++--
>  drivers/md/dm-bufio.c |  2 +-
>  drivers/md/raid1.c|  5 ++---
>  drivers/md/raid10.c   |  5 ++---
>  include/linux/bio.h   | 25 
>  include/linux/sched.h |  4 ++--
>  9 files changed, 80 insertions(+), 47 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 848cd35..c2606fd 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>*/
>  
>   bio_list_init();
> - bio_list_init();
>  
> - while ((bio = bio_list_pop(current->bio_list)))
> + bio_list_init();
> + while ((bio = bio_list_pop(>bio_lists->recursion)))
>   bio_list_add(bio->bi_pool == bs ?  : , bio);
> + current->bio_lists->recursion = nopunt;
>  
> - *current->bio_list = nopunt;
> + bio_list_init();

Re: [Drbd-dev] DRBD: ASSERT in drbd_actlog.c:259

2016-11-14 Thread Lars Ellenberg

On Fri, Nov 11, 2016 at 06:45:18PM +0100, Wolfgang Walter wrote:
> Hello,
> 
> when I execute
> 
> mkfs.ext4 -J device=UUID=625d871f-c278-4acb-916d-774dc78dbd8a -v -b 4096 -E 
> stride=$((512/4)),stripe_width=$((512/4*3)),lazy_itable_init=0 -O 
> inline_data,mmp -L dyn -m .01 /dev/test/big
> 
> the command hangs

It should not hang, really, but it is probably too busy spewing below
message to the console over and over again.

There have been some problems with our trim/discard handling in some situations.
That's also why this does not happen if you say nodiscard.

> and the kernel (4.4.30) complains:
>
> block drbd1: ASSERT( (unsigned)(last - first) <= 1 ) in 
> drivers/block/drbd/drbd_actlog.c:259

see:
commit 505675f96cf0f169647a18c3dda1f373eca957b1
Author: Lars Ellenberg <lars.ellenb...@linbit.com>
Date:   Tue Jun 14 00:26:23 2016 +0200

drbd: allow larger max_discard_sectors

Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION)
al-extents available, and allow up to half of that to be
discarded in one bio.

Signed-off-by: Philipp Reisner <philipp.reis...@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
Signed-off-by: Jens Axboe <ax...@fb.com>

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index d524973..265b2b6 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -258,7 +258,7 @@ bool drbd_al_begin_io_fastpath(struct drbd_device *device, 
struct drbd_interval
unsigned first = i->sector >> (AL_EXTENT_SHIFT-9);
unsigned last = i->size == 0 ? first : (i->sector + (i->size >> 9) - 1) 
>> (AL_EXTENT_SHIFT-9);
 
-   D_ASSERT(device, (unsigned)(last - first) <= 1);
+   D_ASSERT(device, first <= last);
D_ASSERT(device, atomic_read(>local_cnt) > 0);
 
/* FIXME figure out a fast path for bios crossing AL extent boundaries 
*/

...
and some more chunks, but you should not cherry-pick stuff "just so",
if it may change semantics ...

kernel v4.4 is roughly equivalent to out-of-tree drbd 8.4.5,
we are at 8.4.9 meanwhile, so you may want to try that as well.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

Re: [Drbd-dev] DRBD: ASSERT in drbd_actlog.c:259

2016-11-14 Thread Lars Ellenberg

On Fri, Nov 11, 2016 at 06:45:18PM +0100, Wolfgang Walter wrote:
> Hello,
> 
> when I execute
> 
> mkfs.ext4 -J device=UUID=625d871f-c278-4acb-916d-774dc78dbd8a -v -b 4096 -E 
> stride=$((512/4)),stripe_width=$((512/4*3)),lazy_itable_init=0 -O 
> inline_data,mmp -L dyn -m .01 /dev/test/big
> 
> the command hangs

It should not hang, really, but it is probably too busy spewing below
message to the console over and over again.

There have been some problems with our trim/discard handling in some situations.
That's also why this does not happen if you say nodiscard.

> and the kernel (4.4.30) complains:
>
> block drbd1: ASSERT( (unsigned)(last - first) <= 1 ) in 
> drivers/block/drbd/drbd_actlog.c:259

see:
commit 505675f96cf0f169647a18c3dda1f373eca957b1
Author: Lars Ellenberg 
Date:   Tue Jun 14 00:26:23 2016 +0200

drbd: allow larger max_discard_sectors

Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION)
al-extents available, and allow up to half of that to be
discarded in one bio.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
Signed-off-by: Jens Axboe 

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index d524973..265b2b6 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -258,7 +258,7 @@ bool drbd_al_begin_io_fastpath(struct drbd_device *device, 
struct drbd_interval
unsigned first = i->sector >> (AL_EXTENT_SHIFT-9);
unsigned last = i->size == 0 ? first : (i->sector + (i->size >> 9) - 1) 
>> (AL_EXTENT_SHIFT-9);
 
-   D_ASSERT(device, (unsigned)(last - first) <= 1);
+   D_ASSERT(device, first <= last);
D_ASSERT(device, atomic_read(>local_cnt) > 0);
 
/* FIXME figure out a fast path for bios crossing AL extent boundaries 
*/

...
and some more chunks, but you should not cherry-pick stuff "just so",
if it may change semantics ...

kernel v4.4 is roughly equivalent to out-of-tree drbd 8.4.5,
we are at 8.4.9 meanwhile, so you may want to try that as well.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

[PATCH v2] drbd: Fix kernel_sendmsg() usage - potential NULL deref

2016-11-09 Thread Lars Ellenberg

From: Richard Weinberger <rich...@nod.at>

Don't pass a size larger than iov_len to kernel_sendmsg().
Otherwise it will cause a NULL pointer deref when kernel_sendmsg()
returns with rv < size.

DRBD as external module has been around in the kernel 2.4 days already.
We used to be compatible to 2.4 and very early 2.6 kernels,
we used to use
 rv = sock_sendmsg(sock, , iov.iov_len);
then later changed to
 rv = kernel_sendmsg(sock, , , 1, size);
when we should have used
 rv = kernel_sendmsg(sock, , , 1, iov.iov_len);

tcp_sendmsg() used to totally ignore the size parameter.
 57be5bd ip: convert tcp_sendmsg() to iov_iter primitives
changes that, and exposes our long standing error.

Even with this error exposed, to trigger the bug, we would need to have
an environment (config or otherwise) causing us to not use sendpage()
for larger transfers, a failing connection, and have it fail "just at the
right time".  Apparently that was unlikely enough for most, so this went
unnoticed for years.

Still, it is known to trigger at least some of these,
and suspected for the others:
[0] http://lists.linbit.com/pipermail/drbd-user/2016-July/023112.html
[1] http://lists.linbit.com/pipermail/drbd-dev/2016-March/003362.html
[2] https://forums.grsecurity.net/viewtopic.php?f=3=4546
[3] https://ubuntuforums.org/showthread.php?t=2336150
[4] http://e2.howsolveproblem.com/i/1175162/

This should go into 4.9,
and into all stable branches since and including v4.0,
which is the first to contain the exposing change.

It is correct for all stable branches older than that as well
(which contain the DRBD driver; which is 2.6.33 and up).

It requires a small "conflict" resolution for v4.4 and earlier, with v4.5
we dropped the comment block immediately preceding the kernel_sendmsg().

Fixes: b411b3637fa7 ("The DRBD driver")
Cc: <sta...@vger.kernel.org> # 2.6.33.x-
Cc: v...@zeniv.linux.org.uk
Cc: christoph.lechleit...@iteg.at
Cc: wolfgang.g...@iteg.at
Reported-by: Christoph Lechleitner <christoph.lechleit...@iteg.at>
Tested-by: Christoph Lechleitner <christoph.lechleit...@iteg.at>
Signed-off-by: Richard Weinberger <rich...@nod.at>
[changed oneliner to be "obvious" without context; more verbose message]
Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
---
 drivers/block/drbd/drbd_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 100be55..8348272 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1871,7 +1871,7 @@ int drbd_send(struct drbd_connection *connection, struct 
socket *sock,
drbd_update_congested(connection);
}
do {
-   rv = kernel_sendmsg(sock, , , 1, size);
+   rv = kernel_sendmsg(sock, , , 1, iov.iov_len);
if (rv == -EAGAIN) {
if (we_should_drop_the_connection(connection, sock))
break;
-- 
2.7.4

[PATCH v2] drbd: Fix kernel_sendmsg() usage - potential NULL deref

2016-11-09 Thread Lars Ellenberg

From: Richard Weinberger 

Don't pass a size larger than iov_len to kernel_sendmsg().
Otherwise it will cause a NULL pointer deref when kernel_sendmsg()
returns with rv < size.

DRBD as external module has been around in the kernel 2.4 days already.
We used to be compatible to 2.4 and very early 2.6 kernels,
we used to use
 rv = sock_sendmsg(sock, , iov.iov_len);
then later changed to
 rv = kernel_sendmsg(sock, , , 1, size);
when we should have used
 rv = kernel_sendmsg(sock, , , 1, iov.iov_len);

tcp_sendmsg() used to totally ignore the size parameter.
 57be5bd ip: convert tcp_sendmsg() to iov_iter primitives
changes that, and exposes our long standing error.

Even with this error exposed, to trigger the bug, we would need to have
an environment (config or otherwise) causing us to not use sendpage()
for larger transfers, a failing connection, and have it fail "just at the
right time".  Apparently that was unlikely enough for most, so this went
unnoticed for years.

Still, it is known to trigger at least some of these,
and suspected for the others:
[0] http://lists.linbit.com/pipermail/drbd-user/2016-July/023112.html
[1] http://lists.linbit.com/pipermail/drbd-dev/2016-March/003362.html
[2] https://forums.grsecurity.net/viewtopic.php?f=3=4546
[3] https://ubuntuforums.org/showthread.php?t=2336150
[4] http://e2.howsolveproblem.com/i/1175162/

This should go into 4.9,
and into all stable branches since and including v4.0,
which is the first to contain the exposing change.

It is correct for all stable branches older than that as well
(which contain the DRBD driver; which is 2.6.33 and up).

It requires a small "conflict" resolution for v4.4 and earlier, with v4.5
we dropped the comment block immediately preceding the kernel_sendmsg().

Fixes: b411b3637fa7 ("The DRBD driver")
Cc:  # 2.6.33.x-
Cc: v...@zeniv.linux.org.uk
Cc: christoph.lechleit...@iteg.at
Cc: wolfgang.g...@iteg.at
Reported-by: Christoph Lechleitner 
Tested-by: Christoph Lechleitner 
Signed-off-by: Richard Weinberger 
[changed oneliner to be "obvious" without context; more verbose message]
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 100be55..8348272 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1871,7 +1871,7 @@ int drbd_send(struct drbd_connection *connection, struct 
socket *sock,
drbd_update_congested(connection);
}
do {
-   rv = kernel_sendmsg(sock, , , 1, size);
+   rv = kernel_sendmsg(sock, , , 1, iov.iov_len);
if (rv == -EAGAIN) {
if (we_should_drop_the_connection(connection, sock))
break;
-- 
2.7.4

Re: [Drbd-dev] [PATCH] drbd: Fix kernel_sendmsg() usage

2016-11-09 Thread Lars Ellenberg

On Wed, Nov 09, 2016 at 04:47:15PM +0100, Richard Weinberger wrote:
> > Should I sent two patches, one that applies to 4.5 and later,
> > and one that applies to 2.6.33 ... 4.4, or are you or stable
> > willing to resolve the trivial "missing comment block" conflict yourself?
> 
> BTW: Why did you drop the "Fixes:" tag too?
> 

By accident, probably.

I'm ok with you guys just using the original, if you prefer, just let me
know.  It's the fix, that's important, not how much noise we can make
about that oneliner.

Lars

Re: [Drbd-dev] [PATCH] drbd: Fix kernel_sendmsg() usage

2016-11-09 Thread Lars Ellenberg

On Wed, Nov 09, 2016 at 04:47:15PM +0100, Richard Weinberger wrote:
> > Should I sent two patches, one that applies to 4.5 and later,
> > and one that applies to 2.6.33 ... 4.4, or are you or stable
> > willing to resolve the trivial "missing comment block" conflict yourself?
> 
> BTW: Why did you drop the "Fixes:" tag too?
> 

By accident, probably.

I'm ok with you guys just using the original, if you prefer, just let me
know.  It's the fix, that's important, not how much noise we can make
about that oneliner.

Lars

Re: [PATCH] drbd: Fix kernel_sendmsg() usage

2016-11-09 Thread Lars Ellenberg

On Tue, Nov 08, 2016 at 09:52:04AM -0700, Jens Axboe wrote:
> > > This should go into 4.9,
> > > and into all stable branches since and including v4.0,
> > > which is the first to contain the exposing change.
> > > 
> > > It is correct for all stable branches older than that as well
> > > (which contain the DRBD driver; which is 2.6.33 and up).
> > > 
> > > It requires a small "conflict" resolution for v4.4 and earlier, with v4.5
> > > we dropped the comment block immediately preceding the kernel_sendmsg().
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Cc: v...@zeniv.linux.org.uk
> > > Cc: christoph.lechleit...@iteg.at
> > > Cc: wolfgang.g...@iteg.at
> > > Reported-by: Christoph Lechleitner <christoph.lechleit...@iteg.at>
> > > Tested-by: Christoph Lechleitner <christoph.lechleit...@iteg.at>
> > > Signed-off-by: Richard Weinberger <rich...@nod.at>
> > > Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
> > 
> > Changing my patch is perfectly fine, but please clearly state it.
> > I.e. by adding something like that before your S-o-b.
> > [Lars: Massaged patch to match my personal taste...]
> 

> Lars, are you sending a new one? If you do, add the stable tag as well.

So my "change" against his original patch was
- rv = kernel_sendmsg(sock, , , 1, size - sent);
+ rv = kernel_sendmsg(sock, , , 1, iov.iov_len);
to make it "more obviously correct" from looking just at the one line
without even having to read the context.  And a more verbose commit message.

If that requires yet additional noise, sure, so be it :)

Should I sent two patches, one that applies to 4.5 and later,
and one that applies to 2.6.33 ... 4.4, or are you or stable
willing to resolve the trivial "missing comment block" conflict yourself?

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

Re: [PATCH] drbd: Fix kernel_sendmsg() usage

2016-11-09 Thread Lars Ellenberg

On Tue, Nov 08, 2016 at 09:52:04AM -0700, Jens Axboe wrote:
> > > This should go into 4.9,
> > > and into all stable branches since and including v4.0,
> > > which is the first to contain the exposing change.
> > > 
> > > It is correct for all stable branches older than that as well
> > > (which contain the DRBD driver; which is 2.6.33 and up).
> > > 
> > > It requires a small "conflict" resolution for v4.4 and earlier, with v4.5
> > > we dropped the comment block immediately preceding the kernel_sendmsg().
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Cc: v...@zeniv.linux.org.uk
> > > Cc: christoph.lechleit...@iteg.at
> > > Cc: wolfgang.g...@iteg.at
> > > Reported-by: Christoph Lechleitner 
> > > Tested-by: Christoph Lechleitner 
> > > Signed-off-by: Richard Weinberger 
> > > Signed-off-by: Lars Ellenberg 
> > 
> > Changing my patch is perfectly fine, but please clearly state it.
> > I.e. by adding something like that before your S-o-b.
> > [Lars: Massaged patch to match my personal taste...]
> 

> Lars, are you sending a new one? If you do, add the stable tag as well.

So my "change" against his original patch was
- rv = kernel_sendmsg(sock, , , 1, size - sent);
+ rv = kernel_sendmsg(sock, , , 1, iov.iov_len);
to make it "more obviously correct" from looking just at the one line
without even having to read the context.  And a more verbose commit message.

If that requires yet additional noise, sure, so be it :)

Should I sent two patches, one that applies to 4.5 and later,
and one that applies to 2.6.33 ... 4.4, or are you or stable
willing to resolve the trivial "missing comment block" conflict yourself?

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-19 Thread Lars Ellenberg

On Tue, Jul 12, 2016 at 10:32:33PM -0400, Mike Snitzer wrote:
> On Tue, Jul 12 2016 at 10:18pm -0400,
> Eric Wheeler <bca...@lists.ewheeler.net> wrote:
> 
> > On Tue, 12 Jul 2016, NeilBrown wrote:
> > 
> > > On Tue, Jul 12 2016, Lars Ellenberg wrote:
> > > 
> > > >
> > > > Instead, I suggest to distinguish between recursive calls to
> > > > generic_make_request(), and pushing back the remainder part in
> > > > blk_queue_split(), by pointing current->bio_lists to a
> > > > struct recursion_to_iteration_bio_lists {
> > > > struct bio_list recursion;
> > > > struct bio_list queue;
> > > > }
> > > >
> > > > By providing each q->make_request_fn() with an empty "recursion"
> > > > bio_list, then merging any recursively submitted bios to the
> > > > head of the "queue" list, we can make the recursion-to-iteration
> > > > logic in generic_make_request() process deepest level bios first,
> > > > and "sibling" bios of the same level in "natural" order.
> > > >
> > > > Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
> > > > Signed-off-by: Roland Kammerer <roland.kamme...@linbit.com>
> > > 
> > > Reviewed-by: NeilBrown <ne...@suse.com>
> > > 
> > > Thanks again for doing this - I think this is a very significant
> > > improvement and could allow other simplifications.
> > 
> > Thank you Lars for all of this work!  
> > 
> > It seems like there have been many 4.3+ blockdev stacking issues and this 
> > will certainly address some of those (maybe all of them?).  (I think we 
> > hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
> > issue.)  It would be great to hear 4.4.y stable pick this up if 
> > compatible.
> > 
> > 
> > Do you believe that this patch would solve any of the proposals by others 
> > since 4.3 related to bio splitting/large bios?  I've been collecting a 
> > list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
> > if I'm wrong):
> > 
> > A.  [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
> > by Ming Lei: https://patchwork.kernel.org/patch/9169483/

That's an independend issue.

> > B.  block: don't make BLK_DEF_MAX_SECTORS too big
> > by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html

Yet an other independend issue.

> > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > (was https://patchwork.kernel.org/patch/7398411/)

As it stands now,
this is yet an other issue, but related.

>From the link above:

| ** Here is the dm-snapshot deadlock that was observed:
| 
| 1) Process A sends one-page read bio to the dm-snapshot target. The bio
| spans snapshot chunk boundary and so it is split to two bios by device
| mapper.
| 
| 2) Device mapper creates the first sub-bio and sends it to the snapshot
| driver.
| 
| 3) The function snapshot_map calls track_chunk (that allocates a
| structure
| dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
| the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
| 
| 4) The remapped bio is submitted with generic_make_request, but it isn't
| issued - it is added to current->bio_list instead.
| 
| 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
| chunk affected be the first remapped bio, it takes down_write(>lock)
| and then loops in __check_for_conflicting_io, waiting for
| dm_snap_tracked_chunk created in step 3) to be released.
| 
| 6) Process A continues, it creates a second sub-bio for the rest of the
| original bio.

Aha.
Here is the relation.
If "A" had only ever processed "just the chunk it can handle now",
and "pushed back" the rest of the incoming bio,
it could rely on all deeper level bios to have been submitted already.

But this does not look like it easily fits into the current DM model.

| 7) snapshot_map is called for this new bio, it waits on
| down_write(>lock) that is held by Process B (in step 5).

There is an other suggestion:
Use down_trylock (or down_timeout),
and if it fails, push back the currently to-be-processed bio.
We can introduce a new bio helper for that.
Kind of what blk_queue_split() does with my patch applied.

Or even better, ignore the down_trylock suggestion,
simply not iterate over all pieces first,
but process one piece, and return back the the
iteration in generic_make_request.

A bit of con

Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-19 Thread Lars Ellenberg

On Tue, Jul 12, 2016 at 10:32:33PM -0400, Mike Snitzer wrote:
> On Tue, Jul 12 2016 at 10:18pm -0400,
> Eric Wheeler  wrote:
> 
> > On Tue, 12 Jul 2016, NeilBrown wrote:
> > 
> > > On Tue, Jul 12 2016, Lars Ellenberg wrote:
> > > 
> > > >
> > > > Instead, I suggest to distinguish between recursive calls to
> > > > generic_make_request(), and pushing back the remainder part in
> > > > blk_queue_split(), by pointing current->bio_lists to a
> > > > struct recursion_to_iteration_bio_lists {
> > > > struct bio_list recursion;
> > > > struct bio_list queue;
> > > > }
> > > >
> > > > By providing each q->make_request_fn() with an empty "recursion"
> > > > bio_list, then merging any recursively submitted bios to the
> > > > head of the "queue" list, we can make the recursion-to-iteration
> > > > logic in generic_make_request() process deepest level bios first,
> > > > and "sibling" bios of the same level in "natural" order.
> > > >
> > > > Signed-off-by: Lars Ellenberg 
> > > > Signed-off-by: Roland Kammerer 
> > > 
> > > Reviewed-by: NeilBrown 
> > > 
> > > Thanks again for doing this - I think this is a very significant
> > > improvement and could allow other simplifications.
> > 
> > Thank you Lars for all of this work!  
> > 
> > It seems like there have been many 4.3+ blockdev stacking issues and this 
> > will certainly address some of those (maybe all of them?).  (I think we 
> > hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
> > issue.)  It would be great to hear 4.4.y stable pick this up if 
> > compatible.
> > 
> > 
> > Do you believe that this patch would solve any of the proposals by others 
> > since 4.3 related to bio splitting/large bios?  I've been collecting a 
> > list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
> > if I'm wrong):
> > 
> > A.  [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
> > by Ming Lei: https://patchwork.kernel.org/patch/9169483/

That's an independend issue.

> > B.  block: don't make BLK_DEF_MAX_SECTORS too big
> > by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html

Yet an other independend issue.

> > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > (was https://patchwork.kernel.org/patch/7398411/)

As it stands now,
this is yet an other issue, but related.

>From the link above:

| ** Here is the dm-snapshot deadlock that was observed:
| 
| 1) Process A sends one-page read bio to the dm-snapshot target. The bio
| spans snapshot chunk boundary and so it is split to two bios by device
| mapper.
| 
| 2) Device mapper creates the first sub-bio and sends it to the snapshot
| driver.
| 
| 3) The function snapshot_map calls track_chunk (that allocates a
| structure
| dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
| the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
| 
| 4) The remapped bio is submitted with generic_make_request, but it isn't
| issued - it is added to current->bio_list instead.
| 
| 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
| chunk affected be the first remapped bio, it takes down_write(>lock)
| and then loops in __check_for_conflicting_io, waiting for
| dm_snap_tracked_chunk created in step 3) to be released.
| 
| 6) Process A continues, it creates a second sub-bio for the rest of the
| original bio.

Aha.
Here is the relation.
If "A" had only ever processed "just the chunk it can handle now",
and "pushed back" the rest of the incoming bio,
it could rely on all deeper level bios to have been submitted already.

But this does not look like it easily fits into the current DM model.

| 7) snapshot_map is called for this new bio, it waits on
| down_write(>lock) that is held by Process B (in step 5).

There is an other suggestion:
Use down_trylock (or down_timeout),
and if it fails, push back the currently to-be-processed bio.
We can introduce a new bio helper for that.
Kind of what blk_queue_split() does with my patch applied.

Or even better, ignore the down_trylock suggestion,
simply not iterate over all pieces first,
but process one piece, and return back the the
iteration in generic_make_request.

A bit of conflict here may be that DM has all its own
split and clone and queue magic, and wants to process
"all of the bio&quo

Re: [PATCH 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-11 Thread Lars Ellenberg

Dropped the XXX comment (oops),
moved the current_has_pending_bios() helper to bio.h
and dropped the identical ones from bio.c and md.h.

Reposted in-thread as
[PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

Thanks,

Lars Ellenberg

Re: [PATCH 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-11 Thread Lars Ellenberg

Dropped the XXX comment (oops),
moved the current_has_pending_bios() helper to bio.h
and dropped the identical ones from bio.c and md.h.

Reposted in-thread as
[PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

Thanks,

Lars Ellenberg

[PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-11 Thread Lars Ellenberg

For a long time, generic_make_request() converts recursion into
iteration by queuing recursive arguments on current->bio_list.

This is convenient for stacking drivers,
the top-most driver would take the originally submitted bio,
and re-submit a re-mapped version of it, or one or more clones,
or one or more new allocated bios to its backend(s). Which
are then simply processed in turn, and each can again queue
more "backend-bios" until we reach the bottom of the driver stack,
and actually dispatch to the real backend device.

Any stacking driver ->make_request_fn() could expect that,
once it returns, any backend-bios it submitted via recursive calls
to generic_make_request() would now be processed and dispatched, before
the current task would call into this driver again.

This is changed by commit
  54efd50 block: make generic_make_request handle arbitrarily sized bios

Drivers may call blk_queue_split() inside their ->make_request_fn(),
which may split the current bio into a front-part to be dealt with
immediately, and a remainder-part, which may need to be split even
further. That remainder-part will simply also be pushed to
current->bio_list, and would end up being head-of-queue, in front
of any backend-bios the current make_request_fn() might submit during
processing of the fron-part.

Which means the current task would immediately end up back in the same
make_request_fn() of the same driver again, before any of its backend
bios have even been processed.

This can lead to resource starvation deadlock.
Drivers could avoid this by learning to not need blk_queue_split(),
or by submitting their backend bios in a different context (dedicated
kernel thread, work_queue context, ...). Or by playing funny re-ordering
games with entries on current->bio_list.

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
struct recursion_to_iteration_bio_lists {
struct bio_list recursion;
struct bio_list queue;
}

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "queue" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first,
and "sibling" bios of the same level in "natural" order.

Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
Signed-off-by: Roland Kammerer <roland.kamme...@linbit.com>
---
 block/bio.c   | 20 +++
 block/blk-core.c  | 49 +--
 block/blk-merge.c |  5 -
 drivers/md/bcache/btree.c | 12 ++--
 drivers/md/dm-bufio.c |  2 +-
 drivers/md/raid1.c|  5 ++---
 drivers/md/raid10.c   |  5 ++---
 include/linux/bio.h   | 25 
 include/linux/sched.h |  4 ++--
 9 files changed, 80 insertions(+), 47 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 848cd35..c2606fd 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 */
 
bio_list_init();
-   bio_list_init();
 
-   while ((bio = bio_list_pop(current->bio_list)))
+   bio_list_init();
+   while ((bio = bio_list_pop(>bio_lists->recursion)))
bio_list_add(bio->bi_pool == bs ?  : , bio);
+   current->bio_lists->recursion = nopunt;
 
-   *current->bio_list = nopunt;
+   bio_list_init();
+   while ((bio = bio_list_pop(>bio_lists->queue)))
+   bio_list_add(bio->bi_pool == bs ?  : , bio);
+   current->bio_lists->queue = nopunt;
 
spin_lock(>rescue_lock);
bio_list_merge(>rescue_list, );
@@ -453,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int 
nr_iovecs, struct bio_set *bs)
 *
 * We solve this, and guarantee forward progress, with a rescuer
 * workqueue per bio_set. If we go to allocate and there are
-* bios on current->bio_list, we first try the allocation
-* without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-* bios we would be blocking to the rescuer workqueue before
-* we retry with the original gfp_flags.
+* bios on current->bio_lists->{recursion,queue}, we first try 
the
+* allocation without __GFP_DIRECT_RECLAIM; if that fails, we
+* punt those bios we would be blocking to the rescuer
+* workqueue before we retry with the original gfp_flags.
 */
 
-   if (current->bio_list && !bio_list_empty(current->bio_list))
+   if (current_has_pending_bios())

[PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-11 Thread Lars Ellenberg

For a long time, generic_make_request() converts recursion into
iteration by queuing recursive arguments on current->bio_list.

This is convenient for stacking drivers,
the top-most driver would take the originally submitted bio,
and re-submit a re-mapped version of it, or one or more clones,
or one or more new allocated bios to its backend(s). Which
are then simply processed in turn, and each can again queue
more "backend-bios" until we reach the bottom of the driver stack,
and actually dispatch to the real backend device.

Any stacking driver ->make_request_fn() could expect that,
once it returns, any backend-bios it submitted via recursive calls
to generic_make_request() would now be processed and dispatched, before
the current task would call into this driver again.

This is changed by commit
  54efd50 block: make generic_make_request handle arbitrarily sized bios

Drivers may call blk_queue_split() inside their ->make_request_fn(),
which may split the current bio into a front-part to be dealt with
immediately, and a remainder-part, which may need to be split even
further. That remainder-part will simply also be pushed to
current->bio_list, and would end up being head-of-queue, in front
of any backend-bios the current make_request_fn() might submit during
processing of the fron-part.

Which means the current task would immediately end up back in the same
make_request_fn() of the same driver again, before any of its backend
bios have even been processed.

This can lead to resource starvation deadlock.
Drivers could avoid this by learning to not need blk_queue_split(),
or by submitting their backend bios in a different context (dedicated
kernel thread, work_queue context, ...). Or by playing funny re-ordering
games with entries on current->bio_list.

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
struct recursion_to_iteration_bio_lists {
struct bio_list recursion;
struct bio_list queue;
}

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "queue" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first,
and "sibling" bios of the same level in "natural" order.

Signed-off-by: Lars Ellenberg 
Signed-off-by: Roland Kammerer 
---
 block/bio.c   | 20 +++
 block/blk-core.c  | 49 +--
 block/blk-merge.c |  5 -
 drivers/md/bcache/btree.c | 12 ++--
 drivers/md/dm-bufio.c |  2 +-
 drivers/md/raid1.c|  5 ++---
 drivers/md/raid10.c   |  5 ++---
 include/linux/bio.h   | 25 
 include/linux/sched.h |  4 ++--
 9 files changed, 80 insertions(+), 47 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 848cd35..c2606fd 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 */
 
bio_list_init();
-   bio_list_init();
 
-   while ((bio = bio_list_pop(current->bio_list)))
+   bio_list_init();
+   while ((bio = bio_list_pop(>bio_lists->recursion)))
bio_list_add(bio->bi_pool == bs ?  : , bio);
+   current->bio_lists->recursion = nopunt;
 
-   *current->bio_list = nopunt;
+   bio_list_init();
+   while ((bio = bio_list_pop(>bio_lists->queue)))
+   bio_list_add(bio->bi_pool == bs ?  : , bio);
+   current->bio_lists->queue = nopunt;
 
spin_lock(>rescue_lock);
bio_list_merge(>rescue_list, );
@@ -453,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int 
nr_iovecs, struct bio_set *bs)
 *
 * We solve this, and guarantee forward progress, with a rescuer
 * workqueue per bio_set. If we go to allocate and there are
-* bios on current->bio_list, we first try the allocation
-* without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-* bios we would be blocking to the rescuer workqueue before
-* we retry with the original gfp_flags.
+* bios on current->bio_lists->{recursion,queue}, we first try 
the
+* allocation without __GFP_DIRECT_RECLAIM; if that fails, we
+* punt those bios we would be blocking to the rescuer
+* workqueue before we retry with the original gfp_flags.
 */
 
-   if (current->bio_list && !bio_list_empty(current->bio_list))
+   if (current_has_pending_bios())
gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
p = m

[PATCH 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

For a long time, generic_make_request() converts recursion into
iteration by queuing recursive arguments on current->bio_list.

This is convenient for stacking drivers,
the top-most driver would take the originally submitted bio,
and re-submit a re-mapped version of it, or one or more clones,
or one or more new allocated bios to its backend(s). Which
are then simply processed in turn, and each can again queue
more "backend-bios" until we reach the bottom of the driver stack,
and actually dispatch to the real backend device.

Any stacking driver ->make_request_fn() could expect that,
once it returns, any backend-bios it submitted via recursive calls
to generic_make_request() would now be processed and dispatched, before
the current task would call into this driver again.

This is changed by commit
  54efd50 block: make generic_make_request handle arbitrarily sized bios

Drivers may call blk_queue_split() inside their ->make_request_fn(),
which may split the current bio into a front-part to be dealt with
immediately, and a remainder-part, which may need to be split even
further. That remainder-part will simply also be pushed to
current->bio_list, and would end up being head-of-queue, in front
of any backend-bios the current make_request_fn() might submit during
processing of the fron-part.

Which means the current task would immediately end up back in the same
make_request_fn() of the same driver again, before any of its backend
bios have even been processed.

This can lead to resource starvation deadlock.
Drivers could avoid this by learning to not need blk_queue_split(),
or by submitting their backend bios in a different context (dedicated
kernel thread, work_queue context, ...). Or by playing funny re-ordering
games with entries on current->bio_list.

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
struct recursion_to_iteration_bio_lists {
struct bio_list recursion;
struct bio_list queue;
}

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "queue" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first,
and "sibling" bios of the same level in "natural" order.

Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
Signed-off-by: Roland Kammerer <roland.kamme...@linbit.com>
---
 block/bio.c   | 27 +
 block/blk-core.c  | 50 +--
 block/blk-merge.c |  5 -
 drivers/md/bcache/btree.c | 12 ++--
 drivers/md/dm-bufio.c |  2 +-
 drivers/md/md.h   |  7 +++
 drivers/md/raid1.c|  5 ++---
 drivers/md/raid10.c   |  5 ++---
 include/linux/bio.h   | 18 +
 include/linux/sched.h |  4 ++--
 10 files changed, 88 insertions(+), 47 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 848cd35..1f9fcf0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 */
 
bio_list_init();
-   bio_list_init();
 
-   while ((bio = bio_list_pop(current->bio_list)))
+   bio_list_init();
+   while ((bio = bio_list_pop(>bio_lists->recursion)))
bio_list_add(bio->bi_pool == bs ?  : , bio);
+   current->bio_lists->recursion = nopunt;
 
-   *current->bio_list = nopunt;
+   bio_list_init();
+   while ((bio = bio_list_pop(>bio_lists->queue)))
+   bio_list_add(bio->bi_pool == bs ?  : , bio);
+   current->bio_lists->queue = nopunt;
 
spin_lock(>rescue_lock);
bio_list_merge(>rescue_list, );
@@ -380,6 +384,13 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
queue_work(bs->rescue_workqueue, >rescue_work);
 }
 
+static bool current_has_pending_bios(void)
+{
+   return current->bio_lists &&
+   (!bio_list_empty(>bio_lists->queue) ||
+!bio_list_empty(>bio_lists->recursion));
+}
+
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -453,13 +464,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int 
nr_iovecs, struct bio_set *bs)
 *
 * We solve this, and guarantee forward progress, with a rescuer
 * workqueue per bio_set. If we go to allocate and there are
-* bios on current->bio_list, we first try the allocation
-* without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-* bios we would be blocking to the rescuer workqueue before
-* we

[PATCH 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

For a long time, generic_make_request() converts recursion into
iteration by queuing recursive arguments on current->bio_list.

This is convenient for stacking drivers,
the top-most driver would take the originally submitted bio,
and re-submit a re-mapped version of it, or one or more clones,
or one or more new allocated bios to its backend(s). Which
are then simply processed in turn, and each can again queue
more "backend-bios" until we reach the bottom of the driver stack,
and actually dispatch to the real backend device.

Any stacking driver ->make_request_fn() could expect that,
once it returns, any backend-bios it submitted via recursive calls
to generic_make_request() would now be processed and dispatched, before
the current task would call into this driver again.

This is changed by commit
  54efd50 block: make generic_make_request handle arbitrarily sized bios

Drivers may call blk_queue_split() inside their ->make_request_fn(),
which may split the current bio into a front-part to be dealt with
immediately, and a remainder-part, which may need to be split even
further. That remainder-part will simply also be pushed to
current->bio_list, and would end up being head-of-queue, in front
of any backend-bios the current make_request_fn() might submit during
processing of the fron-part.

Which means the current task would immediately end up back in the same
make_request_fn() of the same driver again, before any of its backend
bios have even been processed.

This can lead to resource starvation deadlock.
Drivers could avoid this by learning to not need blk_queue_split(),
or by submitting their backend bios in a different context (dedicated
kernel thread, work_queue context, ...). Or by playing funny re-ordering
games with entries on current->bio_list.

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
struct recursion_to_iteration_bio_lists {
struct bio_list recursion;
struct bio_list queue;
}

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "queue" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first,
and "sibling" bios of the same level in "natural" order.

Signed-off-by: Lars Ellenberg 
Signed-off-by: Roland Kammerer 
---
 block/bio.c   | 27 +
 block/blk-core.c  | 50 +--
 block/blk-merge.c |  5 -
 drivers/md/bcache/btree.c | 12 ++--
 drivers/md/dm-bufio.c |  2 +-
 drivers/md/md.h   |  7 +++
 drivers/md/raid1.c|  5 ++---
 drivers/md/raid10.c   |  5 ++---
 include/linux/bio.h   | 18 +
 include/linux/sched.h |  4 ++--
 10 files changed, 88 insertions(+), 47 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 848cd35..1f9fcf0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 */
 
bio_list_init();
-   bio_list_init();
 
-   while ((bio = bio_list_pop(current->bio_list)))
+   bio_list_init();
+   while ((bio = bio_list_pop(>bio_lists->recursion)))
bio_list_add(bio->bi_pool == bs ?  : , bio);
+   current->bio_lists->recursion = nopunt;
 
-   *current->bio_list = nopunt;
+   bio_list_init();
+   while ((bio = bio_list_pop(>bio_lists->queue)))
+   bio_list_add(bio->bi_pool == bs ?  : , bio);
+   current->bio_lists->queue = nopunt;
 
spin_lock(>rescue_lock);
bio_list_merge(>rescue_list, );
@@ -380,6 +384,13 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
queue_work(bs->rescue_workqueue, >rescue_work);
 }
 
+static bool current_has_pending_bios(void)
+{
+   return current->bio_lists &&
+   (!bio_list_empty(>bio_lists->queue) ||
+!bio_list_empty(>bio_lists->recursion));
+}
+
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -453,13 +464,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int 
nr_iovecs, struct bio_set *bs)
 *
 * We solve this, and guarantee forward progress, with a rescuer
 * workqueue per bio_set. If we go to allocate and there are
-* bios on current->bio_list, we first try the allocation
-* without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-* bios we would be blocking to the rescuer workqueue before
-* we retry with the original gfp_flags.
+* bios on current->b

[PATCH 0/1] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

Result of RFC previously discussed here:
https://lkml.org/lkml/2016/6/22/172
[RFC] block: fix blk_queue_split() resource exhaustion

Rebased to linux-block/for-4.8/core as of today.
Would also need to go to Stable 4.3 and later.

Lars Ellenberg (1):
  block: fix blk_queue_split() resource exhaustion

 block/bio.c   | 27 +
 block/blk-core.c  | 50 +--
 block/blk-merge.c |  5 -
 drivers/md/bcache/btree.c | 12 ++--
 drivers/md/dm-bufio.c |  2 +-
 drivers/md/md.h   |  7 +++
 drivers/md/raid1.c|  5 ++---
 drivers/md/raid10.c   |  5 ++---
 include/linux/bio.h   | 18 +
 include/linux/sched.h |  4 ++--
 10 files changed, 88 insertions(+), 47 deletions(-)

-- 
1.9.1

[PATCH 0/1] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

Result of RFC previously discussed here:
https://lkml.org/lkml/2016/6/22/172
[RFC] block: fix blk_queue_split() resource exhaustion

Rebased to linux-block/for-4.8/core as of today.
Would also need to go to Stable 4.3 and later.

Lars Ellenberg (1):
  block: fix blk_queue_split() resource exhaustion

 block/bio.c   | 27 +
 block/blk-core.c  | 50 +--
 block/blk-merge.c |  5 -
 drivers/md/bcache/btree.c | 12 ++--
 drivers/md/dm-bufio.c |  2 +-
 drivers/md/md.h   |  7 +++
 drivers/md/raid1.c|  5 ++---
 drivers/md/raid10.c   |  5 ++---
 include/linux/bio.h   | 18 +
 include/linux/sched.h |  4 ++--
 10 files changed, 88 insertions(+), 47 deletions(-)

-- 
1.9.1

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

On Fri, Jul 08, 2016 at 07:24:24PM +0600, Pavel Goran wrote:
> Friday, July 8, 2016, 7:00:35 PM, you wrote:
> > On Fri, Jul 08, 2016 at 07:39:36PM +1000, NeilBrown wrote:
> >> I think we just might be in violent agreement.
> 
> > I thought so, too :-)
> 
> > Should I merge both patches,
> > rename to ".queue" and ".tmp",
> > and submit for inclusion?
> 
> Could you please *not* use ".tmp"? I have a feeling it's even worse than "X"
> and "Y". :)
> 
> I suggest to leave it as ".recursion", or maybe use something like
> ".downstream".

"Naming is hard" :-)

.recursion and .queue, then,
because .remainder does no longer fit.

Also, what about "Cc: Stable"?
The patch does not strictly fall into the "stable" rules,
(too many lines...).
But it should go there anyways, I think.
I'll just add those tags, and wait for the heat?

Lars Ellenberg

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

On Fri, Jul 08, 2016 at 07:24:24PM +0600, Pavel Goran wrote:
> Friday, July 8, 2016, 7:00:35 PM, you wrote:
> > On Fri, Jul 08, 2016 at 07:39:36PM +1000, NeilBrown wrote:
> >> I think we just might be in violent agreement.
> 
> > I thought so, too :-)
> 
> > Should I merge both patches,
> > rename to ".queue" and ".tmp",
> > and submit for inclusion?
> 
> Could you please *not* use ".tmp"? I have a feeling it's even worse than "X"
> and "Y". :)
> 
> I suggest to leave it as ".recursion", or maybe use something like
> ".downstream".

"Naming is hard" :-)

.recursion and .queue, then,
because .remainder does no longer fit.

Also, what about "Cc: Stable"?
The patch does not strictly fall into the "stable" rules,
(too many lines...).
But it should go there anyways, I think.
I'll just add those tags, and wait for the heat?

Lars Ellenberg

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

On Fri, Jul 08, 2016 at 07:39:36PM +1000, NeilBrown wrote:
> >> To make the patch "perfect", and maybe even more elegant we could treat
> >> ->remainder and ->recursion more alike.
> >> i.e.:
> >>   - generic make request has a private "stack" of requests.
> >>   - before calling ->make_request_fn(), both ->remainder and ->recursion
> >> are initialised
> >>   - after ->make_request_fn(), ->remainder are spliced in to top of
> >> 'stack', then ->recursion is spliced onto that.
> >>   - If stack is not empty, the top request is popped and we loop to top.
> >> 
> >> This reliably follows in-order execution, and handles siblings correctly
> >> (in submitted order) if/when a request splits off multiple siblings.
> >
> > The only splitting that creates siblings on the current level
> > is blk_queue_split(), which splits the current bio into
> > "front piece" and "remainder", already processed in this order.
> 
> Yes.
> I imagine that a driver *could* split a bio into three parts with a
> single allocation, but I cannot actually see any point in doing it.  So
> I was over-complicating things.
> 
> >
> > Anything else creating "siblings" is not creating siblings for the
> > current layer, but for the next deeper layer, which are queue on
> > "recursion" and also processed in the order they have been generated.
> >
> >> I think that as long a requests are submitted in the order they are
> >> created at each level there is no reason to expect performance
> >> regressions.
> >> All we are doing is changing the ordering between requests generated at
> >> different levels, and I think we are restoring a more natural order.
> >
> > I believe both patches combined are doing exactly this already.
> > I could rename .remainder to .todo or .incoming, though.
> 
> :-)  neither "remainder" or "recursion" seem like brilliant names to me,
> but I don't have anything better to suggest.  Naming is hard!
> As long as a comment explains the name clearly I could cope with X and Y.

...

> I think we just might be in violent agreement.

I thought so, too :-)

Should I merge both patches,
rename to ".queue" and ".tmp",
and submit for inclusion?

Lars Ellenberg

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

On Fri, Jul 08, 2016 at 07:39:36PM +1000, NeilBrown wrote:
> >> To make the patch "perfect", and maybe even more elegant we could treat
> >> ->remainder and ->recursion more alike.
> >> i.e.:
> >>   - generic make request has a private "stack" of requests.
> >>   - before calling ->make_request_fn(), both ->remainder and ->recursion
> >> are initialised
> >>   - after ->make_request_fn(), ->remainder are spliced in to top of
> >> 'stack', then ->recursion is spliced onto that.
> >>   - If stack is not empty, the top request is popped and we loop to top.
> >> 
> >> This reliably follows in-order execution, and handles siblings correctly
> >> (in submitted order) if/when a request splits off multiple siblings.
> >
> > The only splitting that creates siblings on the current level
> > is blk_queue_split(), which splits the current bio into
> > "front piece" and "remainder", already processed in this order.
> 
> Yes.
> I imagine that a driver *could* split a bio into three parts with a
> single allocation, but I cannot actually see any point in doing it.  So
> I was over-complicating things.
> 
> >
> > Anything else creating "siblings" is not creating siblings for the
> > current layer, but for the next deeper layer, which are queue on
> > "recursion" and also processed in the order they have been generated.
> >
> >> I think that as long a requests are submitted in the order they are
> >> created at each level there is no reason to expect performance
> >> regressions.
> >> All we are doing is changing the ordering between requests generated at
> >> different levels, and I think we are restoring a more natural order.
> >
> > I believe both patches combined are doing exactly this already.
> > I could rename .remainder to .todo or .incoming, though.
> 
> :-)  neither "remainder" or "recursion" seem like brilliant names to me,
> but I don't have anything better to suggest.  Naming is hard!
> As long as a comment explains the name clearly I could cope with X and Y.

...

> I think we just might be in violent agreement.

I thought so, too :-)

Should I merge both patches,
rename to ".queue" and ".tmp",
and submit for inclusion?

Lars Ellenberg

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

On Fri, Jul 08, 2016 at 07:08:32PM +0800, Ming Lei wrote:
> > So after processing a particular bio, we should then process all the
> > 'child' bios - bios send to underlying devices.  Then the 'sibling'
> > bios, that were split off, and then any remaining parents and ancestors.
> 
> IMHO, that is just what the oneline patch is doing, isn't it?
> 
> | diff --git a/block/blk-core.c b/block/blk-core.c
>  | index 2475b1c7..a5623f6 100644
>  | --- a/block/blk-core.c
>  | +++ b/block/blk-core.c
>  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>  |   * should be added at the tail
>  |   */
>  |  if (current->bio_list) {
>  | -bio_list_add(current->bio_list, bio);
>  | +bio_list_add_head(current->bio_list, bio);
>  |  goto out;
>  |  }

Almost, but not quite.

As explained earlier, this will re-order.
It will still process bios in "deepest level first" order,
but it will process "sibling" bios in reverse submission order.

Think "very large bio" submitted to a stripe set
with small stripe width/stripe unit size.

So I'd expect this to be a performance hit in some scenarios,
unless the stack at some deeper level does back-merging in its elevator.
(If some driver is not able to merge stuff because of "reverse submission
order" this can easily mean saturating IOPS of the physical device with
small requests, throttling bandwidth to a minimum.)

That's why I mentioned it as "potential easy fix for the deadlock",
but did not suggest it as the proper way to fix this.

If however the powers that be decide that this was a non-issue,
we could use it this way.

Lars

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

On Fri, Jul 08, 2016 at 07:08:32PM +0800, Ming Lei wrote:
> > So after processing a particular bio, we should then process all the
> > 'child' bios - bios send to underlying devices.  Then the 'sibling'
> > bios, that were split off, and then any remaining parents and ancestors.
> 
> IMHO, that is just what the oneline patch is doing, isn't it?
> 
> | diff --git a/block/blk-core.c b/block/blk-core.c
>  | index 2475b1c7..a5623f6 100644
>  | --- a/block/blk-core.c
>  | +++ b/block/blk-core.c
>  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>  |   * should be added at the tail
>  |   */
>  |  if (current->bio_list) {
>  | -bio_list_add(current->bio_list, bio);
>  | +bio_list_add_head(current->bio_list, bio);
>  |  goto out;
>  |  }

Almost, but not quite.

As explained earlier, this will re-order.
It will still process bios in "deepest level first" order,
but it will process "sibling" bios in reverse submission order.

Think "very large bio" submitted to a stripe set
with small stripe width/stripe unit size.

So I'd expect this to be a performance hit in some scenarios,
unless the stack at some deeper level does back-merging in its elevator.
(If some driver is not able to merge stuff because of "reverse submission
order" this can easily mean saturating IOPS of the physical device with
small requests, throttling bandwidth to a minimum.)

That's why I mentioned it as "potential easy fix for the deadlock",
but did not suggest it as the proper way to fix this.

If however the powers that be decide that this was a non-issue,
we could use it this way.

Lars

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

rsion = [ bio(l=2,aa), bio(l=2,ab), ... ]

merge_head

.incoming = [ bio(l=2,aa), bio(l=2,ab), ...,
    bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
bio(l=0,remainder_1) ]
.recursion = []

...

process away ... until back at l=0

.incoming = [ bio(l=0,remainder_1) ]
.recursion = []

potentially split further
.incoming = [ bio(l=0,now_2), bio(l=0,remainder_2) ]
.recursion = []

rinse, repeat.

Thanks,

Lars Ellenberg

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

rsion = [ bio(l=2,aa), bio(l=2,ab), ... ]

merge_head

.incoming = [ bio(l=2,aa), bio(l=2,ab), ...,
    bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
bio(l=0,remainder_1) ]
.recursion = []

...

process away ... until back at l=0

.incoming = [ bio(l=0,remainder_1) ]
.recursion = []

potentially split further
.incoming = [ bio(l=0,now_2), bio(l=0,remainder_2) ]
.recursion = []

rinse, repeat.

Thanks,

Lars Ellenberg

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-07 Thread Lars Ellenberg

On Thu, Jul 07, 2016 at 10:16:16AM +0200, Lars Ellenberg wrote:
> > > Instead, I suggest to distinguish between recursive calls to
> > > generic_make_request(), and pushing back the remainder part in
> > > blk_queue_split(), by pointing current->bio_lists to a
> > >   struct recursion_to_iteration_bio_lists {
> > >   struct bio_list recursion;
> > >   struct bio_list remainder;
> > >   }
> > >
> > > To have all bios targeted to drivers lower in the stack processed before
> > > processing the next piece of a bio targeted at the higher levels,
> > > as long as queued bios resulting from recursion are available,
> > > they will continue to be processed in FIFO order.
> > > Pushed back bio-parts resulting from blk_queue_split() will be processed
> > > in LIFO order, one-by-one, whenever the recursion list becomes empty.
> > 
> > I really like this change.  It seems to precisely address the problem.
> > The "problem" being that requests for "this" device are potentially
> > mixed up with requests from underlying devices.
> > However I'm not sure it is quite general enough.
> > 
> > The "remainder" list is a stack of requests aimed at "this" level or
> > higher, and I think it will always exactly fit that description.
> > The "recursion" list needs to be a queue of requests aimed at the next
> > level down, and that doesn't quiet work, because once you start acting
> > on the first entry in that list, all the rest become "this" level.
> 
> Uhm, well,
> that's how it has been since you introduced this back in 2007, d89d879.
> And it worked.
> 
> > I think you can address this by always calling ->make_request_fn with an
> > empty "recursion", then after the call completes, splice the "recursion"
> > list that resulted (if any) on top of the "remainder" stack.
> > 
> > This way, the "remainder" stack is always "requests for lower-level
> > devices before request for upper level devices" and the "recursion"
> > queue is always "requests for devices below the current level".
> 
> Yes, I guess that would work as well,
> but may need "empirical proof" to check for performance regressions.
> 
> > I also really *don't* like the idea of punting to a separate thread - it
> > seems to be just delaying the problem.
> > 
> > Can you try move the bio_list_init(->recursion) call to just before
> > the ->make_request_fn() call, and adding
> > bio_list_merge_head(->remainder, ->recursion)
> > just after?
> > (or something like that) and confirm it makes sense, and works?
> 
> Sure, will do.

Attached,
on top of the patch of my initial post.
Also fixes the issue for me.

> I'd suggest this would be a patch on its own though, on top of this one.
> Because it would change the order in which stacked bios are processed
> wrt the way it used to be since 2007 (my suggestion as is does not).
> 
> Which may change performance metrics.
> It may even improve some of them,
> or maybe it does nothing, but we don't know.
> 
> Thanks,
> 
> Lars
> 
>From 73254eae63786aca0af10e42e5b41465c90d8da8 Mon Sep 17 00:00:00 2001
From: Lars Ellenberg <lars.ellenb...@linbit.com>
Date: Thu, 7 Jul 2016 11:03:30 +0200
Subject: [PATCH] block: generic_make_request() recursive bios: process deepest
 levels first

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "remainder" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first.

---

As suggested by Neil Brown while discussing
[RFC] block: fix blk_queue_split() resource exhaustion
https://lkml.org/lkml/2016/7/7/27

Stack: qA -> qB -> qC -> qD

=== Without this patch:

generic_make_request(bio_orig to qA)

	recursion: empty, remainder: empty
qA->make_request_fn(bio_orig)
	potential call to bio_queue_split()
	result: bio_S, bio_R
	recursion: empty, remainder: bio_R
	bio_S
	generic_make_request(bio_S to qB)
	recursion: bio_S, remainder: bio_R
<- return
pop:	recursion: empty, remainder: bio_R
qB->make_request_fn(bio_S)
	remap, maybe many clones because of striping
	generic_make_request(clones to qC)
	recursion: bio_C1, bio_C2, bio_C3
	remainder: bio_R
<- return
pop:	recursion: bio_C2, bio_C3,
	remainder: bio_R
qC->make_request_fn(bio_C1)
	remap, ...
	generic_make_request(clones to qD)
	recursion: bio_C2, bio_C3, bio_D1_1, bio_D1_2
	remainder: bio_R
<- return
pop:	recursion: bio_C3, bio_D1_1,

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-07 Thread Lars Ellenberg

On Thu, Jul 07, 2016 at 10:16:16AM +0200, Lars Ellenberg wrote:
> > > Instead, I suggest to distinguish between recursive calls to
> > > generic_make_request(), and pushing back the remainder part in
> > > blk_queue_split(), by pointing current->bio_lists to a
> > >   struct recursion_to_iteration_bio_lists {
> > >   struct bio_list recursion;
> > >   struct bio_list remainder;
> > >   }
> > >
> > > To have all bios targeted to drivers lower in the stack processed before
> > > processing the next piece of a bio targeted at the higher levels,
> > > as long as queued bios resulting from recursion are available,
> > > they will continue to be processed in FIFO order.
> > > Pushed back bio-parts resulting from blk_queue_split() will be processed
> > > in LIFO order, one-by-one, whenever the recursion list becomes empty.
> > 
> > I really like this change.  It seems to precisely address the problem.
> > The "problem" being that requests for "this" device are potentially
> > mixed up with requests from underlying devices.
> > However I'm not sure it is quite general enough.
> > 
> > The "remainder" list is a stack of requests aimed at "this" level or
> > higher, and I think it will always exactly fit that description.
> > The "recursion" list needs to be a queue of requests aimed at the next
> > level down, and that doesn't quiet work, because once you start acting
> > on the first entry in that list, all the rest become "this" level.
> 
> Uhm, well,
> that's how it has been since you introduced this back in 2007, d89d879.
> And it worked.
> 
> > I think you can address this by always calling ->make_request_fn with an
> > empty "recursion", then after the call completes, splice the "recursion"
> > list that resulted (if any) on top of the "remainder" stack.
> > 
> > This way, the "remainder" stack is always "requests for lower-level
> > devices before request for upper level devices" and the "recursion"
> > queue is always "requests for devices below the current level".
> 
> Yes, I guess that would work as well,
> but may need "empirical proof" to check for performance regressions.
> 
> > I also really *don't* like the idea of punting to a separate thread - it
> > seems to be just delaying the problem.
> > 
> > Can you try move the bio_list_init(->recursion) call to just before
> > the ->make_request_fn() call, and adding
> > bio_list_merge_head(->remainder, ->recursion)
> > just after?
> > (or something like that) and confirm it makes sense, and works?
> 
> Sure, will do.

Attached,
on top of the patch of my initial post.
Also fixes the issue for me.

> I'd suggest this would be a patch on its own though, on top of this one.
> Because it would change the order in which stacked bios are processed
> wrt the way it used to be since 2007 (my suggestion as is does not).
> 
> Which may change performance metrics.
> It may even improve some of them,
> or maybe it does nothing, but we don't know.
> 
> Thanks,
> 
> Lars
> 
>From 73254eae63786aca0af10e42e5b41465c90d8da8 Mon Sep 17 00:00:00 2001
From: Lars Ellenberg 
Date: Thu, 7 Jul 2016 11:03:30 +0200
Subject: [PATCH] block: generic_make_request() recursive bios: process deepest
 levels first

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "remainder" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first.

---

As suggested by Neil Brown while discussing
[RFC] block: fix blk_queue_split() resource exhaustion
https://lkml.org/lkml/2016/7/7/27

Stack: qA -> qB -> qC -> qD

=== Without this patch:

generic_make_request(bio_orig to qA)

	recursion: empty, remainder: empty
qA->make_request_fn(bio_orig)
	potential call to bio_queue_split()
	result: bio_S, bio_R
	recursion: empty, remainder: bio_R
	bio_S
	generic_make_request(bio_S to qB)
	recursion: bio_S, remainder: bio_R
<- return
pop:	recursion: empty, remainder: bio_R
qB->make_request_fn(bio_S)
	remap, maybe many clones because of striping
	generic_make_request(clones to qC)
	recursion: bio_C1, bio_C2, bio_C3
	remainder: bio_R
<- return
pop:	recursion: bio_C2, bio_C3,
	remainder: bio_R
qC->make_request_fn(bio_C1)
	remap, ...
	generic_make_request(clones to qD)
	recursion: bio_C2, bio_C3, bio_D1_1, bio_D1_2
	remainder: bio_R
<- return
pop:	recursion: bio_C3, bio_D1_1, bio_D1_2
	remainder: bio_R
qC->m

Re: [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-07 Thread Lars Ellenberg

On Wed, Jul 06, 2016 at 11:57:51PM +0800, Ming Lei wrote:
> >  my suggestion
> >
> > generic_make_request(bio_orig)
> > NULLin-flight=0
> > bio_origempty   in-flight=0
> > qA->make_request_fn(bio_orig)
> >   blk_queue_split()
> >   result:
> >   bio_s, and bio_r stuffed away to head of remainder list.
> > in-flight=1
> >   bio_c = bio_clone(bio_s)
> >   generic_make_request(bio_c to qB)
> > bio_c
> > <-return
> > bio_c
> >   bio_list_pop()
> > empty
> > qB->make_request_fn(bio_c)
> >   (Assume it does not clone, but only remap.
> >   But it may also be a striping layer,
> >   and queue more than one bio here.)
> >   generic_make_request(bio_c to qC)
> > bio_c
> > <-return
> >   bio_list_pop()
> > empty
> > qC->make_request_fn(bio_c)
> >   generic_make_request(bio_c to qD)
> > bio_c
> > <-return
> >   bio_list_pop()
> > empty
> > qD->make_request_fn(bio_c)
> > dispatches to hardware
> > <-return
> > empty
> >bio_list_pop()
> >NULL, great, lets pop from remainder list
> > qA->make_request_fn(bio_r)  in-flight=?
> >
> > May block, but only until completion of bio_c.
> > Which may already have happened.
> >
> > *makes progress*
> 
> I admit your solution is smart, but it isn't easy to prove it as correct
> in theory.  But if the traversal can be mapped into pre-order traversal
> of the above binary tree, it may be correct.

What are you talking about.
There is no tree.
There is a single fifo.
And I suggest to make that one fifo, and one lifo instead.

  |<-- original bio ->|
  |piece|remainder|

  |piece| is then processed, just as it was before,
  all recursive submissions turned into iterative processing,
  in the exact order they have been called recursively.
  Until all deeper level submissions have been fully processed.

  If deeper levels are again calling bio_queue_split, their
  respective remainder are queued in front of the "top level"
  remainder.

  And only then, the remainders are processed,
  just as if they did come in as "original bio", see above.

So if it did make progress before,
it will make progress now.

Lars

Re: [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-07 Thread Lars Ellenberg

On Wed, Jul 06, 2016 at 11:57:51PM +0800, Ming Lei wrote:
> >  my suggestion
> >
> > generic_make_request(bio_orig)
> > NULLin-flight=0
> > bio_origempty   in-flight=0
> > qA->make_request_fn(bio_orig)
> >   blk_queue_split()
> >   result:
> >   bio_s, and bio_r stuffed away to head of remainder list.
> > in-flight=1
> >   bio_c = bio_clone(bio_s)
> >   generic_make_request(bio_c to qB)
> > bio_c
> > <-return
> > bio_c
> >   bio_list_pop()
> > empty
> > qB->make_request_fn(bio_c)
> >   (Assume it does not clone, but only remap.
> >   But it may also be a striping layer,
> >   and queue more than one bio here.)
> >   generic_make_request(bio_c to qC)
> > bio_c
> > <-return
> >   bio_list_pop()
> > empty
> > qC->make_request_fn(bio_c)
> >   generic_make_request(bio_c to qD)
> > bio_c
> > <-return
> >   bio_list_pop()
> > empty
> > qD->make_request_fn(bio_c)
> > dispatches to hardware
> > <-return
> > empty
> >bio_list_pop()
> >NULL, great, lets pop from remainder list
> > qA->make_request_fn(bio_r)  in-flight=?
> >
> > May block, but only until completion of bio_c.
> > Which may already have happened.
> >
> > *makes progress*
> 
> I admit your solution is smart, but it isn't easy to prove it as correct
> in theory.  But if the traversal can be mapped into pre-order traversal
> of the above binary tree, it may be correct.

What are you talking about.
There is no tree.
There is a single fifo.
And I suggest to make that one fifo, and one lifo instead.

  |<-- original bio ->|
  |piece|remainder|

  |piece| is then processed, just as it was before,
  all recursive submissions turned into iterative processing,
  in the exact order they have been called recursively.
  Until all deeper level submissions have been fully processed.

  If deeper levels are again calling bio_queue_split, their
  respective remainder are queued in front of the "top level"
  remainder.

  And only then, the remainders are processed,
  just as if they did come in as "original bio", see above.

So if it did make progress before,
it will make progress now.

Lars

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-07 Thread Lars Ellenberg

On Thu, Jul 07, 2016 at 03:35:48PM +1000, NeilBrown wrote:
> On Wed, Jun 22 2016, Lars Ellenberg wrote:
> 
> > For a long time, generic_make_request() converts recursion into
> > iteration by queuing recursive arguments on current->bio_list.
> >
> > This is convenient for stacking drivers,
> > the top-most driver would take the originally submitted bio,
> > and re-submit a re-mapped version of it, or one or more clones,
> > or one or more new allocated bios to its backend(s). Which
> > are then simply processed in turn, and each can again queue
> > more "backend-bios" until we reach the bottom of the driver stack,
> > and actually dispatch to the real backend device.
> >
> > Any stacking driver ->make_request_fn() could expect that,
> > once it returns, any backend-bios it submitted via recursive calls
> > to generic_make_request() would now be processed and dispatched, before
> > the current task would call into this driver again.
> >
> > This is changed by commit
> >   54efd50 block: make generic_make_request handle arbitrarily sized bios
> >
> > Drivers may call blk_queue_split() inside their ->make_request_fn(),
> > which may split the current bio into a front-part to be dealt with
> > immediately, and a remainder-part, which may need to be split even
> > further. That remainder-part will simply also be pushed to
> > current->bio_list, and would end up being head-of-queue, in front
> > of any backend-bios the current make_request_fn() might submit during
> > processing of the fron-part.
> >
> > Which means the current task would immediately end up back in the same
> > make_request_fn() of the same driver again, before any of its backend
> > bios have even been processed.
> >
> > This can lead to resource starvation deadlock.
> > Drivers could avoid this by learning to not need blk_queue_split(),
> > or by submitting their backend bios in a different context (dedicated
> > kernel thread, work_queue context, ...). Or by playing funny re-ordering
> > games with entries on current->bio_list.
> >
> > Instead, I suggest to distinguish between recursive calls to
> > generic_make_request(), and pushing back the remainder part in
> > blk_queue_split(), by pointing current->bio_lists to a
> > struct recursion_to_iteration_bio_lists {
> > struct bio_list recursion;
> > struct bio_list remainder;
> > }
> >
> > To have all bios targeted to drivers lower in the stack processed before
> > processing the next piece of a bio targeted at the higher levels,
> > as long as queued bios resulting from recursion are available,
> > they will continue to be processed in FIFO order.
> > Pushed back bio-parts resulting from blk_queue_split() will be processed
> > in LIFO order, one-by-one, whenever the recursion list becomes empty.
> 
> I really like this change.  It seems to precisely address the problem.
> The "problem" being that requests for "this" device are potentially
> mixed up with requests from underlying devices.
> However I'm not sure it is quite general enough.
> 
> The "remainder" list is a stack of requests aimed at "this" level or
> higher, and I think it will always exactly fit that description.
> The "recursion" list needs to be a queue of requests aimed at the next
> level down, and that doesn't quiet work, because once you start acting
> on the first entry in that list, all the rest become "this" level.

Uhm, well,
that's how it has been since you introduced this back in 2007, d89d879.
And it worked.

> I think you can address this by always calling ->make_request_fn with an
> empty "recursion", then after the call completes, splice the "recursion"
> list that resulted (if any) on top of the "remainder" stack.
> 
> This way, the "remainder" stack is always "requests for lower-level
> devices before request for upper level devices" and the "recursion"
> queue is always "requests for devices below the current level".

Yes, I guess that would work as well,
but may need "empirical proof" to check for performance regressions.

> I also really *don't* like the idea of punting to a separate thread - it
> seems to be just delaying the problem.
> 
> Can you try move the bio_list_init(->recursion) call to just before
> the ->make_request_fn() call, and adding
> bio_list_merge_head(->remainder, ->recursion)
> just after?
> (or something like that) and confirm it makes sense, and works?

Sure, will do.
I'd suggest this would be a patch on its own though, on top of this one.
Because it would change the order in which stacked bios are processed
wrt the way it used to be since 2007 (my suggestion as is does not).

Which may change performance metrics.
It may even improve some of them,
or maybe it does nothing, but we don't know.

Thanks,

Lars

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-07 Thread Lars Ellenberg

On Thu, Jul 07, 2016 at 03:35:48PM +1000, NeilBrown wrote:
> On Wed, Jun 22 2016, Lars Ellenberg wrote:
> 
> > For a long time, generic_make_request() converts recursion into
> > iteration by queuing recursive arguments on current->bio_list.
> >
> > This is convenient for stacking drivers,
> > the top-most driver would take the originally submitted bio,
> > and re-submit a re-mapped version of it, or one or more clones,
> > or one or more new allocated bios to its backend(s). Which
> > are then simply processed in turn, and each can again queue
> > more "backend-bios" until we reach the bottom of the driver stack,
> > and actually dispatch to the real backend device.
> >
> > Any stacking driver ->make_request_fn() could expect that,
> > once it returns, any backend-bios it submitted via recursive calls
> > to generic_make_request() would now be processed and dispatched, before
> > the current task would call into this driver again.
> >
> > This is changed by commit
> >   54efd50 block: make generic_make_request handle arbitrarily sized bios
> >
> > Drivers may call blk_queue_split() inside their ->make_request_fn(),
> > which may split the current bio into a front-part to be dealt with
> > immediately, and a remainder-part, which may need to be split even
> > further. That remainder-part will simply also be pushed to
> > current->bio_list, and would end up being head-of-queue, in front
> > of any backend-bios the current make_request_fn() might submit during
> > processing of the fron-part.
> >
> > Which means the current task would immediately end up back in the same
> > make_request_fn() of the same driver again, before any of its backend
> > bios have even been processed.
> >
> > This can lead to resource starvation deadlock.
> > Drivers could avoid this by learning to not need blk_queue_split(),
> > or by submitting their backend bios in a different context (dedicated
> > kernel thread, work_queue context, ...). Or by playing funny re-ordering
> > games with entries on current->bio_list.
> >
> > Instead, I suggest to distinguish between recursive calls to
> > generic_make_request(), and pushing back the remainder part in
> > blk_queue_split(), by pointing current->bio_lists to a
> > struct recursion_to_iteration_bio_lists {
> > struct bio_list recursion;
> > struct bio_list remainder;
> > }
> >
> > To have all bios targeted to drivers lower in the stack processed before
> > processing the next piece of a bio targeted at the higher levels,
> > as long as queued bios resulting from recursion are available,
> > they will continue to be processed in FIFO order.
> > Pushed back bio-parts resulting from blk_queue_split() will be processed
> > in LIFO order, one-by-one, whenever the recursion list becomes empty.
> 
> I really like this change.  It seems to precisely address the problem.
> The "problem" being that requests for "this" device are potentially
> mixed up with requests from underlying devices.
> However I'm not sure it is quite general enough.
> 
> The "remainder" list is a stack of requests aimed at "this" level or
> higher, and I think it will always exactly fit that description.
> The "recursion" list needs to be a queue of requests aimed at the next
> level down, and that doesn't quiet work, because once you start acting
> on the first entry in that list, all the rest become "this" level.

Uhm, well,
that's how it has been since you introduced this back in 2007, d89d879.
And it worked.

> I think you can address this by always calling ->make_request_fn with an
> empty "recursion", then after the call completes, splice the "recursion"
> list that resulted (if any) on top of the "remainder" stack.
> 
> This way, the "remainder" stack is always "requests for lower-level
> devices before request for upper level devices" and the "recursion"
> queue is always "requests for devices below the current level".

Yes, I guess that would work as well,
but may need "empirical proof" to check for performance regressions.

> I also really *don't* like the idea of punting to a separate thread - it
> seems to be just delaying the problem.
> 
> Can you try move the bio_list_init(->recursion) call to just before
> the ->make_request_fn() call, and adding
> bio_list_merge_head(->remainder, ->recursion)
> just after?
> (or something like that) and confirm it makes sense, and works?

Sure, will do.
I'd suggest this would be a patch on its own though, on top of this one.
Because it would change the order in which stacked bios are processed
wrt the way it used to be since 2007 (my suggestion as is does not).

Which may change performance metrics.
It may even improve some of them,
or maybe it does nothing, but we don't know.

Thanks,

Lars

Re: [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-06 Thread Lars Ellenberg

On Mon, Jul 04, 2016 at 06:47:29PM +0800, Ming Lei wrote:
> >> One clean solution may be to convert the loop of generic_make_request()
> >> into the following way:
> >>
> >> do {
> >> struct bio *splitted, *remainder = NULL;
> >> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> >>
> >> blk_queue_split(q, , , q->bio_split);
> >>
> >> ret = q->make_request_fn(q, bio);
> >>
> >> if (remainder)
> >>  bio_list_add(current->bio_list, remainder);
> >> bio = bio_list_pop(current->bio_list);
> >> } while (bio)
> >
> > Not good enough.

> > Goal was to first process all "deeper level" bios
> > before processing the remainder.
> 
> For the reported bio splitting issue, I think the goal is to make sure all
> BIOs generated from 'bio' inside .make_request_fn(bio) are queued
> before the 'current' remainder. Cause it is the issue introduced by
> blk_split_bio().

Stacking:
qA (limitting layer)
-> qB (remapping)
-> qC (remapping)
-> qD (hardware)

[in fact you don't even need four layers,
 this is just to clarify that the stack may be more complex than you
 assume it would be]

Columns below:
1) argument to generic_make_request() and its target queue.
2) current->bio_list
3) "in-flight" counter of qA.

 In your new picture, iiuc:

generic_make_request(bio_orig)
NULLin-flight=0
bio_origempty
  blk_queue_split()
  result:
  bio_s, bio_r  
qA->make_request_fn(bio_s)
in-flight=1
  bio_c = bio_clone(bio_s)
  generic_make_request(bio_c to qB)
bio_c
<-return
  bio_list_add(bio_r)
bio_c, bio_r
  bio_list_pop()
bio_r
qB->make_request_fn(bio_c)
  (Assume it does not clone, but only remap.
  But it may also be a striping layer,
  and queue more than one bio here.)
  generic_make_request(bio_c to qC)
bio_r, bio_c
<-return
  bio_list_pop()
bio_c
qA->make_request_fn(bio_r)  in-flight still 1

BLOCKS, because bio_c has not been processed to its final
destination qD yet, and not dispatched to hardware.

 my suggestion

generic_make_request(bio_orig)
NULLin-flight=0
bio_origempty   in-flight=0
qA->make_request_fn(bio_orig)
  blk_queue_split()
  result:
  bio_s, and bio_r stuffed away to head of remainder list.
in-flight=1
  bio_c = bio_clone(bio_s)
  generic_make_request(bio_c to qB)
bio_c
<-return
bio_c
  bio_list_pop()
empty
qB->make_request_fn(bio_c)
  (Assume it does not clone, but only remap.
  But it may also be a striping layer,
  and queue more than one bio here.)
  generic_make_request(bio_c to qC)
bio_c
<-return
  bio_list_pop()
empty
qC->make_request_fn(bio_c)
  generic_make_request(bio_c to qD)
bio_c
<-return
  bio_list_pop()
empty
qD->make_request_fn(bio_c)
dispatches to hardware
<-return
empty
   bio_list_pop()
   NULL, great, lets pop from remainder list
qA->make_request_fn(bio_r)  in-flight=?

May block, but only until completion of bio_c.
Which may already have happened.

*makes progress*

Thanks,

Lars

Re: [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-06 Thread Lars Ellenberg

On Mon, Jul 04, 2016 at 06:47:29PM +0800, Ming Lei wrote:
> >> One clean solution may be to convert the loop of generic_make_request()
> >> into the following way:
> >>
> >> do {
> >> struct bio *splitted, *remainder = NULL;
> >> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> >>
> >> blk_queue_split(q, , , q->bio_split);
> >>
> >> ret = q->make_request_fn(q, bio);
> >>
> >> if (remainder)
> >>  bio_list_add(current->bio_list, remainder);
> >> bio = bio_list_pop(current->bio_list);
> >> } while (bio)
> >
> > Not good enough.

> > Goal was to first process all "deeper level" bios
> > before processing the remainder.
> 
> For the reported bio splitting issue, I think the goal is to make sure all
> BIOs generated from 'bio' inside .make_request_fn(bio) are queued
> before the 'current' remainder. Cause it is the issue introduced by
> blk_split_bio().

Stacking:
qA (limitting layer)
-> qB (remapping)
-> qC (remapping)
-> qD (hardware)

[in fact you don't even need four layers,
 this is just to clarify that the stack may be more complex than you
 assume it would be]

Columns below:
1) argument to generic_make_request() and its target queue.
2) current->bio_list
3) "in-flight" counter of qA.

 In your new picture, iiuc:

generic_make_request(bio_orig)
NULLin-flight=0
bio_origempty
  blk_queue_split()
  result:
  bio_s, bio_r  
qA->make_request_fn(bio_s)
in-flight=1
  bio_c = bio_clone(bio_s)
  generic_make_request(bio_c to qB)
bio_c
<-return
  bio_list_add(bio_r)
bio_c, bio_r
  bio_list_pop()
bio_r
qB->make_request_fn(bio_c)
  (Assume it does not clone, but only remap.
  But it may also be a striping layer,
  and queue more than one bio here.)
  generic_make_request(bio_c to qC)
bio_r, bio_c
<-return
  bio_list_pop()
bio_c
qA->make_request_fn(bio_r)  in-flight still 1

BLOCKS, because bio_c has not been processed to its final
destination qD yet, and not dispatched to hardware.

 my suggestion

generic_make_request(bio_orig)
NULLin-flight=0
bio_origempty   in-flight=0
qA->make_request_fn(bio_orig)
  blk_queue_split()
  result:
  bio_s, and bio_r stuffed away to head of remainder list.
in-flight=1
  bio_c = bio_clone(bio_s)
  generic_make_request(bio_c to qB)
bio_c
<-return
bio_c
  bio_list_pop()
empty
qB->make_request_fn(bio_c)
  (Assume it does not clone, but only remap.
  But it may also be a striping layer,
  and queue more than one bio here.)
  generic_make_request(bio_c to qC)
bio_c
<-return
  bio_list_pop()
empty
qC->make_request_fn(bio_c)
  generic_make_request(bio_c to qD)
bio_c
<-return
  bio_list_pop()
empty
qD->make_request_fn(bio_c)
dispatches to hardware
<-return
empty
   bio_list_pop()
   NULL, great, lets pop from remainder list
qA->make_request_fn(bio_r)  in-flight=?

May block, but only until completion of bio_c.
Which may already have happened.

*makes progress*

Thanks,

Lars

Re: [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-04 Thread Lars Ellenberg

On Sat, Jul 02, 2016 at 06:28:29PM +0800, Ming Lei wrote:
> The idea looks good, but not sure it can cover all cases of
> dm over brbd or brbd over dm and maintaining two lists
> becomes too complicated.
> 
> One clean solution may be to convert the loop of generic_make_request()
> into the following way:
> 
> do {
> struct bio *splitted, *remainder = NULL;
> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> 
> blk_queue_split(q, , , q->bio_split);
> 
> ret = q->make_request_fn(q, bio);
> 
> if (remainder)
>  bio_list_add(current->bio_list, remainder);
> bio = bio_list_pop(current->bio_list);
> } while (bio)

Not good enough.

Consider DRBD on device-mapper on device-mapper on scsi,
or insert multipath and / or md raid into the stack.
The iterative call to q->make_request_fn() in the second iteration
may queue bios after the remainder.

Goal was to first process all "deeper level" bios
before processing the remainder.

You can achieve this by doing last-in-first-out on bio_list,
or by using two lists, as I suggested in the original post.

Lars

Re: [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-04 Thread Lars Ellenberg

On Sat, Jul 02, 2016 at 06:28:29PM +0800, Ming Lei wrote:
> The idea looks good, but not sure it can cover all cases of
> dm over brbd or brbd over dm and maintaining two lists
> becomes too complicated.
> 
> One clean solution may be to convert the loop of generic_make_request()
> into the following way:
> 
> do {
> struct bio *splitted, *remainder = NULL;
> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> 
> blk_queue_split(q, , , q->bio_split);
> 
> ret = q->make_request_fn(q, bio);
> 
> if (remainder)
>  bio_list_add(current->bio_list, remainder);
> bio = bio_list_pop(current->bio_list);
> } while (bio)

Not good enough.

Consider DRBD on device-mapper on device-mapper on scsi,
or insert multipath and / or md raid into the stack.
The iterative call to q->make_request_fn() in the second iteration
may queue bios after the remainder.

Goal was to first process all "deeper level" bios
before processing the remainder.

You can achieve this by doing last-in-first-out on bio_list,
or by using two lists, as I suggested in the original post.

Lars

Re: [RFC] block: fix blk_queue_split() resource exhaustion

2016-06-28 Thread Lars Ellenberg

On Sat, Jun 25, 2016 at 05:30:29PM +0800, Ming Lei wrote:
> On Fri, Jun 24, 2016 at 10:27 PM, Lars Ellenberg
> <lars.ellenb...@linbit.com> wrote:
> > On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
> >> >
> >> > This is not a theoretical problem.
> >> > At least int DRBD, and an unfortunately high IO concurrency wrt. the
> >> > "max-buffers" setting, without this patch we have a reproducible 
> >> > deadlock.
> >>
> >> Is there any log about the deadlock? And is there any lockdep warning
> >> if it is enabled?
> >
> > In DRBD, to avoid potentially very long internal queues as we wait for
> > our replication peer device and local backend, we limit the number of
> > in-flight bios we accept, and block in our ->make_request_fn() if that
> > number exceeds a configured watermark ("max-buffers").
> >
> > Works fine, as long as we could assume that once our make_request_fn()
> > returns, any bios we "recursively" submitted against the local backend
> > would be dispatched. Which used to be the case.
> >
> > commit 54efd50 block: make generic_make_request handle arbitrarily sized 
> > bios
> > introduced blk_queue_split(), which will split any bio that is
> > violating the queue limits into a smaller piece to be processed
> > right away, and queue the "rest" on current->bio_list to be
> > processed by the iteration in generic_make_request() once the
> > current ->make_request_fn() returns.
> >
> > Before that, any user was supposed to go through bio_add_page(),
> > which would call our merge bvec function, and that should already
> > be sufficient to not violate queue limits.
> >
> > Previously, typically the next in line bio to be processed would
> > be the cloned one we passed down to our backend device, which in
> > case of further stacking devices (e.g. logical volume, raid1 or
> > similar) may again push further bios on that list.
> > All of which would simply be processed in turn.
> 
> Could you clarify if the issue is triggered in drbd without dm/md involved?
> Or the issue is always triggered with dm/md over drbd?
> 
> As Mike mentioned, there is one known issue with dm-snapshot.


The issue can always be triggered, even if only DRBD is involved.

> If your ->make_request_fn() is called several times, each time
> at least you should dispatch one bio returnd from blk_queue_split(),
> so I don't understand why even one bio isn't dispatched out.

I'll try to "visualize" the mechanics of "my" deadlock here.

Just to clarify the effect, assume we had a driver that
for $reasons would limit the number of in-flight IO to
one single bio.

=== before bio_queue_split()

Previously, if something would want to read some range of blocks,
it would allocate a bio, call bio_add_page() a number of times,
and once the bio was "full", call generic_make_request(),
and fill the next bio, then submit that one.

Stacking: "single_in_flight" (q1) -> "sda" (q2)

  generic_make_request(bio) current->bio_list   in-flight
   B_orig_1 NULL0
q1->make_request_fn(B_orig_1)   empty
1
recursive call, queues: B_1_remapped
iterative call:
q2->make_request_fn(B_1_remapped)   empty
-> actually dispatched to hardware
  return of generic_make_request(B_orig_1).
   B_orig_2
q1->make_request_fn(B_orig_1)
1
blocks, waits for in-flight to drop
...
completion of B_orig_1  0

recursive call, queues: B_2_remapped
iterative call:
q2->make_request_fn(B_2_remapped)   empty
-> actually dispatched to hardware
  return of generic_make_request(B_orig_2).


=== with bio_queue_split()

Now, uppser layers buils one large bio.

  generic_make_request(bio) current->bio_list   in-flight
   B_orig   NULL0
q1->make_request_fn(B_orig) empty
blk_queue_split(B_orig) splits into
B_orig_r1
B_orig_s1
1
recursive call, queues: B_orig_r1, B_s1_remapped
iterative call:
q1->make_request_fn(B_orig_r1)  B_s1_remapped
blocks, waits for in-flight to drop
...
which never happens,
because B_s1_remapped has not even been submitted to q2 yet,
let alone dispatched to hardware.


Obviously we don't limit ourselves to just one request, but with larger
incoming bios, with the number of times we split them, with the number
of stacking layers below us, or even layers below us that *also*
call blk_queue_split (or equivalent open coded clone and split)
themselves to split even further, things get worse.

Lars

Re: [RFC] block: fix blk_queue_split() resource exhaustion

2016-06-28 Thread Lars Ellenberg

On Sat, Jun 25, 2016 at 05:30:29PM +0800, Ming Lei wrote:
> On Fri, Jun 24, 2016 at 10:27 PM, Lars Ellenberg
>  wrote:
> > On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
> >> >
> >> > This is not a theoretical problem.
> >> > At least int DRBD, and an unfortunately high IO concurrency wrt. the
> >> > "max-buffers" setting, without this patch we have a reproducible 
> >> > deadlock.
> >>
> >> Is there any log about the deadlock? And is there any lockdep warning
> >> if it is enabled?
> >
> > In DRBD, to avoid potentially very long internal queues as we wait for
> > our replication peer device and local backend, we limit the number of
> > in-flight bios we accept, and block in our ->make_request_fn() if that
> > number exceeds a configured watermark ("max-buffers").
> >
> > Works fine, as long as we could assume that once our make_request_fn()
> > returns, any bios we "recursively" submitted against the local backend
> > would be dispatched. Which used to be the case.
> >
> > commit 54efd50 block: make generic_make_request handle arbitrarily sized 
> > bios
> > introduced blk_queue_split(), which will split any bio that is
> > violating the queue limits into a smaller piece to be processed
> > right away, and queue the "rest" on current->bio_list to be
> > processed by the iteration in generic_make_request() once the
> > current ->make_request_fn() returns.
> >
> > Before that, any user was supposed to go through bio_add_page(),
> > which would call our merge bvec function, and that should already
> > be sufficient to not violate queue limits.
> >
> > Previously, typically the next in line bio to be processed would
> > be the cloned one we passed down to our backend device, which in
> > case of further stacking devices (e.g. logical volume, raid1 or
> > similar) may again push further bios on that list.
> > All of which would simply be processed in turn.
> 
> Could you clarify if the issue is triggered in drbd without dm/md involved?
> Or the issue is always triggered with dm/md over drbd?
> 
> As Mike mentioned, there is one known issue with dm-snapshot.


The issue can always be triggered, even if only DRBD is involved.

> If your ->make_request_fn() is called several times, each time
> at least you should dispatch one bio returnd from blk_queue_split(),
> so I don't understand why even one bio isn't dispatched out.

I'll try to "visualize" the mechanics of "my" deadlock here.

Just to clarify the effect, assume we had a driver that
for $reasons would limit the number of in-flight IO to
one single bio.

=== before bio_queue_split()

Previously, if something would want to read some range of blocks,
it would allocate a bio, call bio_add_page() a number of times,
and once the bio was "full", call generic_make_request(),
and fill the next bio, then submit that one.

Stacking: "single_in_flight" (q1) -> "sda" (q2)

  generic_make_request(bio) current->bio_list   in-flight
   B_orig_1 NULL0
q1->make_request_fn(B_orig_1)   empty
1
recursive call, queues: B_1_remapped
iterative call:
q2->make_request_fn(B_1_remapped)   empty
-> actually dispatched to hardware
  return of generic_make_request(B_orig_1).
   B_orig_2
q1->make_request_fn(B_orig_1)
1
blocks, waits for in-flight to drop
...
completion of B_orig_1  0

recursive call, queues: B_2_remapped
iterative call:
q2->make_request_fn(B_2_remapped)   empty
-> actually dispatched to hardware
  return of generic_make_request(B_orig_2).


=== with bio_queue_split()

Now, uppser layers buils one large bio.

  generic_make_request(bio) current->bio_list   in-flight
   B_orig   NULL0
q1->make_request_fn(B_orig) empty
blk_queue_split(B_orig) splits into
B_orig_r1
B_orig_s1
1
recursive call, queues: B_orig_r1, B_s1_remapped
iterative call:
q1->make_request_fn(B_orig_r1)  B_s1_remapped
blocks, waits for in-flight to drop
...
which never happens,
because B_s1_remapped has not even been submitted to q2 yet,
let alone dispatched to hardware.


Obviously we don't limit ourselves to just one request, but with larger
incoming bios, with the number of times we split them, with the number
of stacking layers below us, or even layers below us that *also*
call blk_queue_split (or equivalent open coded clone and split)
themselves to split even further, things get worse.

Lars

Re: block: fix blk_queue_split() resource exhaustion

2016-06-28 Thread Lars Ellenberg

On Fri, Jun 24, 2016 at 11:15:47AM -0400, Mike Snitzer wrote:
> On Fri, Jun 24 2016 at 10:27am -0400,
> Lars Ellenberg <lars.ellenb...@linbit.com> wrote:
> 
> > On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
> > > >
> > > > This is not a theoretical problem.
> > > > At least int DRBD, and an unfortunately high IO concurrency wrt. the
> > > > "max-buffers" setting, without this patch we have a reproducible 
> > > > deadlock.
> > > 
> > > Is there any log about the deadlock? And is there any lockdep warning
> > > if it is enabled?
> > 
> > In DRBD, to avoid potentially very long internal queues as we wait for
> > our replication peer device and local backend, we limit the number of
> > in-flight bios we accept, and block in our ->make_request_fn() if that
> > number exceeds a configured watermark ("max-buffers").
> > 
> > Works fine, as long as we could assume that once our make_request_fn()
> > returns, any bios we "recursively" submitted against the local backend
> > would be dispatched. Which used to be the case.
> 
> It'd be useful to know whether this patch fixes your issue:
> https://patchwork.kernel.org/patch/7398411/

I would assume so.
because if current is blocked for any reason,
it will dispatch all bios that are still on current->bio_list
to be processed from other contexts.

Which means we will not deadlock, but make progress,
if the unblock of current depends on processing of those bios.

Also see my other mail on the issue,
where I try to better explain the mechanics of "my" deadlock.

Lars

Re: block: fix blk_queue_split() resource exhaustion

2016-06-28 Thread Lars Ellenberg

On Fri, Jun 24, 2016 at 11:15:47AM -0400, Mike Snitzer wrote:
> On Fri, Jun 24 2016 at 10:27am -0400,
> Lars Ellenberg  wrote:
> 
> > On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
> > > >
> > > > This is not a theoretical problem.
> > > > At least int DRBD, and an unfortunately high IO concurrency wrt. the
> > > > "max-buffers" setting, without this patch we have a reproducible 
> > > > deadlock.
> > > 
> > > Is there any log about the deadlock? And is there any lockdep warning
> > > if it is enabled?
> > 
> > In DRBD, to avoid potentially very long internal queues as we wait for
> > our replication peer device and local backend, we limit the number of
> > in-flight bios we accept, and block in our ->make_request_fn() if that
> > number exceeds a configured watermark ("max-buffers").
> > 
> > Works fine, as long as we could assume that once our make_request_fn()
> > returns, any bios we "recursively" submitted against the local backend
> > would be dispatched. Which used to be the case.
> 
> It'd be useful to know whether this patch fixes your issue:
> https://patchwork.kernel.org/patch/7398411/

I would assume so.
because if current is blocked for any reason,
it will dispatch all bios that are still on current->bio_list
to be processed from other contexts.

Which means we will not deadlock, but make progress,
if the unblock of current depends on processing of those bios.

Also see my other mail on the issue,
where I try to better explain the mechanics of "my" deadlock.

Lars

Re: [RFC] block: fix blk_queue_split() resource exhaustion

2016-06-24 Thread Lars Ellenberg

On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
> >
> > This is not a theoretical problem.
> > At least int DRBD, and an unfortunately high IO concurrency wrt. the
> > "max-buffers" setting, without this patch we have a reproducible deadlock.
> 
> Is there any log about the deadlock? And is there any lockdep warning
> if it is enabled?

In DRBD, to avoid potentially very long internal queues as we wait for
our replication peer device and local backend, we limit the number of
in-flight bios we accept, and block in our ->make_request_fn() if that
number exceeds a configured watermark ("max-buffers").

Works fine, as long as we could assume that once our make_request_fn()
returns, any bios we "recursively" submitted against the local backend
would be dispatched. Which used to be the case.

commit 54efd50 block: make generic_make_request handle arbitrarily sized bios
introduced blk_queue_split(), which will split any bio that is
violating the queue limits into a smaller piece to be processed
right away, and queue the "rest" on current->bio_list to be
processed by the iteration in generic_make_request() once the
current ->make_request_fn() returns.

Before that, any user was supposed to go through bio_add_page(),
which would call our merge bvec function, and that should already
be sufficient to not violate queue limits.

Previously, typically the next in line bio to be processed would
be the cloned one we passed down to our backend device, which in
case of further stacking devices (e.g. logical volume, raid1 or
similar) may again push further bios on that list.
All of which would simply be processed in turn.

Now, with blk_queue_split(), the next-in-line bio would be the
next piece of the "too big" original bio, so we end up calling
several times into our ->make_request_fn() without even
dispatching one bio to the backend.

With highly concurrent IO and/or very small "max-buffers" setting,
this can deadlock, where the current submit path would wait for
the completion of the bios supposedly already passed to the
backend, but which actually are not even dispatched yet, rotting
away on current->bio_list.

I could code around that in various ways inside the DRBD make_request_fn,
or always dispatch the backend bios to a different (thread or work_queue)
context, but I'd rather have the distinction between "recursively
submitted" bios and "pushed back part of originally passed in bio" as
implemented in this patch.

Thanks,

Lars

> > I'm unsure if it could cause problems in md raid in some corner cases
> > or when a rebuild/scrub/... starts in a "bad" moment.
> >
> > It also may increase "stress" on any involved bio_set,
> > because it may increase the number of bios that are allocated
> > before the first of them is actually processed, causing more
> > frequent calls of punt_bios_to_rescuer().
> >
> > Alternatively,
> > a simple one-line change to generic_make_request() would also "fix" it,
> > by processing all recursive bios in LIFO order.
> > But that may change the order in which bios reach the "physical" layer,
> > in case they are in fact split into many smaller pieces, for example in
> > a striping setup, which may have other performance implications.
> >
> >  | diff --git a/block/blk-core.c b/block/blk-core.c
> >  | index 2475b1c7..a5623f6 100644
> >  | --- a/block/blk-core.c
> >  | +++ b/block/blk-core.c
> >  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
> >  |   * should be added at the tail
> >  |   */
> >  |  if (current->bio_list) {
> >  | -bio_list_add(current->bio_list, bio);
> >  | +bio_list_add_head(current->bio_list, bio);
> >  |  goto out;
> >  |  }

> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 2475b1c7..f03ff4c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2031,7 +2031,7 @@ end_io:
> >   */
> >  blk_qc_t generic_make_request(struct bio *bio)
> >  {
> > -   struct bio_list bio_list_on_stack;
> > +   struct recursion_to_iteration_bio_lists bio_lists_on_stack;
> > blk_qc_t ret = BLK_QC_T_NONE;
> >
> > if (!generic_make_request_checks(bio))
> > @@ -2040,15 +2040,18 @@ blk_qc_t generic_make_request(struct bio *bio)

> > -   if (current->bio_list) {
> > -   bio_list_add(current->bio_list, bio);
> > +   if (current->bio_lists) {
> > +   bio_list_add(>bio_lists->recursion, bio);
> > goto out;
> > }
> >

> > @@ -2067,8 +2070,9 @@ blk_qc_t generic_make_request(struct bio *bio)
> >  * bio_list, and call into ->make_request() again.
> >  */
> > BUG_ON(bio->bi_next);
> > -   bio_list_init(_list_on_stack);
> > -   current->bio_list = _list_on_stack;
> > +   bio_list_init(_lists_on_stack.recursion);
> > +   bio_list_init(_lists_on_stack.remainder);
> > +   current->bio_lists = _lists_on_stack;
> > do {
> > struct request_queue *q =

Re: [RFC] block: fix blk_queue_split() resource exhaustion

2016-06-24 Thread Lars Ellenberg

On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
> >
> > This is not a theoretical problem.
> > At least int DRBD, and an unfortunately high IO concurrency wrt. the
> > "max-buffers" setting, without this patch we have a reproducible deadlock.
> 
> Is there any log about the deadlock? And is there any lockdep warning
> if it is enabled?

In DRBD, to avoid potentially very long internal queues as we wait for
our replication peer device and local backend, we limit the number of
in-flight bios we accept, and block in our ->make_request_fn() if that
number exceeds a configured watermark ("max-buffers").

Works fine, as long as we could assume that once our make_request_fn()
returns, any bios we "recursively" submitted against the local backend
would be dispatched. Which used to be the case.

commit 54efd50 block: make generic_make_request handle arbitrarily sized bios
introduced blk_queue_split(), which will split any bio that is
violating the queue limits into a smaller piece to be processed
right away, and queue the "rest" on current->bio_list to be
processed by the iteration in generic_make_request() once the
current ->make_request_fn() returns.

Before that, any user was supposed to go through bio_add_page(),
which would call our merge bvec function, and that should already
be sufficient to not violate queue limits.

Previously, typically the next in line bio to be processed would
be the cloned one we passed down to our backend device, which in
case of further stacking devices (e.g. logical volume, raid1 or
similar) may again push further bios on that list.
All of which would simply be processed in turn.

Now, with blk_queue_split(), the next-in-line bio would be the
next piece of the "too big" original bio, so we end up calling
several times into our ->make_request_fn() without even
dispatching one bio to the backend.

With highly concurrent IO and/or very small "max-buffers" setting,
this can deadlock, where the current submit path would wait for
the completion of the bios supposedly already passed to the
backend, but which actually are not even dispatched yet, rotting
away on current->bio_list.

I could code around that in various ways inside the DRBD make_request_fn,
or always dispatch the backend bios to a different (thread or work_queue)
context, but I'd rather have the distinction between "recursively
submitted" bios and "pushed back part of originally passed in bio" as
implemented in this patch.

Thanks,

Lars

> > I'm unsure if it could cause problems in md raid in some corner cases
> > or when a rebuild/scrub/... starts in a "bad" moment.
> >
> > It also may increase "stress" on any involved bio_set,
> > because it may increase the number of bios that are allocated
> > before the first of them is actually processed, causing more
> > frequent calls of punt_bios_to_rescuer().
> >
> > Alternatively,
> > a simple one-line change to generic_make_request() would also "fix" it,
> > by processing all recursive bios in LIFO order.
> > But that may change the order in which bios reach the "physical" layer,
> > in case they are in fact split into many smaller pieces, for example in
> > a striping setup, which may have other performance implications.
> >
> >  | diff --git a/block/blk-core.c b/block/blk-core.c
> >  | index 2475b1c7..a5623f6 100644
> >  | --- a/block/blk-core.c
> >  | +++ b/block/blk-core.c
> >  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
> >  |   * should be added at the tail
> >  |   */
> >  |  if (current->bio_list) {
> >  | -bio_list_add(current->bio_list, bio);
> >  | +bio_list_add_head(current->bio_list, bio);
> >  |  goto out;
> >  |  }

> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 2475b1c7..f03ff4c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2031,7 +2031,7 @@ end_io:
> >   */
> >  blk_qc_t generic_make_request(struct bio *bio)
> >  {
> > -   struct bio_list bio_list_on_stack;
> > +   struct recursion_to_iteration_bio_lists bio_lists_on_stack;
> > blk_qc_t ret = BLK_QC_T_NONE;
> >
> > if (!generic_make_request_checks(bio))
> > @@ -2040,15 +2040,18 @@ blk_qc_t generic_make_request(struct bio *bio)

> > -   if (current->bio_list) {
> > -   bio_list_add(current->bio_list, bio);
> > +   if (current->bio_lists) {
> > +   bio_list_add(>bio_lists->recursion, bio);
> > goto out;
> > }
> >

> > @@ -2067,8 +2070,9 @@ blk_qc_t generic_make_request(struct bio *bio)
> >  * bio_list, and call into ->make_request() again.
> >  */
> > BUG_ON(bio->bi_next);
> > -   bio_list_init(_list_on_stack);
> > -   current->bio_list = _list_on_stack;
> > +   bio_list_init(_lists_on_stack.recursion);
> > +   bio_list_init(_lists_on_stack.remainder);
> > +   current->bio_lists = _lists_on_stack;
> > do {
> > struct request_queue *q =

[RFC] block: fix blk_queue_split() resource exhaustion

2016-06-22 Thread Lars Ellenberg

For a long time, generic_make_request() converts recursion into
iteration by queuing recursive arguments on current->bio_list.

This is convenient for stacking drivers,
the top-most driver would take the originally submitted bio,
and re-submit a re-mapped version of it, or one or more clones,
or one or more new allocated bios to its backend(s). Which
are then simply processed in turn, and each can again queue
more "backend-bios" until we reach the bottom of the driver stack,
and actually dispatch to the real backend device.

Any stacking driver ->make_request_fn() could expect that,
once it returns, any backend-bios it submitted via recursive calls
to generic_make_request() would now be processed and dispatched, before
the current task would call into this driver again.

This is changed by commit
  54efd50 block: make generic_make_request handle arbitrarily sized bios

Drivers may call blk_queue_split() inside their ->make_request_fn(),
which may split the current bio into a front-part to be dealt with
immediately, and a remainder-part, which may need to be split even
further. That remainder-part will simply also be pushed to
current->bio_list, and would end up being head-of-queue, in front
of any backend-bios the current make_request_fn() might submit during
processing of the fron-part.

Which means the current task would immediately end up back in the same
make_request_fn() of the same driver again, before any of its backend
bios have even been processed.

This can lead to resource starvation deadlock.
Drivers could avoid this by learning to not need blk_queue_split(),
or by submitting their backend bios in a different context (dedicated
kernel thread, work_queue context, ...). Or by playing funny re-ordering
games with entries on current->bio_list.

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
struct recursion_to_iteration_bio_lists {
struct bio_list recursion;
struct bio_list remainder;
}

To have all bios targeted to drivers lower in the stack processed before
processing the next piece of a bio targeted at the higher levels,
as long as queued bios resulting from recursion are available,
they will continue to be processed in FIFO order.
Pushed back bio-parts resulting from blk_queue_split() will be processed
in LIFO order, one-by-one, whenever the recursion list becomes empty.

Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
Signed-off-by: Roland Kammerer <roland.kamme...@linbit.com>
---

This is not a theoretical problem.
At least int DRBD, and an unfortunately high IO concurrency wrt. the
"max-buffers" setting, without this patch we have a reproducible deadlock.

I'm unsure if it could cause problems in md raid in some corner cases
or when a rebuild/scrub/... starts in a "bad" moment.

It also may increase "stress" on any involved bio_set,
because it may increase the number of bios that are allocated
before the first of them is actually processed, causing more
frequent calls of punt_bios_to_rescuer().

Alternatively,
a simple one-line change to generic_make_request() would also "fix" it,
by processing all recursive bios in LIFO order.
But that may change the order in which bios reach the "physical" layer,
in case they are in fact split into many smaller pieces, for example in
a striping setup, which may have other performance implications.

 | diff --git a/block/blk-core.c b/block/blk-core.c
 | index 2475b1c7..a5623f6 100644
 | --- a/block/blk-core.c
 | +++ b/block/blk-core.c
 | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 |   * should be added at the tail
 |   */
 |  if (current->bio_list) {
 | -bio_list_add(current->bio_list, bio);
 | +bio_list_add_head(current->bio_list, bio);
 |  goto out;
 |  }

Any fix would need to go to stable 4.3.x and up.
This patch does apply as is to 4.5 and up,
with a trivial fixup (or a cherry-pick of cda2264) to 4.4,
and with some more (also trivial) fixup to 4.3 because of
GFP_WAIT -> GFP_RECLAIM and comment rewrap.

Any input?
How should I proceed?

---
 block/bio.c   | 14 +++---
 block/blk-core.c  | 32 +---
 block/blk-merge.c |  3 ++-
 drivers/md/bcache/btree.c | 12 ++--
 drivers/md/dm-bufio.c |  2 +-
 drivers/md/md.h   |  7 +++
 drivers/md/raid1.c|  5 ++---
 drivers/md/raid10.c   |  5 ++---
 include/linux/bio.h   | 11 +++
 include/linux/sched.h |  4 ++--
 10 files changed, 57 insertions(+), 38 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 0e4aa42..2ffcea0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -368,10 +368,10 @@ static void punt_bios_to_rescuer(struct bio_set *bs)

[RFC] block: fix blk_queue_split() resource exhaustion

2016-06-22 Thread Lars Ellenberg

For a long time, generic_make_request() converts recursion into
iteration by queuing recursive arguments on current->bio_list.

This is convenient for stacking drivers,
the top-most driver would take the originally submitted bio,
and re-submit a re-mapped version of it, or one or more clones,
or one or more new allocated bios to its backend(s). Which
are then simply processed in turn, and each can again queue
more "backend-bios" until we reach the bottom of the driver stack,
and actually dispatch to the real backend device.

Any stacking driver ->make_request_fn() could expect that,
once it returns, any backend-bios it submitted via recursive calls
to generic_make_request() would now be processed and dispatched, before
the current task would call into this driver again.

This is changed by commit
  54efd50 block: make generic_make_request handle arbitrarily sized bios

Drivers may call blk_queue_split() inside their ->make_request_fn(),
which may split the current bio into a front-part to be dealt with
immediately, and a remainder-part, which may need to be split even
further. That remainder-part will simply also be pushed to
current->bio_list, and would end up being head-of-queue, in front
of any backend-bios the current make_request_fn() might submit during
processing of the fron-part.

Which means the current task would immediately end up back in the same
make_request_fn() of the same driver again, before any of its backend
bios have even been processed.

This can lead to resource starvation deadlock.
Drivers could avoid this by learning to not need blk_queue_split(),
or by submitting their backend bios in a different context (dedicated
kernel thread, work_queue context, ...). Or by playing funny re-ordering
games with entries on current->bio_list.

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
struct recursion_to_iteration_bio_lists {
struct bio_list recursion;
struct bio_list remainder;
}

To have all bios targeted to drivers lower in the stack processed before
processing the next piece of a bio targeted at the higher levels,
as long as queued bios resulting from recursion are available,
they will continue to be processed in FIFO order.
Pushed back bio-parts resulting from blk_queue_split() will be processed
in LIFO order, one-by-one, whenever the recursion list becomes empty.

Signed-off-by: Lars Ellenberg 
Signed-off-by: Roland Kammerer 
---

This is not a theoretical problem.
At least int DRBD, and an unfortunately high IO concurrency wrt. the
"max-buffers" setting, without this patch we have a reproducible deadlock.

I'm unsure if it could cause problems in md raid in some corner cases
or when a rebuild/scrub/... starts in a "bad" moment.

It also may increase "stress" on any involved bio_set,
because it may increase the number of bios that are allocated
before the first of them is actually processed, causing more
frequent calls of punt_bios_to_rescuer().

Alternatively,
a simple one-line change to generic_make_request() would also "fix" it,
by processing all recursive bios in LIFO order.
But that may change the order in which bios reach the "physical" layer,
in case they are in fact split into many smaller pieces, for example in
a striping setup, which may have other performance implications.

 | diff --git a/block/blk-core.c b/block/blk-core.c
 | index 2475b1c7..a5623f6 100644
 | --- a/block/blk-core.c
 | +++ b/block/blk-core.c
 | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 |   * should be added at the tail
 |   */
 |  if (current->bio_list) {
 | -bio_list_add(current->bio_list, bio);
 | +bio_list_add_head(current->bio_list, bio);
 |  goto out;
 |  }

Any fix would need to go to stable 4.3.x and up.
This patch does apply as is to 4.5 and up,
with a trivial fixup (or a cherry-pick of cda2264) to 4.4,
and with some more (also trivial) fixup to 4.3 because of
GFP_WAIT -> GFP_RECLAIM and comment rewrap.

Any input?
How should I proceed?

---
 block/bio.c   | 14 +++---
 block/blk-core.c  | 32 +---
 block/blk-merge.c |  3 ++-
 drivers/md/bcache/btree.c | 12 ++--
 drivers/md/dm-bufio.c |  2 +-
 drivers/md/md.h   |  7 +++
 drivers/md/raid1.c|  5 ++---
 drivers/md/raid10.c   |  5 ++---
 include/linux/bio.h   | 11 +++
 include/linux/sched.h |  4 ++--
 10 files changed, 57 insertions(+), 38 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 0e4aa42..2ffcea0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -368,10 +368,10 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
bio_list_init();
bio_list_init();
 
-   whi

Re: [Drbd-dev] [PATCH net-next v3] block/drbd: align properly u64 in nl messages

2016-05-10 Thread Lars Ellenberg

On Tue, May 10, 2016 at 11:39:49AM -0400, David Miller wrote:
> From: Lars Ellenberg <lars.ellenb...@linbit.com>
> Date: Tue, 10 May 2016 11:40:23 +0200

excuse me for reordering the original:

> Anyways, back to the topic, can you please just relent and come to
> some kind of agreement about the fix for this alignment bug?

I thought we did?  I'm fine with the "v3",
it even carries my signed-of-by.

Whether or not Nicholas wants to prefix those headers with drbd_,
I don't really care.

> This is taking a very long time and patches are just rotting in
> patchwork with no resolution.  Why would 

Nicholas asked how to go about DRBD,
I suggested to use 0 as a padding attribute,
and after taking a detour, he did. All good.

Rest of original:

> > If we introduce a new config option,
> > we have to add it to the config scanner (one line),
> > define min, max, default and scale (four short defines),
> > and add it to the netlink definition here (one line).
> > Done, rest of the code is generated,
> > both on the kernel side,
> > and on the drbd-utils side used to talk to the kernel.
> > We found that to be very convenient.
> 
> But it entirely misses the core design point of netlink.
> 
> Sender and receive _DO NOT_ need to coordinate at all.  That's the
> whole point.  So tightly coupling such coordination is going to run
> you into all kinds of problems.
> 
> When implemented properly, the sender can emit whatever attributes it
> knows about and can generate, and the receive scans the attributes one
> by one and picks out the ones it understands and processes them.
> 
> If you go against this model
> then you have no clean way to

We don't.
We extend (not violate) that model, so the sender *may* indicate
to the recipient that for some particular attribute, the sender would
rather have an "I don't understand this" return than a silent ignore.
And that we can indicate in the definition of the attributes which ones
are required to make a message meaningful.

> extend things whilst allowing existing software to continue working.

*that* is exactly why we use netlink,
and why we do things with it the way we do.
Actually I think what we are doing there is, comparatively, "elegant".
You obviously don't have to agree.

I could discuss this in more detail,
but I assume you are not really interested,
at least not here and now.

Thanks,

Lars

Re: [Drbd-dev] [PATCH net-next v3] block/drbd: align properly u64 in nl messages

2016-05-10 Thread Lars Ellenberg

On Tue, May 10, 2016 at 11:39:49AM -0400, David Miller wrote:
> From: Lars Ellenberg 
> Date: Tue, 10 May 2016 11:40:23 +0200

excuse me for reordering the original:

> Anyways, back to the topic, can you please just relent and come to
> some kind of agreement about the fix for this alignment bug?

I thought we did?  I'm fine with the "v3",
it even carries my signed-of-by.

Whether or not Nicholas wants to prefix those headers with drbd_,
I don't really care.

> This is taking a very long time and patches are just rotting in
> patchwork with no resolution.  Why would 

Nicholas asked how to go about DRBD,
I suggested to use 0 as a padding attribute,
and after taking a detour, he did. All good.

Rest of original:

> > If we introduce a new config option,
> > we have to add it to the config scanner (one line),
> > define min, max, default and scale (four short defines),
> > and add it to the netlink definition here (one line).
> > Done, rest of the code is generated,
> > both on the kernel side,
> > and on the drbd-utils side used to talk to the kernel.
> > We found that to be very convenient.
> 
> But it entirely misses the core design point of netlink.
> 
> Sender and receive _DO NOT_ need to coordinate at all.  That's the
> whole point.  So tightly coupling such coordination is going to run
> you into all kinds of problems.
> 
> When implemented properly, the sender can emit whatever attributes it
> knows about and can generate, and the receive scans the attributes one
> by one and picks out the ones it understands and processes them.
> 
> If you go against this model
> then you have no clean way to

We don't.
We extend (not violate) that model, so the sender *may* indicate
to the recipient that for some particular attribute, the sender would
rather have an "I don't understand this" return than a silent ignore.
And that we can indicate in the definition of the attributes which ones
are required to make a message meaningful.

> extend things whilst allowing existing software to continue working.

*that* is exactly why we use netlink,
and why we do things with it the way we do.
Actually I think what we are doing there is, comparatively, "elegant".
You obviously don't have to agree.

I could discuss this in more detail,
but I assume you are not really interested,
at least not here and now.

Thanks,

Lars

Re: [Drbd-dev] [PATCH net-next v3] block/drbd: align properly u64 in nl messages

2016-05-10 Thread Lars Ellenberg

On Tue, May 10, 2016 at 11:09:53AM +0200, Nicolas Dichtel wrote:
> Le 09/05/2016 15:15, Lars Ellenberg a écrit :
> > On Mon, May 09, 2016 at 11:40:20AM +0200, Nicolas Dichtel wrote:
> [snip]
> >> Maybe prefixing genl_magic_func.h and genl_magic_struct.h by 'drbd_'
> >> could be interesting so that new module won't use it. What is your
> >> opinion?
> > 
> > This was supposed to not be DRBD specific.  But it might even still
> > need some massaging before it was truly generic. And obviously,
> > it does not meet the taste of genetlink folks, to say the least :(
> Yes, this file is not generic and netlink APIs are never defined like this.
> These tons of macro complexifies the code too much. It's overengineering for
> what purpose?

If we introduce a new config option,
we have to add it to the config scanner (one line),
define min, max, default and scale (four short defines),
and add it to the netlink definition here (one line).
Done, rest of the code is generated,
both on the kernel side,
and on the drbd-utils side used to talk to the kernel.
We found that to be very convenient.

> Small examples:
>  - the drbd netlink API is not exported via uapi (I wonder how apps using this
>API get it)

There used to be a time where there was no "uapi".
(I wonder how apps ever worked back then).

>  - v2 of the patch is nacked because adding a new attribute may break existing

No.

But because the "new" attributes you chose have not been new,
but already used (though not yet merged back into mainline yet).
(Which you did not realize, and had no obvious way of knowing.
 Could have been fixed.).

And because your patch introduced useless new members to the structs.
(Could also have been fixed).

And because I did not see any use defining that many new "padding attributes"
for no reason, where the obvious (to me) choice was to use 0, and you
did not even try to explain why that would have been a bad choice.

>apps (in networking code, a lot of new attributes are added in each 
> version)

>  - it's not possible to grep to show the definition of an attribute ('git grep
>-w T_bits_total' returns only 1 line)

Opencoded, it would return 2.

 ;-)

Is this going somewhere?

Cheers,

Lars

Re: [Drbd-dev] [PATCH net-next v3] block/drbd: align properly u64 in nl messages

2016-05-10 Thread Lars Ellenberg

On Tue, May 10, 2016 at 11:09:53AM +0200, Nicolas Dichtel wrote:
> Le 09/05/2016 15:15, Lars Ellenberg a écrit :
> > On Mon, May 09, 2016 at 11:40:20AM +0200, Nicolas Dichtel wrote:
> [snip]
> >> Maybe prefixing genl_magic_func.h and genl_magic_struct.h by 'drbd_'
> >> could be interesting so that new module won't use it. What is your
> >> opinion?
> > 
> > This was supposed to not be DRBD specific.  But it might even still
> > need some massaging before it was truly generic. And obviously,
> > it does not meet the taste of genetlink folks, to say the least :(
> Yes, this file is not generic and netlink APIs are never defined like this.
> These tons of macro complexifies the code too much. It's overengineering for
> what purpose?

If we introduce a new config option,
we have to add it to the config scanner (one line),
define min, max, default and scale (four short defines),
and add it to the netlink definition here (one line).
Done, rest of the code is generated,
both on the kernel side,
and on the drbd-utils side used to talk to the kernel.
We found that to be very convenient.

> Small examples:
>  - the drbd netlink API is not exported via uapi (I wonder how apps using this
>API get it)

There used to be a time where there was no "uapi".
(I wonder how apps ever worked back then).

>  - v2 of the patch is nacked because adding a new attribute may break existing

No.

But because the "new" attributes you chose have not been new,
but already used (though not yet merged back into mainline yet).
(Which you did not realize, and had no obvious way of knowing.
 Could have been fixed.).

And because your patch introduced useless new members to the structs.
(Could also have been fixed).

And because I did not see any use defining that many new "padding attributes"
for no reason, where the obvious (to me) choice was to use 0, and you
did not even try to explain why that would have been a bad choice.

>apps (in networking code, a lot of new attributes are added in each 
> version)

>  - it's not possible to grep to show the definition of an attribute ('git grep
>-w T_bits_total' returns only 1 line)

Opencoded, it would return 2.

 ;-)

Is this going somewhere?

Cheers,

Lars

Re: [PATCH net-next v3] block/drbd: align properly u64 in nl messages

2016-05-09 Thread Lars Ellenberg

On Mon, May 09, 2016 at 11:40:20AM +0200, Nicolas Dichtel wrote:
> The attribute 0 is never used in drbd, so let's use it as pad attribute
> in netlink messages. This minimizes the patch.
> 
> Note that this patch is only compile-tested.
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dich...@6wind.com>
> Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
> ---
> 
> v2 -> v3:
>   use 0 as padattr instead of adding new attributes

Thanks.

> v1 -> v2:
>  rework the patch to handle all cases
> 
> Maybe prefixing genl_magic_func.h and genl_magic_struct.h by 'drbd_'
> could be interesting so that new module won't use it. What is your
> opinion?

This was supposed to not be DRBD specific.  But it might even still
need some massaging before it was truly generic. And obviously,
it does not meet the taste of genetlink folks, to say the least :(

I don't care either way.

Lars Ellenberg

Re: [PATCH net-next v3] block/drbd: align properly u64 in nl messages

2016-05-09 Thread Lars Ellenberg

On Mon, May 09, 2016 at 11:40:20AM +0200, Nicolas Dichtel wrote:
> The attribute 0 is never used in drbd, so let's use it as pad attribute
> in netlink messages. This minimizes the patch.
> 
> Note that this patch is only compile-tested.
> 
> Signed-off-by: Nicolas Dichtel 
> Signed-off-by: Lars Ellenberg 
> ---
> 
> v2 -> v3:
>   use 0 as padattr instead of adding new attributes

Thanks.

> v1 -> v2:
>  rework the patch to handle all cases
> 
> Maybe prefixing genl_magic_func.h and genl_magic_struct.h by 'drbd_'
> could be interesting so that new module won't use it. What is your
> opinion?

This was supposed to not be DRBD specific.  But it might even still
need some massaging before it was truly generic. And obviously,
it does not meet the taste of genetlink folks, to say the least :(

I don't care either way.

Lars Ellenberg

Re: [Drbd-dev] [PATCH net-next v2] block/drbd: use nla_put_u64_64bit()

2016-05-04 Thread Lars Ellenberg

On Wed, May 04, 2016 at 02:49:00PM +0200, Nicolas Dichtel wrote:
> Le 04/05/2016 11:05, Lars Ellenberg a écrit :
> [snip]
> > We don't have an "alignment problem" there, btw.
> > Last time I checked, we did work fine without this alignment magic,
> > we already take care of that, yes, even on affected architectures.
> The code adds several consecutive u64 attributes. The nl attribute header is 4
> bytes, thus the full attribute length is 12 bytes. If the first u64 is aligned
> on 8 (nla_data()), the next one is not aligned on 8: it starts 12 bytes (8 
> (u64)
> + 4 (nl attr hdr)) after the previous u64.

Yes.  Which in our case is not a problem.
But I don't object to the padding per se,
if that is how things "should be".

I try to understand why you so much object to using 0 as pad.

Lars

Re: [Drbd-dev] [PATCH net-next v2] block/drbd: use nla_put_u64_64bit()

2016-05-04 Thread Lars Ellenberg

On Wed, May 04, 2016 at 02:49:00PM +0200, Nicolas Dichtel wrote:
> Le 04/05/2016 11:05, Lars Ellenberg a écrit :
> [snip]
> > We don't have an "alignment problem" there, btw.
> > Last time I checked, we did work fine without this alignment magic,
> > we already take care of that, yes, even on affected architectures.
> The code adds several consecutive u64 attributes. The nl attribute header is 4
> bytes, thus the full attribute length is 12 bytes. If the first u64 is aligned
> on 8 (nla_data()), the next one is not aligned on 8: it starts 12 bytes (8 
> (u64)
> + 4 (nl attr hdr)) after the previous u64.

Yes.  Which in our case is not a problem.
But I don't object to the padding per se,
if that is how things "should be".

I try to understand why you so much object to using 0 as pad.

Lars

1 2 3 >

1 - 100 of 214 matches

Mail list logo