Re: [PATCH] drbd: do not ignore signals in threads

2019-08-12 Thread Philipp Reisner
Hi David,

[...]
> While our code is 'out of tree' (you really don't want it - and since
> it still uses force_sig() is fine) I suspect that the 'drdb' code
> (with Christoph's allow_signal() patch) now loops in kernel if a user
> sends it a signal.

I am not asking for that out of tree code. But you are welcome to learn
from the drbd code that is in the upstream kernel.
It does not loop if a root sends a signal, it receives it and ignores it.

> If the driver (eg drdb) is using (say) SIGINT to break a thread out of
> (say) a blocking kernel_accept() call then it can detect the unexpected
> signal (maybe double-checking with signal_pending()) but I don't think
> it can clear down the pending signal so that kernel_accept() blocks
> again.

You do that with flush_signals(current)

What we have do is, somewhere in the main loop:

  if (signal_pending(current)) {
flush_signals(current);
if (!terminate_condition()) {
warn(connection, "Ignoring an unexpected 
signal\n");
continue;
}
break;
}
  }

-- 
LINBIT | Keeping The Digital World Running

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.





Re: [PATCH] drbd: do not ignore signals in threads

2019-08-12 Thread Philipp Reisner
Hi Jens,

Please have a look.

With fee109901f392 Eric W. Biederman changed drbd to use send_sig() 
instead of force_sig(). That was part of a series that did this change
in multiple call sites tree wide. Which, by accident broke drbd, since 
the signals are _not_ allowed by default. That got released with v5.2.

On July 29 ChristophBöhmwalder sent a patch that adds two 
allow_signal()s to fix drbd.

Then David Laight points out that he has code that can not deal
with the send_sig() instead of force_sig() because allowed signals
can be sent from user-space as well.
I assume that David is referring to out of tree code, so I fear it
is up to him to fix that to work with upstream, or initiate a 
revert of Eric's change.

Jens, please consider sending Christoph's path to Linus for merge in 
this cycle, or let us know how you think we should proceed.

best regards,
 Phil

Am Montag, 5. August 2019, 11:41:06 CEST schrieb David Laight:
> From: Christoph Böhmwalder
> 
> > Sent: 05 August 2019 10:33
> > 
> > On 29.07.19 10:50, David Laight wrote:
> > 
> > > Doesn't unmasking the signals and using send_sig() instead  of
> > > force_sig()
> > > have the (probably unwanted) side effect of allowing userspace to send
> > > the signal?
> > 
> > 
> > I have ran some tests, and it does look like it is now possible to send
> > signals to the DRBD kthread from userspace. However, ...
> > 
> > 
> > > I've certainly got some driver code that uses force_sig() on a kthread
> > > that it doesn't (ever) want userspace to signal.
> > 
> > 
> > ... we don't feel that it is absolutely necessary for userspace to be
> > unable to send a signal to our kthreads. This is because the DRBD thread
> > independently checks its own state, and (for example) only exits as a
> > result of a signal if its thread state was already "EXITING" to begin
> > with.
> 
> 
> In must 'clear' the signal - otherwise it won't block again.
> 
> I've also got this horrid code fragment:
> 
> init_waitqueue_entry(&w, current);
> 
> /* Tell scheduler we are going to sleep... */
> if (signal_pending(current) && !interruptible)
> /* We don't want waking immediately (again) */
> sleep_state = TASK_UNINTERRUPTIBLE;
> else
> sleep_state = TASK_INTERRUPTIBLE;
> set_current_state(sleep_state);
> 
> /* Connect to condition variable ... */
> add_wait_queue(cvp, &w);
> mutex_unlock(mtxp); /* Release mutex */
> 
> where we want to sleep TASK_UNINTERRUPTIBLE but that f*cks up the 'load
> average',
 so sleep TASK_INTERRUPTIBLE unless there is a signal pending
> that we want to ignore.
> 
>   David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1
> 1PT, UK
 Registration No: 1397386 (Wales)


-- 
LINBIT | Keeping The Digital World Running

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.





[PATCH 01/17] drbd: introduce drbd_recv_header_maybe_unplug

2017-08-28 Thread Philipp Reisner
From: Lars Ellenberg 

Recently, drbd_recv_header() was changed to potentially
implicitly "unplug" the backend device(s), in case there
is currently nothing to receive.

Be more explicit about it: re-introduce the original drbd_recv_header(),
and introduce a new drbd_recv_header_maybe_unplug() for use by the
receiver "main loop".

Using explicit plugging via blk_start_plug(); blk_finish_plug();
really helps the io-scheduler of the backend with merging requests.

Wrap the receiver "main loop" with such a plug.
Also catch unplug events on the Primary,
and try to propagate.

This is performance relevant.  Without this, if the receiving side does
not merge requests, number of IOPS on the peer can me significantly
higher than IOPS on the Primary, and can easily become the bottleneck.

Together, both changes should help to reduce the number of IOPS
as seen on the backend of the receiving side, by increasing
the chance of merging mergable requests, without trading latency
for more throughput.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h  |  5 +++-
 drivers/block/drbd/drbd_main.c | 13 +
 drivers/block/drbd/drbd_receiver.c | 47 +---
 drivers/block/drbd/drbd_req.c  | 55 ++
 drivers/block/drbd/drbd_req.h  |  6 +
 drivers/block/drbd/drbd_worker.c   | 22 +++
 6 files changed, 139 insertions(+), 9 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 819f9d0..74a7d0b 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -745,6 +745,8 @@ struct drbd_connection {
unsigned current_tle_writes;/* writes seen within this tl epoch */
 
unsigned long last_reconnect_jif;
+   /* empty member on older kernels without blk_start_plug() */
+   struct blk_plug receiver_plug;
struct drbd_thread receiver;
struct drbd_thread worker;
struct drbd_thread ack_receiver;
@@ -1131,7 +1133,8 @@ extern void conn_send_sr_reply(struct drbd_connection 
*connection, enum drbd_sta
 extern int drbd_send_rs_deallocated(struct drbd_peer_device *, struct 
drbd_peer_request *);
 extern void drbd_backing_dev_free(struct drbd_device *device, struct 
drbd_backing_dev *ldev);
 extern void drbd_device_cleanup(struct drbd_device *device);
-void drbd_print_uuids(struct drbd_device *device, const char *text);
+extern void drbd_print_uuids(struct drbd_device *device, const char *text);
+extern void drbd_queue_unplug(struct drbd_device *device);
 
 extern void conn_md_sync(struct drbd_connection *connection);
 extern void drbd_md_write(struct drbd_device *device, void *buffer);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index e2ed28d..a3b2ee7 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1952,6 +1952,19 @@ static void drbd_release(struct gendisk *gd, fmode_t 
mode)
mutex_unlock(&drbd_main_mutex);
 }
 
+/* need to hold resource->req_lock */
+void drbd_queue_unplug(struct drbd_device *device)
+{
+   if (device->state.pdsk >= D_INCONSISTENT && device->state.conn >= 
C_CONNECTED) {
+   D_ASSERT(device, device->state.role == R_PRIMARY);
+   if (test_and_clear_bit(UNPLUG_REMOTE, &device->flags)) {
+   drbd_queue_work_if_unqueued(
+   
&first_peer_device(device)->connection->sender_work,
+   &device->unplug_work);
+   }
+   }
+}
+
 static void drbd_set_defaults(struct drbd_device *device)
 {
/* Beware! The actual layout differs
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index ece6e5d..1b3f439 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1194,6 +1194,14 @@ static int decode_header(struct drbd_connection 
*connection, void *header, struc
return 0;
 }
 
+static void drbd_unplug_all_devices(struct drbd_connection *connection)
+{
+   if (current->plug == &connection->receiver_plug) {
+   blk_finish_plug(&connection->receiver_plug);
+   blk_start_plug(&connection->receiver_plug);
+   } /* else: maybe just schedule() ?? */
+}
+
 static int drbd_recv_header(struct drbd_connection *connection, struct 
packet_info *pi)
 {
void *buffer = connection->data.rbuf;
@@ -1209,6 +1217,36 @@ static int drbd_recv_header(struct drbd_connection 
*connection, struct packet_in
return err;
 }
 
+static int drbd_recv_header_maybe_unplug(struct drbd_connection *connection, 
struct packet_info *pi)
+{
+   void *buffer = connection->data.rbuf;
+   unsigned int size = drbd_header_size(connection);
+   int err;
+
+   err = drbd_recv_short(connection

Re: [PATCH 01/17] drbd: introduce drbd_recv_header_maybe_unplug

2017-08-28 Thread Philipp Reisner
Am Freitag, 25. August 2017, 19:26:20 CEST schrieb Jens Axboe:
> On 08/24/2017 03:22 PM, Philipp Reisner wrote:
> > +#ifndef blk_queue_plugged
> > +struct drbd_plug_cb {
> > +   struct blk_plug_cb cb;
> > +   struct drbd_request *most_recent_req;
> > +   /* do we need more? */
> > +};
> 
> What is this blk_queue_plugged ifdef?

Facepalm. That escaped from out out-of-tree code. It has compat code
so that it can be compiled with older (distro) kernels. I will resend
it without that.



[PATCH 02/17] drbd: change list_for_each_safe to while(list_first_entry_or_null)

2017-08-24 Thread Philipp Reisner
From: Lars Ellenberg 

Two instances of list_for_each_safe can drop their tmp element, they
really just peel off each element in turn from the start of the list.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 5955ab8..85e05ee 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1485,12 +1485,12 @@ static bool prepare_al_transaction_nonblock(struct 
drbd_device *device,
struct list_head *pending,
struct list_head *later)
 {
-   struct drbd_request *req, *tmp;
+   struct drbd_request *req;
int wake = 0;
int err;
 
spin_lock_irq(&device->al_lock);
-   list_for_each_entry_safe(req, tmp, incoming, tl_requests) {
+   while ((req = list_first_entry_or_null(incoming, struct drbd_request, 
tl_requests))) {
err = drbd_al_begin_io_nonblock(device, &req->i);
if (err == -ENOBUFS)
break;
@@ -1509,9 +1509,9 @@ static bool prepare_al_transaction_nonblock(struct 
drbd_device *device,
 
 void send_and_submit_pending(struct drbd_device *device, struct list_head 
*pending)
 {
-   struct drbd_request *req, *tmp;
+   struct drbd_request *req;
 
-   list_for_each_entry_safe(req, tmp, pending, tl_requests) {
+   while ((req = list_first_entry_or_null(pending, struct drbd_request, 
tl_requests))) {
req->rq_state |= RQ_IN_ACT_LOG;
req->in_actlog_jif = jiffies;
atomic_dec(&device->ap_actlog_cnt);
-- 
2.7.4



[PATCH 03/17] drbd: add explicit plugging when submitting batches

2017-08-24 Thread Philipp Reisner
From: Lars Ellenberg 

When submitting batches of requests which had been queued on the
submitter thread, typically because they needed to wait for an
activity log transactions, use explicit plugging to help potential
merging of requests in the backend io-scheduler.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 85e05ee..2c82330 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1292,6 +1292,7 @@ static void drbd_unplug(struct blk_plug_cb *cb, bool 
from_schedule)
struct drbd_resource *resource = plug->cb.data;
struct drbd_request *req = plug->most_recent_req;
 
+   kfree(cb);
if (!req)
return;
 
@@ -1301,8 +1302,8 @@ static void drbd_unplug(struct blk_plug_cb *cb, bool 
from_schedule)
req->rq_state |= RQ_UNPLUG;
/* but also queue a generic unplug */
drbd_queue_unplug(req->device);
-   spin_unlock_irq(&resource->req_lock);
kref_put(&req->kref, drbd_req_destroy);
+   spin_unlock_irq(&resource->req_lock);
 }
 
 static struct drbd_plug_cb* drbd_check_plugged(struct drbd_resource *resource)
@@ -1343,8 +1344,6 @@ static void drbd_send_and_submit(struct drbd_device 
*device, struct drbd_request
bool no_remote = false;
bool submit_private_bio = false;
 
-   struct drbd_plug_cb *plug = drbd_check_plugged(resource);
-
spin_lock_irq(&resource->req_lock);
if (rw == WRITE) {
/* This may temporarily give up the req_lock,
@@ -1409,8 +1408,11 @@ static void drbd_send_and_submit(struct drbd_device 
*device, struct drbd_request
no_remote = true;
}
 
-   if (plug != NULL && no_remote == false)
-   drbd_update_plug(plug, req);
+   if (no_remote == false) {
+   struct drbd_plug_cb *plug = drbd_check_plugged(resource);
+   if (plug)
+   drbd_update_plug(plug, req);
+   }
 
/* If it took the fast path in drbd_request_prepare, add it here.
 * The slow path has added it already. */
@@ -1460,7 +1462,10 @@ void __drbd_make_request(struct drbd_device *device, 
struct bio *bio, unsigned l
 
 static void submit_fast_path(struct drbd_device *device, struct list_head 
*incoming)
 {
+   struct blk_plug plug;
struct drbd_request *req, *tmp;
+
+   blk_start_plug(&plug);
list_for_each_entry_safe(req, tmp, incoming, tl_requests) {
const int rw = bio_data_dir(req->master_bio);
 
@@ -1478,6 +1483,7 @@ static void submit_fast_path(struct drbd_device *device, 
struct list_head *incom
list_del_init(&req->tl_requests);
drbd_send_and_submit(device, req);
}
+   blk_finish_plug(&plug);
 }
 
 static bool prepare_al_transaction_nonblock(struct drbd_device *device,
@@ -1507,10 +1513,12 @@ static bool prepare_al_transaction_nonblock(struct 
drbd_device *device,
return !list_empty(pending);
 }
 
-void send_and_submit_pending(struct drbd_device *device, struct list_head 
*pending)
+static void send_and_submit_pending(struct drbd_device *device, struct 
list_head *pending)
 {
+   struct blk_plug plug;
struct drbd_request *req;
 
+   blk_start_plug(&plug);
while ((req = list_first_entry_or_null(pending, struct drbd_request, 
tl_requests))) {
req->rq_state |= RQ_IN_ACT_LOG;
req->in_actlog_jif = jiffies;
@@ -1518,6 +1526,7 @@ void send_and_submit_pending(struct drbd_device *device, 
struct list_head *pendi
list_del_init(&req->tl_requests);
drbd_send_and_submit(device, req);
}
+   blk_finish_plug(&plug);
 }
 
 void do_submit(struct work_struct *ws)
-- 
2.7.4



[PATCH 05/17] drbd: mark symbols static where possible

2017-08-24 Thread Philipp Reisner
From: Baoyou Xie 

We get a few warnings when building kernel with W=1:
drbd/drbd_receiver.c:1224:6: warning: no previous prototype for 
'one_flush_endio' [-Wmissing-prototypes]
drbd/drbd_req.c:1450:6: warning: no previous prototype for 
'send_and_submit_pending' [-Wmissing-prototypes]
drbd/drbd_main.c:924:6: warning: no previous prototype for 
'assign_p_sizes_qlim' [-Wmissing-prototypes]


In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
So this patch marks these functions with 'static'.

Signed-off-by: Baoyou Xie 
Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index a3b2ee7..11f3852 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -923,7 +923,9 @@ void drbd_gen_and_send_sync_uuid(struct drbd_peer_device 
*peer_device)
 }
 
 /* communicated if (agreed_features & DRBD_FF_WSAME) */
-void assign_p_sizes_qlim(struct drbd_device *device, struct p_sizes *p, struct 
request_queue *q)
+static void
+assign_p_sizes_qlim(struct drbd_device *device, struct p_sizes *p,
+   struct request_queue *q)
 {
if (q) {
p->qlim->physical_block_size = 
cpu_to_be32(queue_physical_block_size(q));
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 1b3f439..2489667 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1261,7 +1261,7 @@ struct one_flush_context {
struct issue_flush_context *ctx;
 };
 
-void one_flush_endio(struct bio *bio)
+static void one_flush_endio(struct bio *bio)
 {
struct one_flush_context *octx = bio->bi_private;
struct drbd_device *device = octx->device;
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 72cb0bd..e48012d 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -203,7 +203,8 @@ void drbd_peer_request_endio(struct bio *bio)
}
 }
 
-void drbd_panic_after_delayed_completion_of_aborted_request(struct drbd_device 
*device)
+static void
+drbd_panic_after_delayed_completion_of_aborted_request(struct drbd_device 
*device)
 {
panic("drbd%u %s/%u potential random memory corruption caused by 
delayed completion of aborted local request\n",
device->minor, device->resource->name, device->vnr);
-- 
2.7.4



[PATCH 01/17] drbd: introduce drbd_recv_header_maybe_unplug

2017-08-24 Thread Philipp Reisner
From: Lars Ellenberg 

Recently, drbd_recv_header() was changed to potentially
implicitly "unplug" the backend device(s), in case there
is currently nothing to receive.

Be more explicit about it: re-introduce the original drbd_recv_header(),
and introduce a new drbd_recv_header_maybe_unplug() for use by the
receiver "main loop".

Using explicit plugging via blk_start_plug(); blk_finish_plug();
really helps the io-scheduler of the backend with merging requests.

Wrap the receiver "main loop" with such a plug.
Also catch unplug events on the Primary,
and try to propagate.

This is performance relevant.  Without this, if the receiving side does
not merge requests, number of IOPS on the peer can me significantly
higher than IOPS on the Primary, and can easily become the bottleneck.

Together, both changes should help to reduce the number of IOPS
as seen on the backend of the receiving side, by increasing
the chance of merging mergable requests, without trading latency
for more throughput.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 819f9d0..74a7d0b 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -745,6 +745,8 @@ struct drbd_connection {
unsigned current_tle_writes;/* writes seen within this tl epoch */
 
unsigned long last_reconnect_jif;
+   /* empty member on older kernels without blk_start_plug() */
+   struct blk_plug receiver_plug;
struct drbd_thread receiver;
struct drbd_thread worker;
struct drbd_thread ack_receiver;
@@ -1131,7 +1133,8 @@ extern void conn_send_sr_reply(struct drbd_connection 
*connection, enum drbd_sta
 extern int drbd_send_rs_deallocated(struct drbd_peer_device *, struct 
drbd_peer_request *);
 extern void drbd_backing_dev_free(struct drbd_device *device, struct 
drbd_backing_dev *ldev);
 extern void drbd_device_cleanup(struct drbd_device *device);
-void drbd_print_uuids(struct drbd_device *device, const char *text);
+extern void drbd_print_uuids(struct drbd_device *device, const char *text);
+extern void drbd_queue_unplug(struct drbd_device *device);
 
 extern void conn_md_sync(struct drbd_connection *connection);
 extern void drbd_md_write(struct drbd_device *device, void *buffer);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index e2ed28d..a3b2ee7 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1952,6 +1952,19 @@ static void drbd_release(struct gendisk *gd, fmode_t 
mode)
mutex_unlock(&drbd_main_mutex);
 }
 
+/* need to hold resource->req_lock */
+void drbd_queue_unplug(struct drbd_device *device)
+{
+   if (device->state.pdsk >= D_INCONSISTENT && device->state.conn >= 
C_CONNECTED) {
+   D_ASSERT(device, device->state.role == R_PRIMARY);
+   if (test_and_clear_bit(UNPLUG_REMOTE, &device->flags)) {
+   drbd_queue_work_if_unqueued(
+   
&first_peer_device(device)->connection->sender_work,
+   &device->unplug_work);
+   }
+   }
+}
+
 static void drbd_set_defaults(struct drbd_device *device)
 {
/* Beware! The actual layout differs
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index ece6e5d..1b3f439 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1194,6 +1194,14 @@ static int decode_header(struct drbd_connection 
*connection, void *header, struc
return 0;
 }
 
+static void drbd_unplug_all_devices(struct drbd_connection *connection)
+{
+   if (current->plug == &connection->receiver_plug) {
+   blk_finish_plug(&connection->receiver_plug);
+   blk_start_plug(&connection->receiver_plug);
+   } /* else: maybe just schedule() ?? */
+}
+
 static int drbd_recv_header(struct drbd_connection *connection, struct 
packet_info *pi)
 {
void *buffer = connection->data.rbuf;
@@ -1209,6 +1217,36 @@ static int drbd_recv_header(struct drbd_connection 
*connection, struct packet_in
return err;
 }
 
+static int drbd_recv_header_maybe_unplug(struct drbd_connection *connection, 
struct packet_info *pi)
+{
+   void *buffer = connection->data.rbuf;
+   unsigned int size = drbd_header_size(connection);
+   int err;
+
+   err = drbd_recv_short(connection->data.socket, buffer, size, 
MSG_NOSIGNAL|MSG_DONTWAIT);
+   if (err != size) {
+   /* If we have nothing in the receive buffer now, to reduce
+* application latency, try to drain the backend queues as
+* quickly as possible, and let remote TCP know what we have
+* received so far. */
+   if (err == -EAGAIN) {
+

[PATCH 06/17] drbd: Fix resource role for newly created resources in events2

2017-08-24 Thread Philipp Reisner
The conn_higest_role() (a terribly misnamed function) returns
the role of the resource. It returned R_UNKNOWN as long as the
resource has not a single device.

Resources without devices are short living objects.

But it matters for the NOTIFY_CREATE netwlink message. It makes
a lot more sense to report R_SECONDARY for the newly created
resource than R_UNKNOWN.

I reviewd all call sites of conn_highest_role(), that change
does not matter for the other call sites.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index eea0c4a..306f116 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -346,7 +346,7 @@ static enum drbd_role min_role(enum drbd_role role1, enum 
drbd_role role2)
 
 enum drbd_role conn_highest_role(struct drbd_connection *connection)
 {
-   enum drbd_role role = R_UNKNOWN;
+   enum drbd_role role = R_SECONDARY;
struct drbd_peer_device *peer_device;
int vnr;
 
-- 
2.7.4



[PATCH 09/17] drbd: Use setup_timer() instead of init_timer() to simplify the code.

2017-08-24 Thread Philipp Reisner
From: Geliang Tang 

Signed-off-by: Geliang Tang 
Signed-off-by: Roland Kammerer 
Signed-off-by: Philipp Reisner 

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 11f3852..056d9ab 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2023,18 +2023,14 @@ void drbd_init_set_defaults(struct drbd_device *device)
device->unplug_work.cb  = w_send_write_hint;
device->bm_io_work.w.cb = w_bitmap_io;
 
-   init_timer(&device->resync_timer);
-   init_timer(&device->md_sync_timer);
-   init_timer(&device->start_resync_timer);
-   init_timer(&device->request_timer);
-   device->resync_timer.function = resync_timer_fn;
-   device->resync_timer.data = (unsigned long) device;
-   device->md_sync_timer.function = md_sync_timer_fn;
-   device->md_sync_timer.data = (unsigned long) device;
-   device->start_resync_timer.function = start_resync_timer_fn;
-   device->start_resync_timer.data = (unsigned long) device;
-   device->request_timer.function = request_timer_fn;
-   device->request_timer.data = (unsigned long) device;
+   setup_timer(&device->resync_timer, resync_timer_fn,
+   (unsigned long)device);
+   setup_timer(&device->md_sync_timer, md_sync_timer_fn,
+   (unsigned long)device);
+   setup_timer(&device->start_resync_timer, start_resync_timer_fn,
+   (unsigned long)device);
+   setup_timer(&device->request_timer, request_timer_fn,
+   (unsigned long)device);
 
init_waitqueue_head(&device->misc_wait);
init_waitqueue_head(&device->state_wait);
-- 
2.7.4



[PATCH 10/17] drbd: fix rmmod cleanup, remove _all_ debugfs entries

2017-08-24 Thread Philipp Reisner
From: Lars Ellenberg 

If there are still resources defined, but "empty", no more volumes
or connections configured, they don't hold module reference counts,
so rmmod is possible.

To avoid DRBD leftovers in debugfs, we need to call our global
drbd_debugfs_cleanup() only after all resources have been cleaned up.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 056d9ab..8b8dd82 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2420,7 +2420,6 @@ static void drbd_cleanup(void)
destroy_workqueue(retry.wq);
 
drbd_genl_unregister();
-   drbd_debugfs_cleanup();
 
idr_for_each_entry(&drbd_devices, device, i)
drbd_delete_device(device);
@@ -2431,6 +2430,8 @@ static void drbd_cleanup(void)
drbd_free_resource(resource);
}
 
+   drbd_debugfs_cleanup();
+
drbd_destroy_mempools();
unregister_blkdev(DRBD_MAJOR, "drbd");
 
-- 
2.7.4



[PATCH 08/17] drbd: fix potential get_ldev/put_ldev refcount imbalance during attach

2017-08-24 Thread Philipp Reisner
From: Lars Ellenberg 

Race:

drbd_adm_attach()   | async drbd_md_endio()
|
device->ldev is still NULL. |
|
drbd_md_read(   |
 .endio = drbd_md_endio;|
 submit;|
    |
 wait for done == 1;|   done = 1;
);  |   wake_up();
.. lot of other stuff,  |
.. includeing taking and|
...giving up locks, |
.. doing further IO,|
.. stuff that takes "some time" |
| while in this context,
| this is the next statement.
| which means this context was scheduled
.. only then, finally,  | away for "some time".
device->ldev = nbc; |
|   if (device->ldev)
|   put_ldev()

Unlikely, but possible. I was able to provoke it "reliably"
by adding an mdelay(500); after the wake_up().
Fixed by moving the if (!NULL) put_ldev() before done = 1;

Impact of the bug was that the resulting refcount imbalance
could lead to premature destruction of the object, potentially
causing a NULL pointer dereference during a subsequent detach.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index e48012d..f0717a9 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -65,6 +65,11 @@ void drbd_md_endio(struct bio *bio)
device = bio->bi_private;
device->md_io.error = blk_status_to_errno(bio->bi_status);
 
+   /* special case: drbd_md_read() during drbd_adm_attach() */
+   if (device->ldev)
+   put_ldev(device);
+   bio_put(bio);
+
/* We grabbed an extra reference in _drbd_md_sync_page_io() to be able
 * to timeout on the lower level device, and eventually detach from it.
 * If this io completion runs after that timeout expired, this
@@ -79,9 +84,6 @@ void drbd_md_endio(struct bio *bio)
drbd_md_put_buffer(device);
device->md_io.done = 1;
wake_up(&device->misc_wait);
-   bio_put(bio);
-   if (device->ldev) /* special case: drbd_md_read() during 
drbd_adm_attach() */
-   put_ldev(device);
 }
 
 /* reads on behalf of the partner,
-- 
2.7.4



[PATCH 12/17] drbd: fix potential deadlock when trying to detach during handshake

2017-08-24 Thread Philipp Reisner
From: Lars Ellenberg 

When requesting a detach, we first suspend IO, and also inhibit meta-data IO
by means of drbd_md_get_buffer(), because we don't want to "fail" the disk
while there is IO in-flight: the transition into D_FAILED for detach purposes
may get misinterpreted as actual IO error in a confused endio function.

We wrap it all into wait_event(), to retry in case the drbd_req_state()
returns SS_IN_TRANSIENT_STATE, as it does for example during an ongoing
connection handshake.

In that example, the receiver thread may need to grab drbd_md_get_buffer()
during the handshake to make progress.  To avoid potential deadlock with
detach, detach needs to grab and release the meta data buffer inside of
that wait_event retry loop. To avoid lock inversion between
mutex_lock(&device->state_mutex) and drbd_md_get_buffer(device),
introduce a new enum chg_state_flag CS_INHIBIT_MD_IO, and move the
call to drbd_md_get_buffer() inside the state_mutex grabbed in
drbd_req_state().

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index c383b6c..6bb58a6 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -2149,34 +2149,13 @@ int drbd_adm_attach(struct sk_buff *skb, struct 
genl_info *info)
 
 static int adm_detach(struct drbd_device *device, int force)
 {
-   enum drbd_state_rv retcode;
-   void *buffer;
-   int ret;
-
if (force) {
set_bit(FORCE_DETACH, &device->flags);
drbd_force_state(device, NS(disk, D_FAILED));
-   retcode = SS_SUCCESS;
-   goto out;
+   return SS_SUCCESS;
}
 
-   drbd_suspend_io(device); /* so no-one is stuck in drbd_al_begin_io */
-   buffer = drbd_md_get_buffer(device, __func__); /* make sure there is no 
in-flight meta-data IO */
-   if (buffer) {
-   retcode = drbd_request_state(device, NS(disk, D_FAILED));
-   drbd_md_put_buffer(device);
-   } else /* already <= D_FAILED */
-   retcode = SS_NOTHING_TO_DO;
-   /* D_FAILED will transition to DISKLESS. */
-   drbd_resume_io(device);
-   ret = wait_event_interruptible(device->misc_wait,
-   device->state.disk != D_FAILED);
-   if ((int)retcode == (int)SS_IS_DISKLESS)
-   retcode = SS_NOTHING_TO_DO;
-   if (ret)
-   retcode = ERR_INTR;
-out:
-   return retcode;
+   return drbd_request_detach_interruptible(device);
 }
 
 /* Detaching the disk is a process in multiple stages.  First we need to lock
diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 306f116..0813c65 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -579,11 +579,14 @@ drbd_req_state(struct drbd_device *device, union 
drbd_state mask,
unsigned long flags;
union drbd_state os, ns;
enum drbd_state_rv rv;
+   void *buffer = NULL;
 
init_completion(&done);
 
if (f & CS_SERIALIZE)
mutex_lock(device->state_mutex);
+   if (f & CS_INHIBIT_MD_IO)
+   buffer = drbd_md_get_buffer(device, __func__);
 
spin_lock_irqsave(&device->resource->req_lock, flags);
os = drbd_read_state(device);
@@ -636,6 +639,8 @@ drbd_req_state(struct drbd_device *device, union drbd_state 
mask,
}
 
 abort:
+   if (buffer)
+   drbd_md_put_buffer(device);
if (f & CS_SERIALIZE)
mutex_unlock(device->state_mutex);
 
@@ -664,6 +669,47 @@ _drbd_request_state(struct drbd_device *device, union 
drbd_state mask,
return rv;
 }
 
+/*
+ * We grab drbd_md_get_buffer(), because we don't want to "fail" the disk while
+ * there is IO in-flight: the transition into D_FAILED for detach purposes
+ * may get misinterpreted as actual IO error in a confused endio function.
+ *
+ * We wrap it all into wait_event(), to retry in case the drbd_req_state()
+ * returns SS_IN_TRANSIENT_STATE.
+ *
+ * To avoid potential deadlock with e.g. the receiver thread trying to grab
+ * drbd_md_get_buffer() while trying to get out of the "transient state", we
+ * need to grab and release the meta data buffer inside of that wait_event 
loop.
+ */
+static enum drbd_state_rv
+request_detach(struct drbd_device *device)
+{
+   return drbd_req_state(device, NS(disk, D_FAILED),
+   CS_VERBOSE | CS_ORDERED | CS_INHIBIT_MD_IO);
+}
+
+enum drbd_state_rv
+drbd_request_detach_interruptible(struct drbd_device *device)
+{
+   enum drbd_state_rv rv;
+   int ret;
+
+   drbd_suspend_io(device); /* so no-one is stuck in drbd_al_begin_io */
+   wait_event_interruptible(device->state_wait,
+   (rv = request_detach(device)) != SS_IN_TRANSIENT_STATE);
+   drbd_resume_io(device);
+
+ 

[PATCH 07/17] drbd: new disk-option disable-write-same

2017-08-24 Thread Philipp Reisner
From: Lars Ellenberg 

Some backend devices claim to support write-same,
but would fail actual write-same requests.

Allow to set (or toggle) whether or not DRBD tries to support write-same.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index ad0fcb4..c383b6c 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1236,12 +1236,18 @@ static void fixup_discard_if_not_supported(struct 
request_queue *q)
 
 static void decide_on_write_same_support(struct drbd_device *device,
struct request_queue *q,
-   struct request_queue *b, struct o_qlim *o)
+   struct request_queue *b, struct o_qlim *o,
+   bool disable_write_same)
 {
struct drbd_peer_device *peer_device = first_peer_device(device);
struct drbd_connection *connection = peer_device->connection;
bool can_do = b ? b->limits.max_write_same_sectors : true;
 
+   if (can_do && disable_write_same) {
+   can_do = false;
+   drbd_info(peer_device, "WRITE_SAME disabled by config\n");
+   }
+
if (can_do && connection->cstate >= C_CONNECTED && 
!(connection->agreed_features & DRBD_FF_WSAME)) {
can_do = false;
drbd_info(peer_device, "peer does not support WRITE_SAME\n");
@@ -1302,6 +1308,7 @@ static void drbd_setup_queue_param(struct drbd_device 
*device, struct drbd_backi
struct request_queue *b = NULL;
struct disk_conf *dc;
bool discard_zeroes_if_aligned = true;
+   bool disable_write_same = false;
 
if (bdev) {
b = bdev->backing_bdev->bd_disk->queue;
@@ -1311,6 +1318,7 @@ static void drbd_setup_queue_param(struct drbd_device 
*device, struct drbd_backi
dc = rcu_dereference(device->ldev->disk_conf);
max_segments = dc->max_bio_bvecs;
discard_zeroes_if_aligned = dc->discard_zeroes_if_aligned;
+   disable_write_same = dc->disable_write_same;
rcu_read_unlock();
 
blk_set_stacking_limits(&q->limits);
@@ -1321,7 +1329,7 @@ static void drbd_setup_queue_param(struct drbd_device 
*device, struct drbd_backi
blk_queue_max_segments(q, max_segments ? max_segments : 
BLK_MAX_SEGMENTS);
blk_queue_segment_boundary(q, PAGE_SIZE-1);
decide_on_discard_support(device, q, b, discard_zeroes_if_aligned);
-   decide_on_write_same_support(device, q, b, o);
+   decide_on_write_same_support(device, q, b, o, disable_write_same);
 
if (b) {
blk_queue_stack_limits(q, b);
@@ -1612,7 +1620,8 @@ int drbd_adm_disk_opts(struct sk_buff *skb, struct 
genl_info *info)
if (write_ordering_changed(old_disk_conf, new_disk_conf))
drbd_bump_write_ordering(device->resource, NULL, WO_BDEV_FLUSH);
 
-   if (old_disk_conf->discard_zeroes_if_aligned != 
new_disk_conf->discard_zeroes_if_aligned)
+   if (old_disk_conf->discard_zeroes_if_aligned != 
new_disk_conf->discard_zeroes_if_aligned
+   ||  old_disk_conf->disable_write_same != 
new_disk_conf->disable_write_same)
drbd_reconsider_queue_parameters(device, device->ldev, NULL);
 
drbd_md_sync(device);
diff --git a/include/linux/drbd_genl.h b/include/linux/drbd_genl.h
index 2896f93..4e6d4d4 100644
--- a/include/linux/drbd_genl.h
+++ b/include/linux/drbd_genl.h
@@ -132,7 +132,8 @@ GENL_struct(DRBD_NLA_DISK_CONF, 3, disk_conf,
__flg_field_def(18, DRBD_GENLA_F_MANDATORY, disk_drain, 
DRBD_DISK_DRAIN_DEF)
__flg_field_def(19, DRBD_GENLA_F_MANDATORY, md_flushes, 
DRBD_MD_FLUSHES_DEF)
__flg_field_def(23, 0 /* OPTIONAL */,   al_updates, 
DRBD_AL_UPDATES_DEF)
-   __flg_field_def(24, 0 /* OPTIONAL */,   
discard_zeroes_if_aligned, DRBD_DISCARD_ZEROES_IF_ALIGNED)
+   __flg_field_def(24, 0 /* OPTIONAL */,   
discard_zeroes_if_aligned, DRBD_DISCARD_ZEROES_IF_ALIGNED_DEF)
+   __flg_field_def(26, 0 /* OPTIONAL */,   disable_write_same, 
DRBD_DISABLE_WRITE_SAME_DEF)
 )
 
 GENL_struct(DRBD_NLA_RESOURCE_OPTS, 4, res_opts,
diff --git a/include/linux/drbd_limits.h b/include/linux/drbd_limits.h
index ddac684..24ae1b9 100644
--- a/include/linux/drbd_limits.h
+++ b/include/linux/drbd_limits.h
@@ -209,12 +209,18 @@
 #define DRBD_MD_FLUSHES_DEF1
 #define DRBD_TCP_CORK_DEF  1
 #define DRBD_AL_UPDATES_DEF 1
+
 /* We used to ignore the discard_zeroes_data setting.
  * To not change established (and expected) behaviour,
  * by default assume that, for discard_zeroes_data=0,
  * we can make that an effective discard_zeroes_data=1,
  * if we only explicitly zero-out unaligned partial chunks. */
-#define DRBD_DISCARD_ZEROES_IF_ALIGNED 1
+#define D

[PATCH 13/17] drbd: fix race between handshake and admin disconnect/down

2017-08-24 Thread Philipp Reisner
From: Lars Ellenberg 

conn_try_disconnect() could potentialy hit the BUG_ON()
in _conn_set_state() where it iterates over _drbd_set_state()
and "asserts" via BUG_ON() that the latter was successful.

If the STATE_SENT bit was not yet visible to conn_is_valid_transition()
early in _conn_request_state(), but became visible before conn_set_state()
later in that call path, we could hit the BUG_ON() after _drbd_set_state(),
because it returned SS_IN_TRANSIENT_STATE.

To avoid that race, we better protect set_bit(SENT_STATE) with the spinlock.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 2489667..5e090a1 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1100,7 +1100,10 @@ static int conn_connect(struct drbd_connection 
*connection)
idr_for_each_entry(&connection->peer_devices, peer_device, vnr)
mutex_lock(peer_device->device->state_mutex);
 
+   /* avoid a race with conn_request_state( C_DISCONNECTING ) */
+   spin_lock_irq(&connection->resource->req_lock);
set_bit(STATE_SENT, &connection->flags);
+   spin_unlock_irq(&connection->resource->req_lock);
 
idr_for_each_entry(&connection->peer_devices, peer_device, vnr)
mutex_unlock(peer_device->device->state_mutex);
-- 
2.7.4



[PATCH 15/17] drbd: move global variables to drbd namespace and make some static

2017-08-24 Thread Philipp Reisner
From: Roland Kammerer 

This is a follow-up to Gregs complaints that drbd clutteres the global
namespace.
Some of DRBD's module parameters are only used within one compilation
unit. Make these static.

Signed-off-by: Roland Kammerer 
Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 61596af..7e8589c 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -63,19 +63,15 @@
 # define __must_hold(x)
 #endif
 
-/* module parameter, defined in drbd_main.c */
-extern unsigned int minor_count;
-extern bool disable_sendpage;
-extern bool allow_oos;
-void tl_abort_disk_io(struct drbd_device *device);
-
+/* shared module parameters, defined in drbd_main.c */
 #ifdef CONFIG_DRBD_FAULT_INJECTION
-extern int enable_faults;
-extern int fault_rate;
-extern int fault_devs;
+extern int drbd_enable_faults;
+extern int drbd_fault_rate;
 #endif
 
+extern unsigned int drbd_minor_count;
 extern char drbd_usermode_helper[];
+extern int drbd_proc_details;
 
 
 /* This is used to stop/restart our threads.
@@ -181,8 +177,8 @@ _drbd_insert_fault(struct drbd_device *device, unsigned int 
type);
 static inline int
 drbd_insert_fault(struct drbd_device *device, unsigned int type) {
 #ifdef CONFIG_DRBD_FAULT_INJECTION
-   return fault_rate &&
-   (enable_faults & (1<
-/* allow_open_on_secondary */
-MODULE_PARM_DESC(allow_oos, "DONT USE!");
 /* thanks to these macros, if compiled into the kernel (not-module),
- * this becomes the boot parameter drbd.minor_count */
-module_param(minor_count, uint, 0444);
-module_param(disable_sendpage, bool, 0644);
-module_param(allow_oos, bool, 0);
-module_param(proc_details, int, 0644);
+ * these become boot parameters (e.g., drbd.minor_count) */
 
 #ifdef CONFIG_DRBD_FAULT_INJECTION
-int enable_faults;
-int fault_rate;
-static int fault_count;
-int fault_devs;
+int drbd_enable_faults;
+int drbd_fault_rate;
+static int drbd_fault_count;
+static int drbd_fault_devs;
 /* bitmap of enabled faults */
-module_param(enable_faults, int, 0664);
+module_param_named(enable_faults, drbd_enable_faults, int, 0664);
 /* fault rate % value - applies to all enabled faults */
-module_param(fault_rate, int, 0664);
+module_param_named(fault_rate, drbd_fault_rate, int, 0664);
 /* count of faults inserted */
-module_param(fault_count, int, 0664);
+module_param_named(fault_count, drbd_fault_count, int, 0664);
 /* bitmap of devices to insert faults on */
-module_param(fault_devs, int, 0644);
+module_param_named(fault_devs, drbd_fault_devs, int, 0644);
 #endif
 
-/* module parameter, defined */
-unsigned int minor_count = DRBD_MINOR_COUNT_DEF;
-bool disable_sendpage;
-bool allow_oos;
-int proc_details;   /* Detail level in proc drbd*/
-
+/* module parameters we can keep static */
+static bool drbd_allow_oos; /* allow_open_on_secondary */
+static bool drbd_disable_sendpage;
+MODULE_PARM_DESC(allow_oos, "DONT USE!");
+module_param_named(allow_oos, drbd_allow_oos, bool, 0);
+module_param_named(disable_sendpage, drbd_disable_sendpage, bool, 0644);
+
+/* module parameters we share */
+int drbd_proc_details; /* Detail level in proc drbd*/
+module_param_named(proc_details, drbd_proc_details, int, 0644);
+/* module parameters shared with defaults */
+unsigned int drbd_minor_count = DRBD_MINOR_COUNT_DEF;
 /* Module parameter for setting the user mode helper program
  * to run. Default is /sbin/drbdadm */
 char drbd_usermode_helper[80] = "/sbin/drbdadm";
-
+module_param_named(minor_count, drbd_minor_count, uint, 0444);
 module_param_string(usermode_helper, drbd_usermode_helper, 
sizeof(drbd_usermode_helper), 0644);
 
 /* in 2.6.x, our device mapping and config info contains our virtual gendisks
@@ -1562,7 +1562,7 @@ static int _drbd_send_page(struct drbd_peer_device 
*peer_device, struct page *pa
 * put_page(); and would cause either a VM_BUG directly, or
 * __page_cache_release a page that would actually still be referenced
 * by someone, leading to some obscure delayed Oops somewhere else. */
-   if (disable_sendpage || (page_count(page) < 1) || PageSlab(page))
+   if (drbd_disable_sendpage || (page_count(page) < 1) || PageSlab(page))
return _drbd_no_send_page(peer_device, page, offset, size, 
msg_flags);
 
msg_flags |= MSG_NOSIGNAL;
@@ -1934,7 +1934,7 @@ static int drbd_open(struct block_device *bdev, fmode_t 
mode)
if (device->state.role != R_PRIMARY) {
if (mode & FMODE_WRITE)
rv = -EROFS;
-   else if (!allow_oos)
+   else if (!drbd_allow_oos)
rv = -EMEDIUMTYPE;
}
 
@@ -2142,7 +2142,7 @@ static void drbd_destroy_mempools(void)
 static int drbd_create_mempools(void)
 {
struct page *page;
-   const int number = (DRBD_MAX_BIO_SIZE/PAGE_SIZE) * minor

[PATCH 11/17] drbd: A single dot should be put into a sequence.

2017-08-24 Thread Philipp Reisner
From: Markus Elfring 

Thus use the corresponding function "seq_putc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
Signed-off-by: Roland Kammerer 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_proc.c b/drivers/block/drbd/drbd_proc.c
index 8378142..fc0f627 100644
--- a/drivers/block/drbd/drbd_proc.c
+++ b/drivers/block/drbd/drbd_proc.c
@@ -127,7 +127,7 @@ static void drbd_syncer_progress(struct drbd_device 
*device, struct seq_file *se
seq_putc(seq, '=');
seq_putc(seq, '>');
for (i = 0; i < y; i++)
-   seq_printf(seq, ".");
+   seq_putc(seq, '.');
seq_puts(seq, "] ");
 
if (state.conn == C_VERIFY_S || state.conn == C_VERIFY_T)
-- 
2.7.4



[PATCH 16/17] drbd: abort drbd_start_resync if there is no connection

2017-08-24 Thread Philipp Reisner
From: Roland Kammerer 

This was found by a static analysis tool. While highly unlikely, be sure
to return without dereferencing the NULL pointer.

Reported-by: Shaobo 
Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index f0717a9..03471b3 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -1756,6 +1756,11 @@ void drbd_start_resync(struct drbd_device *device, enum 
drbd_conns side)
return;
}
 
+   if (!connection) {
+   drbd_err(device, "No connection to peer, aborting!\n");
+   return;
+   }
+
if (!test_bit(B_RS_H_DONE, &device->flags)) {
if (side == C_SYNC_TARGET) {
/* Since application IO was locked out during 
C_WF_BITMAP_T and
-- 
2.7.4



[PATCH 17/17] drbd: switch from kmalloc() to kmalloc_array()

2017-08-24 Thread Philipp Reisner
From: Roland Kammerer 

We had one call to kmalloc that actually allocates an array. Switch that
one to the kmalloc_array() function.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 4e8a543..796eaf3 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -4126,7 +4126,7 @@ static int receive_uuids(struct drbd_connection 
*connection, struct packet_info
return config_unknown_volume(connection, pi);
device = peer_device->device;
 
-   p_uuid = kmalloc(sizeof(u64)*UI_EXTENDED_SIZE, GFP_NOIO);
+   p_uuid = kmalloc_array(UI_EXTENDED_SIZE, sizeof(*p_uuid), GFP_NOIO);
if (!p_uuid) {
drbd_err(device, "kmalloc of p_uuid failed\n");
return false;
diff --git a/include/linux/drbd.h b/include/linux/drbd.h
index 002611c..2d02593 100644
--- a/include/linux/drbd.h
+++ b/include/linux/drbd.h
@@ -51,7 +51,7 @@
 #endif
 
 extern const char *drbd_buildtag(void);
-#define REL_VERSION "8.4.7"
+#define REL_VERSION "8.4.10"
 #define API_VERSION 1
 #define PRO_VERSION_MIN 86
 #define PRO_VERSION_MAX 101
-- 
2.7.4



[PATCH 14/17] drbd: rename "usermode_helper" to "drbd_usermode_helper"

2017-08-24 Thread Philipp Reisner
From: Greg Kroah-Hartman 

Nothing like having a very generic global variable in a tiny driver
subsystem to make a mess of the global namespace...

Note, there are many other "generic" named global variables in the drbd
subsystem, someone should fix those up one day before they hit a linking
error.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 74a7d0b..61596af 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -75,7 +75,7 @@ extern int fault_rate;
 extern int fault_devs;
 #endif
 
-extern char usermode_helper[];
+extern char drbd_usermode_helper[];
 
 
 /* This is used to stop/restart our threads.
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 8b8dd82..bdd9ab2 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -109,9 +109,9 @@ int proc_details;   /* Detail level in proc drbd*/
 
 /* Module parameter for setting the user mode helper program
  * to run. Default is /sbin/drbdadm */
-char usermode_helper[80] = "/sbin/drbdadm";
+char drbd_usermode_helper[80] = "/sbin/drbdadm";
 
-module_param_string(usermode_helper, usermode_helper, sizeof(usermode_helper), 
0644);
+module_param_string(usermode_helper, drbd_usermode_helper, 
sizeof(drbd_usermode_helper), 0644);
 
 /* in 2.6.x, our device mapping and config info contains our virtual gendisks
  * as member "struct gendisk *vdisk;"
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 6bb58a6..a12f77e 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -344,7 +344,7 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
 (char[60]) { }, /* address */
NULL };
char mb[14];
-   char *argv[] = {usermode_helper, cmd, mb, NULL };
+   char *argv[] = {drbd_usermode_helper, cmd, mb, NULL };
struct drbd_connection *connection = 
first_peer_device(device)->connection;
struct sib_info sib;
int ret;
@@ -359,19 +359,19 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
 * write out any unsynced meta data changes now */
drbd_md_sync(device);
 
-   drbd_info(device, "helper command: %s %s %s\n", usermode_helper, cmd, 
mb);
+   drbd_info(device, "helper command: %s %s %s\n", drbd_usermode_helper, 
cmd, mb);
sib.sib_reason = SIB_HELPER_PRE;
sib.helper_name = cmd;
drbd_bcast_event(device, &sib);
notify_helper(NOTIFY_CALL, device, connection, cmd, 0);
-   ret = call_usermodehelper(usermode_helper, argv, envp, UMH_WAIT_PROC);
+   ret = call_usermodehelper(drbd_usermode_helper, argv, envp, 
UMH_WAIT_PROC);
if (ret)
drbd_warn(device, "helper command: %s %s %s exit code %u 
(0x%x)\n",
-   usermode_helper, cmd, mb,
+   drbd_usermode_helper, cmd, mb,
(ret >> 8) & 0xff, ret);
else
drbd_info(device, "helper command: %s %s %s exit code %u 
(0x%x)\n",
-   usermode_helper, cmd, mb,
+   drbd_usermode_helper, cmd, mb,
(ret >> 8) & 0xff, ret);
sib.sib_reason = SIB_HELPER_POST;
sib.helper_exit_code = ret;
@@ -396,24 +396,24 @@ enum drbd_peer_state conn_khelper(struct drbd_connection 
*connection, char *cmd)
 (char[60]) { }, /* address */
NULL };
char *resource_name = connection->resource->name;
-   char *argv[] = {usermode_helper, cmd, resource_name, NULL };
+   char *argv[] = {drbd_usermode_helper, cmd, resource_name, NULL };
int ret;
 
setup_khelper_env(connection, envp);
conn_md_sync(connection);
 
-   drbd_info(connection, "helper command: %s %s %s\n", usermode_helper, 
cmd, resource_name);
+   drbd_info(connection, "helper command: %s %s %s\n", 
drbd_usermode_helper, cmd, resource_name);
/* TODO: conn_bcast_event() ?? */
notify_helper(NOTIFY_CALL, NULL, connection, cmd, 0);
 
-   ret = call_usermodehelper(usermode_helper, argv, envp, UMH_WAIT_PROC);
+   ret = call_usermodehelper(drbd_usermode_helper, argv, envp, 
UMH_WAIT_PROC);
if (ret)
drbd_warn(connection, "helper command: %s %s %s exit code %u 
(0x%x)\n",
- usermode_helper, cmd, resource_name,
+ drbd_usermode_helper, cmd, resource_name,
  (ret >> 8) & 0xff, ret);
else
drbd_info(connection, "helper command: %s %s %s exit code %u 
(0x%x)\n",
- usermode_helper, cmd, reso

[PATCH 00/17] DRBD updates

2017-08-24 Thread Philipp Reisner
Hi Jens,

Please consider these patches for your for-4.14 branch.

The first and third patch help with request merging on DRBD's secondary side.
That can improves performance for some workloads.

The other patches are fixes and random mentenance.


Baoyou Xie (1):
  drbd: mark symbols static where possible

Geliang Tang (1):
  drbd: Use setup_timer() instead of init_timer() to simplify the code.

Greg Kroah-Hartman (1):
  drbd: rename "usermode_helper" to "drbd_usermode_helper"

Lars Ellenberg (9):
  drbd: introduce drbd_recv_header_maybe_unplug
  drbd: change list_for_each_safe to while(list_first_entry_or_null)
  drbd: add explicit plugging when submitting batches
  drbd: Send P_NEG_ACK upon write error in protocol != C
  drbd: new disk-option disable-write-same
  drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
  drbd: fix rmmod cleanup, remove _all_ debugfs entries
  drbd: fix potential deadlock when trying to detach during handshake
  drbd: fix race between handshake and admin disconnect/down

Markus Elfring (1):
  drbd: A single dot should be put into a sequence.

Philipp Reisner (1):
  drbd: Fix resource role for newly created resources in events2

Roland Kammerer (3):
  drbd: move global variables to drbd namespace and make some static
  drbd: abort drbd_start_resync if there is no connection
  drbd: switch from kmalloc() to kmalloc_array()

 drivers/block/drbd/drbd_int.h  |  27 +-
 drivers/block/drbd/drbd_main.c | 106 +
 drivers/block/drbd/drbd_nl.c   |  60 +
 drivers/block/drbd/drbd_proc.c |  10 ++--
 drivers/block/drbd/drbd_receiver.c |  56 +---
 drivers/block/drbd/drbd_req.c  |  80 ++--
 drivers/block/drbd/drbd_req.h  |   6 +++
 drivers/block/drbd/drbd_state.c|  48 -
 drivers/block/drbd/drbd_state.h|   8 +++
 drivers/block/drbd/drbd_worker.c   |  46 
 include/linux/drbd.h   |   2 +-
 include/linux/drbd_genl.h  |   3 +-
 include/linux/drbd_limits.h|   8 ++-
 13 files changed, 333 insertions(+), 127 deletions(-)

-- 
2.7.4



[PATCH 04/17] drbd: Send P_NEG_ACK upon write error in protocol != C

2017-08-24 Thread Philipp Reisner
From: Lars Ellenberg 

In protocol != C, we forgot to send the P_NEG_ACK for failing writes.

Once we no longer submit to local disk, because we already "detached",
due to the typical "on-io-error detach;" config setting,
we already send the neg acks right away.

Only those requests that have been submitted,
and have been error-completed by the local disk,
would forget to send the neg-ack,
and only in asynchronous replication (protocol != C).
Unless this happened during resync,
where we already always send acks, regardless of protocol.

The primary side needs the P_NEG_ACK in order to mark
the affected block(s) for resync in its out-of-sync bitmap.

If the blocks in question are not re-written again,
we may miss to resync them later, causing data inconsistencies.

This patch will always send the neg-acks, and also at least try to
persist the out-of-sync status on the local node already.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 

diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 2745db2..72cb0bd 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -128,6 +128,14 @@ void drbd_endio_write_sec_final(struct drbd_peer_request 
*peer_req) __releases(l
block_id = peer_req->block_id;
peer_req->flags &= ~EE_CALL_AL_COMPLETE_IO;
 
+   if (peer_req->flags & EE_WAS_ERROR) {
+   /* In protocol != C, we usually do not send write acks.
+* In case of a write error, send the neg ack anyways. */
+   if (!__test_and_set_bit(__EE_SEND_WRITE_ACK, &peer_req->flags))
+   inc_unacked(device);
+   drbd_set_out_of_sync(device, peer_req->i.sector, 
peer_req->i.size);
+   }
+
spin_lock_irqsave(&device->resource->req_lock, flags);
device->writ_cnt += peer_req->i.size >> 9;
list_move_tail(&peer_req->w.list, &device->done_ee);
-- 
2.7.4



Re: [PATCH] drbd: mark symbols static where possible

2016-09-02 Thread Philipp Reisner
Hi Baoyou,

thanks for the patch. I applied it to our tree. Will be sent to
one of the next merge windows...

best regards,
 Phil
Am Donnerstag, 1. September 2016, 18:57:53 CEST schrieb Baoyou Xie:
> We get a few warnings when building kernel with W=1:
> drivers/block/drbd/drbd_receiver.c:1224:6: warning: no previous prototype
> for 'one_flush_endio' [-Wmissing-prototypes]
> drivers/block/drbd/drbd_req.c:1450:6: warning: no previous prototype for
> 'send_and_submit_pending' [-Wmissing-prototypes]
> drivers/block/drbd/drbd_main.c:924:6: warning: no previous prototype for
> 'assign_p_sizes_qlim' [-Wmissing-prototypes] 
> 
> In fact, these functions are only used in the file in which they are
> declared and don't need a declaration, but can be made static.
> So this patch marks these functions with 'static'.
> 
> Signed-off-by: Baoyou Xie 
> ---
>  drivers/block/drbd/drbd_main.c | 4 +++-
>  drivers/block/drbd/drbd_receiver.c | 2 +-
>  drivers/block/drbd/drbd_req.c  | 3 ++-
>  drivers/block/drbd/drbd_worker.c   | 3 ++-
>  4 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
> index 100be55..f0aa746 100644
> --- a/drivers/block/drbd/drbd_main.c
> +++ b/drivers/block/drbd/drbd_main.c
> @@ -921,7 +921,9 @@ void drbd_gen_and_send_sync_uuid(struct drbd_peer_device
> *peer_device) }
> 
>  /* communicated if (agreed_features & DRBD_FF_WSAME) */
> -void assign_p_sizes_qlim(struct drbd_device *device, struct p_sizes *p,
> struct request_queue *q) +static void
> +assign_p_sizes_qlim(struct drbd_device *device, struct p_sizes *p,
> + struct request_queue *q)
>  {
>   if (q) {
>   p->qlim->physical_block_size = 
cpu_to_be32(queue_physical_block_size(q));
> diff --git a/drivers/block/drbd/drbd_receiver.c
> b/drivers/block/drbd/drbd_receiver.c index 942384f..432f39a 100644
> --- a/drivers/block/drbd/drbd_receiver.c
> +++ b/drivers/block/drbd/drbd_receiver.c
> @@ -1221,7 +1221,7 @@ struct one_flush_context {
>   struct issue_flush_context *ctx;
>  };
> 
> -void one_flush_endio(struct bio *bio)
> +static void one_flush_endio(struct bio *bio)
>  {
>   struct one_flush_context *octx = bio->bi_private;
>   struct drbd_device *device = octx->device;
> diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
> index de279fe..c725bf5 100644
> --- a/drivers/block/drbd/drbd_req.c
> +++ b/drivers/block/drbd/drbd_req.c
> @@ -1447,7 +1447,8 @@ static bool prepare_al_transaction_nonblock(struct
> drbd_device *device, return !list_empty(pending);
>  }
> 
> -void send_and_submit_pending(struct drbd_device *device, struct list_head
> *pending) +static void
> +send_and_submit_pending(struct drbd_device *device, struct list_head
> *pending) {
>   struct drbd_request *req, *tmp;
> 
> diff --git a/drivers/block/drbd/drbd_worker.c
> b/drivers/block/drbd/drbd_worker.c index c6755c9..70f2706 100644
> --- a/drivers/block/drbd/drbd_worker.c
> +++ b/drivers/block/drbd/drbd_worker.c
> @@ -194,7 +194,8 @@ void drbd_peer_request_endio(struct bio *bio)
>   }
>  }
> 
> -void drbd_panic_after_delayed_completion_of_aborted_request(struct
> drbd_device *device) +static void
> +drbd_panic_after_delayed_completion_of_aborted_request(struct drbd_device
> *device) {
>   panic("drbd%u %s/%u potential random memory corruption caused by delayed
> completion of aborted local request\n", device->minor,
> device->resource->name, device->vnr);




[PATCH 07/30] drbd: adjust assert in w_bitmap_io to account for BM_LOCKED_CHANGE_ALLOWED

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_main.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index b0891c3..64e9525 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -3523,7 +3523,12 @@ static int w_bitmap_io(struct drbd_work *w, int unused)
struct bm_io_work *work = &device->bm_io_work;
int rv = -EIO;
 
-   D_ASSERT(device, atomic_read(&device->ap_bio_cnt) == 0);
+   if (work->flags != BM_LOCKED_CHANGE_ALLOWED) {
+   int cnt = atomic_read(&device->ap_bio_cnt);
+   if (cnt)
+   drbd_err(device, "FIXME: ap_bio_cnt %d, expected 0; 
queued for '%s'\n",
+   cnt, work->why);
+   }
 
if (get_ldev(device)) {
drbd_bm_lock(device, work->why, work->flags);
-- 
2.7.4



[PATCH 04/30] drbd: Implement handling of thinly provisioned storage on resync target nodes

2016-06-13 Thread Philipp Reisner
If during resync we read only zeroes for a range of sectors assume
that these secotors can be discarded on the sync target node.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h  |  5 +++
 drivers/block/drbd/drbd_main.c | 18 
 drivers/block/drbd/drbd_protocol.h |  4 ++
 drivers/block/drbd/drbd_receiver.c | 88 --
 drivers/block/drbd/drbd_worker.c   | 29 -
 5 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 33f0b82..9e338ec 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -471,6 +471,9 @@ enum {
/* this originates from application on peer
 * (not some resync or verify or other DRBD internal request) */
__EE_APPLICATION,
+
+   /* If it contains only 0 bytes, send back P_RS_DEALLOCATED */
+   __EE_RS_THIN_REQ,
 };
 #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
 #define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC)
@@ -485,6 +488,7 @@ enum {
 #define EE_SUBMITTED   (1<<__EE_SUBMITTED)
 #define EE_WRITE   (1<<__EE_WRITE)
 #define EE_APPLICATION (1<<__EE_APPLICATION)
+#define EE_RS_THIN_REQ (1<<__EE_RS_THIN_REQ)
 
 /* flag bits per device */
 enum {
@@ -1123,6 +1127,7 @@ extern int drbd_send_ov_request(struct drbd_peer_device 
*, sector_t sector, int
 extern int drbd_send_bitmap(struct drbd_device *device);
 extern void drbd_send_sr_reply(struct drbd_peer_device *, enum drbd_state_rv 
retcode);
 extern void conn_send_sr_reply(struct drbd_connection *connection, enum 
drbd_state_rv retcode);
+extern int drbd_send_rs_deallocated(struct drbd_peer_device *, struct 
drbd_peer_request *);
 extern void drbd_backing_dev_free(struct drbd_device *device, struct 
drbd_backing_dev *ldev);
 extern void drbd_device_cleanup(struct drbd_device *device);
 void drbd_print_uuids(struct drbd_device *device, const char *text);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 2891631..b0891c3 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1377,6 +1377,22 @@ int drbd_send_ack_ex(struct drbd_peer_device 
*peer_device, enum drbd_packet cmd,
  cpu_to_be64(block_id));
 }
 
+int drbd_send_rs_deallocated(struct drbd_peer_device *peer_device,
+struct drbd_peer_request *peer_req)
+{
+   struct drbd_socket *sock;
+   struct p_block_desc *p;
+
+   sock = &peer_device->connection->data;
+   p = drbd_prepare_command(peer_device, sock);
+   if (!p)
+   return -EIO;
+   p->sector = cpu_to_be64(peer_req->i.sector);
+   p->blksize = cpu_to_be32(peer_req->i.size);
+   p->pad = 0;
+   return drbd_send_command(peer_device, sock, P_RS_DEALLOCATED, 
sizeof(*p), NULL, 0);
+}
+
 int drbd_send_drequest(struct drbd_peer_device *peer_device, int cmd,
   sector_t sector, int size, u64 block_id)
 {
@@ -3683,6 +3699,8 @@ const char *cmdname(enum drbd_packet cmd)
[P_CONN_ST_CHG_REPLY]   = "conn_st_chg_reply",
[P_RETRY_WRITE] = "retry_write",
[P_PROTOCOL_UPDATE] = "protocol_update",
+   [P_RS_THIN_REQ] = "rs_thin_req",
+   [P_RS_DEALLOCATED]  = "rs_deallocated",
 
/* enum drbd_packet, but not commands - obsoleted flags:
 *  P_MAY_IGNORE
diff --git a/drivers/block/drbd/drbd_protocol.h 
b/drivers/block/drbd/drbd_protocol.h
index 129f8c7..ce0e72c 100644
--- a/drivers/block/drbd/drbd_protocol.h
+++ b/drivers/block/drbd/drbd_protocol.h
@@ -60,6 +60,10 @@ enum drbd_packet {
 * which is why I chose TRIM here, to disambiguate. */
P_TRIM= 0x31,
 
+   /* Only use these two if both support FF_THIN_RESYNC */
+   P_RS_THIN_REQ = 0x32, /* Request a block for resync or reply 
P_RS_DEALLOCATED */
+   P_RS_DEALLOCATED  = 0x33, /* Contains only zeros on sync source 
node */
+
P_MAY_IGNORE  = 0x100, /* Flag to test if (cmd > P_MAY_IGNORE) 
... */
P_MAX_OPT_CMD = 0x101,
 
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index dcadea2..f5eef97 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1418,9 +1418,15 @@ int drbd_submit_peer_request(struct drbd_device *device,
 * so we can find it to present it in debugfs */
peer_req->submit_jif = jiffies;
peer_req->flags |= EE_SUBMITTED;
-   spin_lock_irq(&device->resource->req_lock);
-   list_add_tail(&peer_req->w.list, &device->active_

[PATCH 08/30] drbd: fix regression: protocol A sometimes synchronous, C sometimes double-latency

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Regression introduced with 8.4.5
 drbd: application writes may set-in-sync in protocol != C

Overwriting the same block (LBA) while a former version is still
"in-flight" to the peer (to be exact: we did not receive the
P_BARRIER_ACK for its epoch yet) would wait for the full epoch of that
former version to be acknowledged by the peer.

In synchronous and quasi-synchronous protocols C and B,
this may double the latency on overwrites.

With protocol A, which is supposed to be asynchronous and only wait for
local completion, it is even worse: it would make overwrites
quasi-synchronous, they would be hit by the full RTT, which protocol A
was specifically meant to avoid, and possibly the additional time it
takes to drain the buffers first.

Particularly bad for databases, or anything else that
does frequent updates to the same blocks (various file system meta data).

No impact if >= rtt passes between updates to the same block.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_req.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index eef6e95..74903ab 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -977,16 +977,20 @@ static void complete_conflicting_writes(struct 
drbd_request *req)
sector_t sector = req->i.sector;
int size = req->i.size;
 
-   i = drbd_find_overlap(&device->write_requests, sector, size);
-   if (!i)
-   return;
-
for (;;) {
-   prepare_to_wait(&device->misc_wait, &wait, 
TASK_UNINTERRUPTIBLE);
-   i = drbd_find_overlap(&device->write_requests, sector, size);
-   if (!i)
+   drbd_for_each_overlap(i, &device->write_requests, sector, size) 
{
+   /* Ignore, if already completed to upper layers. */
+   if (i->completed)
+   continue;
+   /* Handle the first found overlap.  After the schedule
+* we have to restart the tree walk. */
break;
+   }
+   if (!i) /* if any */
+   break;
+
/* Indicate to wake up device->misc_wait on progress.  */
+   prepare_to_wait(&device->misc_wait, &wait, 
TASK_UNINTERRUPTIBLE);
i->waiting = true;
spin_unlock_irq(&device->resource->req_lock);
schedule();
-- 
2.7.4



[PATCH 27/30] drbd: get rid of empty statement in is_valid_state

2016-06-13 Thread Philipp Reisner
From: Roland Kammerer 

This should silence a warning about an empty statement. Thanks to Fabian
Frederick  who sent a patch I modified to be smaller and
avoids an additional indent level.

Signed-off-by: Roland Kammerer 
Signed-off-by: Philipp Reisner 
---
 drivers/block/drbd/drbd_state.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index aca68a5..eea0c4a 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -814,7 +814,7 @@ is_valid_state(struct drbd_device *device, union drbd_state 
ns)
}
 
if (rv <= 0)
-   /* already found a reason to abort */;
+   goto out; /* already found a reason to abort */
else if (ns.role == R_SECONDARY && device->open_cnt)
rv = SS_DEVICE_IN_USE;
 
@@ -862,6 +862,7 @@ is_valid_state(struct drbd_device *device, union drbd_state 
ns)
else if (ns.conn >= C_CONNECTED && ns.pdsk == D_UNKNOWN)
rv = SS_CONNECTED_OUTDATES;
 
+out:
rcu_read_unlock();
 
return rv;
-- 
2.7.4



[PATCH 28/30] drbd: finally report ms, not jiffies, in log message

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Also skip the message unless bitmap IO took longer than 5 ms.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_bitmap.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 095625b..0807fcb 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -1121,10 +1121,14 @@ static int bm_rw(struct drbd_device *device, const 
unsigned int flags, unsigned
kref_put(&ctx->kref, &drbd_bm_aio_ctx_destroy);
 
/* summary for global bitmap IO */
-   if (flags == 0)
-   drbd_info(device, "bitmap %s of %u pages took %lu jiffies\n",
-(flags & BM_AIO_READ) ? "READ" : "WRITE",
-count, jiffies - now);
+   if (flags == 0) {
+   unsigned int ms = jiffies_to_msecs(jiffies - now);
+   if (ms > 5) {
+   drbd_info(device, "bitmap %s of %u pages took %u ms\n",
+(flags & BM_AIO_READ) ? "READ" : "WRITE",
+count, ms);
+   }
+   }
 
if (ctx->error) {
drbd_alert(device, "we had at least one MD IO ERROR during 
bitmap IO\n");
-- 
2.7.4



[PATCH 24/30] drbd: disallow promotion during resync handshake, avoid deadlock and hard reset

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

We already serialize connection state changes,
and other, non-connection state changes (role changes)
while we are establishing a connection.

But if we have an established connection,
then trigger a resync handshake (by primary --force or similar),
until now we just had to be "lucky".

Consider this sequence (e.g. deployment scenario):
create-md; up;
  -> Connected Secondary/Secondary Inconsistent/Inconsistent
then do a racy primary --force on both peers.

 block drbd0: drbd_sync_handshake:
 block drbd0: self 
0004::: bits:25590 
flags:0
 block drbd0: peer 
0004::: bits:25590 
flags:0
 block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) 
pdsk( DUnknown -> Inconsistent )
 block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
  *** HERE things go wrong. ***
 block drbd0: role( Secondary -> Primary )
 block drbd0: drbd_sync_handshake:
 block drbd0: self 
0005::: bits:25590 
flags:0
 block drbd0: peer 
C90D2FC716D232AB:0004:: bits:25590 
flags:0
 block drbd0: Becoming sync target due to disk states.
 block drbd0: Writing the whole bitmap, full sync required after 
drbd_sync_handshake.
 block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * 
timeout (30 * 0.1s)
 drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( 
UpToDate -> DUnknown )

The problem here is that the local promotion happens before the sync handshake
triggered by the remote promotion was completed.  Some assumptions elsewhere
become wrong, and when the expected resync handshake is then received and
processed, we get stuck in a deadlock, which can only be recovered by reboot :-(

Fix: if we know the peer has good data,
and our own disk is present, but NOT good,
and there is no resync going on yet,
we expect a sync handshake to happen "soon".
So reject a racy promotion with SS_IN_TRANSIENT_STATE.

Result:
 ... as above ...
 block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
  *** local promotion being postponed until ... ***
 block drbd0: drbd_sync_handshake:
 block drbd0: self 
0004::: bits:25590 
flags:0
 block drbd0: peer 
77868BDA836E12A5:0004:: bits:25590 
flags:0
  ...
 block drbd0: conn( WFBitMapT -> WFSyncUUID )
 block drbd0: updated sync uuid 
85D06D0E8887AD44:::
 block drbd0: conn( WFSyncUUID -> SyncTarget )
  *** ... after the resync handshake ***
 block drbd0: role( Secondary -> Primary )

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_state.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 24422e8..7562c5c 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -906,6 +906,15 @@ is_valid_soft_transition(union drbd_state os, union 
drbd_state ns, struct drbd_c
  (ns.conn >= C_CONNECTED && os.conn == C_WF_REPORT_PARAMS)))
rv = SS_IN_TRANSIENT_STATE;
 
+   /* Do not promote during resync handshake triggered by "force primary".
+* This is a hack. It should really be rejected by the peer during the
+* cluster wide state change request. */
+   if (os.role != R_PRIMARY && ns.role == R_PRIMARY
+   && ns.pdsk == D_UP_TO_DATE
+   && ns.disk != D_UP_TO_DATE && ns.disk != D_DISKLESS
+   && (ns.conn <= C_WF_SYNC_UUID || ns.conn != os.conn))
+   rv = SS_IN_TRANSIENT_STATE;
+
if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) && os.conn < 
C_CONNECTED)
rv = SS_NEED_CONNECTION;
 
-- 
2.7.4



[PATCH 25/30] drbd: bump current uuid when resuming IO with diskless peer

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Scenario, starting with normal operation
 Connected Primary/Secondary UpToDate/UpToDate
 NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
 ... more failures happen, secondary loses it's disk,
 but eventually is able to re-establish the replication link ...
 Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)

We used to just resume/resent suspended requests,
without bumping the UUID.

Which will lead to problems later, when we want to re-attach the disk on
the peer, without first disconnecting, or if we experience additional
failures, because we now have diverging data without being able to
recognize it.

Make sure we also bump the current data generation UUID,
if we notice "peer disk unknown" -> "peer disk known bad".

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_state.c | 34 --
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 7562c5c..a1b5e6c9 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -1637,6 +1637,26 @@ static void broadcast_state_change(struct 
drbd_state_change *state_change)
 #undef REMEMBER_STATE_CHANGE
 }
 
+/* takes old and new peer disk state */
+static bool lost_contact_to_peer_data(enum drbd_disk_state os, enum 
drbd_disk_state ns)
+{
+   if ((os >= D_INCONSISTENT && os != D_UNKNOWN && os != D_OUTDATED)
+   &&  (ns < D_INCONSISTENT || ns == D_UNKNOWN || ns == D_OUTDATED))
+   return true;
+
+   /* Scenario, starting with normal operation
+* Connected Primary/Secondary UpToDate/UpToDate
+* NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
+* ...
+* Connected Primary/Secondary UpToDate/Diskless (resumed; needs to 
bump uuid!)
+*/
+   if (os == D_UNKNOWN
+   &&  (ns == D_DISKLESS || ns == D_FAILED || ns == D_OUTDATED))
+   return true;
+
+   return false;
+}
+
 /**
  * after_state_ch() - Perform after state change actions that may sleep
  * @device:DRBD device.
@@ -1708,6 +1728,13 @@ static void after_state_ch(struct drbd_device *device, 
union drbd_state os,
idr_for_each_entry(&connection->peer_devices, 
peer_device, vnr)
clear_bit(NEW_CUR_UUID, 
&peer_device->device->flags);
rcu_read_unlock();
+
+   /* We should actively create a new uuid, _before_
+* we resume/resent, if the peer is diskless
+* (recovery from a multiple error scenario).
+* Currently, this happens with a slight delay
+* below when checking lost_contact_to_peer_data() ...
+*/
_tl_restart(connection, RESEND);
_conn_request_state(connection,
(union drbd_state) { { .susp_fen = 
1 } },
@@ -1751,12 +1778,7 @@ static void after_state_ch(struct drbd_device *device, 
union drbd_state os,
BM_LOCKED_TEST_ALLOWED);
 
/* Lost contact to peer's copy of the data */
-   if ((os.pdsk >= D_INCONSISTENT &&
-os.pdsk != D_UNKNOWN &&
-os.pdsk != D_OUTDATED)
-   &&  (ns.pdsk < D_INCONSISTENT ||
-ns.pdsk == D_UNKNOWN ||
-ns.pdsk == D_OUTDATED)) {
+   if (lost_contact_to_peer_data(os.pdsk, ns.pdsk)) {
if (get_ldev(device)) {
if ((ns.role == R_PRIMARY || ns.peer == R_PRIMARY) &&
device->ldev->md.uuid[UI_BITMAP] == 0 && ns.disk >= 
D_UP_TO_DATE) {
-- 
2.7.4



[PATCH 26/30] drbd: code cleanups without semantic changes

2016-06-13 Thread Philipp Reisner
From: Fabian Frederick 

This contains various cosmetic fixes ranging from simple typos to
const-ifying, and using booleans properly.

Original commit messages from Fabian's patch set:
drbd: debugfs: constify drbd_version_fops
drbd: use seq_put instead of seq_print where possible
drbd: include linux/uaccess.h instead of asm/uaccess.h
drbd: use const char * const for drbd strings
drbd: kerneldoc warning fix in w_e_end_data_req()
drbd: use unsigned for one bit fields
drbd: use bool for peer is_ states
drbd: fix typo
drbd: use | for bitmask combination
drbd: use true/false for bool
drbd: fix drbd_bm_init() comments
drbd: introduce peer state union
drbd: fix maybe_pull_ahead() locking comments
drbd: use bool for growing
drbd: remove redundant declarations
drbd: replace if/BUG by BUG_ON

Signed-off-by: Fabian Frederick 
Signed-off-by: Roland Kammerer 
---
 drivers/block/drbd/drbd_bitmap.c   |  6 +++---
 drivers/block/drbd/drbd_debugfs.c  |  2 +-
 drivers/block/drbd/drbd_int.h  |  4 +---
 drivers/block/drbd/drbd_interval.h | 14 +++---
 drivers/block/drbd/drbd_main.c |  2 +-
 drivers/block/drbd/drbd_nl.c   | 14 --
 drivers/block/drbd/drbd_proc.c | 30 +++---
 drivers/block/drbd/drbd_receiver.c |  8 
 drivers/block/drbd/drbd_req.c  |  2 +-
 drivers/block/drbd/drbd_state.c|  4 +---
 drivers/block/drbd/drbd_state.h|  2 +-
 drivers/block/drbd/drbd_strings.c  |  8 
 drivers/block/drbd/drbd_worker.c   |  9 -
 include/linux/drbd.h   |  8 
 14 files changed, 59 insertions(+), 54 deletions(-)

diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index e5d89f6..095625b 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -427,8 +427,7 @@ static struct page **bm_realloc_pages(struct drbd_bitmap 
*b, unsigned long want)
 }
 
 /*
- * called on driver init only. TODO call when a device is created.
- * allocates the drbd_bitmap, and stores it in device->bitmap.
+ * allocates the drbd_bitmap and stores it in device->bitmap.
  */
 int drbd_bm_init(struct drbd_device *device)
 {
@@ -633,7 +632,8 @@ int drbd_bm_resize(struct drbd_device *device, sector_t 
capacity, int set_new_bi
unsigned long bits, words, owords, obits;
unsigned long want, have, onpages; /* number of pages */
struct page **npages, **opages = NULL;
-   int err = 0, growing;
+   int err = 0;
+   bool growing;
 
if (!expect(b))
return -ENOMEM;
diff --git a/drivers/block/drbd/drbd_debugfs.c 
b/drivers/block/drbd/drbd_debugfs.c
index 8a90812..be91a8d 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -903,7 +903,7 @@ static int drbd_version_open(struct inode *inode, struct 
file *file)
return single_open(file, drbd_version_show, NULL);
 }
 
-static struct file_operations drbd_version_fops = {
+static const struct file_operations drbd_version_fops = {
.owner = THIS_MODULE,
.open = drbd_version_open,
.llseek = seq_lseek,
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 995aa8d..2c9194d 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1499,7 +1499,7 @@ extern enum drbd_state_rv drbd_set_role(struct 
drbd_device *device,
int force);
 extern bool conn_try_outdate_peer(struct drbd_connection *connection);
 extern void conn_try_outdate_peer_async(struct drbd_connection *connection);
-extern int conn_khelper(struct drbd_connection *connection, char *cmd);
+extern enum drbd_peer_state conn_khelper(struct drbd_connection *connection, 
char *cmd);
 extern int drbd_khelper(struct drbd_device *device, char *cmd);
 
 /* drbd_worker.c */
@@ -1648,8 +1648,6 @@ void drbd_bump_write_ordering(struct drbd_resource 
*resource, struct drbd_backin
 /* drbd_proc.c */
 extern struct proc_dir_entry *drbd_proc;
 extern const struct file_operations drbd_proc_fops;
-extern const char *drbd_conn_str(enum drbd_conns s);
-extern const char *drbd_role_str(enum drbd_role s);
 
 /* drbd_actlog.c */
 extern bool drbd_al_begin_io_prepare(struct drbd_device *device, struct 
drbd_interval *i);
diff --git a/drivers/block/drbd/drbd_interval.h 
b/drivers/block/drbd/drbd_interval.h
index f210543..23c5a94 100644
--- a/drivers/block/drbd/drbd_interval.h
+++ b/drivers/block/drbd/drbd_interval.h
@@ -6,13 +6,13 @@
 
 struct drbd_interval {
struct rb_node rb;
-   sector_t sector;/* start sector of the interval */
-   unsigned int size;  /* size in bytes */
-   sector_t end;   /* highest interval end in subtree */
-   int local:1 /* local or remote request? */;
-   int waiting:1;  /* someone is waiting for this to complete */
-   int completed:1;/* this has been completed already;
-* ignore for confli

[PATCH 22/30] drbd: introduce WRITE_SAME support

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

We will support WRITE_SAME, if
 * all peers support WRITE_SAME (both in kernel and DRBD version),
 * all peer devices support WRITE_SAME
 * logical_block_size is identical on all peers.

We may at some point introduce a fallback on the receiving side
for devices/kernels that do not support WRITE_SAME,
by open-coding a submit loop. But not yet.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_actlog.c   |   9 ++-
 drivers/block/drbd/drbd_debugfs.c  |  11 +--
 drivers/block/drbd/drbd_int.h  |  13 ++--
 drivers/block/drbd/drbd_main.c |  82 +++---
 drivers/block/drbd/drbd_nl.c   |  88 +---
 drivers/block/drbd/drbd_protocol.h |  74 ++--
 drivers/block/drbd/drbd_receiver.c | 137 +++--
 drivers/block/drbd/drbd_req.c  |  13 ++--
 drivers/block/drbd/drbd_req.h  |   5 +-
 drivers/block/drbd/drbd_worker.c   |   8 ++-
 10 files changed, 360 insertions(+), 80 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index cafa9c4..f9af555 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -840,6 +840,13 @@ static int update_sync_bits(struct drbd_device *device,
return count;
 }
 
+static bool plausible_request_size(int size)
+{
+   return size > 0
+   && size <= DRBD_MAX_BATCH_BIO_SIZE
+   && IS_ALIGNED(size, 512);
+}
+
 /* clear the bit corresponding to the piece of storage in question:
  * size byte of data starting from sector.  Only clear a bits of the affected
  * one ore more _aligned_ BM_BLOCK_SIZE blocks.
@@ -859,7 +866,7 @@ int __drbd_change_sync(struct drbd_device *device, sector_t 
sector, int size,
if ((mode == SET_OUT_OF_SYNC) && size == 0)
return 0;
 
-   if (size <= 0 || !IS_ALIGNED(size, 512) || size > 
DRBD_MAX_DISCARD_SIZE) {
+   if (!plausible_request_size(size)) {
drbd_err(device, "%s: sector=%llus size=%d nonsense!\n",
drbd_change_sync_fname[mode],
(unsigned long long)sector, size);
diff --git a/drivers/block/drbd/drbd_debugfs.c 
b/drivers/block/drbd/drbd_debugfs.c
index 4de95bb..8a90812 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -237,14 +237,9 @@ static void seq_print_peer_request_flags(struct seq_file 
*m, struct drbd_peer_re
seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, &sep, "C");
seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, &sep, "set-in-sync");
 
-   if (f & EE_IS_TRIM) {
-   seq_putc(m, sep);
-   sep = '|';
-   if (f & EE_IS_TRIM_USE_ZEROOUT)
-   seq_puts(m, "zero-out");
-   else
-   seq_puts(m, "trim");
-   }
+   if (f & EE_IS_TRIM)
+   __seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, &sep, 
"zero-out", "trim");
+   seq_print_rq_state_bit(m, f & EE_WRITE_SAME, &sep, "write-same");
seq_putc(m, '\n');
 }
 
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 5ee8da3..995aa8d 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -468,6 +468,9 @@ enum {
/* this is/was a write request */
__EE_WRITE,
 
+   /* this is/was a write same request */
+   __EE_WRITE_SAME,
+
/* this originates from application on peer
 * (not some resync or verify or other DRBD internal request) */
__EE_APPLICATION,
@@ -487,6 +490,7 @@ enum {
 #define EE_IN_INTERVAL_TREE(1<<__EE_IN_INTERVAL_TREE)
 #define EE_SUBMITTED   (1<<__EE_SUBMITTED)
 #define EE_WRITE   (1<<__EE_WRITE)
+#define EE_WRITE_SAME  (1<<__EE_WRITE_SAME)
 #define EE_APPLICATION (1<<__EE_APPLICATION)
 #define EE_RS_THIN_REQ (1<<__EE_RS_THIN_REQ)
 
@@ -1350,8 +1354,8 @@ struct bm_extent {
 /* For now, don't allow more than half of what we can "activate" in one
  * activity log transaction to be discarded in one go. We may need to rework
  * drbd_al_begin_io() to allow for even larger discard ranges */
-#define DRBD_MAX_DISCARD_SIZE  (AL_UPDATES_PER_TRANSACTION/2*AL_EXTENT_SIZE)
-#define DRBD_MAX_DISCARD_SECTORS (DRBD_MAX_DISCARD_SIZE >> 9)
+#define DRBD_MAX_BATCH_BIO_SIZE 
(AL_UPDATES_PER_TRANSACTION/2*AL_EXTENT_SIZE)
+#define DRBD_MAX_BBIO_SECTORS(DRBD_MAX_BATCH_BIO_SIZE >> 9)
 
 extern int  drbd_bm_init(struct drbd_device *device);
 extern int  drbd_bm_resize(struct drbd_device *device, sector_t sectors, int 
set_new_bits);
@@ -1488,7 +1492,

[PATCH 11/30] drbd: when receiving P_TRIM, zero-out partial unaligned chunks

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

We can avoid spurious data divergence caused by partially-ignored
discards on certain backends with discard_zeroes_data=0, if we
translate partial unaligned discard requests into explicit zero-out.

The relevant use case is LVM/DM thin.

If on different nodes, DRBD is backed by devices with differing
discard characteristics, discards may lead to data divergence
(old data or garbage left over on one backend, zeroes due to
unmapped areas on the other backend). Online verify would now
potentially report tons of spurious differences.

While probably harmless for most use cases (fstrim on a file system),
DRBD cannot have that, it would violate our promise to upper layers
that our data instances on the nodes are identical.

To be correct and play safe (make sure data is identical on both copies),
we would have to disable discard support, if our local backend (on a
Primary) does not support "discard_zeroes_data=true".

We'd also have to translate discards to explicit zero-out on the
receiving (typically: Secondary) side, unless the receiving side
supports "discard_zeroes_data=true".

Which both would allocate those blocks, instead of unmapping them,
in contrast with expectations.

LVM/DM thin does set discard_zeroes_data=0,
because it silently ignores discards to partial chunks.

We can work around this by checking the alignment first.
For unaligned (wrt. alignment and granularity) or too small discards,
we zero-out the initial (and/or) trailing unaligned partial chunks,
but discard all the aligned full chunks.

At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".

Arguably it should behave this way internally, by default,
and we'll try to make that happen.

But our workaround is still valid for already deployed setups,
and for other devices that may behave this way.

Setting discard-zeroes-if-aligned=yes will allow DRBD to use
discards, and to announce discard_zeroes_data=true, even on
backends that announce discard_zeroes_data=false.

Setting discard-zeroes-if-aligned=no will cause DRBD to always
fall-back to zero-out on the receiving side, and to not even
announce discard capabilities on the Primary, if the respective
backend announces discard_zeroes_data=false.

We used to ignore the discard_zeroes_data setting completely.
To not break established and expected behaviour, and suddenly
cause fstrim on thin-provisioned LVs to run out-of-space,
instead of freeing up space, the default value is "yes".

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h  |   2 +-
 drivers/block/drbd/drbd_nl.c   |  15 ++--
 drivers/block/drbd/drbd_receiver.c | 140 ++---
 include/linux/drbd_genl.h  |   1 +
 include/linux/drbd_limits.h|   6 ++
 5 files changed, 134 insertions(+), 30 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 9e338ec..f49ff86 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1488,7 +1488,7 @@ enum determine_dev_size {
 extern enum determine_dev_size
 drbd_determine_dev_size(struct drbd_device *, enum dds_flags, struct 
resize_parms *) __must_hold(local);
 extern void resync_after_online_grow(struct drbd_device *);
-extern void drbd_reconsider_max_bio_size(struct drbd_device *device, struct 
drbd_backing_dev *bdev);
+extern void drbd_reconsider_queue_parameters(struct drbd_device *device, 
struct drbd_backing_dev *bdev);
 extern enum drbd_state_rv drbd_set_role(struct drbd_device *device,
enum drbd_role new_role,
int force);
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 3643f9c..8d757d6 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1161,13 +1161,17 @@ static void drbd_setup_queue_param(struct drbd_device 
*device, struct drbd_backi
unsigned int max_hw_sectors = max_bio_size >> 9;
unsigned int max_segments = 0;
struct request_queue *b = NULL;
+   struct disk_conf *dc;
+   bool discard_zeroes_if_aligned = true;
 
if (bdev) {
b = bdev->backing_bdev->bd_disk->queue;
 
max_hw_sectors = min(queue_max_hw_sectors(b), max_bio_size >> 
9);
rcu_read_lock();
-   max_segments = 
rcu_dereference(device->ldev->disk_conf)->max_bio_bvecs;
+   dc = rcu_dereference(device->ldev->disk_conf);
+   max_segments = dc->max_bio_bvecs;
+   discard_zeroes_if_aligned = dc->discard_zeroes_if_aligned;
rcu_read_unlock();
 
blk_set_stacking_limits(&q->limits);
@@ -1185,7 +1189,7 @@ static void drbd_setup_queue_param(struct drbd_device 
*device, struct drbd_backi
 
blk_queue_max_discard_sec

[PATCH 06/30] drbd: Create the protocol feature THIN_RESYNC

2016-06-13 Thread Philipp Reisner
If thinly provisioned volumes are used, during a resync the sync source
tries to find out if a block is deallocated. If it is deallocated, then
the resync target uses block_dev_issue_zeroout() on the range in
question.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_protocol.h |  1 +
 drivers/block/drbd/drbd_receiver.c |  5 -
 drivers/block/drbd/drbd_worker.c   | 13 -
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_protocol.h 
b/drivers/block/drbd/drbd_protocol.h
index ce0e72c..95ca458 100644
--- a/drivers/block/drbd/drbd_protocol.h
+++ b/drivers/block/drbd/drbd_protocol.h
@@ -165,6 +165,7 @@ struct p_block_req {
  */
 
 #define FF_TRIM  1
+#define FF_THIN_RESYNC 2
 
 struct p_connection_features {
u32 protocol_min;
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index f5eef97..a50cc99 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -48,7 +48,7 @@
 #include "drbd_req.h"
 #include "drbd_vli.h"
 
-#define PRO_FEATURES (FF_TRIM)
+#define PRO_FEATURES (FF_TRIM | FF_THIN_RESYNC)
 
 struct packet_info {
enum drbd_packet cmd;
@@ -4991,6 +4991,9 @@ static int drbd_do_features(struct drbd_connection 
*connection)
drbd_info(connection, "Agreed to%ssupport TRIM on protocol level\n",
  connection->agreed_features & FF_TRIM ? " " : " not ");
 
+   drbd_info(connection, "Agreed to%ssupport THIN_RESYNC on protocol 
level\n",
+ connection->agreed_features & FF_THIN_RESYNC ? " " : " not ");
+
return 1;
 
  incompat:
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index dd85433..154dbfc 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -583,6 +583,7 @@ static int make_resync_request(struct drbd_device *const 
device, int cancel)
int number, rollback_i, size;
int align, requeue = 0;
int i = 0;
+   int discard_granularity = 0;
 
if (unlikely(cancel))
return 0;
@@ -602,6 +603,12 @@ static int make_resync_request(struct drbd_device *const 
device, int cancel)
return 0;
}
 
+   if (connection->agreed_features & FF_THIN_RESYNC) {
+   rcu_read_lock();
+   discard_granularity = 
rcu_dereference(device->ldev->disk_conf)->rs_discard_granularity;
+   rcu_read_unlock();
+   }
+
max_bio_size = queue_max_hw_sectors(device->rq_queue) << 9;
number = drbd_rs_number_requests(device);
if (number <= 0)
@@ -666,6 +673,9 @@ next_sector:
if (sector & ((1<<(align+3))-1))
break;
 
+   if (discard_granularity && size == discard_granularity)
+   break;
+
/* do not cross extent boundaries */
if (((bit+1) & BM_BLOCKS_PER_BM_EXT_MASK) == 0)
break;
@@ -712,7 +722,8 @@ next_sector:
int err;
 
inc_rs_pending(device);
-   err = drbd_send_drequest(peer_device, P_RS_DATA_REQUEST,
+   err = drbd_send_drequest(peer_device,
+size == discard_granularity ? 
P_RS_THIN_REQ : P_RS_DATA_REQUEST,
 sector, size, ID_SYNCER);
if (err) {
drbd_err(device, "drbd_send_drequest() failed, 
aborting...\n");
-- 
2.7.4



[PATCH 10/30] drbd: allow parallel flushes for multi-volume resources

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

To maintain write-order fidelity accros all volumes in a DRBD resource,
the receiver of a P_BARRIER needs to issue flushes to all volumes.
We used to do this by calling blkdev_issue_flush(), synchronously,
one volume at a time.

We now submit all flushes to all volumes in parallel, then wait for all
completions, to reduce worst-case latencies on multi-volume resources.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 114 +
 1 file changed, 89 insertions(+), 25 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index a50cc99..15b2a0d 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1204,13 +1204,84 @@ static int drbd_recv_header(struct drbd_connection 
*connection, struct packet_in
return err;
 }
 
-static void drbd_flush(struct drbd_connection *connection)
+/* This is blkdev_issue_flush, but asynchronous.
+ * We want to submit to all component volumes in parallel,
+ * then wait for all completions.
+ */
+struct issue_flush_context {
+   atomic_t pending;
+   int error;
+   struct completion done;
+};
+struct one_flush_context {
+   struct drbd_device *device;
+   struct issue_flush_context *ctx;
+};
+
+void one_flush_endio(struct bio *bio)
 {
-   int rv;
-   struct drbd_peer_device *peer_device;
-   int vnr;
+   struct one_flush_context *octx = bio->bi_private;
+   struct drbd_device *device = octx->device;
+   struct issue_flush_context *ctx = octx->ctx;
+
+   if (bio->bi_error) {
+   ctx->error = bio->bi_error;
+   drbd_info(device, "local disk FLUSH FAILED with status %d\n", 
bio->bi_error);
+   }
+   kfree(octx);
+   bio_put(bio);
+
+   clear_bit(FLUSH_PENDING, &device->flags);
+   put_ldev(device);
+   kref_put(&device->kref, drbd_destroy_device);
+
+   if (atomic_dec_and_test(&ctx->pending))
+   complete(&ctx->done);
+}
+
+static void submit_one_flush(struct drbd_device *device, struct 
issue_flush_context *ctx)
+{
+   struct bio *bio = bio_alloc(GFP_NOIO, 0);
+   struct one_flush_context *octx = kmalloc(sizeof(*octx), GFP_NOIO);
+   if (!bio || !octx) {
+   drbd_warn(device, "Could not allocate a bio, CANNOT ISSUE 
FLUSH\n");
+   /* FIXME: what else can I do now?  disconnecting or detaching
+* really does not help to improve the state of the world, 
either.
+*/
+   kfree(octx);
+   if (bio)
+   bio_put(bio);
 
+   ctx->error = -ENOMEM;
+   put_ldev(device);
+   kref_put(&device->kref, drbd_destroy_device);
+   return;
+   }
+
+   octx->device = device;
+   octx->ctx = ctx;
+   bio->bi_bdev = device->ldev->backing_bdev;
+   bio->bi_private = octx;
+   bio->bi_end_io = one_flush_endio;
+   bio_set_op_attrs(bio, REQ_OP_FLUSH, WRITE_FLUSH);
+
+   device->flush_jif = jiffies;
+   set_bit(FLUSH_PENDING, &device->flags);
+   atomic_inc(&ctx->pending);
+   submit_bio(bio);
+}
+
+static void drbd_flush(struct drbd_connection *connection)
+{
if (connection->resource->write_ordering >= WO_BDEV_FLUSH) {
+   struct drbd_peer_device *peer_device;
+   struct issue_flush_context ctx;
+   int vnr;
+
+   atomic_set(&ctx.pending, 1);
+   ctx.error = 0;
+   init_completion(&ctx.done);
+
rcu_read_lock();
idr_for_each_entry(&connection->peer_devices, peer_device, vnr) 
{
struct drbd_device *device = peer_device->device;
@@ -1220,31 +1291,24 @@ static void drbd_flush(struct drbd_connection 
*connection)
kref_get(&device->kref);
rcu_read_unlock();
 
-   /* Right now, we have only this one synchronous code 
path
-* for flushes between request epochs.
-* We may want to make those asynchronous,
-* or at least parallelize the flushes to the volume 
devices.
-*/
-   device->flush_jif = jiffies;
-   set_bit(FLUSH_PENDING, &device->flags);
-   rv = blkdev_issue_flush(device->ldev->backing_bdev,
-   GFP_NOIO, NULL);
-   clear_bit(FLUSH_PENDING, &device->flags);
-   if (rv) {
-   drbd_info(device, "local disk flush failed with 
status %d\n", rv);
- 

[PATCH 05/30] drbd: Introduce new disk config option rs-discard-granularity

2016-06-13 Thread Philipp Reisner
As long as the value is 0 the feature is disabled. With setting
it to a positive value, DRBD limits and aligns its resync requests
to the rs-discard-granularity setting. If the sync source detects
all zeros in such a block, the resync target discards the range
on disk.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 32 +---
 include/linux/drbd_genl.h|  6 +++---
 include/linux/drbd_limits.h  |  6 ++
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index fad03e4..99339df 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1348,12 +1348,38 @@ static bool write_ordering_changed(struct disk_conf *a, 
struct disk_conf *b)
a->disk_drain != b->disk_drain;
 }
 
-static void sanitize_disk_conf(struct disk_conf *disk_conf, struct 
drbd_backing_dev *nbc)
+static void sanitize_disk_conf(struct drbd_device *device, struct disk_conf 
*disk_conf,
+  struct drbd_backing_dev *nbc)
 {
+   struct request_queue * const q = nbc->backing_bdev->bd_disk->queue;
+
if (disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
if (disk_conf->al_extents > drbd_al_extents_max(nbc))
disk_conf->al_extents = drbd_al_extents_max(nbc);
+
+   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data) {
+   disk_conf->rs_discard_granularity = 0; /* disable feature */
+   drbd_info(device, "rs_discard_granularity feature disabled\n");
+   }
+
+   if (disk_conf->rs_discard_granularity) {
+   int orig_value = disk_conf->rs_discard_granularity;
+   int remainder;
+
+   if (q->limits.discard_granularity > 
disk_conf->rs_discard_granularity)
+   disk_conf->rs_discard_granularity = 
q->limits.discard_granularity;
+
+   remainder = disk_conf->rs_discard_granularity % 
q->limits.discard_granularity;
+   disk_conf->rs_discard_granularity += remainder;
+
+   if (disk_conf->rs_discard_granularity > 
q->limits.max_discard_sectors << 9)
+   disk_conf->rs_discard_granularity = 
q->limits.max_discard_sectors << 9;
+
+   if (disk_conf->rs_discard_granularity != orig_value)
+   drbd_info(device, "rs_discard_granularity changed to 
%d\n",
+ disk_conf->rs_discard_granularity);
+   }
 }
 
 int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
@@ -1403,7 +1429,7 @@ int drbd_adm_disk_opts(struct sk_buff *skb, struct 
genl_info *info)
if (!expect(new_disk_conf->resync_rate >= 1))
new_disk_conf->resync_rate = 1;
 
-   sanitize_disk_conf(new_disk_conf, device->ldev);
+   sanitize_disk_conf(device, new_disk_conf, device->ldev);
 
if (new_disk_conf->c_plan_ahead > DRBD_C_PLAN_AHEAD_MAX)
new_disk_conf->c_plan_ahead = DRBD_C_PLAN_AHEAD_MAX;
@@ -1698,7 +1724,7 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info 
*info)
if (retcode != NO_ERROR)
goto fail;
 
-   sanitize_disk_conf(new_disk_conf, nbc);
+   sanitize_disk_conf(device, new_disk_conf, nbc);
 
if (drbd_get_max_capacity(nbc) < new_disk_conf->disk_size) {
drbd_err(device, "max capacity %llu smaller than disk size 
%llu\n",
diff --git a/include/linux/drbd_genl.h b/include/linux/drbd_genl.h
index 2d0e5ad..ab649d8 100644
--- a/include/linux/drbd_genl.h
+++ b/include/linux/drbd_genl.h
@@ -123,14 +123,14 @@ GENL_struct(DRBD_NLA_DISK_CONF, 3, disk_conf,
__u32_field_def(13, DRBD_GENLA_F_MANDATORY, c_fill_target, 
DRBD_C_FILL_TARGET_DEF)
__u32_field_def(14, DRBD_GENLA_F_MANDATORY, c_max_rate, 
DRBD_C_MAX_RATE_DEF)
__u32_field_def(15, DRBD_GENLA_F_MANDATORY, c_min_rate, 
DRBD_C_MIN_RATE_DEF)
+   __u32_field_def(20, DRBD_GENLA_F_MANDATORY, disk_timeout, 
DRBD_DISK_TIMEOUT_DEF)
+   __u32_field_def(21, 0 /* OPTIONAL */,   read_balancing, 
DRBD_READ_BALANCING_DEF)
+   __u32_field_def(25, 0 /* OPTIONAL */,   rs_discard_granularity, 
DRBD_RS_DISCARD_GRANULARITY_DEF)
 
__flg_field_def(16, DRBD_GENLA_F_MANDATORY, disk_barrier, 
DRBD_DISK_BARRIER_DEF)
__flg_field_def(17, DRBD_GENLA_F_MANDATORY, disk_flushes, 
DRBD_DISK_FLUSHES_DEF)
__flg_field_def(18, DRBD_GENLA_F_MANDATORY, disk_drain, 
DRBD_DISK_DRAIN_DEF)
__flg_field_def(19, DRBD_GENLA_F_MANDATORY, md_flushes, 
DRBD_MD_FLUSHES_DEF)
-   __u32_field_def(20, DRBD_GENLA_F_MANDATORY, disk_timeout, 
DRBD_DISK_TIMEOUT_DEF)
-   __u32_fie

[PATCH 09/30] drbd: fix for truncated minor number in callback command line

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

The command line parameter the kernel module uses to communicate the
device minor to userland helper is flawed in a way that the device
indentifier "minor-%d" is being truncated to minors with a maximum
of 5 digits.

But DRBD 8.4 allows 2^20 == 1048576 minors,
thus a minimum of 7 digits must be supported.

Reported by Veit Wahlich on drbd-dev.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 99339df..3643f9c 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -343,7 +343,7 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
 (char[20]) { }, /* address family */
 (char[60]) { }, /* address */
NULL };
-   char mb[12];
+   char mb[14];
char *argv[] = {usermode_helper, cmd, mb, NULL };
struct drbd_connection *connection = 
first_peer_device(device)->connection;
struct sib_info sib;
@@ -352,7 +352,7 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
if (current == connection->worker.task)
set_bit(CALLBACK_PENDING, &connection->flags);
 
-   snprintf(mb, 12, "minor-%d", device_to_minor(device));
+   snprintf(mb, 14, "minor-%d", device_to_minor(device));
setup_khelper_env(connection, envp);
 
/* The helper may take some time.
-- 
2.7.4



[PATCH 12/30] drbd: possibly disable discard support, if backend has discard_zeroes_data=0

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Now that we have the discard_zeroes_if_aligned setting, we should also
check it when setting up our queue parameters on the primary,
not only on the receiving side.

We announce discard support,
UNLESS

 * we are connected to a peer that does not support TRIM
   on the DRBD protocol level.  Otherwise, it would either discard, or
   do a fallback to zero-out, depending on its backend and configuration.

 * our local backend does not support discards,
   or (discard_zeroes_data=0 AND discard_zeroes_if_aligned=no).

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 80 ++--
 1 file changed, 55 insertions(+), 25 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 8d757d6..12e9b31 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1154,6 +1154,59 @@ static int drbd_check_al_size(struct drbd_device 
*device, struct disk_conf *dc)
return 0;
 }
 
+static void blk_queue_discard_granularity(struct request_queue *q, unsigned 
int granularity)
+{
+   q->limits.discard_granularity = granularity;
+}
+static void decide_on_discard_support(struct drbd_device *device,
+   struct request_queue *q,
+   struct request_queue *b,
+   bool discard_zeroes_if_aligned)
+{
+   /* q = drbd device queue (device->rq_queue)
+* b = backing device queue 
(device->ldev->backing_bdev->bd_disk->queue),
+* or NULL if diskless
+*/
+   struct drbd_connection *connection = 
first_peer_device(device)->connection;
+   bool can_do = b ? blk_queue_discard(b) : true;
+
+   if (can_do && b && !b->limits.discard_zeroes_data && 
!discard_zeroes_if_aligned) {
+   can_do = false;
+   drbd_info(device, "discard_zeroes_data=0 and 
discard_zeroes_if_aligned=no: disabling discards\n");
+   }
+   if (can_do && connection->cstate >= C_CONNECTED && 
!(connection->agreed_features & FF_TRIM)) {
+   can_do = false;
+   drbd_info(connection, "peer DRBD too old, does not support 
TRIM: disabling discards\n");
+   }
+   if (can_do) {
+   /* We don't care for the granularity, really.
+* Stacking limits below should fix it for the local
+* device.  Whether or not it is a suitable granularity
+* on the remote device is not our problem, really. If
+* you care, you need to use devices with similar
+* topology on all peers. */
+   blk_queue_discard_granularity(q, 512);
+   q->limits.max_discard_sectors = DRBD_MAX_DISCARD_SECTORS;
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+   } else {
+   queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
+   blk_queue_discard_granularity(q, 0);
+   q->limits.max_discard_sectors = 0;
+   }
+}
+
+static void fixup_discard_if_not_supported(struct request_queue *q)
+{
+   /* To avoid confusion, if this queue does not support discard, clear
+* max_discard_sectors, which is what lsblk -D reports to the user.
+* Older kernels got this wrong in "stack limits".
+* */
+   if (!blk_queue_discard(q)) {
+   blk_queue_max_discard_sectors(q, 0);
+   blk_queue_discard_granularity(q, 0);
+   }
+}
+
 static void drbd_setup_queue_param(struct drbd_device *device, struct 
drbd_backing_dev *bdev,
   unsigned int max_bio_size)
 {
@@ -1183,26 +1236,8 @@ static void drbd_setup_queue_param(struct drbd_device 
*device, struct drbd_backi
/* This is the workaround for "bio would need to, but cannot, be split" 
*/
blk_queue_max_segments(q, max_segments ? max_segments : 
BLK_MAX_SEGMENTS);
blk_queue_segment_boundary(q, PAGE_SIZE-1);
-
+   decide_on_discard_support(device, q, b, discard_zeroes_if_aligned);
if (b) {
-   struct drbd_connection *connection = 
first_peer_device(device)->connection;
-
-   blk_queue_max_discard_sectors(q, DRBD_MAX_DISCARD_SECTORS);
-
-   if (blk_queue_discard(b) && (b->limits.discard_zeroes_data || 
discard_zeroes_if_aligned) &&
-   (connection->cstate < C_CONNECTED || 
connection->agreed_features & FF_TRIM)) {
-   /* We don't care, stacking below should fix it for the 
local device.
-* Whether or not it is a suitable granularity on the 
remote device
-* is not our problem, really. If you care, you need to
-* use devices with similar topology on a

[PATCH 02/30] drbd: change bitmap write-out when leaving resync states

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

When leaving resync states because of disconnect,
do the bitmap write-out synchronously in the drbd_disconnected() path.

When leaving resync states because we go back to AHEAD/BEHIND, or
because resync actually finished, or some disk was lost during resync,
trigger the write-out from after_state_ch().

The bitmap write-out for resync -> ahead/behind was missing completely before.

Note that this is all only an optimization to avoid double-resyncs of
already completed blocks in case this node crashes.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 8 +---
 drivers/block/drbd/drbd_state.c| 9 +++--
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 1ee0023..dcadea2 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -4795,9 +4795,11 @@ static int drbd_disconnected(struct drbd_peer_device 
*peer_device)
 
drbd_md_sync(device);
 
-   /* serialize with bitmap writeout triggered by the state change,
-* if any. */
-   wait_event(device->misc_wait, !test_bit(BITMAP_IO, &device->flags));
+   if (get_ldev(device)) {
+   drbd_bitmap_io(device, &drbd_bm_write_copy_pages,
+   "write from disconnected", 
BM_LOCKED_CHANGE_ALLOWED);
+   put_ldev(device);
+   }
 
/* tcp_close and release of sendpage pages can be deferred.  I don't
 * want to use SO_LINGER, because apparently it can be deferred for
diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 5a7ef78..59c6467 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -1934,12 +1934,17 @@ static void after_state_ch(struct drbd_device *device, 
union drbd_state os,
 
/* This triggers bitmap writeout of potentially still unwritten pages
 * if the resync finished cleanly, or aborted because of peer disk
-* failure, or because of connection loss.
+* failure, or on transition from resync back to AHEAD/BEHIND.
+*
+* Connection loss is handled in drbd_disconnected() by the receiver.
+*
 * For resync aborted because of local disk failure, we cannot do
 * any bitmap writeout anymore.
+*
 * No harm done if some bits change during this phase.
 */
-   if (os.conn > C_CONNECTED && ns.conn <= C_CONNECTED && 
get_ldev(device)) {
+   if ((os.conn > C_CONNECTED && os.conn < C_AHEAD) &&
+   (ns.conn == C_CONNECTED || ns.conn >= C_AHEAD) && get_ldev(device)) 
{
drbd_queue_bitmap_io(device, &drbd_bm_write_copy_pages, NULL,
"write from resync_finished", BM_LOCKED_CHANGE_ALLOWED);
put_ldev(device);
-- 
2.7.4



[PATCH 03/30] drbd: Kill code duplication

2016-06-13 Thread Philipp Reisner
Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 0bac9c8..fad03e4 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1348,6 +1348,14 @@ static bool write_ordering_changed(struct disk_conf *a, 
struct disk_conf *b)
a->disk_drain != b->disk_drain;
 }
 
+static void sanitize_disk_conf(struct disk_conf *disk_conf, struct 
drbd_backing_dev *nbc)
+{
+   if (disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
+   disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
+   if (disk_conf->al_extents > drbd_al_extents_max(nbc))
+   disk_conf->al_extents = drbd_al_extents_max(nbc);
+}
+
 int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
 {
struct drbd_config_context adm_ctx;
@@ -1395,10 +1403,7 @@ int drbd_adm_disk_opts(struct sk_buff *skb, struct 
genl_info *info)
if (!expect(new_disk_conf->resync_rate >= 1))
new_disk_conf->resync_rate = 1;
 
-   if (new_disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
-   new_disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
-   if (new_disk_conf->al_extents > drbd_al_extents_max(device->ldev))
-   new_disk_conf->al_extents = drbd_al_extents_max(device->ldev);
+   sanitize_disk_conf(new_disk_conf, device->ldev);
 
if (new_disk_conf->c_plan_ahead > DRBD_C_PLAN_AHEAD_MAX)
new_disk_conf->c_plan_ahead = DRBD_C_PLAN_AHEAD_MAX;
@@ -1693,10 +1698,7 @@ int drbd_adm_attach(struct sk_buff *skb, struct 
genl_info *info)
if (retcode != NO_ERROR)
goto fail;
 
-   if (new_disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
-   new_disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
-   if (new_disk_conf->al_extents > drbd_al_extents_max(nbc))
-   new_disk_conf->al_extents = drbd_al_extents_max(nbc);
+   sanitize_disk_conf(new_disk_conf, nbc);
 
if (drbd_get_max_capacity(nbc) < new_disk_conf->disk_size) {
drbd_err(device, "max capacity %llu smaller than disk size 
%llu\n",
-- 
2.7.4



[PATCH 01/30] drbd: bitmap bulk IO: do not always suspend IO

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

The intention was to only suspend IO if some normal bitmap operation is
supposed to be locked out, not always. If the bulk operation is flaged
as BM_LOCKED_CHANGE_ALLOWED, we do not need to suspend IO.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_main.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 2b37744..2891631 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -3587,18 +3587,20 @@ void drbd_queue_bitmap_io(struct drbd_device *device,
 int drbd_bitmap_io(struct drbd_device *device, int (*io_fn)(struct drbd_device 
*),
char *why, enum bm_flag flags)
 {
+   /* Only suspend io, if some operation is supposed to be locked out */
+   const bool do_suspend_io = flags & 
(BM_DONT_CLEAR|BM_DONT_SET|BM_DONT_TEST);
int rv;
 
D_ASSERT(device, current != 
first_peer_device(device)->connection->worker.task);
 
-   if ((flags & BM_LOCKED_SET_ALLOWED) == 0)
+   if (do_suspend_io)
drbd_suspend_io(device);
 
drbd_bm_lock(device, why, flags);
rv = io_fn(device);
drbd_bm_unlock(device);
 
-   if ((flags & BM_LOCKED_SET_ALLOWED) == 0)
+   if (do_suspend_io)
drbd_resume_io(device);
 
return rv;
-- 
2.7.4



[PATCH 23/30] drbd: sync_handshake: handle identical uuids with current (frozen) Primary

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

If in a two-primary scenario, we lost our peer, freeze IO,
and are still frozen (no UUID rotation) when the peer comes back
as Secondary after a hard crash, we will see identical UUIDs.

The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as
arbitration, but that would cause the still running (but frozen) Primary
to become SyncTarget (which it typically refuses), and the handshake is
declined.

Fix: check current roles.
If we have *one* current primary, the Primary wins.
(rule_nr = 41)

Since that is a protocol change, use the newly introduced DRBD_FF_WSAME
to determine if rule_nr = 41 can be applied.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 47 +++---
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index b25600e..577a187 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3194,7 +3194,8 @@ static void drbd_uuid_dump(struct drbd_device *device, 
char *text, u64 *uuid,
 -1091   requires proto 91
 -1096   requires proto 96
  */
-static int drbd_uuid_compare(struct drbd_device *const device, int *rule_nr) 
__must_hold(local)
+
+static int drbd_uuid_compare(struct drbd_device *const device, enum drbd_role 
const peer_role, int *rule_nr) __must_hold(local)
 {
struct drbd_peer_device *const peer_device = first_peer_device(device);
struct drbd_connection *const connection = peer_device ? 
peer_device->connection : NULL;
@@ -3274,8 +3275,39 @@ static int drbd_uuid_compare(struct drbd_device *const 
device, int *rule_nr) __m
 * next bit (weight 2) is set when peer was primary */
*rule_nr = 40;
 
+   /* Neither has the "crashed primary" flag set,
+* only a replication link hickup. */
+   if (rct == 0)
+   return 0;
+
+   /* Current UUID equal and no bitmap uuid; does not necessarily
+* mean this was a "simultaneous hard crash", maybe IO was
+* frozen, so no UUID-bump happened.
+* This is a protocol change, overload DRBD_FF_WSAME as flag
+* for "new-enough" peer DRBD version. */
+   if (device->state.role == R_PRIMARY || peer_role == R_PRIMARY) {
+   *rule_nr = 41;
+   if (!(connection->agreed_features & DRBD_FF_WSAME)) {
+   drbd_warn(peer_device, "Equivalent unrotated 
UUIDs, but current primary present.\n");
+   return -(0x1 | PRO_VERSION_MAX | 
(DRBD_FF_WSAME << 8));
+   }
+   if (device->state.role == R_PRIMARY && peer_role == 
R_PRIMARY) {
+   /* At least one has the "crashed primary" bit 
set,
+* both are primary now, but neither has 
rotated its UUIDs?
+* "Can not happen." */
+   drbd_err(peer_device, "Equivalent unrotated 
UUIDs, but both are primary. Can not resolve this.\n");
+   return -100;
+   }
+   if (device->state.role == R_PRIMARY)
+   return 1;
+   return -1;
+   }
+
+   /* Both are secondary.
+* Really looks like recovery from simultaneous hard crash.
+* Check which had been primary before, and arbitrate. */
switch (rct) {
-   case 0: /* !self_pri && !peer_pri */ return 0;
+   case 0: /* !self_pri && !peer_pri */ return 0; /* already 
handled */
case 1: /*  self_pri && !peer_pri */ return 1;
case 2: /* !self_pri &&  peer_pri */ return -1;
case 3: /*  self_pri &&  peer_pri */
@@ -3402,7 +3434,7 @@ static enum drbd_conns drbd_sync_handshake(struct 
drbd_peer_device *peer_device,
drbd_uuid_dump(device, "peer", device->p_uuid,
   device->p_uuid[UI_SIZE], device->p_uuid[UI_FLAGS]);
 
-   hg = drbd_uuid_compare(device, &rule_nr);
+   hg = drbd_uuid_compare(device, peer_role, &rule_nr);
spin_unlock_irq(&device->ldev->md.uuid_lock);
 
drbd_info(device, "uuid_compare()=%d by rule %d\n", hg, rule_nr);
@@ -3411,6 +3443,15 @@ static enum drbd_conns drbd_sync_handshake(struct 
drbd_peer_device *peer_device,
drbd_alert(device, "Unrelated data, aborting!\n");
return C_MASK;
}
+   if (hg < -0x1) {
+   int proto, fflags;
+ 

[PATCH 18/30] drbd: if there is no good data accessible, writes should be IO errors

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

If DRBD lost all path to good data,
and the on-no-data-accessible policy is OND_SUSPEND_IO,
all pending and new IO requests are suspended (will block).

If that setting is OND_IO_ERROR, IO will still be completed.
READ to "clean" areas (e.g. on an D_INCONSISTENT device,
and bitmap indicates a block is already in sync) will succeed.
READ to "unclean" areas (bitmap indicates block is out-of-sync),
will return EIO.

If we are already D_DISKLESS (or D_FAILED), we also return EIO.

Unfortunately, on a former R_PRIMARY C_SYNC_TARGET D_INCONSISTENT,
after replication link loss, new WRITE requests still went through OK.

The would also set the "out-of-sync" bit on their way, so READ after
WRITE would still return EIO. Also, the data generation UUIDs had not
been bumped, we would cause data divergence, without being able to
detect it on the next sync handshake, given the right sequence of events
in a multiple error scenario and "improper" order of recovery actions.

The right thing to do is to return EIO for all new writes,
unless we have access to good, current, D_UP_TO_DATE data.

The "established best practices" way to avoid these situations in the
first place is to set OND_SUSPEND_IO, or even do a hard-reset from
the pri-on-incon-degr policy helper hook.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_req.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 355cf10..68151271f 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1258,6 +1258,22 @@ drbd_request_prepare(struct drbd_device *device, struct 
bio *bio, unsigned long
return NULL;
 }
 
+/* Require at least one path to current data.
+ * We don't want to allow writes on C_STANDALONE D_INCONSISTENT:
+ * We would not allow to read what was written,
+ * we would not have bumped the data generation uuids,
+ * we would cause data divergence for all the wrong reasons.
+ *
+ * If we don't see at least one D_UP_TO_DATE, we will fail this request,
+ * which either returns EIO, or, if OND_SUSPEND_IO is set, suspends IO,
+ * and queues for retry later.
+ */
+static bool may_do_writes(struct drbd_device *device)
+{
+   const union drbd_dev_state s = device->state;
+   return s.disk == D_UP_TO_DATE || s.pdsk == D_UP_TO_DATE;
+}
+
 static void drbd_send_and_submit(struct drbd_device *device, struct 
drbd_request *req)
 {
struct drbd_resource *resource = device->resource;
@@ -1312,6 +1328,12 @@ static void drbd_send_and_submit(struct drbd_device 
*device, struct drbd_request
}
 
if (rw == WRITE) {
+   if (req->private_bio && !may_do_writes(device)) {
+   bio_put(req->private_bio);
+   req->private_bio = NULL;
+   put_ldev(device);
+   goto nodata;
+   }
if (!drbd_process_write_request(req))
no_remote = true;
} else {
-- 
2.7.4



[PATCH 19/30] drbd: only restart frozen disk io when D_UP_TO_DATE

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

When re-attaching the local backend device to a C_STANDALONE D_DISKLESS
R_PRIMARY with OND_SUSPEND_IO, we may only resume IO if we recognize the
backend that is being attached as D_UP_TO_DATE.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 59c6467..24422e8 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -1675,7 +1675,7 @@ static void after_state_ch(struct drbd_device *device, 
union drbd_state os,
what = RESEND;
 
if ((os.disk == D_ATTACHING || os.disk == D_NEGOTIATING) &&
-   conn_lowest_disk(connection) > D_NEGOTIATING)
+   conn_lowest_disk(connection) == D_UP_TO_DATE)
what = RESTART_FROZEN_DISK_IO;
 
if (resource->susp_nod && what != NOTHING) {
-- 
2.7.4



[PATCH 13/30] drbd: zero-out partial unaligned discards on local backend

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

For consistency, also zero-out partial unaligned chunks of discard
requests on the local backend.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h |  2 ++
 drivers/block/drbd/drbd_req.c | 29 +++--
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index f49ff86..0b5a658 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1553,6 +1553,8 @@ extern void start_resync_timer_fn(unsigned long data);
 extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req);
 
 /* drbd_receiver.c */
+extern int drbd_issue_discard_or_zero_out(struct drbd_device *device,
+   sector_t start, unsigned int nr_sectors, bool discard);
 extern int drbd_receiver(struct drbd_thread *thi);
 extern int drbd_ack_receiver(struct drbd_thread *thi);
 extern void drbd_send_ping_wf(struct work_struct *ws);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 74903ab..355cf10 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1156,6 +1156,16 @@ static int drbd_process_write_request(struct 
drbd_request *req)
return remote;
 }
 
+static void drbd_process_discard_req(struct drbd_request *req)
+{
+   int err = drbd_issue_discard_or_zero_out(req->device,
+   req->i.sector, req->i.size >> 9, true);
+
+   if (err)
+   req->private_bio->bi_error = -EIO;
+   bio_endio(req->private_bio);
+}
+
 static void
 drbd_submit_req_private_bio(struct drbd_request *req)
 {
@@ -1176,6 +1186,8 @@ drbd_submit_req_private_bio(struct drbd_request *req)
: rw == READ  ? DRBD_FAULT_DT_RD
:   DRBD_FAULT_DT_RA))
bio_io_error(bio);
+   else if (bio_op(bio) == REQ_OP_DISCARD)
+   drbd_process_discard_req(req);
else
generic_make_request(bio);
put_ldev(device);
@@ -1227,18 +1239,23 @@ drbd_request_prepare(struct drbd_device *device, struct 
bio *bio, unsigned long
/* Update disk stats */
_drbd_start_io_acct(device, req);
 
+   /* process discards always from our submitter thread */
+   if (bio_op(bio) & REQ_OP_DISCARD)
+   goto queue_for_submitter_thread;
+
if (rw == WRITE && req->private_bio && req->i.size
&& !test_bit(AL_SUSPENDED, &device->flags)) {
-   if (!drbd_al_begin_io_fastpath(device, &req->i)) {
-   atomic_inc(&device->ap_actlog_cnt);
-   drbd_queue_write(device, req);
-   return NULL;
-   }
+   if (!drbd_al_begin_io_fastpath(device, &req->i))
+   goto queue_for_submitter_thread;
req->rq_state |= RQ_IN_ACT_LOG;
req->in_actlog_jif = jiffies;
}
-
return req;
+
+ queue_for_submitter_thread:
+   atomic_inc(&device->ap_actlog_cnt);
+   drbd_queue_write(device, req);
+   return NULL;
 }
 
 static void drbd_send_and_submit(struct drbd_device *device, struct 
drbd_request *req)
-- 
2.7.4



[PATCH 20/30] drbd: discard_zeroes_if_aligned allows "thin" resync for discard_zeroes_data=0

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Even if discard_zeroes_data != 0,
if discard_zeroes_if_aligned is set, we assume we can reliably
zero-out/discard using the drbd_issue_peer_discard() helper.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index e5fdcc6..169e3e1 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1408,9 +1408,12 @@ static void sanitize_disk_conf(struct drbd_device 
*device, struct disk_conf *dis
if (disk_conf->al_extents > drbd_al_extents_max(nbc))
disk_conf->al_extents = drbd_al_extents_max(nbc);
 
-   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data) {
-   disk_conf->rs_discard_granularity = 0; /* disable feature */
-   drbd_info(device, "rs_discard_granularity feature disabled\n");
+   if (!blk_queue_discard(q)
+   || (!q->limits.discard_zeroes_data && 
!disk_conf->discard_zeroes_if_aligned)) {
+   if (disk_conf->rs_discard_granularity) {
+   disk_conf->rs_discard_granularity = 0; /* disable 
feature */
+   drbd_info(device, "rs_discard_granularity feature 
disabled\n");
+   }
}
 
if (disk_conf->rs_discard_granularity) {
-- 
2.7.4



[PATCH 15/30] drbd: finish resync on sync source only by notification from sync target

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

If the replication link breaks exactly during "resync finished" detection,
finishing too early on the sync source could again lead to UUIDs rotated
too fast, and potentially a spurious full resync on next handshake.

Always wait for explicit resync finished state change notification from
the sync target.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_actlog.c | 16 
 drivers/block/drbd/drbd_int.h| 19 ++-
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 265b2b6..cafa9c4 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -770,10 +770,18 @@ static bool lazy_bitmap_update_due(struct drbd_device 
*device)
 
 static void maybe_schedule_on_disk_bitmap_update(struct drbd_device *device, 
bool rs_done)
 {
-   if (rs_done)
-   set_bit(RS_DONE, &device->flags);
-   /* and also set RS_PROGRESS below */
-   else if (!lazy_bitmap_update_due(device))
+   if (rs_done) {
+   struct drbd_connection *connection = 
first_peer_device(device)->connection;
+   if (connection->agreed_pro_version <= 95 ||
+   is_sync_target_state(device->state.conn))
+   set_bit(RS_DONE, &device->flags);
+   /* and also set RS_PROGRESS below */
+
+   /* Else: rather wait for explicit notification via 
receive_state,
+* to avoid uuids-rotated-too-fast causing full resync
+* in next handshake, in case the replication link breaks
+* at the most unfortunate time... */
+   } else if (!lazy_bitmap_update_due(device))
return;
 
drbd_device_post_work(device, RS_PROGRESS);
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 9c68ec5..c5dbc85 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -2102,13 +2102,22 @@ static inline void _sub_unacked(struct drbd_device 
*device, int n, const char *f
ERR_IF_CNT_IS_NEGATIVE(unacked_cnt, func, line);
 }
 
+static inline bool is_sync_target_state(enum drbd_conns connection_state)
+{
+   return  connection_state == C_SYNC_TARGET ||
+   connection_state == C_PAUSED_SYNC_T;
+}
+
+static inline bool is_sync_source_state(enum drbd_conns connection_state)
+{
+   return  connection_state == C_SYNC_SOURCE ||
+   connection_state == C_PAUSED_SYNC_S;
+}
+
 static inline bool is_sync_state(enum drbd_conns connection_state)
 {
-   return
-  (connection_state == C_SYNC_SOURCE
-   ||  connection_state == C_SYNC_TARGET
-   ||  connection_state == C_PAUSED_SYNC_S
-   ||  connection_state == C_PAUSED_SYNC_T);
+   return  is_sync_source_state(connection_state) ||
+   is_sync_target_state(connection_state);
 }
 
 /**
-- 
2.7.4



[PATCH 14/30] drbd: allow larger max_discard_sectors

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION)
al-extents available, and allow up to half of that to be
discarded in one bio.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_actlog.c | 2 +-
 drivers/block/drbd/drbd_int.h| 8 
 include/linux/drbd_limits.h  | 3 +--
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index d524973..265b2b6 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -258,7 +258,7 @@ bool drbd_al_begin_io_fastpath(struct drbd_device *device, 
struct drbd_interval
unsigned first = i->sector >> (AL_EXTENT_SHIFT-9);
unsigned last = i->size == 0 ? first : (i->sector + (i->size >> 9) - 1) 
>> (AL_EXTENT_SHIFT-9);
 
-   D_ASSERT(device, (unsigned)(last - first) <= 1);
+   D_ASSERT(device, first <= last);
D_ASSERT(device, atomic_read(&device->local_cnt) > 0);
 
/* FIXME figure out a fast path for bios crossing AL extent boundaries 
*/
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 0b5a658..9c68ec5 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1347,10 +1347,10 @@ struct bm_extent {
 #define DRBD_MAX_SIZE_H80_PACKET (1U << 15) /* Header 80 only allows packets 
up to 32KiB data */
 #define DRBD_MAX_BIO_SIZE_P95(1U << 17) /* Protocol 95 to 99 allows bios 
up to 128KiB */
 
-/* For now, don't allow more than one activity log extent worth of data
- * to be discarded in one go. We may need to rework drbd_al_begin_io()
- * to allow for even larger discard ranges */
-#define DRBD_MAX_DISCARD_SIZE  AL_EXTENT_SIZE
+/* For now, don't allow more than half of what we can "activate" in one
+ * activity log transaction to be discarded in one go. We may need to rework
+ * drbd_al_begin_io() to allow for even larger discard ranges */
+#define DRBD_MAX_DISCARD_SIZE  (AL_UPDATES_PER_TRANSACTION/2*AL_EXTENT_SIZE)
 #define DRBD_MAX_DISCARD_SECTORS (DRBD_MAX_DISCARD_SIZE >> 9)
 
 extern int  drbd_bm_init(struct drbd_device *device);
diff --git a/include/linux/drbd_limits.h b/include/linux/drbd_limits.h
index a351c40..ddac684 100644
--- a/include/linux/drbd_limits.h
+++ b/include/linux/drbd_limits.h
@@ -126,8 +126,7 @@
 #define DRBD_RESYNC_RATE_DEF 250
 #define DRBD_RESYNC_RATE_SCALE 'k'  /* kilobytes */
 
-  /* less than 7 would hit performance unnecessarily. */
-#define DRBD_AL_EXTENTS_MIN  7
+#define DRBD_AL_EXTENTS_MIN  67
   /* we use u16 as "slot number", (u16)~0 is "FREE".
* If you use >= 292 kB on-disk ring buffer,
* this is the maximum you can use: */
-- 
2.7.4



[PATCH 21/30] drbd: report sizes if rejecting too small peer disk

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index cb80fb4..367b8e9 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3952,6 +3952,7 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
device->p_size = p_size;
 
if (get_ldev(device)) {
+   sector_t new_size, cur_size;
rcu_read_lock();
my_usize = rcu_dereference(device->ldev->disk_conf)->disk_size;
rcu_read_unlock();
@@ -3968,11 +3969,13 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
 
/* Never shrink a device with usable data during connect.
   But allow online shrinking if we are connected. */
-   if (drbd_new_dev_size(device, device->ldev, p_usize, 0) <
-   drbd_get_capacity(device->this_bdev) &&
+   new_size = drbd_new_dev_size(device, device->ldev, p_usize, 0);
+   cur_size = drbd_get_capacity(device->this_bdev);
+   if (new_size < cur_size &&
device->state.disk >= D_OUTDATED &&
device->state.conn < C_CONNECTED) {
-   drbd_err(device, "The peer's disk size is too 
small!\n");
+   drbd_err(device, "The peer's disk size is too small! 
(%llu < %llu sectors)\n",
+   (unsigned long long)new_size, (unsigned 
long long)cur_size);
conn_request_state(peer_device->connection, NS(conn, 
C_DISCONNECTING), CS_HARD);
put_ldev(device);
return -EIO;
-- 
2.7.4



[PATCH 16/30] drbd: introduce unfence-peer handler

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

When resync is finished, we already call the "after-resync-target"
handler (on the former sync target, obviously), once per volume.

Paired with the before-resync-target handler, you can create snapshots,
before the resync causes the volumes to become inconsistent,
and discard those snapshots again, once they are no longer needed.

It was also overloaded to be paired with the "fence-peer" handler,
to "unfence" once the volumes are up-to-date and known good.

This has some disadvantages, though: we call "fence-peer" for the whole
connection (once for the group of volumes), but would call unfence as
side-effect of after-resync-target once for each volume.

Also, we fence on a (current, or about to become) Primary,
which will later become the sync-source.

Calling unfence only as a side effect of the after-resync-target
handler opens a race window, between a new fence on the Primary
(SyncTarget) and the unfence on the SyncTarget, which is difficult to
close without some kind of "cluster wide lock" in those handlers.

We would not need those handlers if we could still communicate.
Which makes trying to aquire a cluster wide lock from those handlers
seem like a very bad idea.

This introduces the "unfence-peer" handler, which will be called
per connection (once for the group of volumes), just like the fence
handler, only once all volumes are back in sync, and on the SyncSource.

Which is expected to be the node that previously called "fence", the
node that is currently allowed to be Primary, and thus the only node
that could trigger a new "fence" that could race with this unfence.

Which makes us not need any cluster wide synchronization here,
serializing two scripts running on the same node is trivial.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h|  1 +
 drivers/block/drbd/drbd_nl.c |  2 +-
 drivers/block/drbd/drbd_worker.c | 28 ++--
 3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index c5dbc85..5ee8da3 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1494,6 +1494,7 @@ extern enum drbd_state_rv drbd_set_role(struct 
drbd_device *device,
int force);
 extern bool conn_try_outdate_peer(struct drbd_connection *connection);
 extern void conn_try_outdate_peer_async(struct drbd_connection *connection);
+extern int conn_khelper(struct drbd_connection *connection, char *cmd);
 extern int drbd_khelper(struct drbd_device *device, char *cmd);
 
 /* drbd_worker.c */
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 12e9b31..4a4eb80 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -387,7 +387,7 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
return ret;
 }
 
-static int conn_khelper(struct drbd_connection *connection, char *cmd)
+int conn_khelper(struct drbd_connection *connection, char *cmd)
 {
char *envp[] = { "HOME=/",
"TERM=linux",
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 154dbfc..8cc2ffb 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -840,6 +840,7 @@ static void ping_peer(struct drbd_device *device)
 
 int drbd_resync_finished(struct drbd_device *device)
 {
+   struct drbd_connection *connection = 
first_peer_device(device)->connection;
unsigned long db, dt, dbdt;
unsigned long n_oos;
union drbd_state os, ns;
@@ -861,8 +862,7 @@ int drbd_resync_finished(struct drbd_device *device)
if (dw) {
dw->w.cb = w_resync_finished;
dw->device = device;
-   
drbd_queue_work(&first_peer_device(device)->connection->sender_work,
-   &dw->w);
+   drbd_queue_work(&connection->sender_work, &dw->w);
return 1;
}
drbd_err(device, "Warn failed to drbd_rs_del_all() and to 
kmalloc(dw).\n");
@@ -975,6 +975,30 @@ int drbd_resync_finished(struct drbd_device *device)
_drbd_set_state(device, ns, CS_VERBOSE, NULL);
 out_unlock:
spin_unlock_irq(&device->resource->req_lock);
+
+   /* If we have been sync source, and have an effective fencing-policy,
+* once *all* volumes are back in sync, call "unfence". */
+   if (os.conn == C_SYNC_SOURCE) {
+   enum drbd_disk_state disk_state = D_MASK;
+   enum drbd_disk_state pdsk_state = D_MASK;
+   enum drbd_fencing_p fp = FP_DONT_CARE;
+
+   rcu_read_lock();
+   fp = rcu_dereference(

[PATCH 29/30] drbd: al_write_transaction: skip re-scanning of bitmap page pointer array

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

For larger devices, the array of bitmap page pointers can grow very
large (8000 pointers per TB of storage).

For each activity log transaction, we need to flush the associated
bitmap pages to stable storage. Currently, we just "mark" the respective
pages while setting up the transaction, then tell the bitmap code to
write out all marked pages, but skip unchanged pages.

But one such transaction can affect only a small number of bitmap pages,
there is no need to scan the full array of several (ten-)thousand
page pointers to find the few marked ones.

Instead, remember the index numbers of the few affected pages,
and later only re-check those to skip duplicates and unchanged ones.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_actlog.c |  2 ++
 drivers/block/drbd/drbd_bitmap.c | 66 +++-
 drivers/block/drbd/drbd_int.h|  1 +
 3 files changed, 54 insertions(+), 15 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index f9af555..0a1aaf8 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -341,6 +341,8 @@ static int __al_write_transaction(struct drbd_device 
*device, struct al_transact
 
i = 0;
 
+   drbd_bm_reset_al_hints(device);
+
/* Even though no one can start to change this list
 * once we set the LC_LOCKED -- from drbd_al_begin_io(),
 * lc_try_lock_for_transaction() --, someone may still
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 0807fcb..ab62b81 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -96,6 +96,13 @@ struct drbd_bitmap {
struct page **bm_pages;
spinlock_t bm_lock;
 
+   /* exclusively to be used by __al_write_transaction(),
+* drbd_bm_mark_for_writeout() and
+* and drbd_bm_write_hinted() -> bm_rw() called from there.
+*/
+   unsigned int n_bitmap_hints;
+   unsigned int al_bitmap_hints[AL_UPDATES_PER_TRANSACTION];
+
/* see LIMITATIONS: above */
 
unsigned long bm_set;   /* nr of set bits; THINK maybe atomic_t? */
@@ -242,6 +249,11 @@ static void bm_set_page_need_writeout(struct page *page)
set_bit(BM_PAGE_NEED_WRITEOUT, &page_private(page));
 }
 
+void drbd_bm_reset_al_hints(struct drbd_device *device)
+{
+   device->bitmap->n_bitmap_hints = 0;
+}
+
 /**
  * drbd_bm_mark_for_writeout() - mark a page with a "hint" to be considered 
for writeout
  * @device:DRBD device.
@@ -253,6 +265,7 @@ static void bm_set_page_need_writeout(struct page *page)
  */
 void drbd_bm_mark_for_writeout(struct drbd_device *device, int page_nr)
 {
+   struct drbd_bitmap *b = device->bitmap;
struct page *page;
if (page_nr >= device->bitmap->bm_number_of_pages) {
drbd_warn(device, "BAD: page_nr: %u, number_of_pages: %u\n",
@@ -260,7 +273,9 @@ void drbd_bm_mark_for_writeout(struct drbd_device *device, 
int page_nr)
return;
}
page = device->bitmap->bm_pages[page_nr];
-   set_bit(BM_PAGE_HINT_WRITEOUT, &page_private(page));
+   BUG_ON(b->n_bitmap_hints >= ARRAY_SIZE(b->al_bitmap_hints));
+   if (!test_and_set_bit(BM_PAGE_HINT_WRITEOUT, &page_private(page)))
+   b->al_bitmap_hints[b->n_bitmap_hints++] = page_nr;
 }
 
 static int bm_test_page_unchanged(struct page *page)
@@ -1030,7 +1045,7 @@ static int bm_rw(struct drbd_device *device, const 
unsigned int flags, unsigned
 {
struct drbd_bm_aio_ctx *ctx;
struct drbd_bitmap *b = device->bitmap;
-   int num_pages, i, count = 0;
+   unsigned int num_pages, i, count = 0;
unsigned long now;
char ppb[10];
int err = 0;
@@ -1078,16 +1093,37 @@ static int bm_rw(struct drbd_device *device, const 
unsigned int flags, unsigned
now = jiffies;
 
/* let the layers below us try to merge these bios... */
-   for (i = 0; i < num_pages; i++) {
-   /* ignore completely unchanged pages */
-   if (lazy_writeout_upper_idx && i == lazy_writeout_upper_idx)
-   break;
-   if (!(flags & BM_AIO_READ)) {
-   if ((flags & BM_AIO_WRITE_HINTED) &&
-   !test_and_clear_bit(BM_PAGE_HINT_WRITEOUT,
-   &page_private(b->bm_pages[i])))
-   continue;
 
+   if (flags & BM_AIO_READ) {
+   for (i = 0; i < num_pages; i++) {
+   atomic_inc(&ctx->in_flight);
+   bm_page_io_async(ctx, i);
+   ++count;
+   cond_resched();
+   }
+   } else if (flags & BM_AIO_WRITE_HINTED) {

[PATCH 30/30] drbd: correctly handle failed crypto_alloc_hash

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

crypto_alloc_hash returns an ERR_PTR(), not NULL.

Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash()
on an errno in the cleanup path.

Reported-by: Insu Yun 

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 3 ++-
 include/linux/drbd.h   | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 0d74602..df45713 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3681,7 +3681,8 @@ static int receive_protocol(struct drbd_connection 
*connection, struct packet_in
 */
 
peer_integrity_tfm = crypto_alloc_ahash(integrity_alg, 0, 
CRYPTO_ALG_ASYNC);
-   if (!peer_integrity_tfm) {
+   if (IS_ERR(peer_integrity_tfm)) {
+   peer_integrity_tfm = NULL;
drbd_err(connection, "peer data-integrity-alg %s not 
supported\n",
 integrity_alg);
goto disconnect;
diff --git a/include/linux/drbd.h b/include/linux/drbd.h
index 2b26156..002611c 100644
--- a/include/linux/drbd.h
+++ b/include/linux/drbd.h
@@ -51,7 +51,7 @@
 #endif
 
 extern const char *drbd_buildtag(void);
-#define REL_VERSION "8.4.6"
+#define REL_VERSION "8.4.7"
 #define API_VERSION 1
 #define PRO_VERSION_MIN 86
 #define PRO_VERSION_MAX 101
-- 
2.7.4



[PATCH 17/30] drbd: don't forget error completion when "unsuspending" IO

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Possibly sequence of events:
SyncTarget is made Primary, then loses replication link
(only path to good data on SyncSource).

Behavior is then controlled by the on-no-data-accessible policy,
which defaults to OND_IO_ERROR (may be set to OND_SUSPEND_IO).

If OND_IO_ERROR is in fact the current policy, we clear the susp_fen
(IO suspended due to fencing policy) flag, do NOT set the susp_nod
(IO suspended due to no data) flag.

But we forgot to call the IO error completion for all pending,
suspended, requests.

While at it, also add a race check for a theoretically possible
race with a new handshake (network hickup), we may be able to
re-send requests, and can avoid passing IO errors up the stack.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 48 +---
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 4a4eb80..e5fdcc6 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -442,19 +442,17 @@ static enum drbd_fencing_p highest_fencing_policy(struct 
drbd_connection *connec
}
rcu_read_unlock();
 
-   if (fp == FP_NOT_AVAIL) {
-   /* IO Suspending works on the whole resource.
-  Do it only for one device. */
-   vnr = 0;
-   peer_device = idr_get_next(&connection->peer_devices, &vnr);
-   drbd_change_state(peer_device->device, CS_VERBOSE | CS_HARD, 
NS(susp_fen, 0));
-   }
-
return fp;
 }
 
+static bool resource_is_supended(struct drbd_resource *resource)
+{
+   return resource->susp || resource->susp_fen || resource->susp_nod;
+}
+
 bool conn_try_outdate_peer(struct drbd_connection *connection)
 {
+   struct drbd_resource * const resource = connection->resource;
unsigned int connect_cnt;
union drbd_state mask = { };
union drbd_state val = { };
@@ -462,21 +460,41 @@ bool conn_try_outdate_peer(struct drbd_connection 
*connection)
char *ex_to_string;
int r;
 
-   spin_lock_irq(&connection->resource->req_lock);
+   spin_lock_irq(&resource->req_lock);
if (connection->cstate >= C_WF_REPORT_PARAMS) {
drbd_err(connection, "Expected cstate < C_WF_REPORT_PARAMS\n");
-   spin_unlock_irq(&connection->resource->req_lock);
+   spin_unlock_irq(&resource->req_lock);
return false;
}
 
connect_cnt = connection->connect_cnt;
-   spin_unlock_irq(&connection->resource->req_lock);
+   spin_unlock_irq(&resource->req_lock);
 
fp = highest_fencing_policy(connection);
switch (fp) {
case FP_NOT_AVAIL:
drbd_warn(connection, "Not fencing peer, I'm not even 
Consistent myself.\n");
-   goto out;
+   spin_lock_irq(&resource->req_lock);
+   if (connection->cstate < C_WF_REPORT_PARAMS) {
+   _conn_request_state(connection,
+   (union drbd_state) { { .susp_fen = 
1 } },
+   (union drbd_state) { { .susp_fen = 
0 } },
+   CS_VERBOSE | CS_HARD | CS_DC_SUSP);
+   /* We are no longer suspended due to the fencing policy.
+* We may still be suspended due to the 
on-no-data-accessible policy.
+* If that was OND_IO_ERROR, fail pending requests. */
+   if (!resource_is_supended(resource))
+   _tl_restart(connection, 
CONNECTION_LOST_WHILE_PENDING);
+   }
+   /* Else: in case we raced with a connection handshake,
+* let the handshake figure out if we maybe can RESEND,
+* and do not resume/fail pending requests here.
+* Worst case is we stay suspended for now, which may be
+* resolved by either re-establishing the replication link, or
+* the next link failure, or eventually the administrator.  */
+   spin_unlock_irq(&resource->req_lock);
+   return false;
+
case FP_DONT_CARE:
return true;
default: ;
@@ -529,13 +547,11 @@ bool conn_try_outdate_peer(struct drbd_connection 
*connection)
drbd_info(connection, "fence-peer helper returned %d (%s)\n",
  (r>>8) & 0xff, ex_to_string);
 
- out:
-
/* Not using
   conn_request_state(connection, mask, val, CS_VERBOSE);
   here, because we might were able to re-establish the connection in 
the
   meantime. */
-   spin_lock_irq(&connection->resource->req_lock);

Re: [PATCH 00/30] DRBD updates

2016-06-13 Thread Philipp Reisner
[...]
> If you want me to add it to that branch (which is where it should go),
> then why aren't the patches against that branch? I get rejects on
> several of the patches, mainly because they are not done on top of this
> particular branch.
>
> We can do two things here. I can skip patches, I don't like doing that.
> Or you can respin against the proper branch, as it should have been from
> the beginning. What do you want to do?

Sorry. It was based on Linus' 4.7-rc3. Shame on me. I rebased it onto your
for-4.8/drivers.

cheers,
 Phil

Fabian Frederick (1):
  drbd: code cleanups without semantic changes

Lars Ellenberg (24):
  drbd: bitmap bulk IO: do not always suspend IO
  drbd: change bitmap write-out when leaving resync states
  drbd: adjust assert in w_bitmap_io to account for
BM_LOCKED_CHANGE_ALLOWED
  drbd: fix regression: protocol A sometimes synchronous, C sometimes
double-latency
  drbd: fix for truncated minor number in callback command line
  drbd: allow parallel flushes for multi-volume resources
  drbd: when receiving P_TRIM, zero-out partial unaligned chunks
  drbd: possibly disable discard support, if backend has
discard_zeroes_data=0
  drbd: zero-out partial unaligned discards on local backend
  drbd: allow larger max_discard_sectors
  drbd: finish resync on sync source only by notification from sync
target
  drbd: introduce unfence-peer handler
  drbd: don't forget error completion when "unsuspending" IO
  drbd: if there is no good data accessible, writes should be IO errors
  drbd: only restart frozen disk io when D_UP_TO_DATE
  drbd: discard_zeroes_if_aligned allows "thin" resync for
discard_zeroes_data=0
  drbd: report sizes if rejecting too small peer disk
  drbd: introduce WRITE_SAME support
  drbd: sync_handshake: handle identical uuids with current (frozen)
Primary
  drbd: disallow promotion during resync handshake, avoid deadlock and
hard reset
  drbd: bump current uuid when resuming IO with diskless peer
  drbd: finally report ms, not jiffies, in log message
  drbd: al_write_transaction: skip re-scanning of bitmap page pointer
array
  drbd: correctly handle failed crypto_alloc_hash

Philipp Reisner (4):
  drbd: Kill code duplication
  drbd: Implement handling of thinly provisioned storage on resync
target nodes
  drbd: Introduce new disk config option rs-discard-granularity
  drbd: Create the protocol feature THIN_RESYNC

Roland Kammerer (1):
  drbd: get rid of empty statement in is_valid_state

 drivers/block/drbd/drbd_actlog.c   |  29 +-
 drivers/block/drbd/drbd_bitmap.c   |  84 --
 drivers/block/drbd/drbd_debugfs.c  |  13 +-
 drivers/block/drbd/drbd_int.h  |  49 +++-
 drivers/block/drbd/drbd_interval.h |  14 +-
 drivers/block/drbd/drbd_main.c | 115 +++-
 drivers/block/drbd/drbd_nl.c   | 282 ++-
 drivers/block/drbd/drbd_proc.c |  30 +--
 drivers/block/drbd/drbd_protocol.h |  77 +-
 drivers/block/drbd/drbd_receiver.c | 535 ++---
 drivers/block/drbd/drbd_req.c  |  84 --
 drivers/block/drbd/drbd_req.h  |   5 +-
 drivers/block/drbd/drbd_state.c|  61 -
 drivers/block/drbd/drbd_state.h|   2 +-
 drivers/block/drbd/drbd_strings.c  |   8 +-
 drivers/block/drbd/drbd_worker.c   |  85 +-
 include/linux/drbd.h   |  10 +-
 include/linux/drbd_genl.h  |   7 +-
 include/linux/drbd_limits.h|  15 +-
 19 files changed, 1206 insertions(+), 299 deletions(-)

-- 
2.7.4



[PATCH 10/30] drbd: allow parallel flushes for multi-volume resources

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

To maintain write-order fidelity accros all volumes in a DRBD resource,
the receiver of a P_BARRIER needs to issue flushes to all volumes.
We used to do this by calling blkdev_issue_flush(), synchronously,
one volume at a time.

We now submit all flushes to all volumes in parallel, then wait for all
completions, to reduce worst-case latencies on multi-volume resources.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 113 +
 1 file changed, 88 insertions(+), 25 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 4cfc721..a2e7ba9 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1204,13 +1204,83 @@ static int drbd_recv_header(struct drbd_connection 
*connection, struct packet_in
return err;
 }
 
-static void drbd_flush(struct drbd_connection *connection)
+/* This is blkdev_issue_flush, but asynchronous.
+ * We want to submit to all component volumes in parallel,
+ * then wait for all completions.
+ */
+struct issue_flush_context {
+   atomic_t pending;
+   int error;
+   struct completion done;
+};
+struct one_flush_context {
+   struct drbd_device *device;
+   struct issue_flush_context *ctx;
+};
+
+void one_flush_endio(struct bio *bio)
 {
-   int rv;
-   struct drbd_peer_device *peer_device;
-   int vnr;
+   struct one_flush_context *octx = bio->bi_private;
+   struct drbd_device *device = octx->device;
+   struct issue_flush_context *ctx = octx->ctx;
+
+   if (bio->bi_error) {
+   ctx->error = bio->bi_error;
+   drbd_info(device, "local disk FLUSH FAILED with status %d\n", 
bio->bi_error);
+   }
+   kfree(octx);
+   bio_put(bio);
+
+   clear_bit(FLUSH_PENDING, &device->flags);
+   put_ldev(device);
+   kref_put(&device->kref, drbd_destroy_device);
+
+   if (atomic_dec_and_test(&ctx->pending))
+   complete(&ctx->done);
+}
+
+static void submit_one_flush(struct drbd_device *device, struct 
issue_flush_context *ctx)
+{
+   struct bio *bio = bio_alloc(GFP_NOIO, 0);
+   struct one_flush_context *octx = kmalloc(sizeof(*octx), GFP_NOIO);
+   if (!bio || !octx) {
+   drbd_warn(device, "Could not allocate a bio, CANNOT ISSUE 
FLUSH\n");
+   /* FIXME: what else can I do now?  disconnecting or detaching
+* really does not help to improve the state of the world, 
either.
+*/
+   kfree(octx);
+   if (bio)
+   bio_put(bio);
 
+   ctx->error = -ENOMEM;
+   put_ldev(device);
+   kref_put(&device->kref, drbd_destroy_device);
+   return;
+   }
+
+   octx->device = device;
+   octx->ctx = ctx;
+   bio->bi_bdev = device->ldev->backing_bdev;
+   bio->bi_private = octx;
+   bio->bi_end_io = one_flush_endio;
+
+   device->flush_jif = jiffies;
+   set_bit(FLUSH_PENDING, &device->flags);
+   atomic_inc(&ctx->pending);
+   submit_bio(WRITE_FLUSH, bio);
+}
+
+static void drbd_flush(struct drbd_connection *connection)
+{
if (connection->resource->write_ordering >= WO_BDEV_FLUSH) {
+   struct drbd_peer_device *peer_device;
+   struct issue_flush_context ctx;
+   int vnr;
+
+   atomic_set(&ctx.pending, 1);
+   ctx.error = 0;
+   init_completion(&ctx.done);
+
rcu_read_lock();
idr_for_each_entry(&connection->peer_devices, peer_device, vnr) 
{
struct drbd_device *device = peer_device->device;
@@ -1220,31 +1290,24 @@ static void drbd_flush(struct drbd_connection 
*connection)
kref_get(&device->kref);
rcu_read_unlock();
 
-   /* Right now, we have only this one synchronous code 
path
-* for flushes between request epochs.
-* We may want to make those asynchronous,
-* or at least parallelize the flushes to the volume 
devices.
-*/
-   device->flush_jif = jiffies;
-   set_bit(FLUSH_PENDING, &device->flags);
-   rv = blkdev_issue_flush(device->ldev->backing_bdev,
-   GFP_NOIO, NULL);
-   clear_bit(FLUSH_PENDING, &device->flags);
-   if (rv) {
-   drbd_info(device, "local disk flush failed with 
status %d\n", rv);
-   /* would rath

[PATCH 07/30] drbd: adjust assert in w_bitmap_io to account for BM_LOCKED_CHANGE_ALLOWED

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_main.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index dd2432e..782e430 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -3521,7 +3521,12 @@ static int w_bitmap_io(struct drbd_work *w, int unused)
struct bm_io_work *work = &device->bm_io_work;
int rv = -EIO;
 
-   D_ASSERT(device, atomic_read(&device->ap_bio_cnt) == 0);
+   if (work->flags != BM_LOCKED_CHANGE_ALLOWED) {
+   int cnt = atomic_read(&device->ap_bio_cnt);
+   if (cnt)
+   drbd_err(device, "FIXME: ap_bio_cnt %d, expected 0; 
queued for '%s'\n",
+   cnt, work->why);
+   }
 
if (get_ldev(device)) {
drbd_bm_lock(device, work->why, work->flags);
-- 
2.7.4



[PATCH 28/30] drbd: finally report ms, not jiffies, in log message

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Also skip the message unless bitmap IO took longer than 5 ms.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_bitmap.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 17e5e60..801b8f3 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -1121,10 +1121,14 @@ static int bm_rw(struct drbd_device *device, const 
unsigned int flags, unsigned
kref_put(&ctx->kref, &drbd_bm_aio_ctx_destroy);
 
/* summary for global bitmap IO */
-   if (flags == 0)
-   drbd_info(device, "bitmap %s of %u pages took %lu jiffies\n",
-(flags & BM_AIO_READ) ? "READ" : "WRITE",
-count, jiffies - now);
+   if (flags == 0) {
+   unsigned int ms = jiffies_to_msecs(jiffies - now);
+   if (ms > 5) {
+   drbd_info(device, "bitmap %s of %u pages took %u ms\n",
+(flags & BM_AIO_READ) ? "READ" : "WRITE",
+count, ms);
+   }
+   }
 
if (ctx->error) {
drbd_alert(device, "we had at least one MD IO ERROR during 
bitmap IO\n");
-- 
2.7.4



[PATCH 22/30] drbd: introduce WRITE_SAME support

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

We will support WRITE_SAME, if
 * all peers support WRITE_SAME (both in kernel and DRBD version),
 * all peer devices support WRITE_SAME
 * logical_block_size is identical on all peers.

We may at some point introduce a fallback on the receiving side
for devices/kernels that do not support WRITE_SAME,
by open-coding a submit loop. But not yet.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_actlog.c   |   9 ++-
 drivers/block/drbd/drbd_debugfs.c  |  11 +--
 drivers/block/drbd/drbd_int.h  |  13 ++--
 drivers/block/drbd/drbd_main.c |  82 +++---
 drivers/block/drbd/drbd_nl.c   |  88 +---
 drivers/block/drbd/drbd_protocol.h |  74 ++--
 drivers/block/drbd/drbd_receiver.c | 137 +++--
 drivers/block/drbd/drbd_req.c  |  13 ++--
 drivers/block/drbd/drbd_req.h  |   5 +-
 drivers/block/drbd/drbd_worker.c   |   8 ++-
 10 files changed, 360 insertions(+), 80 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 4e07cff..99a2b92 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -838,6 +838,13 @@ static int update_sync_bits(struct drbd_device *device,
return count;
 }
 
+static bool plausible_request_size(int size)
+{
+   return size > 0
+   && size <= DRBD_MAX_BATCH_BIO_SIZE
+   && IS_ALIGNED(size, 512);
+}
+
 /* clear the bit corresponding to the piece of storage in question:
  * size byte of data starting from sector.  Only clear a bits of the affected
  * one ore more _aligned_ BM_BLOCK_SIZE blocks.
@@ -857,7 +864,7 @@ int __drbd_change_sync(struct drbd_device *device, sector_t 
sector, int size,
if ((mode == SET_OUT_OF_SYNC) && size == 0)
return 0;
 
-   if (size <= 0 || !IS_ALIGNED(size, 512) || size > 
DRBD_MAX_DISCARD_SIZE) {
+   if (!plausible_request_size(size)) {
drbd_err(device, "%s: sector=%llus size=%d nonsense!\n",
drbd_change_sync_fname[mode],
(unsigned long long)sector, size);
diff --git a/drivers/block/drbd/drbd_debugfs.c 
b/drivers/block/drbd/drbd_debugfs.c
index 4de95bb..8a90812 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -237,14 +237,9 @@ static void seq_print_peer_request_flags(struct seq_file 
*m, struct drbd_peer_re
seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, &sep, "C");
seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, &sep, "set-in-sync");
 
-   if (f & EE_IS_TRIM) {
-   seq_putc(m, sep);
-   sep = '|';
-   if (f & EE_IS_TRIM_USE_ZEROOUT)
-   seq_puts(m, "zero-out");
-   else
-   seq_puts(m, "trim");
-   }
+   if (f & EE_IS_TRIM)
+   __seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, &sep, 
"zero-out", "trim");
+   seq_print_rq_state_bit(m, f & EE_WRITE_SAME, &sep, "write-same");
seq_putc(m, '\n');
 }
 
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index cb42f6c..cb47809 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -468,6 +468,9 @@ enum {
/* this is/was a write request */
__EE_WRITE,
 
+   /* this is/was a write same request */
+   __EE_WRITE_SAME,
+
/* this originates from application on peer
 * (not some resync or verify or other DRBD internal request) */
__EE_APPLICATION,
@@ -487,6 +490,7 @@ enum {
 #define EE_IN_INTERVAL_TREE(1<<__EE_IN_INTERVAL_TREE)
 #define EE_SUBMITTED   (1<<__EE_SUBMITTED)
 #define EE_WRITE   (1<<__EE_WRITE)
+#define EE_WRITE_SAME  (1<<__EE_WRITE_SAME)
 #define EE_APPLICATION (1<<__EE_APPLICATION)
 #define EE_RS_THIN_REQ (1<<__EE_RS_THIN_REQ)
 
@@ -1350,8 +1354,8 @@ struct bm_extent {
 /* For now, don't allow more than half of what we can "activate" in one
  * activity log transaction to be discarded in one go. We may need to rework
  * drbd_al_begin_io() to allow for even larger discard ranges */
-#define DRBD_MAX_DISCARD_SIZE  (AL_UPDATES_PER_TRANSACTION/2*AL_EXTENT_SIZE)
-#define DRBD_MAX_DISCARD_SECTORS (DRBD_MAX_DISCARD_SIZE >> 9)
+#define DRBD_MAX_BATCH_BIO_SIZE 
(AL_UPDATES_PER_TRANSACTION/2*AL_EXTENT_SIZE)
+#define DRBD_MAX_BBIO_SECTORS(DRBD_MAX_BATCH_BIO_SIZE >> 9)
 
 extern int  drbd_bm_init(struct drbd_device *device);
 extern int  drbd_bm_resize(struct drbd_device *device, sector_t sectors, int 
set_new_bits);
@@ -1488,7 +1492,

[PATCH 24/30] drbd: disallow promotion during resync handshake, avoid deadlock and hard reset

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

We already serialize connection state changes,
and other, non-connection state changes (role changes)
while we are establishing a connection.

But if we have an established connection,
then trigger a resync handshake (by primary --force or similar),
until now we just had to be "lucky".

Consider this sequence (e.g. deployment scenario):
create-md; up;
  -> Connected Secondary/Secondary Inconsistent/Inconsistent
then do a racy primary --force on both peers.

 block drbd0: drbd_sync_handshake:
 block drbd0: self 
0004::: bits:25590 
flags:0
 block drbd0: peer 
0004::: bits:25590 
flags:0
 block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) 
pdsk( DUnknown -> Inconsistent )
 block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
  *** HERE things go wrong. ***
 block drbd0: role( Secondary -> Primary )
 block drbd0: drbd_sync_handshake:
 block drbd0: self 
0005::: bits:25590 
flags:0
 block drbd0: peer 
C90D2FC716D232AB:0004:: bits:25590 
flags:0
 block drbd0: Becoming sync target due to disk states.
 block drbd0: Writing the whole bitmap, full sync required after 
drbd_sync_handshake.
 block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * 
timeout (30 * 0.1s)
 drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( 
UpToDate -> DUnknown )

The problem here is that the local promotion happens before the sync handshake
triggered by the remote promotion was completed.  Some assumptions elsewhere
become wrong, and when the expected resync handshake is then received and
processed, we get stuck in a deadlock, which can only be recovered by reboot :-(

Fix: if we know the peer has good data,
and our own disk is present, but NOT good,
and there is no resync going on yet,
we expect a sync handshake to happen "soon".
So reject a racy promotion with SS_IN_TRANSIENT_STATE.

Result:
 ... as above ...
 block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
  *** local promotion being postponed until ... ***
 block drbd0: drbd_sync_handshake:
 block drbd0: self 
0004::: bits:25590 
flags:0
 block drbd0: peer 
77868BDA836E12A5:0004:: bits:25590 
flags:0
  ...
 block drbd0: conn( WFBitMapT -> WFSyncUUID )
 block drbd0: updated sync uuid 
85D06D0E8887AD44:::
 block drbd0: conn( WFSyncUUID -> SyncTarget )
  *** ... after the resync handshake ***
 block drbd0: role( Secondary -> Primary )

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_state.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 24422e8..7562c5c 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -906,6 +906,15 @@ is_valid_soft_transition(union drbd_state os, union 
drbd_state ns, struct drbd_c
  (ns.conn >= C_CONNECTED && os.conn == C_WF_REPORT_PARAMS)))
rv = SS_IN_TRANSIENT_STATE;
 
+   /* Do not promote during resync handshake triggered by "force primary".
+* This is a hack. It should really be rejected by the peer during the
+* cluster wide state change request. */
+   if (os.role != R_PRIMARY && ns.role == R_PRIMARY
+   && ns.pdsk == D_UP_TO_DATE
+   && ns.disk != D_UP_TO_DATE && ns.disk != D_DISKLESS
+   && (ns.conn <= C_WF_SYNC_UUID || ns.conn != os.conn))
+   rv = SS_IN_TRANSIENT_STATE;
+
if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) && os.conn < 
C_CONNECTED)
rv = SS_NEED_CONNECTION;
 
-- 
2.7.4



[PATCH 18/30] drbd: if there is no good data accessible, writes should be IO errors

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

If DRBD lost all path to good data,
and the on-no-data-accessible policy is OND_SUSPEND_IO,
all pending and new IO requests are suspended (will block).

If that setting is OND_IO_ERROR, IO will still be completed.
READ to "clean" areas (e.g. on an D_INCONSISTENT device,
and bitmap indicates a block is already in sync) will succeed.
READ to "unclean" areas (bitmap indicates block is out-of-sync),
will return EIO.

If we are already D_DISKLESS (or D_FAILED), we also return EIO.

Unfortunately, on a former R_PRIMARY C_SYNC_TARGET D_INCONSISTENT,
after replication link loss, new WRITE requests still went through OK.

The would also set the "out-of-sync" bit on their way, so READ after
WRITE would still return EIO. Also, the data generation UUIDs had not
been bumped, we would cause data divergence, without being able to
detect it on the next sync handshake, given the right sequence of events
in a multiple error scenario and "improper" order of recovery actions.

The right thing to do is to return EIO for all new writes,
unless we have access to good, current, D_UP_TO_DATE data.

The "established best practices" way to avoid these situations in the
first place is to set OND_SUSPEND_IO, or even do a hard-reset from
the pri-on-incon-degr policy helper hook.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_req.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 7e441ff..8260dfa6 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1258,6 +1258,22 @@ drbd_request_prepare(struct drbd_device *device, struct 
bio *bio, unsigned long
return NULL;
 }
 
+/* Require at least one path to current data.
+ * We don't want to allow writes on C_STANDALONE D_INCONSISTENT:
+ * We would not allow to read what was written,
+ * we would not have bumped the data generation uuids,
+ * we would cause data divergence for all the wrong reasons.
+ *
+ * If we don't see at least one D_UP_TO_DATE, we will fail this request,
+ * which either returns EIO, or, if OND_SUSPEND_IO is set, suspends IO,
+ * and queues for retry later.
+ */
+static bool may_do_writes(struct drbd_device *device)
+{
+   const union drbd_dev_state s = device->state;
+   return s.disk == D_UP_TO_DATE || s.pdsk == D_UP_TO_DATE;
+}
+
 static void drbd_send_and_submit(struct drbd_device *device, struct 
drbd_request *req)
 {
struct drbd_resource *resource = device->resource;
@@ -1312,6 +1328,12 @@ static void drbd_send_and_submit(struct drbd_device 
*device, struct drbd_request
}
 
if (rw == WRITE) {
+   if (req->private_bio && !may_do_writes(device)) {
+   bio_put(req->private_bio);
+   req->private_bio = NULL;
+   put_ldev(device);
+   goto nodata;
+   }
if (!drbd_process_write_request(req))
no_remote = true;
} else {
-- 
2.7.4



[PATCH 06/30] drbd: Create the protocol feature THIN_RESYNC

2016-06-13 Thread Philipp Reisner
If thinly provisioned volumes are used, during a resync the sync source
tries to find out if a block is deallocated. If it is deallocated, then
the resync target uses block_dev_issue_zeroout() on the range in
question.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_protocol.h |  1 +
 drivers/block/drbd/drbd_receiver.c |  5 -
 drivers/block/drbd/drbd_worker.c   | 13 -
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_protocol.h 
b/drivers/block/drbd/drbd_protocol.h
index e5e74e3..7acc2e0 100644
--- a/drivers/block/drbd/drbd_protocol.h
+++ b/drivers/block/drbd/drbd_protocol.h
@@ -165,6 +165,7 @@ struct p_block_req {
  */
 
 #define FF_TRIM  1
+#define FF_THIN_RESYNC 2
 
 struct p_connection_features {
u32 protocol_min;
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 3a6c2ec..4cfc721 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -48,7 +48,7 @@
 #include "drbd_req.h"
 #include "drbd_vli.h"
 
-#define PRO_FEATURES (FF_TRIM)
+#define PRO_FEATURES (FF_TRIM | FF_THIN_RESYNC)
 
 struct packet_info {
enum drbd_packet cmd;
@@ -4979,6 +4979,9 @@ static int drbd_do_features(struct drbd_connection 
*connection)
drbd_info(connection, "Agreed to%ssupport TRIM on protocol level\n",
  connection->agreed_features & FF_TRIM ? " " : " not ");
 
+   drbd_info(connection, "Agreed to%ssupport THIN_RESYNC on protocol 
level\n",
+ connection->agreed_features & FF_THIN_RESYNC ? " " : " not ");
+
return 1;
 
  incompat:
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 01d74ee2..fa63c22 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -582,6 +582,7 @@ static int make_resync_request(struct drbd_device *const 
device, int cancel)
int number, rollback_i, size;
int align, requeue = 0;
int i = 0;
+   int discard_granularity = 0;
 
if (unlikely(cancel))
return 0;
@@ -601,6 +602,12 @@ static int make_resync_request(struct drbd_device *const 
device, int cancel)
return 0;
}
 
+   if (connection->agreed_features & FF_THIN_RESYNC) {
+   rcu_read_lock();
+   discard_granularity = 
rcu_dereference(device->ldev->disk_conf)->rs_discard_granularity;
+   rcu_read_unlock();
+   }
+
max_bio_size = queue_max_hw_sectors(device->rq_queue) << 9;
number = drbd_rs_number_requests(device);
if (number <= 0)
@@ -665,6 +672,9 @@ next_sector:
if (sector & ((1<<(align+3))-1))
break;
 
+   if (discard_granularity && size == discard_granularity)
+   break;
+
/* do not cross extent boundaries */
if (((bit+1) & BM_BLOCKS_PER_BM_EXT_MASK) == 0)
break;
@@ -711,7 +721,8 @@ next_sector:
int err;
 
inc_rs_pending(device);
-   err = drbd_send_drequest(peer_device, P_RS_DATA_REQUEST,
+   err = drbd_send_drequest(peer_device,
+size == discard_granularity ? 
P_RS_THIN_REQ : P_RS_DATA_REQUEST,
 sector, size, ID_SYNCER);
if (err) {
drbd_err(device, "drbd_send_drequest() failed, 
aborting...\n");
-- 
2.7.4



[PATCH 26/30] drbd: code cleanups without semantic changes

2016-06-13 Thread Philipp Reisner
From: Fabian Frederick 

This contains various cosmetic fixes ranging from simple typos to
const-ifying, and using booleans properly.

Original commit messages from Fabian's patch set:
drbd: debugfs: constify drbd_version_fops
drbd: use seq_put instead of seq_print where possible
drbd: include linux/uaccess.h instead of asm/uaccess.h
drbd: use const char * const for drbd strings
drbd: kerneldoc warning fix in w_e_end_data_req()
drbd: use unsigned for one bit fields
drbd: use bool for peer is_ states
drbd: fix typo
drbd: use | for bitmask combination
drbd: use true/false for bool
drbd: fix drbd_bm_init() comments
drbd: introduce peer state union
drbd: fix maybe_pull_ahead() locking comments
drbd: use bool for growing
drbd: remove redundant declarations
drbd: replace if/BUG by BUG_ON

Signed-off-by: Fabian Frederick 
Signed-off-by: Roland Kammerer 
---
 drivers/block/drbd/drbd_bitmap.c   |  6 +++---
 drivers/block/drbd/drbd_debugfs.c  |  2 +-
 drivers/block/drbd/drbd_int.h  |  4 +---
 drivers/block/drbd/drbd_interval.h | 14 +++---
 drivers/block/drbd/drbd_main.c |  2 +-
 drivers/block/drbd/drbd_nl.c   | 14 --
 drivers/block/drbd/drbd_proc.c | 30 +++---
 drivers/block/drbd/drbd_receiver.c |  8 
 drivers/block/drbd/drbd_req.c  |  2 +-
 drivers/block/drbd/drbd_state.c|  4 +---
 drivers/block/drbd/drbd_state.h|  2 +-
 drivers/block/drbd/drbd_strings.c  |  8 
 drivers/block/drbd/drbd_worker.c   |  9 -
 include/linux/drbd.h   |  8 
 14 files changed, 59 insertions(+), 54 deletions(-)

diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 92d6fc0..17e5e60 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -427,8 +427,7 @@ static struct page **bm_realloc_pages(struct drbd_bitmap 
*b, unsigned long want)
 }
 
 /*
- * called on driver init only. TODO call when a device is created.
- * allocates the drbd_bitmap, and stores it in device->bitmap.
+ * allocates the drbd_bitmap and stores it in device->bitmap.
  */
 int drbd_bm_init(struct drbd_device *device)
 {
@@ -633,7 +632,8 @@ int drbd_bm_resize(struct drbd_device *device, sector_t 
capacity, int set_new_bi
unsigned long bits, words, owords, obits;
unsigned long want, have, onpages; /* number of pages */
struct page **npages, **opages = NULL;
-   int err = 0, growing;
+   int err = 0;
+   bool growing;
 
if (!expect(b))
return -ENOMEM;
diff --git a/drivers/block/drbd/drbd_debugfs.c 
b/drivers/block/drbd/drbd_debugfs.c
index 8a90812..be91a8d 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -903,7 +903,7 @@ static int drbd_version_open(struct inode *inode, struct 
file *file)
return single_open(file, drbd_version_show, NULL);
 }
 
-static struct file_operations drbd_version_fops = {
+static const struct file_operations drbd_version_fops = {
.owner = THIS_MODULE,
.open = drbd_version_open,
.llseek = seq_lseek,
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index cb47809..acb1462 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1499,7 +1499,7 @@ extern enum drbd_state_rv drbd_set_role(struct 
drbd_device *device,
int force);
 extern bool conn_try_outdate_peer(struct drbd_connection *connection);
 extern void conn_try_outdate_peer_async(struct drbd_connection *connection);
-extern int conn_khelper(struct drbd_connection *connection, char *cmd);
+extern enum drbd_peer_state conn_khelper(struct drbd_connection *connection, 
char *cmd);
 extern int drbd_khelper(struct drbd_device *device, char *cmd);
 
 /* drbd_worker.c */
@@ -1648,8 +1648,6 @@ void drbd_bump_write_ordering(struct drbd_resource 
*resource, struct drbd_backin
 /* drbd_proc.c */
 extern struct proc_dir_entry *drbd_proc;
 extern const struct file_operations drbd_proc_fops;
-extern const char *drbd_conn_str(enum drbd_conns s);
-extern const char *drbd_role_str(enum drbd_role s);
 
 /* drbd_actlog.c */
 extern bool drbd_al_begin_io_prepare(struct drbd_device *device, struct 
drbd_interval *i);
diff --git a/drivers/block/drbd/drbd_interval.h 
b/drivers/block/drbd/drbd_interval.h
index f210543..23c5a94 100644
--- a/drivers/block/drbd/drbd_interval.h
+++ b/drivers/block/drbd/drbd_interval.h
@@ -6,13 +6,13 @@
 
 struct drbd_interval {
struct rb_node rb;
-   sector_t sector;/* start sector of the interval */
-   unsigned int size;  /* size in bytes */
-   sector_t end;   /* highest interval end in subtree */
-   int local:1 /* local or remote request? */;
-   int waiting:1;  /* someone is waiting for this to complete */
-   int completed:1;/* this has been completed already;
-* ignore for confli

[PATCH 11/30] drbd: when receiving P_TRIM, zero-out partial unaligned chunks

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

We can avoid spurious data divergence caused by partially-ignored
discards on certain backends with discard_zeroes_data=0, if we
translate partial unaligned discard requests into explicit zero-out.

The relevant use case is LVM/DM thin.

If on different nodes, DRBD is backed by devices with differing
discard characteristics, discards may lead to data divergence
(old data or garbage left over on one backend, zeroes due to
unmapped areas on the other backend). Online verify would now
potentially report tons of spurious differences.

While probably harmless for most use cases (fstrim on a file system),
DRBD cannot have that, it would violate our promise to upper layers
that our data instances on the nodes are identical.

To be correct and play safe (make sure data is identical on both copies),
we would have to disable discard support, if our local backend (on a
Primary) does not support "discard_zeroes_data=true".

We'd also have to translate discards to explicit zero-out on the
receiving (typically: Secondary) side, unless the receiving side
supports "discard_zeroes_data=true".

Which both would allocate those blocks, instead of unmapping them,
in contrast with expectations.

LVM/DM thin does set discard_zeroes_data=0,
because it silently ignores discards to partial chunks.

We can work around this by checking the alignment first.
For unaligned (wrt. alignment and granularity) or too small discards,
we zero-out the initial (and/or) trailing unaligned partial chunks,
but discard all the aligned full chunks.

At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".

Arguably it should behave this way internally, by default,
and we'll try to make that happen.

But our workaround is still valid for already deployed setups,
and for other devices that may behave this way.

Setting discard-zeroes-if-aligned=yes will allow DRBD to use
discards, and to announce discard_zeroes_data=true, even on
backends that announce discard_zeroes_data=false.

Setting discard-zeroes-if-aligned=no will cause DRBD to always
fall-back to zero-out on the receiving side, and to not even
announce discard capabilities on the Primary, if the respective
backend announces discard_zeroes_data=false.

We used to ignore the discard_zeroes_data setting completely.
To not break established and expected behaviour, and suddenly
cause fstrim on thin-provisioned LVs to run out-of-space,
instead of freeing up space, the default value is "yes".

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h  |   2 +-
 drivers/block/drbd/drbd_nl.c   |  15 ++--
 drivers/block/drbd/drbd_receiver.c | 140 ++---
 include/linux/drbd_genl.h  |   1 +
 include/linux/drbd_limits.h|   6 ++
 5 files changed, 134 insertions(+), 30 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 1a93f4f..8cc2955 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1488,7 +1488,7 @@ enum determine_dev_size {
 extern enum determine_dev_size
 drbd_determine_dev_size(struct drbd_device *, enum dds_flags, struct 
resize_parms *) __must_hold(local);
 extern void resync_after_online_grow(struct drbd_device *);
-extern void drbd_reconsider_max_bio_size(struct drbd_device *device, struct 
drbd_backing_dev *bdev);
+extern void drbd_reconsider_queue_parameters(struct drbd_device *device, 
struct drbd_backing_dev *bdev);
 extern enum drbd_state_rv drbd_set_role(struct drbd_device *device,
enum drbd_role new_role,
int force);
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 3643f9c..8d757d6 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1161,13 +1161,17 @@ static void drbd_setup_queue_param(struct drbd_device 
*device, struct drbd_backi
unsigned int max_hw_sectors = max_bio_size >> 9;
unsigned int max_segments = 0;
struct request_queue *b = NULL;
+   struct disk_conf *dc;
+   bool discard_zeroes_if_aligned = true;
 
if (bdev) {
b = bdev->backing_bdev->bd_disk->queue;
 
max_hw_sectors = min(queue_max_hw_sectors(b), max_bio_size >> 
9);
rcu_read_lock();
-   max_segments = 
rcu_dereference(device->ldev->disk_conf)->max_bio_bvecs;
+   dc = rcu_dereference(device->ldev->disk_conf);
+   max_segments = dc->max_bio_bvecs;
+   discard_zeroes_if_aligned = dc->discard_zeroes_if_aligned;
rcu_read_unlock();
 
blk_set_stacking_limits(&q->limits);
@@ -1185,7 +1189,7 @@ static void drbd_setup_queue_param(struct drbd_device 
*device, struct drbd_backi
 
blk_queue_max_discard_sec

[PATCH 00/30] DRBD updates

2016-06-13 Thread Philipp Reisner
Hi Jens,

I have sent this already on April 25, I guess it was too late in the cycle
at that time. Apart from the usual maintenance and bug fixes this time comes
support for WRITE_SAME and lots of improvements for DISCARD.

At that time we had a discussion about (1) the all_zero() heuristic introduced
with [PATCH 04/30] drbd: Implement handling of thinly provisioned storage...
not being efficient, and about the (2) rs-discard-granularity configuration
parameter.

Regarding (1): I intend to work on block-devices being able to export their
allocation map by either FIEMAP or SEEK_HOLE/SEEK_DATA or both for the next
cycle. The I will change DRBD to use that as well.

Regarding (2): We need to announce the discard granularity when we create the
device/minor. At might it might be that there is no connection to the peer
node. So we are left with information about the discard granularity of the
local backing device only. Therefore we decided to delegate it to the
user/admin to provide the discard granularity for the resync process.


Please add it to your for-4.8/drivers branch.
Thanks!

Fabian Frederick (1):
  drbd: code cleanups without semantic changes

Lars Ellenberg (24):
  drbd: bitmap bulk IO: do not always suspend IO
  drbd: change bitmap write-out when leaving resync states
  drbd: adjust assert in w_bitmap_io to account for
BM_LOCKED_CHANGE_ALLOWED
  drbd: fix regression: protocol A sometimes synchronous, C sometimes
double-latency
  drbd: fix for truncated minor number in callback command line
  drbd: allow parallel flushes for multi-volume resources
  drbd: when receiving P_TRIM, zero-out partial unaligned chunks
  drbd: possibly disable discard support, if backend has
discard_zeroes_data=0
  drbd: zero-out partial unaligned discards on local backend
  drbd: allow larger max_discard_sectors
  drbd: finish resync on sync source only by notification from sync
target
  drbd: introduce unfence-peer handler
  drbd: don't forget error completion when "unsuspending" IO
  drbd: if there is no good data accessible, writes should be IO errors
  drbd: only restart frozen disk io when D_UP_TO_DATE
  drbd: discard_zeroes_if_aligned allows "thin" resync for
discard_zeroes_data=0
  drbd: report sizes if rejecting too small peer disk
  drbd: introduce WRITE_SAME support
  drbd: sync_handshake: handle identical uuids with current (frozen)
Primary
  drbd: disallow promotion during resync handshake, avoid deadlock and
hard reset
  drbd: bump current uuid when resuming IO with diskless peer
  drbd: finally report ms, not jiffies, in log message
  drbd: al_write_transaction: skip re-scanning of bitmap page pointer
array
  drbd: correctly handle failed crypto_alloc_hash

Philipp Reisner (4):
  drbd: Kill code duplication
  drbd: Implement handling of thinly provisioned storage on resync
target nodes
  drbd: Introduce new disk config option rs-discard-granularity
  drbd: Create the protocol feature THIN_RESYNC

Roland Kammerer (1):
  drbd: get rid of empty statement in is_valid_state

 drivers/block/drbd/drbd_actlog.c   |  29 +-
 drivers/block/drbd/drbd_bitmap.c   |  84 --
 drivers/block/drbd/drbd_debugfs.c  |  13 +-
 drivers/block/drbd/drbd_int.h  |  49 +++-
 drivers/block/drbd/drbd_interval.h |  14 +-
 drivers/block/drbd/drbd_main.c | 115 +++-
 drivers/block/drbd/drbd_nl.c   | 282 +++-
 drivers/block/drbd/drbd_proc.c |  30 +--
 drivers/block/drbd/drbd_protocol.h |  77 +-
 drivers/block/drbd/drbd_receiver.c | 534 ++---
 drivers/block/drbd/drbd_req.c  |  84 --
 drivers/block/drbd/drbd_req.h  |   5 +-
 drivers/block/drbd/drbd_state.c|  61 -
 drivers/block/drbd/drbd_state.h|   2 +-
 drivers/block/drbd/drbd_strings.c  |   8 +-
 drivers/block/drbd/drbd_worker.c   |  85 +-
 include/linux/drbd.h   |  10 +-
 include/linux/drbd_genl.h  |   7 +-
 include/linux/drbd_limits.h|  15 +-
 19 files changed, 1205 insertions(+), 299 deletions(-)

-- 
2.7.4



[PATCH 04/30] drbd: Implement handling of thinly provisioned storage on resync target nodes

2016-06-13 Thread Philipp Reisner
If during resync we read only zeroes for a range of sectors assume
that these secotors can be discarded on the sync target node.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h  |  5 +++
 drivers/block/drbd/drbd_main.c | 18 
 drivers/block/drbd/drbd_protocol.h |  4 ++
 drivers/block/drbd/drbd_receiver.c | 88 --
 drivers/block/drbd/drbd_worker.c   | 29 -
 5 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 7a1cf7e..1a93f4f 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -471,6 +471,9 @@ enum {
/* this originates from application on peer
 * (not some resync or verify or other DRBD internal request) */
__EE_APPLICATION,
+
+   /* If it contains only 0 bytes, send back P_RS_DEALLOCATED */
+   __EE_RS_THIN_REQ,
 };
 #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
 #define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC)
@@ -485,6 +488,7 @@ enum {
 #define EE_SUBMITTED   (1<<__EE_SUBMITTED)
 #define EE_WRITE   (1<<__EE_WRITE)
 #define EE_APPLICATION (1<<__EE_APPLICATION)
+#define EE_RS_THIN_REQ (1<<__EE_RS_THIN_REQ)
 
 /* flag bits per device */
 enum {
@@ -1123,6 +1127,7 @@ extern int drbd_send_ov_request(struct drbd_peer_device 
*, sector_t sector, int
 extern int drbd_send_bitmap(struct drbd_device *device);
 extern void drbd_send_sr_reply(struct drbd_peer_device *, enum drbd_state_rv 
retcode);
 extern void conn_send_sr_reply(struct drbd_connection *connection, enum 
drbd_state_rv retcode);
+extern int drbd_send_rs_deallocated(struct drbd_peer_device *, struct 
drbd_peer_request *);
 extern void drbd_backing_dev_free(struct drbd_device *device, struct 
drbd_backing_dev *ldev);
 extern void drbd_device_cleanup(struct drbd_device *device);
 void drbd_print_uuids(struct drbd_device *device, const char *text);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 4c64cb9..dd2432e 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1377,6 +1377,22 @@ int drbd_send_ack_ex(struct drbd_peer_device 
*peer_device, enum drbd_packet cmd,
  cpu_to_be64(block_id));
 }
 
+int drbd_send_rs_deallocated(struct drbd_peer_device *peer_device,
+struct drbd_peer_request *peer_req)
+{
+   struct drbd_socket *sock;
+   struct p_block_desc *p;
+
+   sock = &peer_device->connection->data;
+   p = drbd_prepare_command(peer_device, sock);
+   if (!p)
+   return -EIO;
+   p->sector = cpu_to_be64(peer_req->i.sector);
+   p->blksize = cpu_to_be32(peer_req->i.size);
+   p->pad = 0;
+   return drbd_send_command(peer_device, sock, P_RS_DEALLOCATED, 
sizeof(*p), NULL, 0);
+}
+
 int drbd_send_drequest(struct drbd_peer_device *peer_device, int cmd,
   sector_t sector, int size, u64 block_id)
 {
@@ -3681,6 +3697,8 @@ const char *cmdname(enum drbd_packet cmd)
[P_CONN_ST_CHG_REPLY]   = "conn_st_chg_reply",
[P_RETRY_WRITE] = "retry_write",
[P_PROTOCOL_UPDATE] = "protocol_update",
+   [P_RS_THIN_REQ] = "rs_thin_req",
+   [P_RS_DEALLOCATED]  = "rs_deallocated",
 
/* enum drbd_packet, but not commands - obsoleted flags:
 *  P_MAY_IGNORE
diff --git a/drivers/block/drbd/drbd_protocol.h 
b/drivers/block/drbd/drbd_protocol.h
index ef92453..e5e74e3 100644
--- a/drivers/block/drbd/drbd_protocol.h
+++ b/drivers/block/drbd/drbd_protocol.h
@@ -60,6 +60,10 @@ enum drbd_packet {
 * which is why I chose TRIM here, to disambiguate. */
P_TRIM= 0x31,
 
+   /* Only use these two if both support FF_THIN_RESYNC */
+   P_RS_THIN_REQ = 0x32, /* Request a block for resync or reply 
P_RS_DEALLOCATED */
+   P_RS_DEALLOCATED  = 0x33, /* Contains only zeros on sync source 
node */
+
P_MAY_IGNORE  = 0x100, /* Flag to test if (cmd > P_MAY_IGNORE) 
... */
P_MAX_OPT_CMD = 0x101,
 
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 8b30ab5..3a6c2ec 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1417,9 +1417,15 @@ int drbd_submit_peer_request(struct drbd_device *device,
 * so we can find it to present it in debugfs */
peer_req->submit_jif = jiffies;
peer_req->flags |= EE_SUBMITTED;
-   spin_lock_irq(&device->resource->req_lock);
-   list_add_tail(&peer_req->w.list, &device->active_

[PATCH 05/30] drbd: Introduce new disk config option rs-discard-granularity

2016-06-13 Thread Philipp Reisner
As long as the value is 0 the feature is disabled. With setting
it to a positive value, DRBD limits and aligns its resync requests
to the rs-discard-granularity setting. If the sync source detects
all zeros in such a block, the resync target discards the range
on disk.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 32 +---
 include/linux/drbd_genl.h|  6 +++---
 include/linux/drbd_limits.h  |  6 ++
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index fad03e4..99339df 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1348,12 +1348,38 @@ static bool write_ordering_changed(struct disk_conf *a, 
struct disk_conf *b)
a->disk_drain != b->disk_drain;
 }
 
-static void sanitize_disk_conf(struct disk_conf *disk_conf, struct 
drbd_backing_dev *nbc)
+static void sanitize_disk_conf(struct drbd_device *device, struct disk_conf 
*disk_conf,
+  struct drbd_backing_dev *nbc)
 {
+   struct request_queue * const q = nbc->backing_bdev->bd_disk->queue;
+
if (disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
if (disk_conf->al_extents > drbd_al_extents_max(nbc))
disk_conf->al_extents = drbd_al_extents_max(nbc);
+
+   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data) {
+   disk_conf->rs_discard_granularity = 0; /* disable feature */
+   drbd_info(device, "rs_discard_granularity feature disabled\n");
+   }
+
+   if (disk_conf->rs_discard_granularity) {
+   int orig_value = disk_conf->rs_discard_granularity;
+   int remainder;
+
+   if (q->limits.discard_granularity > 
disk_conf->rs_discard_granularity)
+   disk_conf->rs_discard_granularity = 
q->limits.discard_granularity;
+
+   remainder = disk_conf->rs_discard_granularity % 
q->limits.discard_granularity;
+   disk_conf->rs_discard_granularity += remainder;
+
+   if (disk_conf->rs_discard_granularity > 
q->limits.max_discard_sectors << 9)
+   disk_conf->rs_discard_granularity = 
q->limits.max_discard_sectors << 9;
+
+   if (disk_conf->rs_discard_granularity != orig_value)
+   drbd_info(device, "rs_discard_granularity changed to 
%d\n",
+ disk_conf->rs_discard_granularity);
+   }
 }
 
 int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
@@ -1403,7 +1429,7 @@ int drbd_adm_disk_opts(struct sk_buff *skb, struct 
genl_info *info)
if (!expect(new_disk_conf->resync_rate >= 1))
new_disk_conf->resync_rate = 1;
 
-   sanitize_disk_conf(new_disk_conf, device->ldev);
+   sanitize_disk_conf(device, new_disk_conf, device->ldev);
 
if (new_disk_conf->c_plan_ahead > DRBD_C_PLAN_AHEAD_MAX)
new_disk_conf->c_plan_ahead = DRBD_C_PLAN_AHEAD_MAX;
@@ -1698,7 +1724,7 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info 
*info)
if (retcode != NO_ERROR)
goto fail;
 
-   sanitize_disk_conf(new_disk_conf, nbc);
+   sanitize_disk_conf(device, new_disk_conf, nbc);
 
if (drbd_get_max_capacity(nbc) < new_disk_conf->disk_size) {
drbd_err(device, "max capacity %llu smaller than disk size 
%llu\n",
diff --git a/include/linux/drbd_genl.h b/include/linux/drbd_genl.h
index 2d0e5ad..ab649d8 100644
--- a/include/linux/drbd_genl.h
+++ b/include/linux/drbd_genl.h
@@ -123,14 +123,14 @@ GENL_struct(DRBD_NLA_DISK_CONF, 3, disk_conf,
__u32_field_def(13, DRBD_GENLA_F_MANDATORY, c_fill_target, 
DRBD_C_FILL_TARGET_DEF)
__u32_field_def(14, DRBD_GENLA_F_MANDATORY, c_max_rate, 
DRBD_C_MAX_RATE_DEF)
__u32_field_def(15, DRBD_GENLA_F_MANDATORY, c_min_rate, 
DRBD_C_MIN_RATE_DEF)
+   __u32_field_def(20, DRBD_GENLA_F_MANDATORY, disk_timeout, 
DRBD_DISK_TIMEOUT_DEF)
+   __u32_field_def(21, 0 /* OPTIONAL */,   read_balancing, 
DRBD_READ_BALANCING_DEF)
+   __u32_field_def(25, 0 /* OPTIONAL */,   rs_discard_granularity, 
DRBD_RS_DISCARD_GRANULARITY_DEF)
 
__flg_field_def(16, DRBD_GENLA_F_MANDATORY, disk_barrier, 
DRBD_DISK_BARRIER_DEF)
__flg_field_def(17, DRBD_GENLA_F_MANDATORY, disk_flushes, 
DRBD_DISK_FLUSHES_DEF)
__flg_field_def(18, DRBD_GENLA_F_MANDATORY, disk_drain, 
DRBD_DISK_DRAIN_DEF)
__flg_field_def(19, DRBD_GENLA_F_MANDATORY, md_flushes, 
DRBD_MD_FLUSHES_DEF)
-   __u32_field_def(20, DRBD_GENLA_F_MANDATORY, disk_timeout, 
DRBD_DISK_TIMEOUT_DEF)
-   __u32_fie

[PATCH 03/30] drbd: Kill code duplication

2016-06-13 Thread Philipp Reisner
Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 0bac9c8..fad03e4 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1348,6 +1348,14 @@ static bool write_ordering_changed(struct disk_conf *a, 
struct disk_conf *b)
a->disk_drain != b->disk_drain;
 }
 
+static void sanitize_disk_conf(struct disk_conf *disk_conf, struct 
drbd_backing_dev *nbc)
+{
+   if (disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
+   disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
+   if (disk_conf->al_extents > drbd_al_extents_max(nbc))
+   disk_conf->al_extents = drbd_al_extents_max(nbc);
+}
+
 int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
 {
struct drbd_config_context adm_ctx;
@@ -1395,10 +1403,7 @@ int drbd_adm_disk_opts(struct sk_buff *skb, struct 
genl_info *info)
if (!expect(new_disk_conf->resync_rate >= 1))
new_disk_conf->resync_rate = 1;
 
-   if (new_disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
-   new_disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
-   if (new_disk_conf->al_extents > drbd_al_extents_max(device->ldev))
-   new_disk_conf->al_extents = drbd_al_extents_max(device->ldev);
+   sanitize_disk_conf(new_disk_conf, device->ldev);
 
if (new_disk_conf->c_plan_ahead > DRBD_C_PLAN_AHEAD_MAX)
new_disk_conf->c_plan_ahead = DRBD_C_PLAN_AHEAD_MAX;
@@ -1693,10 +1698,7 @@ int drbd_adm_attach(struct sk_buff *skb, struct 
genl_info *info)
if (retcode != NO_ERROR)
goto fail;
 
-   if (new_disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
-   new_disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
-   if (new_disk_conf->al_extents > drbd_al_extents_max(nbc))
-   new_disk_conf->al_extents = drbd_al_extents_max(nbc);
+   sanitize_disk_conf(new_disk_conf, nbc);
 
if (drbd_get_max_capacity(nbc) < new_disk_conf->disk_size) {
drbd_err(device, "max capacity %llu smaller than disk size 
%llu\n",
-- 
2.7.4



[PATCH 25/30] drbd: bump current uuid when resuming IO with diskless peer

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Scenario, starting with normal operation
 Connected Primary/Secondary UpToDate/UpToDate
 NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
 ... more failures happen, secondary loses it's disk,
 but eventually is able to re-establish the replication link ...
 Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)

We used to just resume/resent suspended requests,
without bumping the UUID.

Which will lead to problems later, when we want to re-attach the disk on
the peer, without first disconnecting, or if we experience additional
failures, because we now have diverging data without being able to
recognize it.

Make sure we also bump the current data generation UUID,
if we notice "peer disk unknown" -> "peer disk known bad".

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_state.c | 34 --
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 7562c5c..a1b5e6c9 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -1637,6 +1637,26 @@ static void broadcast_state_change(struct 
drbd_state_change *state_change)
 #undef REMEMBER_STATE_CHANGE
 }
 
+/* takes old and new peer disk state */
+static bool lost_contact_to_peer_data(enum drbd_disk_state os, enum 
drbd_disk_state ns)
+{
+   if ((os >= D_INCONSISTENT && os != D_UNKNOWN && os != D_OUTDATED)
+   &&  (ns < D_INCONSISTENT || ns == D_UNKNOWN || ns == D_OUTDATED))
+   return true;
+
+   /* Scenario, starting with normal operation
+* Connected Primary/Secondary UpToDate/UpToDate
+* NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
+* ...
+* Connected Primary/Secondary UpToDate/Diskless (resumed; needs to 
bump uuid!)
+*/
+   if (os == D_UNKNOWN
+   &&  (ns == D_DISKLESS || ns == D_FAILED || ns == D_OUTDATED))
+   return true;
+
+   return false;
+}
+
 /**
  * after_state_ch() - Perform after state change actions that may sleep
  * @device:DRBD device.
@@ -1708,6 +1728,13 @@ static void after_state_ch(struct drbd_device *device, 
union drbd_state os,
idr_for_each_entry(&connection->peer_devices, 
peer_device, vnr)
clear_bit(NEW_CUR_UUID, 
&peer_device->device->flags);
rcu_read_unlock();
+
+   /* We should actively create a new uuid, _before_
+* we resume/resent, if the peer is diskless
+* (recovery from a multiple error scenario).
+* Currently, this happens with a slight delay
+* below when checking lost_contact_to_peer_data() ...
+*/
_tl_restart(connection, RESEND);
_conn_request_state(connection,
(union drbd_state) { { .susp_fen = 
1 } },
@@ -1751,12 +1778,7 @@ static void after_state_ch(struct drbd_device *device, 
union drbd_state os,
BM_LOCKED_TEST_ALLOWED);
 
/* Lost contact to peer's copy of the data */
-   if ((os.pdsk >= D_INCONSISTENT &&
-os.pdsk != D_UNKNOWN &&
-os.pdsk != D_OUTDATED)
-   &&  (ns.pdsk < D_INCONSISTENT ||
-ns.pdsk == D_UNKNOWN ||
-ns.pdsk == D_OUTDATED)) {
+   if (lost_contact_to_peer_data(os.pdsk, ns.pdsk)) {
if (get_ldev(device)) {
if ((ns.role == R_PRIMARY || ns.peer == R_PRIMARY) &&
device->ldev->md.uuid[UI_BITMAP] == 0 && ns.disk >= 
D_UP_TO_DATE) {
-- 
2.7.4



[PATCH 12/30] drbd: possibly disable discard support, if backend has discard_zeroes_data=0

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Now that we have the discard_zeroes_if_aligned setting, we should also
check it when setting up our queue parameters on the primary,
not only on the receiving side.

We announce discard support,
UNLESS

 * we are connected to a peer that does not support TRIM
   on the DRBD protocol level.  Otherwise, it would either discard, or
   do a fallback to zero-out, depending on its backend and configuration.

 * our local backend does not support discards,
   or (discard_zeroes_data=0 AND discard_zeroes_if_aligned=no).

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 80 ++--
 1 file changed, 55 insertions(+), 25 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 8d757d6..12e9b31 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1154,6 +1154,59 @@ static int drbd_check_al_size(struct drbd_device 
*device, struct disk_conf *dc)
return 0;
 }
 
+static void blk_queue_discard_granularity(struct request_queue *q, unsigned 
int granularity)
+{
+   q->limits.discard_granularity = granularity;
+}
+static void decide_on_discard_support(struct drbd_device *device,
+   struct request_queue *q,
+   struct request_queue *b,
+   bool discard_zeroes_if_aligned)
+{
+   /* q = drbd device queue (device->rq_queue)
+* b = backing device queue 
(device->ldev->backing_bdev->bd_disk->queue),
+* or NULL if diskless
+*/
+   struct drbd_connection *connection = 
first_peer_device(device)->connection;
+   bool can_do = b ? blk_queue_discard(b) : true;
+
+   if (can_do && b && !b->limits.discard_zeroes_data && 
!discard_zeroes_if_aligned) {
+   can_do = false;
+   drbd_info(device, "discard_zeroes_data=0 and 
discard_zeroes_if_aligned=no: disabling discards\n");
+   }
+   if (can_do && connection->cstate >= C_CONNECTED && 
!(connection->agreed_features & FF_TRIM)) {
+   can_do = false;
+   drbd_info(connection, "peer DRBD too old, does not support 
TRIM: disabling discards\n");
+   }
+   if (can_do) {
+   /* We don't care for the granularity, really.
+* Stacking limits below should fix it for the local
+* device.  Whether or not it is a suitable granularity
+* on the remote device is not our problem, really. If
+* you care, you need to use devices with similar
+* topology on all peers. */
+   blk_queue_discard_granularity(q, 512);
+   q->limits.max_discard_sectors = DRBD_MAX_DISCARD_SECTORS;
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+   } else {
+   queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
+   blk_queue_discard_granularity(q, 0);
+   q->limits.max_discard_sectors = 0;
+   }
+}
+
+static void fixup_discard_if_not_supported(struct request_queue *q)
+{
+   /* To avoid confusion, if this queue does not support discard, clear
+* max_discard_sectors, which is what lsblk -D reports to the user.
+* Older kernels got this wrong in "stack limits".
+* */
+   if (!blk_queue_discard(q)) {
+   blk_queue_max_discard_sectors(q, 0);
+   blk_queue_discard_granularity(q, 0);
+   }
+}
+
 static void drbd_setup_queue_param(struct drbd_device *device, struct 
drbd_backing_dev *bdev,
   unsigned int max_bio_size)
 {
@@ -1183,26 +1236,8 @@ static void drbd_setup_queue_param(struct drbd_device 
*device, struct drbd_backi
/* This is the workaround for "bio would need to, but cannot, be split" 
*/
blk_queue_max_segments(q, max_segments ? max_segments : 
BLK_MAX_SEGMENTS);
blk_queue_segment_boundary(q, PAGE_SIZE-1);
-
+   decide_on_discard_support(device, q, b, discard_zeroes_if_aligned);
if (b) {
-   struct drbd_connection *connection = 
first_peer_device(device)->connection;
-
-   blk_queue_max_discard_sectors(q, DRBD_MAX_DISCARD_SECTORS);
-
-   if (blk_queue_discard(b) && (b->limits.discard_zeroes_data || 
discard_zeroes_if_aligned) &&
-   (connection->cstate < C_CONNECTED || 
connection->agreed_features & FF_TRIM)) {
-   /* We don't care, stacking below should fix it for the 
local device.
-* Whether or not it is a suitable granularity on the 
remote device
-* is not our problem, really. If you care, you need to
-* use devices with similar topology on a

[PATCH 14/30] drbd: allow larger max_discard_sectors

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION)
al-extents available, and allow up to half of that to be
discarded in one bio.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_actlog.c | 2 +-
 drivers/block/drbd/drbd_int.h| 8 
 include/linux/drbd_limits.h  | 3 +--
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 10459a1..1664762 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -256,7 +256,7 @@ bool drbd_al_begin_io_fastpath(struct drbd_device *device, 
struct drbd_interval
unsigned first = i->sector >> (AL_EXTENT_SHIFT-9);
unsigned last = i->size == 0 ? first : (i->sector + (i->size >> 9) - 1) 
>> (AL_EXTENT_SHIFT-9);
 
-   D_ASSERT(device, (unsigned)(last - first) <= 1);
+   D_ASSERT(device, first <= last);
D_ASSERT(device, atomic_read(&device->local_cnt) > 0);
 
/* FIXME figure out a fast path for bios crossing AL extent boundaries 
*/
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index d818e7d..d82e531 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1347,10 +1347,10 @@ struct bm_extent {
 #define DRBD_MAX_SIZE_H80_PACKET (1U << 15) /* Header 80 only allows packets 
up to 32KiB data */
 #define DRBD_MAX_BIO_SIZE_P95(1U << 17) /* Protocol 95 to 99 allows bios 
up to 128KiB */
 
-/* For now, don't allow more than one activity log extent worth of data
- * to be discarded in one go. We may need to rework drbd_al_begin_io()
- * to allow for even larger discard ranges */
-#define DRBD_MAX_DISCARD_SIZE  AL_EXTENT_SIZE
+/* For now, don't allow more than half of what we can "activate" in one
+ * activity log transaction to be discarded in one go. We may need to rework
+ * drbd_al_begin_io() to allow for even larger discard ranges */
+#define DRBD_MAX_DISCARD_SIZE  (AL_UPDATES_PER_TRANSACTION/2*AL_EXTENT_SIZE)
 #define DRBD_MAX_DISCARD_SECTORS (DRBD_MAX_DISCARD_SIZE >> 9)
 
 extern int  drbd_bm_init(struct drbd_device *device);
diff --git a/include/linux/drbd_limits.h b/include/linux/drbd_limits.h
index a351c40..ddac684 100644
--- a/include/linux/drbd_limits.h
+++ b/include/linux/drbd_limits.h
@@ -126,8 +126,7 @@
 #define DRBD_RESYNC_RATE_DEF 250
 #define DRBD_RESYNC_RATE_SCALE 'k'  /* kilobytes */
 
-  /* less than 7 would hit performance unnecessarily. */
-#define DRBD_AL_EXTENTS_MIN  7
+#define DRBD_AL_EXTENTS_MIN  67
   /* we use u16 as "slot number", (u16)~0 is "FREE".
* If you use >= 292 kB on-disk ring buffer,
* this is the maximum you can use: */
-- 
2.7.4



[PATCH 13/30] drbd: zero-out partial unaligned discards on local backend

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

For consistency, also zero-out partial unaligned chunks of discard
requests on the local backend.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h |  2 ++
 drivers/block/drbd/drbd_req.c | 29 +++--
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 8cc2955..d818e7d 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1553,6 +1553,8 @@ extern void start_resync_timer_fn(unsigned long data);
 extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req);
 
 /* drbd_receiver.c */
+extern int drbd_issue_discard_or_zero_out(struct drbd_device *device,
+   sector_t start, unsigned int nr_sectors, bool discard);
 extern int drbd_receiver(struct drbd_thread *thi);
 extern int drbd_ack_receiver(struct drbd_thread *thi);
 extern void drbd_send_ping_wf(struct work_struct *ws);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 6dbf1f1..7e441ff 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1156,6 +1156,16 @@ static int drbd_process_write_request(struct 
drbd_request *req)
return remote;
 }
 
+static void drbd_process_discard_req(struct drbd_request *req)
+{
+   int err = drbd_issue_discard_or_zero_out(req->device,
+   req->i.sector, req->i.size >> 9, true);
+
+   if (err)
+   req->private_bio->bi_error = -EIO;
+   bio_endio(req->private_bio);
+}
+
 static void
 drbd_submit_req_private_bio(struct drbd_request *req)
 {
@@ -1176,6 +1186,8 @@ drbd_submit_req_private_bio(struct drbd_request *req)
: rw == READ  ? DRBD_FAULT_DT_RD
:   DRBD_FAULT_DT_RA))
bio_io_error(bio);
+   else if (bio->bi_rw & REQ_DISCARD)
+   drbd_process_discard_req(req);
else
generic_make_request(bio);
put_ldev(device);
@@ -1227,18 +1239,23 @@ drbd_request_prepare(struct drbd_device *device, struct 
bio *bio, unsigned long
/* Update disk stats */
_drbd_start_io_acct(device, req);
 
+   /* process discards always from our submitter thread */
+   if (bio->bi_rw & REQ_DISCARD)
+   goto queue_for_submitter_thread;
+
if (rw == WRITE && req->private_bio && req->i.size
&& !test_bit(AL_SUSPENDED, &device->flags)) {
-   if (!drbd_al_begin_io_fastpath(device, &req->i)) {
-   atomic_inc(&device->ap_actlog_cnt);
-   drbd_queue_write(device, req);
-   return NULL;
-   }
+   if (!drbd_al_begin_io_fastpath(device, &req->i))
+   goto queue_for_submitter_thread;
req->rq_state |= RQ_IN_ACT_LOG;
req->in_actlog_jif = jiffies;
}
-
return req;
+
+ queue_for_submitter_thread:
+   atomic_inc(&device->ap_actlog_cnt);
+   drbd_queue_write(device, req);
+   return NULL;
 }
 
 static void drbd_send_and_submit(struct drbd_device *device, struct 
drbd_request *req)
-- 
2.7.4



[PATCH 17/30] drbd: don't forget error completion when "unsuspending" IO

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Possibly sequence of events:
SyncTarget is made Primary, then loses replication link
(only path to good data on SyncSource).

Behavior is then controlled by the on-no-data-accessible policy,
which defaults to OND_IO_ERROR (may be set to OND_SUSPEND_IO).

If OND_IO_ERROR is in fact the current policy, we clear the susp_fen
(IO suspended due to fencing policy) flag, do NOT set the susp_nod
(IO suspended due to no data) flag.

But we forgot to call the IO error completion for all pending,
suspended, requests.

While at it, also add a race check for a theoretically possible
race with a new handshake (network hickup), we may be able to
re-send requests, and can avoid passing IO errors up the stack.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 48 +---
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 4a4eb80..e5fdcc6 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -442,19 +442,17 @@ static enum drbd_fencing_p highest_fencing_policy(struct 
drbd_connection *connec
}
rcu_read_unlock();
 
-   if (fp == FP_NOT_AVAIL) {
-   /* IO Suspending works on the whole resource.
-  Do it only for one device. */
-   vnr = 0;
-   peer_device = idr_get_next(&connection->peer_devices, &vnr);
-   drbd_change_state(peer_device->device, CS_VERBOSE | CS_HARD, 
NS(susp_fen, 0));
-   }
-
return fp;
 }
 
+static bool resource_is_supended(struct drbd_resource *resource)
+{
+   return resource->susp || resource->susp_fen || resource->susp_nod;
+}
+
 bool conn_try_outdate_peer(struct drbd_connection *connection)
 {
+   struct drbd_resource * const resource = connection->resource;
unsigned int connect_cnt;
union drbd_state mask = { };
union drbd_state val = { };
@@ -462,21 +460,41 @@ bool conn_try_outdate_peer(struct drbd_connection 
*connection)
char *ex_to_string;
int r;
 
-   spin_lock_irq(&connection->resource->req_lock);
+   spin_lock_irq(&resource->req_lock);
if (connection->cstate >= C_WF_REPORT_PARAMS) {
drbd_err(connection, "Expected cstate < C_WF_REPORT_PARAMS\n");
-   spin_unlock_irq(&connection->resource->req_lock);
+   spin_unlock_irq(&resource->req_lock);
return false;
}
 
connect_cnt = connection->connect_cnt;
-   spin_unlock_irq(&connection->resource->req_lock);
+   spin_unlock_irq(&resource->req_lock);
 
fp = highest_fencing_policy(connection);
switch (fp) {
case FP_NOT_AVAIL:
drbd_warn(connection, "Not fencing peer, I'm not even 
Consistent myself.\n");
-   goto out;
+   spin_lock_irq(&resource->req_lock);
+   if (connection->cstate < C_WF_REPORT_PARAMS) {
+   _conn_request_state(connection,
+   (union drbd_state) { { .susp_fen = 
1 } },
+   (union drbd_state) { { .susp_fen = 
0 } },
+   CS_VERBOSE | CS_HARD | CS_DC_SUSP);
+   /* We are no longer suspended due to the fencing policy.
+* We may still be suspended due to the 
on-no-data-accessible policy.
+* If that was OND_IO_ERROR, fail pending requests. */
+   if (!resource_is_supended(resource))
+   _tl_restart(connection, 
CONNECTION_LOST_WHILE_PENDING);
+   }
+   /* Else: in case we raced with a connection handshake,
+* let the handshake figure out if we maybe can RESEND,
+* and do not resume/fail pending requests here.
+* Worst case is we stay suspended for now, which may be
+* resolved by either re-establishing the replication link, or
+* the next link failure, or eventually the administrator.  */
+   spin_unlock_irq(&resource->req_lock);
+   return false;
+
case FP_DONT_CARE:
return true;
default: ;
@@ -529,13 +547,11 @@ bool conn_try_outdate_peer(struct drbd_connection 
*connection)
drbd_info(connection, "fence-peer helper returned %d (%s)\n",
  (r>>8) & 0xff, ex_to_string);
 
- out:
-
/* Not using
   conn_request_state(connection, mask, val, CS_VERBOSE);
   here, because we might were able to re-establish the connection in 
the
   meantime. */
-   spin_lock_irq(&connection->resource->req_lock);

[PATCH 02/30] drbd: change bitmap write-out when leaving resync states

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

When leaving resync states because of disconnect,
do the bitmap write-out synchronously in the drbd_disconnected() path.

When leaving resync states because we go back to AHEAD/BEHIND, or
because resync actually finished, or some disk was lost during resync,
trigger the write-out from after_state_ch().

The bitmap write-out for resync -> ahead/behind was missing completely before.

Note that this is all only an optimization to avoid double-resyncs of
already completed blocks in case this node crashes.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 8 +---
 drivers/block/drbd/drbd_state.c| 9 +++--
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 050aaa1..8b30ab5 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -4783,9 +4783,11 @@ static int drbd_disconnected(struct drbd_peer_device 
*peer_device)
 
drbd_md_sync(device);
 
-   /* serialize with bitmap writeout triggered by the state change,
-* if any. */
-   wait_event(device->misc_wait, !test_bit(BITMAP_IO, &device->flags));
+   if (get_ldev(device)) {
+   drbd_bitmap_io(device, &drbd_bm_write_copy_pages,
+   "write from disconnected", 
BM_LOCKED_CHANGE_ALLOWED);
+   put_ldev(device);
+   }
 
/* tcp_close and release of sendpage pages can be deferred.  I don't
 * want to use SO_LINGER, because apparently it can be deferred for
diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 5a7ef78..59c6467 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -1934,12 +1934,17 @@ static void after_state_ch(struct drbd_device *device, 
union drbd_state os,
 
/* This triggers bitmap writeout of potentially still unwritten pages
 * if the resync finished cleanly, or aborted because of peer disk
-* failure, or because of connection loss.
+* failure, or on transition from resync back to AHEAD/BEHIND.
+*
+* Connection loss is handled in drbd_disconnected() by the receiver.
+*
 * For resync aborted because of local disk failure, we cannot do
 * any bitmap writeout anymore.
+*
 * No harm done if some bits change during this phase.
 */
-   if (os.conn > C_CONNECTED && ns.conn <= C_CONNECTED && 
get_ldev(device)) {
+   if ((os.conn > C_CONNECTED && os.conn < C_AHEAD) &&
+   (ns.conn == C_CONNECTED || ns.conn >= C_AHEAD) && get_ldev(device)) 
{
drbd_queue_bitmap_io(device, &drbd_bm_write_copy_pages, NULL,
"write from resync_finished", BM_LOCKED_CHANGE_ALLOWED);
put_ldev(device);
-- 
2.7.4



[PATCH 09/30] drbd: fix for truncated minor number in callback command line

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

The command line parameter the kernel module uses to communicate the
device minor to userland helper is flawed in a way that the device
indentifier "minor-%d" is being truncated to minors with a maximum
of 5 digits.

But DRBD 8.4 allows 2^20 == 1048576 minors,
thus a minimum of 7 digits must be supported.

Reported by Veit Wahlich on drbd-dev.

Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 99339df..3643f9c 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -343,7 +343,7 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
 (char[20]) { }, /* address family */
 (char[60]) { }, /* address */
NULL };
-   char mb[12];
+   char mb[14];
char *argv[] = {usermode_helper, cmd, mb, NULL };
struct drbd_connection *connection = 
first_peer_device(device)->connection;
struct sib_info sib;
@@ -352,7 +352,7 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
if (current == connection->worker.task)
set_bit(CALLBACK_PENDING, &connection->flags);
 
-   snprintf(mb, 12, "minor-%d", device_to_minor(device));
+   snprintf(mb, 14, "minor-%d", device_to_minor(device));
setup_khelper_env(connection, envp);
 
/* The helper may take some time.
-- 
2.7.4



[PATCH 08/30] drbd: fix regression: protocol A sometimes synchronous, C sometimes double-latency

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Regression introduced with 8.4.5
 drbd: application writes may set-in-sync in protocol != C

Overwriting the same block (LBA) while a former version is still
"in-flight" to the peer (to be exact: we did not receive the
P_BARRIER_ACK for its epoch yet) would wait for the full epoch of that
former version to be acknowledged by the peer.

In synchronous and quasi-synchronous protocols C and B,
this may double the latency on overwrites.

With protocol A, which is supposed to be asynchronous and only wait for
local completion, it is even worse: it would make overwrites
quasi-synchronous, they would be hit by the full RTT, which protocol A
was specifically meant to avoid, and possibly the additional time it
takes to drain the buffers first.

Particularly bad for databases, or anything else that
does frequent updates to the same blocks (various file system meta data).

No impact if >= rtt passes between updates to the same block.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_req.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 2255dcf..6dbf1f1 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -977,16 +977,20 @@ static void complete_conflicting_writes(struct 
drbd_request *req)
sector_t sector = req->i.sector;
int size = req->i.size;
 
-   i = drbd_find_overlap(&device->write_requests, sector, size);
-   if (!i)
-   return;
-
for (;;) {
-   prepare_to_wait(&device->misc_wait, &wait, 
TASK_UNINTERRUPTIBLE);
-   i = drbd_find_overlap(&device->write_requests, sector, size);
-   if (!i)
+   drbd_for_each_overlap(i, &device->write_requests, sector, size) 
{
+   /* Ignore, if already completed to upper layers. */
+   if (i->completed)
+   continue;
+   /* Handle the first found overlap.  After the schedule
+* we have to restart the tree walk. */
break;
+   }
+   if (!i) /* if any */
+   break;
+
/* Indicate to wake up device->misc_wait on progress.  */
+   prepare_to_wait(&device->misc_wait, &wait, 
TASK_UNINTERRUPTIBLE);
i->waiting = true;
spin_unlock_irq(&device->resource->req_lock);
schedule();
-- 
2.7.4



[PATCH 30/30] drbd: correctly handle failed crypto_alloc_hash

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

crypto_alloc_hash returns an ERR_PTR(), not NULL.

Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash()
on an errno in the cleanup path.

Reported-by: Insu Yun 

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 3 ++-
 include/linux/drbd.h   | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 5c06286..80a6aff 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3668,7 +3668,8 @@ static int receive_protocol(struct drbd_connection 
*connection, struct packet_in
 */
 
peer_integrity_tfm = crypto_alloc_ahash(integrity_alg, 0, 
CRYPTO_ALG_ASYNC);
-   if (!peer_integrity_tfm) {
+   if (IS_ERR(peer_integrity_tfm)) {
+   peer_integrity_tfm = NULL;
drbd_err(connection, "peer data-integrity-alg %s not 
supported\n",
 integrity_alg);
goto disconnect;
diff --git a/include/linux/drbd.h b/include/linux/drbd.h
index 2b26156..002611c 100644
--- a/include/linux/drbd.h
+++ b/include/linux/drbd.h
@@ -51,7 +51,7 @@
 #endif
 
 extern const char *drbd_buildtag(void);
-#define REL_VERSION "8.4.6"
+#define REL_VERSION "8.4.7"
 #define API_VERSION 1
 #define PRO_VERSION_MIN 86
 #define PRO_VERSION_MAX 101
-- 
2.7.4



[PATCH 21/30] drbd: report sizes if rejecting too small peer disk

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 078c4d98..99f4519 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3939,6 +3939,7 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
device->p_size = p_size;
 
if (get_ldev(device)) {
+   sector_t new_size, cur_size;
rcu_read_lock();
my_usize = rcu_dereference(device->ldev->disk_conf)->disk_size;
rcu_read_unlock();
@@ -3955,11 +3956,13 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
 
/* Never shrink a device with usable data during connect.
   But allow online shrinking if we are connected. */
-   if (drbd_new_dev_size(device, device->ldev, p_usize, 0) <
-   drbd_get_capacity(device->this_bdev) &&
+   new_size = drbd_new_dev_size(device, device->ldev, p_usize, 0);
+   cur_size = drbd_get_capacity(device->this_bdev);
+   if (new_size < cur_size &&
device->state.disk >= D_OUTDATED &&
device->state.conn < C_CONNECTED) {
-   drbd_err(device, "The peer's disk size is too 
small!\n");
+   drbd_err(device, "The peer's disk size is too small! 
(%llu < %llu sectors)\n",
+   (unsigned long long)new_size, (unsigned 
long long)cur_size);
conn_request_state(peer_device->connection, NS(conn, 
C_DISCONNECTING), CS_HARD);
put_ldev(device);
return -EIO;
-- 
2.7.4



[PATCH 16/30] drbd: introduce unfence-peer handler

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

When resync is finished, we already call the "after-resync-target"
handler (on the former sync target, obviously), once per volume.

Paired with the before-resync-target handler, you can create snapshots,
before the resync causes the volumes to become inconsistent,
and discard those snapshots again, once they are no longer needed.

It was also overloaded to be paired with the "fence-peer" handler,
to "unfence" once the volumes are up-to-date and known good.

This has some disadvantages, though: we call "fence-peer" for the whole
connection (once for the group of volumes), but would call unfence as
side-effect of after-resync-target once for each volume.

Also, we fence on a (current, or about to become) Primary,
which will later become the sync-source.

Calling unfence only as a side effect of the after-resync-target
handler opens a race window, between a new fence on the Primary
(SyncTarget) and the unfence on the SyncTarget, which is difficult to
close without some kind of "cluster wide lock" in those handlers.

We would not need those handlers if we could still communicate.
Which makes trying to aquire a cluster wide lock from those handlers
seem like a very bad idea.

This introduces the "unfence-peer" handler, which will be called
per connection (once for the group of volumes), just like the fence
handler, only once all volumes are back in sync, and on the SyncSource.

Which is expected to be the node that previously called "fence", the
node that is currently allowed to be Primary, and thus the only node
that could trigger a new "fence" that could race with this unfence.

Which makes us not need any cluster wide synchronization here,
serializing two scripts running on the same node is trivial.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h|  1 +
 drivers/block/drbd/drbd_nl.c |  2 +-
 drivers/block/drbd/drbd_worker.c | 28 ++--
 3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 451a745..cb42f6c 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1494,6 +1494,7 @@ extern enum drbd_state_rv drbd_set_role(struct 
drbd_device *device,
int force);
 extern bool conn_try_outdate_peer(struct drbd_connection *connection);
 extern void conn_try_outdate_peer_async(struct drbd_connection *connection);
+extern int conn_khelper(struct drbd_connection *connection, char *cmd);
 extern int drbd_khelper(struct drbd_device *device, char *cmd);
 
 /* drbd_worker.c */
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 12e9b31..4a4eb80 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -387,7 +387,7 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
return ret;
 }
 
-static int conn_khelper(struct drbd_connection *connection, char *cmd)
+int conn_khelper(struct drbd_connection *connection, char *cmd)
 {
char *envp[] = { "HOME=/",
"TERM=linux",
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index fa63c22..f9e142d 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -839,6 +839,7 @@ static void ping_peer(struct drbd_device *device)
 
 int drbd_resync_finished(struct drbd_device *device)
 {
+   struct drbd_connection *connection = 
first_peer_device(device)->connection;
unsigned long db, dt, dbdt;
unsigned long n_oos;
union drbd_state os, ns;
@@ -860,8 +861,7 @@ int drbd_resync_finished(struct drbd_device *device)
if (dw) {
dw->w.cb = w_resync_finished;
dw->device = device;
-   
drbd_queue_work(&first_peer_device(device)->connection->sender_work,
-   &dw->w);
+   drbd_queue_work(&connection->sender_work, &dw->w);
return 1;
}
drbd_err(device, "Warn failed to drbd_rs_del_all() and to 
kmalloc(dw).\n");
@@ -974,6 +974,30 @@ int drbd_resync_finished(struct drbd_device *device)
_drbd_set_state(device, ns, CS_VERBOSE, NULL);
 out_unlock:
spin_unlock_irq(&device->resource->req_lock);
+
+   /* If we have been sync source, and have an effective fencing-policy,
+* once *all* volumes are back in sync, call "unfence". */
+   if (os.conn == C_SYNC_SOURCE) {
+   enum drbd_disk_state disk_state = D_MASK;
+   enum drbd_disk_state pdsk_state = D_MASK;
+   enum drbd_fencing_p fp = FP_DONT_CARE;
+
+   rcu_read_lock();
+   fp = rcu_dereference(

[PATCH 15/30] drbd: finish resync on sync source only by notification from sync target

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

If the replication link breaks exactly during "resync finished" detection,
finishing too early on the sync source could again lead to UUIDs rotated
too fast, and potentially a spurious full resync on next handshake.

Always wait for explicit resync finished state change notification from
the sync target.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_actlog.c | 16 
 drivers/block/drbd/drbd_int.h| 19 ++-
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 1664762..4e07cff 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -768,10 +768,18 @@ static bool lazy_bitmap_update_due(struct drbd_device 
*device)
 
 static void maybe_schedule_on_disk_bitmap_update(struct drbd_device *device, 
bool rs_done)
 {
-   if (rs_done)
-   set_bit(RS_DONE, &device->flags);
-   /* and also set RS_PROGRESS below */
-   else if (!lazy_bitmap_update_due(device))
+   if (rs_done) {
+   struct drbd_connection *connection = 
first_peer_device(device)->connection;
+   if (connection->agreed_pro_version <= 95 ||
+   is_sync_target_state(device->state.conn))
+   set_bit(RS_DONE, &device->flags);
+   /* and also set RS_PROGRESS below */
+
+   /* Else: rather wait for explicit notification via 
receive_state,
+* to avoid uuids-rotated-too-fast causing full resync
+* in next handshake, in case the replication link breaks
+* at the most unfortunate time... */
+   } else if (!lazy_bitmap_update_due(device))
return;
 
drbd_device_post_work(device, RS_PROGRESS);
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index d82e531..451a745 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -2102,13 +2102,22 @@ static inline void _sub_unacked(struct drbd_device 
*device, int n, const char *f
ERR_IF_CNT_IS_NEGATIVE(unacked_cnt, func, line);
 }
 
+static inline bool is_sync_target_state(enum drbd_conns connection_state)
+{
+   return  connection_state == C_SYNC_TARGET ||
+   connection_state == C_PAUSED_SYNC_T;
+}
+
+static inline bool is_sync_source_state(enum drbd_conns connection_state)
+{
+   return  connection_state == C_SYNC_SOURCE ||
+   connection_state == C_PAUSED_SYNC_S;
+}
+
 static inline bool is_sync_state(enum drbd_conns connection_state)
 {
-   return
-  (connection_state == C_SYNC_SOURCE
-   ||  connection_state == C_SYNC_TARGET
-   ||  connection_state == C_PAUSED_SYNC_S
-   ||  connection_state == C_PAUSED_SYNC_T);
+   return  is_sync_source_state(connection_state) ||
+   is_sync_target_state(connection_state);
 }
 
 /**
-- 
2.7.4



[PATCH 01/30] drbd: bitmap bulk IO: do not always suspend IO

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

The intention was to only suspend IO if some normal bitmap operation is
supposed to be locked out, not always. If the bulk operation is flaged
as BM_LOCKED_CHANGE_ALLOWED, we do not need to suspend IO.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_main.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 2ba1494..4c64cb9 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -3585,18 +3585,20 @@ void drbd_queue_bitmap_io(struct drbd_device *device,
 int drbd_bitmap_io(struct drbd_device *device, int (*io_fn)(struct drbd_device 
*),
char *why, enum bm_flag flags)
 {
+   /* Only suspend io, if some operation is supposed to be locked out */
+   const bool do_suspend_io = flags & 
(BM_DONT_CLEAR|BM_DONT_SET|BM_DONT_TEST);
int rv;
 
D_ASSERT(device, current != 
first_peer_device(device)->connection->worker.task);
 
-   if ((flags & BM_LOCKED_SET_ALLOWED) == 0)
+   if (do_suspend_io)
drbd_suspend_io(device);
 
drbd_bm_lock(device, why, flags);
rv = io_fn(device);
drbd_bm_unlock(device);
 
-   if ((flags & BM_LOCKED_SET_ALLOWED) == 0)
+   if (do_suspend_io)
drbd_resume_io(device);
 
return rv;
-- 
2.7.4



[PATCH 23/30] drbd: sync_handshake: handle identical uuids with current (frozen) Primary

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

If in a two-primary scenario, we lost our peer, freeze IO,
and are still frozen (no UUID rotation) when the peer comes back
as Secondary after a hard crash, we will see identical UUIDs.

The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as
arbitration, but that would cause the still running (but frozen) Primary
to become SyncTarget (which it typically refuses), and the handshake is
declined.

Fix: check current roles.
If we have *one* current primary, the Primary wins.
(rule_nr = 41)

Since that is a protocol change, use the newly introduced DRBD_FF_WSAME
to determine if rule_nr = 41 can be applied.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 47 +++---
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 1320bb8..8e7afa3 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3181,7 +3181,8 @@ static void drbd_uuid_dump(struct drbd_device *device, 
char *text, u64 *uuid,
 -1091   requires proto 91
 -1096   requires proto 96
  */
-static int drbd_uuid_compare(struct drbd_device *const device, int *rule_nr) 
__must_hold(local)
+
+static int drbd_uuid_compare(struct drbd_device *const device, enum drbd_role 
const peer_role, int *rule_nr) __must_hold(local)
 {
struct drbd_peer_device *const peer_device = first_peer_device(device);
struct drbd_connection *const connection = peer_device ? 
peer_device->connection : NULL;
@@ -3261,8 +3262,39 @@ static int drbd_uuid_compare(struct drbd_device *const 
device, int *rule_nr) __m
 * next bit (weight 2) is set when peer was primary */
*rule_nr = 40;
 
+   /* Neither has the "crashed primary" flag set,
+* only a replication link hickup. */
+   if (rct == 0)
+   return 0;
+
+   /* Current UUID equal and no bitmap uuid; does not necessarily
+* mean this was a "simultaneous hard crash", maybe IO was
+* frozen, so no UUID-bump happened.
+* This is a protocol change, overload DRBD_FF_WSAME as flag
+* for "new-enough" peer DRBD version. */
+   if (device->state.role == R_PRIMARY || peer_role == R_PRIMARY) {
+   *rule_nr = 41;
+   if (!(connection->agreed_features & DRBD_FF_WSAME)) {
+   drbd_warn(peer_device, "Equivalent unrotated 
UUIDs, but current primary present.\n");
+   return -(0x1 | PRO_VERSION_MAX | 
(DRBD_FF_WSAME << 8));
+   }
+   if (device->state.role == R_PRIMARY && peer_role == 
R_PRIMARY) {
+   /* At least one has the "crashed primary" bit 
set,
+* both are primary now, but neither has 
rotated its UUIDs?
+* "Can not happen." */
+   drbd_err(peer_device, "Equivalent unrotated 
UUIDs, but both are primary. Can not resolve this.\n");
+   return -100;
+   }
+   if (device->state.role == R_PRIMARY)
+   return 1;
+   return -1;
+   }
+
+   /* Both are secondary.
+* Really looks like recovery from simultaneous hard crash.
+* Check which had been primary before, and arbitrate. */
switch (rct) {
-   case 0: /* !self_pri && !peer_pri */ return 0;
+   case 0: /* !self_pri && !peer_pri */ return 0; /* already 
handled */
case 1: /*  self_pri && !peer_pri */ return 1;
case 2: /* !self_pri &&  peer_pri */ return -1;
case 3: /*  self_pri &&  peer_pri */
@@ -3389,7 +3421,7 @@ static enum drbd_conns drbd_sync_handshake(struct 
drbd_peer_device *peer_device,
drbd_uuid_dump(device, "peer", device->p_uuid,
   device->p_uuid[UI_SIZE], device->p_uuid[UI_FLAGS]);
 
-   hg = drbd_uuid_compare(device, &rule_nr);
+   hg = drbd_uuid_compare(device, peer_role, &rule_nr);
spin_unlock_irq(&device->ldev->md.uuid_lock);
 
drbd_info(device, "uuid_compare()=%d by rule %d\n", hg, rule_nr);
@@ -3398,6 +3430,15 @@ static enum drbd_conns drbd_sync_handshake(struct 
drbd_peer_device *peer_device,
drbd_alert(device, "Unrelated data, aborting!\n");
return C_MASK;
}
+   if (hg < -0x1) {
+   int proto, fflags;
+ 

[PATCH 27/30] drbd: get rid of empty statement in is_valid_state

2016-06-13 Thread Philipp Reisner
From: Roland Kammerer 

This should silence a warning about an empty statement. Thanks to Fabian
Frederick  who sent a patch I modified to be smaller and
avoids an additional indent level.

Signed-off-by: Roland Kammerer 
Signed-off-by: Philipp Reisner 
---
 drivers/block/drbd/drbd_state.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index aca68a5..eea0c4a 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -814,7 +814,7 @@ is_valid_state(struct drbd_device *device, union drbd_state 
ns)
}
 
if (rv <= 0)
-   /* already found a reason to abort */;
+   goto out; /* already found a reason to abort */
else if (ns.role == R_SECONDARY && device->open_cnt)
rv = SS_DEVICE_IN_USE;
 
@@ -862,6 +862,7 @@ is_valid_state(struct drbd_device *device, union drbd_state 
ns)
else if (ns.conn >= C_CONNECTED && ns.pdsk == D_UNKNOWN)
rv = SS_CONNECTED_OUTDATES;
 
+out:
rcu_read_unlock();
 
return rv;
-- 
2.7.4



[PATCH 20/30] drbd: discard_zeroes_if_aligned allows "thin" resync for discard_zeroes_data=0

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

Even if discard_zeroes_data != 0,
if discard_zeroes_if_aligned is set, we assume we can reliably
zero-out/discard using the drbd_issue_peer_discard() helper.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index e5fdcc6..169e3e1 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1408,9 +1408,12 @@ static void sanitize_disk_conf(struct drbd_device 
*device, struct disk_conf *dis
if (disk_conf->al_extents > drbd_al_extents_max(nbc))
disk_conf->al_extents = drbd_al_extents_max(nbc);
 
-   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data) {
-   disk_conf->rs_discard_granularity = 0; /* disable feature */
-   drbd_info(device, "rs_discard_granularity feature disabled\n");
+   if (!blk_queue_discard(q)
+   || (!q->limits.discard_zeroes_data && 
!disk_conf->discard_zeroes_if_aligned)) {
+   if (disk_conf->rs_discard_granularity) {
+   disk_conf->rs_discard_granularity = 0; /* disable 
feature */
+   drbd_info(device, "rs_discard_granularity feature 
disabled\n");
+   }
}
 
if (disk_conf->rs_discard_granularity) {
-- 
2.7.4



[PATCH 19/30] drbd: only restart frozen disk io when D_UP_TO_DATE

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

When re-attaching the local backend device to a C_STANDALONE D_DISKLESS
R_PRIMARY with OND_SUSPEND_IO, we may only resume IO if we recognize the
backend that is being attached as D_UP_TO_DATE.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 59c6467..24422e8 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -1675,7 +1675,7 @@ static void after_state_ch(struct drbd_device *device, 
union drbd_state os,
what = RESEND;
 
if ((os.disk == D_ATTACHING || os.disk == D_NEGOTIATING) &&
-   conn_lowest_disk(connection) > D_NEGOTIATING)
+   conn_lowest_disk(connection) == D_UP_TO_DATE)
what = RESTART_FROZEN_DISK_IO;
 
if (resource->susp_nod && what != NOTHING) {
-- 
2.7.4



[PATCH 29/30] drbd: al_write_transaction: skip re-scanning of bitmap page pointer array

2016-06-13 Thread Philipp Reisner
From: Lars Ellenberg 

For larger devices, the array of bitmap page pointers can grow very
large (8000 pointers per TB of storage).

For each activity log transaction, we need to flush the associated
bitmap pages to stable storage. Currently, we just "mark" the respective
pages while setting up the transaction, then tell the bitmap code to
write out all marked pages, but skip unchanged pages.

But one such transaction can affect only a small number of bitmap pages,
there is no need to scan the full array of several (ten-)thousand
page pointers to find the few marked ones.

Instead, remember the index numbers of the few affected pages,
and later only re-check those to skip duplicates and unchanged ones.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_actlog.c |  2 ++
 drivers/block/drbd/drbd_bitmap.c | 66 +++-
 drivers/block/drbd/drbd_int.h|  1 +
 3 files changed, 54 insertions(+), 15 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 99a2b92..8305615 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -339,6 +339,8 @@ static int __al_write_transaction(struct drbd_device 
*device, struct al_transact
 
i = 0;
 
+   drbd_bm_reset_al_hints(device);
+
/* Even though no one can start to change this list
 * once we set the LC_LOCKED -- from drbd_al_begin_io(),
 * lc_try_lock_for_transaction() --, someone may still
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 801b8f3..b1c2a57 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -96,6 +96,13 @@ struct drbd_bitmap {
struct page **bm_pages;
spinlock_t bm_lock;
 
+   /* exclusively to be used by __al_write_transaction(),
+* drbd_bm_mark_for_writeout() and
+* and drbd_bm_write_hinted() -> bm_rw() called from there.
+*/
+   unsigned int n_bitmap_hints;
+   unsigned int al_bitmap_hints[AL_UPDATES_PER_TRANSACTION];
+
/* see LIMITATIONS: above */
 
unsigned long bm_set;   /* nr of set bits; THINK maybe atomic_t? */
@@ -242,6 +249,11 @@ static void bm_set_page_need_writeout(struct page *page)
set_bit(BM_PAGE_NEED_WRITEOUT, &page_private(page));
 }
 
+void drbd_bm_reset_al_hints(struct drbd_device *device)
+{
+   device->bitmap->n_bitmap_hints = 0;
+}
+
 /**
  * drbd_bm_mark_for_writeout() - mark a page with a "hint" to be considered 
for writeout
  * @device:DRBD device.
@@ -253,6 +265,7 @@ static void bm_set_page_need_writeout(struct page *page)
  */
 void drbd_bm_mark_for_writeout(struct drbd_device *device, int page_nr)
 {
+   struct drbd_bitmap *b = device->bitmap;
struct page *page;
if (page_nr >= device->bitmap->bm_number_of_pages) {
drbd_warn(device, "BAD: page_nr: %u, number_of_pages: %u\n",
@@ -260,7 +273,9 @@ void drbd_bm_mark_for_writeout(struct drbd_device *device, 
int page_nr)
return;
}
page = device->bitmap->bm_pages[page_nr];
-   set_bit(BM_PAGE_HINT_WRITEOUT, &page_private(page));
+   BUG_ON(b->n_bitmap_hints >= ARRAY_SIZE(b->al_bitmap_hints));
+   if (!test_and_set_bit(BM_PAGE_HINT_WRITEOUT, &page_private(page)))
+   b->al_bitmap_hints[b->n_bitmap_hints++] = page_nr;
 }
 
 static int bm_test_page_unchanged(struct page *page)
@@ -1030,7 +1045,7 @@ static int bm_rw(struct drbd_device *device, const 
unsigned int flags, unsigned
 {
struct drbd_bm_aio_ctx *ctx;
struct drbd_bitmap *b = device->bitmap;
-   int num_pages, i, count = 0;
+   unsigned int num_pages, i, count = 0;
unsigned long now;
char ppb[10];
int err = 0;
@@ -1078,16 +1093,37 @@ static int bm_rw(struct drbd_device *device, const 
unsigned int flags, unsigned
now = jiffies;
 
/* let the layers below us try to merge these bios... */
-   for (i = 0; i < num_pages; i++) {
-   /* ignore completely unchanged pages */
-   if (lazy_writeout_upper_idx && i == lazy_writeout_upper_idx)
-   break;
-   if (!(flags & BM_AIO_READ)) {
-   if ((flags & BM_AIO_WRITE_HINTED) &&
-   !test_and_clear_bit(BM_PAGE_HINT_WRITEOUT,
-   &page_private(b->bm_pages[i])))
-   continue;
 
+   if (flags & BM_AIO_READ) {
+   for (i = 0; i < num_pages; i++) {
+   atomic_inc(&ctx->in_flight);
+   bm_page_io_async(ctx, i);
+   ++count;
+   cond_resched();
+   }
+   } else if (flags & BM_AIO_WRITE_HINTED) {

Re: [Drbd-dev] [PATCH 05/30] drbd: Introduce new disk config option rs-discard-granularity

2016-04-25 Thread Philipp Reisner
Am Montag, 25. April 2016, 11:48:30 schrieb Bart Van Assche:
> On 04/25/2016 09:42 AM, Philipp Reisner wrote:
> > Am Montag, 25. April 2016, 08:35:26 schrieb Bart Van Assche:
> >> On 04/25/2016 05:10 AM, Philipp Reisner wrote:
> >>> As long as the value is 0 the feature is disabled. With setting
> >>> it to a positive value, DRBD limits and aligns its resync requests
> >>> to the rs-discard-granularity setting. If the sync source detects
> >>> all zeros in such a block, the resync target discards the range
> >>> on disk.
> >> 
> >> Can you explain why rs-discard-granularity is configurable instead of
> >> e.g. setting it to the least common multiple of the discard
> >> granularities of the underlying block devices at both sides?
> > 
> > we had this idea as well. It seems that real world devices like larger
> > discards better than smaller discards. The other motivation was that
> > a device mapper logical volume might change it on the fly...
> > So we think it is best to delegate the decision on the discard chunk
> > size to user space.
> 
> Hello Phil,
> 
> Are you aware that for aligned discard requests the discard granularity
> does not affect the size of discard requests at all?
> 
> Regarding LVM volumes: if the discard granularity for such volumes can
> change on the fly, shouldn't I/O be quiesced by the LVM kernel driver
> before it changes the discard granularity? I think that increasing
> discard granularity while I/O is in progress should be considered as a bug.
> 
> Bart.

Hi Bart,

I worked on this about 6 month ago, sorry for not having all the details
at the top of my head immediately. I think it came back now:
We need to announce the discard granularity when we create the device/minor.
At might it might be that there is no connection to the peer node. So we
are left with information about the discard granularity of the local
backing device only.
Therefore we decided to delegate it to the user/admin to provide the
discard granularity for the resync process.

best regards,
 phil 


Re: [Drbd-dev] [PATCH 05/30] drbd: Introduce new disk config option rs-discard-granularity

2016-04-25 Thread Philipp Reisner
Am Montag, 25. April 2016, 08:35:26 schrieb Bart Van Assche:
> On 04/25/2016 05:10 AM, Philipp Reisner wrote:
> > As long as the value is 0 the feature is disabled. With setting
> > it to a positive value, DRBD limits and aligns its resync requests
> > to the rs-discard-granularity setting. If the sync source detects
> > all zeros in such a block, the resync target discards the range
> > on disk.
> 
> Hello Phil,
> 
> Can you explain why rs-discard-granularity is configurable instead of
> e.g. setting it to the least common multiple of the discard
> granularities of the underlying block devices at both sides?
> 
> Thanks,
> 

Hi Bart,

we had this idea as well. It seems that real world devices like larger
discards better than smaller discards. The other motivation was that
a device mapper logical volume might change it on the fly...
So we think it is best to delegate the decision on the discard chunk
size to user space.

best regards,
 Phil



Re: [Drbd-dev] [PATCH 04/30] drbd: Implement handling of thinly provisioned storage on resync target nodes

2016-04-25 Thread Philipp Reisner
Am Montag, 25. April 2016, 08:28:45 schrieb Bart Van Assche:
> On 04/25/2016 05:10 AM, Philipp Reisner wrote:
> > If during resync we read only zeroes for a range of sectors assume
> > that these secotors can be discarded on the sync target node.
> 
> Hello Phil,
> 
> With which interconnect(s) has this patch been tested? I'm afraid that
> for high-speed interconnects this patch will slow down I/O instead of
> making it faster because all_zero() examines all data before it is sent.
> 

Hi Bart,

that it might make things slower is true for sure. The benefit it
provides is to de-allocate blocks on the secondary obviously.
The whole feature is optional and it is off by default.

Obviously we want to have a generic interface like SEEK_HOLE/
SEEK_DATA for block devices, but that does not exist as of today.

best regards,
 Phil


[PATCH 24/30] drbd: disallow promotion during resync handshake, avoid deadlock and hard reset

2016-04-25 Thread Philipp Reisner
From: Lars Ellenberg 

We already serialize connection state changes,
and other, non-connection state changes (role changes)
while we are establishing a connection.

But if we have an established connection,
then trigger a resync handshake (by primary --force or similar),
until now we just had to be "lucky".

Consider this sequence (e.g. deployment scenario):
create-md; up;
  -> Connected Secondary/Secondary Inconsistent/Inconsistent
then do a racy primary --force on both peers.

 block drbd0: drbd_sync_handshake:
 block drbd0: self 
0004::: bits:25590 
flags:0
 block drbd0: peer 
0004::: bits:25590 
flags:0
 block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) 
pdsk( DUnknown -> Inconsistent )
 block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
  *** HERE things go wrong. ***
 block drbd0: role( Secondary -> Primary )
 block drbd0: drbd_sync_handshake:
 block drbd0: self 
0005::: bits:25590 
flags:0
 block drbd0: peer 
C90D2FC716D232AB:0004:: bits:25590 
flags:0
 block drbd0: Becoming sync target due to disk states.
 block drbd0: Writing the whole bitmap, full sync required after 
drbd_sync_handshake.
 block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * 
timeout (30 * 0.1s)
 drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( 
UpToDate -> DUnknown )

The problem here is that the local promotion happens before the sync handshake
triggered by the remote promotion was completed.  Some assumptions elsewhere
become wrong, and when the expected resync handshake is then received and
processed, we get stuck in a deadlock, which can only be recovered by reboot :-(

Fix: if we know the peer has good data,
and our own disk is present, but NOT good,
and there is no resync going on yet,
we expect a sync handshake to happen "soon".
So reject a racy promotion with SS_IN_TRANSIENT_STATE.

Result:
 ... as above ...
 block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
  *** local promotion being postponed until ... ***
 block drbd0: drbd_sync_handshake:
 block drbd0: self 
0004::: bits:25590 
flags:0
 block drbd0: peer 
77868BDA836E12A5:0004:: bits:25590 
flags:0
  ...
 block drbd0: conn( WFBitMapT -> WFSyncUUID )
 block drbd0: updated sync uuid 
85D06D0E8887AD44:::
 block drbd0: conn( WFSyncUUID -> SyncTarget )
  *** ... after the resync handshake ***
 block drbd0: role( Secondary -> Primary )

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_state.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 24422e8..7562c5c 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -906,6 +906,15 @@ is_valid_soft_transition(union drbd_state os, union 
drbd_state ns, struct drbd_c
  (ns.conn >= C_CONNECTED && os.conn == C_WF_REPORT_PARAMS)))
rv = SS_IN_TRANSIENT_STATE;
 
+   /* Do not promote during resync handshake triggered by "force primary".
+* This is a hack. It should really be rejected by the peer during the
+* cluster wide state change request. */
+   if (os.role != R_PRIMARY && ns.role == R_PRIMARY
+   && ns.pdsk == D_UP_TO_DATE
+   && ns.disk != D_UP_TO_DATE && ns.disk != D_DISKLESS
+   && (ns.conn <= C_WF_SYNC_UUID || ns.conn != os.conn))
+   rv = SS_IN_TRANSIENT_STATE;
+
if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) && os.conn < 
C_CONNECTED)
rv = SS_NEED_CONNECTION;
 
-- 
1.9.1



[PATCH 29/30] drbd: al_write_transaction: skip re-scanning of bitmap page pointer array

2016-04-25 Thread Philipp Reisner
From: Lars Ellenberg 

For larger devices, the array of bitmap page pointers can grow very
large (8000 pointers per TB of storage).

For each activity log transaction, we need to flush the associated
bitmap pages to stable storage. Currently, we just "mark" the respective
pages while setting up the transaction, then tell the bitmap code to
write out all marked pages, but skip unchanged pages.

But one such transaction can affect only a small number of bitmap pages,
there is no need to scan the full array of several (ten-)thousand
page pointers to find the few marked ones.

Instead, remember the index numbers of the few affected pages,
and later only re-check those to skip duplicates and unchanged ones.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_actlog.c |  2 ++
 drivers/block/drbd/drbd_bitmap.c | 66 +++-
 drivers/block/drbd/drbd_int.h|  1 +
 3 files changed, 54 insertions(+), 15 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 99a2b92..8305615 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -339,6 +339,8 @@ static int __al_write_transaction(struct drbd_device 
*device, struct al_transact
 
i = 0;
 
+   drbd_bm_reset_al_hints(device);
+
/* Even though no one can start to change this list
 * once we set the LC_LOCKED -- from drbd_al_begin_io(),
 * lc_try_lock_for_transaction() --, someone may still
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 801b8f3..b1c2a57 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -96,6 +96,13 @@ struct drbd_bitmap {
struct page **bm_pages;
spinlock_t bm_lock;
 
+   /* exclusively to be used by __al_write_transaction(),
+* drbd_bm_mark_for_writeout() and
+* and drbd_bm_write_hinted() -> bm_rw() called from there.
+*/
+   unsigned int n_bitmap_hints;
+   unsigned int al_bitmap_hints[AL_UPDATES_PER_TRANSACTION];
+
/* see LIMITATIONS: above */
 
unsigned long bm_set;   /* nr of set bits; THINK maybe atomic_t? */
@@ -242,6 +249,11 @@ static void bm_set_page_need_writeout(struct page *page)
set_bit(BM_PAGE_NEED_WRITEOUT, &page_private(page));
 }
 
+void drbd_bm_reset_al_hints(struct drbd_device *device)
+{
+   device->bitmap->n_bitmap_hints = 0;
+}
+
 /**
  * drbd_bm_mark_for_writeout() - mark a page with a "hint" to be considered 
for writeout
  * @device:DRBD device.
@@ -253,6 +265,7 @@ static void bm_set_page_need_writeout(struct page *page)
  */
 void drbd_bm_mark_for_writeout(struct drbd_device *device, int page_nr)
 {
+   struct drbd_bitmap *b = device->bitmap;
struct page *page;
if (page_nr >= device->bitmap->bm_number_of_pages) {
drbd_warn(device, "BAD: page_nr: %u, number_of_pages: %u\n",
@@ -260,7 +273,9 @@ void drbd_bm_mark_for_writeout(struct drbd_device *device, 
int page_nr)
return;
}
page = device->bitmap->bm_pages[page_nr];
-   set_bit(BM_PAGE_HINT_WRITEOUT, &page_private(page));
+   BUG_ON(b->n_bitmap_hints >= ARRAY_SIZE(b->al_bitmap_hints));
+   if (!test_and_set_bit(BM_PAGE_HINT_WRITEOUT, &page_private(page)))
+   b->al_bitmap_hints[b->n_bitmap_hints++] = page_nr;
 }
 
 static int bm_test_page_unchanged(struct page *page)
@@ -1030,7 +1045,7 @@ static int bm_rw(struct drbd_device *device, const 
unsigned int flags, unsigned
 {
struct drbd_bm_aio_ctx *ctx;
struct drbd_bitmap *b = device->bitmap;
-   int num_pages, i, count = 0;
+   unsigned int num_pages, i, count = 0;
unsigned long now;
char ppb[10];
int err = 0;
@@ -1078,16 +1093,37 @@ static int bm_rw(struct drbd_device *device, const 
unsigned int flags, unsigned
now = jiffies;
 
/* let the layers below us try to merge these bios... */
-   for (i = 0; i < num_pages; i++) {
-   /* ignore completely unchanged pages */
-   if (lazy_writeout_upper_idx && i == lazy_writeout_upper_idx)
-   break;
-   if (!(flags & BM_AIO_READ)) {
-   if ((flags & BM_AIO_WRITE_HINTED) &&
-   !test_and_clear_bit(BM_PAGE_HINT_WRITEOUT,
-   &page_private(b->bm_pages[i])))
-   continue;
 
+   if (flags & BM_AIO_READ) {
+   for (i = 0; i < num_pages; i++) {
+   atomic_inc(&ctx->in_flight);
+   bm_page_io_async(ctx, i);
+   ++count;
+   cond_resched();
+   }
+   } else if (flags & BM_AIO_WRITE_HINTED) {

[PATCH 17/30] drbd: don't forget error completion when "unsuspending" IO

2016-04-25 Thread Philipp Reisner
From: Lars Ellenberg 

Possibly sequence of events:
SyncTarget is made Primary, then loses replication link
(only path to good data on SyncSource).

Behavior is then controlled by the on-no-data-accessible policy,
which defaults to OND_IO_ERROR (may be set to OND_SUSPEND_IO).

If OND_IO_ERROR is in fact the current policy, we clear the susp_fen
(IO suspended due to fencing policy) flag, do NOT set the susp_nod
(IO suspended due to no data) flag.

But we forgot to call the IO error completion for all pending,
suspended, requests.

While at it, also add a race check for a theoretically possible
race with a new handshake (network hickup), we may be able to
re-send requests, and can avoid passing IO errors up the stack.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_nl.c | 48 +---
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index f16084a..a703a0e 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -442,19 +442,17 @@ static enum drbd_fencing_p highest_fencing_policy(struct 
drbd_connection *connec
}
rcu_read_unlock();
 
-   if (fp == FP_NOT_AVAIL) {
-   /* IO Suspending works on the whole resource.
-  Do it only for one device. */
-   vnr = 0;
-   peer_device = idr_get_next(&connection->peer_devices, &vnr);
-   drbd_change_state(peer_device->device, CS_VERBOSE | CS_HARD, 
NS(susp_fen, 0));
-   }
-
return fp;
 }
 
+static bool resource_is_supended(struct drbd_resource *resource)
+{
+   return resource->susp || resource->susp_fen || resource->susp_nod;
+}
+
 bool conn_try_outdate_peer(struct drbd_connection *connection)
 {
+   struct drbd_resource * const resource = connection->resource;
unsigned int connect_cnt;
union drbd_state mask = { };
union drbd_state val = { };
@@ -462,21 +460,41 @@ bool conn_try_outdate_peer(struct drbd_connection 
*connection)
char *ex_to_string;
int r;
 
-   spin_lock_irq(&connection->resource->req_lock);
+   spin_lock_irq(&resource->req_lock);
if (connection->cstate >= C_WF_REPORT_PARAMS) {
drbd_err(connection, "Expected cstate < C_WF_REPORT_PARAMS\n");
-   spin_unlock_irq(&connection->resource->req_lock);
+   spin_unlock_irq(&resource->req_lock);
return false;
}
 
connect_cnt = connection->connect_cnt;
-   spin_unlock_irq(&connection->resource->req_lock);
+   spin_unlock_irq(&resource->req_lock);
 
fp = highest_fencing_policy(connection);
switch (fp) {
case FP_NOT_AVAIL:
drbd_warn(connection, "Not fencing peer, I'm not even 
Consistent myself.\n");
-   goto out;
+   spin_lock_irq(&resource->req_lock);
+   if (connection->cstate < C_WF_REPORT_PARAMS) {
+   _conn_request_state(connection,
+   (union drbd_state) { { .susp_fen = 
1 } },
+   (union drbd_state) { { .susp_fen = 
0 } },
+   CS_VERBOSE | CS_HARD | CS_DC_SUSP);
+   /* We are no longer suspended due to the fencing policy.
+* We may still be suspended due to the 
on-no-data-accessible policy.
+* If that was OND_IO_ERROR, fail pending requests. */
+   if (!resource_is_supended(resource))
+   _tl_restart(connection, 
CONNECTION_LOST_WHILE_PENDING);
+   }
+   /* Else: in case we raced with a connection handshake,
+* let the handshake figure out if we maybe can RESEND,
+* and do not resume/fail pending requests here.
+* Worst case is we stay suspended for now, which may be
+* resolved by either re-establishing the replication link, or
+* the next link failure, or eventually the administrator.  */
+   spin_unlock_irq(&resource->req_lock);
+   return false;
+
case FP_DONT_CARE:
return true;
default: ;
@@ -529,13 +547,11 @@ bool conn_try_outdate_peer(struct drbd_connection 
*connection)
drbd_info(connection, "fence-peer helper returned %d (%s)\n",
  (r>>8) & 0xff, ex_to_string);
 
- out:
-
/* Not using
   conn_request_state(connection, mask, val, CS_VERBOSE);
   here, because we might were able to re-establish the connection in 
the
   meantime. */
-   spin_lock_irq(&connection->resource->req_lock);

[PATCH 04/30] drbd: Implement handling of thinly provisioned storage on resync target nodes

2016-04-25 Thread Philipp Reisner
If during resync we read only zeroes for a range of sectors assume
that these secotors can be discarded on the sync target node.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h  |  5 +++
 drivers/block/drbd/drbd_main.c | 18 
 drivers/block/drbd/drbd_protocol.h |  4 ++
 drivers/block/drbd/drbd_receiver.c | 88 --
 drivers/block/drbd/drbd_worker.c   | 29 -
 5 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 7a1cf7e..1a93f4f 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -471,6 +471,9 @@ enum {
/* this originates from application on peer
 * (not some resync or verify or other DRBD internal request) */
__EE_APPLICATION,
+
+   /* If it contains only 0 bytes, send back P_RS_DEALLOCATED */
+   __EE_RS_THIN_REQ,
 };
 #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
 #define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC)
@@ -485,6 +488,7 @@ enum {
 #define EE_SUBMITTED   (1<<__EE_SUBMITTED)
 #define EE_WRITE   (1<<__EE_WRITE)
 #define EE_APPLICATION (1<<__EE_APPLICATION)
+#define EE_RS_THIN_REQ (1<<__EE_RS_THIN_REQ)
 
 /* flag bits per device */
 enum {
@@ -1123,6 +1127,7 @@ extern int drbd_send_ov_request(struct drbd_peer_device 
*, sector_t sector, int
 extern int drbd_send_bitmap(struct drbd_device *device);
 extern void drbd_send_sr_reply(struct drbd_peer_device *, enum drbd_state_rv 
retcode);
 extern void conn_send_sr_reply(struct drbd_connection *connection, enum 
drbd_state_rv retcode);
+extern int drbd_send_rs_deallocated(struct drbd_peer_device *, struct 
drbd_peer_request *);
 extern void drbd_backing_dev_free(struct drbd_device *device, struct 
drbd_backing_dev *ldev);
 extern void drbd_device_cleanup(struct drbd_device *device);
 void drbd_print_uuids(struct drbd_device *device, const char *text);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 802d729..3cecc4f 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1377,6 +1377,22 @@ int drbd_send_ack_ex(struct drbd_peer_device 
*peer_device, enum drbd_packet cmd,
  cpu_to_be64(block_id));
 }
 
+int drbd_send_rs_deallocated(struct drbd_peer_device *peer_device,
+struct drbd_peer_request *peer_req)
+{
+   struct drbd_socket *sock;
+   struct p_block_desc *p;
+
+   sock = &peer_device->connection->data;
+   p = drbd_prepare_command(peer_device, sock);
+   if (!p)
+   return -EIO;
+   p->sector = cpu_to_be64(peer_req->i.sector);
+   p->blksize = cpu_to_be32(peer_req->i.size);
+   p->pad = 0;
+   return drbd_send_command(peer_device, sock, P_RS_DEALLOCATED, 
sizeof(*p), NULL, 0);
+}
+
 int drbd_send_drequest(struct drbd_peer_device *peer_device, int cmd,
   sector_t sector, int size, u64 block_id)
 {
@@ -3681,6 +3697,8 @@ const char *cmdname(enum drbd_packet cmd)
[P_CONN_ST_CHG_REPLY]   = "conn_st_chg_reply",
[P_RETRY_WRITE] = "retry_write",
[P_PROTOCOL_UPDATE] = "protocol_update",
+   [P_RS_THIN_REQ] = "rs_thin_req",
+   [P_RS_DEALLOCATED]  = "rs_deallocated",
 
/* enum drbd_packet, but not commands - obsoleted flags:
 *  P_MAY_IGNORE
diff --git a/drivers/block/drbd/drbd_protocol.h 
b/drivers/block/drbd/drbd_protocol.h
index ef92453..e5e74e3 100644
--- a/drivers/block/drbd/drbd_protocol.h
+++ b/drivers/block/drbd/drbd_protocol.h
@@ -60,6 +60,10 @@ enum drbd_packet {
 * which is why I chose TRIM here, to disambiguate. */
P_TRIM= 0x31,
 
+   /* Only use these two if both support FF_THIN_RESYNC */
+   P_RS_THIN_REQ = 0x32, /* Request a block for resync or reply 
P_RS_DEALLOCATED */
+   P_RS_DEALLOCATED  = 0x33, /* Contains only zeros on sync source 
node */
+
P_MAY_IGNORE  = 0x100, /* Flag to test if (cmd > P_MAY_IGNORE) 
... */
P_MAX_OPT_CMD = 0x101,
 
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 8b30ab5..3a6c2ec 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1417,9 +1417,15 @@ int drbd_submit_peer_request(struct drbd_device *device,
 * so we can find it to present it in debugfs */
peer_req->submit_jif = jiffies;
peer_req->flags |= EE_SUBMITTED;
-   spin_lock_irq(&device->resource->req_lock);
-   list_add_tail(&peer_req->w.list, &device->active_

[PATCH 15/30] drbd: finish resync on sync source only by notification from sync target

2016-04-25 Thread Philipp Reisner
From: Lars Ellenberg 

If the replication link breaks exactly during "resync finished" detection,
finishing too early on the sync source could again lead to UUIDs rotated
too fast, and potentially a spurious full resync on next handshake.

Always wait for explicit resync finished state change notification from
the sync target.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_actlog.c | 16 
 drivers/block/drbd/drbd_int.h| 19 ++-
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 1664762..4e07cff 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -768,10 +768,18 @@ static bool lazy_bitmap_update_due(struct drbd_device 
*device)
 
 static void maybe_schedule_on_disk_bitmap_update(struct drbd_device *device, 
bool rs_done)
 {
-   if (rs_done)
-   set_bit(RS_DONE, &device->flags);
-   /* and also set RS_PROGRESS below */
-   else if (!lazy_bitmap_update_due(device))
+   if (rs_done) {
+   struct drbd_connection *connection = 
first_peer_device(device)->connection;
+   if (connection->agreed_pro_version <= 95 ||
+   is_sync_target_state(device->state.conn))
+   set_bit(RS_DONE, &device->flags);
+   /* and also set RS_PROGRESS below */
+
+   /* Else: rather wait for explicit notification via 
receive_state,
+* to avoid uuids-rotated-too-fast causing full resync
+* in next handshake, in case the replication link breaks
+* at the most unfortunate time... */
+   } else if (!lazy_bitmap_update_due(device))
return;
 
drbd_device_post_work(device, RS_PROGRESS);
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index d82e531..451a745 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -2102,13 +2102,22 @@ static inline void _sub_unacked(struct drbd_device 
*device, int n, const char *f
ERR_IF_CNT_IS_NEGATIVE(unacked_cnt, func, line);
 }
 
+static inline bool is_sync_target_state(enum drbd_conns connection_state)
+{
+   return  connection_state == C_SYNC_TARGET ||
+   connection_state == C_PAUSED_SYNC_T;
+}
+
+static inline bool is_sync_source_state(enum drbd_conns connection_state)
+{
+   return  connection_state == C_SYNC_SOURCE ||
+   connection_state == C_PAUSED_SYNC_S;
+}
+
 static inline bool is_sync_state(enum drbd_conns connection_state)
 {
-   return
-  (connection_state == C_SYNC_SOURCE
-   ||  connection_state == C_SYNC_TARGET
-   ||  connection_state == C_PAUSED_SYNC_S
-   ||  connection_state == C_PAUSED_SYNC_T);
+   return  is_sync_source_state(connection_state) ||
+   is_sync_target_state(connection_state);
 }
 
 /**
-- 
1.9.1



[PATCH 11/30] drbd: when receiving P_TRIM, zero-out partial unaligned chunks

2016-04-25 Thread Philipp Reisner
From: Lars Ellenberg 

We can avoid spurious data divergence caused by partially-ignored
discards on certain backends with discard_zeroes_data=0, if we
translate partial unaligned discard requests into explicit zero-out.

The relevant use case is LVM/DM thin.

If on different nodes, DRBD is backed by devices with differing
discard characteristics, discards may lead to data divergence
(old data or garbage left over on one backend, zeroes due to
unmapped areas on the other backend). Online verify would now
potentially report tons of spurious differences.

While probably harmless for most use cases (fstrim on a file system),
DRBD cannot have that, it would violate our promise to upper layers
that our data instances on the nodes are identical.

To be correct and play safe (make sure data is identical on both copies),
we would have to disable discard support, if our local backend (on a
Primary) does not support "discard_zeroes_data=true".

We'd also have to translate discards to explicit zero-out on the
receiving (typically: Secondary) side, unless the receiving side
supports "discard_zeroes_data=true".

Which both would allocate those blocks, instead of unmapping them,
in contrast with expectations.

LVM/DM thin does set discard_zeroes_data=0,
because it silently ignores discards to partial chunks.

We can work around this by checking the alignment first.
For unaligned (wrt. alignment and granularity) or too small discards,
we zero-out the initial (and/or) trailing unaligned partial chunks,
but discard all the aligned full chunks.

At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".

Arguably it should behave this way internally, by default,
and we'll try to make that happen.

But our workaround is still valid for already deployed setups,
and for other devices that may behave this way.

Setting discard-zeroes-if-aligned=yes will allow DRBD to use
discards, and to announce discard_zeroes_data=true, even on
backends that announce discard_zeroes_data=false.

Setting discard-zeroes-if-aligned=no will cause DRBD to always
fall-back to zero-out on the receiving side, and to not even
announce discard capabilities on the Primary, if the respective
backend announces discard_zeroes_data=false.

We used to ignore the discard_zeroes_data setting completely.
To not break established and expected behaviour, and suddenly
cause fstrim on thin-provisioned LVs to run out-of-space,
instead of freeing up space, the default value is "yes".

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_int.h  |   2 +-
 drivers/block/drbd/drbd_nl.c   |  15 ++--
 drivers/block/drbd/drbd_receiver.c | 140 ++---
 include/linux/drbd_genl.h  |   1 +
 include/linux/drbd_limits.h|   6 ++
 5 files changed, 134 insertions(+), 30 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 1a93f4f..8cc2955 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1488,7 +1488,7 @@ enum determine_dev_size {
 extern enum determine_dev_size
 drbd_determine_dev_size(struct drbd_device *, enum dds_flags, struct 
resize_parms *) __must_hold(local);
 extern void resync_after_online_grow(struct drbd_device *);
-extern void drbd_reconsider_max_bio_size(struct drbd_device *device, struct 
drbd_backing_dev *bdev);
+extern void drbd_reconsider_queue_parameters(struct drbd_device *device, 
struct drbd_backing_dev *bdev);
 extern enum drbd_state_rv drbd_set_role(struct drbd_device *device,
enum drbd_role new_role,
int force);
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index e63c5c4..4a0b184 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1161,13 +1161,17 @@ static void drbd_setup_queue_param(struct drbd_device 
*device, struct drbd_backi
unsigned int max_hw_sectors = max_bio_size >> 9;
unsigned int max_segments = 0;
struct request_queue *b = NULL;
+   struct disk_conf *dc;
+   bool discard_zeroes_if_aligned = true;
 
if (bdev) {
b = bdev->backing_bdev->bd_disk->queue;
 
max_hw_sectors = min(queue_max_hw_sectors(b), max_bio_size >> 
9);
rcu_read_lock();
-   max_segments = 
rcu_dereference(device->ldev->disk_conf)->max_bio_bvecs;
+   dc = rcu_dereference(device->ldev->disk_conf);
+   max_segments = dc->max_bio_bvecs;
+   discard_zeroes_if_aligned = dc->discard_zeroes_if_aligned;
rcu_read_unlock();
 
blk_set_stacking_limits(&q->limits);
@@ -1185,7 +1189,7 @@ static void drbd_setup_queue_param(struct drbd_device 
*device, struct drbd_backi
 
blk_queue_max_discard_sec

[PATCH 00/30] DBRD updates

2016-04-25 Thread Philipp Reisner
Hi Jens,

apart from the usual maintenance and bug fixes this time comes
support for WRITE_SAME and lots of improvemnts for DISCARD.

Overview:
As replication technology we want to use DISCARDs only if they
really zero the backing storage. Thin LVM does it but claims
not to do it. To make that reasonably usable with DRBD we
added this "discard_zeroes_if_aligned" hack^H^H^H^H configure
option. Please see the commit messages for all the details.

Please add it to your for-4.7/drivers branch.
Thanks!


Fabian Frederick (1):
  drbd: code cleanups without semantic changes

Lars Ellenberg (24):
  drbd: bitmap bulk IO: do not always suspend IO
  drbd: change bitmap write-out when leaving resync states
  drbd: adjust assert in w_bitmap_io to account for
BM_LOCKED_CHANGE_ALLOWED
  drbd: fix regression: protocol A sometimes synchronous, C sometimes
double-latency
  drbd: fix for truncated minor number in callback command line
  drbd: allow parallel flushes for multi-volume resources
  drbd: when receiving P_TRIM, zero-out partial unaligned chunks
  drbd: possibly disable discard support, if backend has
discard_zeroes_data=0
  drbd: zero-out partial unaligned discards on local backend
  drbd: allow larger max_discard_sectors
  drbd: finish resync on sync source only by notification from sync
target
  drbd: introduce unfence-peer handler
  drbd: don't forget error completion when "unsuspending" IO
  drbd: if there is no good data accessible, writes should be IO errors
  drbd: only restart frozen disk io when D_UP_TO_DATE
  drbd: discard_zeroes_if_aligned allows "thin" resync for
discard_zeroes_data=0
  drbd: report sizes if rejecting too small peer disk
  drbd: introduce WRITE_SAME support
  drbd: sync_handshake: handle identical uuids with current (frozen)
Primary
  drbd: disallow promotion during resync handshake, avoid deadlock and
hard reset
  drbd: bump current uuid when resuming IO with diskless peer
  drbd: finally report ms, not jiffies, in log message
  drbd: al_write_transaction: skip re-scanning of bitmap page pointer
array
  drbd: correctly handle failed crypto_alloc_hash

Philipp Reisner (4):
  drbd: Kill code duplication
  drbd: Implement handling of thinly provisioned storage on resync
target nodes
  drbd: Introduce new disk config option rs-discard-granularity
  drbd: Create the protocol feature THIN_RESYNC

Roland Kammerer (1):
  drbd: get rid of empty statement in is_valid_state

 drivers/block/drbd/drbd_actlog.c   |  29 +-
 drivers/block/drbd/drbd_bitmap.c   |  84 --
 drivers/block/drbd/drbd_debugfs.c  |  13 +-
 drivers/block/drbd/drbd_int.h  |  49 +++-
 drivers/block/drbd/drbd_interval.h |  14 +-
 drivers/block/drbd/drbd_main.c | 115 +++-
 drivers/block/drbd/drbd_nl.c   | 282 +++-
 drivers/block/drbd/drbd_proc.c |  30 +--
 drivers/block/drbd/drbd_protocol.h |  77 +-
 drivers/block/drbd/drbd_receiver.c | 534 ++---
 drivers/block/drbd/drbd_req.c  |  84 --
 drivers/block/drbd/drbd_req.h  |   5 +-
 drivers/block/drbd/drbd_state.c|  61 -
 drivers/block/drbd/drbd_state.h|   2 +-
 drivers/block/drbd/drbd_strings.c  |   8 +-
 drivers/block/drbd/drbd_worker.c   |  85 +-
 include/linux/drbd.h   |  10 +-
 include/linux/drbd_genl.h  |   7 +-
 include/linux/drbd_limits.h|  15 +-
 19 files changed, 1205 insertions(+), 299 deletions(-)

-- 
1.9.1



[PATCH 21/30] drbd: report sizes if rejecting too small peer disk

2016-04-25 Thread Philipp Reisner
From: Lars Ellenberg 

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 078c4d98..99f4519 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3939,6 +3939,7 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
device->p_size = p_size;
 
if (get_ldev(device)) {
+   sector_t new_size, cur_size;
rcu_read_lock();
my_usize = rcu_dereference(device->ldev->disk_conf)->disk_size;
rcu_read_unlock();
@@ -3955,11 +3956,13 @@ static int receive_sizes(struct drbd_connection 
*connection, struct packet_info
 
/* Never shrink a device with usable data during connect.
   But allow online shrinking if we are connected. */
-   if (drbd_new_dev_size(device, device->ldev, p_usize, 0) <
-   drbd_get_capacity(device->this_bdev) &&
+   new_size = drbd_new_dev_size(device, device->ldev, p_usize, 0);
+   cur_size = drbd_get_capacity(device->this_bdev);
+   if (new_size < cur_size &&
device->state.disk >= D_OUTDATED &&
device->state.conn < C_CONNECTED) {
-   drbd_err(device, "The peer's disk size is too 
small!\n");
+   drbd_err(device, "The peer's disk size is too small! 
(%llu < %llu sectors)\n",
+   (unsigned long long)new_size, (unsigned 
long long)cur_size);
conn_request_state(peer_device->connection, NS(conn, 
C_DISCONNECTING), CS_HARD);
put_ldev(device);
return -EIO;
-- 
1.9.1



[PATCH 30/30] drbd: correctly handle failed crypto_alloc_hash

2016-04-25 Thread Philipp Reisner
From: Lars Ellenberg 

crypto_alloc_hash returns an ERR_PTR(), not NULL.

Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash()
on an errno in the cleanup path.

Reported-by: Insu Yun 

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 3 ++-
 include/linux/drbd.h   | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 5c06286..80a6aff 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3668,7 +3668,8 @@ static int receive_protocol(struct drbd_connection 
*connection, struct packet_in
 */
 
peer_integrity_tfm = crypto_alloc_ahash(integrity_alg, 0, 
CRYPTO_ALG_ASYNC);
-   if (!peer_integrity_tfm) {
+   if (IS_ERR(peer_integrity_tfm)) {
+   peer_integrity_tfm = NULL;
drbd_err(connection, "peer data-integrity-alg %s not 
supported\n",
 integrity_alg);
goto disconnect;
diff --git a/include/linux/drbd.h b/include/linux/drbd.h
index 2b26156..002611c 100644
--- a/include/linux/drbd.h
+++ b/include/linux/drbd.h
@@ -51,7 +51,7 @@
 #endif
 
 extern const char *drbd_buildtag(void);
-#define REL_VERSION "8.4.6"
+#define REL_VERSION "8.4.7"
 #define API_VERSION 1
 #define PRO_VERSION_MIN 86
 #define PRO_VERSION_MAX 101
-- 
1.9.1



[PATCH 22/30] drbd: introduce WRITE_SAME support

2016-04-25 Thread Philipp Reisner
From: Lars Ellenberg 

We will support WRITE_SAME, if
 * all peers support WRITE_SAME (both in kernel and DRBD version),
 * all peer devices support WRITE_SAME
 * logical_block_size is identical on all peers.

We may at some point introduce a fallback on the receiving side
for devices/kernels that do not support WRITE_SAME,
by open-coding a submit loop. But not yet.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_actlog.c   |   9 ++-
 drivers/block/drbd/drbd_debugfs.c  |  11 +--
 drivers/block/drbd/drbd_int.h  |  13 ++--
 drivers/block/drbd/drbd_main.c |  82 +++---
 drivers/block/drbd/drbd_nl.c   |  88 +---
 drivers/block/drbd/drbd_protocol.h |  74 ++--
 drivers/block/drbd/drbd_receiver.c | 137 +++--
 drivers/block/drbd/drbd_req.c  |  13 ++--
 drivers/block/drbd/drbd_req.h  |   5 +-
 drivers/block/drbd/drbd_worker.c   |   8 ++-
 10 files changed, 360 insertions(+), 80 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 4e07cff..99a2b92 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -838,6 +838,13 @@ static int update_sync_bits(struct drbd_device *device,
return count;
 }
 
+static bool plausible_request_size(int size)
+{
+   return size > 0
+   && size <= DRBD_MAX_BATCH_BIO_SIZE
+   && IS_ALIGNED(size, 512);
+}
+
 /* clear the bit corresponding to the piece of storage in question:
  * size byte of data starting from sector.  Only clear a bits of the affected
  * one ore more _aligned_ BM_BLOCK_SIZE blocks.
@@ -857,7 +864,7 @@ int __drbd_change_sync(struct drbd_device *device, sector_t 
sector, int size,
if ((mode == SET_OUT_OF_SYNC) && size == 0)
return 0;
 
-   if (size <= 0 || !IS_ALIGNED(size, 512) || size > 
DRBD_MAX_DISCARD_SIZE) {
+   if (!plausible_request_size(size)) {
drbd_err(device, "%s: sector=%llus size=%d nonsense!\n",
drbd_change_sync_fname[mode],
(unsigned long long)sector, size);
diff --git a/drivers/block/drbd/drbd_debugfs.c 
b/drivers/block/drbd/drbd_debugfs.c
index 4de95bb..8a90812 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -237,14 +237,9 @@ static void seq_print_peer_request_flags(struct seq_file 
*m, struct drbd_peer_re
seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, &sep, "C");
seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, &sep, "set-in-sync");
 
-   if (f & EE_IS_TRIM) {
-   seq_putc(m, sep);
-   sep = '|';
-   if (f & EE_IS_TRIM_USE_ZEROOUT)
-   seq_puts(m, "zero-out");
-   else
-   seq_puts(m, "trim");
-   }
+   if (f & EE_IS_TRIM)
+   __seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, &sep, 
"zero-out", "trim");
+   seq_print_rq_state_bit(m, f & EE_WRITE_SAME, &sep, "write-same");
seq_putc(m, '\n');
 }
 
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index cb42f6c..cb47809 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -468,6 +468,9 @@ enum {
/* this is/was a write request */
__EE_WRITE,
 
+   /* this is/was a write same request */
+   __EE_WRITE_SAME,
+
/* this originates from application on peer
 * (not some resync or verify or other DRBD internal request) */
__EE_APPLICATION,
@@ -487,6 +490,7 @@ enum {
 #define EE_IN_INTERVAL_TREE(1<<__EE_IN_INTERVAL_TREE)
 #define EE_SUBMITTED   (1<<__EE_SUBMITTED)
 #define EE_WRITE   (1<<__EE_WRITE)
+#define EE_WRITE_SAME  (1<<__EE_WRITE_SAME)
 #define EE_APPLICATION (1<<__EE_APPLICATION)
 #define EE_RS_THIN_REQ (1<<__EE_RS_THIN_REQ)
 
@@ -1350,8 +1354,8 @@ struct bm_extent {
 /* For now, don't allow more than half of what we can "activate" in one
  * activity log transaction to be discarded in one go. We may need to rework
  * drbd_al_begin_io() to allow for even larger discard ranges */
-#define DRBD_MAX_DISCARD_SIZE  (AL_UPDATES_PER_TRANSACTION/2*AL_EXTENT_SIZE)
-#define DRBD_MAX_DISCARD_SECTORS (DRBD_MAX_DISCARD_SIZE >> 9)
+#define DRBD_MAX_BATCH_BIO_SIZE 
(AL_UPDATES_PER_TRANSACTION/2*AL_EXTENT_SIZE)
+#define DRBD_MAX_BBIO_SECTORS(DRBD_MAX_BATCH_BIO_SIZE >> 9)
 
 extern int  drbd_bm_init(struct drbd_device *device);
 extern int  drbd_bm_resize(struct drbd_device *device, sector_t sectors, int 
set_new_bits);
@@ -1488,7 +1492,

[PATCH 28/30] drbd: finally report ms, not jiffies, in log message

2016-04-25 Thread Philipp Reisner
From: Lars Ellenberg 

Also skip the message unless bitmap IO took longer than 5 ms.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_bitmap.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 17e5e60..801b8f3 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -1121,10 +1121,14 @@ static int bm_rw(struct drbd_device *device, const 
unsigned int flags, unsigned
kref_put(&ctx->kref, &drbd_bm_aio_ctx_destroy);
 
/* summary for global bitmap IO */
-   if (flags == 0)
-   drbd_info(device, "bitmap %s of %u pages took %lu jiffies\n",
-(flags & BM_AIO_READ) ? "READ" : "WRITE",
-count, jiffies - now);
+   if (flags == 0) {
+   unsigned int ms = jiffies_to_msecs(jiffies - now);
+   if (ms > 5) {
+   drbd_info(device, "bitmap %s of %u pages took %u ms\n",
+(flags & BM_AIO_READ) ? "READ" : "WRITE",
+count, ms);
+   }
+   }
 
if (ctx->error) {
drbd_alert(device, "we had at least one MD IO ERROR during 
bitmap IO\n");
-- 
1.9.1



[PATCH 02/30] drbd: change bitmap write-out when leaving resync states

2016-04-25 Thread Philipp Reisner
From: Lars Ellenberg 

When leaving resync states because of disconnect,
do the bitmap write-out synchronously in the drbd_disconnected() path.

When leaving resync states because we go back to AHEAD/BEHIND, or
because resync actually finished, or some disk was lost during resync,
trigger the write-out from after_state_ch().

The bitmap write-out for resync -> ahead/behind was missing completely before.

Note that this is all only an optimization to avoid double-resyncs of
already completed blocks in case this node crashes.

Signed-off-by: Philipp Reisner 
Signed-off-by: Lars Ellenberg 
---
 drivers/block/drbd/drbd_receiver.c | 8 +---
 drivers/block/drbd/drbd_state.c| 9 +++--
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 050aaa1..8b30ab5 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -4783,9 +4783,11 @@ static int drbd_disconnected(struct drbd_peer_device 
*peer_device)
 
drbd_md_sync(device);
 
-   /* serialize with bitmap writeout triggered by the state change,
-* if any. */
-   wait_event(device->misc_wait, !test_bit(BITMAP_IO, &device->flags));
+   if (get_ldev(device)) {
+   drbd_bitmap_io(device, &drbd_bm_write_copy_pages,
+   "write from disconnected", 
BM_LOCKED_CHANGE_ALLOWED);
+   put_ldev(device);
+   }
 
/* tcp_close and release of sendpage pages can be deferred.  I don't
 * want to use SO_LINGER, because apparently it can be deferred for
diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 5a7ef78..59c6467 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -1934,12 +1934,17 @@ static void after_state_ch(struct drbd_device *device, 
union drbd_state os,
 
/* This triggers bitmap writeout of potentially still unwritten pages
 * if the resync finished cleanly, or aborted because of peer disk
-* failure, or because of connection loss.
+* failure, or on transition from resync back to AHEAD/BEHIND.
+*
+* Connection loss is handled in drbd_disconnected() by the receiver.
+*
 * For resync aborted because of local disk failure, we cannot do
 * any bitmap writeout anymore.
+*
 * No harm done if some bits change during this phase.
 */
-   if (os.conn > C_CONNECTED && ns.conn <= C_CONNECTED && 
get_ldev(device)) {
+   if ((os.conn > C_CONNECTED && os.conn < C_AHEAD) &&
+   (ns.conn == C_CONNECTED || ns.conn >= C_AHEAD) && get_ldev(device)) 
{
drbd_queue_bitmap_io(device, &drbd_bm_write_copy_pages, NULL,
"write from resync_finished", BM_LOCKED_CHANGE_ALLOWED);
put_ldev(device);
-- 
1.9.1



  1   2   3   4   5   >