Re: [Nbd] [PATCH] nbd: improve request timeouts handling

2015-03-01 Thread Markus Pargmann
Hi Michal,

First of all, thanks for resending this. It works for my test setup but
I have some comments inline below.

On Thu, Feb 26, 2015 at 05:37:12PM +0100, Michal Belczyk wrote:
 The main idea behind it is to be able to quickly detect broken replica
 and switch over to another when used with any sort of mirror type device
 built on top of any number of nbd devices.
 
 Before this change a request would time out causing the socket to be
 shut down and the device to fail in case of a dead server or removed
 network cable only if:
 
   a) either the timer around kernel_sendmsg() kicked in
   b) or the TCP failures on retransmission finally caused an error
  on the socket, likely blocked on kernel_recvmsg() at this time,
  waiting for replies from the server
 
 Case a) depends mostly on the size of requests issued and on the maximum
 size of the socket buffer -- a lot of read request headers or small
 write requests could be sent without triggering the requested timeout
 
 Case b) timeout is independent of nbd-client -t timeout option
 as there is no TCP_USER_TIMEOUT set on the client socket by default.
 And even if such timeout was set it would not solve the problem of
 an nbd-client hung on receiving replies for much longer time without
 setting TCP keep-alives... and that would be the third, independent
 timeout setting required to make it work almost as expected...
 
 So, instead, take the big hammer approach and:
 
   *) trace the number of outstanding requests sent to the server
  (nbd-inflight)
   *) enable the timer (nbd-req_timer) before the first request
  is submitted and leave it enabled
   *) when sending next request do not touch the timer (it is up
  to the receiving side to control it at this point)
   *) on receive side update the timer every time a response
  is collected but there are more to read from the server
   *) disable the timer whenever the inflight counter drops
  to zero or an error (leading to the socket shutdown)
  is returned
 
 This patch does NOT prevent the server to process a request for longer
 than the timeout specified if only it replies to any other request
 submitted within the timeout (the server still may reply to a batch
 of requests in any order).
 
 Only the nbd-xmit_timeout != 0 code path is changed so the patch should
 not affect nbd connections running without an explicit timeout set
 on the nbd-client command line.
 
 There is also no way to enable or disable the timeout on an active
 (nbd-pid != 0) nbd device, it is however possible to change its value.
 Otherwise the inflight request counter would have to affect the nbd
 devices enabled without nbd-client -t timeout.
 
 Also move nbd-pid modifications behind nbd-tx_lock wherever possible
 to avoid races between the concurrent nbd-client invocations.

This commit message is really good.

 
 Signed-off-by: Michal Belczyk belc...@bsd.krakow.pl
 ---
  drivers/block/nbd.c | 150 
 +---
  include/linux/nbd.h |   6 +++
  2 files changed, 125 insertions(+), 31 deletions(-)
 
 diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
 index 4bc2a5c..cc4a98a 100644
 --- a/drivers/block/nbd.c
 +++ b/drivers/block/nbd.c
 @@ -140,11 +140,24 @@ static void sock_shutdown(struct nbd_device *nbd, int 
 lock)
  
  static void nbd_xmit_timeout(unsigned long arg)
  {
 - struct task_struct *task = (struct task_struct *)arg;
 + struct nbd_device *nbd = (struct nbd_device *)arg;
 + struct task_struct *task_ary[2];
 + unsigned long flags;
 + int i;
  
 - printk(KERN_WARNING nbd: killing hung xmit (%s, pid: %d)\n,
 - task-comm, task-pid);
 - force_sig(SIGKILL, task);
 + spin_lock_irqsave(nbd-timer_lock, flags);
 + nbd-timedout = 1;
 + task_ary[0] = nbd-sender;
 + task_ary[1] = nbd-receiver;
 + for (i = 0; i  2; i++) {
 + if (task_ary[i] == NULL)
 + continue;
 + printk(KERN_WARNING nbd: killing hung xmit (%s, pid: %d)\n,
 + task_ary[i]-comm, task_ary[i]-pid);
 + force_sig(SIGKILL, task_ary[i]);
 + break;
 + }
 + spin_unlock_irqrestore(nbd-timer_lock, flags);
  }
  
  /*
 @@ -158,7 +171,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, 
 void *buf, int size,
   struct msghdr msg;
   struct kvec iov;
   sigset_t blocked, oldset;
 - unsigned long pflags = current-flags;
 + unsigned long flags, pflags = current-flags;

I would prefer variable declerations in seperate lines, especially if
there is an assignment involved, for readability.

  
   if (unlikely(!sock)) {
   dev_err(disk_to_dev(nbd-disk),
 @@ -183,23 +196,39 @@ static int sock_xmit(struct nbd_device *nbd, int send, 
 void *buf, int size,
   msg.msg_controllen = 0;
   msg.msg_flags = msg_flags | MSG_NOSIGNAL;
  
 - if (send) {
 - struct 

[Nbd] [PATCH] nbd: improve request timeouts handling

2015-02-26 Thread Michal Belczyk
The main idea behind it is to be able to quickly detect broken replica
and switch over to another when used with any sort of mirror type device
built on top of any number of nbd devices.

Before this change a request would time out causing the socket to be
shut down and the device to fail in case of a dead server or removed
network cable only if:

  a) either the timer around kernel_sendmsg() kicked in
  b) or the TCP failures on retransmission finally caused an error
 on the socket, likely blocked on kernel_recvmsg() at this time,
 waiting for replies from the server

Case a) depends mostly on the size of requests issued and on the maximum
size of the socket buffer -- a lot of read request headers or small
write requests could be sent without triggering the requested timeout

Case b) timeout is independent of nbd-client -t timeout option
as there is no TCP_USER_TIMEOUT set on the client socket by default.
And even if such timeout was set it would not solve the problem of
an nbd-client hung on receiving replies for much longer time without
setting TCP keep-alives... and that would be the third, independent
timeout setting required to make it work almost as expected...

So, instead, take the big hammer approach and:

  *) trace the number of outstanding requests sent to the server
 (nbd-inflight)
  *) enable the timer (nbd-req_timer) before the first request
 is submitted and leave it enabled
  *) when sending next request do not touch the timer (it is up
 to the receiving side to control it at this point)
  *) on receive side update the timer every time a response
 is collected but there are more to read from the server
  *) disable the timer whenever the inflight counter drops
 to zero or an error (leading to the socket shutdown)
 is returned

This patch does NOT prevent the server to process a request for longer
than the timeout specified if only it replies to any other request
submitted within the timeout (the server still may reply to a batch
of requests in any order).

Only the nbd-xmit_timeout != 0 code path is changed so the patch should
not affect nbd connections running without an explicit timeout set
on the nbd-client command line.

There is also no way to enable or disable the timeout on an active
(nbd-pid != 0) nbd device, it is however possible to change its value.
Otherwise the inflight request counter would have to affect the nbd
devices enabled without nbd-client -t timeout.

Also move nbd-pid modifications behind nbd-tx_lock wherever possible
to avoid races between the concurrent nbd-client invocations.

Signed-off-by: Michal Belczyk belc...@bsd.krakow.pl
---
 drivers/block/nbd.c | 150 +---
 include/linux/nbd.h |   6 +++
 2 files changed, 125 insertions(+), 31 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 4bc2a5c..cc4a98a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -140,11 +140,24 @@ static void sock_shutdown(struct nbd_device *nbd, int 
lock)
 
 static void nbd_xmit_timeout(unsigned long arg)
 {
-   struct task_struct *task = (struct task_struct *)arg;
+   struct nbd_device *nbd = (struct nbd_device *)arg;
+   struct task_struct *task_ary[2];
+   unsigned long flags;
+   int i;
 
-   printk(KERN_WARNING nbd: killing hung xmit (%s, pid: %d)\n,
-   task-comm, task-pid);
-   force_sig(SIGKILL, task);
+   spin_lock_irqsave(nbd-timer_lock, flags);
+   nbd-timedout = 1;
+   task_ary[0] = nbd-sender;
+   task_ary[1] = nbd-receiver;
+   for (i = 0; i  2; i++) {
+   if (task_ary[i] == NULL)
+   continue;
+   printk(KERN_WARNING nbd: killing hung xmit (%s, pid: %d)\n,
+   task_ary[i]-comm, task_ary[i]-pid);
+   force_sig(SIGKILL, task_ary[i]);
+   break;
+   }
+   spin_unlock_irqrestore(nbd-timer_lock, flags);
 }
 
 /*
@@ -158,7 +171,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void 
*buf, int size,
struct msghdr msg;
struct kvec iov;
sigset_t blocked, oldset;
-   unsigned long pflags = current-flags;
+   unsigned long flags, pflags = current-flags;
 
if (unlikely(!sock)) {
dev_err(disk_to_dev(nbd-disk),
@@ -183,23 +196,39 @@ static int sock_xmit(struct nbd_device *nbd, int send, 
void *buf, int size,
msg.msg_controllen = 0;
msg.msg_flags = msg_flags | MSG_NOSIGNAL;
 
-   if (send) {
-   struct timer_list ti;
-
-   if (nbd-xmit_timeout) {
-   init_timer(ti);
-   ti.function = nbd_xmit_timeout;
-   ti.data = (unsigned long)current;
-   ti.expires = jiffies + nbd-xmit_timeout;
-   add_timer(ti);
+   if (nbd-xmit_timeout) 

Re: [Nbd] [PATCH] nbd: improve request timeouts handling

2013-07-29 Thread Paul Clements
Michal,

thanks for this...a couple of preliminary questions...

On Mon, Jun 17, 2013 at 4:10 PM, Michal Belczyk belc...@bsd.krakow.plwrote:

 The main idea behind it is to be able to quickly detect broken replica
 and switch over to another when used with any sort of mirror type device
 built on top of any number of nbd devices.


I'm wondering if a simpler change would have the same result. Namely, could
you just apply the xmit_timeout, as currently implemented, to the recvmsg
path (ignoring the timeout, of course, when no requests are queued). Is
there a case that this would not handle that the proposed patch handles?
They seem to be equivalent.


Also move nbd-pid modifications behind nbd-tx_lock wherever possible
 to avoid races between the concurrent nbd-client invocations.


Is there a bug related to this that you've seen, or is this just a
precaution?

Thanks,
Paul
--
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH] nbd: improve request timeouts handling

2013-07-29 Thread Michal Belczyk
On Mon, Jul 29, 2013 at 12:45:33PM -0400, Paul Clements wrote:
 Michal,
 
 thanks for this...a couple of preliminary questions...
 
 On Mon, Jun 17, 2013 at 4:10 PM, Michal Belczyk belc...@bsd.krakow.plwrote:
 
  The main idea behind it is to be able to quickly detect broken replica
  and switch over to another when used with any sort of mirror type device
  built on top of any number of nbd devices.
 
 
 I'm wondering if a simpler change would have the same result. Namely, could
 you just apply the xmit_timeout, as currently implemented, to the recvmsg
 path (ignoring the timeout, of course, when no requests are queued). Is
 there a case that this would not handle that the proposed patch handles?
 They seem to be equivalent.

Oh... no, they would not be equivalent.
It is all my fault -- the commit message lacks very important information:

Timeouts implemented via timers directly on top of kernel_sendmsg() or
kernel_recvmsg() routines apply to a single bio_vec being sent/received at
a time while the proposed patch makes a block queue request the subject to
time out -- timeouts handling was moved from the sock_xmit() way up to
nbd_send_req() and nbd_do_it()...

Currently, assuming for this example max_sectors_kb=512, a single
direct read request from an nbd device:

# dd if=/dev/nbd0 of=/dev/null bs=512K iflag=direct count=1

generates:

[1301353.424080] nbd0: request 8802253d22f0: got reply
[1301353.424089] nbd0: request 8802253d22f0: got 4096 bytes data
[1301353.424092] nbd0: request 8802253d22f0: got 4096 bytes data
[ ... 128 4KB bio_vec-s / 128 sock_xmit() calls? ]

So, assuming a timeout set to 30s, a single 512KB read request could
take up to 128 times 30s which makes... 64 minutes, if the timeouts were
implemented the way you ask...
Well, they actually _are_ implemented this way in the upstream code, so
a single direct write request like this will generate 128 timer
arms/disarms AND a 30s timeout actually means that such request can take
up to 64 minutes...
I think it is wrong and I believe that if the maximum sized (subject to
now configurable max_sectors_kb tunable) request cannot be fulfilled
within a specified timeout then this NBD device queue should die,
hopefully being handled by a DM layer on top... and such request could
be happily requeued to a well-behaving replica, possibly another NBD
device...


 Also move nbd-pid modifications behind nbd-tx_lock wherever possible
  to avoid races between the concurrent nbd-client invocations.
 
 
 Is there a bug related to this that you've seen, or is this just a
 precaution?

Precaution but 1) I don't like that the current code checks for nbd-pid
with nbd-tx_lock held and modifies it later with the same lock released
in nbd_do_it() and 2) the proposed patch uses nbd-pid to determine if
the device is in use (nbd-client alive) -- should it do it some other
way?

Thank you for the feedback!

-- 
Michal Belczyk Sr.

--
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH] nbd: improve request timeouts handling (rebased)

2013-07-15 Thread Goswin von Brederlow
On Sat, Jul 13, 2013 at 10:01:20PM +0200, Michal Belczyk wrote:
 On Thu, Jul 11, 2013 at 03:23:46PM +0200, Goswin von Brederlow wrote:
  On Wed, Jul 10, 2013 at 09:40:42PM +0200, Michal Belczyk wrote:
   [rebased against the master, no other changes]
   
   The main idea behind it is to be able to quickly detect broken replica
   and switch over to another when used with any sort of mirror type device
   built on top of any number of nbd devices.
  
  It is good to see someone checking and fixing timeout issues.
  
  But I would also like to have the opposite behaviour and delay
  detecting failures. Insane? read on...
  
  
  I have a small server that runs 24/7 with minimal power consumption. I
  also have a storage node that exports a number of disks via NBD. This
  should be powered down when not in use. The problem is that not in
  use is incompatible with NBD as is. The storage node must be running
  and must have an active connection to the server.
  
  What I'm suggesting is that the nbd-client should be able to create a
  network block device in disconnected state (without attached socket)
  and the kernel would return on the next access. The nbd client can
  then actually connect to the NBD server and restart the device with a
  socket to fullfill the request. In my case that would involve a
  wake-on-lan call to first bring up the storage node.
  
  Secondly it would be nice to have an idle timeout in the kernel. If no
  access happened to the NBD for some time the kernel should return so
  the client can switch to disconnected mode. In my case that would
  involve sending the storage node to sleep.
  
  Interested in trying to implement this?
 
 I'm currently busy trying to push bnbd to its first release which is not
 easy being distracted by fixing bugs in other but related components on
 my way...
 So far the biggest winners of my project are the nbd driver and the
 nbd client which already have some of the bugfixes upstreamed ;-)
 
 I just wonder -- have you tried with autofs plus some scripting over it
 on mount and unmount upon timeout?  I think I have used it once in my
 life long ago, so not sure if it would work at all, but it seems possible?
 
 -- 
 Michal Belczyk Sr.

I don't think autofs can umount a filesystem with open files, like a
shell having their CWD in the FS. Right? The FS would then easily be
blocked accidentally without being actually in real use.

MfG
Goswin


--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH] nbd: improve request timeouts handling (rebased)

2013-07-15 Thread Michal Belczyk
On Mon, Jul 15, 2013 at 10:38:35AM +0200, Goswin von Brederlow wrote:
 
 I don't think autofs can umount a filesystem with open files, like a
 shell having their CWD in the FS. Right? The FS would then easily be
 blocked accidentally without being actually in real use.

Yeah, I guess you're right... then it probably should go into the
driver.  As I said before -- I'm way too busy currently to work on it
now...


-- 
Michal Belczyk Sr.

--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH] nbd: improve request timeouts handling (rebased)

2013-07-13 Thread Michal Belczyk
On Thu, Jul 11, 2013 at 03:23:46PM +0200, Goswin von Brederlow wrote:
 On Wed, Jul 10, 2013 at 09:40:42PM +0200, Michal Belczyk wrote:
  [rebased against the master, no other changes]
  
  The main idea behind it is to be able to quickly detect broken replica
  and switch over to another when used with any sort of mirror type device
  built on top of any number of nbd devices.
 
 It is good to see someone checking and fixing timeout issues.
 
 But I would also like to have the opposite behaviour and delay
 detecting failures. Insane? read on...
 
 
 I have a small server that runs 24/7 with minimal power consumption. I
 also have a storage node that exports a number of disks via NBD. This
 should be powered down when not in use. The problem is that not in
 use is incompatible with NBD as is. The storage node must be running
 and must have an active connection to the server.
 
 What I'm suggesting is that the nbd-client should be able to create a
 network block device in disconnected state (without attached socket)
 and the kernel would return on the next access. The nbd client can
 then actually connect to the NBD server and restart the device with a
 socket to fullfill the request. In my case that would involve a
 wake-on-lan call to first bring up the storage node.
 
 Secondly it would be nice to have an idle timeout in the kernel. If no
 access happened to the NBD for some time the kernel should return so
 the client can switch to disconnected mode. In my case that would
 involve sending the storage node to sleep.
 
 Interested in trying to implement this?

I'm currently busy trying to push bnbd to its first release which is not
easy being distracted by fixing bugs in other but related components on
my way...
So far the biggest winners of my project are the nbd driver and the
nbd client which already have some of the bugfixes upstreamed ;-)

I just wonder -- have you tried with autofs plus some scripting over it
on mount and unmount upon timeout?  I think I have used it once in my
life long ago, so not sure if it would work at all, but it seems possible?

-- 
Michal Belczyk Sr.

--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH] nbd: improve request timeouts handling (rebased)

2013-07-11 Thread Goswin von Brederlow
On Wed, Jul 10, 2013 at 09:40:42PM +0200, Michal Belczyk wrote:
 [rebased against the master, no other changes]
 
 The main idea behind it is to be able to quickly detect broken replica
 and switch over to another when used with any sort of mirror type device
 built on top of any number of nbd devices.

It is good to see someone checking and fixing timeout issues.

But I would also like to have the opposite behaviour and delay
detecting failures. Insane? read on...


I have a small server that runs 24/7 with minimal power consumption. I
also have a storage node that exports a number of disks via NBD. This
should be powered down when not in use. The problem is that not in
use is incompatible with NBD as is. The storage node must be running
and must have an active connection to the server.

What I'm suggesting is that the nbd-client should be able to create a
network block device in disconnected state (without attached socket)
and the kernel would return on the next access. The nbd client can
then actually connect to the NBD server and restart the device with a
socket to fullfill the request. In my case that would involve a
wake-on-lan call to first bring up the storage node.

Secondly it would be nice to have an idle timeout in the kernel. If no
access happened to the NBD for some time the kernel should return so
the client can switch to disconnected mode. In my case that would
involve sending the storage node to sleep.

Interested in trying to implement this?

MfG
Goswin
 

--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


[Nbd] [PATCH] nbd: improve request timeouts handling (rebased)

2013-07-10 Thread Michal Belczyk
[rebased against the master, no other changes]

The main idea behind it is to be able to quickly detect broken replica
and switch over to another when used with any sort of mirror type device
built on top of any number of nbd devices.

Before this change a request would time out causing the socket to be
shut down and the device to fail in case of a dead server or removed
network cable only if:

  a) either the timer around kernel_sendmsg() kicked in
  b) or the TCP failures on retransmission finally caused an error
 on the socket, likely blocked on kernel_recvmsg() at this time,
 waiting for replies from the server

Case a) depends mostly on the size of requests issued and on the maximum
size of the socket buffer -- a lot of read request headers or small
write requests could be sent without triggering the requested timeout

Case b) timeout is independent of nbd-client -t timeout option
as there is no TCP_USER_TIMEOUT set on the client socket by default.
And even if such timeout was set it would not solve the problem of
an nbd-client hung on receiving replies for much longer time without
setting TCP keep-alives... and that would be the third, independent
timeout setting required to make it work almost as expected...

So, instead, take the big hammer approach and:

  *) trace the number of outstanding requests sent to the server
 (nbd-inflight)
  *) enable the timer (nbd-req_timer) before the first request
 is submitted and leave it enabled
  *) when sending next request do not touch the timer (it is up
 to the receiving side to control it at this point)
  *) on receive side update the timer every time a response
 is collected but there are more to read from the server
  *) disable the timer whenever the inflight counter drops
 to zero or an error (leading to the socket shutdown)
 is returned

This patch does NOT prevent the server to process a request for longer
than the timeout specified if only it replies to any other request
submitted within the timeout (the server still may reply to a batch
of requests in any order).

Only the nbd-xmit_timeout != 0 code path is changed so the patch should
not affect nbd connections running without an explicit timeout set
on the nbd-client command line.

There is also no way to enable or disable the timeout on an active
(nbd-pid != 0) nbd device, it is however possible to change its value.
Otherwise the inflight request counter would have to affect the nbd
devices enabled without nbd-client -t timeout.

Also move nbd-pid modifications behind nbd-tx_lock wherever possible
to avoid races between the concurrent nbd-client invocations.

Signed-off-by: Michal Belczyk belc...@bsd.krakow.pl
---
 drivers/block/nbd.c | 150 +---
 include/linux/nbd.h |   6 +++
 2 files changed, 125 insertions(+), 31 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 2dc3b51..a20d5eb 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -140,11 +140,24 @@ static void sock_shutdown(struct nbd_device *nbd, int 
lock)
 
 static void nbd_xmit_timeout(unsigned long arg)
 {
-   struct task_struct *task = (struct task_struct *)arg;
+   struct nbd_device *nbd = (struct nbd_device *)arg;
+   struct task_struct *task_ary[2];
+   unsigned long flags;
+   int i;
 
-   printk(KERN_WARNING nbd: killing hung xmit (%s, pid: %d)\n,
-   task-comm, task-pid);
-   force_sig(SIGKILL, task);
+   spin_lock_irqsave(nbd-timer_lock, flags);
+   nbd-timedout = 1;
+   task_ary[0] = nbd-sender;
+   task_ary[1] = nbd-receiver;
+   for (i = 0; i  2; i++) {
+   if (task_ary[i] == NULL)
+   continue;
+   printk(KERN_WARNING nbd: killing hung xmit (%s, pid: %d)\n,
+   task_ary[i]-comm, task_ary[i]-pid);
+   force_sig(SIGKILL, task_ary[i]);
+   break;
+   }
+   spin_unlock_irqrestore(nbd-timer_lock, flags);
 }
 
 /*
@@ -158,7 +171,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void 
*buf, int size,
struct msghdr msg;
struct kvec iov;
sigset_t blocked, oldset;
-   unsigned long pflags = current-flags;
+   unsigned long flags, pflags = current-flags;
 
if (unlikely(!sock)) {
dev_err(disk_to_dev(nbd-disk),
@@ -183,23 +196,39 @@ static int sock_xmit(struct nbd_device *nbd, int send, 
void *buf, int size,
msg.msg_controllen = 0;
msg.msg_flags = msg_flags | MSG_NOSIGNAL;
 
-   if (send) {
-   struct timer_list ti;
-
-   if (nbd-xmit_timeout) {
-   init_timer(ti);
-   ti.function = nbd_xmit_timeout;
-   ti.data = (unsigned long)current;
-   ti.expires = jiffies + nbd-xmit_timeout;
-   

[Nbd] [PATCH] nbd: improve request timeouts handling

2013-06-17 Thread Michal Belczyk
The main idea behind it is to be able to quickly detect broken replica
and switch over to another when used with any sort of mirror type device
built on top of any number of nbd devices.

Before this change a request would time out causing the socket to be
shut down and the device to fail in case of a dead server or removed
network cable only if:

  a) either the timer around kernel_sendmsg() kicked in
  b) or the TCP failures on retransmission finally caused an error
 on the socket, likely blocked on kernel_recvmsg() at this time,
 waiting for replies from the server

Case a) depends mostly on the size of requests issued and on the maximum
size of the socket buffer -- a lot of read request headers or small
write requests could be sent without triggering the requested timeout

Case b) timeout is independent of nbd-client -t timeout option
as there is no TCP_USER_TIMEOUT set on the client socket by default.
And even if such timeout was set it would not solve the problem of
an nbd-client hung on receiving replies for much longer time without
setting TCP keep-alives... and that would be the third, independent
timeout setting required to make it work almost as expected...

So, instead, take the big hammer approach and:

  *) trace the number of outstanding requests sent to the server
 (nbd-inflight)
  *) enable the timer (nbd-req_timer) before the first request
 is submitted and leave it enabled
  *) when sending next request do not touch the timer (it is up
 to the receiving side to control it at this point)
  *) on receive side update the timer every time a response
 is collected but there are more to read from the server
  *) disable the timer whenever the inflight counter drops
 to zero or an error (leading to the socket shutdown)
 is returned

This patch does NOT prevent the server to process a request for longer
than the timeout specified if only it replies to any other request
submitted within the timeout (the server still may reply to a batch
of requests in any order).

Only the nbd-xmit_timeout != 0 code path is changed so the patch should
not affect nbd connections running without an explicit timeout set
on the nbd-client command line.

There is also no way to enable or disable the timeout on an active
(nbd-pid != 0) nbd device, it is however possible to change its value.
Otherwise the inflight request counter would have to affect the nbd
devices enabled without nbd-client -t timeout.

Also move nbd-pid modifications behind nbd-tx_lock wherever possible
to avoid races between the concurrent nbd-client invocations.

Signed-off-by: Michal Belczyk belc...@bsd.krakow.pl
---
 drivers/block/nbd.c | 150 +---
 include/linux/nbd.h |   6 +++
 2 files changed, 125 insertions(+), 31 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index c09640a..a61f1d3 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -140,11 +140,24 @@ static void sock_shutdown(struct nbd_device *nbd, int 
lock)
 
 static void nbd_xmit_timeout(unsigned long arg)
 {
-   struct task_struct *task = (struct task_struct *)arg;
+   struct nbd_device *nbd = (struct nbd_device *)arg;
+   struct task_struct *task_ary[2];
+   unsigned long flags;
+   int i;
 
-   printk(KERN_WARNING nbd: killing hung xmit (%s, pid: %d)\n,
-   task-comm, task-pid);
-   force_sig(SIGKILL, task);
+   spin_lock_irqsave(nbd-timer_lock, flags);
+   nbd-timedout = 1;
+   task_ary[0] = nbd-sender;
+   task_ary[1] = nbd-receiver;
+   for (i = 0; i  2; i++) {
+   if (task_ary[i] == NULL)
+   continue;
+   printk(KERN_WARNING nbd: killing hung xmit (%s, pid: %d)\n,
+   task_ary[i]-comm, task_ary[i]-pid);
+   force_sig(SIGKILL, task_ary[i]);
+   break;
+   }
+   spin_unlock_irqrestore(nbd-timer_lock, flags);
 }
 
 /*
@@ -158,7 +171,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void 
*buf, int size,
struct msghdr msg;
struct kvec iov;
sigset_t blocked, oldset;
-   unsigned long pflags = current-flags;
+   unsigned long flags, pflags = current-flags;
 
if (unlikely(!sock)) {
dev_err(disk_to_dev(nbd-disk),
@@ -183,23 +196,39 @@ static int sock_xmit(struct nbd_device *nbd, int send, 
void *buf, int size,
msg.msg_controllen = 0;
msg.msg_flags = msg_flags | MSG_NOSIGNAL;
 
-   if (send) {
-   struct timer_list ti;
-
-   if (nbd-xmit_timeout) {
-   init_timer(ti);
-   ti.function = nbd_xmit_timeout;
-   ti.data = (unsigned long)current;
-   ti.expires = jiffies + nbd-xmit_timeout;
-   add_timer(ti);
+   if (nbd-xmit_timeout)