RE: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-10-03 Thread Elliott, Robert (Server Storage)


 -Original Message-
 From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
 ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
 Sent: Thursday, 25 September, 2014 7:13 AM
 To: Douglas Gilbert
 Cc: SCSI development list; linux-kernel; James Bottomley; Christoph Hellwig;
 Milan Broz
 Subject: Re: [PATCH] scsi_debug: deadlock between completions and surprise
 module removal
 
 Review ping again?
 
 While I think the shutdown code in scsi_debug needs a bit more of an
 overhault I'd really like to include the fix at least for 3.18 and
 3.17-stable now that we have missed the 3.17 window.
 
 On Sun, Aug 31, 2014 at 07:09:59PM -0400, Douglas Gilbert wrote:
  A deadlock has been reported when the completion
  of SCSI commands (simulated by a timer) was surprised
  by a module removal. This patch removes one half of
  the offending locks around timer deletions. This fix
  is applied both to stop_all_queued() which is were
  the deadlock was discovered and stop_queued_cmnd()
  which has very similar logic.
 
  This patch should be applied both to the lk 3.17 tree
  and Christoph's drivers-for-3.18 tree.
 
  Tested-and-reported-by: Milan Broz gmazyl...@gmail.com
  Signed-off-by: Douglas Gilbert dgilb...@interlog.com

Reviewed-by: Robert Elliott elli...@hp.com



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-09-25 Thread Christoph Hellwig
Review ping again?

While I think the shutdown code in scsi_debug needs a bit more of an
overhault I'd really like to include the fix at least for 3.18 and
3.17-stable now that we have missed the 3.17 window.

On Sun, Aug 31, 2014 at 07:09:59PM -0400, Douglas Gilbert wrote:
 A deadlock has been reported when the completion
 of SCSI commands (simulated by a timer) was surprised
 by a module removal. This patch removes one half of
 the offending locks around timer deletions. This fix
 is applied both to stop_all_queued() which is were
 the deadlock was discovered and stop_queued_cmnd()
 which has very similar logic.
 
 This patch should be applied both to the lk 3.17 tree
 and Christoph's drivers-for-3.18 tree.
 
 Tested-and-reported-by: Milan Broz gmazyl...@gmail.com
 Signed-off-by: Douglas Gilbert dgilb...@interlog.com

 --- a/drivers/scsi/scsi_debug.c   2014-08-26 13:24:51.646948507 -0400
 +++ b/drivers/scsi/scsi_debug.c   2014-08-30 18:04:54.589226679 -0400
 @@ -2743,6 +2743,13 @@ static int stop_queued_cmnd(struct scsi_
   if (test_bit(k, queued_in_use_bm)) {
   sqcp = queued_arr[k];
   if (cmnd == sqcp-a_cmnd) {
 + devip = (struct sdebug_dev_info *)
 + cmnd-device-hostdata;
 + if (devip)
 + atomic_dec(devip-num_in_q);
 + sqcp-a_cmnd = NULL;
 + spin_unlock_irqrestore(queued_arr_lock,
 +iflags);
   if (scsi_debug_ndelay  0) {
   if (sqcp-sd_hrtp)
   hrtimer_cancel(
 @@ -2755,18 +2762,13 @@ static int stop_queued_cmnd(struct scsi_
   if (sqcp-tletp)
   tasklet_kill(sqcp-tletp);
   }
 - __clear_bit(k, queued_in_use_bm);
 - devip = (struct sdebug_dev_info *)
 - cmnd-device-hostdata;
 - if (devip)
 - atomic_dec(devip-num_in_q);
 - sqcp-a_cmnd = NULL;
 - break;
 + clear_bit(k, queued_in_use_bm);
 + return 1;
   }
   }
   }
   spin_unlock_irqrestore(queued_arr_lock, iflags);
 - return (k  qmax) ? 1 : 0;
 + return 0;
  }
  
  /* Deletes (stops) timers or tasklets of all queued commands */
 @@ -2782,6 +2784,13 @@ static void stop_all_queued(void)
   if (test_bit(k, queued_in_use_bm)) {
   sqcp = queued_arr[k];
   if (sqcp-a_cmnd) {
 + devip = (struct sdebug_dev_info *)
 + sqcp-a_cmnd-device-hostdata;
 + if (devip)
 + atomic_dec(devip-num_in_q);
 + sqcp-a_cmnd = NULL;
 + spin_unlock_irqrestore(queued_arr_lock,
 +iflags);
   if (scsi_debug_ndelay  0) {
   if (sqcp-sd_hrtp)
   hrtimer_cancel(
 @@ -2794,12 +2803,8 @@ static void stop_all_queued(void)
   if (sqcp-tletp)
   tasklet_kill(sqcp-tletp);
   }
 - __clear_bit(k, queued_in_use_bm);
 - devip = (struct sdebug_dev_info *)
 - sqcp-a_cmnd-device-hostdata;
 - if (devip)
 - atomic_dec(devip-num_in_q);
 - sqcp-a_cmnd = NULL;
 + clear_bit(k, queued_in_use_bm);
 + spin_lock_irqsave(queued_arr_lock, iflags);
   }
   }
   }

---end quoted text---
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-09-09 Thread Christoph Hellwig
On Mon, Sep 08, 2014 at 04:31:01PM -0400, Douglas Gilbert wrote:
 stop_all_queued() is doing hrtimer_cancel(), del_timer_sync()
 or tasklet_kill() on all the scsi_cmnd objects that are
 in play. Unless another mechanism calls the .eh_abort_handler
 entry point reliably on each in play command then the module
 cannot be removed. That is because some timer expiry callbacks
 are pending.

scsi_remove_host disabled all queueing of new commands, so all these
timers and tasklets will eventually expire or run and allow the
removal to complete.  Of course this could be sped up by cancelling
them, but you don't need the sync version

 Something like the (untested) patch below would do the trick.
 We'd still need Dougs patch for the EH case, though.

 
 The only other call to stop_all_queued() is from the
 .eh_host_reset_handler entry point.

True, but you also have stop_queued_cmnd for a abort case which
also needs that treatment.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-09-08 Thread Bart Van Assche

On 09/06/14 16:40, Douglas Gilbert wrote:

On 14-09-05 11:25 AM, Bart Van Assche wrote:

An LLD must call scsi_remove_host() directly or indirectly from the
module
cleanup path. scsi_remove_host() triggers a call to
blk_cleanup_queue(). That
last function sets the flag QUEUE_FLAG_DYING which prevents that new
I/O is
queued and waits until previously queued requests have finished before
returning.


And they do call scsi_remove_host(). But they do that toward
the end of their clean-up. The problem that I observed has
already happened before that.

IOW I think the QUEUE_FLAG_DYING state needs to be set and
acknowledged as the first order of business by the code
that implements 'rmmod LLD'.


Hello Doug,

In the scsi_debug driver scsi_remove_host() is called from inside the 
sdebug_driver_remove() callback function. Unless I have missed something 
it is not guaranteed that that callback function is invoked before 
unloading of the scsi_debug driver has finished. I think most of the 
code in sdebug_driver_remove() should be moved to sdebug_remove_adapter().


Bart.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-09-08 Thread Christoph Hellwig
On Mon, Sep 08, 2014 at 11:11:23AM +0200, Bart Van Assche wrote:
 Hello Doug,
 
 In the scsi_debug driver scsi_remove_host() is called from inside the
 sdebug_driver_remove() callback function. Unless I have missed something it
 is not guaranteed that that callback function is invoked before unloading of
 the scsi_debug driver has finished. I think most of the code in
 sdebug_driver_remove() should be moved to sdebug_remove_adapter().

I'm not sure that's right.  scsi_debug uses the driver mode in a
slightly unusual way, and includes both the bus driver, device and
device driver.

sdebug_driver_remove is a bus method, but as we don't have driver methods
should act very much like all other _remove callbacks.

sdebug_remove_adapter is more a bus-level function that calls into
the driver model to unbind devices from the driver.

But we defintively shouldn't stop and free queued command before we
fully remove the hosts.  As far as I can tell the stop_all_queued
call can be entirely removed from the remove path, as the midlayer
will take care of waiting for all commands to return, and the
free_all_queued should be after all hosts are unregistered.

Something like the (untested) patch below would do the trick.
We'd still need Dougs patch for the EH case, though.

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index d19c0e3..d022c2f 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -3983,14 +3983,13 @@ static void __exit scsi_debug_exit(void)
 {
int k = scsi_debug_add_host;
 
-   stop_all_queued();
-   free_all_queued();
for (; k; k--)
sdebug_remove_adapter();
driver_unregister(sdebug_driverfs_driver);
bus_unregister(pseudo_lld_bus);
root_device_unregister(pseudo_primary);
 
+   free_all_queued();
if (dif_storep)
vfree(dif_storep);
 
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-09-08 Thread Douglas Gilbert

On 14-09-08 11:07 AM, Christoph Hellwig wrote:

On Mon, Sep 08, 2014 at 11:11:23AM +0200, Bart Van Assche wrote:

Hello Doug,

In the scsi_debug driver scsi_remove_host() is called from inside the
sdebug_driver_remove() callback function. Unless I have missed something it
is not guaranteed that that callback function is invoked before unloading of
the scsi_debug driver has finished. I think most of the code in
sdebug_driver_remove() should be moved to sdebug_remove_adapter().


I'm not sure that's right.  scsi_debug uses the driver mode in a
slightly unusual way, and includes both the bus driver, device and
device driver.

sdebug_driver_remove is a bus method, but as we don't have driver methods
should act very much like all other _remove callbacks.

sdebug_remove_adapter is more a bus-level function that calls into
the driver model to unbind devices from the driver.

But we defintively shouldn't stop and free queued command before we
fully remove the hosts.  As far as I can tell the stop_all_queued
call can be entirely removed from the remove path, as the midlayer
will take care of waiting for all commands to return, and the
free_all_queued should be after all hosts are unregistered.


stop_all_queued() is doing hrtimer_cancel(), del_timer_sync()
or tasklet_kill() on all the scsi_cmnd objects that are
in play. Unless another mechanism calls the .eh_abort_handler
entry point reliably on each in play command then the module
cannot be removed. That is because some timer expiry callbacks
are pending.


Something like the (untested) patch below would do the trick.
We'd still need Dougs patch for the EH case, though.

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index d19c0e3..d022c2f 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -3983,14 +3983,13 @@ static void __exit scsi_debug_exit(void)
  {
int k = scsi_debug_add_host;

-   stop_all_queued();
-   free_all_queued();
for (; k; k--)
sdebug_remove_adapter();
driver_unregister(sdebug_driverfs_driver);
bus_unregister(pseudo_lld_bus);
root_device_unregister(pseudo_primary);

+   free_all_queued();
if (dif_storep)
vfree(dif_storep);

--


The only other call to stop_all_queued() is from the
.eh_host_reset_handler entry point.

Doug Gilbert


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-09-05 Thread Bart Van Assche

On 09/05/14 15:56, Douglas Gilbert wrote:

With scsi-mq I think many LLDs probably have a new
race possibility between a surprise rmmod of the LLD
and another thread presenting a new command at about
the same time (or another thread's command completing
around that time). Does anything above the LLD stop
this happening?

Looking at mpt3sas and hpsa module exit calls, they don't
seem to guard against this possibility.

The test is pretty easy: build the LLD as a module, load
it and fire up a multi-thread, libaio fio test on one or
more devices (SSDs would probably be good) on that LLD.
While the test is running, do 'rmmod LLD'.


An LLD must call scsi_remove_host() directly or indirectly from the 
module cleanup path. scsi_remove_host() triggers a call to 
blk_cleanup_queue(). That last function sets the flag QUEUE_FLAG_DYING 
which prevents that new I/O is queued and waits until previously queued 
requests have finished before returning.


Bart.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-09-04 Thread Christoph Hellwig
Can I get another review for this one?

On Sun, Aug 31, 2014 at 07:09:59PM -0400, Douglas Gilbert wrote:
 A deadlock has been reported when the completion
 of SCSI commands (simulated by a timer) was surprised
 by a module removal. This patch removes one half of
 the offending locks around timer deletions. This fix
 is applied both to stop_all_queued() which is were
 the deadlock was discovered and stop_queued_cmnd()
 which has very similar logic.
 
 This patch should be applied both to the lk 3.17 tree
 and Christoph's drivers-for-3.18 tree.
 
 Tested-and-reported-by: Milan Broz gmazyl...@gmail.com
 Signed-off-by: Douglas Gilbert dgilb...@interlog.com

 --- a/drivers/scsi/scsi_debug.c   2014-08-26 13:24:51.646948507 -0400
 +++ b/drivers/scsi/scsi_debug.c   2014-08-30 18:04:54.589226679 -0400
 @@ -2743,6 +2743,13 @@ static int stop_queued_cmnd(struct scsi_
   if (test_bit(k, queued_in_use_bm)) {
   sqcp = queued_arr[k];
   if (cmnd == sqcp-a_cmnd) {
 + devip = (struct sdebug_dev_info *)
 + cmnd-device-hostdata;
 + if (devip)
 + atomic_dec(devip-num_in_q);
 + sqcp-a_cmnd = NULL;
 + spin_unlock_irqrestore(queued_arr_lock,
 +iflags);
   if (scsi_debug_ndelay  0) {
   if (sqcp-sd_hrtp)
   hrtimer_cancel(
 @@ -2755,18 +2762,13 @@ static int stop_queued_cmnd(struct scsi_
   if (sqcp-tletp)
   tasklet_kill(sqcp-tletp);
   }
 - __clear_bit(k, queued_in_use_bm);
 - devip = (struct sdebug_dev_info *)
 - cmnd-device-hostdata;
 - if (devip)
 - atomic_dec(devip-num_in_q);
 - sqcp-a_cmnd = NULL;
 - break;
 + clear_bit(k, queued_in_use_bm);
 + return 1;
   }
   }
   }
   spin_unlock_irqrestore(queued_arr_lock, iflags);
 - return (k  qmax) ? 1 : 0;
 + return 0;
  }
  
  /* Deletes (stops) timers or tasklets of all queued commands */
 @@ -2782,6 +2784,13 @@ static void stop_all_queued(void)
   if (test_bit(k, queued_in_use_bm)) {
   sqcp = queued_arr[k];
   if (sqcp-a_cmnd) {
 + devip = (struct sdebug_dev_info *)
 + sqcp-a_cmnd-device-hostdata;
 + if (devip)
 + atomic_dec(devip-num_in_q);
 + sqcp-a_cmnd = NULL;
 + spin_unlock_irqrestore(queued_arr_lock,
 +iflags);
   if (scsi_debug_ndelay  0) {
   if (sqcp-sd_hrtp)
   hrtimer_cancel(
 @@ -2794,12 +2803,8 @@ static void stop_all_queued(void)
   if (sqcp-tletp)
   tasklet_kill(sqcp-tletp);
   }
 - __clear_bit(k, queued_in_use_bm);
 - devip = (struct sdebug_dev_info *)
 - sqcp-a_cmnd-device-hostdata;
 - if (devip)
 - atomic_dec(devip-num_in_q);
 - sqcp-a_cmnd = NULL;
 + clear_bit(k, queued_in_use_bm);
 + spin_lock_irqsave(queued_arr_lock, iflags);
   }
   }
   }

---end quoted text---
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-09-01 Thread Christoph Hellwig
 --- a/drivers/scsi/scsi_debug.c   2014-08-26 13:24:51.646948507 -0400
 +++ b/drivers/scsi/scsi_debug.c   2014-08-30 18:04:54.589226679 -0400
 @@ -2743,6 +2743,13 @@ static int stop_queued_cmnd(struct scsi_
   if (test_bit(k, queued_in_use_bm)) {
   sqcp = queued_arr[k];
   if (cmnd == sqcp-a_cmnd) {
 + devip = (struct sdebug_dev_info *)
 + cmnd-device-hostdata;
 + if (devip)
 + atomic_dec(devip-num_in_q);
 + sqcp-a_cmnd = NULL;

Why would the hostdata every be NULL here?  We should never
call -slave_destroy on a device that has outstanding commands.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-09-01 Thread Douglas Gilbert

On 14-09-01 11:36 AM, Christoph Hellwig wrote:

--- a/drivers/scsi/scsi_debug.c 2014-08-26 13:24:51.646948507 -0400
+++ b/drivers/scsi/scsi_debug.c 2014-08-30 18:04:54.589226679 -0400
@@ -2743,6 +2743,13 @@ static int stop_queued_cmnd(struct scsi_
if (test_bit(k, queued_in_use_bm)) {
sqcp = queued_arr[k];
if (cmnd == sqcp-a_cmnd) {
+   devip = (struct sdebug_dev_info *)
+   cmnd-device-hostdata;
+   if (devip)
+   atomic_dec(devip-num_in_q);
+   sqcp-a_cmnd = NULL;


Why would the hostdata every be NULL here?  We should never
call -slave_destroy on a device that has outstanding commands.


To your first question, perhaps it could not be. In
the scsi_debug driver almost all uses of 'devip'
check for NULL, so it may not always have been true.

'rmmod scsi_debug' would lead to scsi_debug_exit()
being called and that called stop_all_queued() which
deadlocked with a command completion, or worse command
commencement. scsi_debug_exit() looks a bit racy: the
first thing it does is stop_all_queued() but has
anything been done to stop new commands being issued?
Later scsi_debug_exit() brings down the hosts.


This patch is primarily a bug fix for the lk 3.17 series and
the code you highlight was simply moved from being under a
lock to outside that lock. I didn't want to be too creative,
it's too easy to slip up and upset the management.


Enhancements could be sent to your drivers-for-3.18 tree. Rob
Elliott already has a few in mind, to improve performance.
Removing all 'devip' NULL checks would also improve performance,
and readability.

Doug Gilbert


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html