Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime

2013-10-23 Thread Hannes Reinecke

On 10/16/2013 09:22 PM, James Bottomley wrote:

On Mon, 2013-07-01 at 08:50 +0200, Hannes Reinecke wrote:

This patchs adds an 'eh_deadline' sysfs attribute to the scsi
host which limits the overall runtime of the SCSI EH.
The 'eh_deadline' value is stored in the now obsolete field
'resetting'.
When a command is failed the start time of the EH is stored
in 'last_reset'. If the overall runtime of the SCSI EH is longer
than last_reset + eh_deadline, the EH is short-circuited and
falls through to issue a host reset only.


OK, so the specific problem with this one is that potentially it will
spend all its time mucking about with aborts (which most often time out
on non FC kit because of the issue problems) and then proceed to host
reset, which mostly does nothing for failing devices.

If you want to impose a deadline, then we need to spend only 50% of the
time attempting aborts and the rest of the time escalating the resets.

[...]

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index f43de1e..84369f2 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -89,6 +89,18 @@ void scsi_schedule_eh(struct Scsi_Host *shost)
  }
  EXPORT_SYMBOL_GPL(scsi_schedule_eh);

+static int scsi_host_eh_past_deadline(struct Scsi_Host *shost)
+{
+   if (!shost-last_reset || !shost-eh_deadline)
+   return 0;
+
+   if (time_before(jiffies,
+   shost-last_reset + shost-eh_deadline))
+   return 0;
+
+   return 1;
+}
+


What about instead:

static int scsi_host_eh_past_deadline(struct Scsi_Host *shost, int percent) {
if (!shost-last_reset || !shost-eh_deadline)
return 0;

if (time_before(jiffies,
shost-last_reset + shost-eh_deadline * percent/100))
return 0;

return 1;
}

which allows us to have

if (scsi_host_eh_past_deadline(shost, 50)) {

in scsi_eh_abort_cmds()

if (scsi_host_eh_past_deadline(shost, 66) {

in scsi_eh_bus_device_reset()

say 83 in target reset, and 100 in bus reset.

Thus ensuring we at least get a crack at the reset chain?

Alternatively we could just escalate directly to LUN reset once the 
first abort fails. That looks like a cleaner solution here.


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime

2013-10-23 Thread James Bottomley
On Wed, 2013-10-23 at 11:25 +0200, Hannes Reinecke wrote:
 On 10/16/2013 09:22 PM, James Bottomley wrote:
  On Mon, 2013-07-01 at 08:50 +0200, Hannes Reinecke wrote:
  This patchs adds an 'eh_deadline' sysfs attribute to the scsi
  host which limits the overall runtime of the SCSI EH.
  The 'eh_deadline' value is stored in the now obsolete field
  'resetting'.
  When a command is failed the start time of the EH is stored
  in 'last_reset'. If the overall runtime of the SCSI EH is longer
  than last_reset + eh_deadline, the EH is short-circuited and
  falls through to issue a host reset only.
 
  OK, so the specific problem with this one is that potentially it will
  spend all its time mucking about with aborts (which most often time out
  on non FC kit because of the issue problems) and then proceed to host
  reset, which mostly does nothing for failing devices.
 
  If you want to impose a deadline, then we need to spend only 50% of the
  time attempting aborts and the rest of the time escalating the resets.
 
  [...]
  diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
  index f43de1e..84369f2 100644
  --- a/drivers/scsi/scsi_error.c
  +++ b/drivers/scsi/scsi_error.c
  @@ -89,6 +89,18 @@ void scsi_schedule_eh(struct Scsi_Host *shost)
}
EXPORT_SYMBOL_GPL(scsi_schedule_eh);
 
  +static int scsi_host_eh_past_deadline(struct Scsi_Host *shost)
  +{
  +  if (!shost-last_reset || !shost-eh_deadline)
  +  return 0;
  +
  +  if (time_before(jiffies,
  +  shost-last_reset + shost-eh_deadline))
  +  return 0;
  +
  +  return 1;
  +}
  +
 
  What about instead:
 
  static int scsi_host_eh_past_deadline(struct Scsi_Host *shost, int percent) 
  {
  if (!shost-last_reset || !shost-eh_deadline)
  return 0;
 
  if (time_before(jiffies,
  shost-last_reset + shost-eh_deadline * percent/100))
  return 0;
 
  return 1;
  }
 
  which allows us to have
 
  if (scsi_host_eh_past_deadline(shost, 50)) {
 
  in scsi_eh_abort_cmds()
 
  if (scsi_host_eh_past_deadline(shost, 66) {
 
  in scsi_eh_bus_device_reset()
 
  say 83 in target reset, and 100 in bus reset.
 
  Thus ensuring we at least get a crack at the reset chain?
 
 Alternatively we could just escalate directly to LUN reset once the 
 first abort fails. That looks like a cleaner solution here.

I'm OK with that ... if you want this in the current merge window, you
have about a week to code it up ...

James

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime

2013-10-23 Thread Hannes Reinecke

On 10/23/2013 09:46 AM, James Bottomley wrote:

On Wed, 2013-10-23 at 11:25 +0200, Hannes Reinecke wrote:

On 10/16/2013 09:22 PM, James Bottomley wrote:

On Mon, 2013-07-01 at 08:50 +0200, Hannes Reinecke wrote:

This patchs adds an 'eh_deadline' sysfs attribute to the scsi
host which limits the overall runtime of the SCSI EH.
The 'eh_deadline' value is stored in the now obsolete field
'resetting'.
When a command is failed the start time of the EH is stored
in 'last_reset'. If the overall runtime of the SCSI EH is longer
than last_reset + eh_deadline, the EH is short-circuited and
falls through to issue a host reset only.


OK, so the specific problem with this one is that potentially it will
spend all its time mucking about with aborts (which most often time out
on non FC kit because of the issue problems) and then proceed to host
reset, which mostly does nothing for failing devices.

If you want to impose a deadline, then we need to spend only 50% of the
time attempting aborts and the rest of the time escalating the resets.

[...]

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index f43de1e..84369f2 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -89,6 +89,18 @@ void scsi_schedule_eh(struct Scsi_Host *shost)
   }
   EXPORT_SYMBOL_GPL(scsi_schedule_eh);

+static int scsi_host_eh_past_deadline(struct Scsi_Host *shost)
+{
+   if (!shost-last_reset || !shost-eh_deadline)
+   return 0;
+
+   if (time_before(jiffies,
+   shost-last_reset + shost-eh_deadline))
+   return 0;
+
+   return 1;
+}
+


What about instead:

static int scsi_host_eh_past_deadline(struct Scsi_Host *shost, int percent) {
if (!shost-last_reset || !shost-eh_deadline)
return 0;

if (time_before(jiffies,
shost-last_reset + shost-eh_deadline * percent/100))
return 0;

return 1;
}

which allows us to have

if (scsi_host_eh_past_deadline(shost, 50)) {

in scsi_eh_abort_cmds()

if (scsi_host_eh_past_deadline(shost, 66) {

in scsi_eh_bus_device_reset()

say 83 in target reset, and 100 in bus reset.

Thus ensuring we at least get a crack at the reset chain?


Alternatively we could just escalate directly to LUN reset once the
first abort fails. That looks like a cleaner solution here.


I'm OK with that ... if you want this in the current merge window, you
have about a week to code it up ...


Patch is already compiling :-)

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime

2013-10-17 Thread Ewan Milne
On Wed, 2013-10-16 at 19:22 +, James Bottomley wrote:
 What about instead:
 
 static int scsi_host_eh_past_deadline(struct Scsi_Host *shost, int percent) {
   if (!shost-last_reset || !shost-eh_deadline)
   return 0;
 
   if (time_before(jiffies,
   shost-last_reset + shost-eh_deadline * percent/100))
   return 0;
 
   return 1;
 }
 
 which allows us to have
 
 if (scsi_host_eh_past_deadline(shost, 50)) {
 
 in scsi_eh_abort_cmds()
 
 if (scsi_host_eh_past_deadline(shost, 66) {
 
 in scsi_eh_bus_device_reset()
 
 say 83 in target reset, and 100 in bus reset.
 
 Thus ensuring we at least get a crack at the reset chain?
 
 James
 

Well, conceptually that seems like a good idea, since if there
is limited time available it is probably wiser to spend it on
higher-level recovery instead of timing out trying to deal with
individual devices, but we didn't do any testing on how long
the bus device reset/target reset/bus reset take.  The host
reset was about 10 seconds for lpfc, and the maximum time was
(command timeout) + (eh deadline) + (host reset time).

However...

With this enhancement, the maximum time could be much longer if
we attempt to e.g. perform a bus reset right before the
eh_deadline expires, because drivers like lpfc iterate over the
targets and send a target reset to each one (with a timeout).

The original problem that prompted this change was that a
target became inaccessible, and nothing the EH did was ever going
to do anything except timeout, until the host reset was performed,
at which point the FC login would fail and the HBA would start
failing commands immediately instead of them timing out.

I guess the main thing is that there should be some way to
explain to people what value to use for eh_deadline in order
for I/O to complete within a specified amount of time (e.g. before
some other node in a cluster shoots us because we are unresponsive).

-Ewan




--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime

2013-10-16 Thread James Bottomley
On Mon, 2013-07-01 at 08:50 +0200, Hannes Reinecke wrote:
 This patchs adds an 'eh_deadline' sysfs attribute to the scsi
 host which limits the overall runtime of the SCSI EH.
 The 'eh_deadline' value is stored in the now obsolete field
 'resetting'.
 When a command is failed the start time of the EH is stored
 in 'last_reset'. If the overall runtime of the SCSI EH is longer
 than last_reset + eh_deadline, the EH is short-circuited and
 falls through to issue a host reset only.

OK, so the specific problem with this one is that potentially it will
spend all its time mucking about with aborts (which most often time out
on non FC kit because of the issue problems) and then proceed to host
reset, which mostly does nothing for failing devices.

If you want to impose a deadline, then we need to spend only 50% of the
time attempting aborts and the rest of the time escalating the resets.

[...]
 diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
 index f43de1e..84369f2 100644
 --- a/drivers/scsi/scsi_error.c
 +++ b/drivers/scsi/scsi_error.c
 @@ -89,6 +89,18 @@ void scsi_schedule_eh(struct Scsi_Host *shost)
  }
  EXPORT_SYMBOL_GPL(scsi_schedule_eh);
  
 +static int scsi_host_eh_past_deadline(struct Scsi_Host *shost)
 +{
 + if (!shost-last_reset || !shost-eh_deadline)
 + return 0;
 +
 + if (time_before(jiffies,
 + shost-last_reset + shost-eh_deadline))
 + return 0;
 +
 + return 1;
 +}
 +

What about instead:

static int scsi_host_eh_past_deadline(struct Scsi_Host *shost, int percent) {
if (!shost-last_reset || !shost-eh_deadline)
return 0;

if (time_before(jiffies,
shost-last_reset + shost-eh_deadline * percent/100))
return 0;

return 1;
}

which allows us to have

if (scsi_host_eh_past_deadline(shost, 50)) {

in scsi_eh_abort_cmds()

if (scsi_host_eh_past_deadline(shost, 66) {

in scsi_eh_bus_device_reset()

say 83 in target reset, and 100 in bus reset.

Thus ensuring we at least get a crack at the reset chain?

James

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime

2013-07-01 Thread Hannes Reinecke
This patchs adds an 'eh_deadline' sysfs attribute to the scsi
host which limits the overall runtime of the SCSI EH.
The 'eh_deadline' value is stored in the now obsolete field
'resetting'.
When a command is failed the start time of the EH is stored
in 'last_reset'. If the overall runtime of the SCSI EH is longer
than last_reset + eh_deadline, the EH is short-circuited and
falls through to issue a host reset only.

Signed-off-by: Hannes Reinecke h...@suse.de
---
 drivers/scsi/hosts.c  |   7 +++
 drivers/scsi/scsi_error.c | 130 +++---
 drivers/scsi/scsi_sysfs.c |  37 +
 include/scsi/scsi_host.h  |   4 +-
 4 files changed, 170 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index df0c3c7..f334859 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -316,6 +316,12 @@ static void scsi_host_dev_release(struct device *dev)
kfree(shost);
 }
 
+static unsigned int shost_eh_deadline;
+
+module_param_named(eh_deadline, shost_eh_deadline, uint, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(eh_deadline,
+SCSI EH timeout in seconds (should be between 1 and 2^32-1));
+
 static struct device_type scsi_host_type = {
.name = scsi_host,
.release =  scsi_host_dev_release,
@@ -388,6 +394,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template 
*sht, int privsize)
shost-unchecked_isa_dma = sht-unchecked_isa_dma;
shost-use_clustering = sht-use_clustering;
shost-ordered_tag = sht-ordered_tag;
+   shost-eh_deadline = shost_eh_deadline * HZ;
 
if (sht-supported_mode == MODE_UNKNOWN)
/* means we didn't set it ... default to INITIATOR */
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index f43de1e..84369f2 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -89,6 +89,18 @@ void scsi_schedule_eh(struct Scsi_Host *shost)
 }
 EXPORT_SYMBOL_GPL(scsi_schedule_eh);
 
+static int scsi_host_eh_past_deadline(struct Scsi_Host *shost)
+{
+   if (!shost-last_reset || !shost-eh_deadline)
+   return 0;
+
+   if (time_before(jiffies,
+   shost-last_reset + shost-eh_deadline))
+   return 0;
+
+   return 1;
+}
+
 /**
  * scsi_eh_scmd_add - add scsi cmd to error handling.
  * @scmd:  scmd to run eh on.
@@ -111,6 +123,9 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
if (scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY))
goto out_unlock;
 
+   if (shost-eh_deadline  !shost-last_reset)
+   shost-last_reset = jiffies;
+
ret = 1;
scmd-eh_eflags |= eh_flag;
list_add_tail(scmd-eh_entry, shost-eh_cmd_q);
@@ -140,6 +155,9 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
trace_scsi_dispatch_cmd_timeout(scmd);
scsi_log_completion(scmd, TIMEOUT_ERROR);
 
+   if (host-eh_deadline  !host-last_reset)
+   host-last_reset = jiffies;
+
if (host-transportt-eh_timed_out)
rtn = host-transportt-eh_timed_out(scmd);
else if (host-hostt-eh_timed_out)
@@ -928,13 +946,26 @@ int scsi_eh_get_sense(struct list_head *work_q,
  struct list_head *done_q)
 {
struct scsi_cmnd *scmd, *next;
+   struct Scsi_Host *shost;
int rtn;
+   unsigned long flags;
 
list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
if ((scmd-eh_eflags  SCSI_EH_CANCEL_CMD) ||
SCSI_SENSE_VALID(scmd))
continue;
 
+   shost = scmd-device-host;
+   spin_lock_irqsave(shost-host_lock, flags);
+   if (scsi_host_eh_past_deadline(shost)) {
+   spin_unlock_irqrestore(shost-host_lock, flags);
+   SCSI_LOG_ERROR_RECOVERY(3,
+   shost_printk(KERN_INFO, shost,
+   skip %s, past eh deadline\n,
+__func__));
+   break;
+   }
+   spin_unlock_irqrestore(shost-host_lock, flags);
SCSI_LOG_ERROR_RECOVERY(2, scmd_printk(KERN_INFO, scmd,
  %s: requesting sense\n,
  current-comm));
@@ -1019,11 +1050,28 @@ static int scsi_eh_test_devices(struct list_head 
*cmd_list,
struct scsi_cmnd *scmd, *next;
struct scsi_device *sdev;
int finish_cmds;
+   unsigned long flags;
 
while (!list_empty(cmd_list)) {
scmd = list_entry(cmd_list-next, struct scsi_cmnd, eh_entry);
sdev = scmd-device;
 
+   if (!try_stu) {
+   spin_lock_irqsave(sdev-host-host_lock, flags);
+   if 

Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime

2013-06-28 Thread Hannes Reinecke
On 06/28/2013 09:29 AM, Bart Van Assche wrote:
 On 06/10/13 13:11, Hannes Reinecke wrote:
 +static int sdev_eh_deadline(struct Scsi_Host *shost,
 +   unsigned long eh_start)
 +{
 +if (!shost-eh_deadline)
 +return 0;
 +
 +if (shost-last_reset != 0 
 +time_before(shost-last_reset, eh_start))
 +eh_start = shost-last_reset;
 +
 +if (time_before(jiffies,
 +eh_start + shost-eh_deadline))
 +return 0;
 +
 +return 1;
 +}
 +
 +static int scsi_host_eh_deadline(struct Scsi_Host *shost)
 +{
 +if (!shost-last_reset)
 +return 0;
 +
 +return sdev_eh_deadline(shost, shost-last_reset);
 +}
 
 Hello Hannes,
 
 I would appreciate if you would choose other names for these two
 functions and also for shost-eh_deadline. To me a deadline is a
 time instant. As far as I can see the two functions above check
 whether a deadline has been passed, and shost-eh_deadline is a time
 interval. How about the following names: sdev_eh_past_deadline(),
 shost_eh_past_deadline() and shost-max_eh_jiffies ?
 
Sure. I changed the naming once already, so I don't have any issues
with that.
And yes, the suggested naming does make more sense.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime

2013-06-28 Thread Ewan Milne
On Fri, 2013-06-28 at 09:14 +0200, Hannes Reinecke wrote:

  @@ -232,6 +272,9 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int 
  eh_flag)
 if (scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY))
 goto out_unlock;
   
  +  if (sdev-eh_deadline  !shost-last_reset)
  +  shost-last_reset = jiffies;
  +
  
  I think this is supposed to be if (shost-eh_deadline ...
  
 No. -last_reset is set to the time the first command timeout/failure.
 
 -last_reset + -eh_deadline will give you the expiry time.

Sorry, I wasn't clear what I meant.  What I meant was:

  if (shost-eh_deadline  !shost-last_reset)
  shost-last_reset = jiffies;

...because the eh_deadline field is in struct Scsi_Host, not
in struct scsi_device.

-Ewan



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime

2013-06-27 Thread Ewan Milne
The eh_deadline changes allow for a significant improvement
in multipath failover time.  It works very well in our testing.
I do have a few corrections, see below:

On Mon, 2013-06-10 at 13:11 +0200, Hannes Reinecke wrote:
 This patchs adds an 'eh_deadline' attribute to the scsi
 host which limits the overall runtime of the SCSI EH.
 When a command is failed the start time of the EH is stored
 in 'last_reset'. If the overall runtime of the SCSI EH is longer
 than last_reset + eh_deadline, the EH is short-circuited and
 falls through to issue a host reset only.
 
 Signed-off-by: Hannes Reinecke h...@suse.de
 ---
  drivers/scsi/hosts.c  |   7 +++
  drivers/scsi/scsi_error.c | 142 
 +++---
  drivers/scsi/scsi_sysfs.c |  37 
  include/scsi/scsi_host.h  |   2 +-
  4 files changed, 180 insertions(+), 8 deletions(-)
 
 diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
 index df0c3c7..c8d828f 100644
 --- a/drivers/scsi/hosts.c
 +++ b/drivers/scsi/hosts.c
 @@ -316,6 +316,12 @@ static void scsi_host_dev_release(struct device *dev)
   kfree(shost);
  }
  
 +static unsigned int shost_eh_deadline;
 +
 +module_param_named(eh_deadline, shost_eh_deadline, uint, S_IRUGO|S_IWUSR);
 +MODULE_PARM_DESC(eh_deadline,
 +  SCSI EH deadline in seconds (should be between 1 and 
 2^32-1));
 +
  static struct device_type scsi_host_type = {
   .name = scsi_host,
   .release =  scsi_host_dev_release,
 @@ -388,6 +394,7 @@ struct Scsi_Host *scsi_host_alloc(struct 
 scsi_host_template *sht, int privsize)
   shost-unchecked_isa_dma = sht-unchecked_isa_dma;
   shost-use_clustering = sht-use_clustering;
   shost-ordered_tag = sht-ordered_tag;
 + shost-eh_deadline = shost_eh_deadline;

This should be shost-eh_deadline = shost_eh_deadline * HZ; since the
parameter is specified in seconds.

  
   if (sht-supported_mode == MODE_UNKNOWN)
   /* means we didn't set it ... default to INITIATOR */
 diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
 index 467cb3c..cf30475 100644
 --- a/drivers/scsi/scsi_error.c
 +++ b/drivers/scsi/scsi_error.c
 @@ -91,6 +91,31 @@ void scsi_schedule_eh(struct Scsi_Host *shost)
  }
  EXPORT_SYMBOL_GPL(scsi_schedule_eh);
  
 +static int sdev_eh_deadline(struct Scsi_Host *shost,
 +unsigned long eh_start)
 +{
 + if (!shost-eh_deadline)
 + return 0;
 +
 + if (shost-last_reset != 0 
 + time_before(shost-last_reset, eh_start))
 + eh_start = shost-last_reset;
 +
 + if (time_before(jiffies,
 + eh_start + shost-eh_deadline))
 + return 0;
 +
 + return 1;
 +}
 +
 +static int scsi_host_eh_deadline(struct Scsi_Host *shost)
 +{
 + if (!shost-last_reset)
 + return 0;
 +
 + return sdev_eh_deadline(shost, shost-last_reset);
 +}
 +
  /**
   * scsi_eh_abort_handler - Handle command aborts
   * @work:sdev on which commands should be aborted.
 @@ -102,13 +127,15 @@ scsi_eh_abort_handler(struct work_struct *work)
   container_of(work, struct scsi_device, abort_work);
   struct scsi_cmnd *scmd, *tmp;
   LIST_HEAD(abort_list);
 - unsigned long flags;
 + unsigned long flags, eh_start;
   int rtn;
  
   spin_lock_irqsave(sdev-list_lock, flags);
   list_splice_init(sdev-eh_abort_list, abort_list);
   spin_unlock_irqrestore(sdev-list_lock, flags);
  
 + eh_start = jiffies;
 +
   list_for_each_entry_safe(scmd, tmp, abort_list, eh_entry) {
   list_del_init(scmd-eh_entry);
   if (sdev-sdev_state == SDEV_CANCEL) {
 @@ -119,6 +146,13 @@ scsi_eh_abort_handler(struct work_struct *work)
   scsi_finish_command(scmd);
   continue;
   }
 + if (sdev_eh_deadline(sdev-host, eh_start)) {
 + SCSI_LOG_ERROR_RECOVERY(3,
 + scmd_printk(KERN_INFO, scmd,
 +  eh timeout, not aborting\n));
 + list_move_tail(scmd-eh_entry, abort_list);
 + goto start_eh;
 + }
   SCSI_LOG_ERROR_RECOVERY(3,
   scmd_printk(KERN_INFO, scmd,
   aborting command %p\n, scmd));
 @@ -151,6 +185,12 @@ scsi_eh_abort_handler(struct work_struct *work)
   return;
  
  start_eh:
 + spin_lock_irqsave(sdev-host-host_lock, flags);
 + if (sdev-host-eh_deadline 
 + (!sdev-host-last_reset ||
 +  time_before(eh_start, sdev-host-last_reset)))
 + sdev-host-last_reset = eh_start;
 + spin_unlock_irqrestore(sdev-host-host_lock, flags);
   list_for_each_entry_safe(scmd, tmp, abort_list, eh_entry) {
   scmd-result |= DID_TIME_OUT  16;
   if (!scsi_eh_scmd_add(scmd, 0)) {
 @@ -232,6 +272,9 @@ int scsi_eh_scmd_add(struct scsi_cmnd 

[PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime

2013-06-10 Thread Hannes Reinecke
This patchs adds an 'eh_deadline' attribute to the scsi
host which limits the overall runtime of the SCSI EH.
When a command is failed the start time of the EH is stored
in 'last_reset'. If the overall runtime of the SCSI EH is longer
than last_reset + eh_deadline, the EH is short-circuited and
falls through to issue a host reset only.

Signed-off-by: Hannes Reinecke h...@suse.de
---
 drivers/scsi/hosts.c  |   7 +++
 drivers/scsi/scsi_error.c | 142 +++---
 drivers/scsi/scsi_sysfs.c |  37 
 include/scsi/scsi_host.h  |   2 +-
 4 files changed, 180 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index df0c3c7..c8d828f 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -316,6 +316,12 @@ static void scsi_host_dev_release(struct device *dev)
kfree(shost);
 }
 
+static unsigned int shost_eh_deadline;
+
+module_param_named(eh_deadline, shost_eh_deadline, uint, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(eh_deadline,
+SCSI EH deadline in seconds (should be between 1 and 
2^32-1));
+
 static struct device_type scsi_host_type = {
.name = scsi_host,
.release =  scsi_host_dev_release,
@@ -388,6 +394,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template 
*sht, int privsize)
shost-unchecked_isa_dma = sht-unchecked_isa_dma;
shost-use_clustering = sht-use_clustering;
shost-ordered_tag = sht-ordered_tag;
+   shost-eh_deadline = shost_eh_deadline;
 
if (sht-supported_mode == MODE_UNKNOWN)
/* means we didn't set it ... default to INITIATOR */
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 467cb3c..cf30475 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -91,6 +91,31 @@ void scsi_schedule_eh(struct Scsi_Host *shost)
 }
 EXPORT_SYMBOL_GPL(scsi_schedule_eh);
 
+static int sdev_eh_deadline(struct Scsi_Host *shost,
+  unsigned long eh_start)
+{
+   if (!shost-eh_deadline)
+   return 0;
+
+   if (shost-last_reset != 0 
+   time_before(shost-last_reset, eh_start))
+   eh_start = shost-last_reset;
+
+   if (time_before(jiffies,
+   eh_start + shost-eh_deadline))
+   return 0;
+
+   return 1;
+}
+
+static int scsi_host_eh_deadline(struct Scsi_Host *shost)
+{
+   if (!shost-last_reset)
+   return 0;
+
+   return sdev_eh_deadline(shost, shost-last_reset);
+}
+
 /**
  * scsi_eh_abort_handler - Handle command aborts
  * @work:  sdev on which commands should be aborted.
@@ -102,13 +127,15 @@ scsi_eh_abort_handler(struct work_struct *work)
container_of(work, struct scsi_device, abort_work);
struct scsi_cmnd *scmd, *tmp;
LIST_HEAD(abort_list);
-   unsigned long flags;
+   unsigned long flags, eh_start;
int rtn;
 
spin_lock_irqsave(sdev-list_lock, flags);
list_splice_init(sdev-eh_abort_list, abort_list);
spin_unlock_irqrestore(sdev-list_lock, flags);
 
+   eh_start = jiffies;
+
list_for_each_entry_safe(scmd, tmp, abort_list, eh_entry) {
list_del_init(scmd-eh_entry);
if (sdev-sdev_state == SDEV_CANCEL) {
@@ -119,6 +146,13 @@ scsi_eh_abort_handler(struct work_struct *work)
scsi_finish_command(scmd);
continue;
}
+   if (sdev_eh_deadline(sdev-host, eh_start)) {
+   SCSI_LOG_ERROR_RECOVERY(3,
+   scmd_printk(KERN_INFO, scmd,
+eh timeout, not aborting\n));
+   list_move_tail(scmd-eh_entry, abort_list);
+   goto start_eh;
+   }
SCSI_LOG_ERROR_RECOVERY(3,
scmd_printk(KERN_INFO, scmd,
aborting command %p\n, scmd));
@@ -151,6 +185,12 @@ scsi_eh_abort_handler(struct work_struct *work)
return;
 
 start_eh:
+   spin_lock_irqsave(sdev-host-host_lock, flags);
+   if (sdev-host-eh_deadline 
+   (!sdev-host-last_reset ||
+time_before(eh_start, sdev-host-last_reset)))
+   sdev-host-last_reset = eh_start;
+   spin_unlock_irqrestore(sdev-host-host_lock, flags);
list_for_each_entry_safe(scmd, tmp, abort_list, eh_entry) {
scmd-result |= DID_TIME_OUT  16;
if (!scsi_eh_scmd_add(scmd, 0)) {
@@ -232,6 +272,9 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
if (scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY))
goto out_unlock;
 
+   if (sdev-eh_deadline  !shost-last_reset)
+   shost-last_reset = jiffies;
+
ret = 1;
scmd-eh_eflags |= eh_flag;
list_add_tail(scmd-eh_entry, shost-eh_cmd_q);