Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime
On 10/16/2013 09:22 PM, James Bottomley wrote: On Mon, 2013-07-01 at 08:50 +0200, Hannes Reinecke wrote: This patchs adds an 'eh_deadline' sysfs attribute to the scsi host which limits the overall runtime of the SCSI EH. The 'eh_deadline' value is stored in the now obsolete field 'resetting'. When a command is failed the start time of the EH is stored in 'last_reset'. If the overall runtime of the SCSI EH is longer than last_reset + eh_deadline, the EH is short-circuited and falls through to issue a host reset only. OK, so the specific problem with this one is that potentially it will spend all its time mucking about with aborts (which most often time out on non FC kit because of the issue problems) and then proceed to host reset, which mostly does nothing for failing devices. If you want to impose a deadline, then we need to spend only 50% of the time attempting aborts and the rest of the time escalating the resets. [...] diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index f43de1e..84369f2 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -89,6 +89,18 @@ void scsi_schedule_eh(struct Scsi_Host *shost) } EXPORT_SYMBOL_GPL(scsi_schedule_eh); +static int scsi_host_eh_past_deadline(struct Scsi_Host *shost) +{ + if (!shost-last_reset || !shost-eh_deadline) + return 0; + + if (time_before(jiffies, + shost-last_reset + shost-eh_deadline)) + return 0; + + return 1; +} + What about instead: static int scsi_host_eh_past_deadline(struct Scsi_Host *shost, int percent) { if (!shost-last_reset || !shost-eh_deadline) return 0; if (time_before(jiffies, shost-last_reset + shost-eh_deadline * percent/100)) return 0; return 1; } which allows us to have if (scsi_host_eh_past_deadline(shost, 50)) { in scsi_eh_abort_cmds() if (scsi_host_eh_past_deadline(shost, 66) { in scsi_eh_bus_device_reset() say 83 in target reset, and 100 in bus reset. Thus ensuring we at least get a crack at the reset chain? Alternatively we could just escalate directly to LUN reset once the first abort fails. That looks like a cleaner solution here. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime
On Wed, 2013-10-23 at 11:25 +0200, Hannes Reinecke wrote: On 10/16/2013 09:22 PM, James Bottomley wrote: On Mon, 2013-07-01 at 08:50 +0200, Hannes Reinecke wrote: This patchs adds an 'eh_deadline' sysfs attribute to the scsi host which limits the overall runtime of the SCSI EH. The 'eh_deadline' value is stored in the now obsolete field 'resetting'. When a command is failed the start time of the EH is stored in 'last_reset'. If the overall runtime of the SCSI EH is longer than last_reset + eh_deadline, the EH is short-circuited and falls through to issue a host reset only. OK, so the specific problem with this one is that potentially it will spend all its time mucking about with aborts (which most often time out on non FC kit because of the issue problems) and then proceed to host reset, which mostly does nothing for failing devices. If you want to impose a deadline, then we need to spend only 50% of the time attempting aborts and the rest of the time escalating the resets. [...] diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index f43de1e..84369f2 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -89,6 +89,18 @@ void scsi_schedule_eh(struct Scsi_Host *shost) } EXPORT_SYMBOL_GPL(scsi_schedule_eh); +static int scsi_host_eh_past_deadline(struct Scsi_Host *shost) +{ + if (!shost-last_reset || !shost-eh_deadline) + return 0; + + if (time_before(jiffies, + shost-last_reset + shost-eh_deadline)) + return 0; + + return 1; +} + What about instead: static int scsi_host_eh_past_deadline(struct Scsi_Host *shost, int percent) { if (!shost-last_reset || !shost-eh_deadline) return 0; if (time_before(jiffies, shost-last_reset + shost-eh_deadline * percent/100)) return 0; return 1; } which allows us to have if (scsi_host_eh_past_deadline(shost, 50)) { in scsi_eh_abort_cmds() if (scsi_host_eh_past_deadline(shost, 66) { in scsi_eh_bus_device_reset() say 83 in target reset, and 100 in bus reset. Thus ensuring we at least get a crack at the reset chain? Alternatively we could just escalate directly to LUN reset once the first abort fails. That looks like a cleaner solution here. I'm OK with that ... if you want this in the current merge window, you have about a week to code it up ... James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime
On 10/23/2013 09:46 AM, James Bottomley wrote: On Wed, 2013-10-23 at 11:25 +0200, Hannes Reinecke wrote: On 10/16/2013 09:22 PM, James Bottomley wrote: On Mon, 2013-07-01 at 08:50 +0200, Hannes Reinecke wrote: This patchs adds an 'eh_deadline' sysfs attribute to the scsi host which limits the overall runtime of the SCSI EH. The 'eh_deadline' value is stored in the now obsolete field 'resetting'. When a command is failed the start time of the EH is stored in 'last_reset'. If the overall runtime of the SCSI EH is longer than last_reset + eh_deadline, the EH is short-circuited and falls through to issue a host reset only. OK, so the specific problem with this one is that potentially it will spend all its time mucking about with aborts (which most often time out on non FC kit because of the issue problems) and then proceed to host reset, which mostly does nothing for failing devices. If you want to impose a deadline, then we need to spend only 50% of the time attempting aborts and the rest of the time escalating the resets. [...] diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index f43de1e..84369f2 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -89,6 +89,18 @@ void scsi_schedule_eh(struct Scsi_Host *shost) } EXPORT_SYMBOL_GPL(scsi_schedule_eh); +static int scsi_host_eh_past_deadline(struct Scsi_Host *shost) +{ + if (!shost-last_reset || !shost-eh_deadline) + return 0; + + if (time_before(jiffies, + shost-last_reset + shost-eh_deadline)) + return 0; + + return 1; +} + What about instead: static int scsi_host_eh_past_deadline(struct Scsi_Host *shost, int percent) { if (!shost-last_reset || !shost-eh_deadline) return 0; if (time_before(jiffies, shost-last_reset + shost-eh_deadline * percent/100)) return 0; return 1; } which allows us to have if (scsi_host_eh_past_deadline(shost, 50)) { in scsi_eh_abort_cmds() if (scsi_host_eh_past_deadline(shost, 66) { in scsi_eh_bus_device_reset() say 83 in target reset, and 100 in bus reset. Thus ensuring we at least get a crack at the reset chain? Alternatively we could just escalate directly to LUN reset once the first abort fails. That looks like a cleaner solution here. I'm OK with that ... if you want this in the current merge window, you have about a week to code it up ... Patch is already compiling :-) Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime
On Wed, 2013-10-16 at 19:22 +, James Bottomley wrote: What about instead: static int scsi_host_eh_past_deadline(struct Scsi_Host *shost, int percent) { if (!shost-last_reset || !shost-eh_deadline) return 0; if (time_before(jiffies, shost-last_reset + shost-eh_deadline * percent/100)) return 0; return 1; } which allows us to have if (scsi_host_eh_past_deadline(shost, 50)) { in scsi_eh_abort_cmds() if (scsi_host_eh_past_deadline(shost, 66) { in scsi_eh_bus_device_reset() say 83 in target reset, and 100 in bus reset. Thus ensuring we at least get a crack at the reset chain? James Well, conceptually that seems like a good idea, since if there is limited time available it is probably wiser to spend it on higher-level recovery instead of timing out trying to deal with individual devices, but we didn't do any testing on how long the bus device reset/target reset/bus reset take. The host reset was about 10 seconds for lpfc, and the maximum time was (command timeout) + (eh deadline) + (host reset time). However... With this enhancement, the maximum time could be much longer if we attempt to e.g. perform a bus reset right before the eh_deadline expires, because drivers like lpfc iterate over the targets and send a target reset to each one (with a timeout). The original problem that prompted this change was that a target became inaccessible, and nothing the EH did was ever going to do anything except timeout, until the host reset was performed, at which point the FC login would fail and the HBA would start failing commands immediately instead of them timing out. I guess the main thing is that there should be some way to explain to people what value to use for eh_deadline in order for I/O to complete within a specified amount of time (e.g. before some other node in a cluster shoots us because we are unresponsive). -Ewan -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime
On Mon, 2013-07-01 at 08:50 +0200, Hannes Reinecke wrote: This patchs adds an 'eh_deadline' sysfs attribute to the scsi host which limits the overall runtime of the SCSI EH. The 'eh_deadline' value is stored in the now obsolete field 'resetting'. When a command is failed the start time of the EH is stored in 'last_reset'. If the overall runtime of the SCSI EH is longer than last_reset + eh_deadline, the EH is short-circuited and falls through to issue a host reset only. OK, so the specific problem with this one is that potentially it will spend all its time mucking about with aborts (which most often time out on non FC kit because of the issue problems) and then proceed to host reset, which mostly does nothing for failing devices. If you want to impose a deadline, then we need to spend only 50% of the time attempting aborts and the rest of the time escalating the resets. [...] diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index f43de1e..84369f2 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -89,6 +89,18 @@ void scsi_schedule_eh(struct Scsi_Host *shost) } EXPORT_SYMBOL_GPL(scsi_schedule_eh); +static int scsi_host_eh_past_deadline(struct Scsi_Host *shost) +{ + if (!shost-last_reset || !shost-eh_deadline) + return 0; + + if (time_before(jiffies, + shost-last_reset + shost-eh_deadline)) + return 0; + + return 1; +} + What about instead: static int scsi_host_eh_past_deadline(struct Scsi_Host *shost, int percent) { if (!shost-last_reset || !shost-eh_deadline) return 0; if (time_before(jiffies, shost-last_reset + shost-eh_deadline * percent/100)) return 0; return 1; } which allows us to have if (scsi_host_eh_past_deadline(shost, 50)) { in scsi_eh_abort_cmds() if (scsi_host_eh_past_deadline(shost, 66) { in scsi_eh_bus_device_reset() say 83 in target reset, and 100 in bus reset. Thus ensuring we at least get a crack at the reset chain? James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime
This patchs adds an 'eh_deadline' sysfs attribute to the scsi host which limits the overall runtime of the SCSI EH. The 'eh_deadline' value is stored in the now obsolete field 'resetting'. When a command is failed the start time of the EH is stored in 'last_reset'. If the overall runtime of the SCSI EH is longer than last_reset + eh_deadline, the EH is short-circuited and falls through to issue a host reset only. Signed-off-by: Hannes Reinecke h...@suse.de --- drivers/scsi/hosts.c | 7 +++ drivers/scsi/scsi_error.c | 130 +++--- drivers/scsi/scsi_sysfs.c | 37 + include/scsi/scsi_host.h | 4 +- 4 files changed, 170 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index df0c3c7..f334859 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -316,6 +316,12 @@ static void scsi_host_dev_release(struct device *dev) kfree(shost); } +static unsigned int shost_eh_deadline; + +module_param_named(eh_deadline, shost_eh_deadline, uint, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(eh_deadline, +SCSI EH timeout in seconds (should be between 1 and 2^32-1)); + static struct device_type scsi_host_type = { .name = scsi_host, .release = scsi_host_dev_release, @@ -388,6 +394,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize) shost-unchecked_isa_dma = sht-unchecked_isa_dma; shost-use_clustering = sht-use_clustering; shost-ordered_tag = sht-ordered_tag; + shost-eh_deadline = shost_eh_deadline * HZ; if (sht-supported_mode == MODE_UNKNOWN) /* means we didn't set it ... default to INITIATOR */ diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index f43de1e..84369f2 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -89,6 +89,18 @@ void scsi_schedule_eh(struct Scsi_Host *shost) } EXPORT_SYMBOL_GPL(scsi_schedule_eh); +static int scsi_host_eh_past_deadline(struct Scsi_Host *shost) +{ + if (!shost-last_reset || !shost-eh_deadline) + return 0; + + if (time_before(jiffies, + shost-last_reset + shost-eh_deadline)) + return 0; + + return 1; +} + /** * scsi_eh_scmd_add - add scsi cmd to error handling. * @scmd: scmd to run eh on. @@ -111,6 +123,9 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag) if (scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY)) goto out_unlock; + if (shost-eh_deadline !shost-last_reset) + shost-last_reset = jiffies; + ret = 1; scmd-eh_eflags |= eh_flag; list_add_tail(scmd-eh_entry, shost-eh_cmd_q); @@ -140,6 +155,9 @@ enum blk_eh_timer_return scsi_times_out(struct request *req) trace_scsi_dispatch_cmd_timeout(scmd); scsi_log_completion(scmd, TIMEOUT_ERROR); + if (host-eh_deadline !host-last_reset) + host-last_reset = jiffies; + if (host-transportt-eh_timed_out) rtn = host-transportt-eh_timed_out(scmd); else if (host-hostt-eh_timed_out) @@ -928,13 +946,26 @@ int scsi_eh_get_sense(struct list_head *work_q, struct list_head *done_q) { struct scsi_cmnd *scmd, *next; + struct Scsi_Host *shost; int rtn; + unsigned long flags; list_for_each_entry_safe(scmd, next, work_q, eh_entry) { if ((scmd-eh_eflags SCSI_EH_CANCEL_CMD) || SCSI_SENSE_VALID(scmd)) continue; + shost = scmd-device-host; + spin_lock_irqsave(shost-host_lock, flags); + if (scsi_host_eh_past_deadline(shost)) { + spin_unlock_irqrestore(shost-host_lock, flags); + SCSI_LOG_ERROR_RECOVERY(3, + shost_printk(KERN_INFO, shost, + skip %s, past eh deadline\n, +__func__)); + break; + } + spin_unlock_irqrestore(shost-host_lock, flags); SCSI_LOG_ERROR_RECOVERY(2, scmd_printk(KERN_INFO, scmd, %s: requesting sense\n, current-comm)); @@ -1019,11 +1050,28 @@ static int scsi_eh_test_devices(struct list_head *cmd_list, struct scsi_cmnd *scmd, *next; struct scsi_device *sdev; int finish_cmds; + unsigned long flags; while (!list_empty(cmd_list)) { scmd = list_entry(cmd_list-next, struct scsi_cmnd, eh_entry); sdev = scmd-device; + if (!try_stu) { + spin_lock_irqsave(sdev-host-host_lock, flags); + if
Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime
On 06/28/2013 09:29 AM, Bart Van Assche wrote: On 06/10/13 13:11, Hannes Reinecke wrote: +static int sdev_eh_deadline(struct Scsi_Host *shost, + unsigned long eh_start) +{ +if (!shost-eh_deadline) +return 0; + +if (shost-last_reset != 0 +time_before(shost-last_reset, eh_start)) +eh_start = shost-last_reset; + +if (time_before(jiffies, +eh_start + shost-eh_deadline)) +return 0; + +return 1; +} + +static int scsi_host_eh_deadline(struct Scsi_Host *shost) +{ +if (!shost-last_reset) +return 0; + +return sdev_eh_deadline(shost, shost-last_reset); +} Hello Hannes, I would appreciate if you would choose other names for these two functions and also for shost-eh_deadline. To me a deadline is a time instant. As far as I can see the two functions above check whether a deadline has been passed, and shost-eh_deadline is a time interval. How about the following names: sdev_eh_past_deadline(), shost_eh_past_deadline() and shost-max_eh_jiffies ? Sure. I changed the naming once already, so I don't have any issues with that. And yes, the suggested naming does make more sense. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime
On Fri, 2013-06-28 at 09:14 +0200, Hannes Reinecke wrote: @@ -232,6 +272,9 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag) if (scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY)) goto out_unlock; + if (sdev-eh_deadline !shost-last_reset) + shost-last_reset = jiffies; + I think this is supposed to be if (shost-eh_deadline ... No. -last_reset is set to the time the first command timeout/failure. -last_reset + -eh_deadline will give you the expiry time. Sorry, I wasn't clear what I meant. What I meant was: if (shost-eh_deadline !shost-last_reset) shost-last_reset = jiffies; ...because the eh_deadline field is in struct Scsi_Host, not in struct scsi_device. -Ewan -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime
The eh_deadline changes allow for a significant improvement in multipath failover time. It works very well in our testing. I do have a few corrections, see below: On Mon, 2013-06-10 at 13:11 +0200, Hannes Reinecke wrote: This patchs adds an 'eh_deadline' attribute to the scsi host which limits the overall runtime of the SCSI EH. When a command is failed the start time of the EH is stored in 'last_reset'. If the overall runtime of the SCSI EH is longer than last_reset + eh_deadline, the EH is short-circuited and falls through to issue a host reset only. Signed-off-by: Hannes Reinecke h...@suse.de --- drivers/scsi/hosts.c | 7 +++ drivers/scsi/scsi_error.c | 142 +++--- drivers/scsi/scsi_sysfs.c | 37 include/scsi/scsi_host.h | 2 +- 4 files changed, 180 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index df0c3c7..c8d828f 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -316,6 +316,12 @@ static void scsi_host_dev_release(struct device *dev) kfree(shost); } +static unsigned int shost_eh_deadline; + +module_param_named(eh_deadline, shost_eh_deadline, uint, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(eh_deadline, + SCSI EH deadline in seconds (should be between 1 and 2^32-1)); + static struct device_type scsi_host_type = { .name = scsi_host, .release = scsi_host_dev_release, @@ -388,6 +394,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize) shost-unchecked_isa_dma = sht-unchecked_isa_dma; shost-use_clustering = sht-use_clustering; shost-ordered_tag = sht-ordered_tag; + shost-eh_deadline = shost_eh_deadline; This should be shost-eh_deadline = shost_eh_deadline * HZ; since the parameter is specified in seconds. if (sht-supported_mode == MODE_UNKNOWN) /* means we didn't set it ... default to INITIATOR */ diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 467cb3c..cf30475 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -91,6 +91,31 @@ void scsi_schedule_eh(struct Scsi_Host *shost) } EXPORT_SYMBOL_GPL(scsi_schedule_eh); +static int sdev_eh_deadline(struct Scsi_Host *shost, +unsigned long eh_start) +{ + if (!shost-eh_deadline) + return 0; + + if (shost-last_reset != 0 + time_before(shost-last_reset, eh_start)) + eh_start = shost-last_reset; + + if (time_before(jiffies, + eh_start + shost-eh_deadline)) + return 0; + + return 1; +} + +static int scsi_host_eh_deadline(struct Scsi_Host *shost) +{ + if (!shost-last_reset) + return 0; + + return sdev_eh_deadline(shost, shost-last_reset); +} + /** * scsi_eh_abort_handler - Handle command aborts * @work:sdev on which commands should be aborted. @@ -102,13 +127,15 @@ scsi_eh_abort_handler(struct work_struct *work) container_of(work, struct scsi_device, abort_work); struct scsi_cmnd *scmd, *tmp; LIST_HEAD(abort_list); - unsigned long flags; + unsigned long flags, eh_start; int rtn; spin_lock_irqsave(sdev-list_lock, flags); list_splice_init(sdev-eh_abort_list, abort_list); spin_unlock_irqrestore(sdev-list_lock, flags); + eh_start = jiffies; + list_for_each_entry_safe(scmd, tmp, abort_list, eh_entry) { list_del_init(scmd-eh_entry); if (sdev-sdev_state == SDEV_CANCEL) { @@ -119,6 +146,13 @@ scsi_eh_abort_handler(struct work_struct *work) scsi_finish_command(scmd); continue; } + if (sdev_eh_deadline(sdev-host, eh_start)) { + SCSI_LOG_ERROR_RECOVERY(3, + scmd_printk(KERN_INFO, scmd, + eh timeout, not aborting\n)); + list_move_tail(scmd-eh_entry, abort_list); + goto start_eh; + } SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_INFO, scmd, aborting command %p\n, scmd)); @@ -151,6 +185,12 @@ scsi_eh_abort_handler(struct work_struct *work) return; start_eh: + spin_lock_irqsave(sdev-host-host_lock, flags); + if (sdev-host-eh_deadline + (!sdev-host-last_reset || + time_before(eh_start, sdev-host-last_reset))) + sdev-host-last_reset = eh_start; + spin_unlock_irqrestore(sdev-host-host_lock, flags); list_for_each_entry_safe(scmd, tmp, abort_list, eh_entry) { scmd-result |= DID_TIME_OUT 16; if (!scsi_eh_scmd_add(scmd, 0)) { @@ -232,6 +272,9 @@ int scsi_eh_scmd_add(struct scsi_cmnd
[PATCH 7/7] scsi: Add 'eh_deadline' to limit SCSI EH runtime
This patchs adds an 'eh_deadline' attribute to the scsi host which limits the overall runtime of the SCSI EH. When a command is failed the start time of the EH is stored in 'last_reset'. If the overall runtime of the SCSI EH is longer than last_reset + eh_deadline, the EH is short-circuited and falls through to issue a host reset only. Signed-off-by: Hannes Reinecke h...@suse.de --- drivers/scsi/hosts.c | 7 +++ drivers/scsi/scsi_error.c | 142 +++--- drivers/scsi/scsi_sysfs.c | 37 include/scsi/scsi_host.h | 2 +- 4 files changed, 180 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index df0c3c7..c8d828f 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -316,6 +316,12 @@ static void scsi_host_dev_release(struct device *dev) kfree(shost); } +static unsigned int shost_eh_deadline; + +module_param_named(eh_deadline, shost_eh_deadline, uint, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(eh_deadline, +SCSI EH deadline in seconds (should be between 1 and 2^32-1)); + static struct device_type scsi_host_type = { .name = scsi_host, .release = scsi_host_dev_release, @@ -388,6 +394,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize) shost-unchecked_isa_dma = sht-unchecked_isa_dma; shost-use_clustering = sht-use_clustering; shost-ordered_tag = sht-ordered_tag; + shost-eh_deadline = shost_eh_deadline; if (sht-supported_mode == MODE_UNKNOWN) /* means we didn't set it ... default to INITIATOR */ diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 467cb3c..cf30475 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -91,6 +91,31 @@ void scsi_schedule_eh(struct Scsi_Host *shost) } EXPORT_SYMBOL_GPL(scsi_schedule_eh); +static int sdev_eh_deadline(struct Scsi_Host *shost, + unsigned long eh_start) +{ + if (!shost-eh_deadline) + return 0; + + if (shost-last_reset != 0 + time_before(shost-last_reset, eh_start)) + eh_start = shost-last_reset; + + if (time_before(jiffies, + eh_start + shost-eh_deadline)) + return 0; + + return 1; +} + +static int scsi_host_eh_deadline(struct Scsi_Host *shost) +{ + if (!shost-last_reset) + return 0; + + return sdev_eh_deadline(shost, shost-last_reset); +} + /** * scsi_eh_abort_handler - Handle command aborts * @work: sdev on which commands should be aborted. @@ -102,13 +127,15 @@ scsi_eh_abort_handler(struct work_struct *work) container_of(work, struct scsi_device, abort_work); struct scsi_cmnd *scmd, *tmp; LIST_HEAD(abort_list); - unsigned long flags; + unsigned long flags, eh_start; int rtn; spin_lock_irqsave(sdev-list_lock, flags); list_splice_init(sdev-eh_abort_list, abort_list); spin_unlock_irqrestore(sdev-list_lock, flags); + eh_start = jiffies; + list_for_each_entry_safe(scmd, tmp, abort_list, eh_entry) { list_del_init(scmd-eh_entry); if (sdev-sdev_state == SDEV_CANCEL) { @@ -119,6 +146,13 @@ scsi_eh_abort_handler(struct work_struct *work) scsi_finish_command(scmd); continue; } + if (sdev_eh_deadline(sdev-host, eh_start)) { + SCSI_LOG_ERROR_RECOVERY(3, + scmd_printk(KERN_INFO, scmd, +eh timeout, not aborting\n)); + list_move_tail(scmd-eh_entry, abort_list); + goto start_eh; + } SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_INFO, scmd, aborting command %p\n, scmd)); @@ -151,6 +185,12 @@ scsi_eh_abort_handler(struct work_struct *work) return; start_eh: + spin_lock_irqsave(sdev-host-host_lock, flags); + if (sdev-host-eh_deadline + (!sdev-host-last_reset || +time_before(eh_start, sdev-host-last_reset))) + sdev-host-last_reset = eh_start; + spin_unlock_irqrestore(sdev-host-host_lock, flags); list_for_each_entry_safe(scmd, tmp, abort_list, eh_entry) { scmd-result |= DID_TIME_OUT 16; if (!scsi_eh_scmd_add(scmd, 0)) { @@ -232,6 +272,9 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag) if (scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY)) goto out_unlock; + if (sdev-eh_deadline !shost-last_reset) + shost-last_reset = jiffies; + ret = 1; scmd-eh_eflags |= eh_flag; list_add_tail(scmd-eh_entry, shost-eh_cmd_q);