Re: Reset eh timer if cmd is really making progress
On 1 Feb 2010 at 11:52, Mike Christie wrote: [...] I was not sure though if I should check if any cmds to the target made progress or if any cmds to the same disk. It could be that just one disk went bad, so we might want to check per disk. However, this could be the first IO to the disk and it just got stuck behind a bunch of other IO to other disks, so in that case we want to check per target. I'm no specialist on SCSI, but I think there is a command abort as well as a target abort command. So when there are multiple outstanding commands, you should not abort the target as long as a single command for that target is making any progress (and you should re-arm the target abort timer (if such a thing exists)), but if none of the commands for a specific target is making any progress, you should either try a target abort or abort each command. As said above, my view on SCSI may be wrong... Regards, Ulrich -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: Reset eh timer if cmd is really making progress
On 02/02/2010 03:39 AM, Ulrich Windl wrote: On 1 Feb 2010 at 11:52, Mike Christie wrote: [...] I was not sure though if I should check if any cmds to the target made progress or if any cmds to the same disk. It could be that just one disk went bad, so we might want to check per disk. However, this could be the first IO to the disk and it just got stuck behind a bunch of other IO to other disks, so in that case we want to check per target. I'm no specialist on SCSI, but I think there is a command abort as well as a target abort command. So when there are multiple There is a abort_task which aborts one task. Then there is a warm_target_reset and a cold_target_reset. They both basically abort all the cmds on the target. For the target resets there are also other side effects though. outstanding commands, you should not abort the target as long as a single command for that target is making any progress (and you should re-arm the target abort timer (if such a thing exists)), but if none of the commands for a specific target is making any progress, you should either try a target abort or abort each command. If a command times out we first try to abort each command with a abort_task. If that fails we try a lun_reset to each device that had a timed out command. If that fails we try a target_reset. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Reset eh timer if cmd is really making progress
When iscsi_eh_cmd_timed_out gets called, we can ask scsi-ml to give us more time if the cmd is making progress (i.e. if there was some data transfer since the last timeout). The problem is that task-last_xfer task-last_timeout are set to the value of 'jiffies' when allocating the task. If the target machine is already dead when we send the cmd, no progress will be made. Still, when iscsi_eh_cmd_timed_out will be called, it will think that data was sent since the last timeout and reset the timer (and waste time). In order to solve that, iscsi_eh_cmd_timed_out should also check if there was any data transfer after the task was allocated. How about the following patch? diff --git a/kernel/libiscsi.c b/kernel/libiscsi.c index 90f3018..b727c10 100644 --- a/kernel/libiscsi.c +++ b/kernel/libiscsi.c @@ -1419,6 +1419,7 @@ static inline struct iscsi_task *iscsi_alloc_task(struct iscsi_conn *conn, task-conn = conn; task-sc = sc; task-have_checked_conn = 0; + task-init_time = jiffies; task-last_timeout = jiffies; task-last_xfer = jiffies; INIT_LIST_HEAD(task-running); @@ -1817,7 +1818,8 @@ static enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) * we can check if it is the task or connection when we send the * nop as a ping. */ - if (time_after_eq(task-last_xfer, task-last_timeout)) { + if (time_after_eq(task-last_xfer, task-last_timeout) + time_after(task-last_xfer, task-init_time)) { ISCSI_DBG_EH(session, Command making progress. Asking scsi-ml for more time to complete. Last data recv at %lu. Last timeout was at diff --git a/kernel/libiscsi.h b/kernel/libiscsi.h index 1990243..4712ea9 100644 --- a/kernel/libiscsi.h +++ b/kernel/libiscsi.h @@ -126,6 +126,7 @@ struct iscsi_task { struct iscsi_conn *conn; /* used connection*/ /* data processing tracking */ + unsigned long init_time; unsigned long last_xfer; unsigned long last_timeout; int have_checked_conn; -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: Reset eh timer if cmd is really making progress
On 02/01/2010 05:14 AM, Erez Zilber wrote: When iscsi_eh_cmd_timed_out gets called, we can ask scsi-ml to give us more time if the cmd is making progress (i.e. if there was some data transfer since the last timeout). The problem is that task-last_xfer task-last_timeout are set to the value of 'jiffies' when allocating the task. If the target machine is already dead when we send the cmd, no progress will be made. Still, when iscsi_eh_cmd_timed_out will be called, it will think that data was sent since the last timeout and reset the timer (and waste time). In order to solve that, iscsi_eh_cmd_timed_out should also check if there was any data transfer after the task was allocated. I agree it is a problem with the code. The problem is that the check also handled the case where we are so backed up that we cannot even send a cmd/data within the cmd timeout. For that case, the check was giving it a extra cmd timeout seconds to get it off. That code is not really good though. It should probably just loop over all the cmds there and see if any cmds have made progress. If so give the cmd more time, if not then fail. I was not sure though if I should check if any cmds to the target made progress or if any cmds to the same disk. It could be that just one disk went bad, so we might want to check per disk. However, this could be the first IO to the disk and it just got stuck behind a bunch of other IO to other disks, so in that case we want to check per target. For your problem, I guess you can just change this: if (time_after_eq(task-last_xfer, task-last_timeout)) { to if (time_after(task-last_xfer, task-last_timeout)) { It is probably unlikely that we hit the case where the timeout and last_xfer happen on the same jiffie, so there is not much point in adding extra code to handle it. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: Reset eh timer if cmd is really making progress
On 02/01/2010 11:52 AM, Mike Christie wrote: On 02/01/2010 05:14 AM, Erez Zilber wrote: When iscsi_eh_cmd_timed_out gets called, we can ask scsi-ml to give us more time if the cmd is making progress (i.e. if there was some data transfer since the last timeout). The problem is that task-last_xfer task-last_timeout are set to the value of 'jiffies' when allocating the task. If the target machine is already dead when we send the cmd, no progress will be made. Still, when iscsi_eh_cmd_timed_out will be called, it will think that data was sent since the last timeout and reset the timer (and waste time). In order to solve that, iscsi_eh_cmd_timed_out should also check if there was any data transfer after the task was allocated. I agree it is a problem with the code. The problem is that the check also handled the case where we are so backed up that we cannot even send a cmd/data within the cmd timeout. For that case, the check was giving it a extra cmd timeout seconds to get it off. That code is not really good though. It should probably just loop over all the cmds there and see if any cmds have made progress. If so give the cmd more time, if not then fail. I was not sure though if I should check if any cmds to the target made progress or if any cmds to the same disk. It could be that just one disk went bad, so we might want to check per disk. However, this could be the first IO to the disk and it just got stuck behind a bunch of other IO to other disks, so in that case we want to check per target. Give me until tomorrow. I think I can cook up patch. Before when deciding when to check for dev vs target, I was mixed up with some reordering stuff, but I think I have a patch that should work for both of us. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: Reset eh timer if cmd is really making progress
On 02/01/2010 01:21 PM, Mike Christie wrote: On 02/01/2010 11:52 AM, Mike Christie wrote: On 02/01/2010 05:14 AM, Erez Zilber wrote: When iscsi_eh_cmd_timed_out gets called, we can ask scsi-ml to give us more time if the cmd is making progress (i.e. if there was some data transfer since the last timeout). The problem is that task-last_xfer task-last_timeout are set to the value of 'jiffies' when allocating the task. If the target machine is already dead when we send the cmd, no progress will be made. Still, when iscsi_eh_cmd_timed_out will be called, it will think that data was sent since the last timeout and reset the timer (and waste time). In order to solve that, iscsi_eh_cmd_timed_out should also check if there was any data transfer after the task was allocated. I agree it is a problem with the code. The problem is that the check also handled the case where we are so backed up that we cannot even send a cmd/data within the cmd timeout. For that case, the check was giving it a extra cmd timeout seconds to get it off. That code is not really good though. It should probably just loop over all the cmds there and see if any cmds have made progress. If so give the cmd more time, if not then fail. I was not sure though if I should check if any cmds to the target made progress or if any cmds to the same disk. It could be that just one disk went bad, so we might want to check per disk. However, this could be the first IO to the disk and it just got stuck behind a bunch of other IO to other disks, so in that case we want to check per target. Give me until tomorrow. I think I can cook up patch. Before when deciding when to check for dev vs target, I was mixed up with some reordering stuff, but I think I have a patch that should work for both of us. I think the attached patch should do what we both want. Instead of always waiting an extra cmd timeout seconds, we will only get extra time if: 1. The command has made progress. (changed test to time_after) 2. Commands queued before the timed out one have made transfers since we started the task or since it last timedout. Patch was made over scsi-misc. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en. From 5eb20f1e53e3b2ed770fe9b410c26e45cdb2d55e Mon Sep 17 00:00:00 2001 From: Mike Christie micha...@cs.wisc.edu Date: Mon, 1 Feb 2010 22:37:07 -0600 Subject: [PATCH 4/4] libiscsi: reset cmd timer if cmds are making progress This patch resets the cmd timer if cmds started before the timedout command are making progress. The idea is that the cmd probably timed out because we are trying to exeucte too many commands. If it turns out that the device the IO timedout on was bad or the cmd just got screwed up but other IO/devs were ok then we will will figure this out when the cmds ahead of the timed out one complete ok. This also fixes a bug where we were sort of detecting this by setting the last_timeout and last_xfer to the same value when the task was allocated. That caught the case where we never got to send any IO for it. However, if the problem had started right before we started the new task, then we were forced to wait an extra cmd timeout seconds to start the scsi eh. Signed-off-by: Mike Christie micha...@cs.wisc.edu --- drivers/scsi/libiscsi.c | 53 +++--- 1 files changed, 49 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index c28a712..e556f52 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -1919,10 +1919,11 @@ static int iscsi_has_ping_timed_out(struct iscsi_conn *conn) static enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) { enum blk_eh_timer_return rc = BLK_EH_NOT_HANDLED; - struct iscsi_task *task = NULL; + struct iscsi_task *task = NULL, *running_task; struct iscsi_cls_session *cls_session; struct iscsi_session *session; struct iscsi_conn *conn; + int i; cls_session = starget_to_session(scsi_target(sc-device)); session = cls_session-dd_data; @@ -1947,8 +1948,15 @@ static enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) } task = (struct iscsi_task *)sc-SCp.ptr; - if (!task) + if (!task) { + /* + * Raced with completion. Just reset timer, and let it + * complete normally + */ + rc = BLK_EH_RESET_TIMER; goto done; + } + /* * If we have sent (at least queued to the network layer) a pdu or * recvd one for the task since the last timeout ask for @@ -1956,10 +1964,10 @@ static enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) * we can check if it is the task or connection when we send the * nop as a ping. */ - if
Re: Reset eh timer if cmd is really making progress
On Tue, Feb 2, 2010 at 6:47 AM, Mike Christie micha...@cs.wisc.edu wrote: On 02/01/2010 01:21 PM, Mike Christie wrote: On 02/01/2010 11:52 AM, Mike Christie wrote: On 02/01/2010 05:14 AM, Erez Zilber wrote: When iscsi_eh_cmd_timed_out gets called, we can ask scsi-ml to give us more time if the cmd is making progress (i.e. if there was some data transfer since the last timeout). The problem is that task-last_xfer task-last_timeout are set to the value of 'jiffies' when allocating the task. If the target machine is already dead when we send the cmd, no progress will be made. Still, when iscsi_eh_cmd_timed_out will be called, it will think that data was sent since the last timeout and reset the timer (and waste time). In order to solve that, iscsi_eh_cmd_timed_out should also check if there was any data transfer after the task was allocated. I agree it is a problem with the code. The problem is that the check also handled the case where we are so backed up that we cannot even send a cmd/data within the cmd timeout. For that case, the check was giving it a extra cmd timeout seconds to get it off. That code is not really good though. It should probably just loop over all the cmds there and see if any cmds have made progress. If so give the cmd more time, if not then fail. I was not sure though if I should check if any cmds to the target made progress or if any cmds to the same disk. It could be that just one disk went bad, so we might want to check per disk. However, this could be the first IO to the disk and it just got stuck behind a bunch of other IO to other disks, so in that case we want to check per target. Give me until tomorrow. I think I can cook up patch. Before when deciding when to check for dev vs target, I was mixed up with some reordering stuff, but I think I have a patch that should work for both of us. I think the attached patch should do what we both want. Instead of always waiting an extra cmd timeout seconds, we will only get extra time if: 1. The command has made progress. (changed test to time_after) 2. Commands queued before the timed out one have made transfers since we started the task or since it last timedout. Patch was made over scsi-misc. Looks okay to me. Erez -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.