Re: SCSI error handling -- one error blocks the whole SCSI host

2013-05-28 Thread Jeremy Linton
On 5/27/2013 8:32 PM, Baruch Even wrote:

 necessary but the command itself if it is already actively handled
 continues in its path. The abort only cancels those commands that are in
 the queue and if there really was a problem and the disk is engaging in
 error recovery of its own you'll just have no response from it and it will
 seem dead (abort may timeout).

Yes, the abort seems to be handled more like a hint in many cases. 
Having
coded a couple targets, abort handling is often _REALLY_ hard to get 100%
right. Especially, when its an actual error that is causing the delay, rather
than a correctly functional long running command. That said, I've seen devices
actually respond to aborts on tape ERASE and similar commands by actually
aborting the command as one would expect. So it does sometimes work..

Besides abort timeouts (which is major bad karma) the abort may be 
accepted,
and the next non inquiry/tur type command that gets queued simply blocks
waiting for the abort to internally complete. From the target device
perspective, if you don't send a response for ABTS out in 2*RA_TOV then your
problems start to multiply. So it encourages the target devices to treat
aborts in an async manner. As you said, the device simply finds the indicated
command on a queue, marks it as being aborted and hopes whatever is processing
the command notices and terminates its operation. On subsequent commands the
nicer devices will notice the abort hasn't completed and return becoming ready
or similar in response to TUR/etc for some number of minutes.




 
 This view of aborts also means that reducing timeouts for commands and TMFs
 is mostly useless and sometimes even a really bad idea. I prefer to just
 let the device go on with its error recovery and just forget about the 
 command. I want to forget about the DMA so I issue an abort but anything 
 higher than that means a link is dead to me.

Well, invariably the manufactures have timeouts that are really long and
based on internal error recovery logic. See
http://www-01.ibm.com/support/docview.wss?uid=ssg1S7003556aid=1 page 468.
Notice the timeouts are specified in minutes, not seconds. Furthermore, the
commands that normally complete in fractions of a second have actual timeouts
that can be tens of minutes (READ/WRITE for example). So, doing anything
before that timeout has expired is a good way to knock the device offline.
Some of the newer disks have mode page options to shorten their read/write
error recovery, but short error recovery can still be many tens of seconds
rather than a couple minutes. Plus, it doesn't help compound commands like
SYNCHRONIZE CACHE which may take multiple errors during operation.

This is another part of what formed my opinions about error isolation. 
If one
of your devices goes out to lunch and isn't recovering via abort/lun reset.
Its done! Wrecking the rest of the SAN doing bus resets and HBA resets is a
good way to take a serious problem and turn it into a full blown catastrophe.




--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI error handling -- one error blocks the whole SCSI host

2013-05-28 Thread Baruch Even
On Tue, May 28, 2013 at 5:38 PM, Jeremy Linton jlin...@tributary.com wrote:
 This is another part of what formed my opinions about error 
 isolation. If one
 of your devices goes out to lunch and isn't recovering via abort/lun reset.
 Its done! Wrecking the rest of the SAN doing bus resets and HBA resets is a
 good way to take a serious problem and turn it into a full blown catastrophe.

This is the gist of the issue, once you got to an abort you are screwed already.
You need the abort but anything else should be reserved to when things
are really
dead (the HBA might still recover on a host reset, but only do it if the host is
really unresponsive).

That's why I prefer to have a long timeout for the command and a long
timeout for
the abort. The application above should handle itself with its own
timeout once the
abort was sent (the buffer remains locked until the abort returns).
The device itself
is likely stuck in error recovery and it will come out of it when its
own internal
timeouts are exhausted which can be infinite and will generally be very large.

Baruch
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI error handling -- one error blocks the whole SCSI host

2013-05-27 Thread Hannes Reinecke
On 05/27/2013 12:44 AM, James Bottomley wrote:
 
 On Thu, 2013-05-23 at 11:14 -0700, Roland Dreier wrote:
 At LSF this year, we had a discussion about error handling and in
 particular the problem that SCSI midlayer error handling waits for the
 entire SCSI host (HBA) to quiesce before it starts to abort commands
 etc.

 James made the suggestion that FC should handle things the way SAS
 does, because SAS has a strategy handler that does things the right
 way.  However, now that I finally sit down and look at the code, I
 don't see how this is the case.  It seems inherent in the way that
 scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
 particular the strategy handler can't even be called until host_failed
 == host_busy; we don't bump host_failed without SHOST_RECOVERY set,
 which stops queueing commands to any devices attached to the whole
 HBA).

 James, am I understanding your suggestion properly?  If so can you
 explain what you meant about the libsas code -- I see that it has its
 own strategy handler but as I said before we've already stopped every
 device attached to the HBA before we ever get there.
 
 It is, but I checked: Apparently it's not implemented in the sas
 transport class.  The original discussion when libsas was constructed,
 as I remember it, was about using the scsi timeout handler to implement
 a running abort.  The idea is fairly simple: you use the first fire of
 eh_timed_out to trigger the abort (or LUN reset) while simultaneously
 returning BLK_EH_RESET_TIMER.  If the timer fires again and the abort
 hasn't returned, you escalate, otherwise you resend the command when the
 abort returns.  This allows you to handle single command failures (up to
 LUN reset) without stopping the host.  Obviously, if you have to
 escalate to device reset, then you need to start the eh thread.
 
There are some problems with that:

- Returning BLK_EH_RESET_TIMER will restart the timer with the
  _default_ blk timeout. Whereas the _abort_ timeout might
  (and, for some LLDDs, it definitely is) different from
  that.
- Leaving the command running while abort is active will
  inevitably risk a double completion on the original command;
  the command abort might terminate the command at the
  same time as the (real) completion comes in.
  'Normal' command timeouts are protected against this via
  REQ_ATOM_COMPLETE; commands aborted via scsi_finish_cmnd()
  are not.
- LLDDs typically won't return a command status even for a
  command which has been aborted via ABORT TASK TMF.
  So the midlayer probably will never get notified if
  the command got aborted via ABORT TASK.

Especially the last point made me abandon this idea for my EH
rewrite. We would be having a real benefit if we somehow could get
the command status _from the target_ for an aborted command.
But as it appears we won't.
So as any status is made up anyway I'd very much prefer to have it
set by the midlayer. Which renders the whole operation quite
pointless and we're better off using the existing syntax for command
aborts.
Plus it makes life _so much_ easier for the implementation ...

But to answer Roland: Have you checked my patchset?
It should help for command timeouts ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI error handling -- one error blocks the whole SCSI host

2013-05-27 Thread James Bottomley
On Mon, 2013-05-27 at 16:39 +0200, Hannes Reinecke wrote:
 On 05/27/2013 12:44 AM, James Bottomley wrote:
  
  On Thu, 2013-05-23 at 11:14 -0700, Roland Dreier wrote:
  At LSF this year, we had a discussion about error handling and in
  particular the problem that SCSI midlayer error handling waits for the
  entire SCSI host (HBA) to quiesce before it starts to abort commands
  etc.
 
  James made the suggestion that FC should handle things the way SAS
  does, because SAS has a strategy handler that does things the right
  way.  However, now that I finally sit down and look at the code, I
  don't see how this is the case.  It seems inherent in the way that
  scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
  particular the strategy handler can't even be called until host_failed
  == host_busy; we don't bump host_failed without SHOST_RECOVERY set,
  which stops queueing commands to any devices attached to the whole
  HBA).
 
  James, am I understanding your suggestion properly?  If so can you
  explain what you meant about the libsas code -- I see that it has its
  own strategy handler but as I said before we've already stopped every
  device attached to the HBA before we ever get there.
  
  It is, but I checked: Apparently it's not implemented in the sas
  transport class.  The original discussion when libsas was constructed,
  as I remember it, was about using the scsi timeout handler to implement
  a running abort.  The idea is fairly simple: you use the first fire of
  eh_timed_out to trigger the abort (or LUN reset) while simultaneously
  returning BLK_EH_RESET_TIMER.  If the timer fires again and the abort
  hasn't returned, you escalate, otherwise you resend the command when the
  abort returns.  This allows you to handle single command failures (up to
  LUN reset) without stopping the host.  Obviously, if you have to
  escalate to device reset, then you need to start the eh thread.
  
 There are some problems with that:
 
 - Returning BLK_EH_RESET_TIMER will restart the timer with the
   _default_ blk timeout. Whereas the _abort_ timeout might
   (and, for some LLDDs, it definitely is) different from
   that.

Right ... you don't reuse the command, you have to start a new one.
libsas actually has a task abstraction, which is what you use to send
TMFs.

 - Leaving the command running while abort is active will
   inevitably risk a double completion on the original command;
   the command abort might terminate the command at the
   same time as the (real) completion comes in.
   'Normal' command timeouts are protected against this via
   REQ_ATOM_COMPLETE; commands aborted via scsi_finish_cmnd()
   are not.

That's not a bug, it's a requirement.  The way you handle commands in a
running abort or LUN reset is only in the status return code from the
command, so you have to tie the success of the eh action to the base
command and return DID_ABORT (or DID_RESET) in the actual command ...
this is how retries get done without troubling the error handler.
Essentially, this requires a low level tie with the HBA machine
description of the command, which is what avoids double completion.

 - LLDDs typically won't return a command status even for a
   command which has been aborted via ABORT TASK TMF.
   So the midlayer probably will never get notified if
   the command got aborted via ABORT TASK.

Well, that's true, but irrelevant.  If the HBA can't inform you of the
status of the abort, then abort is useless as a first step in the
traditional eh as well as in this method, so you just don't do that and
proceed to resets.

There's actually a school of thought that says even if the HBA *can*
give you all the status you need, aborts are still pointless because
it's sending in yet another state transition to an already failed state
machine (because the device is timing out).  Therefore, since the chance
of recovering the state machine with an abort is so tiny, you should
start with the lowest reset anyway because that takes the state machine
to a known state.

James

 Especially the last point made me abandon this idea for my EH
 rewrite. We would be having a real benefit if we somehow could get
 the command status _from the target_ for an aborted command.
 But as it appears we won't.
 So as any status is made up anyway I'd very much prefer to have it
 set by the midlayer. Which renders the whole operation quite
 pointless and we're better off using the existing syntax for command
 aborts.
 Plus it makes life _so much_ easier for the implementation ...
 
 But to answer Roland: Have you checked my patchset?
 It should help for command timeouts ...
 
 Cheers,
 
 Hannes



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI error handling -- one error blocks the whole SCSI host

2013-05-27 Thread Baruch Even
On Mon, May 27, 2013 at 11:41 PM, James Bottomley
james.bottom...@hansenpartnership.com wrote:
 On Mon, 2013-05-27 at 16:39 +0200, Hannes Reinecke wrote:

 - LLDDs typically won't return a command status even for a
   command which has been aborted via ABORT TASK TMF.
   So the midlayer probably will never get notified if
   the command got aborted via ABORT TASK.

 Well, that's true, but irrelevant.  If the HBA can't inform you of the
 status of the abort, then abort is useless as a first step in the
 traditional eh as well as in this method, so you just don't do that and
 proceed to resets.

 There's actually a school of thought that says even if the HBA *can*
 give you all the status you need, aborts are still pointless because
 it's sending in yet another state transition to an already failed state
 machine (because the device is timing out).  Therefore, since the chance
 of recovering the state machine with an abort is so tiny, you should
 start with the lowest reset anyway because that takes the state machine
 to a known state.

Most devices I know do not really abort the command in any normal sense
anyhow. Not even when doing a reset. The disks (HDD  SSD) and also SAN
systems normally just treat an abort or a reset as a signal that no
real reply is
necessary but the command itself if it is already actively handled continues
in its path. The abort only cancels those commands that are in the queue
and if there really was a problem and the disk is engaging in error recovery
of its own you'll just have no response from it and it will seem dead (abort
may timeout).

The one thing aborts/reset help with is to clear your HBA from any pending
so that your DMA buffers will no longer be affected and you can forget the
command and do your application level recovery (RAID or lose data and panic).

It is also an important part of handling bad links but at least in SAS that is
done internally in the HBA anyway.

This view of aborts also means that reducing timeouts for commands and
TMFs is mostly useless and sometimes even a really bad idea. I prefer
to just let the device go on with its error recovery and just forget about the
command. I want to forget about the DMA so I issue an abort but anything
higher than that means a link is dead to me.

Baruch
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI error handling -- one error blocks the whole SCSI host

2013-05-26 Thread James Bottomley

On Thu, 2013-05-23 at 11:14 -0700, Roland Dreier wrote:
 At LSF this year, we had a discussion about error handling and in
 particular the problem that SCSI midlayer error handling waits for the
 entire SCSI host (HBA) to quiesce before it starts to abort commands
 etc.
 
 James made the suggestion that FC should handle things the way SAS
 does, because SAS has a strategy handler that does things the right
 way.  However, now that I finally sit down and look at the code, I
 don't see how this is the case.  It seems inherent in the way that
 scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
 particular the strategy handler can't even be called until host_failed
 == host_busy; we don't bump host_failed without SHOST_RECOVERY set,
 which stops queueing commands to any devices attached to the whole
 HBA).
 
 James, am I understanding your suggestion properly?  If so can you
 explain what you meant about the libsas code -- I see that it has its
 own strategy handler but as I said before we've already stopped every
 device attached to the HBA before we ever get there.

It is, but I checked: Apparently it's not implemented in the sas
transport class.  The original discussion when libsas was constructed,
as I remember it, was about using the scsi timeout handler to implement
a running abort.  The idea is fairly simple: you use the first fire of
eh_timed_out to trigger the abort (or LUN reset) while simultaneously
returning BLK_EH_RESET_TIMER.  If the timer fires again and the abort
hasn't returned, you escalate, otherwise you resend the command when the
abort returns.  This allows you to handle single command failures (up to
LUN reset) without stopping the host.  Obviously, if you have to
escalate to device reset, then you need to start the eh thread.

James


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI error handling -- one error blocks the whole SCSI host

2013-05-25 Thread James Smart

Roland,

I agree, and am already working around that limitation.

-- james s


On 5/23/2013 2:14 PM, Roland Dreier wrote:

At LSF this year, we had a discussion about error handling and in
particular the problem that SCSI midlayer error handling waits for the
entire SCSI host (HBA) to quiesce before it starts to abort commands
etc.

James made the suggestion that FC should handle things the way SAS
does, because SAS has a strategy handler that does things the right
way.  However, now that I finally sit down and look at the code, I
don't see how this is the case.  It seems inherent in the way that
scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
particular the strategy handler can't even be called until host_failed
== host_busy; we don't bump host_failed without SHOST_RECOVERY set,
which stops queueing commands to any devices attached to the whole
HBA).

James, am I understanding your suggestion properly?  If so can you
explain what you meant about the libsas code -- I see that it has its
own strategy handler but as I said before we've already stopped every
device attached to the HBA before we ever get there.

To recapitulate the problem here, we might have a whole fabric
attached to an HBA via SAS or FC, and be doing 500K IOPS happily to 50
devices.  Then a single LUN goes wonky and all the IO stops while we
try to recover that single device, which might take minutes.

I know this has been discussed before, but can we find a way forward
here?  Is there some way we can start with per-device error recovery
and avoid disrupting IO that we can see is working fine?

Thanks,
   Roland
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SCSI error handling -- one error blocks the whole SCSI host

2013-05-23 Thread Roland Dreier
At LSF this year, we had a discussion about error handling and in
particular the problem that SCSI midlayer error handling waits for the
entire SCSI host (HBA) to quiesce before it starts to abort commands
etc.

James made the suggestion that FC should handle things the way SAS
does, because SAS has a strategy handler that does things the right
way.  However, now that I finally sit down and look at the code, I
don't see how this is the case.  It seems inherent in the way that
scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
particular the strategy handler can't even be called until host_failed
== host_busy; we don't bump host_failed without SHOST_RECOVERY set,
which stops queueing commands to any devices attached to the whole
HBA).

James, am I understanding your suggestion properly?  If so can you
explain what you meant about the libsas code -- I see that it has its
own strategy handler but as I said before we've already stopped every
device attached to the HBA before we ever get there.

To recapitulate the problem here, we might have a whole fabric
attached to an HBA via SAS or FC, and be doing 500K IOPS happily to 50
devices.  Then a single LUN goes wonky and all the IO stops while we
try to recover that single device, which might take minutes.

I know this has been discussed before, but can we find a way forward
here?  Is there some way we can start with per-device error recovery
and avoid disrupting IO that we can see is working fine?

Thanks,
  Roland
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


re :SCSI error handling -- one error blocks the whole SCSI host

2013-05-23 Thread Jack Wang
 James, am I understanding your suggestion properly?  If so can you
 explain what you meant about the libsas code -- I see that it has its
 own strategy handler but as I said before we've already stopped every
 device attached to the HBA before we ever get there.
 
 To recapitulate the problem here, we might have a whole fabric
 attached to an HBA via SAS or FC, and be doing 500K IOPS happily to 50
 devices.  Then a single LUN goes wonky and all the IO stops while we
 try to recover that single device, which might take minutes.

I'm not James, but from my experience in pm8001 and libsas, your
understanding is right. and when one error happens on one lun, scsi core
do hold the whole scsi host.

I think Hannes has some good proposal weeks ago, it looks reasonable,
but don't what the status now.


Regards
Jack Wang
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html