RE: SCSI: usage of DID_REQUEUE vs DID_RESET for returning SCSI commands to be retried

2016-12-14 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Hannes Reinecke
> Sent: Wednesday, December 14, 2016 9:07 PM
> To: Sumit Saxena; linux-scsi
> Subject: Re: SCSI: usage of DID_REQUEUE vs DID_RESET for returning SCSI
> commands to be retried
>
> On 12/13/2016 02:19 PM, Sumit Saxena wrote:
> > Hi all,
> >
> > I have query regarding usage of host_byte DID_REQUEUE vs DID_RESET
> > returned by LLD to SCSI mid layer.
> >
> > Let me give some background here.
> > I am using megaraid_sas controller. megaraid_sas driver returns all
> > outstanding SCSI commands back to SCSI layer with DID_RESET host_byte
> > before resetting controller.
> > The intent is- all these commands returned with DID_RESET before
> > controller reset should be retried by SCSI.
> >
> > In few distros, I have observed that if SYNCHRONIZE_CACHE
> > command(should be applicable for all non Read write commands) is
> > outstanding before resetting controller  and driver returns those
> > commands back with DID_RESET then SYNCHRONIZE_CACHE command not
> > retried but failed to upper layer but other READ/WRITE commands were
> > not failed but retried. I was running filesystem IOs and
> > SYNHRONIZE_CACHE returned with error cause filesystem to go in READ
> > only mode.
> >
> > Later I checked and realized below commit was missing in that distro
> > kernel and seems to cause the problem-
> >
> > a621bac scsi_lib: correctly retry failed zero length REQ_TYPE_FS
> > commands
> >
> > But distro kernel has below commit-
> >
> > 89fb4cd scsi: handle flush errors properly
> >
> > Issue does not hit on the kernels which don't have both commits but
> > hits when commit- "89fb4cd scsi: handle flush errors properly " is
> > there but commit-  "a621bac scsi_lib: correctly retry failed zero
> > length REQ_TYPE_FS commands" is missing.
> >
> > This issue is observed with mpt3sas driver as well and should be
> > applicable to all LLDs returning non read write commands with DID_RESET.
> >
> > Returning DID_REQUEUE instead of DID_RESET from driver solves the
> > problem irrespective of whether these above mentioned commits are
> > there or not in kernel.
> > So I am thinking to use DID_REQUEUE instead of DID_RESET in
> > megaraid_sas driver for all SCSI commands(not only limited to
> > SYNCHRONIZE_CACHE or non read write commands) before resetting
> > controller. As I mentioned intent is those outstanding commands
> > returned by driver before doing controller reset should be retried and
> > as soon as reset is complete, driver will be processing those
> > commands. There is no sense key associated with SCSI commands
> > returned.
> >
> > I browsed SCSI code and get to know DID_REQUEUE causes command to be
> > retried by calling- scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY).
> > This seems to be good enough for our requirement of commands to be
> > re-tried by SCSI layer.
> >
> > Please provide feedback if anyone forsee any issue with using
> > DID_REQUEUE for this use case.
> > I will be doing some testing with DID_REQUEUE in place of DID_RESET in
> > megaraid_sas driver.
> >
> > I have attached here a proposed patch for megaraid_sas driver.
> > If there are no concerns, I will send this patch to SCSI mailing list.
> >
> Hmm.
>
> DID_RESET is meant to be an indicator to the SCSI midlayer that the host /
> device was reset, and the command _should_ be retried.
> DID_REQUEUE OTOH is an indicator to the SCSI midlayer to retry the
> command.
>
> The problem with DID_RESET is that it's slightly underspecified; there is
> no
> indication if the command has been processed (and if so, up to which
> parts?)
> DID_REQUEUE, OTOH, will always cause a retry.
>
> So yeah, I guess DID_REQUEUE is a better choice here.

Thanks Hannes. We thought DID_REQUEUE functionality will be a better choice
and we are planning to move that with proposed patch.

Thanks for your feedback.  We will be sending the final patch to upstream.
Now, queuing this change for internal testing.


>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke  Teamlead Storage & Networking
> h...@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG
> Nürnberg)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI: usage of DID_REQUEUE vs DID_RESET for returning SCSI commands to be retried

2016-12-14 Thread Hannes Reinecke

On 12/13/2016 02:19 PM, Sumit Saxena wrote:

Hi all,

I have query regarding usage of host_byte DID_REQUEUE vs DID_RESET
returned by LLD to SCSI mid layer.

Let me give some background here.
I am using megaraid_sas controller. megaraid_sas driver returns all
outstanding SCSI commands back to SCSI layer with DID_RESET host_byte
before resetting controller.
The intent is- all these commands returned with DID_RESET before
controller reset should be retried by SCSI.

In few distros, I have observed that if SYNCHRONIZE_CACHE command(should
be applicable for all non Read write commands) is outstanding before
resetting controller  and driver returns those commands back with
DID_RESET then SYNCHRONIZE_CACHE command not retried but failed to upper
layer
but other READ/WRITE commands were not failed but retried. I was running
filesystem IOs and SYNHRONIZE_CACHE returned with error cause filesystem
to go in READ only mode.

Later I checked and realized below commit was missing in that distro
kernel and seems to cause the problem-

a621bac scsi_lib: correctly retry failed zero length REQ_TYPE_FS commands

But distro kernel has below commit-

89fb4cd scsi: handle flush errors properly

Issue does not hit on the kernels which don't have both commits but hits
when commit- "89fb4cd scsi: handle flush errors properly " is there
but commit-  "a621bac scsi_lib: correctly retry failed zero length
REQ_TYPE_FS commands" is missing.

This issue is observed with mpt3sas driver as well and should be
applicable to all LLDs returning non read write commands with DID_RESET.

Returning DID_REQUEUE instead of DID_RESET from driver solves the problem
irrespective of whether these above mentioned commits are there or not in
kernel.
So I am thinking to use DID_REQUEUE instead of DID_RESET in megaraid_sas
driver for all SCSI commands(not only limited to SYNCHRONIZE_CACHE or non
read write commands)
before resetting controller. As I mentioned intent is those outstanding
commands returned by driver before doing controller reset should be
retried and as soon as reset is complete,
driver will be processing those commands. There is no sense key associated
with SCSI commands returned.

I browsed SCSI code and get to know DID_REQUEUE causes command to be
retried by calling- scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY).
This seems to be good enough for our requirement of commands to be
re-tried by SCSI layer.

Please provide feedback if anyone forsee any issue with using DID_REQUEUE
for this use case.
I will be doing some testing with DID_REQUEUE in place of DID_RESET in
megaraid_sas driver.

I have attached here a proposed patch for megaraid_sas driver.
If there are no concerns, I will send this patch to SCSI mailing list.


Hmm.

DID_RESET is meant to be an indicator to the SCSI midlayer that the host 
/ device was reset, and the command _should_ be retried.

DID_REQUEUE OTOH is an indicator to the SCSI midlayer to retry the command.

The problem with DID_RESET is that it's slightly underspecified; there 
is no indication if the command has been processed (and if so, up to 
which parts?)

DID_REQUEUE, OTOH, will always cause a retry.

So yeah, I guess DID_REQUEUE is a better choice here.

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: SCSI: usage of DID_REQUEUE vs DID_RESET for returning SCSI commands to be retried

2016-12-13 Thread Sumit Saxena
Adding direct email addresses of few people to avoid any filters.

Hannes/Martin/James/Tomas/Christoph,

Can you please comment on this?

Thanks,
Sumit

>-Original Message-
>From: Sumit Saxena [mailto:sumit.sax...@broadcom.com]
>Sent: Tuesday, December 13, 2016 6:50 PM
>To: 'linux-scsi'
>Subject: SCSI: usage of DID_REQUEUE vs DID_RESET for returning SCSI
>commands to be retried
>
>Hi all,
>
>I have query regarding usage of host_byte DID_REQUEUE vs DID_RESET
returned
>by LLD to SCSI mid layer.
>
>Let me give some background here.
>I am using megaraid_sas controller. megaraid_sas driver returns all
outstanding
>SCSI commands back to SCSI layer with DID_RESET host_byte before
resetting
>controller.
>The intent is- all these commands returned with DID_RESET before
controller
>reset should be retried by SCSI.
>
>In few distros, I have observed that if SYNCHRONIZE_CACHE command(should
be
>applicable for all non Read write commands) is outstanding before
resetting
>controller  and driver returns those commands back with DID_RESET then
>SYNCHRONIZE_CACHE command not retried but failed to upper layer but other
>READ/WRITE commands were not failed but retried. I was running filesystem
IOs
>and SYNHRONIZE_CACHE returned with error cause filesystem to go in READ
only
>mode.
>
>Later I checked and realized below commit was missing in that distro
kernel and
>seems to cause the problem-
>
>a621bac scsi_lib: correctly retry failed zero length REQ_TYPE_FS commands
>
>But distro kernel has below commit-
>
>89fb4cd scsi: handle flush errors properly
>
>Issue does not hit on the kernels which don't have both commits but hits
when
>commit- "89fb4cd scsi: handle flush errors properly " is there but
commit-
>"a621bac scsi_lib: correctly retry failed zero length REQ_TYPE_FS
commands" is
>missing.
>
>This issue is observed with mpt3sas driver as well and should be
applicable to all
>LLDs returning non read write commands with DID_RESET.
>
>Returning DID_REQUEUE instead of DID_RESET from driver solves the problem
>irrespective of whether these above mentioned commits are there or not in
>kernel.
>So I am thinking to use DID_REQUEUE instead of DID_RESET in megaraid_sas
>driver for all SCSI commands(not only limited to SYNCHRONIZE_CACHE or non
>read write commands) before resetting controller. As I mentioned intent
is those
>outstanding commands returned by driver before doing controller reset
should be
>retried and as soon as reset is complete, driver will be processing those
>commands. There is no sense key associated with SCSI commands returned.
>
>I browsed SCSI code and get to know DID_REQUEUE causes command to be
>retried by calling- scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY).
>This seems to be good enough for our requirement of commands to be
re-tried
>by SCSI layer.
>
>Please provide feedback if anyone forsee any issue with using DID_REQUEUE
for
>this use case.
>I will be doing some testing with DID_REQUEUE in place of DID_RESET in
>megaraid_sas driver.
>
>I have attached here a proposed patch for megaraid_sas driver.
>If there are no concerns, I will send this patch to SCSI mailing list.
>
>Thanks,
>Sumit
--- Begin Message ---
Driver returns outstanding SCSI commands to SCSI layer with host_byte
set to DID_RESET before resetting controller so that SCSI layer should
retry
those commands.
Sending DID_RESET for non RW commands(e.g SYNCHRONIZE_CACHE) may cause
those commands getting failed and returning I/O erros on few distros which
has included commit- 89fb4cd scsi: handle flush errors properly but missed
commit- a621bac scsi_lib: correctly retry failed zero length REQ_TYPE_FS
commands.
However Read write commands returned with DID_RESET are not failed but
retried.

DID_REQUEUE seems safer to use instead of DID_RESET for all outstanding
commands before doing chip reset as it serves purpose of getting all
commands
to be retried by SCSI layer.

Signed-off-by: Sumit Saxena 
Signed-off-by: Kashyap Desai 
---
 drivers/scsi/megaraid/megaraid_sas_base.c   | 4 ++--
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
b/drivers/scsi/megaraid/megaraid_sas_base.c
index 6484c38..959ce3e 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -1662,7 +1662,7 @@ megasas_queue_command(struct Scsi_Host *shost,
struct scsi_cmnd *scmd)
/* Check for an mpio path and adjust behavior */
if (atomic_read(>adprecovery) ==
MEGASAS_ADPRESET_SM_INFAULT) {
if (megasas_check_mpio_paths(instance, scmd) ==
-   (DID_RESET << 16)) {
+   (DID_REQUEUE << 16)) {
return SCSI_MLQUEUE_HOST_BUSY;
} else {
scmd->result = DID_NO_CONNECT << 16;
@@ -2483,7 +2483,7 @@ static int megasas_wait_for_outstanding(struct