From: Kevin Barnett <kevin.barn...@microsemi.com>

Problem:
The Linux kernel takes a logical volume offline after a LUN reset.
This is generally accompanied by this message in the dmesg output:

Device offlined - not ready after error recovery

Root Cause:
The root cause is a "quirk" in the timeout handling in the Linux
SCSI layer. The Linux kernel places a 30-second timeout on most media
access commands (reads and writes) that it send to device drivers.
When a media access command times out, the Linux kernel goes into
error recovery mode for the LUN that was the target of the command
that timed out. Every command that timed out is kept on a list inside
of the Linux kernel to be retried later. The kernel attempts to
recover the command(s) that timed out by issuing a LUN reset
followed by a TEST UNIT READY. If the LUN reset and
TEST UNIT READY commands are successful, the kernel retries
the command(s) that timed out.

Each SCSI command issued by the kernel has a result field associated
with it. This field indicates the final result of the command (success
or error). When a command times out, the kernel places a value in this
result field indicating that the command timed out.

The "quirk" is that after the LUN reset and TEST UNIT READY commands
are completed, the kernel checks each command on the timed-out command
list before retrying it. If the result field is still "timed out", the
kernel treats that command as not having been successfully recovered
for a retry. If the number of commands that are in this state are
greater than two, the kernel takes the LUN offline.

Fix:
When our RAIDStack receives a LUN reset, it simply waits until all
outstanding commands complete. Generally, all of these outstanding
commands complete successfully. Therefore, the fix in the smartpqi
driver is to always set the command result field to indicate success
when a request completes successfully. This normally isn’t necessary
because the result field is always initialized to success when the
command is submitted to the driver. So when the command completes
successfully, the result field is left untouched. But in this case,
the kernel changes the result field behind the driver’s back and
then expects the field to be changed by the driver as the
commands that timed-out complete.

Reviewed-by: Dave Carroll <david.carr...@microsemi.com>
Reviewed-by: Scott Teel <scott.t...@microsemi.com>
Signed-off-by: Kevin Barnett <kevin.barn...@microsemi.com>
Signed-off-by: Don Brace <don.br...@microsemi.com>
---
 drivers/scsi/smartpqi/smartpqi_init.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c 
b/drivers/scsi/smartpqi/smartpqi_init.c
index bee14fc8a35e..2f2a07a38dad 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -2841,6 +2841,9 @@ static unsigned int pqi_process_io_intr(struct 
pqi_ctrl_info *ctrl_info,
                switch (response->header.iu_type) {
                case PQI_RESPONSE_IU_RAID_PATH_IO_SUCCESS:
                case PQI_RESPONSE_IU_AIO_PATH_IO_SUCCESS:
+                       if (io_request->scmd)
+                               io_request->scmd->result = 0;
+                       /* fall through */
                case PQI_RESPONSE_IU_GENERAL_MANAGEMENT:
                        break;
                case PQI_RESPONSE_IU_VENDOR_GENERAL:

Reply via email to