Device not ready after error recovery?

2009-03-19 Thread dave

Can anyone tell me why the SCSI layer says the device is not ready
when iscsiadm reports it is logged in?

Can I manually online the device? How should I recover from here?

Is this a known problem, and has it been fixed in newer open-iscsi
versions?

Mar 18 18:21:33 eq1-vz2 kernel:  connection1:0: detected conn error
(1011)
Mar 18 18:21:36 eq1-vz2 kernel:  session1: host reset succeeded
Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined -
not ready after error recovery
Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined -
not ready after error recovery
Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: SCSI error: return code =
0x0002
Mar 18 18:22:16 eq1-vz2 kernel: end_request: I/O error, dev sdc,
sector 523643177
Mar 18 18:22:16 eq1-vz2 kernel: device-mapper: multipath: Failing path
8:32.
Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: SCSI error: return code =
0x0001
Mar 18 18:22:16 eq1-vz2 kernel: end_request: I/O error, dev sdc,
sector 552260889
... snip - more I/O error messages ...

$ sudo iscsiadm -m session -P3
iSCSI Transport Class version 2.0-869
iscsiadm version 2.0-869
Target: iqn.1986-03.com.sun:02:271d5722-0206-6ad0-fe1f-d44007068ec4
Current Portal: 10.0.15.0:3260,1
Persistent Portal: 10.0.15.0:3260,1
**
Interface:
**
Iface Name: iface.bond0
Iface Transport: tcp
Iface Initiatorname: iqn.2005-03.com.equest:eq1-vz2
Iface IPaddress: 10.0.10.1
Iface HWaddress: default
Iface Netdev: bond0
SID: 1
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
Internal iscsid Session State: NO CHANGE

Negotiated iSCSI params:

HeaderDigest: None
DataDigest: None
MaxRecvDataSegmentLength: 131072
MaxXmitDataSegmentLength: 131072
FirstBurstLength: 262144
MaxBurstLength: 16776192
ImmediateData: Yes
InitialR2T: Yes
MaxOutstandingR2T: 1

Attached SCSI devices:

Host Number: 6  State: running
scsi6 Channel 00 Id 0 Lun: 0
Attached scsi disk sdc  State: offline


201151723430d2a0048d003dddm-3 SUN,SOLARIS
[size=300G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=0][enabled]
 \_ 6:0:0:0 sdc 8:32  [failed][faulty]

$ cat /etc/multipath.conf
defaults {
default_features 1 queue_if_no_path
}

devnode_blacklist {
devnode ^hd[a-z]$
devnode ^sd[ab]$
devnode ^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*
devnode ^cciss!c[0-9]d[0-9]*[p[0-9]*]
}

Kernel is custom compiled from 2.6.18 source on Debian 4.0
$ uname -a
Linux eq1-vz2 2.6.18-prep-92.1.1.el5.028stab057.2-ovz #1 SMP Mon Aug
25 16:43:00 MDT 2008 x86_64 GNU/Linux

The open-iscsi tools and module were compiled by hand as well.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Device not ready after error recovery?

2009-03-19 Thread Mike Christie

dave wrote:
 Can anyone tell me why the SCSI layer says the device is not ready
 when iscsiadm reports it is logged in?
 
 Can I manually online the device? How should I recover from here?
 

You can do

echo running  /sys/block/sdX/device/state

but you might not want to because the device may not be back.


 Is this a known problem, and has it been fixed in newer open-iscsi
 versions?

Are you using a older version of the sun target?

 
 Mar 18 18:21:33 eq1-vz2 kernel:  connection1:0: detected conn error
 (1011)
 Mar 18 18:21:36 eq1-vz2 kernel:  session1: host reset succeeded


When we log back in we tell scsi-ml that we are ok.

 Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined -
 not ready after error recovery

scsi-ml will send a Test unit ready (TUR) command to check that the 
device is ready to go. The TUR seems to be failing and so the scsi layer 
sets the device offline.

I think there was some target issue and was fixed in newer ones.

If you can easily replicate this then you should take wireshark/ethereal 
trace and send the trace here so we can see why the TUR failed and make 
sure it is not our fault before you go to the trouble of updating.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Device not ready after error recovery?

2009-03-19 Thread dave



On Mar 19, 10:56 am, Mike Christie micha...@cs.wisc.edu wrote:
 dave wrote:
  Can anyone tell me why the SCSI layer says the device is not ready
  when iscsiadm reports it is logged in?

  Can I manually online the device? How should I recover from here?

 You can do

 echo running  /sys/block/sdX/device/state

 but you might not want to because the device may not be back.

A disk in the Sun iscsi target server died. When a disk fails in the
server, the iscsi target pauses all read/writes for about 1-2 minutes
until it marks the disk as faulted, then continues normal operation
using the rest of the RAID pool. I had tested this before and dm-
multipath with iscsi seemed to work just fine when the iscsi target
paused and eventually resumed, so I was just a little surprised this
time. Usually I see timing closer to a minute between conn error and
recovery... what are the reconnect/recovery timers of open-iscsi for
this scenario?


  Is this a known problem, and has it been fixed in newer open-iscsi
  versions?

 Are you using a older version of the sun target?

I am. I am running OpenSoalris SXCE build 93, which is about 8 months
old. I'll be upgrading this soon.




  Mar 18 18:21:33 eq1-vz2 kernel:  connection1:0: detected conn error
  (1011)
  Mar 18 18:21:36 eq1-vz2 kernel:  session1: host reset succeeded

 When we log back in we tell scsi-ml that we are ok.

At what level does the connection receive an error and reset (can't
log in to target, read/write errors, etc), and what functionality is
needed to be considered ok? If the device wasn't really ready to be
used again, shouldn't iscsi know this and attempt another recovery?
I'm not particularly well versed in iscsi protocol.


  Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined -
  not ready after error recovery

 scsi-ml will send a Test unit ready (TUR) command to check that the
 device is ready to go. The TUR seems to be failing and so the scsi layer
 sets the device offline.

Is there only one TUR sent? I would have assumed a more robust
recovery procedure here.


 I think there was some target issue and was fixed in newer ones.

 If you can easily replicate this then you should take wireshark/ethereal
 trace and send the trace here so we can see why the TUR failed and make
 sure it is not our fault before you go to the trouble of updating.

I'll see what I can do to get a wire trace next time I have an
opportunity to intentionally hiccup the iscsi target.

Thanks, Mike.

--
Dave
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Device not ready after error recovery?

2009-03-19 Thread Mike Christie

dave wrote:
 
 
 On Mar 19, 10:56 am, Mike Christie micha...@cs.wisc.edu wrote:
 dave wrote:
 Can anyone tell me why the SCSI layer says the device is not ready
 when iscsiadm reports it is logged in?
 Can I manually online the device? How should I recover from here?
 You can do

 echo running  /sys/block/sdX/device/state

 but you might not want to because the device may not be back.
 
 A disk in the Sun iscsi target server died. When a disk fails in the
 server, the iscsi target pauses all read/writes for about 1-2 minutes
 until it marks the disk as faulted, then continues normal operation
 using the rest of the RAID pool. I had tested this before and dm-
 multipath with iscsi seemed to work just fine when the iscsi target
 paused and eventually resumed, so I was just a little surprised this
 time. Usually I see timing closer to a minute between conn error and
 recovery... what are the reconnect/recovery timers of open-iscsi for
 this scenario?

First the scsi command timer would expire. You can see/set this in 
/sys/block/sdX/device/timeout (there is also a udev rule). This causes 
the scsi eh to run. That will try to abort the tasks on the device. If 
that fails we try a lu reset. If that fails we drop the sessions on the 
host and relogin (that is where the host reset messages comes from). So 
for a disk failure, we can log back in quickly because the target is 
fine. The scsi eh will then send a TUR to the device to verify it is 
back. The TUR would/could then fail quickly like you saw because the 
disk really is bad. For this when you know the disk is back online then 
you would want to manually set the state to running. Eventually 
multipathd will then set the path back online in the mulitpath device.


 
 Is this a known problem, and has it been fixed in newer open-iscsi
 versions?
 Are you using a older version of the sun target?
 
 I am. I am running OpenSoalris SXCE build 93, which is about 8 months
 old. I'll be upgrading this soon.
 


 Mar 18 18:21:33 eq1-vz2 kernel:  connection1:0: detected conn error
 (1011)
 Mar 18 18:21:36 eq1-vz2 kernel:  session1: host reset succeeded
 When we log back in we tell scsi-ml that we are ok.
 
 At what level does the connection receive an error and reset (can't
 log in to target, read/write errors, etc), and what functionality is
 needed to be considered ok? If the device wasn't really ready to be
 used again, shouldn't iscsi know this and attempt another recovery?
 I'm not particularly well versed in iscsi protocol.

iSCSI does not know this and does not really deal with the device. It 
deals with the connections/session to the target port/portal. So the 
target seems fine, and so can relog in quickly. The connections are fine 
and we can send iscsi level IOs like logins and nops to the target and 
it will respond ok. The target could tell the initiator that it is 
temporarily unavailable when we try to login again, but if it can allow 
IO to other disks while this problem on the one bad disk is going on it 
probably would not want to do this.

If the target is returning something in the TUR that indicates that the 
device is only temporarily gone, then maybe we would want to change the 
scsi layer so that instead of failing and setting the device offline 
right away it retries its eh a little later.


 
 Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined -
 not ready after error recovery
 scsi-ml will send a Test unit ready (TUR) command to check that the
 device is ready to go. The TUR seems to be failing and so the scsi layer
 sets the device offline.
 
 Is there only one TUR sent? I would have assumed a more robust
 recovery procedure here.

Only a TUR is sent to check if the aborts or resets worked.


 
 I think there was some target issue and was fixed in newer ones.

 If you can easily replicate this then you should take wireshark/ethereal
 trace and send the trace here so we can see why the TUR failed and make
 sure it is not our fault before you go to the trouble of updating.
 
 I'll see what I can do to get a wire trace next time I have an
 opportunity to intentionally hiccup the iscsi target.
 

You probably do not need to worry about this. It is working like expected.

But if you could get a trace we can see what the TUR is failed with and 
maybe see if we can add some code so that if the device is telling us it 
is only a temporary problem then we do not fail right away.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---