Re: How to resurrect SCSI device

2012-06-11 Thread niejasiek
i'm not using multipath, and it seems that i have the exact same problem.
(identical ietd version, debian squeeze on amd64, open-iscsi as initiator)

when i restart ietd during i/o, the session on the initiator side goes unsuable
readonly, with no way to get it working other than logout/login again.
this of course fails all pending i/o and is bad.
longer timeouts on initiator side just make it a longer wait for failure.

have you found a solution yet ?

maybe the trick described here will help ?
http://blog.wpkg.org/2007/09/27/solving-reliability-and-scalability-problems-with-iscsi-part-2/

regards


-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: How to resurrect SCSI device

2012-06-11 Thread Mike Christie
On 06/11/2012 07:43 AM, niejasiek wrote:
 i'm not using multipath, and it seems that i have the exact same problem.
 (identical ietd version, debian squeeze on amd64, open-iscsi as initiator)
 
 when i restart ietd during i/o, the session on the initiator side goes 
 unsuable
 readonly, with no way to get it working other than logout/login again.
 this of course fails all pending i/o and is bad.
 longer timeouts on initiator side just make it a longer wait for failure.
 
 have you found a solution yet ?
 
 maybe the trick described here will help ?
 http://blog.wpkg.org/2007/09/27/solving-reliability-and-scalability-problems-with-iscsi-part-2/
 

Yes, that sounds like the problem. You should contact the ietd list to
make sure that there is not a new ietd version that fixed the problem.

-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: How to resurrect SCSI device

2011-03-02 Thread Székelyi Szabolcs
Hi Mike,

On Wednesday 02 March 2011 03:59:39 Mike Christie wrote:
 On 03/01/2011 12:18 PM, Székelyi Szabolcs wrote:
  I'm facing a somewhat strange situation. As a part of testing a multipath
  solution, I've written a script that simulates the failure and recovery
  of one path which happens to be an iSCSI connection. It's running on the
  target side, periodically stopping and starting it. After some
  successful failovers and failbacks the SCSI device on the initiator side
  seems to block all operations on the low level (eg. non-multipathed)
  device. However the iSCSI session seems to reestablish properly. I'm
  seeking your advice about how to get this device to work again. I don't
  even know what can cause such a problem. Is it the SCSI layer, the
  initiator or the target? I'm quite sure that a relogin could solve this,
  but I'd like to avoid that if possible.
  
  I'm running Open-iSCSI 2.0.871.3, kernel 2.6.32 as distributed in Debian
  Squeeze on x86-64 hardware on both sides, with IET 1.4.20.2 as the
  target.
 
  Regarding the SCSI device in question, I see log entries like this:
 Do you have the time stamps for those errors?

I've run the test again, here are the log entries with timestamps.

Mar  2 12:21:02 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:21:07 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:21:12 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:21:17 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:21:22 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:21:25 debian kernel: [601122.678387]  connection3:0: detected conn 
error (1020)
Mar  2 12:21:27 debian iscsid: Kernel reported iSCSI connection 3:0 error 
(1020) state (3)
Mar  2 12:21:27 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:21:29 debian iscsid: connect to 193.225.36.17:3260 failed 
(Connection refused)
Mar  2 12:21:32 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:21:33 debian iscsid: connect to [ipaddress]:3260 failed (Connection 
refused)
Mar  2 12:21:37 debian iscsid: connect to [ipaddress]:3260 failed (Connection 
refused)
Mar  2 12:21:37 debian multipathd: sdc: directio checker reports path is down

[Stripped messages identical to the last two]

Mar  2 12:23:18 debian iscsid: connect to 193.225.36.17:3260 failed 
(Connection refused)
Mar  2 12:23:22 debian iscsid: connect to 193.225.36.17:3260 failed 
(Connection refused)
Mar  2 12:23:22 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:23:26 debian iscsid: connect to 193.225.36.17:3260 failed 
(Connection refused)
Mar  2 12:23:26 debian kernel: [601242.928075]  session3: session recovery 
timed out after 120 secs
Mar  2 12:23:27 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:23:29 debian iscsid: connect to 193.225.36.17:3260 failed 
(Connection refused)
Mar  2 12:23:32 debian multipathd: sdc: directio checker reports path is down

[Stripped messages identical to the last two]

Mar  2 12:26:15 debian iscsid: connect to 193.225.36.17:3260 failed 
(Connection refused)
Mar  2 12:26:17 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:26:19 debian iscsid: connection3:0 is operational after recovery (78 
attempts)
Mar  2 12:26:22 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:26:27 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:26:32 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:26:37 debian multipathd: sdc: directio checker reports path is down
Mar  2 12:26:42 debian multipathd: sdc: directio checker reports path is down

You can see that I've shut the target down at 12:21:25, the recovery times out 
two minutes later as expected (at 12:23:26). I've restarted the target at 
12:26:19, and the session is reestablished, but the SCSI device still fails to 
work as reported by multipathd.

 Is everything going through 1 session?

No, it's two separate sessions to two separate machines. The backend storage 
is the same. I'm trying to do multipath to a storage area that's on a shared 
storage, served by two iSCSI gateways.

 At this point the iscsi layer should now be up. If you run
 
 iscsiadm -m session -P 3
 
 does the session state indicate logged in, and does the device states
 indicated running?

Yes, the session state is LOGGED_IN and the device state is running.

  multipathd: sdc: directio checker reports path is down
 
 If this happened around the same time as the recovery message before it,
 it could have been a race.

No, this keeps on repeating infinitely after the session is reestablished as 
you can see above.

 If at this point you send your own IO using SG/passthrough (something
 like sg_turs /dev/sdc) does that succeed or fail?

It blocks forever. I can't stop it even with SIGTERM. It succeds on the other 
path.

Cheers,
-- 
cc

-- 
You 

Re: How to resurrect SCSI device

2011-03-02 Thread Mark Lehrer
I'm facing a somewhat strange situation. As a part of testing a 
multipath solution, I've written a script that simulates the failure and


This is an excellent thread, thanks.

Does this happen even without multipathd?  I have been trying to work on a 
procedure to fix a similar problem where the device is read-only even though 
the session was re-established - and there is no way to either unmount it or 
get it back to a state where it is usable.


Mark

--
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



How to resurrect SCSI device

2011-03-01 Thread Székelyi Szabolcs
Hi all,

I'm facing a somewhat strange situation. As a part of testing a multipath 
solution, I've written a script that simulates the failure and recovery of one 
path which happens to be an iSCSI connection. It's running on the target side, 
periodically stopping and starting it. After some successful failovers and 
failbacks the SCSI device on the initiator side seems to block all operations 
on the low level (eg. non-multipathed) device. However the iSCSI session seems 
to reestablish properly. I'm seeking your advice about how to get this device 
to work again. I don't even know what can cause such a problem. Is it the SCSI 
layer, the initiator or the target? I'm quite sure that a relogin could solve 
this, but I'd like to avoid that if possible.

I'm running Open-iSCSI 2.0.871.3, kernel 2.6.32 as distributed in Debian 
Squeeze on x86-64 hardware on both sides, with IET 1.4.20.2 as the target.

Regarding the SCSI device in question, I see log entries like this:

kernel: [538607.929428]  connection3:0: detected conn error (1020)
iscsid: Kernel reported iSCSI connection 3:0 error (1020) state (3)
kernel: [538728.180083]  session3: session recovery timed out after 120 secs
iscsid: connect to [ipaddress]:3260 failed (Connection refused)
multipathd: sdc: directio checker reports path is down
iscsid: connect to 193.225.36.16:3260 failed (Connection refused)
iscsid: connection3:0 is operational after recovery (34 attempts)
multipathd: sdc: directio checker reports path is down

I'm quite suspicious about the last two lines: if the connection is 
operational, why is the path still down? Any idea about handling this 
situation?

Thanks,
-- 
cc

-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.