Hi Mike,
On Wednesday 02 March 2011 03:59:39 Mike Christie wrote:
On 03/01/2011 12:18 PM, Székelyi Szabolcs wrote:
I'm facing a somewhat strange situation. As a part of testing a multipath
solution, I've written a script that simulates the failure and recovery
of one path which happens to be an iSCSI connection. It's running on the
target side, periodically stopping and starting it. After some
successful failovers and failbacks the SCSI device on the initiator side
seems to block all operations on the low level (eg. non-multipathed)
device. However the iSCSI session seems to reestablish properly. I'm
seeking your advice about how to get this device to work again. I don't
even know what can cause such a problem. Is it the SCSI layer, the
initiator or the target? I'm quite sure that a relogin could solve this,
but I'd like to avoid that if possible.
I'm running Open-iSCSI 2.0.871.3, kernel 2.6.32 as distributed in Debian
Squeeze on x86-64 hardware on both sides, with IET 1.4.20.2 as the
target.
Regarding the SCSI device in question, I see log entries like this:
Do you have the time stamps for those errors?
I've run the test again, here are the log entries with timestamps.
Mar 2 12:21:02 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:21:07 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:21:12 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:21:17 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:21:22 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:21:25 debian kernel: [601122.678387] connection3:0: detected conn
error (1020)
Mar 2 12:21:27 debian iscsid: Kernel reported iSCSI connection 3:0 error
(1020) state (3)
Mar 2 12:21:27 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:21:29 debian iscsid: connect to 193.225.36.17:3260 failed
(Connection refused)
Mar 2 12:21:32 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:21:33 debian iscsid: connect to [ipaddress]:3260 failed (Connection
refused)
Mar 2 12:21:37 debian iscsid: connect to [ipaddress]:3260 failed (Connection
refused)
Mar 2 12:21:37 debian multipathd: sdc: directio checker reports path is down
[Stripped messages identical to the last two]
Mar 2 12:23:18 debian iscsid: connect to 193.225.36.17:3260 failed
(Connection refused)
Mar 2 12:23:22 debian iscsid: connect to 193.225.36.17:3260 failed
(Connection refused)
Mar 2 12:23:22 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:23:26 debian iscsid: connect to 193.225.36.17:3260 failed
(Connection refused)
Mar 2 12:23:26 debian kernel: [601242.928075] session3: session recovery
timed out after 120 secs
Mar 2 12:23:27 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:23:29 debian iscsid: connect to 193.225.36.17:3260 failed
(Connection refused)
Mar 2 12:23:32 debian multipathd: sdc: directio checker reports path is down
[Stripped messages identical to the last two]
Mar 2 12:26:15 debian iscsid: connect to 193.225.36.17:3260 failed
(Connection refused)
Mar 2 12:26:17 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:26:19 debian iscsid: connection3:0 is operational after recovery (78
attempts)
Mar 2 12:26:22 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:26:27 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:26:32 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:26:37 debian multipathd: sdc: directio checker reports path is down
Mar 2 12:26:42 debian multipathd: sdc: directio checker reports path is down
You can see that I've shut the target down at 12:21:25, the recovery times out
two minutes later as expected (at 12:23:26). I've restarted the target at
12:26:19, and the session is reestablished, but the SCSI device still fails to
work as reported by multipathd.
Is everything going through 1 session?
No, it's two separate sessions to two separate machines. The backend storage
is the same. I'm trying to do multipath to a storage area that's on a shared
storage, served by two iSCSI gateways.
At this point the iscsi layer should now be up. If you run
iscsiadm -m session -P 3
does the session state indicate logged in, and does the device states
indicated running?
Yes, the session state is LOGGED_IN and the device state is running.
multipathd: sdc: directio checker reports path is down
If this happened around the same time as the recovery message before it,
it could have been a race.
No, this keeps on repeating infinitely after the session is reestablished as
you can see above.
If at this point you send your own IO using SG/passthrough (something
like sg_turs /dev/sdc) does that succeed or fail?
It blocks forever. I can't stop it even with SIGTERM. It succeds on the other
path.
Cheers,
--
cc
--
You