On 08/01/2013 01:43 PM, Tracy Reed wrote:
On Wed, Jul 24, 2013 at 05:15:16PM PDT, Mike Christie spake thusly:
Did you bring the target back up and if so did you do it with the same
target name?
Sorry for the delay in getting back, been travelling on business. But thanks
very much for the reply!
Yes, I did bring the target back up and with the same name. Although some of
the LUNs have moved around as I rebuilt the machine to match its partner which
the VMs RAID 1 it against.
What is your replacement/recovery timeout setting in /etc/iscsi/iscsid.conf?
Looks like 120 but just in case, here's the entire contents:
scsid.startup = /etc/rc.d/init.d/iscsid force-start
node.startup = automatic
node.leading_login = No
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.iscsi.FastAbort = Yes
See anything amiss? I now have around 8 processes stuck on this system. I'm
going to have to reboot it this weekend to clear up the issue but I would
really like to find out what is really going on and how to avoid it before
taking such measures.
It sounds like the scsi scan IO is stuck on a target that disappeared
and never came back, or it is a Centos scsi layer bug. Could you send
the /var/log/messages.
The entire file is rather large but here are some of the messages relevant to
iscsi:
Jul 4 15:18:44 cpu03 kernel: connection8:0: detected conn error (1020)
Jul 4 15:18:45 cpu03 iscsid: Kernel reported iSCSI connection 8:0 error
(1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Jul 4 15:18:46 cpu03 kernel: connection6:0: detected conn error (1020)
Jul 4 15:18:46 cpu03 kernel: connection7:0: detected conn error (1020)
Jul 4 15:18:47 cpu03 iscsid: Kernel reported iSCSI connection 6:0 error
(1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Jul 4 15:18:47 cpu03 iscsid: Kernel reported iSCSI connection 7:0 error
(1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Jul 4 15:18:47 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection
refused)
Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection
refused)
Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection
refused)
Jul 4 15:18:51 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection
refused)
Jul 4 15:19:27 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to
host)
Jul 4 15:19:30 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to
host)
Jul 4 15:19:30 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to
host)
Jul 4 15:19:33 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to
host)
Jul 4 15:19:36 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to
host)
skip many of these no route to host messages, happened while I was
rebuilding the target with ip 10.0.1.11
Jul 4 15:18:47 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection
refused)
Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection
refused)
Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection
refused)
Jul 4 15:18:51 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection
refused)
Jul 4 15:20:45 cpu03 kernel: session8: session recovery timed out after 120
secs
Jul 4 15:20:45 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to
host)
Jul 4 15:20:47 cpu03 kernel: session6: session recovery timed out after 120
secs
Jul 4 15:20:47 cpu03 kernel: session7: session recovery timed out after 120
secs
Jul 8 20:37:04 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection
refused)
Jul 8 20:37:04 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection
refused)
Jul 8 20:37:07 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection
refused)
skip lots of these connection refused messages)
Jul 12 14:33:08 cpu03 kernel: connection8:0: detected conn error (1020)
Jul 12 14:33:08 cpu03 kernel: connection6:0: detected conn error (1020)
Jul 12 14:33:08 cpu03 kernel: connection7:0: detected conn error (1020)
Jul 12 14:33:09 cpu03 iscsid: conn 0 login rejected: initiator error -