Re: Rescan LUNs hangs

2013-08-03 Thread Mike Christie
On 08/01/2013 01:43 PM, Tracy Reed wrote:
 On Wed, Jul 24, 2013 at 05:15:16PM PDT, Mike Christie spake thusly:
 Did you bring the target back up and if so did you do it with the same
 target name?
 
 Sorry for the delay in getting back, been travelling on business. But thanks
 very much for the reply!
 
 Yes, I did bring the target back up and with the same name. Although some of
 the LUNs have moved around as I rebuilt the machine to match its partner which
 the VMs RAID 1 it against.
 
 What is your replacement/recovery timeout setting in /etc/iscsi/iscsid.conf?
 
 Looks like 120 but just in case, here's the entire contents:
 
 scsid.startup = /etc/rc.d/init.d/iscsid force-start
 node.startup = automatic
 node.leading_login = No
 node.session.timeo.replacement_timeout = 120
 node.conn[0].timeo.login_timeout = 15
 node.conn[0].timeo.logout_timeout = 15
 node.conn[0].timeo.noop_out_interval = 5
 node.conn[0].timeo.noop_out_timeout = 5
 node.session.err_timeo.abort_timeout = 15
 node.session.err_timeo.lu_reset_timeout = 30
 node.session.err_timeo.tgt_reset_timeout = 30
 node.session.initial_login_retry_max = 8
 node.session.cmds_max = 128
 node.session.queue_depth = 32
 node.session.xmit_thread_priority = -20
 node.session.iscsi.InitialR2T = No
 node.session.iscsi.ImmediateData = Yes
 node.session.iscsi.FirstBurstLength = 262144
 node.session.iscsi.MaxBurstLength = 16776192
 node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
 node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
 discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
 node.conn[0].iscsi.HeaderDigest = None
 node.session.nr_sessions = 1
 node.session.iscsi.FastAbort = Yes
 
 See anything amiss? I now have around 8 processes stuck on this system. I'm
 going to have to reboot it this weekend to clear up the issue but I would
 really like to find out what is really going on and how to avoid it before
 taking such measures.
 
 It sounds like the scsi scan IO is stuck on a target that disappeared
 and never came back, or it is a Centos scsi layer bug. Could you send
 the /var/log/messages.
 
 The entire file is rather large but here are some of the messages relevant to
 iscsi:
 
 Jul  4 15:18:44 cpu03 kernel: connection8:0: detected conn error (1020)
 Jul  4 15:18:45 cpu03 iscsid: Kernel reported iSCSI connection 8:0 error 
 (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
 Jul  4 15:18:46 cpu03 kernel: connection6:0: detected conn error (1020)
 Jul  4 15:18:46 cpu03 kernel: connection7:0: detected conn error (1020)
 Jul  4 15:18:47 cpu03 iscsid: Kernel reported iSCSI connection 6:0 error 
 (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
 Jul  4 15:18:47 cpu03 iscsid: Kernel reported iSCSI connection 7:0 error 
 (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
 Jul  4 15:18:47 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
 refused)
 Jul  4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
 refused)
 Jul  4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
 refused)
 Jul  4 15:18:51 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
 refused)
 Jul  4 15:19:27 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
 host)
 Jul  4 15:19:30 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
 host)
 Jul  4 15:19:30 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
 host)
 Jul  4 15:19:33 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
 host)
 Jul  4 15:19:36 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
 host)
 skip many of these no route to host messages, happened while I was 
 rebuilding the target with ip 10.0.1.11
 Jul  4 15:18:47 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
 refused)
 Jul  4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
 refused)
 Jul  4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
 refused)
 Jul  4 15:18:51 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
 refused)
 Jul  4 15:20:45 cpu03 kernel: session8: session recovery timed out after 120 
 secs
 Jul  4 15:20:45 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
 host)
 Jul  4 15:20:47 cpu03 kernel: session6: session recovery timed out after 120 
 secs
 Jul  4 15:20:47 cpu03 kernel: session7: session recovery timed out after 120 
 secs
 Jul  8 20:37:04 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
 refused)
 Jul  8 20:37:04 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
 refused)
 Jul  8 20:37:07 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
 refused)
 skip lots of these connection refused messages)
 Jul 12 14:33:08 cpu03 kernel: connection8:0: detected conn error (1020)
 Jul 12 14:33:08 cpu03 kernel: connection6:0: detected conn error (1020)
 Jul 12 14:33:08 cpu03 kernel: connection7:0: detected conn error (1020)
 Jul 12 14:33:09 cpu03 iscsid: conn 0 login rejected: initiator error - 

Re: Rescan LUNs hangs

2013-08-01 Thread Tracy Reed
On Wed, Jul 24, 2013 at 05:15:16PM PDT, Mike Christie spake thusly:
 Did you bring the target back up and if so did you do it with the same
 target name?

Sorry for the delay in getting back, been travelling on business. But thanks
very much for the reply!

Yes, I did bring the target back up and with the same name. Although some of
the LUNs have moved around as I rebuilt the machine to match its partner which
the VMs RAID 1 it against.

 What is your replacement/recovery timeout setting in /etc/iscsi/iscsid.conf?

Looks like 120 but just in case, here's the entire contents:

scsid.startup = /etc/rc.d/init.d/iscsid force-start
node.startup = automatic
node.leading_login = No
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.iscsi.FastAbort = Yes

See anything amiss? I now have around 8 processes stuck on this system. I'm
going to have to reboot it this weekend to clear up the issue but I would
really like to find out what is really going on and how to avoid it before
taking such measures.

 It sounds like the scsi scan IO is stuck on a target that disappeared
 and never came back, or it is a Centos scsi layer bug. Could you send
 the /var/log/messages.

The entire file is rather large but here are some of the messages relevant to
iscsi:

Jul  4 15:18:44 cpu03 kernel: connection8:0: detected conn error (1020)
Jul  4 15:18:45 cpu03 iscsid: Kernel reported iSCSI connection 8:0 error (1020 
- ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Jul  4 15:18:46 cpu03 kernel: connection6:0: detected conn error (1020)
Jul  4 15:18:46 cpu03 kernel: connection7:0: detected conn error (1020)
Jul  4 15:18:47 cpu03 iscsid: Kernel reported iSCSI connection 6:0 error (1020 
- ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Jul  4 15:18:47 cpu03 iscsid: Kernel reported iSCSI connection 7:0 error (1020 
- ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Jul  4 15:18:47 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
refused)
Jul  4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
refused)
Jul  4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
refused)
Jul  4 15:18:51 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
refused)
Jul  4 15:19:27 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
host)
Jul  4 15:19:30 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
host)
Jul  4 15:19:30 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
host)
Jul  4 15:19:33 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
host)
Jul  4 15:19:36 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
host)
skip many of these no route to host messages, happened while I was rebuilding 
the target with ip 10.0.1.11
Jul  4 15:18:47 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
refused)
Jul  4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
refused)
Jul  4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
refused)
Jul  4 15:18:51 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
refused)
Jul  4 15:20:45 cpu03 kernel: session8: session recovery timed out after 120 
secs
Jul  4 15:20:45 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to 
host)
Jul  4 15:20:47 cpu03 kernel: session6: session recovery timed out after 120 
secs
Jul  4 15:20:47 cpu03 kernel: session7: session recovery timed out after 120 
secs
Jul  8 20:37:04 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
refused)
Jul  8 20:37:04 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
refused)
Jul  8 20:37:07 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection 
refused)
skip lots of these connection refused messages)
Jul 12 14:33:08 cpu03 kernel: connection8:0: detected conn error (1020)
Jul 12 14:33:08 cpu03 kernel: connection6:0: detected conn error (1020)
Jul 12 14:33:08 cpu03 kernel: connection7:0: detected conn error (1020)
Jul 12 14:33:09 cpu03 iscsid: conn 0 login rejected: initiator error - target 
not found (02/03)
Jul 12 14:33:09 cpu03 iscsid: conn 0 login rejected: initiator error - target 
not found (02/03)
Jul 12 14:33:09 cpu03 

Re: Rescan LUNs hangs

2013-07-24 Thread Mike Christie
On 07/23/2013 07:32 PM, Tracy Reed wrote:
 Hello all,
 
 I am running iscsi-initiator-utils-6.2.0.872-13.el5 on CentOS release 5.4 
 (will
 patch up at next reboot) initiator to scsi-target-utils-1.0.24-2.el6.x86_64 on
 CentOS 6.4. 
 
 I have Xen running on the initiator machine with luns from the target machine
 as storage. I actually have two target machines and do software raid1 in the
 VMs. 
 
 I needed to upgrade one of the target machines so I split the mirrors in the
 VMs and shutdown the target machine, upgraded some disk, reinstalled new OS
 etc. Now when I do /sbin/iscsiadm -m node -R on the initiator machine it 
 hangs.

Did you bring the target back up and if so did you do it with the same
target name?

 The process seems uninterruptable. I can't kill -9 it and it spins using 100%
 of a cpu. I think perhaps before I shutdown the target machine I should have
 logged out the initiator from it. Now I have a bunch of hung processes and I
 can't access new disk volumes because I can't rescan the LUNs. 
 

What is your replacement/recovery timeout setting in /etc/iscsi/iscsid.conf?

It sounds like the scsi scan IO is stuck on a target that disappeared
and never came back, or it is a Centos scsi layer bug. Could you send
the /var/log/messages.

-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/groups/opt_out.