Re: Rescan LUNs hangs
On 08/01/2013 01:43 PM, Tracy Reed wrote: On Wed, Jul 24, 2013 at 05:15:16PM PDT, Mike Christie spake thusly: Did you bring the target back up and if so did you do it with the same target name? Sorry for the delay in getting back, been travelling on business. But thanks very much for the reply! Yes, I did bring the target back up and with the same name. Although some of the LUNs have moved around as I rebuilt the machine to match its partner which the VMs RAID 1 it against. What is your replacement/recovery timeout setting in /etc/iscsi/iscsid.conf? Looks like 120 but just in case, here's the entire contents: scsid.startup = /etc/rc.d/init.d/iscsid force-start node.startup = automatic node.leading_login = No node.session.timeo.replacement_timeout = 120 node.conn[0].timeo.login_timeout = 15 node.conn[0].timeo.logout_timeout = 15 node.conn[0].timeo.noop_out_interval = 5 node.conn[0].timeo.noop_out_timeout = 5 node.session.err_timeo.abort_timeout = 15 node.session.err_timeo.lu_reset_timeout = 30 node.session.err_timeo.tgt_reset_timeout = 30 node.session.initial_login_retry_max = 8 node.session.cmds_max = 128 node.session.queue_depth = 32 node.session.xmit_thread_priority = -20 node.session.iscsi.InitialR2T = No node.session.iscsi.ImmediateData = Yes node.session.iscsi.FirstBurstLength = 262144 node.session.iscsi.MaxBurstLength = 16776192 node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144 node.conn[0].iscsi.MaxXmitDataSegmentLength = 0 discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768 node.conn[0].iscsi.HeaderDigest = None node.session.nr_sessions = 1 node.session.iscsi.FastAbort = Yes See anything amiss? I now have around 8 processes stuck on this system. I'm going to have to reboot it this weekend to clear up the issue but I would really like to find out what is really going on and how to avoid it before taking such measures. It sounds like the scsi scan IO is stuck on a target that disappeared and never came back, or it is a Centos scsi layer bug. Could you send the /var/log/messages. The entire file is rather large but here are some of the messages relevant to iscsi: Jul 4 15:18:44 cpu03 kernel: connection8:0: detected conn error (1020) Jul 4 15:18:45 cpu03 iscsid: Kernel reported iSCSI connection 8:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3) Jul 4 15:18:46 cpu03 kernel: connection6:0: detected conn error (1020) Jul 4 15:18:46 cpu03 kernel: connection7:0: detected conn error (1020) Jul 4 15:18:47 cpu03 iscsid: Kernel reported iSCSI connection 6:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3) Jul 4 15:18:47 cpu03 iscsid: Kernel reported iSCSI connection 7:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3) Jul 4 15:18:47 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:51 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:19:27 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) Jul 4 15:19:30 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) Jul 4 15:19:30 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) Jul 4 15:19:33 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) Jul 4 15:19:36 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) skip many of these no route to host messages, happened while I was rebuilding the target with ip 10.0.1.11 Jul 4 15:18:47 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:51 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:20:45 cpu03 kernel: session8: session recovery timed out after 120 secs Jul 4 15:20:45 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) Jul 4 15:20:47 cpu03 kernel: session6: session recovery timed out after 120 secs Jul 4 15:20:47 cpu03 kernel: session7: session recovery timed out after 120 secs Jul 8 20:37:04 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 8 20:37:04 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 8 20:37:07 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) skip lots of these connection refused messages) Jul 12 14:33:08 cpu03 kernel: connection8:0: detected conn error (1020) Jul 12 14:33:08 cpu03 kernel: connection6:0: detected conn error (1020) Jul 12 14:33:08 cpu03 kernel: connection7:0: detected conn error (1020) Jul 12 14:33:09 cpu03 iscsid: conn 0 login rejected: initiator error -
Re: Rescan LUNs hangs
On Wed, Jul 24, 2013 at 05:15:16PM PDT, Mike Christie spake thusly: Did you bring the target back up and if so did you do it with the same target name? Sorry for the delay in getting back, been travelling on business. But thanks very much for the reply! Yes, I did bring the target back up and with the same name. Although some of the LUNs have moved around as I rebuilt the machine to match its partner which the VMs RAID 1 it against. What is your replacement/recovery timeout setting in /etc/iscsi/iscsid.conf? Looks like 120 but just in case, here's the entire contents: scsid.startup = /etc/rc.d/init.d/iscsid force-start node.startup = automatic node.leading_login = No node.session.timeo.replacement_timeout = 120 node.conn[0].timeo.login_timeout = 15 node.conn[0].timeo.logout_timeout = 15 node.conn[0].timeo.noop_out_interval = 5 node.conn[0].timeo.noop_out_timeout = 5 node.session.err_timeo.abort_timeout = 15 node.session.err_timeo.lu_reset_timeout = 30 node.session.err_timeo.tgt_reset_timeout = 30 node.session.initial_login_retry_max = 8 node.session.cmds_max = 128 node.session.queue_depth = 32 node.session.xmit_thread_priority = -20 node.session.iscsi.InitialR2T = No node.session.iscsi.ImmediateData = Yes node.session.iscsi.FirstBurstLength = 262144 node.session.iscsi.MaxBurstLength = 16776192 node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144 node.conn[0].iscsi.MaxXmitDataSegmentLength = 0 discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768 node.conn[0].iscsi.HeaderDigest = None node.session.nr_sessions = 1 node.session.iscsi.FastAbort = Yes See anything amiss? I now have around 8 processes stuck on this system. I'm going to have to reboot it this weekend to clear up the issue but I would really like to find out what is really going on and how to avoid it before taking such measures. It sounds like the scsi scan IO is stuck on a target that disappeared and never came back, or it is a Centos scsi layer bug. Could you send the /var/log/messages. The entire file is rather large but here are some of the messages relevant to iscsi: Jul 4 15:18:44 cpu03 kernel: connection8:0: detected conn error (1020) Jul 4 15:18:45 cpu03 iscsid: Kernel reported iSCSI connection 8:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3) Jul 4 15:18:46 cpu03 kernel: connection6:0: detected conn error (1020) Jul 4 15:18:46 cpu03 kernel: connection7:0: detected conn error (1020) Jul 4 15:18:47 cpu03 iscsid: Kernel reported iSCSI connection 6:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3) Jul 4 15:18:47 cpu03 iscsid: Kernel reported iSCSI connection 7:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3) Jul 4 15:18:47 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:51 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:19:27 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) Jul 4 15:19:30 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) Jul 4 15:19:30 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) Jul 4 15:19:33 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) Jul 4 15:19:36 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) skip many of these no route to host messages, happened while I was rebuilding the target with ip 10.0.1.11 Jul 4 15:18:47 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:50 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:18:51 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 4 15:20:45 cpu03 kernel: session8: session recovery timed out after 120 secs Jul 4 15:20:45 cpu03 iscsid: connect to 10.0.1.11:3260 failed (No route to host) Jul 4 15:20:47 cpu03 kernel: session6: session recovery timed out after 120 secs Jul 4 15:20:47 cpu03 kernel: session7: session recovery timed out after 120 secs Jul 8 20:37:04 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 8 20:37:04 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) Jul 8 20:37:07 cpu03 iscsid: connect to 10.0.1.11:3260 failed (Connection refused) skip lots of these connection refused messages) Jul 12 14:33:08 cpu03 kernel: connection8:0: detected conn error (1020) Jul 12 14:33:08 cpu03 kernel: connection6:0: detected conn error (1020) Jul 12 14:33:08 cpu03 kernel: connection7:0: detected conn error (1020) Jul 12 14:33:09 cpu03 iscsid: conn 0 login rejected: initiator error - target not found (02/03) Jul 12 14:33:09 cpu03 iscsid: conn 0 login rejected: initiator error - target not found (02/03) Jul 12 14:33:09 cpu03
Re: Rescan LUNs hangs
On 07/23/2013 07:32 PM, Tracy Reed wrote: Hello all, I am running iscsi-initiator-utils-6.2.0.872-13.el5 on CentOS release 5.4 (will patch up at next reboot) initiator to scsi-target-utils-1.0.24-2.el6.x86_64 on CentOS 6.4. I have Xen running on the initiator machine with luns from the target machine as storage. I actually have two target machines and do software raid1 in the VMs. I needed to upgrade one of the target machines so I split the mirrors in the VMs and shutdown the target machine, upgraded some disk, reinstalled new OS etc. Now when I do /sbin/iscsiadm -m node -R on the initiator machine it hangs. Did you bring the target back up and if so did you do it with the same target name? The process seems uninterruptable. I can't kill -9 it and it spins using 100% of a cpu. I think perhaps before I shutdown the target machine I should have logged out the initiator from it. Now I have a bunch of hung processes and I can't access new disk volumes because I can't rescan the LUNs. What is your replacement/recovery timeout setting in /etc/iscsi/iscsid.conf? It sounds like the scsi scan IO is stuck on a target that disappeared and never came back, or it is a Centos scsi layer bug. Could you send the /var/log/messages. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscr...@googlegroups.com. To post to this group, send email to open-iscsi@googlegroups.com. Visit this group at http://groups.google.com/group/open-iscsi. For more options, visit https://groups.google.com/groups/opt_out.