Re: Target or Initiator closing connection?
On 04/14/2011 02:17 PM, Mike Christie wrote: This message means we have sent 2592 bytes out of 4096. The 0 means that we are trying to send the rest of the data, but we are not able to sent anything (0 is the number of bytes we were able to send when we did sendpage/sendmsg). So send the full logs so I can see what else is in there. Ok in your current logs we have Apr 13 11:07:23 oim61024008 kernel: connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 559699, last ping 560949, now 562199 Then looking over session3/connection3:0 we have connection3:0: iscsi_tcp_segment_done copied 2632 0 size 4096 xmit from Apr 13 11:07:23 to the middle of Apr 13 11:07:13. So for 10 seconds we are trying to do sendpage but it keeps returning 0. For some reason the network layer is just not sending new IO. The network layer is probably retransmitting and so the network layer runs out of writespace. Did you by any chance also take a network trace (tcpdump or ethereal/wireshark)? Could you send it? Can you run a rhel kernel? I can add some debugging in to the kernel. -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: Target or Initiator closing connection?
On 04/14/2011 03:12 PM, Joe Hoot wrote: sorry.. ignore my last email this time :P We've been doing a ton of tests, so I just confused my self. That test is from a default OVM system that had its default kernel and iscsi tools. We also had disconnects in an OEL5.6 environment with latest released 872 as well. Do you know if the iscsi-initiator-utils-6.2.0.872-6.0.2.el5 rpm contains that fix patch as well? If so, I have the dt tests running as of 16:04 in that OEL5.6 environment (2.6.18-238.el5). The patches are kernel patches. Red Hat's 2.6.18-238.el5 kernel has the fix pointed to by the http://git.kernel.org link in the other mail. I do not think you are hitting that though. In the logs you sent we just cannot send data. Send the logs and we will know for sure. -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: Target or Initiator closing connection?
sorry.. ignore my last email this time :P We've been doing a ton of tests, so I just confused my self. That test is from a default OVM system that had its default kernel and iscsi tools. We also had disconnects in an OEL5.6 environment with latest released 872 as well. Do you know if the iscsi-initiator-utils-6.2.0.872-6.0.2.el5 rpm contains that fix patch as well? If so, I have the dt tests running as of 16:04 in that OEL5.6 environment (2.6.18-238.el5). I figured instead of testing in OVM, if I can reproduce it with OEL5.6, it probably has a larger audience. On Apr 14, 2011, at 3:28 PM, Joe Hoot wrote: > This test was done on an OEL5.6 system, using the latest released RHEL kernel > + 872 code. > > Mike, we just discovered that our Cisco switch didn't have FlowControl > enabled properly. It looks like it _should_ have autonegotiated to on, but > may not have. We've manually forced it on and are running another test at > the moment. I don't want to eat up too much of your time, so let me gather > the new set of info from a new test fist. Then I'll attach the full logs. > > Thanks, > Joe > > > > On Apr 14, 2011, at 3:10 PM, Mike Christie wrote: > >> On 04/14/2011 06:39 AM, Joe Hoot wrote: >>> Here's what I am seeing (roughly... with lines scraped in between.. and I'm >>> only grepping for connect7/session7): >>> >>> Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >>> copied 2592 0 size 4096 xmit >>> ...lots of the above types of messages... >>> Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >>> copied 2592 0 size 4096 xmit >>> Apr 11 13:22:24 oim61024009 kernel: connection7:0: >>> iscsi_check_transport_timeouts Setting next tmo 2279563 >>> Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >>> copied 2592 0 size 4096 xmit >>> ...lots of the above types of messages... >>> Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >>> copied 2592 0 size 4096 xmit >>> Apr 11 13:22:24 oim61024009 kernel: connection7:0: >>> iscsi_check_transport_timeouts Setting next tmo 2279563 >>> Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >>> copied 2592 0 size 4096 xmit >>> ...lots of the above types of messages... >>> Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >>> copied 2592 0 size 4096 xmit >>> Apr 11 13:22:25 oim61024009 kernel: connection7:0: >>> iscsi_check_transport_timeouts Sending nopout as ping >>> Apr 11 13:22:25 oim61024009 kernel: connection7:0: >>> iscsi_check_transport_timeouts Setting next tmo 2280813 >>> Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >>> copied 2592 0 size 4096 xmit >>> ...lots of the above types of messages... >>> Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >>> copied 2592 0 size 4096 xmit >> >> Does it always say "xmit" in the iscsi_tcp_segment_done messages from >> 13:22:20 - 13:22:30 or does it sometimes say "recv"? Just send the full >> logs actually. >> >> And are you running with that kernel fix (I think oracle merged if for >> you right). This fix: >> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=4c48a82935f833d94fcf44c2b0c5d2922acfc77a;hp=d1acfae514425d680912907c6554852f1e258551 >> >> >> If there is no "recv"s in there then you need the attached patch. No >> idea if this applies to the oracle kernel you are using. >> >> The weird thing though would be that it means for 10 seconds we just >> send data and never get a response to any of the pdus we have sent. >> >> What is the output of iscsiadm -m session -P 2 when you are logged into >> the target? >> >>> Apr 11 13:22:30 oim61024009 kernel: connection7:0: ping timeout of 5 secs >>> expired, recv timeout 5, last rx 2278313, last ping 2279563, now 2280813 >> >> -- >> You received this message because you are subscribed to the Google Groups >> "open-iscsi" group. >> To post to this group, send email to open-iscsi@googlegroups.com. >> To unsubscribe from this group, send email to >> open-iscsi+unsubscr...@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/open-iscsi?hl=en. >> >> > > -- > You received this message because you are subscribed to the Google Groups > "open-iscsi" group. > To post to this group, send email to open-iscsi@googlegroups.com. > To unsubscribe from this group, send email to > open-iscsi+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/open-iscsi?hl=en. > -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: Target or Initiator closing connection?
This test was done on an OEL5.6 system, using the latest released RHEL kernel + 872 code. Mike, we just discovered that our Cisco switch didn't have FlowControl enabled properly. It looks like it _should_ have autonegotiated to on, but may not have. We've manually forced it on and are running another test at the moment. I don't want to eat up too much of your time, so let me gather the new set of info from a new test fist. Then I'll attach the full logs. Thanks, Joe On Apr 14, 2011, at 3:10 PM, Mike Christie wrote: > On 04/14/2011 06:39 AM, Joe Hoot wrote: >> Here's what I am seeing (roughly... with lines scraped in between.. and I'm >> only grepping for connect7/session7): >> >> Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >> copied 2592 0 size 4096 xmit >> ...lots of the above types of messages... >> Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >> copied 2592 0 size 4096 xmit >> Apr 11 13:22:24 oim61024009 kernel: connection7:0: >> iscsi_check_transport_timeouts Setting next tmo 2279563 >> Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >> copied 2592 0 size 4096 xmit >> ...lots of the above types of messages... >> Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >> copied 2592 0 size 4096 xmit >> Apr 11 13:22:24 oim61024009 kernel: connection7:0: >> iscsi_check_transport_timeouts Setting next tmo 2279563 >> Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >> copied 2592 0 size 4096 xmit >> ...lots of the above types of messages... >> Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >> copied 2592 0 size 4096 xmit >> Apr 11 13:22:25 oim61024009 kernel: connection7:0: >> iscsi_check_transport_timeouts Sending nopout as ping >> Apr 11 13:22:25 oim61024009 kernel: connection7:0: >> iscsi_check_transport_timeouts Setting next tmo 2280813 >> Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >> copied 2592 0 size 4096 xmit >> ...lots of the above types of messages... >> Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done >> copied 2592 0 size 4096 xmit > > Does it always say "xmit" in the iscsi_tcp_segment_done messages from > 13:22:20 - 13:22:30 or does it sometimes say "recv"? Just send the full > logs actually. > > And are you running with that kernel fix (I think oracle merged if for > you right). This fix: > http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=4c48a82935f833d94fcf44c2b0c5d2922acfc77a;hp=d1acfae514425d680912907c6554852f1e258551 > > > If there is no "recv"s in there then you need the attached patch. No > idea if this applies to the oracle kernel you are using. > > The weird thing though would be that it means for 10 seconds we just > send data and never get a response to any of the pdus we have sent. > > What is the output of iscsiadm -m session -P 2 when you are logged into > the target? > >> Apr 11 13:22:30 oim61024009 kernel: connection7:0: ping timeout of 5 secs >> expired, recv timeout 5, last rx 2278313, last ping 2279563, now 2280813 > > -- > You received this message because you are subscribed to the Google Groups > "open-iscsi" group. > To post to this group, send email to open-iscsi@googlegroups.com. > To unsubscribe from this group, send email to > open-iscsi+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/open-iscsi?hl=en. > > -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: Target or Initiator closing connection?
On 04/14/2011 02:10 PM, Mike Christie wrote: On 04/14/2011 06:39 AM, Joe Hoot wrote: Here's what I am seeing (roughly... with lines scraped in between.. and I'm only grepping for connect7/session7): Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2279563 Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2279563 Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Sending nopout as ping Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2280813 Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Ignore the previous mail. Your mail was the first one I read today and my eyes are not wide open yet :) Send the full logs. And were you using that patch to send the sk_sndtimeo lower? It looks like you were. Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit This message means we have sent 2592 bytes out of 4096. The 0 means that we are trying to send the rest of the data, but we are not able to sent anything (0 is the number of bytes we were able to send when we did sendpage/sendmsg). So send the full logs so I can see what else is in there. -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: Target or Initiator closing connection?
On 04/14/2011 06:39 AM, Joe Hoot wrote: Here's what I am seeing (roughly... with lines scraped in between.. and I'm only grepping for connect7/session7): Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2279563 Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2279563 Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Sending nopout as ping Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2280813 Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Does it always say "xmit" in the iscsi_tcp_segment_done messages from 13:22:20 - 13:22:30 or does it sometimes say "recv"? Just send the full logs actually. And are you running with that kernel fix (I think oracle merged if for you right). This fix: http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=4c48a82935f833d94fcf44c2b0c5d2922acfc77a;hp=d1acfae514425d680912907c6554852f1e258551 If there is no "recv"s in there then you need the attached patch. No idea if this applies to the oracle kernel you are using. The weird thing though would be that it means for 10 seconds we just send data and never get a response to any of the pdus we have sent. What is the output of iscsiadm -m session -P 2 when you are logged into the target? Apr 11 13:22:30 oim61024009 kernel: connection7:0: ping timeout of 5 secs expired, recv timeout 5, last rx 2278313, last ping 2279563, now 2280813 -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en. diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c index a860452..c188267 100644 --- a/drivers/scsi/iscsi_tcp.c +++ b/drivers/scsi/iscsi_tcp.c @@ -274,7 +274,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn, if (r < 0) { iscsi_tcp_segment_unmap(segment); return r; - } + } else if (r > 0) + tcp_conn->iscsi_conn->last_xfer = jiffies; copied += r; } return copied; diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index 0c550d5..5d981f2 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -610,7 +610,7 @@ void iscsi_complete_scsi_task(struct iscsi_task *task, ISCSI_DBG_SESSION(conn->session, "[itt 0x%x]\n", task->itt); - conn->last_recv = jiffies; + conn->last_xfer = jiffies; __iscsi_update_cmdsn(conn->session, exp_cmdsn, max_cmdsn); iscsi_complete_task(task, ISCSI_TASK_COMPLETED); } @@ -1130,7 +1130,7 @@ int __iscsi_complete_pdu(struct iscsi_conn *conn, struct iscsi_hdr *hdr, struct iscsi_task *task; uint32_t itt; - conn->last_recv = jiffies; + conn->last_xfer = jiffies; rc = iscsi_verify_itt(conn, hdr->itt); if (rc) return rc; @@ -1902,7 +1902,7 @@ static void iscsi_start_tx(struct iscsi_conn *conn) static int iscsi_has_ping_timed_out(struct iscsi_conn *conn) { if (conn->ping_task && - time_before_eq(conn->last_recv + (conn->recv_timeout * HZ) + + time_before_eq(conn->last_xfer+ (conn->recv_timeout * HZ) + (conn->ping_timeout * HZ), jiffies)) return 1; else @@ -2047,7 +2047,7 @@ static void iscsi_check_transport_timeouts(unsigned long data) { struct iscsi_conn *conn = (struct iscsi_conn *)data; struct iscsi_session *session = conn->session; - unsigned long recv_timeout, next_timeout = 0, last_recv; + unsigned long recv_timeout, next_timeout = 0, last_xfer; s
Re: Target or Initiator closing connection?
sorry... I'm confusing the issue... the messages that I showed below were from a iSCSI Disconnect test that we had done while connecting to the EqualLogic. I will re-setup my test tot he tgtd to see what it and the initiator show. I will take you suggestion of writting to dev/null. On Apr 14, 2011, at 7:39 AM, Joe Hoot wrote: > Here's what I am seeing (roughly... with lines scraped in between.. and I'm > only grepping for connect7/session7): > > Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done > copied 2592 0 size 4096 xmit > ...lots of the above types of messages... > Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done > copied 2592 0 size 4096 xmit > Apr 11 13:22:24 oim61024009 kernel: connection7:0: > iscsi_check_transport_timeouts Setting next tmo 2279563 > Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done > copied 2592 0 size 4096 xmit > ...lots of the above types of messages... > Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done > copied 2592 0 size 4096 xmit > Apr 11 13:22:24 oim61024009 kernel: connection7:0: > iscsi_check_transport_timeouts Setting next tmo 2279563 > Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done > copied 2592 0 size 4096 xmit > ...lots of the above types of messages... > Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done > copied 2592 0 size 4096 xmit > Apr 11 13:22:25 oim61024009 kernel: connection7:0: > iscsi_check_transport_timeouts Sending nopout as ping > Apr 11 13:22:25 oim61024009 kernel: connection7:0: > iscsi_check_transport_timeouts Setting next tmo 2280813 > Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done > copied 2592 0 size 4096 xmit > ...lots of the above types of messages... > Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done > copied 2592 0 size 4096 xmit > Apr 11 13:22:30 oim61024009 kernel: connection7:0: ping timeout of 5 secs > expired, recv timeout 5, last rx 2278313, last ping 2279563, now 2280813 > Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done > copied 2592 0 size 4096 xmit > Apr 11 13:22:30 oim61024009 kernel: connection7:0: detected conn error (1011) > Apr 11 13:22:30 oim61024009 kernel: session7: iscsi_data_xmit Tx suspended! > Apr 11 13:22:30 oim61024009 kernel: session7: iscsi_start_session_recovery > blocking session > Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing > pending sc df493a80 itt 0x108 > ...etc..etc.. > Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing > pending sc d81de300 itt 0x159 > Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing > requeued sc d89b6800 itt 0xd8 > Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task > pending r2t dropped > Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing > requeued sc ca305ac0 itt 0xd9 > Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task > pending r2t dropped > Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing > requeued sc d81de440 itt 0xdb > Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task > pending r2t dropped > Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing > requeued sc ca305700 itt 0xd7 > ...pattern continues.. > Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing > requeued sc d847ee40 itt 0xfa > Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task > pending r2t dropped > Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in > progress sc c9ac8840 itt 0xfb > Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in > progress sc d1e04800 itt 0xfc > Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in > progress sc df0cdc00 itt 0xfd > ...pattern continues... > Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in > progress sc d8811700 itt 0x107 > Apr 11 13:22:30 oim61024009 kernel: session7: flush_control_queues flushing > pending mgmt task itt 0x15a > Apr 11 13:22:30 oim61024009 kernel: session7: iscsi2_queuecommand cmd 0x2a > rejected (7) > Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_tcp_hdr_recv_prep > (digest disabled) > Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_tcp_task_init mtask > deq [itt 0x0] > Apr 11 13:22:32 oim61024009 kernel: connection7:0: > iscsi_sw_tcp_send_hdr_prep digest disabled > Apr 11 13:22:32 oim61024009 kernel: connection7:0: > iscsi_sw_tcp_send_linear_data_prep datalen=208 digest disabled > Apr 11 13:22:32 oim61024009 kernel: session7: iscsi_prep_mgmt_task mgmtpdu > [op 0x3 hdr->itt 0x1000 datalen 208] > Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done > copied 0 0 size 48 xmit > >
Re: Target or Initiator closing connection?
Here's what I am seeing (roughly... with lines scraped in between.. and I'm only grepping for connect7/session7): Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2279563 Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2279563 Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Sending nopout as ping Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2280813 Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:30 oim61024009 kernel: connection7:0: ping timeout of 5 secs expired, recv timeout 5, last rx 2278313, last ping 2279563, now 2280813 Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:30 oim61024009 kernel: connection7:0: detected conn error (1011) Apr 11 13:22:30 oim61024009 kernel: session7: iscsi_data_xmit Tx suspended! Apr 11 13:22:30 oim61024009 kernel: session7: iscsi_start_session_recovery blocking session Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing pending sc df493a80 itt 0x108 ...etc..etc.. Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing pending sc d81de300 itt 0x159 Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing requeued sc d89b6800 itt 0xd8 Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task pending r2t dropped Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing requeued sc ca305ac0 itt 0xd9 Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task pending r2t dropped Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing requeued sc d81de440 itt 0xdb Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task pending r2t dropped Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing requeued sc ca305700 itt 0xd7 ...pattern continues.. Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing requeued sc d847ee40 itt 0xfa Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task pending r2t dropped Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in progress sc c9ac8840 itt 0xfb Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in progress sc d1e04800 itt 0xfc Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in progress sc df0cdc00 itt 0xfd ...pattern continues... Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in progress sc d8811700 itt 0x107 Apr 11 13:22:30 oim61024009 kernel: session7: flush_control_queues flushing pending mgmt task itt 0x15a Apr 11 13:22:30 oim61024009 kernel: session7: iscsi2_queuecommand cmd 0x2a rejected (7) Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_tcp_hdr_recv_prep (digest disabled) Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_tcp_task_init mtask deq [itt 0x0] Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_sw_tcp_send_hdr_prep digest disabled Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_sw_tcp_send_linear_data_prep datalen=208 digest disabled Apr 11 13:22:32 oim61024009 kernel: session7: iscsi_prep_mgmt_task mgmtpdu [op 0x3 hdr->itt 0x1000 datalen 208] Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 0 0 size 48 xmit On Apr 14, 2011, at 6:45 AM, Mike Christie wrote: > On 04/14/2011 05:12 AM, Joe Hoot wrote: >> Mike, >> >> I'm still digging into this "iSCSI Disconnect" issue that we've been dealing >> with for about two years here in SUNY. The iscsi-initiator that I'm using >> is this: >> >> iscsi-initiator-utils-6.2.0.872-6.0.2.el5 >> >> I've been running dt tests to the equallogic in our Oracle VM environment >> with the following set: >> >> echo 1> /sys/module/libiscsi2/parameters/debug_libiscsi >> echo 1
Re: Target or Initiator closing connection?
On 04/14/2011 05:12 AM, Joe Hoot wrote: Mike, I'm still digging into this "iSCSI Disconnect" issue that we've been dealing with for about two years here in SUNY. The iscsi-initiator that I'm using is this: iscsi-initiator-utils-6.2.0.872-6.0.2.el5 I've been running dt tests to the equallogic in our Oracle VM environment with the following set: echo 1> /sys/module/libiscsi2/parameters/debug_libiscsi echo 1> /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp echo 1> /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp When I try to look for "echo 1> /sys/module/libiscsi/parameters/debug_libiscsi_eh" I don't see that as an available option. Are the debug lines above correct? Are there other debug messages that I should be gathering? It is only in newer kernels. debug_libiscsi will print out what debug_libiscsi_eh does plus more. [root@oim61024001 src]# ls -l /sys/module/libiscsi libiscsi2/libiscsi_tcp/ We have done numerous tests and different hardware. I have eliminated OVM and have been testing this now on OEL5.6. The scenario that seems to break connections quickly is the following: - cluster of 5 nodes using OCFS2 (I will be trying to rule OCFS2 and dm-multipath out shortly) - 5-7 iSCSI volumes connected to each of those 5 nodes - 4 threads of dt running against each of 5-7 volumes from each host = 28 threads of dt slamming the volumes per host = 140 threads of dt per cluster. - test only lasts about 2-5 minutes before I start seeing ping timeouts and disconnects. The issue that we've seen thus far is mainly with EqualLogic and open-iscsi. From what EQL is telling us, the initiator is "aborting" the connection. But from the initiator-side, we just see "ping timeout" messages (and then the connection eventually goes away). When the ping times out, the initiator will abort/drop the connection. It will then try to relogin. So you should probably be seeing the ping/nop timeout message, then a conn error 1011, then a message about reconnected in X retries, right? The thing is that below with tgtd we seem to be seeing scsi commands timing out. On the initiator logs did you see ping/nop timeout messages when you hit the problem below with tgtd? With tgtd let's take take the disks out of it and just test iscsi. So do something like this: tgtadm --op new --mode target --tid 1 -T iqn.2001-04.com.pertest tgtadm --op new --mode logicalunit --tid 1 --lun 1 --bstype null -b /dev/null tgtadm --op bind --mode target --tid 1 -I ALL This of course will not do any real IO so you cannot do tests that verify day. Just do really harsh heavy read/write tests. When you do IO to fake disks do you see any ping/nop or scsi timeout errors? We recently saw a thread (Apr 4) regarding cfq scheduler. So we quickly tested noop and deadline, just to see if that would change anything-- it didn't. So my most recent test was to try out a different target, just to see if we could rule out the EqualLogic. Each time I changed from EQLX to the tgtd, I would reset (and rescan in my volumes) the iscsid.conf's "FastAbort = No", or yes (if I was testing tgtd), to conform with EQLX's best practices. The abort setting is not going to help at all for any target. That basically kicks in after you have already hit the problem you are hitting. It just speeds up the handling of the problem. So at this point, after I get dm-multipath and OCFS2 out of the equation, it will be down to a tartget + kernel/initiator + I/O scheduler and I want to make sure that I'm getting all the debug information that I might need to analyze what is going on. Are there any other debug tunables that you might recommend adding to my script? No, the ones you have above are all there is. Maybe tcpdump -w iscsi.out -i ethXYZ Did you have the initiator /var/log/messages with debugging on when you hit the problem below? If so send them. On Apr 14, 2011, at 12:02 AM, Mike Christie wrote: On 04/12/2011 12:43 PM, Joe Hoot wrote: I'm trying to understand the following messages: Apr 12 13:23:52 oim60025001 tgtd: conn_close(88) connection closed 0x94d80c4 1 lots of the above messages... Apr 12 13:37:19 oim60025001 tgtd: abort_task_set(979) found 271 0 lots of the above messages... Apr 12 13:37:27 oim60025001 tgtd: abort_cmd(955) found e9 e lots of the above messages... Apr 12 13:39:08 oim60025001 tgtd: conn_close(88) connection closed 0xa9ab8ec 1 -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: Target or Initiator closing connection?
Mike, I'm still digging into this "iSCSI Disconnect" issue that we've been dealing with for about two years here in SUNY. The iscsi-initiator that I'm using is this: iscsi-initiator-utils-6.2.0.872-6.0.2.el5 I've been running dt tests to the equallogic in our Oracle VM environment with the following set: echo 1 > /sys/module/libiscsi2/parameters/debug_libiscsi echo 1 > /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp echo 1 > /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp When I try to look for "echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh" I don't see that as an available option. Are the debug lines above correct? Are there other debug messages that I should be gathering? [root@oim61024001 src]# ls -l /sys/module/libiscsi libiscsi2/libiscsi_tcp/ We have done numerous tests and different hardware. I have eliminated OVM and have been testing this now on OEL5.6. The scenario that seems to break connections quickly is the following: - cluster of 5 nodes using OCFS2 (I will be trying to rule OCFS2 and dm-multipath out shortly) - 5-7 iSCSI volumes connected to each of those 5 nodes - 4 threads of dt running against each of 5-7 volumes from each host = 28 threads of dt slamming the volumes per host = 140 threads of dt per cluster. - test only lasts about 2-5 minutes before I start seeing ping timeouts and disconnects. The issue that we've seen thus far is mainly with EqualLogic and open-iscsi. From what EQL is telling us, the initiator is "aborting" the connection. But from the initiator-side, we just see "ping timeout" messages (and then the connection eventually goes away). We recently saw a thread (Apr 4) regarding cfq scheduler. So we quickly tested noop and deadline, just to see if that would change anything-- it didn't. So my most recent test was to try out a different target, just to see if we could rule out the EqualLogic. Each time I changed from EQLX to the tgtd, I would reset (and rescan in my volumes) the iscsid.conf's "FastAbort = No", or yes (if I was testing tgtd), to conform with EQLX's best practices. So at this point, after I get dm-multipath and OCFS2 out of the equation, it will be down to a tartget + kernel/initiator + I/O scheduler and I want to make sure that I'm getting all the debug information that I might need to analyze what is going on. Are there any other debug tunables that you might recommend adding to my script? On Apr 14, 2011, at 12:02 AM, Mike Christie wrote: > On 04/12/2011 12:43 PM, Joe Hoot wrote: >> I'm trying to understand the following messages: >> >> Apr 12 13:23:52 oim60025001 tgtd: conn_close(88) connection closed >> 0x94d80c4 1 >> lots of the above messages... >> Apr 12 13:37:19 oim60025001 tgtd: abort_task_set(979) found 271 0 >> lots of the above messages... >> Apr 12 13:37:27 oim60025001 tgtd: abort_cmd(955) found e9 e >> lots of the above messages... >> Apr 12 13:39:08 oim60025001 tgtd: conn_close(88) connection closed >> 0xa9ab8ec 1 >> >> Does that typically mean that the target has closed the connection or the >> initiator? >> > > Might be best to ask the tgtd list, but I think due to the abort > messsages it is probably the initiator if you are using open-iscsi for > the initiator and if you see the abort messages before conn_close ones. > > If you see abort ones first, then see conn close ones, then probably a > scsi command is timing out. This causes the scsi layer to have the > initiator abort the command. If the abort fails, the initiator could try > a lun reset or target reset. If we cannot reset or abort the problem > away we drop the connection/session. > > On the initiator if you did > > echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh > > you would see a "wait for relogin" message in /var/log/messages then a > "session reset succeeded" or "failing session reset: Could not log back > into" message. The wait for relogin would match the conn close messages > on the target. Then the success or failed messages indicated if we were > able to relogin. > > -- > You received this message because you are subscribed to the Google Groups > "open-iscsi" group. > To post to this group, send email to open-iscsi@googlegroups.com. > To unsubscribe from this group, send email to > open-iscsi+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/open-iscsi?hl=en. > -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: Target or Initiator closing connection?
On 04/12/2011 12:43 PM, Joe Hoot wrote: I'm trying to understand the following messages: Apr 12 13:23:52 oim60025001 tgtd: conn_close(88) connection closed 0x94d80c4 1 lots of the above messages... Apr 12 13:37:19 oim60025001 tgtd: abort_task_set(979) found 271 0 lots of the above messages... Apr 12 13:37:27 oim60025001 tgtd: abort_cmd(955) found e9 e lots of the above messages... Apr 12 13:39:08 oim60025001 tgtd: conn_close(88) connection closed 0xa9ab8ec 1 Does that typically mean that the target has closed the connection or the initiator? Might be best to ask the tgtd list, but I think due to the abort messsages it is probably the initiator if you are using open-iscsi for the initiator and if you see the abort messages before conn_close ones. If you see abort ones first, then see conn close ones, then probably a scsi command is timing out. This causes the scsi layer to have the initiator abort the command. If the abort fails, the initiator could try a lun reset or target reset. If we cannot reset or abort the problem away we drop the connection/session. On the initiator if you did echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh you would see a "wait for relogin" message in /var/log/messages then a "session reset succeeded" or "failing session reset: Could not log back into" message. The wait for relogin would match the conn close messages on the target. Then the success or failed messages indicated if we were able to relogin. -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.