Re: Target or Initiator closing connection?

2011-04-15 Thread Mike Christie

On 04/14/2011 02:17 PM, Mike Christie wrote:

This message means we have sent 2592 bytes out of 4096. The 0 means that
we are trying to send the rest of the data, but we are not able to sent
anything (0 is the number of bytes we were able to send when we did
sendpage/sendmsg). So send the full logs so I can see what else is in
there.



Ok in your current logs we have

Apr 13 11:07:23 oim61024008 kernel:  connection3:0: ping timeout of 5 
secs expired, recv timeout 5, last rx 559699, last ping 560949, now 562199



Then looking over session3/connection3:0 we have

connection3:0: iscsi_tcp_segment_done copied 2632 0 size 4096 xmit

from Apr 13 11:07:23  to the middle of Apr 13 11:07:13.

So for 10 seconds we are trying to do sendpage but it keeps returning 0. 
For some reason the network layer is just not sending new IO. The 
network layer is probably retransmitting and so the network layer runs 
out of writespace.


Did you by any chance also take a network trace (tcpdump or 
ethereal/wireshark)? Could you send it?


Can you run a rhel kernel? I can add some debugging in to the kernel.

--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: Target or Initiator closing connection?

2011-04-14 Thread Mike Christie

On 04/14/2011 03:12 PM, Joe Hoot wrote:

sorry.. ignore my last email this time :P   We've been doing a ton of tests, so 
I just confused my self.  That test is from a default OVM system that had its 
default kernel and iscsi tools.  We also had disconnects in an OEL5.6 
environment with latest released 872 as well.

Do you know if the iscsi-initiator-utils-6.2.0.872-6.0.2.el5 rpm contains that 
fix patch as well?  If so, I have the dt tests running as of 16:04 in that 
OEL5.6 environment (2.6.18-238.el5).



The patches are kernel patches. Red Hat's 2.6.18-238.el5 kernel has the 
fix pointed to by the http://git.kernel.org link in the other mail.


I do not think you are hitting that though. In the logs you sent we just 
cannot send data. Send the logs and we will know for sure.


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: Target or Initiator closing connection?

2011-04-14 Thread Joe Hoot
sorry.. ignore my last email this time :P   We've been doing a ton of tests, so 
I just confused my self.  That test is from a default OVM system that had its 
default kernel and iscsi tools.  We also had disconnects in an OEL5.6 
environment with latest released 872 as well.

Do you know if the iscsi-initiator-utils-6.2.0.872-6.0.2.el5 rpm contains that 
fix patch as well?  If so, I have the dt tests running as of 16:04 in that 
OEL5.6 environment (2.6.18-238.el5).

I figured instead of testing in OVM, if I can reproduce it with OEL5.6, it 
probably has a larger audience.



On Apr 14, 2011, at 3:28 PM, Joe Hoot wrote:

> This test was done on an OEL5.6 system, using the latest released RHEL kernel 
> + 872 code.
> 
> Mike, we just discovered that our Cisco switch didn't have FlowControl 
> enabled properly.  It looks like it _should_ have autonegotiated to on, but 
> may not have.  We've manually forced it on and are running another test at 
> the moment.  I don't want to eat up too much of your time, so let me gather 
> the new set of info from a new test fist.  Then I'll attach the full logs.
> 
> Thanks,
> Joe
> 
> 
> 
> On Apr 14, 2011, at 3:10 PM, Mike Christie wrote:
> 
>> On 04/14/2011 06:39 AM, Joe Hoot wrote:
>>> Here's what I am seeing (roughly... with lines scraped in between.. and I'm 
>>> only grepping for connect7/session7):
>>> 
>>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>>> copied 2592 0 size 4096 xmit
>>> ...lots of the above types of messages...
>>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>>> copied 2592 0 size 4096 xmit
>>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: 
>>> iscsi_check_transport_timeouts Setting next tmo 2279563
>>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>>> copied 2592 0 size 4096 xmit
>>> ...lots of the above types of messages...
>>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>>> copied 2592 0 size 4096 xmit
>>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: 
>>> iscsi_check_transport_timeouts Setting next tmo 2279563
>>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>>> copied 2592 0 size 4096 xmit
>>> ...lots of the above types of messages...
>>> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>>> copied 2592 0 size 4096 xmit
>>> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: 
>>> iscsi_check_transport_timeouts Sending nopout as ping
>>> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: 
>>> iscsi_check_transport_timeouts Setting next tmo 2280813
>>> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>>> copied 2592 0 size 4096 xmit
>>> ...lots of the above types of messages...
>>> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>>> copied 2592 0 size 4096 xmit
>> 
>> Does it always say "xmit" in the iscsi_tcp_segment_done messages from 
>> 13:22:20 - 13:22:30 or does it sometimes say "recv"? Just send the full 
>> logs actually.
>> 
>> And are you running with that kernel fix (I think oracle merged if for 
>> you right). This fix:
>> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=4c48a82935f833d94fcf44c2b0c5d2922acfc77a;hp=d1acfae514425d680912907c6554852f1e258551
>> 
>> 
>> If there is no "recv"s in there then you need the attached patch. No 
>> idea if this applies to the oracle kernel you are using.
>> 
>> The weird thing though would be that it means for 10 seconds we just 
>> send data and never get a response to any of the pdus we have sent.
>> 
>> What is the output of iscsiadm -m session -P 2 when you are logged into 
>> the target?
>> 
>>> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: ping timeout of 5 secs 
>>> expired, recv timeout 5, last rx 2278313, last ping 2279563, now 2280813
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "open-iscsi" group.
>> To post to this group, send email to open-iscsi@googlegroups.com.
>> To unsubscribe from this group, send email to 
>> open-iscsi+unsubscr...@googlegroups.com.
>> For more options, visit this group at 
>> http://groups.google.com/group/open-iscsi?hl=en.
>> 
>> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To post to this group, send email to open-iscsi@googlegroups.com.
> To unsubscribe from this group, send email to 
> open-iscsi+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/open-iscsi?hl=en.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: Target or Initiator closing connection?

2011-04-14 Thread Joe Hoot
This test was done on an OEL5.6 system, using the latest released RHEL kernel + 
872 code.

Mike, we just discovered that our Cisco switch didn't have FlowControl enabled 
properly.  It looks like it _should_ have autonegotiated to on, but may not 
have.  We've manually forced it on and are running another test at the moment.  
I don't want to eat up too much of your time, so let me gather the new set of 
info from a new test fist.  Then I'll attach the full logs.

Thanks,
Joe



On Apr 14, 2011, at 3:10 PM, Mike Christie wrote:

> On 04/14/2011 06:39 AM, Joe Hoot wrote:
>> Here's what I am seeing (roughly... with lines scraped in between.. and I'm 
>> only grepping for connect7/session7):
>> 
>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>> copied 2592 0 size 4096 xmit
>> ...lots of the above types of messages...
>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>> copied 2592 0 size 4096 xmit
>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: 
>> iscsi_check_transport_timeouts Setting next tmo 2279563
>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>> copied 2592 0 size 4096 xmit
>> ...lots of the above types of messages...
>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>> copied 2592 0 size 4096 xmit
>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: 
>> iscsi_check_transport_timeouts Setting next tmo 2279563
>> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>> copied 2592 0 size 4096 xmit
>> ...lots of the above types of messages...
>> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>> copied 2592 0 size 4096 xmit
>> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: 
>> iscsi_check_transport_timeouts Sending nopout as ping
>> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: 
>> iscsi_check_transport_timeouts Setting next tmo 2280813
>> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>> copied 2592 0 size 4096 xmit
>> ...lots of the above types of messages...
>> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
>> copied 2592 0 size 4096 xmit
> 
> Does it always say "xmit" in the iscsi_tcp_segment_done messages from 
> 13:22:20 - 13:22:30 or does it sometimes say "recv"? Just send the full 
> logs actually.
> 
> And are you running with that kernel fix (I think oracle merged if for 
> you right). This fix:
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=4c48a82935f833d94fcf44c2b0c5d2922acfc77a;hp=d1acfae514425d680912907c6554852f1e258551
> 
> 
> If there is no "recv"s in there then you need the attached patch. No 
> idea if this applies to the oracle kernel you are using.
> 
> The weird thing though would be that it means for 10 seconds we just 
> send data and never get a response to any of the pdus we have sent.
> 
> What is the output of iscsiadm -m session -P 2 when you are logged into 
> the target?
> 
>> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: ping timeout of 5 secs 
>> expired, recv timeout 5, last rx 2278313, last ping 2279563, now 2280813
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To post to this group, send email to open-iscsi@googlegroups.com.
> To unsubscribe from this group, send email to 
> open-iscsi+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/open-iscsi?hl=en.
> 
> 

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: Target or Initiator closing connection?

2011-04-14 Thread Mike Christie

On 04/14/2011 02:10 PM, Mike Christie wrote:

On 04/14/2011 06:39 AM, Joe Hoot wrote:

Here's what I am seeing (roughly... with lines scraped in between..
and I'm only grepping for connect7/session7):

Apr 11 13:22:24 oim61024009 kernel: connection7:0:
iscsi_tcp_segment_done copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:24 oim61024009 kernel: connection7:0:
iscsi_tcp_segment_done copied 2592 0 size 4096 xmit
Apr 11 13:22:24 oim61024009 kernel: connection7:0:
iscsi_check_transport_timeouts Setting next tmo 2279563
Apr 11 13:22:24 oim61024009 kernel: connection7:0:
iscsi_tcp_segment_done copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:24 oim61024009 kernel: connection7:0:
iscsi_tcp_segment_done copied 2592 0 size 4096 xmit
Apr 11 13:22:24 oim61024009 kernel: connection7:0:
iscsi_check_transport_timeouts Setting next tmo 2279563
Apr 11 13:22:24 oim61024009 kernel: connection7:0:
iscsi_tcp_segment_done copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:25 oim61024009 kernel: connection7:0:
iscsi_tcp_segment_done copied 2592 0 size 4096 xmit
Apr 11 13:22:25 oim61024009 kernel: connection7:0:
iscsi_check_transport_timeouts Sending nopout as ping
Apr 11 13:22:25 oim61024009 kernel: connection7:0:
iscsi_check_transport_timeouts Setting next tmo 2280813
Apr 11 13:22:25 oim61024009 kernel: connection7:0:
iscsi_tcp_segment_done copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:30 oim61024009 kernel: connection7:0:
iscsi_tcp_segment_done copied 2592 0 size 4096 xmit




Ignore the previous mail. Your mail was the first one I read today and 
my eyes are not wide open yet :)


Send the full logs. And were you using that patch to send the 
sk_sndtimeo lower? It looks like you were.


Apr 11 13:22:24 oim61024009 kernel:  connection7:0: 
iscsi_tcp_segment_done copied 2592 0 size 4096 xmit


This message means we have sent 2592 bytes out of 4096. The 0 means that 
we are trying to send the rest of the data, but we are not able to sent 
anything (0 is the number of bytes we were able to send when we did 
sendpage/sendmsg). So send the full logs so I can see what else is in there.


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: Target or Initiator closing connection?

2011-04-14 Thread Mike Christie

On 04/14/2011 06:39 AM, Joe Hoot wrote:

Here's what I am seeing (roughly... with lines scraped in between.. and I'm 
only grepping for connect7/session7):

Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: 
iscsi_check_transport_timeouts Setting next tmo 2279563
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: 
iscsi_check_transport_timeouts Setting next tmo 2279563
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:25 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
Apr 11 13:22:25 oim61024009 kernel:  connection7:0: 
iscsi_check_transport_timeouts Sending nopout as ping
Apr 11 13:22:25 oim61024009 kernel:  connection7:0: 
iscsi_check_transport_timeouts Setting next tmo 2280813
Apr 11 13:22:25 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit


Does it always say "xmit" in the iscsi_tcp_segment_done messages from 
13:22:20 - 13:22:30 or does it sometimes say "recv"? Just send the full 
logs actually.


And are you running with that kernel fix (I think oracle merged if for 
you right). This fix:

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=4c48a82935f833d94fcf44c2b0c5d2922acfc77a;hp=d1acfae514425d680912907c6554852f1e258551


If there is no "recv"s in there then you need the attached patch. No 
idea if this applies to the oracle kernel you are using.


The weird thing though would be that it means for 10 seconds we just 
send data and never get a response to any of the pdus we have sent.


What is the output of iscsiadm -m session -P 2 when you are logged into 
the target?



Apr 11 13:22:30 oim61024009 kernel:  connection7:0: ping timeout of 5 secs 
expired, recv timeout 5, last rx 2278313, last ping 2279563, now 2280813


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index a860452..c188267 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -274,7 +274,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn 
*tcp_conn,
if (r < 0) {
iscsi_tcp_segment_unmap(segment);
return r;
-   }
+   } else if (r > 0)
+   tcp_conn->iscsi_conn->last_xfer = jiffies;
copied += r;
}
return copied;
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index 0c550d5..5d981f2 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -610,7 +610,7 @@ void iscsi_complete_scsi_task(struct iscsi_task *task,
 
ISCSI_DBG_SESSION(conn->session, "[itt 0x%x]\n", task->itt);
 
-   conn->last_recv = jiffies;
+   conn->last_xfer = jiffies;
__iscsi_update_cmdsn(conn->session, exp_cmdsn, max_cmdsn);
iscsi_complete_task(task, ISCSI_TASK_COMPLETED);
 }
@@ -1130,7 +1130,7 @@ int __iscsi_complete_pdu(struct iscsi_conn *conn, struct 
iscsi_hdr *hdr,
struct iscsi_task *task;
uint32_t itt;
 
-   conn->last_recv = jiffies;
+   conn->last_xfer = jiffies;
rc = iscsi_verify_itt(conn, hdr->itt);
if (rc)
return rc;
@@ -1902,7 +1902,7 @@ static void iscsi_start_tx(struct iscsi_conn *conn)
 static int iscsi_has_ping_timed_out(struct iscsi_conn *conn)
 {
if (conn->ping_task &&
-   time_before_eq(conn->last_recv + (conn->recv_timeout * HZ) +
+   time_before_eq(conn->last_xfer+ (conn->recv_timeout * HZ) +
   (conn->ping_timeout * HZ), jiffies))
return 1;
else
@@ -2047,7 +2047,7 @@ static void iscsi_check_transport_timeouts(unsigned long 
data)
 {
struct iscsi_conn *conn = (struct iscsi_conn *)data;
struct iscsi_session *session = conn->session;
-   unsigned long recv_timeout, next_timeout = 0, last_recv;
+   unsigned long recv_timeout, next_timeout = 0, last_xfer;
 
s

Re: Target or Initiator closing connection?

2011-04-14 Thread Joe Hoot
sorry... I'm confusing the issue...   the messages that I showed below were 
from a iSCSI Disconnect test that we had done while connecting to the 
EqualLogic.  I will re-setup my test tot he tgtd to see what it and the 
initiator show.  I will take you suggestion of writting to dev/null.



On Apr 14, 2011, at 7:39 AM, Joe Hoot wrote:

> Here's what I am seeing (roughly... with lines scraped in between.. and I'm 
> only grepping for connect7/session7):
> 
> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
> copied 2592 0 size 4096 xmit
> ...lots of the above types of messages...
> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
> copied 2592 0 size 4096 xmit
> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: 
> iscsi_check_transport_timeouts Setting next tmo 2279563
> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
> copied 2592 0 size 4096 xmit
> ...lots of the above types of messages...
> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
> copied 2592 0 size 4096 xmit
> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: 
> iscsi_check_transport_timeouts Setting next tmo 2279563
> Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
> copied 2592 0 size 4096 xmit
> ...lots of the above types of messages...
> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
> copied 2592 0 size 4096 xmit
> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: 
> iscsi_check_transport_timeouts Sending nopout as ping
> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: 
> iscsi_check_transport_timeouts Setting next tmo 2280813
> Apr 11 13:22:25 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
> copied 2592 0 size 4096 xmit
> ...lots of the above types of messages...
> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
> copied 2592 0 size 4096 xmit
> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: ping timeout of 5 secs 
> expired, recv timeout 5, last rx 2278313, last ping 2279563, now 2280813
> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
> copied 2592 0 size 4096 xmit
> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: detected conn error (1011)
> Apr 11 13:22:30 oim61024009 kernel:  session7: iscsi_data_xmit Tx suspended!
> Apr 11 13:22:30 oim61024009 kernel:  session7: iscsi_start_session_recovery 
> blocking session
> Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
> pending sc df493a80 itt 0x108
> ...etc..etc..
> Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
> pending sc d81de300 itt 0x159
> Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
> requeued sc d89b6800 itt 0xd8
> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_cleanup_task 
> pending r2t dropped
> Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
> requeued sc ca305ac0 itt 0xd9
> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_cleanup_task 
> pending r2t dropped
> Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
> requeued sc d81de440 itt 0xdb
> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_cleanup_task 
> pending r2t dropped
> Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
> requeued sc ca305700 itt 0xd7
> ...pattern continues..
> Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
> requeued sc d847ee40 itt 0xfa
> Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_cleanup_task 
> pending r2t dropped
> Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing in 
> progress sc c9ac8840 itt 0xfb
> Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing in 
> progress sc d1e04800 itt 0xfc
> Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing in 
> progress sc df0cdc00 itt 0xfd
> ...pattern continues...
> Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing in 
> progress sc d8811700 itt 0x107
> Apr 11 13:22:30 oim61024009 kernel:  session7: flush_control_queues flushing 
> pending mgmt task itt 0x15a
> Apr 11 13:22:30 oim61024009 kernel:  session7: iscsi2_queuecommand cmd 0x2a 
> rejected (7)
> Apr 11 13:22:32 oim61024009 kernel:  connection7:0: iscsi_tcp_hdr_recv_prep 
> (digest disabled)
> Apr 11 13:22:32 oim61024009 kernel:  connection7:0: iscsi_tcp_task_init mtask 
> deq [itt 0x0]
> Apr 11 13:22:32 oim61024009 kernel:  connection7:0: 
> iscsi_sw_tcp_send_hdr_prep digest disabled
> Apr 11 13:22:32 oim61024009 kernel:  connection7:0: 
> iscsi_sw_tcp_send_linear_data_prep datalen=208 digest disabled
> Apr 11 13:22:32 oim61024009 kernel:  session7: iscsi_prep_mgmt_task mgmtpdu 
> [op 0x3 hdr->itt 0x1000 datalen 208]
> Apr 11 13:22:32 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
> copied 0 0 size 48 xmit
> 
> 

Re: Target or Initiator closing connection?

2011-04-14 Thread Joe Hoot
Here's what I am seeing (roughly... with lines scraped in between.. and I'm 
only grepping for connect7/session7):

Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: 
iscsi_check_transport_timeouts Setting next tmo 2279563
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: 
iscsi_check_transport_timeouts Setting next tmo 2279563
Apr 11 13:22:24 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:25 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
Apr 11 13:22:25 oim61024009 kernel:  connection7:0: 
iscsi_check_transport_timeouts Sending nopout as ping
Apr 11 13:22:25 oim61024009 kernel:  connection7:0: 
iscsi_check_transport_timeouts Setting next tmo 2280813
Apr 11 13:22:25 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
...lots of the above types of messages...
Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
Apr 11 13:22:30 oim61024009 kernel:  connection7:0: ping timeout of 5 secs 
expired, recv timeout 5, last rx 2278313, last ping 2279563, now 2280813
Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 2592 0 size 4096 xmit
Apr 11 13:22:30 oim61024009 kernel:  connection7:0: detected conn error (1011)
Apr 11 13:22:30 oim61024009 kernel:  session7: iscsi_data_xmit Tx suspended!
Apr 11 13:22:30 oim61024009 kernel:  session7: iscsi_start_session_recovery 
blocking session
Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
pending sc df493a80 itt 0x108
...etc..etc..
Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
pending sc d81de300 itt 0x159
Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
requeued sc d89b6800 itt 0xd8
Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_cleanup_task 
pending r2t dropped
Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
requeued sc ca305ac0 itt 0xd9
Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_cleanup_task 
pending r2t dropped
Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
requeued sc d81de440 itt 0xdb
Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_cleanup_task 
pending r2t dropped
Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
requeued sc ca305700 itt 0xd7
...pattern continues..
Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing 
requeued sc d847ee40 itt 0xfa
Apr 11 13:22:30 oim61024009 kernel:  connection7:0: iscsi_tcp_cleanup_task 
pending r2t dropped
Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing in 
progress sc c9ac8840 itt 0xfb
Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing in 
progress sc d1e04800 itt 0xfc
Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing in 
progress sc df0cdc00 itt 0xfd
...pattern continues...
Apr 11 13:22:30 oim61024009 kernel:  session7: fail_all_commands failing in 
progress sc d8811700 itt 0x107
Apr 11 13:22:30 oim61024009 kernel:  session7: flush_control_queues flushing 
pending mgmt task itt 0x15a
Apr 11 13:22:30 oim61024009 kernel:  session7: iscsi2_queuecommand cmd 0x2a 
rejected (7)
Apr 11 13:22:32 oim61024009 kernel:  connection7:0: iscsi_tcp_hdr_recv_prep 
(digest disabled)
Apr 11 13:22:32 oim61024009 kernel:  connection7:0: iscsi_tcp_task_init mtask 
deq [itt 0x0]
Apr 11 13:22:32 oim61024009 kernel:  connection7:0: iscsi_sw_tcp_send_hdr_prep 
digest disabled
Apr 11 13:22:32 oim61024009 kernel:  connection7:0: 
iscsi_sw_tcp_send_linear_data_prep datalen=208 digest disabled
Apr 11 13:22:32 oim61024009 kernel:  session7: iscsi_prep_mgmt_task mgmtpdu [op 
0x3 hdr->itt 0x1000 datalen 208]
Apr 11 13:22:32 oim61024009 kernel:  connection7:0: iscsi_tcp_segment_done 
copied 0 0 size 48 xmit






On Apr 14, 2011, at 6:45 AM, Mike Christie wrote:

> On 04/14/2011 05:12 AM, Joe Hoot wrote:
>> Mike,
>> 
>> I'm still digging into this "iSCSI Disconnect" issue that we've been dealing 
>> with for about two years here in SUNY.  The iscsi-initiator that I'm using 
>> is this:
>> 
>>   iscsi-initiator-utils-6.2.0.872-6.0.2.el5
>> 
>> I've been running dt tests to the equallogic in our Oracle VM environment 
>> with the following set:
>> 
>>   echo 1>  /sys/module/libiscsi2/parameters/debug_libiscsi
>>   echo 1

Re: Target or Initiator closing connection?

2011-04-14 Thread Mike Christie

On 04/14/2011 05:12 AM, Joe Hoot wrote:

Mike,

I'm still digging into this "iSCSI Disconnect" issue that we've been dealing 
with for about two years here in SUNY.  The iscsi-initiator that I'm using is this:

   iscsi-initiator-utils-6.2.0.872-6.0.2.el5

I've been running dt tests to the equallogic in our Oracle VM environment with 
the following set:

   echo 1>  /sys/module/libiscsi2/parameters/debug_libiscsi
   echo 1>  /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp
   echo 1>  /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp

When I try to look for "echo 1>  
/sys/module/libiscsi/parameters/debug_libiscsi_eh" I don't see that as an available 
option.  Are the debug lines above correct?  Are there other debug messages that I should be 
gathering?



It is only in newer kernels. debug_libiscsi will print out what 
debug_libiscsi_eh does plus more.




[root@oim61024001 src]# ls -l /sys/module/libiscsi
libiscsi2/libiscsi_tcp/


We have done numerous tests and different hardware.  I have eliminated OVM and 
have been testing this now on OEL5.6.

The scenario that seems to break connections quickly is the following:

- cluster of 5 nodes using OCFS2 (I will be trying to rule OCFS2 and 
dm-multipath out shortly)
- 5-7 iSCSI volumes connected to each of those 5 nodes
- 4 threads of dt running against each of 5-7 volumes from each host = 28 
threads of dt slamming the volumes per host = 140 threads of dt per cluster.
- test only lasts about 2-5 minutes before I start seeing ping timeouts and 
disconnects.

The issue that we've seen thus far is mainly with EqualLogic and open-iscsi.   From what EQL is 
telling us, the initiator is "aborting" the connection.  But from the initiator-side, we 
just see "ping timeout" messages (and then the connection eventually goes away).



When the ping times out, the initiator will abort/drop the connection. 
It will then try to relogin. So you should probably be seeing the 
ping/nop timeout message, then a conn error 1011, then a message about 
reconnected in X retries, right?



The thing is that below with tgtd we seem to be seeing scsi commands 
timing out. On the initiator logs did you see ping/nop timeout messages 
when you hit the problem below with tgtd?


With tgtd let's take take the disks out of it and just test iscsi.

So do something like this:

tgtadm --op new --mode target --tid 1 -T iqn.2001-04.com.pertest
tgtadm --op new --mode logicalunit --tid 1 --lun 1 --bstype null -b 
/dev/null

tgtadm --op bind --mode target --tid 1 -I ALL

This of course will not do any real IO so you cannot do tests that 
verify day. Just do really harsh heavy read/write tests.



When you do IO to fake disks do you see any ping/nop or scsi timeout errors?



We recently saw a thread (Apr 4) regarding cfq scheduler.  So we quickly tested 
noop and deadline, just to see if that would change anything-- it didn't.

So my most recent test was to try out a different target, just to see if we could rule 
out the EqualLogic.  Each time I changed from EQLX to the tgtd, I would reset (and rescan 
in my volumes) the iscsid.conf's "FastAbort = No", or yes (if I was testing 
tgtd), to conform with EQLX's best practices.



The abort setting is not going to help at all for any target. That 
basically kicks in after you have already hit the problem you are 
hitting. It just speeds up the handling of the problem.




So at this point, after I get dm-multipath and OCFS2 out of the equation, it 
will be down to a tartget + kernel/initiator + I/O scheduler and I want to make 
sure that I'm getting all the debug information that I might need to analyze 
what is going on.

Are there any other debug tunables that you might recommend adding to my script?


No, the ones you have above are all there is. Maybe tcpdump -w iscsi.out 
-i ethXYZ


Did you have the initiator /var/log/messages with debugging on when you 
hit the problem below? If so send them.




On Apr 14, 2011, at 12:02 AM, Mike Christie wrote:


On 04/12/2011 12:43 PM, Joe Hoot wrote:

I'm trying to understand the following messages:

   Apr 12 13:23:52 oim60025001 tgtd: conn_close(88) connection closed 0x94d80c4 
1
    lots of the above messages...
   Apr 12 13:37:19 oim60025001 tgtd: abort_task_set(979) found 271 0
    lots of the above messages...
   Apr 12 13:37:27 oim60025001 tgtd: abort_cmd(955) found e9 e
    lots of the above messages...
   Apr 12 13:39:08 oim60025001 tgtd: conn_close(88) connection closed 0xa9ab8ec 
1



--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: Target or Initiator closing connection?

2011-04-14 Thread Joe Hoot
Mike,

I'm still digging into this "iSCSI Disconnect" issue that we've been dealing 
with for about two years here in SUNY.  The iscsi-initiator that I'm using is 
this:

  iscsi-initiator-utils-6.2.0.872-6.0.2.el5

I've been running dt tests to the equallogic in our Oracle VM environment with 
the following set:
  
  echo 1 > /sys/module/libiscsi2/parameters/debug_libiscsi
  echo 1 > /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp
  echo 1 > /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp

When I try to look for "echo 1 > 
/sys/module/libiscsi/parameters/debug_libiscsi_eh" I don't see that as an 
available option.  Are the debug lines above correct?  Are there other debug 
messages that I should be gathering?

[root@oim61024001 src]# ls -l /sys/module/libiscsi
libiscsi2/libiscsi_tcp/ 


We have done numerous tests and different hardware.  I have eliminated OVM and 
have been testing this now on OEL5.6.  

The scenario that seems to break connections quickly is the following:

- cluster of 5 nodes using OCFS2 (I will be trying to rule OCFS2 and 
dm-multipath out shortly)
- 5-7 iSCSI volumes connected to each of those 5 nodes
- 4 threads of dt running against each of 5-7 volumes from each host = 28 
threads of dt slamming the volumes per host = 140 threads of dt per cluster.
- test only lasts about 2-5 minutes before I start seeing ping timeouts and 
disconnects.

The issue that we've seen thus far is mainly with EqualLogic and open-iscsi.   
From what EQL is telling us, the initiator is "aborting" the connection.  But 
from the initiator-side, we just see "ping timeout" messages (and then the 
connection eventually goes away).   

We recently saw a thread (Apr 4) regarding cfq scheduler.  So we quickly tested 
noop and deadline, just to see if that would change anything-- it didn't.

So my most recent test was to try out a different target, just to see if we 
could rule out the EqualLogic.  Each time I changed from EQLX to the tgtd, I 
would reset (and rescan in my volumes) the iscsid.conf's "FastAbort = No", or 
yes (if I was testing tgtd), to conform with EQLX's best practices.

So at this point, after I get dm-multipath and OCFS2 out of the equation, it 
will be down to a tartget + kernel/initiator + I/O scheduler and I want to make 
sure that I'm getting all the debug information that I might need to analyze 
what is going on.

Are there any other debug tunables that you might recommend adding to my script?


On Apr 14, 2011, at 12:02 AM, Mike Christie wrote:

> On 04/12/2011 12:43 PM, Joe Hoot wrote:
>> I'm trying to understand the following messages:
>> 
>>   Apr 12 13:23:52 oim60025001 tgtd: conn_close(88) connection closed 
>> 0x94d80c4 1
>>    lots of the above messages...
>>   Apr 12 13:37:19 oim60025001 tgtd: abort_task_set(979) found 271 0
>>    lots of the above messages...
>>   Apr 12 13:37:27 oim60025001 tgtd: abort_cmd(955) found e9 e
>>    lots of the above messages...
>>   Apr 12 13:39:08 oim60025001 tgtd: conn_close(88) connection closed 
>> 0xa9ab8ec 1
>> 
>> Does that typically mean that the target has closed the connection or the 
>> initiator?
>> 
> 
> Might be best to ask the tgtd list, but I think due to the abort 
> messsages it is probably the initiator if you are using open-iscsi for 
> the initiator and if you see the abort messages before conn_close ones.
> 
> If you see abort ones first, then see conn close ones, then probably a 
> scsi command is timing out. This causes the scsi layer to have the 
> initiator abort the command. If the abort fails, the initiator could try 
> a lun reset or target reset. If we cannot reset or abort the problem 
> away we drop the connection/session.
> 
> On the initiator if you did
> 
> echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh
> 
> you would see a "wait for relogin" message in /var/log/messages then a 
> "session reset succeeded" or "failing session reset: Could not log back 
> into" message. The wait for relogin would match the conn close messages 
> on the target. Then the success or failed messages indicated if we were 
> able to relogin.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To post to this group, send email to open-iscsi@googlegroups.com.
> To unsubscribe from this group, send email to 
> open-iscsi+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/open-iscsi?hl=en.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: Target or Initiator closing connection?

2011-04-13 Thread Mike Christie

On 04/12/2011 12:43 PM, Joe Hoot wrote:

I'm trying to understand the following messages:

   Apr 12 13:23:52 oim60025001 tgtd: conn_close(88) connection closed 0x94d80c4 
1
    lots of the above messages...
   Apr 12 13:37:19 oim60025001 tgtd: abort_task_set(979) found 271 0
    lots of the above messages...
   Apr 12 13:37:27 oim60025001 tgtd: abort_cmd(955) found e9 e
    lots of the above messages...
   Apr 12 13:39:08 oim60025001 tgtd: conn_close(88) connection closed 0xa9ab8ec 
1

Does that typically mean that the target has closed the connection or the 
initiator?



Might be best to ask the tgtd list, but I think due to the abort 
messsages it is probably the initiator if you are using open-iscsi for 
the initiator and if you see the abort messages before conn_close ones.


If you see abort ones first, then see conn close ones, then probably a 
scsi command is timing out. This causes the scsi layer to have the 
initiator abort the command. If the abort fails, the initiator could try 
a lun reset or target reset. If we cannot reset or abort the problem 
away we drop the connection/session.


On the initiator if you did

echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh

you would see a "wait for relogin" message in /var/log/messages then a 
"session reset succeeded" or "failing session reset: Could not log back 
into" message. The wait for relogin would match the conn close messages 
on the target. Then the success or failed messages indicated if we were 
able to relogin.


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.