Re: Very strange problem with an Infortrend A16E iSCSI storage array

2009-02-04 Thread Santi Saez


Hi Mike,

El 3/2/09 20:19, Mike Christie escribió:

 *Randomly*, one of these channels resets, making the 4 servers connected
 to the channel timeout. The other 3 channels are not affected at all.

(..)

 The initiatior sends a iscsi ping every X seconds. If we do not get a
 response in Y seconds we drop the session (drop connection and relogin).

Yes, we were aware of this bug. In fact, you helped us with it not too 
long ago:

http://tinyurl.com/cywy3j


 There was a bug in the initiator where we would spit out this timeout
 error by accident. What kernel are you using? Are you using the iscsi
 modules in the kernel or modules from a open-iscsi.org release and what
 release of open-iscsi.org?

# iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-724
iscsiadm version 2.0-868
Target: iqn.2002-10.com.infortrend:raid.sn7457155.30
 Current Portal: 10.15.17.133:3260,1
 Persistent Portal: 10.15.17.133:3260,1
 **
 Interface:
 **
 Iface Name: default
 Iface Transport: tcp
 Iface Initiatorname: iqn.2001-05.net.example:vz11
 Iface IPaddress: 10.15.17.137
 Iface HWaddress: default
 Iface Netdev: default
 SID: 2
 iSCSI Connection State: LOGGED IN
 iSCSI Session State: Unknown
 Internal iscsid Session State: NO CHANGE
 
 Negotiated iSCSI params:
 
 HeaderDigest: None
 DataDigest: None
 MaxRecvDataSegmentLength: 131072
 MaxXmitDataSegmentLength: 65536
 FirstBurstLength: 65536
 MaxBurstLength: 262144
 ImmediateData: Yes
 InitialR2T: No
 MaxOutstandingR2T: 1
 
 Attached SCSI devices:
 
 Host Number: 2  State: running
 scsi2 Channel 00 Id 0 Lun: 0
 Attached scsi disk sdb  State: running


We're using CentOS 5.2 with default iscsi-initiator-utils package:

# rpm -qa iscsi-initiator-utils
iscsi-initiator-utils-6.2.0.868-0.7.el5

Also, using default iSCSI modules.


 connection4:0: iscsi: detected conn error (1011)
 session4: iscsi: session recovery timed out after 120 secs

 I do not think it is the bug, because you would normally log right back in.

 The recovery timed out error means that the initiator tried to log back
 in for 120 seconds and during that time we could not reconnect/relogin.

 I think this makes sense when looking at the switch messages below. If
 something causes the link to go down, the iscsi ping would fail/timeout.

 I am not sure if the iscsi layer dropping the session would cause the
 link to go down/up.

The link that goes down/up isn't the link between switch and the host, 
the link affected is between the *switch and the array*, very strange. 
It appears that some iSCSI client is causing something that makes 
iSCSI interface in the array to reset..

I think it's not a problem with Open-iSCSI and it's a Infortrend array 
bug, but perhaps someone may shed some light with this problem.

As I said, when this ocurrs it affects to all servers connected to this 
iSCSI interface/channel, including Windows hosts, etc..

Regards,

-- 
Santi Saez
http://woop.es

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



EqualLogic load-balancing logout/re-login behavior (asynchrounous event logout)

2009-02-04 Thread Konrad Rzeszutek

In the 2.0-865 version when we received a ISCSI_ASYNC_MSG_REQUEST_LOGOUT we 
would
logout, and then retry logging back in:

- 28Jul 28 20:15:40 iscsid: Target requests logout within 3 seconds for 
connection^M
- 28Jul 28 20:15:45 iscsid: connection5:0 is operational after recovery (2 
attempts)^M

And we would have a short hiccup (5 seconds) of the connection being gone.

This as my understanding was a mechanism for the EqualLogic box to move 
(re-establishing
allegiance) a session to a different port, hence allowing a load-balancing 
mechanism.

In 2.0-869, the git commit 052d014485d2ce5bb7fa8dd0df875dafd1db77df changed this
behavior so that we now actually logout and delete the session. No more retries.

2.0-865:
static int iscsi_xmit_mtask(struct iscsi_conn *conn)
{
struct iscsi_hdr *hdr = conn-mtask-hdr;
int rc, was_logout = 0;

spin_unlock_bh(conn-session-lock);
if ((hdr-opcode  ISCSI_OPCODE_MASK) == ISCSI_OP_LOGOUT) {
conn-session-state = ISCSI_STATE_IN_RECOVERY;
iscsi_block_session(session_to_cls(conn-session));

...
2.0-869:
static int iscsi_xmit_mtask(struct iscsi_conn *conn)
{
struct iscsi_hdr *hdr = conn-mtask-hdr;
int rc;

if ((hdr-opcode  ISCSI_OPCODE_MASK) == ISCSI_OP_LOGOUT)
conn-session-state = ISCSI_STATE_LOGGING_OUT;
spin_unlock_bh(conn-session-lock);

.. and..
if (conn-session-state == ISCSI_STATE_LOGGING_OUT) {
iscsi_free_mgmt_task(conn, conn-mtask);
conn-mtask = NULL;
continue;
}

This comes down to 2.0-869 terminating the session without trying to re-login.

With the EqualLogic boxes that means we never reconnect back.

So.. my question is : was this change intentional?

If so, does Equallogic know this and have they changed their firmware to
send ISCSI_ASYNC_MSG_DROPPING_CONNECTION (0x02) instead back?

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: EqualLogic load-balancing logout/re-login behavior (asynchrounous event logout)

2009-02-04 Thread Konrad Rzeszutek

On Wed, Feb 04, 2009 at 04:33:13PM -0500, Konrad Rzeszutek wrote:
 
 In the 2.0-865 version when we received a ISCSI_ASYNC_MSG_REQUEST_LOGOUT we 
 would
 logout, and then retry logging back in:
 
 - 28Jul 28 20:15:40 iscsid: Target requests logout within 3 seconds for 
 connection^M
 - 28Jul 28 20:15:45 iscsid: connection5:0 is operational after recovery (2 
 attempts)^M
 
 And we would have a short hiccup (5 seconds) of the connection being gone.
 
 This as my understanding was a mechanism for the EqualLogic box to move 
 (re-establishing
 allegiance) a session to a different port, hence allowing a load-balancing 
 mechanism.
 
 In 2.0-869, the git commit 052d014485d2ce5bb7fa8dd0df875dafd1db77df changed 
 this

The right git commit was:

commit b3a7ea8d50f6028964b468d13a095dfb2508b2fb
Author: Mike Christie micha...@cs.wisc.edu
Date:   Thu Dec 13 12:43:26 2007 -0600

[SCSI] libiscsi: do not block session during logout

There is not need to block the session during logout. Since
we are going to fail the commands that were blocked just fail them
immediately instead.


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



RE: Very strange problem with an Infortrend A16E iSCSI storage array

2009-02-04 Thread Tristan Ball

You should be able to turn up the logging verbosity on the switch quite
a bit. If the switch is making the choice to disconnect the port, the
higher log levels should show why. If it's the Infortrend making the
choice - well, the switch probably won't show much more than what you've
got, as it will just see the link go down.

If you're not already, make sure the switch is sending it's log messages
to a syslog server so you don't miss them!

Regards,
T

-Original Message-
From: open-iscsi@googlegroups.com [mailto:open-is...@googlegroups.com]
On Behalf Of Santi Saez
Sent: Wednesday, 4 February 2009 3:55 AM
To: open-iscsi@googlegroups.com
Subject: Very strange problem with an Infortrend A16E iSCSI storage
array



Hi,

We have a very strange problem with an Infortrend A16E iSCSI storage
array [1]. I think it's not a Open-iSCSI related problem, but someone
here may shed some light :-)

This array has 4 iSCSI interfaces to distribute/balance ethernet
traffic. There are 16 hosts connected to this array via iSCSI, with 4
hosts per channel/interface.

*Randomly*, one of these channels resets, making the 4 servers connected
to the channel timeout. The other 3 channels are not affected at all.

Open-iSCSI logs this:

ping timeout of 5 secs expired, last rx 502453156, last ping 502446907,
now 502463156
connection4:0: iscsi: detected conn error (1011)
session4: iscsi: session recovery timed out after 120 secs
iscsi: cmd 0x28 is not queued (8)
iscsi: cmd 0x28 is not queued (8)
iscsi: cmd 0x28 is not queued (8)
sd 4:0:0:0: SCSI error: return code = 0x0001
end_request: I/O error, dev sdc, sector 338694423
(..)


The switch port where it is connected shows:

%LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/5, 
changed state to down
%LINK-3-UPDOWN: Interface GigabitEthernet0/5, changed state to down
%LINK-3-UPDOWN: Interface GigabitEthernet0/5, changed state to up
%LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/5, 
changed state to up


It appears like iSCSI channel *resets* and starts a down+up port 
process.. we have changed the wire, the switch.. and still get the same 
error.

The Infortrend array is logging nothing and the official support people 
have no idea about this issue :-/

We believe that the source of the problem is a single server. When we 
move this server to a different iSCSI channel we get the same error 
there, and the channel where it previously was starts working as 
expected, with no interface resets.

Anyone could say that something in that faulty server is making the 
interface reset; but we've checked it several times and we really 
believe that the server is configured as the other 16 we have attached 
to the array.

The switch connecting the servers and the array is a Cisco Catalyst
2960G.

Anyone ever experienced anything similar?

Regards,

[1] http://www.infortrend.com/main/2_product/es_a16e-g2130-4.asp

-- 
Santi Saez
http://woop.es



__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Workaround: Filesystem corruption using iser transport

2009-02-04 Thread Dr. Volker Jaenisch

Thank you Ulrich!

Ulrich Windl schrieb:
 Well if you thing the problem is transport related, and you are doing 
 performance 
 evaluation anyway, you could try to use iozone: I think it has a 
 checksumming/verification feature. Maybe this helps to reveal the problems. 
 Maybe 
 even, write your own code (e.g. write some long piece of data to a partition 
 directly (using no file system), then read it back and compare it 
 byte-by-byte.
   
We digged deeper into the material and we found the following stunning results:

* The ocurrance of the error depends on the amount of avaible RAM on the 
initiator Machine.
* We regulary boot our Machines limited to 256MB RAM for IO-Testing. 
* But if we give the machine 4 GB RAM the error vanishes.

But the error itself depends not on the amount of moved data. 
With 256MB RAM on the initiator tiotest -f 1200 -t4 wich uses 4.8GB-IO-Space 
fails.
With 4 GB of RAM even with a iSCSI-Devive of 100GB and TioTest on 50GB not a 
single error occures. 

Is there anybody out there with IB to reproduce this error?
This may be of interest for the infiniband community as well as the 
iSCSI-Community.

Our Test-System consist out of two servers

* supermicro H8DME-2 board
* 2 x AMD Opteron Quadcore 2356
* 32 GB RAM
* HCA : Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 2) tested 
back-to-back as well as over flextronic IB-switch.

Identically BIOS, BIOS-Verson, OS setup on target and intitiator side.

On the target-Side we use the actual stgtd 0.9.3 compiled against the debian 
lenny IB-dev-libs. 

Target (ATHENE):
tgtadm --lld iscsi --op new --mode target --tid 1 -T de.inqbus.athene:test
tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /dev/vg0/test
tgtadm --lld iscsi --op bind --mode target --tid 1 -I 10.1.3.0/24


Initiator (ARES):

ares:~# iscsi_discovery 10.1.3.32 -tiser
iscsiadm: No active sessions.
Set target de.inqbus.athene:test to automatic login over iser to portal 
10.1.3.32:3260
Logging out of session [sid: 3, target: de.inqbus.athene:test, portal: 
10.1.3.32,3260]
Logout of [sid: 3, target: de.inqbus.athene:test, portal: 10.1.3.32,3260]: 
successful
discovered 1 targets at 10.1.3.32
ares:~# iscsi_discovery 10.1.3.32 -tiser -l
iscsiadm: No active sessions.
Set target de.inqbus.athene:test to automatic login over iser to portal 
10.1.3.32:3260
discovered 1 targets at 10.1.3.32
ares:~# mkfs.ext3 
/dev/disk/by-path/ip-10.1.3.32:3260-iscsi-de.inqbus.athene:test-lun-1
ares:~# mount 
/dev/disk/by-path/ip-10.1.3.32:3260-iscsi-de.inqbus.athene:test-lun-1 /mnt/test
ares:~# cd /mnt/test
ares:/mnt/test# tiotest -f1200 -t4
Tiotest results for 4 concurrent io threads:
,--.
| Item  | Time | Rate | Usr CPU  | Sys CPU |
+---+--+--+--+-+
| Write4800 MBs |   22.2 s | 215.801 MB/s |  -5.1 %  | 423.4 % |
| Random Write   16 MBs |0.1 s | 163.298 MB/s |   8.4 %  | 255.0 % |
| Read 4800 MBs |9.7 s | 492.869 MB/s |   6.1 %  | 134.5 % |
| Random Read16 MBs |0.1 s | 191.520 MB/s | -29.4 %  | 181.4 % |
`--'
Tiotest latency results:
,-.
| Item | Average latency | Maximum latency | % 2 sec | % 10 sec |
+--+-+-+--+---+
| Write|0.059 ms | 2285.150 ms |  0.8 |   0.0 |
| Random Write |0.008 ms |0.254 ms |  0.0 |   0.0 |
| Read |0.030 ms |  709.000 ms |  0.0 |   0.0 |
| Random Read  |0.076 ms |2.059 ms |  0.0 |   0.0 |
|--+-+-+--+---|
| Total|0.045 ms | 2285.150 ms |  0.4 |   0.0 |
`--+-+-+--+---'

ares:~# demsg
[ 1069.687429] EXT3-fs error (device sdc): ext3_free_blocks_sb: bit already 
cleared for block 176656
[ 1069.698140] EXT3-fs error (device sdc): ext3_free_blocks_sb: bit 
..
 
We use the actual rc of Debian lenny. Open-iscsi, kernel and kernel-Modules 
from Debian. User-Space-IB-Tools are homebrewed from OFED 1.3. The error 
reproduces with OFED 1.4.

Last Question: Where to file this error best?

Any help welcome.

Best Regards

Volker


-- 

   inqbus it-consulting  +49 ( 341 )  5643800
   Dr.  Volker Jaenisch  http://www.inqbus.de
   Herloßsohnstr.12  0 4 1 5 5Leipzig
   N  O  T -  F Ä L L E  +49 ( 170 )  3113748



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to