Re: Very strange problem with an Infortrend A16E iSCSI storage array
Hi Mike, El 3/2/09 20:19, Mike Christie escribió: *Randomly*, one of these channels resets, making the 4 servers connected to the channel timeout. The other 3 channels are not affected at all. (..) The initiatior sends a iscsi ping every X seconds. If we do not get a response in Y seconds we drop the session (drop connection and relogin). Yes, we were aware of this bug. In fact, you helped us with it not too long ago: http://tinyurl.com/cywy3j There was a bug in the initiator where we would spit out this timeout error by accident. What kernel are you using? Are you using the iscsi modules in the kernel or modules from a open-iscsi.org release and what release of open-iscsi.org? # iscsiadm -m session -P 3 iSCSI Transport Class version 2.0-724 iscsiadm version 2.0-868 Target: iqn.2002-10.com.infortrend:raid.sn7457155.30 Current Portal: 10.15.17.133:3260,1 Persistent Portal: 10.15.17.133:3260,1 ** Interface: ** Iface Name: default Iface Transport: tcp Iface Initiatorname: iqn.2001-05.net.example:vz11 Iface IPaddress: 10.15.17.137 Iface HWaddress: default Iface Netdev: default SID: 2 iSCSI Connection State: LOGGED IN iSCSI Session State: Unknown Internal iscsid Session State: NO CHANGE Negotiated iSCSI params: HeaderDigest: None DataDigest: None MaxRecvDataSegmentLength: 131072 MaxXmitDataSegmentLength: 65536 FirstBurstLength: 65536 MaxBurstLength: 262144 ImmediateData: Yes InitialR2T: No MaxOutstandingR2T: 1 Attached SCSI devices: Host Number: 2 State: running scsi2 Channel 00 Id 0 Lun: 0 Attached scsi disk sdb State: running We're using CentOS 5.2 with default iscsi-initiator-utils package: # rpm -qa iscsi-initiator-utils iscsi-initiator-utils-6.2.0.868-0.7.el5 Also, using default iSCSI modules. connection4:0: iscsi: detected conn error (1011) session4: iscsi: session recovery timed out after 120 secs I do not think it is the bug, because you would normally log right back in. The recovery timed out error means that the initiator tried to log back in for 120 seconds and during that time we could not reconnect/relogin. I think this makes sense when looking at the switch messages below. If something causes the link to go down, the iscsi ping would fail/timeout. I am not sure if the iscsi layer dropping the session would cause the link to go down/up. The link that goes down/up isn't the link between switch and the host, the link affected is between the *switch and the array*, very strange. It appears that some iSCSI client is causing something that makes iSCSI interface in the array to reset.. I think it's not a problem with Open-iSCSI and it's a Infortrend array bug, but perhaps someone may shed some light with this problem. As I said, when this ocurrs it affects to all servers connected to this iSCSI interface/channel, including Windows hosts, etc.. Regards, -- Santi Saez http://woop.es --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
EqualLogic load-balancing logout/re-login behavior (asynchrounous event logout)
In the 2.0-865 version when we received a ISCSI_ASYNC_MSG_REQUEST_LOGOUT we would logout, and then retry logging back in: - 28Jul 28 20:15:40 iscsid: Target requests logout within 3 seconds for connection^M - 28Jul 28 20:15:45 iscsid: connection5:0 is operational after recovery (2 attempts)^M And we would have a short hiccup (5 seconds) of the connection being gone. This as my understanding was a mechanism for the EqualLogic box to move (re-establishing allegiance) a session to a different port, hence allowing a load-balancing mechanism. In 2.0-869, the git commit 052d014485d2ce5bb7fa8dd0df875dafd1db77df changed this behavior so that we now actually logout and delete the session. No more retries. 2.0-865: static int iscsi_xmit_mtask(struct iscsi_conn *conn) { struct iscsi_hdr *hdr = conn-mtask-hdr; int rc, was_logout = 0; spin_unlock_bh(conn-session-lock); if ((hdr-opcode ISCSI_OPCODE_MASK) == ISCSI_OP_LOGOUT) { conn-session-state = ISCSI_STATE_IN_RECOVERY; iscsi_block_session(session_to_cls(conn-session)); ... 2.0-869: static int iscsi_xmit_mtask(struct iscsi_conn *conn) { struct iscsi_hdr *hdr = conn-mtask-hdr; int rc; if ((hdr-opcode ISCSI_OPCODE_MASK) == ISCSI_OP_LOGOUT) conn-session-state = ISCSI_STATE_LOGGING_OUT; spin_unlock_bh(conn-session-lock); .. and.. if (conn-session-state == ISCSI_STATE_LOGGING_OUT) { iscsi_free_mgmt_task(conn, conn-mtask); conn-mtask = NULL; continue; } This comes down to 2.0-869 terminating the session without trying to re-login. With the EqualLogic boxes that means we never reconnect back. So.. my question is : was this change intentional? If so, does Equallogic know this and have they changed their firmware to send ISCSI_ASYNC_MSG_DROPPING_CONNECTION (0x02) instead back? --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: EqualLogic load-balancing logout/re-login behavior (asynchrounous event logout)
On Wed, Feb 04, 2009 at 04:33:13PM -0500, Konrad Rzeszutek wrote: In the 2.0-865 version when we received a ISCSI_ASYNC_MSG_REQUEST_LOGOUT we would logout, and then retry logging back in: - 28Jul 28 20:15:40 iscsid: Target requests logout within 3 seconds for connection^M - 28Jul 28 20:15:45 iscsid: connection5:0 is operational after recovery (2 attempts)^M And we would have a short hiccup (5 seconds) of the connection being gone. This as my understanding was a mechanism for the EqualLogic box to move (re-establishing allegiance) a session to a different port, hence allowing a load-balancing mechanism. In 2.0-869, the git commit 052d014485d2ce5bb7fa8dd0df875dafd1db77df changed this The right git commit was: commit b3a7ea8d50f6028964b468d13a095dfb2508b2fb Author: Mike Christie micha...@cs.wisc.edu Date: Thu Dec 13 12:43:26 2007 -0600 [SCSI] libiscsi: do not block session during logout There is not need to block the session during logout. Since we are going to fail the commands that were blocked just fail them immediately instead. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
RE: Very strange problem with an Infortrend A16E iSCSI storage array
You should be able to turn up the logging verbosity on the switch quite a bit. If the switch is making the choice to disconnect the port, the higher log levels should show why. If it's the Infortrend making the choice - well, the switch probably won't show much more than what you've got, as it will just see the link go down. If you're not already, make sure the switch is sending it's log messages to a syslog server so you don't miss them! Regards, T -Original Message- From: open-iscsi@googlegroups.com [mailto:open-is...@googlegroups.com] On Behalf Of Santi Saez Sent: Wednesday, 4 February 2009 3:55 AM To: open-iscsi@googlegroups.com Subject: Very strange problem with an Infortrend A16E iSCSI storage array Hi, We have a very strange problem with an Infortrend A16E iSCSI storage array [1]. I think it's not a Open-iSCSI related problem, but someone here may shed some light :-) This array has 4 iSCSI interfaces to distribute/balance ethernet traffic. There are 16 hosts connected to this array via iSCSI, with 4 hosts per channel/interface. *Randomly*, one of these channels resets, making the 4 servers connected to the channel timeout. The other 3 channels are not affected at all. Open-iSCSI logs this: ping timeout of 5 secs expired, last rx 502453156, last ping 502446907, now 502463156 connection4:0: iscsi: detected conn error (1011) session4: iscsi: session recovery timed out after 120 secs iscsi: cmd 0x28 is not queued (8) iscsi: cmd 0x28 is not queued (8) iscsi: cmd 0x28 is not queued (8) sd 4:0:0:0: SCSI error: return code = 0x0001 end_request: I/O error, dev sdc, sector 338694423 (..) The switch port where it is connected shows: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/5, changed state to down %LINK-3-UPDOWN: Interface GigabitEthernet0/5, changed state to down %LINK-3-UPDOWN: Interface GigabitEthernet0/5, changed state to up %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/5, changed state to up It appears like iSCSI channel *resets* and starts a down+up port process.. we have changed the wire, the switch.. and still get the same error. The Infortrend array is logging nothing and the official support people have no idea about this issue :-/ We believe that the source of the problem is a single server. When we move this server to a different iSCSI channel we get the same error there, and the channel where it previously was starts working as expected, with no interface resets. Anyone could say that something in that faulty server is making the interface reset; but we've checked it several times and we really believe that the server is configured as the other 16 we have attached to the array. The switch connecting the servers and the array is a Cisco Catalyst 2960G. Anyone ever experienced anything similar? Regards, [1] http://www.infortrend.com/main/2_product/es_a16e-g2130-4.asp -- Santi Saez http://woop.es __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Workaround: Filesystem corruption using iser transport
Thank you Ulrich! Ulrich Windl schrieb: Well if you thing the problem is transport related, and you are doing performance evaluation anyway, you could try to use iozone: I think it has a checksumming/verification feature. Maybe this helps to reveal the problems. Maybe even, write your own code (e.g. write some long piece of data to a partition directly (using no file system), then read it back and compare it byte-by-byte. We digged deeper into the material and we found the following stunning results: * The ocurrance of the error depends on the amount of avaible RAM on the initiator Machine. * We regulary boot our Machines limited to 256MB RAM for IO-Testing. * But if we give the machine 4 GB RAM the error vanishes. But the error itself depends not on the amount of moved data. With 256MB RAM on the initiator tiotest -f 1200 -t4 wich uses 4.8GB-IO-Space fails. With 4 GB of RAM even with a iSCSI-Devive of 100GB and TioTest on 50GB not a single error occures. Is there anybody out there with IB to reproduce this error? This may be of interest for the infiniband community as well as the iSCSI-Community. Our Test-System consist out of two servers * supermicro H8DME-2 board * 2 x AMD Opteron Quadcore 2356 * 32 GB RAM * HCA : Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 2) tested back-to-back as well as over flextronic IB-switch. Identically BIOS, BIOS-Verson, OS setup on target and intitiator side. On the target-Side we use the actual stgtd 0.9.3 compiled against the debian lenny IB-dev-libs. Target (ATHENE): tgtadm --lld iscsi --op new --mode target --tid 1 -T de.inqbus.athene:test tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /dev/vg0/test tgtadm --lld iscsi --op bind --mode target --tid 1 -I 10.1.3.0/24 Initiator (ARES): ares:~# iscsi_discovery 10.1.3.32 -tiser iscsiadm: No active sessions. Set target de.inqbus.athene:test to automatic login over iser to portal 10.1.3.32:3260 Logging out of session [sid: 3, target: de.inqbus.athene:test, portal: 10.1.3.32,3260] Logout of [sid: 3, target: de.inqbus.athene:test, portal: 10.1.3.32,3260]: successful discovered 1 targets at 10.1.3.32 ares:~# iscsi_discovery 10.1.3.32 -tiser -l iscsiadm: No active sessions. Set target de.inqbus.athene:test to automatic login over iser to portal 10.1.3.32:3260 discovered 1 targets at 10.1.3.32 ares:~# mkfs.ext3 /dev/disk/by-path/ip-10.1.3.32:3260-iscsi-de.inqbus.athene:test-lun-1 ares:~# mount /dev/disk/by-path/ip-10.1.3.32:3260-iscsi-de.inqbus.athene:test-lun-1 /mnt/test ares:~# cd /mnt/test ares:/mnt/test# tiotest -f1200 -t4 Tiotest results for 4 concurrent io threads: ,--. | Item | Time | Rate | Usr CPU | Sys CPU | +---+--+--+--+-+ | Write4800 MBs | 22.2 s | 215.801 MB/s | -5.1 % | 423.4 % | | Random Write 16 MBs |0.1 s | 163.298 MB/s | 8.4 % | 255.0 % | | Read 4800 MBs |9.7 s | 492.869 MB/s | 6.1 % | 134.5 % | | Random Read16 MBs |0.1 s | 191.520 MB/s | -29.4 % | 181.4 % | `--' Tiotest latency results: ,-. | Item | Average latency | Maximum latency | % 2 sec | % 10 sec | +--+-+-+--+---+ | Write|0.059 ms | 2285.150 ms | 0.8 | 0.0 | | Random Write |0.008 ms |0.254 ms | 0.0 | 0.0 | | Read |0.030 ms | 709.000 ms | 0.0 | 0.0 | | Random Read |0.076 ms |2.059 ms | 0.0 | 0.0 | |--+-+-+--+---| | Total|0.045 ms | 2285.150 ms | 0.4 | 0.0 | `--+-+-+--+---' ares:~# demsg [ 1069.687429] EXT3-fs error (device sdc): ext3_free_blocks_sb: bit already cleared for block 176656 [ 1069.698140] EXT3-fs error (device sdc): ext3_free_blocks_sb: bit .. We use the actual rc of Debian lenny. Open-iscsi, kernel and kernel-Modules from Debian. User-Space-IB-Tools are homebrewed from OFED 1.3. The error reproduces with OFED 1.4. Last Question: Where to file this error best? Any help welcome. Best Regards Volker -- inqbus it-consulting +49 ( 341 ) 5643800 Dr. Volker Jaenisch http://www.inqbus.de Herloßsohnstr.12 0 4 1 5 5Leipzig N O T - F Ä L L E +49 ( 170 ) 3113748 --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to