Re: open-iscsi possibly causing servers to hang (HP blade, CentOS, MSA2012i)

2009-11-16 Thread Niels Callesøe

On Nov 11, 5:14 pm, Mike Christie  wrote:
> Niels Callesøe wrote:
> > Hello group
>
> > I am running a number of HP blade servers in a C7200 enclosure.
> > Several of them have access to individual LUN's on an MSA 2012i using
> > open-iscsi. Recently, however, I have experienced unexplained hangs of
> > the servers in question and the only appearent thing they have in
> > common (beside being blade servers) is that they have access to the IP-
> > SAN.
>
> > When the servers fail, they do so in a fashion where they will still
> > respond to, for example, ping requests. But they refuse to respond to
> > higher level access, such as spawning a shell for login. This means
> > that when the error occurs, I cannot even log into the machines to
> > troubleshoot the problem (regardless of remote or local login), even
> > though the console greeting is printed readily.
>
> > My question is primarily whether this sounds like something the iscsi-
> > driver could cause and, equally importantly, how one would go about
> > troubleshooting the issue. One thing that makes it particularly
> > elusive is that I cannot seem to provoke the error state and it does
> > not occur very often (at least not while the platform is not yet in
> > full production).
>
> > Possibly relevant information follows:
>
> > OS: centos-release-5-3.el5.centos.1
> > iscsi version: iscsid version 2.0-868
> > MSA: Current Storage Controller Code Version J210P12
>
> > I can, and have started, upgrades to more recent versions of all
> > three. However, those were the versions running when the problem was
> > caused last -- and since I cannot provoke it, I have no real way of
> > knowing if version upgrades will solve the issue (unless someone in
> > this group can confirm that it will, of course).
>
> It could be iscsi. Are you using multipath and do you know if there are
> path failures when the system hangs? Is there anything in the log?

I am using multipath, I believe, as I can access either of the MSA
controllers via either of two Gbit interfaces on the blades. I'll
paste what I believe to be the relevant lines from messages below.

Other than the startup messages, as best I can tell there is nothing
else relevant in the logs. I do have logs of what happens during a
network failure, which I'll paste below also, but this failure at
least did not cause the machine to hang. I suspect that whatever
causes the hang also prevents writing to the log...

I can, of course, induce almost any kind of failure on one or both of
the links if you think that will help troubleshooting.

> If there is nothing in the log at the time of the hang, could you hook
> up a serial line? I am hoping a oops will get spit out at the time of
> the hang.

I can attach a remote console, if that will do the trick? Usually I
only open one to attempt login after something goes wrong and prevents
ssh, but I should be able to open one and just keep it there to watch
for any console dumpage. Or perhaps I am misunderstanding you?

On to the log-dumps:

>>> iscsi starting (previous) <<<
Nov  9 13:42:39 promethium kernel: Loading iSCSI transport class
v2.0-724.
Nov  9 13:42:39 promethium kernel: iscsi: registered transport (tcp)
Nov  9 13:42:39 promethium kernel: iscsi: registered transport (iser)
Nov  9 13:42:39 promethium kernel: bnx2: eth0: using MSI
Nov  9 13:42:39 promethium kernel: bnx2: eth0 NIC SerDes Link is Up,
1000 Mbps full duplex
Nov  9 13:42:39 promethium kernel: bnx2: eth1: using MSI
Nov  9 13:42:39 promethium kernel: bnx2: eth1 NIC SerDes Link is Up,
1000 Mbps full duplex
Nov  9 13:42:39 promethium kernel: bnx2: eth2: using MSI
Nov  9 13:42:39 promethium kernel: bnx2: eth2 NIC SerDes Link is Up,
1000 Mbps full duplex
Nov  9 13:42:39 promethium kernel: bnx2: eth3: using MSI
Nov  9 13:42:39 promethium kernel: bnx2: eth3 NIC SerDes Link is Up,
1000 Mbps full duplex
Nov  9 13:42:39 promethium kernel: scsi0 : iSCSI Initiator over TCP/IP
Nov  9 13:42:39 promethium kernel: scsi1 : iSCSI Initiator over TCP/IP
Nov  9 13:42:39 promethium kernel: scsi2 : iSCSI Initiator over TCP/IP
Nov  9 13:42:39 promethium kernel: scsi3 : iSCSI Initiator over TCP/IP
Nov  9 13:42:39 promethium kernel:   Vendor: HPModel:
MSA2012i  Rev: J210
Nov  9 13:42:39 promethium kernel:   Type:
Enclosure  ANSI SCSI revision: 05
Nov  9 13:42:39 promethium kernel:   Vendor: HPModel:
MSA2012i  Rev: J210
Nov  9 13:42:39 promethium kernel:   Type:
Enclosure  ANSI SCSI revision: 05
Nov  9 13:42:39 promethium kernel:   Vendor: HPModel:
MSA2012i  Rev: J210
Nov  9 13:42:39 promethium kernel:   Type:
Enclosure  ANSI SCSI revision: 05
Nov  9 13:42:39 promethium kernel:   Vendor: HPModel:
MSA2012i  Rev: J210
Nov  9 13:42:39 promethium kernel:   Type:
Enclosure  ANSI SCSI revision: 05
Nov  9 13:42:39 promethium kernel:   Vendor: HPModel:
MSA2012i  

Re: open-iscsi possibly causing servers to hang (HP blade, CentOS, MSA2012i)

2009-11-11 Thread Mike Christie

Niels Callesøe wrote:
> Hello group
> 
> I am running a number of HP blade servers in a C7200 enclosure.
> Several of them have access to individual LUN's on an MSA 2012i using
> open-iscsi. Recently, however, I have experienced unexplained hangs of
> the servers in question and the only appearent thing they have in
> common (beside being blade servers) is that they have access to the IP-
> SAN.
> 
> When the servers fail, they do so in a fashion where they will still
> respond to, for example, ping requests. But they refuse to respond to
> higher level access, such as spawning a shell for login. This means
> that when the error occurs, I cannot even log into the machines to
> troubleshoot the problem (regardless of remote or local login), even
> though the console greeting is printed readily.
> 
> My question is primarily whether this sounds like something the iscsi-
> driver could cause and, equally importantly, how one would go about
> troubleshooting the issue. One thing that makes it particularly
> elusive is that I cannot seem to provoke the error state and it does
> not occur very often (at least not while the platform is not yet in
> full production).
> 
> Possibly relevant information follows:
> 
> OS: centos-release-5-3.el5.centos.1
> iscsi version: iscsid version 2.0-868
> MSA: Current Storage Controller Code Version J210P12
> 
> I can, and have started, upgrades to more recent versions of all
> three. However, those were the versions running when the problem was
> caused last -- and since I cannot provoke it, I have no real way of
> knowing if version upgrades will solve the issue (unless someone in
> this group can confirm that it will, of course).
> 

It could be iscsi. Are you using multipath and do you know if there are 
path failures when the system hangs? Is there anything in the log?

If there is nothing in the log at the time of the hang, could you hook 
up a serial line? I am hoping a oops will get spit out at the time of 
the hang.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---