Ahsan,
This link might be helpful in trying to diagnose and treat IB fabric issues:
http://docs.oracle.com/cd/E18476_01/doc.220/e18478/fabric.htm#CIHIHJGD
You might try resetting the problematic port, or just use port 2 for your
jobs as a quick workaround:
-mca btl_openib_if_include mlx4_0:2
It seems that the network was not consistenly wired.
Port DOWN means that the port was not wired (or bad cable). Moreover, on some
nodes port 1 is connected on other port 2.
My concern is that they are not connected to the same subnet. If you have at
least one port on each node connected to the
Hi Josh
It was my mistake. The status of error generating node is pasted below
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80::::0018:8b90:97fe:94fe
base lid:0x0
sm lid: 0x0
state: 1: DOWN
phys state:
Dear Pasha
The ibstatus is not of two different machines it is of the same machine.
There are two infiband ports showing up on all nodes. I checked on all the
nodes that one of the port in always in INIT status and other one active.
Now please see below the ibstatus of the problem causing node
Hmm, this does not make sense.
Your copy-n-paste shows that both machines (00 and 01) have the same guid/lid
(sort of equivalent of mac address in ethernet world).
As you can guess these two can not be identical for two different machines
(unless you moved the card around).
Best,
Pasha
On Jul
Sayed,
You might try this link (or have your sysadmin do it if you do not have
admin privileges.) To me it looks like your second port is in the "INIT"
state but has not been added by the subnet manager.
And where I can find run/job/submission ?
On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel wrote:
>
> You have to check the ports states on *all* nodes in the
> run/job/submission. Checking on a single node is not enough.
> My guess is the 01-00 tries to connect 01-01 and the
Yes I had checked running mpirun on all nodes one by one to see the
problematic one. I had already mentioned that compute-01-01 is causing
problem, when I remove it from the hostlist mpirun works fine. Here is
ibstatus of compute-01-01.
Infiniband device 'mlx4_0' port 1 status:
default
You have to check the ports states on *all* nodes in the run/job/submission.
Checking on a single node is not enough.
My guess is the 01-00 tries to connect 01-01 and the ports are down on 01-01.
You may disable support for infiniband by adding --mca btl ^openib.
Best,
Pavel (Pasha) Shamis
---
Dear All
I need your help to solve this cluster related issue causing mpirun
malfunction. I get following warning for some of the nodes and then the
route failure message comes causing failure to mpirun.
*WARNING: There is at least one OpenFabrics device found but there are no
active ports
10 matches
Mail list logo