Re: [OMPI users] Errors for openib, mpirun fails

2014-07-23 Thread Joshua Ladd
Ahsan, This link might be helpful in trying to diagnose and treat IB fabric issues: http://docs.oracle.com/cd/E18476_01/doc.220/e18478/fabric.htm#CIHIHJGD You might try resetting the problematic port, or just use port 2 for your jobs as a quick workaround: -mca btl_openib_if_include mlx4_0:2

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-23 Thread Shamis, Pavel
It seems that the network was not consistenly wired. Port DOWN means that the port was not wired (or bad cable). Moreover, on some nodes port 1 is connected on other port 2. My concern is that they are not connected to the same subnet. If you have at least one port on each node connected to the

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-23 Thread Syed Ahsan Ali
Hi Josh It was my mistake. The status of error generating node is pasted below Infiniband device 'mlx4_0' port 1 status: default gid: fe80::::0018:8b90:97fe:94fe base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state:

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-23 Thread Syed Ahsan Ali
Dear Pasha The ibstatus is not of two different machines it is of the same machine. There are two infiband ports showing up on all nodes. I checked on all the nodes that one of the port in always in INIT status and other one active. Now please see below the ibstatus of the problem causing node

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Shamis, Pavel
Hmm, this does not make sense. Your copy-n-paste shows that both machines (00 and 01) have the same guid/lid (sort of equivalent of mac address in ethernet world). As you can guess these two can not be identical for two different machines (unless you moved the card around). Best, Pasha On Jul

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Joshua Ladd
Sayed, You might try this link (or have your sysadmin do it if you do not have admin privileges.) To me it looks like your second port is in the "INIT" state but has not been added by the subnet manager.

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Syed Ahsan Ali
And where I can find run/job/submission ? On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel wrote: > > You have to check the ports states on *all* nodes in the > run/job/submission. Checking on a single node is not enough. > My guess is the 01-00 tries to connect 01-01 and the

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Syed Ahsan Ali
Yes I had checked running mpirun on all nodes one by one to see the problematic one. I had already mentioned that compute-01-01 is causing problem, when I remove it from the hostlist mpirun works fine. Here is ibstatus of compute-01-01. Infiniband device 'mlx4_0' port 1 status: default

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-21 Thread Shamis, Pavel
You have to check the ports states on *all* nodes in the run/job/submission. Checking on a single node is not enough. My guess is the 01-00 tries to connect 01-01 and the ports are down on 01-01. You may disable support for infiniband by adding --mca btl ^openib. Best, Pavel (Pasha) Shamis ---

[OMPI users] Errors for openib, mpirun fails

2014-07-21 Thread Syed Ahsan Ali
Dear All I need your help to solve this cluster related issue causing mpirun malfunction. I get following warning for some of the nodes and then the route failure message comes causing failure to mpirun. *WARNING: There is at least one OpenFabrics device found but there are no active ports