[OMPI users] problem with openmpi-1.8.2rc2r32288 on Solaris 10 Sparc

2014-07-23 Thread Siegmar Gross
Hi, today I installed openmpi-1.8.2rc2r32288 on my machines (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64) with Sun C 5.12 and gcc-4.9.0. Unfortunately I have problems with both compilers on "Solaris 10 Sparc". My small program works as expected on "Solaris 10 x86_64" and

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-23 Thread Joshua Ladd
Ahsan, This link might be helpful in trying to diagnose and treat IB fabric issues: http://docs.oracle.com/cd/E18476_01/doc.220/e18478/fabric.htm#CIHIHJGD You might try resetting the problematic port, or just use port 2 for your jobs as a quick workaround: -mca btl_openib_if_include mlx4_0:2

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-23 Thread Shamis, Pavel
It seems that the network was not consistenly wired. Port DOWN means that the port was not wired (or bad cable). Moreover, on some nodes port 1 is connected on other port 2. My concern is that they are not connected to the same subnet. If you have at least one port on each node connected to the

Re: [OMPI users] Salloc and mpirun problem

2014-07-23 Thread Ralph Castain
It's supposed to, so it sounds like we have a bug in the connection failover mechanism. I'll address it On Jul 23, 2014, at 1:21 AM, Timur Ismagilov wrote: > Thanks, Ralph! > When I add --mca oob_tcp_if_include ib0 (where ib0 is infiniband interface > from ifconfig) to

Re: [OMPI users] Salloc and mpirun problem

2014-07-23 Thread Timur Ismagilov
Thanks, Ralph! When I add --mca oob_tcp_if_include ib0 (where ib0 is infiniband interface from ifconfig) to mpirun it starts working correct!  Why OpenMPI doesn't do it itself? Tue, 22 Jul 2014 11:26:16 -0700 от Ralph Castain : >Okay, the problem is that the connection back

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-23 Thread Syed Ahsan Ali
Hi Josh It was my mistake. The status of error generating node is pasted below Infiniband device 'mlx4_0' port 1 status: default gid: fe80::::0018:8b90:97fe:94fe base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state:

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-23 Thread Syed Ahsan Ali
Dear Pasha The ibstatus is not of two different machines it is of the same machine. There are two infiband ports showing up on all nodes. I checked on all the nodes that one of the port in always in INIT status and other one active. Now please see below the ibstatus of the problem causing node