Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Gilles Gouaillardet
My 0.02 US$ first, the root cause of the problem was a default gateway was configured on the node, but this gateway was unreachable. imho, this is incorrect system setting that can lead to unpredictable results : - openmpi 1.8.1 works (you are lucky, good for you) - openmpi 1.8.3 fails (no luck

Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-13 Thread Dave Love
Ralph Castain writes: >> I think there's a problem with documentation at least not being >> explicit, and it would really help to have it clarified unless I'm >> missing some. > > Not quite sure I understand this comment - the problem is that we > aren’t correctly reading the

Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-13 Thread Dave Love
Ralph Castain writes: cn6050 16 par6.q@cn6050 cn6045 16 par6.q@cn6045 >> >> The above looks like the PE_HOSTFILE. So it should be 16 slots per node. > > Hey Reuti > > Is that the standard PE_HOSTFILE format? I’m looking at the ras/gridengine > module, and it

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Gus Correa
On 11/13/2014 11:14 AM, Ralph Castain wrote: Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to assign different hostnames to their interfaces - I’ve seen it in the Hadoop world, but not in HPC. Still, no law against it. No, not so unusual. I have clusters from respectable

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Ralph Castain
Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to assign different hostnames to their interfaces - I’ve seen it in the Hadoop world, but not in HPC. Still, no law against it. This will take a little thought to figure out a solution. One problem that immediately occurs is

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Reuti
Am 13.11.2014 um 00:34 schrieb Ralph Castain: >> On Nov 12, 2014, at 2:45 PM, Reuti wrote: >> >> Am 12.11.2014 um 17:27 schrieb Reuti: >> >>> Am 11.11.2014 um 02:25 schrieb Ralph Castain: >>> Another thing you can do is (a) ensure you built with —enable-debug,

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Reuti
Gus, Am 13.11.2014 um 02:59 schrieb Gus Correa: > On 11/12/2014 05:45 PM, Reuti wrote: >> Am 12.11.2014 um 17:27 schrieb Reuti: >> >>> Am 11.11.2014 um 02:25 schrieb Ralph Castain: >>> Another thing you can do is (a) ensure you built with —enable-debug, >> and then (b) run it with -mca

Re: [OMPI users] OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Reuti
Am 13.11.2014 um 00:55 schrieb Gilles Gouaillardet: > Could you please send the output of netstat -nr on both head and compute node > ? Head node: annemarie:~ # netstat -nr Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 0.0.0.0

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali
netstat don't show loopback interface even on head node while ifconfig shows Loopback up and running on compute nodes as well as master node. [root@pmd ~]# netstat -nr Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 192.168.3.0 0.0.0.0

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Gilles Gouaillardet
but it is running on your head node isnt't it ? you might want to double check why there is no loopback interface on your compute nodes. in the mean time, you can disable lo and ib0 interfaces Cheers, Gilles On 2014/11/13 16:59, Syed Ahsan Ali wrote: > I don't see it running > >

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali
I don't see it running [pmdtest@compute-01-01 ~]$ netstat -nr Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 192.168.108.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0 169.254.0.0 0.0.0.0 255.255.0.0 U

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Gilles Gouaillardet
This is really weird ? is the loopback interface up and running on both nodes and with the same ip ? can you run on both compute nodes ? netstat -nr On 2014/11/13 16:50, Syed Ahsan Ali wrote: > Now it looks through the loopback address > > [pmdtest@pmd ~]$ mpirun --host

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali
Ok ok I can disable that as well. Thank you guys. :) On Thu, Nov 13, 2014 at 12:50 PM, Syed Ahsan Ali wrote: > Now it looks through the loopback address > > [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca > btl_tcp_if_exclude ib0 ring_c > Process 0 sending

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali
Now it looks through the loopback address [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca btl_tcp_if_exclude ib0 ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring)

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Gilles Gouaillardet
--mca btl ^openib disables the openib btl, which is native infiniband only. ib0 is treated as any TCP interface and then handled by the tcp btl an other option is you to use --mca btl_tcp_if_exclude ib0 On 2014/11/13 16:43, Syed Ahsan Ali wrote: > You are right it is running on 10.0.0.0

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali
You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca btl_tcp_if_include 10.0.0.0/8 ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Gilles Gouaillardet
mpirun complains about the 192.168.108.10 ip address, but ping reports a 10.0.0.8 address is the 192.168.* network a point to point network (for example between a host and a mic) so two nodes cannot ping each other via this address ? /* e.g. from compute-01-01 can you ping the 192.168.108.* ip

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali
Same result in both cases [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Gilles Gouaillardet
Hi, it seems you messed up the command line could you try $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c can you also try to run mpirun from a compute node instead of the head node ? Cheers, Gilles On 2014/11/13 16:07, Syed Ahsan Ali wrote: > Here is what I see when

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali
Here is what I see when disabling openib support.\ [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib compute-01-01,compute-01-06 ring_c ssh: orted: Temporary failure in name resolution ssh: orted: Temporary failure in name resolution

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali
Hi Jefff No firewall is enabled. Running the diagnostics I found that non communication mpi job is running . While ring_c remains stuck. There are of course warnings for open fabrics but in my case I an running application by disabling openib., Please see below [pmdtest@pmd ~]$ mpirun --host