Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking

2012-02-02 Thread Jeff Squyres
When you run without a hostfile, you're likely only running on a single node 
via shared memory (unless you're running inside a SLURM job, which is unlikely, 
given the context of your mails).  

When you're running in SLURM, I'm guessing that you're running across multiple 
nodes.  Are you using TCP as your MPI transport?

If so, I would still recommend trying stopping iptables altogether -- 
/etc/init.d/iptables stop.  It might not make a difference, but I've found 
iptables to be sufficiently complex that it's easier to take that variable out 
altogether by stopping it to really, really test to see if that's the problem.



On Feb 2, 2012, at 9:48 AM, adrian sabou wrote:

> Hi,
>  
> I have disabled iptables on all nodes using:
>  
> iptables -F
> iptables -X
> iptables -t nat -F
> iptables -t nat -X
> iptables -t mangle -F
> iptables -t mangle -X
> iptables -P INPUT ACCEPT
> iptables -P FORWARD ACCEPT
> iptables -P OUTPUT ACCEPT
>  
> My problem is still there. I have re-enabled iptables. The current output of 
> the "iptables --list" command is:
>  
> Chain INPUT (policy ACCEPT)
> target prot opt source   destination
> ACCEPT udp  --  anywhere anywhereudp dpt:domain
> ACCEPT tcp  --  anywhere anywheretcp dpt:domain
> ACCEPT udp  --  anywhere anywhereudp dpt:bootps
> ACCEPT tcp  --  anywhere anywheretcp dpt:bootps
> Chain FORWARD (policy ACCEPT)
> target prot opt source   destination
> ACCEPT all  --  anywhere 192.168.122.0/24state 
> RELATED,ESTABLISHED
> ACCEPT all  --  192.168.122.0/24 anywhere
> ACCEPT all  --  anywhere anywhere
> REJECT all  --  anywhere anywherereject-with 
> icmp-port-unreachable
> REJECT all  --  anywhere anywherereject-with 
> icmp-port-unreachable
> Chain OUTPUT (policy ACCEPT)
> target prot opt source   destination
> I don't think this is it. I have tried to run a simple ping-pong program that 
> I found (keeps bouncing a value between two processes) and I keep getting the 
> same results : the first Send / Receive pairs (p1 sends to p2, p2 receives 
> and sends back to p1, p1 receives) work and after that the program just 
> blocks. However, like all other examples, the example works if I launch it 
> with "mpirun -np 2 " and bounces the value 100 times.
>  
> Adrian
> From: Jeff Squyres 
> To: adrian sabou ; Open MPI Users 
>  
> Sent: Thursday, February 2, 2012 3:09 PM
> Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
> 
> Have you disabled iptables (firewalling) on your nodes?
> 
> Or, if you want to leave iptables enabled, set it such that all nodes in your 
> cluster are allowed to open TCP connections from any port to any other port.
> 
> 
> 
> 
> On Feb 2, 2012, at 4:49 AM, adrian sabou wrote:
> 
> > Hi,
> > 
> > The only example that works is hello_c.c. All others (that use MPI_Send and 
> > MPI_Recv)(connectivity_c.c and ring_c.c) block after the first MPI_Send / 
> > MPI_Recv (although the first Send/Receive pair works well for all 
> > processes, subsequent Send/Receive pairs block). My slurm version is 2.1.0. 
> > It is also worth mentioning that all examples work when not using SLURM 
> > (launching with "mpirun -np 5 "). Blocking occurs only when I 
> > try to run on multiple hosts with SLURM ("salloc -N5 mpirun ").
> > 
> > Adrian
> > 
> > From: Jeff Squyres 
> > To: adrian sabou ; Open MPI Users 
> >  
> > Sent: Wednesday, February 1, 2012 10:32 PM
> > Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
> > 
> > On Jan 31, 2012, at 11:16 AM, adrian sabou wrote:
> > 
> > > Like I said, a very simple program.
> > > When launching this application with SLURM (using "salloc -N2 mpirun 
> > > ./"), it hangs at the barrier.
> > 
> > Are you able to run the MPI example programs in examples/ ?
> > 
> > > However, it passes the barrier if I launch it without SLURM (using 
> > > "mpirun -np 2 ./"). I first noticed this problem when my 
> > > application hanged if I tried to send two successive messages from a 
> > > process to another. Only the first MPI_Send would work. The second 
> > > MPI_Send would block indefinitely. I was wondering whether any of you 
> > > have encountered a similar problem, or may have an ideea as to what is 
> > > causing the Send/Receive pair to block when using SLURM. The exact output 
> > > in my console is as follows:
> > >  
> > >salloc: Granted job allocation 1138
> > >Process 0 - Sending...
> > >Process 1 - Receiving...
> > >Process 1 - Received.
> > >Process 1 - Barrier reached.
> > >Process 0 - Sent.
> > >Process 0 - Barrier reached.
> > >(it just hangs here)
> 

Re: [OMPI users] Using physical numbering in a rankfile

2012-02-02 Thread Ralph Castain
Actually, that's not true - the 1.5 series technically still supports 
assignment to physical cpus. However, it is never really tested and very 
unusual for someone to use, so I suspect it is broken. I very much doubt anyone 
will fix it.

Also, be aware that physical cpu assignments are not supported in the current 
developer trunk. This will likely be the case when it is released as the 1.7 
series and going forward.


On Feb 2, 2012, at 10:17 AM, teng ma wrote:

> Just remove p in your rankfile like
> 
> rank 0=host1 slot=0:0
> rank 1=host1 slot=0:2
> rank 2=host1 slot=0:4
> rank 3=host1 slot=0:6
> rank 4=host1 slot=1:1
> rank 5=host1 slot=1:3
> rank 6=host1 slot=1:5
> rank 7=host1 slot=1:7
> 
> Teng
> 
> 2012/2/2 François Tessier 
> Hello,
> 
> I need to use a rankfile with openMPI 1.5.4 to do some tests on a basic 
> architecture. I'm using a node for which lstopo returns that : 
> 
> 
> Machine (24GB)
>   NUMANode L#0 (P#0 12GB)
> Socket L#0 + L3 L#0 (8192KB)
>   L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
>   L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#2)
>   L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#4)
>   L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6)
> HostBridge L#0
>   PCIBridge
> PCI 8086:10c9
>   Net L#0 "eth0"
> PCI 8086:10c9
>   Net L#1 "eth1"
>   PCIBridge
> PCI 15b3:673c
>   Net L#2 "ib0"
>   Net L#3 "ib1"
>   OpenFabrics L#4 "mlx4_0"
>   PCIBridge
> PCI 102b:0522
>   PCI 8086:3a22
> Block L#5 "sda"
> Block L#6 "sdb"
> Block L#7 "sdc"
> Block L#8 "sdd"
>   NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (8192KB)
> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1)
> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#3)
> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#5)
> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
> 
> 
> And I would like to use the physical numbering. To do that, I created a 
> rankfile like this :
> 
> rank 0=host1 slot=p0:0
> rank 1=host1 slot=p0:2
> rank 2=host1 slot=p0:4
> rank 3=host1 slot=p0:6
> rank 4=host1 slot=p1:1
> rank 5=host1 slot=p1:3
> rank 6=host1 slot=p1:5
> rank 7=host1 slot=p1:7
> 
> But when I run my job with "mpiexec -np 8 --rankfile rankfile ./foo", I 
> encounter this error :
> 
> Specified slot list: p0:4
> Error: Not found
> 
> This could mean that a non-existent processor was specified, or
> that the specification had improper syntax.
> 
> 
> Do you know what I did wrong?
> 
> Best regards,
> 
> François
> 
> -- 
> ___
> François TESSIER
> PhD Student at University of Bordeaux
> Tel : 0033.5.24.57.41.52
> francois.tess...@inria.fr
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> -- 
> | Teng Ma  Univ. of Tennessee |
> | t...@cs.utk.eduKnoxville, TN |
> | http://web.eecs.utk.edu/~tma/   |
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Using physical numbering in a rankfile

2012-02-02 Thread teng ma
I made a mistake in the previous reply. You can use two ways here like:
rank 0=host1 slot=0
rank 1=host1 slot=2
rank 2=host1 slot=4
rank 3=host1 slot=6
rank 4=host1 slot=1
rank 5=host1 slot=3
rank 6=host1 slot=5
rank 7=host1 slot=7

or

rank 0=host1 slot=0:0
rank 1=host1 slot=0:1
rank 2=host1 slot=0:2
rank 3=host1 slot=0:3
rank 4=host1 slot=1:0
rank 5=host1 slot=1:1
rank 6=host1 slot=1:2
rank 7=host1 slot=1:3

Teng


On Thu, Feb 2, 2012 at 12:17 PM, teng ma  wrote:

> Just remove p in your rankfile like
>
> rank 0=host1 slot=0:0
> rank 1=host1 slot=0:2
> rank 2=host1 slot=0:4
> rank 3=host1 slot=0:6
> rank 4=host1 slot=1:1
> rank 5=host1 slot=1:3
> rank 6=host1 slot=1:5
> rank 7=host1 slot=1:7
>
> Teng
>
> 2012/2/2 François Tessier 
>
>>  Hello,
>>
>> I need to use a rankfile with openMPI 1.5.4 to do some tests on a basic
>> architecture. I'm using a node for which lstopo returns that :
>>
>> 
>> Machine (24GB)
>>   NUMANode L#0 (P#0 12GB)
>> Socket L#0 + L3 L#0 (8192KB)
>>   L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
>>   L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#2)
>>   L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#4)
>>   L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6)
>> HostBridge L#0
>>   PCIBridge
>> PCI 8086:10c9
>>   Net L#0 "eth0"
>> PCI 8086:10c9
>>   Net L#1 "eth1"
>>   PCIBridge
>> PCI 15b3:673c
>>   Net L#2 "ib0"
>>   Net L#3 "ib1"
>>   OpenFabrics L#4 "mlx4_0"
>>   PCIBridge
>> PCI 102b:0522
>>   PCI 8086:3a22
>> Block L#5 "sda"
>> Block L#6 "sdb"
>> Block L#7 "sdc"
>> Block L#8 "sdd"
>>   NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (8192KB)
>> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1)
>> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#3)
>> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#5)
>> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
>> 
>>
>> And I would like to use the physical numbering. To do that, I created a
>> rankfile like this :
>>
>> rank 0=host1 slot=p0:0
>> rank 1=host1 slot=p0:2
>> rank 2=host1 slot=p0:4
>> rank 3=host1 slot=p0:6
>> rank 4=host1 slot=p1:1
>> rank 5=host1 slot=p1:3
>> rank 6=host1 slot=p1:5
>> rank 7=host1 slot=p1:7
>>
>> But when I run my job with "*mpiexec -np 8 --rankfile rankfile ./foo*",
>> I encounter this error :
>>
>> *Specified slot list: p0:4
>> Error: Not found
>>
>> This could mean that a non-existent processor was specified, or
>> that the specification had improper syntax.*
>>
>>
>> Do you know what I did wrong?
>>
>> Best regards,
>>
>> François
>>
>> --
>> ___
>> François TESSIER
>> PhD Student at University of Bordeaux
>> Tel : 0033.5.24.57.41.52francois.tess...@inria.fr
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> | Teng Ma  Univ. of Tennessee |
> | t...@cs.utk.eduKnoxville, TN |
> | http://web.eecs.utk.edu/~tma/   |
>



-- 
| Teng Ma  Univ. of Tennessee |
| t...@cs.utk.eduKnoxville, TN |
| http://web.eecs.utk.edu/~tma/   |


Re: [OMPI users] Using physical numbering in a rankfile

2012-02-02 Thread teng ma
Just remove p in your rankfile like

rank 0=host1 slot=0:0
rank 1=host1 slot=0:2
rank 2=host1 slot=0:4
rank 3=host1 slot=0:6
rank 4=host1 slot=1:1
rank 5=host1 slot=1:3
rank 6=host1 slot=1:5
rank 7=host1 slot=1:7

Teng

2012/2/2 François Tessier 

>  Hello,
>
> I need to use a rankfile with openMPI 1.5.4 to do some tests on a basic
> architecture. I'm using a node for which lstopo returns that :
>
> 
> Machine (24GB)
>   NUMANode L#0 (P#0 12GB)
> Socket L#0 + L3 L#0 (8192KB)
>   L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
>   L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#2)
>   L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#4)
>   L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6)
> HostBridge L#0
>   PCIBridge
> PCI 8086:10c9
>   Net L#0 "eth0"
> PCI 8086:10c9
>   Net L#1 "eth1"
>   PCIBridge
> PCI 15b3:673c
>   Net L#2 "ib0"
>   Net L#3 "ib1"
>   OpenFabrics L#4 "mlx4_0"
>   PCIBridge
> PCI 102b:0522
>   PCI 8086:3a22
> Block L#5 "sda"
> Block L#6 "sdb"
> Block L#7 "sdc"
> Block L#8 "sdd"
>   NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (8192KB)
> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1)
> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#3)
> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#5)
> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
> 
>
> And I would like to use the physical numbering. To do that, I created a
> rankfile like this :
>
> rank 0=host1 slot=p0:0
> rank 1=host1 slot=p0:2
> rank 2=host1 slot=p0:4
> rank 3=host1 slot=p0:6
> rank 4=host1 slot=p1:1
> rank 5=host1 slot=p1:3
> rank 6=host1 slot=p1:5
> rank 7=host1 slot=p1:7
>
> But when I run my job with "*mpiexec -np 8 --rankfile rankfile ./foo*", I
> encounter this error :
>
> *Specified slot list: p0:4
> Error: Not found
>
> This could mean that a non-existent processor was specified, or
> that the specification had improper syntax.*
>
>
> Do you know what I did wrong?
>
> Best regards,
>
> François
>
> --
> ___
> François TESSIER
> PhD Student at University of Bordeaux
> Tel : 0033.5.24.57.41.52francois.tess...@inria.fr
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
| Teng Ma  Univ. of Tennessee |
| t...@cs.utk.eduKnoxville, TN |
| http://web.eecs.utk.edu/~tma/   |


[OMPI users] Using physical numbering in a rankfile

2012-02-02 Thread François Tessier

Hello,

I need to use a rankfile with openMPI 1.5.4 to do some tests on a basic 
architecture. I'm using a node for which lstopo returns that :



Machine (24GB)
  NUMANode L#0 (P#0 12GB)
Socket L#0 + L3 L#0 (8192KB)
  L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
  L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#2)
  L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#4)
  L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6)
HostBridge L#0
  PCIBridge
PCI 8086:10c9
  Net L#0 "eth0"
PCI 8086:10c9
  Net L#1 "eth1"
  PCIBridge
PCI 15b3:673c
  Net L#2 "ib0"
  Net L#3 "ib1"
  OpenFabrics L#4 "mlx4_0"
  PCIBridge
PCI 102b:0522
  PCI 8086:3a22
Block L#5 "sda"
Block L#6 "sdb"
Block L#7 "sdc"
Block L#8 "sdd"
  NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (8192KB)
L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1)
L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#3)
L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#5)
L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)


And I would like to use the physical numbering. To do that, I created a 
rankfile like this :


rank 0=host1 slot=p0:0
rank 1=host1 slot=p0:2
rank 2=host1 slot=p0:4
rank 3=host1 slot=p0:6
rank 4=host1 slot=p1:1
rank 5=host1 slot=p1:3
rank 6=host1 slot=p1:5
rank 7=host1 slot=p1:7

But when I run my job with "/mpiexec -np 8 --rankfile rankfile ./foo/", 
I encounter this error :


/Specified slot list: p0:4
Error: Not found

This could mean that a non-existent processor was specified, or
that the specification had improper syntax./


Do you know what I did wrong?

Best regards,

François

--
___
François TESSIER
PhD Student at University of Bordeaux
Tel : 0033.5.24.57.41.52
francois.tess...@inria.fr




Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking

2012-02-02 Thread adrian sabou
Hi,
 
I have disabled iptables on all nodes using:
 
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
 
My problem is still there. I have re-enabled iptables. The current output of 
the "iptables --list" command is:
 
Chain INPUT (policy ACCEPT)
target prot opt source   destination
ACCEPT udp  --  anywhere anywhere    udp dpt:domain
ACCEPT tcp  --  anywhere anywhere    tcp dpt:domain
ACCEPT udp  --  anywhere anywhere    udp dpt:bootps
ACCEPT tcp  --  anywhere anywhere    tcp dpt:bootps
Chain FORWARD (policy ACCEPT)
target prot opt source   destination
ACCEPT all  --  anywhere 192.168.122.0/24    state 
RELATED,ESTABLISHED
ACCEPT all  --  192.168.122.0/24 anywhere
ACCEPT all  --  anywhere anywhere
REJECT all  --  anywhere anywhere    reject-with 
icmp-port-unreachable
REJECT all  --  anywhere anywhere    reject-with 
icmp-port-unreachable
Chain OUTPUT (policy ACCEPT)
target prot opt source   destination

I don't think this is it. I have tried to run a simple ping-pong program that I 
found (keeps bouncing a value between two processes) and I keep getting the 
same results : the first Send / Receive pairs (p1 sends to p2, p2 receives and 
sends back to p1, p1 receives) work and after that the program just blocks. 
However, like all other examples, the example works if I launch it with "mpirun 
-np 2 " and bounces the value 100 times.
 
Adrian
 


 From: Jeff Squyres 
To: adrian sabou ; Open MPI Users  
Sent: Thursday, February 2, 2012 3:09 PM
Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
  
Have you disabled iptables (firewalling) on your nodes?

Or, if you want to leave iptables enabled, set it such that all nodes in your 
cluster are allowed to open TCP connections from any port to any other port.




On Feb 2, 2012, at 4:49 AM, adrian sabou wrote:

> Hi,
> 
> The only example that works is hello_c.c. All others (that use MPI_Send and 
> MPI_Recv)(connectivity_c.c and ring_c.c) block after the first MPI_Send / 
> MPI_Recv (although the first Send/Receive pair works well for all processes, 
> subsequent Send/Receive pairs block). My slurm version is 2.1.0. It is also 
> worth mentioning that all examples work when not using SLURM (launching with 
> "mpirun -np 5 "). Blocking occurs only when I try to run on 
> multiple hosts with SLURM ("salloc -N5 mpirun ").
> 
> Adrian
> 
> From: Jeff Squyres 
> To: adrian sabou ; Open MPI Users 
>  
> Sent: Wednesday, February 1, 2012 10:32 PM
> Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
> 
> On Jan 31, 2012, at 11:16 AM, adrian sabou wrote:
> 
> > Like I said, a very simple program.
> > When launching this application with SLURM (using "salloc -N2 mpirun 
> > ./"), it hangs at the barrier.
> 
> Are you able to run the MPI example programs in examples/ ?
> 
> > However, it passes the barrier if I launch it without SLURM (using "mpirun 
> > -np 2 ./"). I first noticed this problem when my application hanged 
> > if I tried to send two successive messages from a process to another. Only 
> > the first MPI_Send would work. The second MPI_Send would block 
> > indefinitely. I was wondering whether any of you have encountered a similar 
> > problem, or may have an ideea as to what is causing the Send/Receive pair 
> > to block when using SLURM. The exact output in my console is as follows:
> >  
> >        salloc: Granted job allocation 1138
> >        Process 0 - Sending...
> >        Process 1 - Receiving...
> >        Process 1 - Received.
> >        Process 1 - Barrier reached.
> >        Process 0 - Sent.
> >        Process 0 - Barrier reached.
> >        (it just hangs here)
> >  
> > I am new to MPI programming and to OpenMPI and would greatly appreciate any 
> > help. My OpenMPI version is 1.4.4 (although I have also tried it on 1.5.4), 
> > my SLURM version is 0.3.3-1 (slurm-llnl 2.1.0-1),
> 
> I'm not sure what SLURM version that is -- my "srun --version" shows 2.2.4.  
> 0.3.3 would be pretty ancient, no?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking

2012-02-02 Thread Jeff Squyres
Have you disabled iptables (firewalling) on your nodes?

Or, if you want to leave iptables enabled, set it such that all nodes in your 
cluster are allowed to open TCP connections from any port to any other port.




On Feb 2, 2012, at 4:49 AM, adrian sabou wrote:

> Hi,
> 
> The only example that works is hello_c.c. All others (that use MPI_Send and 
> MPI_Recv)(connectivity_c.c and ring_c.c) block after the first MPI_Send / 
> MPI_Recv (although the first Send/Receive pair works well for all processes, 
> subsequent Send/Receive pairs block). My slurm version is 2.1.0. It is also 
> worth mentioning that all examples work when not using SLURM (launching with 
> "mpirun -np 5 "). Blocking occurs only when I try to run on 
> multiple hosts with SLURM ("salloc -N5 mpirun ").
> 
> Adrian
> 
> From: Jeff Squyres 
> To: adrian sabou ; Open MPI Users 
>  
> Sent: Wednesday, February 1, 2012 10:32 PM
> Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
> 
> On Jan 31, 2012, at 11:16 AM, adrian sabou wrote:
> 
> > Like I said, a very simple program.
> > When launching this application with SLURM (using "salloc -N2 mpirun 
> > ./"), it hangs at the barrier.
> 
> Are you able to run the MPI example programs in examples/ ?
> 
> > However, it passes the barrier if I launch it without SLURM (using "mpirun 
> > -np 2 ./"). I first noticed this problem when my application hanged 
> > if I tried to send two successive messages from a process to another. Only 
> > the first MPI_Send would work. The second MPI_Send would block 
> > indefinitely. I was wondering whether any of you have encountered a similar 
> > problem, or may have an ideea as to what is causing the Send/Receive pair 
> > to block when using SLURM. The exact output in my console is as follows:
> >  
> >salloc: Granted job allocation 1138
> >Process 0 - Sending...
> >Process 1 - Receiving...
> >Process 1 - Received.
> >Process 1 - Barrier reached.
> >Process 0 - Sent.
> >Process 0 - Barrier reached.
> >(it just hangs here)
> >  
> > I am new to MPI programming and to OpenMPI and would greatly appreciate any 
> > help. My OpenMPI version is 1.4.4 (although I have also tried it on 1.5.4), 
> > my SLURM version is 0.3.3-1 (slurm-llnl 2.1.0-1),
> 
> I'm not sure what SLURM version that is -- my "srun --version" shows 2.2.4.  
> 0.3.3 would be pretty ancient, no?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Error building Openmpi (configure: error: C compiler cannot create executables)

2012-02-02 Thread Jeff Squyres
Both icc and gcc seem to be broken on your system; they're not creating 
executables.

You can look in config.log for more details about what is failing.  But 
basically, configure is trying to compile a simple "hello world"-like C 
program, and it's failing.

You might want to try trying to compile a simple "hello world"-like C program 
yourself and see what error messages are coming out of the compiler.



On Feb 2, 2012, at 5:19 AM, Syed Ahsan Ali wrote:

> Dear All,
> 
> I have been stuck in installation of openmpi1.4.2 on RHEL5.2 with ifort and 
> icc.I get following error while configuring, Please help.
> 
> 
> [precis@precis2 openmpi-1.4.2]$ ./build.sh 
> checking for a BSD-compatible install... /usr/bin/install -c
> checking whether build environment is sane... yes
> checking for a thread-safe mkdir -p... /bin/mkdir -p
> checking for gawk... gawk
> checking whether make sets $(MAKE)... yes
> checking how to create a ustar tar archive... gnutar
> 
> 
> == Configuring Open MPI
> 
> 
> *** Checking versions
> checking Open MPI version... 1.4.2
> checking Open MPI release date... May 04, 2010
> checking Open MPI Subversion repository version... r23093
> checking Open Run-Time Environment version... 1.4.2
> checking Open Run-Time Environment release date... May 04, 2010
> checking Open Run-Time Environment Subversion repository version... r23093
> checking Open Portable Access Layer version... 1.4.2
> checking Open Portable Access Layer release date... May 04, 2010
> checking Open Portable Access Layer Subversion repository version... r23093
> 
> *** Initialization, setup
> configure: builddir: /home/precis/opemmpi/openmpi-1.4.2
> configure: srcdir: /home/precis/opemmpi/openmpi-1.4.2
> checking build system type... i686-pc-linux-gnu
> checking host system type... i686-pc-linux-gnu
> installing to directory "/home/openmpi"
> 
> *** Configuration options
> checking whether to run code coverage... no
> checking whether to compile with branch probabilities... no
> checking whether to debug memory usage... no
> checking whether to profile memory usage... no
> checking if want developer-level compiler pickyness... no
> checking if want developer-level debugging code... no
> checking if want sparse process groups... no
> checking if want Fortran 77 bindings... yes
> checking if want Fortran 90 bindings... yes
> checking desired Fortran 90 bindings "size"... small
> checking whether to enable PMPI... yes
> checking if want C++ bindings... yes
> checking if want MPI::SEEK_SET support... yes
> checking if want to enable weak symbol support... yes
> checking if want run-time MPI parameter checking... runtime
> checking if want to install OMPI header files... no
> checking if want pretty-print stacktrace... yes
> checking if peruse support is required... no
> checking max supported array dimension in F90 MPI bindings... 4
> checking if pty support should be enabled... yes
> checking if user wants dlopen support... yes
> checking if heterogeneous support should be enabled... no
> checking if want trace file debugging... no
> checking if want full RTE support... yes
> checking if want fault tolerance... Disabled fault tolerance
> checking if want IPv6 support... yes (if underlying system supports it)
> checking if want orterun "--prefix" behavior to be enabled by default... no
> checking for package/brand string... Open MPI pre...@precis2.pakmet.com.pk 
> Distribution
> checking for ident string... 1.4.2
> checking whether to add padding to the openib control header... no
> checking whether to use an alternative checksum algo for messages... no
> 
> 
> == Compiler and preprocessor tests
> 
> 
> *** C compiler and preprocessor
> checking for style of include used by make... GNU
> checking for gcc... icc
> checking for C compiler default output file name... 
> configure: error: in `/home/precis/opemmpi/openmpi-1.4.2':
> configure: error: C compiler cannot create executables
> See `config.log' for more details.
> make: *** No targets specified and no makefile found.  Stop.
> make: *** No rule to make target `install'.  Stop.
> [precis@precis2 openmpi-1.4.2]$ clean all
> bash: clean: command not found
> [precis@precis2 openmpi-1.4.2]$ make clean
> make: *** No rule to make target `clean'.  Stop.
> [precis@precis2 openmpi-1.4.2]$ make
> make: *** No targets specified and no makefile found.  Stop.
> [precis@precis2 openmpi-1.4.2]$ ./configure 
> checking for a BSD-compatible install... /usr/bin/install -c
> checking whether build environment is sane... yes
> checking for a thread-safe mkdir -p... /bin/mkdir -p
> checking for gawk... gawk
> checking whether make sets $(MAKE)... yes
> checking how to create a ustar tar 

[OMPI users] Error building Openmpi (configure: error: C compiler cannot create executables)

2012-02-02 Thread Syed Ahsan Ali
Dear All,

I have been stuck in installation of openmpi1.4.2 on RHEL5.2 with ifort and
icc.I get following error while configuring, Please help.


[precis@precis2 openmpi-1.4.2]$ ./build.sh
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking how to create a ustar tar archive... gnutar


== Configuring Open MPI


*** Checking versions
checking Open MPI version... 1.4.2
checking Open MPI release date... May 04, 2010
checking Open MPI Subversion repository version... r23093
checking Open Run-Time Environment version... 1.4.2
checking Open Run-Time Environment release date... May 04, 2010
checking Open Run-Time Environment Subversion repository version... r23093
checking Open Portable Access Layer version... 1.4.2
checking Open Portable Access Layer release date... May 04, 2010
checking Open Portable Access Layer Subversion repository version... r23093

*** Initialization, setup
configure: builddir: /home/precis/opemmpi/openmpi-1.4.2
configure: srcdir: /home/precis/opemmpi/openmpi-1.4.2
checking build system type... i686-pc-linux-gnu
checking host system type... i686-pc-linux-gnu
installing to directory "/home/openmpi"

*** Configuration options
checking whether to run code coverage... no
checking whether to compile with branch probabilities... no
checking whether to debug memory usage... no
checking whether to profile memory usage... no
checking if want developer-level compiler pickyness... no
checking if want developer-level debugging code... no
checking if want sparse process groups... no
checking if want Fortran 77 bindings... yes
checking if want Fortran 90 bindings... yes
checking desired Fortran 90 bindings "size"... small
checking whether to enable PMPI... yes
checking if want C++ bindings... yes
checking if want MPI::SEEK_SET support... yes
checking if want to enable weak symbol support... yes
checking if want run-time MPI parameter checking... runtime
checking if want to install OMPI header files... no
checking if want pretty-print stacktrace... yes
checking if peruse support is required... no
checking max supported array dimension in F90 MPI bindings... 4
checking if pty support should be enabled... yes
checking if user wants dlopen support... yes
checking if heterogeneous support should be enabled... no
checking if want trace file debugging... no
checking if want full RTE support... yes
checking if want fault tolerance... Disabled fault tolerance
checking if want IPv6 support... yes (if underlying system supports it)
checking if want orterun "--prefix" behavior to be enabled by default... no
checking for package/brand string... Open MPI
pre...@precis2.pakmet.com.pkDistribution
checking for ident string... 1.4.2
checking whether to add padding to the openib control header... no
checking whether to use an alternative checksum algo for messages... no


== Compiler and preprocessor tests


*** C compiler and preprocessor
checking for style of include used by make... GNU
checking for gcc... icc
checking for C compiler default output file name...
configure: error: in `/home/precis/opemmpi/openmpi-1.4.2':
configure: error: C compiler cannot create executables
See `config.log' for more details.
make: *** No targets specified and no makefile found.  Stop.
make: *** No rule to make target `install'.  Stop.
[precis@precis2 openmpi-1.4.2]$ clean all
bash: clean: command not found
[precis@precis2 openmpi-1.4.2]$ make clean
make: *** No rule to make target `clean'.  Stop.
[precis@precis2 openmpi-1.4.2]$ make
make: *** No targets specified and no makefile found.  Stop.
[precis@precis2 openmpi-1.4.2]$ ./configure
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking how to create a ustar tar archive... gnutar


== Configuring Open MPI


*** Checking versions
checking Open MPI version... 1.4.2
checking Open MPI release date... May 04, 2010
checking Open MPI Subversion repository version... r23093
checking Open Run-Time Environment version... 1.4.2
checking Open Run-Time Environment release date... May 04, 2010
checking Open Run-Time Environment Subversion repository version... r23093
checking Open Portable Access Layer version... 1.4.2
checking Open Portable Access Layer release date... May 04, 

Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking

2012-02-02 Thread adrian sabou
Hi,

The only example that works is hello_c.c. All others (that use MPI_Send and 
MPI_Recv)(connectivity_c.c and ring_c.c) block after the first MPI_Send / 
MPI_Recv (although the first Send/Receive pair works well for all 
processes, subsequent Send/Receive pairs block). My slurm version is 
2.1.0. It is also worth mentioning that all examples work when not using SLURM 
(launching with "mpirun -np 5 "). Blocking 
occurs only when I try to run on multiple hosts with SLURM ("salloc -N5 
mpirun ").

Adrian



 From: Jeff Squyres 
To: adrian sabou ; Open MPI Users  
Sent: Wednesday, February 1, 2012 10:32 PM
Subject: Re: [OMPI users]  OpenMPI / SLURM -> Send/Recv blocking
 
On Jan 31, 2012, at 11:16 AM, adrian sabou wrote:

> Like I said, a very simple program.
> When launching this application with SLURM (using "salloc -N2 mpirun 
> ./"), it hangs at the barrier.

Are you able to run the MPI example programs in examples/ ?

> However, it passes the barrier if I launch it without SLURM (using "mpirun 
> -np 2 ./"). I first noticed this problem when my application hanged 
> if I tried to send two successive messages from a process to another. Only 
> the first MPI_Send would work. The second MPI_Send would block indefinitely. 
> I was wondering whether any of you have encountered a similar problem, or may 
> have an ideea as to what is causing the Send/Receive pair to block when using 
> SLURM. The exact output in my console is as follows:
>  
>         salloc: Granted job allocation 1138
>         Process 0 - Sending...
>         Process 1 - Receiving...
>         Process 1 - Received.
>         Process 1 - Barrier reached.
>         Process 0 - Sent.
>         Process 0 - Barrier reached.
>         (it just hangs here)
>  
> I am new to MPI programming and to OpenMPI and would greatly appreciate any 
> help. My OpenMPI version is 1.4.4 (although I have also tried it on 1.5.4), 
> my SLURM version is 0.3.3-1 (slurm-llnl 2.1.0-1),

I'm not sure what SLURM version that is -- my "srun --version" shows 2.2.4.  
0.3.3 would be pretty ancient, no?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/