Re: [OMPI users] Hybrid OpenMPI / OpenMP programming

2012-03-02 Thread Ralph Castain

On Mar 2, 2012, at 11:52 AM, Paul Kapinos wrote:

> Hello Ralph,
> I've some questions on placement and -cpus-per-rank.
> 
>> First, use the --cpus-per-rank option to separate the ranks from each other. 
>> In other words, instead of --bind-to-socket -bysocket, you do:
>> -bind-to-core -cpus-per-rank N
>> This will take each rank and bind it to a unique set of N cores, thereby 
>> cleanly separating them on the node.
> 
> Yes, it helps a lot,  but the placement arranged in this way is still not 
> optimal, I believe.
> The cores are assigned from 0 on in incremental order. On a 2-socket, 12-core 
> machine:
> (0,1,2,3,4,  5[,12,13,14,15,16,17])
> (6,7,8,9,10,11[,18,19,20,21,22,23])
> ^cores^ ^hypercores^
> 
> running 2 processes with 5 threads lead to this:
> 
> 0 <#> linuxbdc07.rz.RWTH-Aachen.DE <#> physcpubind: 0 1 2 3 4
> 1 <#> linuxbdc07.rz.RWTH-Aachen.DE <#> physcpubind: 5 6 7 8 9
> (not used cores: 10, 11; not used hypercores: 12-23)
> That is, there is an MPI Process bound to core 0 (which is sweet-pot for may 
> kernel things), and the threads of the 2nd process are spawned over both 
> sockets.

Yeah, the current implementation isn't quite as good as we'd like. We rewrote 
the entire binding system for the trunk/upcoming 1.7 series.

> 
> - is there a way to say to the system to spawn the processes (= slot chunks 
> defined by -cpus-per-rank N) over the sockets in round-robin model?

No, but I should add it

> - is there a way to say "do not use this core number!" in order to add some 
> alignment in core numbering?

No, but again, I should add it

> - in there a way to use the hypercores in company with the real cores?

Not in the 1.5 series, but on the trunk you can

> 
> And last but not least, I found out that the starting and running the program 
> on differing hardware is problematic.
> 
> Trying to start 2-rank, 5-thread on 2x6 core computer from my 4-core 
> workstation, I get the below error message.
> 
> Seem that the calculation of core numbers / pinning determination is a part 
> of *mpiexec* process instead of being run on the target node? *puzzled*

No, it is done on the backend. I suspect there is a bug, though, that is 
causing the number of cores/socket to be *sensed* on the mpiexec node and then 
*passed* back to the daemon. Hasn't surfaced before because the only folks 
using this option are on homogeneous systems.

FWIW: the trunk resolves this problem, but I haven't checked the cpus-per-rank 
support on it yet.

> 
> 
> 
> 
> $ mpiexec -np 1 -H linuxbdc01 -bind-to-core -cpus-per-rank 5 ompi_testpin.sh 
> MPI_FastTest.exe
> --
> Your job has requested more cpus per process(rank) than there
> are cpus in a socket:
> 
>  Cpus/rank: 5
>  #cpus/socket: 4
> 
> Please correct one or both of these values and try again.
> --
> 
> $ ssh linuxbdc01 cat /proc/cpuinfo | grep processor | wc -l
> 24
> 
> $ cat /proc/cpuinfo | grep processor | wc -l
> 4
> 
> 
> 
> Best,
> 
> Paul
> 
> P.S. Using Open MPI 1.5.3, waiting for 1.5.5 :o)
> 
> P.S.2. any update on this? 
> http://www.open-mpi.org/community/lists/users/2012/01/18240.php
> 
> P.S.3. on the same 16-way, 128 core hardware as in P.S.2, also -cpus-per-rank 
> goes crazy:
> 
> $ mpiexec -mca btl_openib_warn_default_gid_prefix=0 -np 2 -H linuxbcsc21 
> -bind-to-core -cpus-per-rank 5 --report-bindings  ompi_testpin.sh 
> MPI_FastTest.exe
> [linuxbcsc21.rz.RWTH-Aachen.DE:106342] [[55934,0],1] odls:default:fork 
> binding child [[55934,1],0] to cpus 1000100010001
> [linuxbcsc21.rz.RWTH-Aachen.DE:106342] [[55934,0],1] odls:default:fork 
> binding child [[55934,1],1] to cpus 20002
> 0 <#> linuxbcsc21.rz.RWTH-Aachen.DE <#> physcpubind: 0 16 32 48
> 1 <#> linuxbcsc21.rz.RWTH-Aachen.DE <#> physcpubind: 1 17
> 
> 
> So, -cpus-per-rank 5, but one process get 4 cores, the other one - only two..
> 
> 
> 
> 
>> What you can do is "entice" it away from your processes by leaving 1-2 cores 
>> for its own use. For example:
>> -npernode 2 -bind-to-core -cpus-per-rank 3
>> would run two MPI ranks on each node, each rank exclusively bound to 3 cores.
>> This leaves 2 cores on each node for Linux. When the scheduler sees the 6 
>> cores of your MPI/MP procs working hard, and 2 cores sitting idle, it will 
>> tend to use those 2 cores for everything else - and not be tempted to push 
>> you aside to gain access to "your" cores.
>> HTH
>> Ralph
>> On Feb 29, 2012, at 3:08 AM, Auclair Francis wrote:
>>> Dear Open-MPI users,
>>> 
>>> Our code is currently running Open-MPI (1.5.4) with SLURM on a NUMA machine 
>>> (2 sockets by nodes and 4 cores by socket) with basically two
>>> levels of implementation for Open-MPI:
>>> - at lower level n "Master" MPI-processes (one by socket) are
>>> simultaneously runned by dividing classically the physical domain into n
>>> sub-domains
>>> - while at higher level 4n 

Re: [OMPI users] Hybrid OpenMPI / OpenMP programming

2012-03-02 Thread Paul Kapinos

Hello Ralph,
I've some questions on placement and -cpus-per-rank.

First, use the --cpus-per-rank option to separate the ranks from each other. 
In other words, instead of --bind-to-socket -bysocket, you do:


-bind-to-core -cpus-per-rank N

This will take each rank and bind it to a unique set of N cores, 
thereby cleanly separating them on the node.


Yes, it helps a lot,  but the placement arranged in this way is still not 
optimal, I believe.
The cores are assigned from 0 on in incremental order. On a 2-socket, 12-core 
machine:

(0,1,2,3,4,  5[,12,13,14,15,16,17])
(6,7,8,9,10,11[,18,19,20,21,22,23])
 ^cores^ ^hypercores^

running 2 processes with 5 threads lead to this:

0 <#> linuxbdc07.rz.RWTH-Aachen.DE <#> physcpubind: 0 1 2 3 4
1 <#> linuxbdc07.rz.RWTH-Aachen.DE <#> physcpubind: 5 6 7 8 9
(not used cores: 10, 11; not used hypercores: 12-23)
That is, there is an MPI Process bound to core 0 (which is sweet-pot for may 
kernel things), and the threads of the 2nd process are spawned over both sockets.


- is there a way to say to the system to spawn the processes (= slot chunks 
defined by -cpus-per-rank N) over the sockets in round-robin model?
- is there a way to say "do not use this core number!" in order to add some 
alignment in core numbering?

- in there a way to use the hypercores in company with the real cores?

And last but not least, I found out that the starting and running the program on 
differing hardware is problematic.


Trying to start 2-rank, 5-thread on 2x6 core computer from my 4-core 
workstation, I get the below error message.


Seem that the calculation of core numbers / pinning determination is a part of 
*mpiexec* process instead of being run on the target node? *puzzled*





$ mpiexec -np 1 -H linuxbdc01 -bind-to-core -cpus-per-rank 5 ompi_testpin.sh 
MPI_FastTest.exe

--
Your job has requested more cpus per process(rank) than there
are cpus in a socket:

  Cpus/rank: 5
  #cpus/socket: 4

Please correct one or both of these values and try again.
--

$ ssh linuxbdc01 cat /proc/cpuinfo | grep processor | wc -l
24

$ cat /proc/cpuinfo | grep processor | wc -l
4



Best,

Paul

P.S. Using Open MPI 1.5.3, waiting for 1.5.5 :o)

P.S.2. any update on this? 
http://www.open-mpi.org/community/lists/users/2012/01/18240.php


P.S.3. on the same 16-way, 128 core hardware as in P.S.2, also -cpus-per-rank 
goes crazy:


$ mpiexec -mca btl_openib_warn_default_gid_prefix=0 -np 2 -H linuxbcsc21 
-bind-to-core -cpus-per-rank 5 --report-bindings  ompi_testpin.sh MPI_FastTest.exe
[linuxbcsc21.rz.RWTH-Aachen.DE:106342] [[55934,0],1] odls:default:fork binding 
child [[55934,1],0] to cpus 1000100010001
[linuxbcsc21.rz.RWTH-Aachen.DE:106342] [[55934,0],1] odls:default:fork binding 
child [[55934,1],1] to cpus 20002

0 <#> linuxbcsc21.rz.RWTH-Aachen.DE <#> physcpubind: 0 16 32 48
1 <#> linuxbcsc21.rz.RWTH-Aachen.DE <#> physcpubind: 1 17


So, -cpus-per-rank 5, but one process get 4 cores, the other one - only two..





What you can do is "entice" it away from your processes by leaving 1-2 cores 
for its own use. For example:

-npernode 2 -bind-to-core -cpus-per-rank 3

would run two MPI ranks on each node, each rank exclusively bound to 3 cores.
This leaves 2 cores on each node for Linux. When the scheduler sees the 6 
cores of your MPI/MP procs working hard, and 2 cores sitting idle, 
it will tend to use those 2 cores for everything else - 
and not be tempted to push you aside to gain access to "your" cores.


HTH
Ralph

On Feb 29, 2012, at 3:08 AM, Auclair Francis wrote:


Dear Open-MPI users,

Our code is currently running Open-MPI (1.5.4) with SLURM on a NUMA machine (2 
sockets by nodes and 4 cores by socket) with basically two
levels of implementation for Open-MPI:
- at lower level n "Master" MPI-processes (one by socket) are
simultaneously runned by dividing classically the physical domain into n
sub-domains
- while at higher level 4n MPI-processes are spawn to run a sparse Poisson 
solver.
At each time step, the code is thus going back and forth between these two levels of implementation using two 
MPI communicators. This also means that during about half of the computation time, 3n cores are at best 
sleeping (if not 'waiting' at a barrier) when not inside "Solver routines". We consequently decided 
to implement OpenMP functionality in our code when solver was not running (we declare one single 
"parallel" region and use the omp "master" command when OpenMP threads are not active). 
We however face several difficulties:

a) It seems that both the 3n-MPI processes and the OpenMP threads 'consume 
processor cycles while waiting'. We consequently tried: mpirun
-mpi_yield_when_idle 1 , export OMP_WAIT_POLICY=passive or export
KMP_BLOCKTIME=0 ... The latest finally leads to an interesting reduction
of computing time but worsens the second 

Re: [OMPI users] Hybrid OpenMPI / OpenMP programming

2012-02-29 Thread Ralph Castain
It sounds like you are running into an issue with the Linux scheduler. I have 
an item to add an API "bind-this-thread-to-", but that won't be 
available until sometime in the future.

Couple of things you could try in the meantime. First, use the --cpus-per-rank 
option to separate the ranks from each other. In other words, instead of 
--bind-to-socket -bysocket, you do:

-bind-to-core -cpus-per-rank N

This will take each rank and bind it to a unique set of N cores, thereby 
cleanly separating them on the node.

Second, the Linux scheduler tends to become jealous of the way MPI procs "hog" 
the resources. The scheduler needs room to run all those daemons and other 
processes too. So it tends to squeeze you aside a little, just to create some 
room for the rest of the stuff.

What you can do is "entice" it away from your processes by leaving 1-2 cores 
for its own use. For example:

-npernode 2 -bind-to-core -cpus-per-rank 3

would run two MPI ranks on each node, each rank exclusively bound to 3 cores. 
This leaves 2 cores on each node for Linux. When the scheduler sees the 6 cores 
of your MPI/MP procs working hard, and 2 cores sitting idle, it will tend to 
use those 2 cores for everything else - and not be tempted to push you aside to 
gain access to "your" cores.

HTH
Ralph

On Feb 29, 2012, at 3:08 AM, Auclair Francis wrote:

> Dear Open-MPI users,
> 
> Our code is currently running Open-MPI (1.5.4) with SLURM on a NUMA machine 
> (2 sockets by nodes and 4 cores by socket) with basically two
> levels of implementation for Open-MPI:
> - at lower level n "Master" MPI-processes (one by socket) are
> simultaneously runned by dividing classically the physical domain into n
> sub-domains
> - while at higher level 4n MPI-processes are spawn to run a sparse Poisson 
> solver.
> At each time step, the code is thus going back and forth between these two 
> levels of implementation using two MPI communicators. This also means that 
> during about half of the computation time, 3n cores are at best sleeping (if 
> not 'waiting' at a barrier) when not inside "Solver routines". We 
> consequently decided to implement OpenMP functionality in our code when 
> solver was not running (we declare one single "parallel" region and use the 
> omp "master" command when OpenMP threads are not active). We however face 
> several difficulties:
> 
> a) It seems that both the 3n-MPI processes and the OpenMP threads 'consume 
> processor cycles while waiting'. We consequently tried: mpirun
> -mpi_yield_when_idle 1 , export OMP_WAIT_POLICY=passive or export
> KMP_BLOCKTIME=0 ... The latest finally leads to an interesting reduction
> of computing time but worsens the second problem we have to face (see
> bellow).
> 
> b) We managed to have a "correct" (?) implementation of our MPI-processes
> on our sockets by using: mpirun -bind-to-socket -bysocket -np 4n 
> However if OpenMP threads initially seem to scatter on each socket (one
> thread by core) they slowly migrate to the same core as their 'Master MPI 
> process' or gather on one or two cores by socket
> We play around with the environment variable KMP_AFFINITY but the best we 
> could obtain was a pinning of the OpenMP threads to their own core... 
> disorganizing at the same time the implementation of the 4n Level-2 MPI 
> processes. When added, neither the specification of a rankfile nor the mpirun 
> option -x IPATH_NO_CPUAFFINITY=1 seem to change significantly the situation.
> This comportment looks rather inefficient but so far we did not manage to 
> prevent the migration of the 4 threads to at most a couple of cores !
> 
> Is there something wrong in our "Hybrid" implementation?
> Do you have any advices?
> Thanks for your help,
> Francis
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Hybrid OpenMPI / OpenMP programming

2012-02-29 Thread Auclair Francis

Dear Open-MPI users,

Our code is currently running Open-MPI (1.5.4) with SLURM on a NUMA 
machine (2 sockets by nodes and 4 cores by socket) with basically two

levels of implementation for Open-MPI:
- at lower level n "Master" MPI-processes (one by socket) are
simultaneously runned by dividing classically the physical domain into n
sub-domains
- while at higher level 4n MPI-processes are spawn to run a sparse 
Poisson solver.
At each time step, the code is thus going back and forth between these 
two levels of implementation using two MPI communicators. This also 
means that during about half of the computation time, 3n cores are at 
best sleeping (if not 'waiting' at a barrier) when not inside "Solver 
routines". We consequently decided to implement OpenMP functionality in 
our code when solver was not running (we declare one single "parallel" 
region and use the omp "master" command when OpenMP threads are not 
active). We however face several difficulties:


a) It seems that both the 3n-MPI processes and the OpenMP threads 
'consume processor cycles while waiting'. We consequently tried: mpirun

-mpi_yield_when_idle 1 … , export OMP_WAIT_POLICY=passive or export
KMP_BLOCKTIME=0 ... The latest finally leads to an interesting reduction
of computing time but worsens the second problem we have to face (see
bellow).

b) We managed to have a "correct" (?) implementation of our MPI-processes
on our sockets by using: mpirun -bind-to-socket -bysocket -np 4n …
However if OpenMP threads initially seem to scatter on each socket (one
thread by core) they slowly migrate to the same core as their 'Master 
MPI process' or gather on one or two cores by socket… We play around 
with the environment variable KMP_AFFINITY but the best we could obtain 
was a pinning of the OpenMP threads to their own core... disorganizing 
at the same time the implementation of the 4n Level-2 MPI processes. 
When added, neither the specification of a rankfile nor the mpirun 
option -x IPATH_NO_CPUAFFINITY=1 seem to change significantly the situation.
This comportment looks rather inefficient but so far we did not manage 
to prevent the migration of the 4 threads to at most a couple of cores !


Is there something wrong in our "Hybrid" implementation?
Do you have any advices?
Thanks for your help,
Francis