Re: [OMPI users] Hybrid OpenMPI / OpenMP programming

Ralph Castain Fri, 2 Mar 2012 14:07:35 -0500

On Mar 2, 2012, at 11:52 AM, Paul Kapinos wrote:

> Hello Ralph,
> I've some questions on placement and -cpus-per-rank.
> 
>> First, use the --cpus-per-rank option to separate the ranks from each other. 
>> In other words, instead of --bind-to-socket -bysocket, you do:
>> -bind-to-core -cpus-per-rank N
>> This will take each rank and bind it to a unique set of N cores, thereby 
>> cleanly separating them on the node.
> 
> Yes, it helps a lot,  but the placement arranged in this way is still not 
> optimal, I believe.
> The cores are assigned from 0 on in incremental order. On a 2-socket, 12-core 
> machine:
> (0,1,2,3,4,  5[,12,13,14,15,16,17])
> (6,7,8,9,10,11[,18,19,20,21,22,23])
> ^cores^         ^hypercores^
> 
> running 2 processes with 5 threads lead to this:
> 
> 0 <#> linuxbdc07.rz.RWTH-Aachen.DE <#> physcpubind: 0 1 2 3 4
> 1 <#> linuxbdc07.rz.RWTH-Aachen.DE <#> physcpubind: 5 6 7 8 9
> (not used cores: 10, 11; not used hypercores: 12-23)
> That is, there is an MPI Process bound to core 0 (which is sweet-pot for may 
> kernel things), and the threads of the 2nd process are spawned over both 
> sockets.


Yeah, the current implementation isn't quite as good as we'd like. We rewrote 
the entire binding system for the trunk/upcoming 1.7 series.

> 
> - is there a way to say to the system to spawn the processes (= slot chunks 
> defined by -cpus-per-rank N) over the sockets in round-robin model?

No, but I should add it

> - is there a way to say "do not use this core number!" in order to add some 
> alignment in core numbering?

No, but again, I should add it

> - in there a way to use the hypercores in company with the real cores?

Not in the 1.5 series, but on the trunk you can

> 
> And last but not least, I found out that the starting and running the program 
> on differing hardware is problematic.
> 
> Trying to start 2-rank, 5-thread on 2x6 core computer from my 4-core 
> workstation, I get the below error message.
> 
> Seem that the calculation of core numbers / pinning determination is a part 
> of *mpiexec* process instead of being run on the target node? *puzzled*

No, it is done on the backend. I suspect there is a bug, though, that is 
causing the number of cores/socket to be *sensed* on the mpiexec node and then 
*passed* back to the daemon. Hasn't surfaced before because the only folks 
using this option are on homogeneous systems.

FWIW: the trunk resolves this problem, but I haven't checked the cpus-per-rank 
support on it yet.

> 
> 
> 
> 
> $ mpiexec -np 1 -H linuxbdc01 -bind-to-core -cpus-per-rank 5 ompi_testpin.sh 
> MPI_FastTest.exe
> --------------------------------------------------------------------------
> Your job has requested more cpus per process(rank) than there
> are cpus in a socket:
> 
>  Cpus/rank: 5
>  #cpus/socket: 4
> 
> Please correct one or both of these values and try again.
> --------------------------------------------------------------------------
> 
> $ ssh linuxbdc01 cat /proc/cpuinfo | grep processor | wc -l
> 24
> 
> $ cat /proc/cpuinfo | grep processor | wc -l
> 4
> 
> 
> 
> Best,
> 
> Paul
> 
> P.S. Using Open MPI 1.5.3, waiting for 1.5.5 :o)
> 
> P.S.2. any update on this? 
> http://www.open-mpi.org/community/lists/users/2012/01/18240.php
> 
> P.S.3. on the same 16-way, 128 core hardware as in P.S.2, also -cpus-per-rank 
> goes crazy:
> 
> $ mpiexec -mca btl_openib_warn_default_gid_prefix=0 -np 2 -H linuxbcsc21 
> -bind-to-core -cpus-per-rank 5 --report-bindings  ompi_testpin.sh 
> MPI_FastTest.exe
> [linuxbcsc21.rz.RWTH-Aachen.DE:106342] [[55934,0],1] odls:default:fork 
> binding child [[55934,1],0] to cpus 1000100010001
> [linuxbcsc21.rz.RWTH-Aachen.DE:106342] [[55934,0],1] odls:default:fork 
> binding child [[55934,1],1] to cpus 20002
> 0 <#> linuxbcsc21.rz.RWTH-Aachen.DE <#> physcpubind: 0 16 32 48
> 1 <#> linuxbcsc21.rz.RWTH-Aachen.DE <#> physcpubind: 1 17
> 
> 
> So, -cpus-per-rank 5, but one process get 4 cores, the other one - only two..
> 
> 
> 
> 
>> What you can do is "entice" it away from your processes by leaving 1-2 cores 
>> for its own use. For example:
>> -npernode 2 -bind-to-core -cpus-per-rank 3
>> would run two MPI ranks on each node, each rank exclusively bound to 3 cores.
>> This leaves 2 cores on each node for Linux. When the scheduler sees the 6 
>> cores of your MPI/MP procs working hard, and 2 cores sitting idle, it will 
>> tend to use those 2 cores for everything else - and not be tempted to push 
>> you aside to gain access to "your" cores.
>> HTH
>> Ralph
>> On Feb 29, 2012, at 3:08 AM, Auclair Francis wrote:
>>> Dear Open-MPI users,
>>> 
>>> Our code is currently running Open-MPI (1.5.4) with SLURM on a NUMA machine 
>>> (2 sockets by nodes and 4 cores by socket) with basically two
>>> levels of implementation for Open-MPI:
>>> - at lower level n "Master" MPI-processes (one by socket) are
>>> simultaneously runned by dividing classically the physical domain into n
>>> sub-domains
>>> - while at higher level 4n MPI-processes are spawn to run a sparse Poisson 
>>> solver.
>>> At each time step, the code is thus going back and forth between these two 
>>> levels of implementation using two MPI communicators. This also means that 
>>> during about half of the computation time, 3n cores are at best sleeping 
>>> (if not 'waiting' at a barrier) when not inside "Solver routines". We 
>>> consequently decided to implement OpenMP functionality in our code when 
>>> solver was not running (we declare one single "parallel" region and use the 
>>> omp "master" command when OpenMP threads are not active). We however face 
>>> several difficulties:
>>> 
>>> a) It seems that both the 3n-MPI processes and the OpenMP threads 'consume 
>>> processor cycles while waiting'. We consequently tried: mpirun
>>> -mpi_yield_when_idle 1 , export OMP_WAIT_POLICY=passive or export
>>> KMP_BLOCKTIME=0 ... The latest finally leads to an interesting reduction
>>> of computing time but worsens the second problem we have to face (see
>>> bellow).
>>> 
>>> b) We managed to have a "correct" (?) implementation of our MPI-processes
>>> on our sockets by using: mpirun -bind-to-socket -bysocket -np 4n However if 
>>> OpenMP threads initially seem to scatter on each socket (one
>>> thread by core) they slowly migrate to the same core as their 'Master MPI 
>>> process' or gather on one or two cores by socket
>>> We play around with the environment variable KMP_AFFINITY but the best we 
>>> could obtain was a pinning of the OpenMP threads to their own core... 
>>> disorganizing at the same time the implementation of the 4n Level-2 MPI 
>>> processes. When added, neither the specification of a rankfile nor the 
>>> mpirun option -x IPATH_NO_CPUAFFINITY=1 seem to change significantly the 
>>> situation.
>>> This comportment looks rather inefficient but so far we did not manage to 
>>> prevent the migration of the 4 threads to at most a couple of cores !
>>> 
>>> Is there something wrong in our "Hybrid" implementation?
>>> Do you have any advices?
>>> Thanks for your help,
>>> Francis
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Hybrid OpenMPI / OpenMP programming

Reply via email to