[OMPI users] Hybrid OpenMPI / OpenMP programming

Auclair Francis Wed, 29 Feb 2012 05:08:14 -0500

Dear Open-MPI users,

Our code is currently running Open-MPI (1.5.4) with SLURM on a NUMAmachine (2 sockets by nodes and 4 cores by socket) with basically two

levels of implementation for Open-MPI:
- at lower level n "Master" MPI-processes (one by socket) are
simultaneously runned by dividing classically the physical domain into n
sub-domains

- while at higher level 4n MPI-processes are spawn to run a sparsePoisson solver.At each time step, the code is thus going back and forth between thesetwo levels of implementation using two MPI communicators. This alsomeans that during about half of the computation time, 3n cores are atbest sleeping (if not 'waiting' at a barrier) when not inside "Solverroutines". We consequently decided to implement OpenMP functionality inour code when solver was not running (we declare one single "parallel"region and use the omp "master" command when OpenMP threads are notactive). We however face several difficulties:

a) It seems that both the 3n-MPI processes and the OpenMP threads'consume processor cycles while waiting'. We consequently tried: mpirun

-mpi_yield_when_idle 1  , export OMP_WAIT_POLICY=passive or export
KMP_BLOCKTIME=0 ... The latest finally leads to an interesting reduction
of computing time but worsens the second problem we have to face (see
bellow).

b) We managed to have a "correct" (?) implementation of our MPI-processes
on our sockets by using: mpirun -bind-to-socket -bysocket -np 4n 
However if OpenMP threads initially seem to scatter on each socket (one

thread by core) they slowly migrate to the same core as their 'MasterMPI process' or gather on one or two cores by socket We play aroundwith the environment variable KMP_AFFINITY but the best we could obtainwas a pinning of the OpenMP threads to their own core... disorganizingat the same time the implementation of the 4n Level-2 MPI processes.When added, neither the specification of a rankfile nor the mpirunoption -x IPATH_NO_CPUAFFINITY=1 seem to change significantly the situation.This comportment looks rather inefficient but so far we did not manageto prevent the migration of the 4 threads to at most a couple of cores !


Is there something wrong in our "Hybrid" implementation?
Do you have any advices?
Thanks for your help,
Francis

[OMPI users] Hybrid OpenMPI / OpenMP programming

Reply via email to