On Mar 2, 2012, at 11:52 AM, Paul Kapinos wrote: > Hello Ralph, > I've some questions on placement and -cpus-per-rank. > >> First, use the --cpus-per-rank option to separate the ranks from each other. >> In other words, instead of --bind-to-socket -bysocket, you do: >> -bind-to-core -cpus-per-rank N >> This will take each rank and bind it to a unique set of N cores, thereby >> cleanly separating them on the node. > > Yes, it helps a lot, but the placement arranged in this way is still not > optimal, I believe. > The cores are assigned from 0 on in incremental order. On a 2-socket, 12-core > machine: > (0,1,2,3,4, 5[,12,13,14,15,16,17]) > (6,7,8,9,10,11[,18,19,20,21,22,23]) > ^cores^ ^hypercores^ > > running 2 processes with 5 threads lead to this: > > 0 <#> linuxbdc07.rz.RWTH-Aachen.DE <#> physcpubind: 0 1 2 3 4 > 1 <#> linuxbdc07.rz.RWTH-Aachen.DE <#> physcpubind: 5 6 7 8 9 > (not used cores: 10, 11; not used hypercores: 12-23) > That is, there is an MPI Process bound to core 0 (which is sweet-pot for may > kernel things), and the threads of the 2nd process are spawned over both > sockets.
Yeah, the current implementation isn't quite as good as we'd like. We rewrote the entire binding system for the trunk/upcoming 1.7 series. > > - is there a way to say to the system to spawn the processes (= slot chunks > defined by -cpus-per-rank N) over the sockets in round-robin model? No, but I should add it > - is there a way to say "do not use this core number!" in order to add some > alignment in core numbering? No, but again, I should add it > - in there a way to use the hypercores in company with the real cores? Not in the 1.5 series, but on the trunk you can > > And last but not least, I found out that the starting and running the program > on differing hardware is problematic. > > Trying to start 2-rank, 5-thread on 2x6 core computer from my 4-core > workstation, I get the below error message. > > Seem that the calculation of core numbers / pinning determination is a part > of *mpiexec* process instead of being run on the target node? *puzzled* No, it is done on the backend. I suspect there is a bug, though, that is causing the number of cores/socket to be *sensed* on the mpiexec node and then *passed* back to the daemon. Hasn't surfaced before because the only folks using this option are on homogeneous systems. FWIW: the trunk resolves this problem, but I haven't checked the cpus-per-rank support on it yet. > > > > > $ mpiexec -np 1 -H linuxbdc01 -bind-to-core -cpus-per-rank 5 ompi_testpin.sh > MPI_FastTest.exe > -------------------------------------------------------------------------- > Your job has requested more cpus per process(rank) than there > are cpus in a socket: > > Cpus/rank: 5 > #cpus/socket: 4 > > Please correct one or both of these values and try again. > -------------------------------------------------------------------------- > > $ ssh linuxbdc01 cat /proc/cpuinfo | grep processor | wc -l > 24 > > $ cat /proc/cpuinfo | grep processor | wc -l > 4 > > > > Best, > > Paul > > P.S. Using Open MPI 1.5.3, waiting for 1.5.5 :o) > > P.S.2. any update on this? > http://www.open-mpi.org/community/lists/users/2012/01/18240.php > > P.S.3. on the same 16-way, 128 core hardware as in P.S.2, also -cpus-per-rank > goes crazy: > > $ mpiexec -mca btl_openib_warn_default_gid_prefix=0 -np 2 -H linuxbcsc21 > -bind-to-core -cpus-per-rank 5 --report-bindings ompi_testpin.sh > MPI_FastTest.exe > [linuxbcsc21.rz.RWTH-Aachen.DE:106342] [[55934,0],1] odls:default:fork > binding child [[55934,1],0] to cpus 1000100010001 > [linuxbcsc21.rz.RWTH-Aachen.DE:106342] [[55934,0],1] odls:default:fork > binding child [[55934,1],1] to cpus 20002 > 0 <#> linuxbcsc21.rz.RWTH-Aachen.DE <#> physcpubind: 0 16 32 48 > 1 <#> linuxbcsc21.rz.RWTH-Aachen.DE <#> physcpubind: 1 17 > > > So, -cpus-per-rank 5, but one process get 4 cores, the other one - only two.. > > > > >> What you can do is "entice" it away from your processes by leaving 1-2 cores >> for its own use. For example: >> -npernode 2 -bind-to-core -cpus-per-rank 3 >> would run two MPI ranks on each node, each rank exclusively bound to 3 cores. >> This leaves 2 cores on each node for Linux. When the scheduler sees the 6 >> cores of your MPI/MP procs working hard, and 2 cores sitting idle, it will >> tend to use those 2 cores for everything else - and not be tempted to push >> you aside to gain access to "your" cores. >> HTH >> Ralph >> On Feb 29, 2012, at 3:08 AM, Auclair Francis wrote: >>> Dear Open-MPI users, >>> >>> Our code is currently running Open-MPI (1.5.4) with SLURM on a NUMA machine >>> (2 sockets by nodes and 4 cores by socket) with basically two >>> levels of implementation for Open-MPI: >>> - at lower level n "Master" MPI-processes (one by socket) are >>> simultaneously runned by dividing classically the physical domain into n >>> sub-domains >>> - while at higher level 4n MPI-processes are spawn to run a sparse Poisson >>> solver. >>> At each time step, the code is thus going back and forth between these two >>> levels of implementation using two MPI communicators. This also means that >>> during about half of the computation time, 3n cores are at best sleeping >>> (if not 'waiting' at a barrier) when not inside "Solver routines". We >>> consequently decided to implement OpenMP functionality in our code when >>> solver was not running (we declare one single "parallel" region and use the >>> omp "master" command when OpenMP threads are not active). We however face >>> several difficulties: >>> >>> a) It seems that both the 3n-MPI processes and the OpenMP threads 'consume >>> processor cycles while waiting'. We consequently tried: mpirun >>> -mpi_yield_when_idle 1 , export OMP_WAIT_POLICY=passive or export >>> KMP_BLOCKTIME=0 ... The latest finally leads to an interesting reduction >>> of computing time but worsens the second problem we have to face (see >>> bellow). >>> >>> b) We managed to have a "correct" (?) implementation of our MPI-processes >>> on our sockets by using: mpirun -bind-to-socket -bysocket -np 4n However if >>> OpenMP threads initially seem to scatter on each socket (one >>> thread by core) they slowly migrate to the same core as their 'Master MPI >>> process' or gather on one or two cores by socket >>> We play around with the environment variable KMP_AFFINITY but the best we >>> could obtain was a pinning of the OpenMP threads to their own core... >>> disorganizing at the same time the implementation of the 4n Level-2 MPI >>> processes. When added, neither the specification of a rankfile nor the >>> mpirun option -x IPATH_NO_CPUAFFINITY=1 seem to change significantly the >>> situation. >>> This comportment looks rather inefficient but so far we did not manage to >>> prevent the migration of the 4 threads to at most a couple of cores ! >>> >>> Is there something wrong in our "Hybrid" implementation? >>> Do you have any advices? >>> Thanks for your help, >>> Francis >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, Center for Computing and Communication > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users