Mark Dixon <m.c.di...@leeds.ac.uk> writes: > Hi there, > > We've started looking at moving to the openmpi 1.8 branch from 1.6 on > our CentOS6/Son of Grid Engine cluster and noticed an unexpected > difference when binding multiple cores to each rank. > > Has openmpi's definition 'slot' changed between 1.6 and 1.8?
You wouldn't expect it to be documented if so, of course :-(, but it it doesn't look so. > It used to mean ranks, but now it appears to mean processing elements > (see Details, below). I'm fairly confused by this. Bizarrely, it happens I was going to ask whether anyone had a patch or workaround for the problem we see with 1.6. [I notice there was a previous thread about mpi+openmp I didn't catch at the time which looked pretty confused. I suppose I should follow it up for archives.] > Thanks, > > Mark > > PS Also, the man page for 1.8.3 reports that '--bysocket' is > deprecated, but it doesn't seem to exist when we try to use it: > > mpirun: Error: unknown option "-bysocket" > Type 'mpirun --help' for usage. [Yes, per mpirun --help.] > ====== Details ====== > > On 1.6.5, we launch with the following core binding options: > > mpirun --bind-to-core --cpus-per-proc <n> <program> That just doesn't work here on multiple nodes (and you forgot the --np to override $NSLOTS). It tries to over-allocate the first host. The workaround is to use --loadbalance in this case, but it fails in the normal case if you try to make it the default, sigh. So the recommendation for MPI+OpenMP jobs, until I fix it, is a script like #$ -l exclusive export OMP_NUM_THREADS=2 exec mpirun --loadbalance --cpus-per-proc $OMP_NUM_THREADS --np $(($NSLOTS/$OMP_NUM_THREADS)) ... assuming OMP_NUM_THREADS divides cores/socket on the relevant nodes sensibly, and eliding issues with per-rank OMP affinity. > mpirun --bind-to-core --bysocket --cpus-per-proc <n> <program> Similarly in that case. (I assume that trying to keep consecutive ranks adjacent is a good default.) > where <n> is calculated to maximise the number of cores available to > use - I guess affectively > max(1, int(number of cores per node / slots per node requested)). > > openmpi reads the file $PE_HOSTFILE and launches a rank for each slot > defined in it, binding <n> cores per rank. That's why you need the --np, or is this with a fiddled host file? > On 1.8.3, we've tried launching with the following core binding > options (which we hoped were equivalent): > > mpirun -map-by node:PE=<n> <program> > mpirun -map-by socket:PE=<n> <program> With 1.8.3 here, replacing "--loadbalance --cpus-per-proc" with "--map-by slot:PE=2" works. I assume you use --report-bindings to check what's going on (which gave me the hint about --loadbalance). I've never seen it lie about the binding the processes actually get. > openmpi reads the file $PE_HOSTFILE and launches a factor of <n> fewer > ranks than under 1.6.5. We also notice that, where we wanted a single > rank on the box and <n> is the number of cores available, openmpi > refuses to launch and we get the message: > > "There are not enough slots available in the system to satisfy the 1 > slots that were requested by the application" > > I think that error message needs a little work :)