Hi there,

We've started looking at moving to the openmpi 1.8 branch from 1.6 on our CentOS6/Son of Grid Engine cluster and noticed an unexpected difference when binding multiple cores to each rank.

Has openmpi's definition 'slot' changed between 1.6 and 1.8? It used to mean ranks, but now it appears to mean processing elements (see Details, below).

Thanks,

Mark

PS Also, the man page for 1.8.3 reports that '--bysocket' is deprecated, but it doesn't seem to exist when we try to use it:

  mpirun: Error: unknown option "-bysocket"
  Type 'mpirun --help' for usage.

====== Details ======

On 1.6.5, we launch with the following core binding options:

  mpirun --bind-to-core --cpus-per-proc <n> <program>
  mpirun --bind-to-core --bysocket --cpus-per-proc <n> <program>

  where <n> is calculated to maximise the number of cores available to
  use - I guess affectively
  max(1, int(number of cores per node / slots per node requested)).

  openmpi reads the file $PE_HOSTFILE and launches a rank for each slot
  defined in it, binding <n> cores per rank.

On 1.8.3, we've tried launching with the following core binding options (which we hoped were equivalent):

  mpirun -map-by node:PE=<n> <program>
  mpirun -map-by socket:PE=<n> <program>

  openmpi reads the file $PE_HOSTFILE and launches a factor of <n> fewer
  ranks than under 1.6.5. We also notice that, where we wanted a single
  rank on the box and <n> is the number of cores available, openmpi
  refuses to launch and we get the message:

  "There are not enough slots available in the system to satisfy the 1
  slots that were requested by the application"

  I think that error message needs a little work :)

Reply via email to