Hi All, I just got the same behaviour with old Torque (2.5, uses cpusets) we have and OpenMPI 1.10.0; when --bind-to core is set, occasionally (not always) it fails
Open MPI tried to bind a new process, but something went wrong. The process was killed without launching the target application. Your job will now abort. Local host: nXXX Application name: /global/software/espresso-5.2.1-intel14-ompi110/bin/pw.x Error message: hwloc_set_cpubind returned "Error" for bitmap "0" Location: ../../../../../openmpi-1.10.0/orte/mca/odls/default/odls_default_module.c:5 51 -- Grigory Shamov Westgrid/ComputeCanada Site Lead University of Manitoba E2-588 EITC Building, (204) 474-9625 On 15-10-02 10:25 AM, "users on behalf of Marcin Krotkiewski" <users-boun...@open-mpi.org on behalf of marcin.krotkiew...@gmail.com> wrote: >Hi, > >I fail to make OpenMPI bind to cores correctly when running from within >SLURM-allocated CPU resources spread over a range of compute nodes in an >otherwise homogeneous cluster. I have found this thread > >http://www.open-mpi.org/community/lists/users/2014/06/24682.php > >and did try to use what Ralph suggested there (--hetero-nodes), but it >does not work (v. 1.10.0). When running with --report-bindings I get >messages like > >[compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all >available processors) > >for all ranks outside of my first physical compute node. Moreover, >everything works as expected if I ask SLURM to assign entire compute >nodes. So it does look like Ralph's diagnose presented in that thread is >correct, just the --hetero-nodes switch does not work for me. > >I have written a short code that uses sched_getaffinity to print the >effective bindings: all MPI ranks except of those on the first node are >bound to all CPU cores allocated by SLURM. > >Do I have to do something except of --hetero-nodes, or is this a problem >that needs further investigation? > >Thanks a lot! > >Marcin > >_______________________________________________ >users mailing list >us...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >Link to this post: >http://www.open-mpi.org/community/lists/users/2015/10/27770.php