Hi Ralph, The "slot=N" directive saids to "put this proc on core N". In your file, you stipulate that
> > rank 0 is to be placed solely on core 0 > rank 1 is to be placed solely on core 2 > etc. > That is exactly what I want to achieve but from the mpirun cmd instead of using a rankfile and I am failing... > That is not what you asked for in your mpirun cmd. You asked that each > proc be mapped to TWO cores (PE=2) or FOUR threads (PE=4 with bind-to HWT). > If you wanted that same thing in a rankfile, it should have said > > rank 0 slots=0-1 > rank 1 slots=2-3 > etc. > > Hence the difference. I was simply correcting your mpirun cmd line as you > said you wanted two CORES, and that isn't guaranteed if you are stipulating > things in terms of HWTs as not every machine has two HWTs/core. > > > > On Feb 28, 2021, at 7:43 AM, Luis Cebamanos via users < > users@lists.open-mpi.org> wrote: > > Hi Ralph, > > Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to core > reports the same binding than --map-by ppr:32:socket:PE=4 --bind-to > hwthread: > > [epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > 0[core 1[hwt 0-1]]: [BB/BB/../../../../ > > ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../ > > ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../.. > > /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../.. > /../../../../../../../..] > [epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt 0-1]], > socket 0[core 3[hwt 0-1]]: [../../BB/BB/../../ > > ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../ > > ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../.. > > /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../.. > /../../../../../../../..] > [epsilon104:2861230] MCW rank 2 bound to socket 0[core 4[hwt 0-1]], > socket 0[core 5[hwt 0-1]]: [../../../../BB/BB/ > > ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../ > > ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../.. > > /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../.. > /../../../../../../../..] > > And this is still different from the output produce using the rankfile. > > Cheers, > Luis > > On 28/02/2021 14:06, Ralph Castain via users wrote: > > Your command line is incorrect: > > --map-by ppr:32:socket:PE=4 --bind-to hwthread > > should be > > --map-by ppr:32:socket:PE=2 --bind-to core > > > > On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users < > users@lists.open-mpi.org> wrote: > > I should have said, "I would like to run 128 MPI processes on 2 nodes" > and not 64 like I initially said... > > On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, <luic...@gmail.com> wrote: > >> Hello OMPI users, >> >> On 128 core nodes, 2 sockets x 64 cores/socket (2 hwthreads/core) , I am >> trying to match the behavior of running with a rankfile with manual >> mapping/ranking/binding. >> >> I would like to run 64 MPI processes on 2 nodes, 1 MPI process every 2 >> cores. This is, I want to run 32 MPI processes per socket on 2 128-core >> nodes. My mapping should be something like: >> >> Node 0 >> ===== >> rank 0 - core 0 >> rank 1 - core 2 >> rank 3 - core 4 >> ... >> rank 63 - core 126 >> >> >> Node 1 >> ==== >> rank 64 - core 0 >> rank 65 - core 2 >> rank 66 - core 4 >> ... >> rank 127- core 126 >> >> If I use a rankfile: >> rank 0=epsilon102 slot=0 >> rank 1=epsilon102 slot=2 >> rank 2=epsilon102 slot=4 >> rank 3=epsilon102 slot=6 >> rank 4=epsilon102 slot=8 >> rank 5=epsilon102slot=10 >> .... >> rank 123=epsilon103 slot=118 >> rank 124=epsilon103 slot=120 >> rank 125=epsilon103 slot=122 >> rank 126=epsilon103 slot=124 >> rank 127=epsilon103 slot=126 >> >> My --report-binding looks like: >> >> [epsilon102:2635370] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: >> [BB/../../.. >> >> /../../../../../../../../../../../../../../../../../../../../../../../../../../. >> >> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >> >> ../../../../../../..][../../../../../../../../../../../../../../../../../../../. >> >> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >> ../../../../../../../../../../../../../../../../../..] >> [epsilon102:2635370] MCW rank 1 bound to socket 0[core 2[hwt 0-1]]: >> [../../BB/.. >> >> /../../../../../../../../../../../../../../../../../../../../../../../../../../. >> >> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >> >> ../../../../../../..][../../../../../../../../../../../../../../../../../../../. >> >> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >> ../../../../../../../../../../../../../../../../../..] >> [epsilon102:2635370] MCW rank 2 bound to socket 0[core 4[hwt 0-1]]: >> [../../../.. >> >> /BB/../../../../../../../../../../../../../../../../../../../../../../../../../. >> >> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >> >> ../../../../../../..][../../../../../../../../../../../../../../../../../../../. >> >> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >> ../../../../../../../../../../../../../../../../../..] >> >> >> However, I cannot match this report-binding output by manually using >> --map-by and --bind-to. I had the impression that this will be the same: >> >> mpirun -np $SLURM_NTASKS --report-bindings --map-by ppr:32:socket:PE=4 >> --bind-to hwthread >> >> But this output is not quite the same: >> >> [epsilon102:2631529] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], >> socket 0[cor >> e 1[hwt 0-1]]: >> [BB/BB/../../../../../../../../../../../../../../../../../../../. >> >> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >> >> ../../../../../../../../../../../../../../../..][../../../../../../../../../../. >> >> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >> >> ../../../../../../../../../../../../../../../../../../../../../../../../../../..] >> [epsilon102:2631529] MCW rank 1 bound to socket 0[core 2[hwt 0-1]], >> socket 0[cor >> e 3[hwt 0-1]]: >> [../../BB/BB/../../../../../../../../../../../../../../../../../. >> >> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >> >> ../../../../../../../../../../../../../../../..][../../../../../../../../../../. >> >> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >> >> ../../../../../../../../../../../../../../../../../../../../../../../../../../..] >> >> What am I missing to match the rankfile behavior? Regarding performance, >> what difference does it make between the first and the second outputs? >> >> Thanks for your help! >> Luis >> > > > >