Did you read the documentation on rankfile? The "slot=N" directive saids to 
"put this proc on core N". In your file, you stipulate that

rank 0 is to be placed solely on core 0
rank 1 is to be placed solely on core 2
etc.

That is not what you asked for in your mpirun cmd. You asked that each proc be 
mapped to TWO cores (PE=2) or FOUR threads (PE=4 with bind-to HWT). If you 
wanted that same thing in a rankfile, it should have said

rank 0 slots=0-1
rank 1 slots=2-3
etc.

Hence the difference. I was simply correcting your mpirun cmd line as you said 
you wanted two CORES, and that isn't guaranteed if you are stipulating things 
in terms of HWTs as not every machine has two HWTs/core.



On Feb 28, 2021, at 7:43 AM, Luis Cebamanos via users <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> > wrote:

 
 Hi Ralph,
 
 Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to core reports 
the same binding than --map-by ppr:32:socket:PE=4 --bind-to hwthread:
 
 [epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]]: [BB/BB/../../../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt 0-1]], socket 
0[core 3[hwt 0-1]]: [../../BB/BB/../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 2 bound to socket 0[core 4[hwt 0-1]], socket 
0[core 5[hwt 0-1]]: [../../../../BB/BB/
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 
 And this is still different from the output produce using the rankfile.

 
 Cheers,
 Luis
 
 
On 28/02/2021 14:06, Ralph Castain via users wrote:
 
 Your command line is incorrect: 

 
 
--map-by ppr:32:socket:PE=4 --bind-to hwthread
 

 
 
should be
 

 
 
--map-by ppr:32:socket:PE=2 --bind-to core
 

 
 

 

 
On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 
I should have said, "I would like to run 128 MPI processes on 2 nodes" and not 
64 like I initially said...
 
 
 
On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, <luic...@gmail.com 
<mailto:luic...@gmail.com> > wrote:
 
 Hello OMPI users,
 
 On 128 core nodes, 2 sockets x 64 cores/socket (2 hwthreads/core) , I am 
 trying to match the behavior of running with a rankfile with manual 
 mapping/ranking/binding.
 
 I would like to run 64 MPI processes on 2 nodes, 1 MPI process every 2 
 cores. This is, I want to run 32 MPI processes per socket on 2 128-core 
 nodes. My mapping should be something like:
 
 Node 0
 =====
 rank 0  -  core 0
 rank 1  -  core 2
 rank 3 -   core 4
 ...
 rank 63 - core 126
 
 
 Node 1
 ====
 rank 64  -  core 0
 rank 65  -  core 2
 rank 66 -   core 4
 ...
 rank 127- core 126
 
 If I use a rankfile:
 rank 0=epsilon102 slot=0
 rank 1=epsilon102 slot=2
 rank 2=epsilon102 slot=4
 rank 3=epsilon102 slot=6
 rank 4=epsilon102 slot=8
 rank 5=epsilon102slot=10
 ....
 rank 123=epsilon103 slot=118
 rank 124=epsilon103 slot=120
 rank 125=epsilon103 slot=122
 rank 126=epsilon103 slot=124
 rank 127=epsilon103 slot=126
 
 My --report-binding looks like:
 
 [epsilon102:2635370] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
 [BB/../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../..]
 [epsilon102:2635370] MCW rank 1 bound to socket 0[core 2[hwt 0-1]]: 
 [../../BB/..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../..]
 [epsilon102:2635370] MCW rank 2 bound to socket 0[core 4[hwt 0-1]]: 
 [../../../..
/BB/../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../..]
 
 
 However, I cannot match this report-binding output by manually using 

 --map-by and --bind-to. I had the impression that this will be the same:
 
 mpirun -np $SLURM_NTASKS  --report-bindings --map-by ppr:32:socket:PE=4 
 --bind-to hwthread
 
 But this output is not quite the same:
 
 [epsilon102:2631529] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], 
 socket 0[cor
 e 1[hwt 0-1]]: 
[BB/BB/../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../..][../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../../../../../../../../..]
 [epsilon102:2631529] MCW rank 1 bound to socket 0[core 2[hwt 0-1]], 
 socket 0[cor
 e 3[hwt 0-1]]: 
[../../BB/BB/../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../..][../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../../../../../../../../..]
 
 What am I missing to match the rankfile behavior? Regarding performance, 
 what difference does it make between the first and the second outputs?
 
 Thanks for your help!
 Luis
 
 
 
 
 
 
 
 
 

Reply via email to