Up to this point, I've been running a single MPI rank per physical host
(using multithreading within my application to use all available cores).  I
use this command:
mpirun -N 1 --bind-to none --hostfile hosts.txt
Where hosts.txt has an IP address on each line

I've started running on machines with significant NUMA effects... on a
single one of these machines, I've started running a separate rank per NUMA
node.  On a machine with 64 CPUs and 4 NUMA nodes, I do this:
mpirun -N 1 --bind-to numa
I've convinced myself by watching the processors that are active on 'top'
that this is behaving like I want it to.

I now want to combine these two - running on, say, 10 physical hosts with 4
NUMA nodes - a total of 40 ranks.  But, the order of the ranks is important
(for efficiency, due to how the application divides up work across ranks).
So, I want ranks 0-3 to be on host 0 across its NUMA nodes, then ranks 4-7
on host 1 across its NUMA nodes, etc.

Some guesses:
mpirun -n 40 --map-by numa --rank-by numa --hostfile hosts.txt
mpirun --map-by ppr:4:node --rank-by numa --hostfile hosts.txt
Where hosts.txt still has a single IP address per line (and doesn't need a

I'd like to make sure I get the syntax right in general and not just
empirically try guesses until one looks like it works... and find
inevitably it doesn't work like I thought when I change the # of machines
or run on machines with a different # of NUMA nodes.

