[OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

David Perozzi via users Wed, 02 Feb 2022 15:07:00 -0800

Helo,

I'm trying to run a code implemented with OpenMPI and OpenMP (forthreading) on a large cluster that uses LSF for the job scheduling anddispatch. The problem with LSF is that it is not very straightforward toallocate and bind the right amount of threads to an MPI rank inside asingle node. Therefore, I have to create a rankfile myself, as soon asthe (a priori unknown) ressources are allocated.


So, after my job get dispatched, I run:

mpirun -n "$nslots" -display-allocation -nooversubscribe --map-bycore:PE=1 --bind-to core mpi_allocation/show_numactl.sh>mpi_allocation/allocation_files/allocation.txt


where show_numactl.sh consists of just one line:

{ hostname; numactl --show; } | sed ':a;N;s/\n/ /;ba'

If I ask for 16 slots, in blocks of 4 (i.e., bsub -n 16 -R"span[block=4]"), I get something like:


======================   ALLOCATED NODES   ======================
    eu-g1-006-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
    eu-g1-009-2: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
    eu-g1-002-3: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
    eu-g1-005-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================

eu-g1-006-1 policy: default preferred node: current physcpubind: 16 cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7eu-g1-006-1 policy: default preferred node: current physcpubind: 24 cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7eu-g1-006-1 policy: default preferred node: current physcpubind: 32 cpubind: 2 nodebind: 2 membind: 0 1 2 3 4 5 6 7eu-g1-002-3 policy: default preferred node: current physcpubind: 21 cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7eu-g1-002-3 policy: default preferred node: current physcpubind: 22 cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7eu-g1-009-2 policy: default preferred node: current physcpubind: 0 cpubind: 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7eu-g1-009-2 policy: default preferred node: current physcpubind: 1 cpubind: 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7eu-g1-009-2 policy: default preferred node: current physcpubind: 2 cpubind: 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7eu-g1-002-3 policy: default preferred node: current physcpubind: 19 cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7eu-g1-002-3 policy: default preferred node: current physcpubind: 23 cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7eu-g1-006-1 policy: default preferred node: current physcpubind: 52 cpubind: 3 nodebind: 3 membind: 0 1 2 3 4 5 6 7eu-g1-009-2 policy: default preferred node: current physcpubind: 3 cpubind: 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7eu-g1-005-1 policy: default preferred node: current physcpubind: 90 cpubind: 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7eu-g1-005-1 policy: default preferred node: current physcpubind: 91 cpubind: 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7eu-g1-005-1 policy: default preferred node: current physcpubind: 94 cpubind: 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7eu-g1-005-1 policy: default preferred node: current physcpubind: 95 cpubind: 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7

After that, I parse this allocation file in python and I create ahostfile and a rankfile.


The hostfile reads:

eu-g1-006-1
eu-g1-009-2
eu-g1-002-3
eu-g1-005-1

The rankfile:

rank 0=eu-g1-006-1 slot=16,24,32,52
rank 1=eu-g1-009-2 slot=0,1,2,3
rank 2=eu-g1-002-3 slot=21,22,19,23
rank 3=eu-g1-005-1 slot=90,91,94,95

Following OpenMPI's manpages and FAQs, I then run my application using

mpirun -n "$nmpiproc" --rankfile mpi_allocation/hostfiles/rankfile --mcarmaps_rank_file_physical 1 ./build/"$executable_name" true "$input_file"

where the bash variables are passed in directly in the bsub command (Ibasically run bsub -n 16 -R "span[block=4]" "my_script.sh num_slotsnum_thread_per_rank executable_name input_file").

Now, this procedure sometimes works just fine, sometimes not. When itdoesn't, the problem is that I don't get any error message (I noticedthat if an error is made inside the rankfile, one does not get anyerror). Strangely, it seems that for 16 slots and four threads (so 4 MPIranks), it works better if I have 8 slots allocated in two nodes than ifI have 4 slots in 4 different nodes. My goal is tu run the applicationwith 256 slots and 32 threads per rank (the cluster has mainly AMD EPYCbased nodes).

The ompi information of the nodes running a failed job and the rankfilefor that failed job can be found at https://pastebin.com/40f6FigH andthe allocation file at https://pastebin.com/jeWnkU40

Do you see any problem with my procedure? Why is it failing seeminglyrandomly? Can I somehow get more informtion about what's failing frommpirun?

I hope not having omitted to much information but, in case, just ask andI'll provide more details.



Cheers,

David

[OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

Reply via email to