Helo,
I'm trying to run a code implemented with OpenMPI and OpenMP (for
threading) on a large cluster that uses LSF for the job scheduling and
dispatch. The problem with LSF is that it is not very straightforward to
allocate and bind the right amount of threads to an MPI rank inside a
single node. Therefore, I have to create a rankfile myself, as soon as
the (a priori unknown) ressources are allocated.
So, after my job get dispatched, I run:
mpirun -n "$nslots" -display-allocation -nooversubscribe --map-by
core:PE=1 --bind-to core mpi_allocation/show_numactl.sh
>mpi_allocation/allocation_files/allocation.txt
where show_numactl.sh consists of just one line:
{ hostname; numactl --show; } | sed ':a;N;s/\n/ /;ba'
If I ask for 16 slots, in blocks of 4 (i.e., bsub -n 16 -R
"span[block=4]"), I get something like:
====================== ALLOCATED NODES ======================
eu-g1-006-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
eu-g1-009-2: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
eu-g1-002-3: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
eu-g1-005-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
eu-g1-006-1 policy: default preferred node: current physcpubind: 16
cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
eu-g1-006-1 policy: default preferred node: current physcpubind: 24
cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
eu-g1-006-1 policy: default preferred node: current physcpubind: 32
cpubind: 2 nodebind: 2 membind: 0 1 2 3 4 5 6 7
eu-g1-002-3 policy: default preferred node: current physcpubind: 21
cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
eu-g1-002-3 policy: default preferred node: current physcpubind: 22
cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
eu-g1-009-2 policy: default preferred node: current physcpubind: 0
cpubind: 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7
eu-g1-009-2 policy: default preferred node: current physcpubind: 1
cpubind: 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7
eu-g1-009-2 policy: default preferred node: current physcpubind: 2
cpubind: 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7
eu-g1-002-3 policy: default preferred node: current physcpubind: 19
cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
eu-g1-002-3 policy: default preferred node: current physcpubind: 23
cpubind: 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
eu-g1-006-1 policy: default preferred node: current physcpubind: 52
cpubind: 3 nodebind: 3 membind: 0 1 2 3 4 5 6 7
eu-g1-009-2 policy: default preferred node: current physcpubind: 3
cpubind: 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7
eu-g1-005-1 policy: default preferred node: current physcpubind: 90
cpubind: 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7
eu-g1-005-1 policy: default preferred node: current physcpubind: 91
cpubind: 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7
eu-g1-005-1 policy: default preferred node: current physcpubind: 94
cpubind: 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7
eu-g1-005-1 policy: default preferred node: current physcpubind: 95
cpubind: 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7
After that, I parse this allocation file in python and I create a
hostfile and a rankfile.
The hostfile reads:
eu-g1-006-1
eu-g1-009-2
eu-g1-002-3
eu-g1-005-1
The rankfile:
rank 0=eu-g1-006-1 slot=16,24,32,52
rank 1=eu-g1-009-2 slot=0,1,2,3
rank 2=eu-g1-002-3 slot=21,22,19,23
rank 3=eu-g1-005-1 slot=90,91,94,95
Following OpenMPI's manpages and FAQs, I then run my application using
mpirun -n "$nmpiproc" --rankfile mpi_allocation/hostfiles/rankfile --mca
rmaps_rank_file_physical 1 ./build/"$executable_name" true "$input_file"
where the bash variables are passed in directly in the bsub command (I
basically run bsub -n 16 -R "span[block=4]" "my_script.sh num_slots
num_thread_per_rank executable_name input_file").
Now, this procedure sometimes works just fine, sometimes not. When it
doesn't, the problem is that I don't get any error message (I noticed
that if an error is made inside the rankfile, one does not get any
error). Strangely, it seems that for 16 slots and four threads (so 4 MPI
ranks), it works better if I have 8 slots allocated in two nodes than if
I have 4 slots in 4 different nodes. My goal is tu run the application
with 256 slots and 32 threads per rank (the cluster has mainly AMD EPYC
based nodes).
The ompi information of the nodes running a failed job and the rankfile
for that failed job can be found at https://pastebin.com/40f6FigH and
the allocation file at https://pastebin.com/jeWnkU40
Do you see any problem with my procedure? Why is it failing seemingly
randomly? Can I somehow get more informtion about what's failing from
mpirun?
I hope not having omitted to much information but, in case, just ask and
I'll provide more details.
Cheers,
David