Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application (users Digest, Vol 4715, Issue 1)

2022-02-07 Thread David Perozzi via users

Hi Bernd,

Thanks for your valuable input! Your suggested approach indeed seems 
like the correct one and is actually what I've always wanted to do. In 
the past, I've also asked our cluster support if there was this 
possibility, but they always suggested the following approach:



export OMP_NUM_THREADS=T
bsub -n N -R "span[ptile=T]" "unset LSB_AFFINITY_HOSTFILE ; mpirun -n 
M --map-by node:PE=T ./my_hybrid_program"


where N=M*T (https://scicomp.ethz.ch/wiki/Hybrid_jobs). However, this 
can sometimes be indirectly "penalized" at the dispatch because of the 
ptile constraint (which is not strictly needed, as block would be an 
acceptable, looser constraint). That's why I wanted to define the 
rankfile by myself (to use a span[block=] requirement).


I now tried your suggested commands and could get what I want with a 
slight variation:


bsub -n 6 -R "span[block=2] affinity[core(4,same=numa, 
exclusive=(core,injob)):cpubind=numa:membind=localprefer]" "export 
OMP_NUM_THREADS=4 ; export OMP_PLACES=cores; export 
OMP_PROC_BIND=true; mpirun -n 6 -nooversubscribe -map-by slot:PE=4 
./test_dispatch"


Something that seems different between the cluster you are using and the 
one I am using is that I have to set the correct number of 
OMP_NUM_THREADS by myself. Also, in our cluster, bsub's -n parameter is 
assumed as "core", not as "slot". So I'm not sure of two things:


1. Is the correct number of cores allocated to my job, or can they
   somehow be oversubscribed by other jobs?
2. I think that a problem may be that the memory reservation
   (rusage[mem=]) is referred to the "cores" as defined by the -n
   parameter. So, for a job that needs a lot of threads and memory, I
   should have to increase the memory "per core" (i.e., per slot,
   actually), and I'm not sure if bsub then considers it correctly.


Anyway, I tried to run my code for about 5hours once with an affinity 
requirement and 16 slots of 16cores each, and once with a ptile 
requirement resulting in the same configuration (MPI processes and 
OpenMP threads) as suggested by our support. To my surprise, the latter 
was much more efficient than the former (I expected the former to give 
the same performance or better). I explicitly chose nodes with the same 
architecture.
With affinity, the CPU utilization was 22% and the simulation did not 
come past the 10% process; with ptile it was 58% and reached about 35% 
of progress. Overall, the values for CPU utilization are so low because 
in the first hour, the input files must be read and the simulation set 
up (serially and involving the reading of many small files). As a 
reference, another simulation has been running for 60 hours and has 93% 
of CPU utilization.


I know I should clarify with my cluster support, but I wanted to share 
my experience. You may also have an idea why I get such bad 
performances. Is it possible that bsub is configured in a different way 
than yours? (By the way, I'm using IBM Spectrum LSF Standard 10.1.0.7)



Best regards,
David





On 03.02.22 13:23, Bernd Dammann via users wrote:

Hi David,

On 03/02/2022 00:03 , David Perozzi wrote:

Helo,

I'm trying to run a code implemented with OpenMPI and OpenMP (for 
threading) on a large cluster that uses LSF for the job scheduling 
and dispatch. The problem with LSF is that it is not very 
straightforward to allocate and bind the right amount of threads to 
an MPI rank inside a single node. Therefore, I have to create a 
rankfile myself, as soon as the (a priori unknown) ressources are 
allocated.


So, after my job get dispatched, I run:

mpirun -n "$nslots" -display-allocation -nooversubscribe --map-by 
core:PE=1 --bind-to core mpi_allocation/show_numactl.sh 
 >mpi_allocation/allocation_files/allocation.txt


Just out of curiosity: why do you not use the built-in LSF features to 
do this mapping?  Something like


#BSUB -n 4
#BSUB -R "span[block=1] affinity[core(4)]"

mpirun ./MyHybridApplication

This will give you 4 cores for each of your 4 MPI ranks, and it sets 
OMP_NUM_THREADS=4 automatically.  LSF's affinity is even more fine 
grained, so you can specify that the 4 cores should be on one socket 
(e.g. if your application is memory bound, and you want to make use of 
more memory bandwidth).  Check the LSF documentation for more details.


Examples:

1) with span[block=...] (allow LSF to place resources on one host)

#BSUB -n 4
#BSUB -R "span[block=1] affinity[core(4)]"

export OMP_DISPLAY_AFFINITY=true
export OMP_AFFINITY_FORMAT="host: %H PID: %P TID: %n affinity: %A"
mpirun --tag-output ./hello

gives this output (sorted):

[1,0]:host: node-23-8 PID: 2798 TID: 0 affinity: 0
[1,0]:host: node-23-8 PID: 2798 TID: 1 affinity: 1
[1,0]:host: node-23-8 PID: 2798 TID: 2 affinity: 2
[1,0]:host: node-23-8 PID: 2798 TID: 3 affinity: 3
[1,0]:Hello world from thread 0!
[1,0]:Hello world from thread 1!
[1,0]:Hello world from thread 2!
[1,0]:Hello world from thread 3!
[1,1]:host: node-23-8 PID: 2799 TID: 0 affinity: 4
[1,1]:host: 

Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application (users Digest, Vol 4715, Issue 1)

2022-02-03 Thread Bernd Dammann via users

Hi David,

On 03/02/2022 00:03 , David Perozzi wrote:

Helo,

I'm trying to run a code implemented with OpenMPI and OpenMP (for 
threading) on a large cluster that uses LSF for the job scheduling and 
dispatch. The problem with LSF is that it is not very straightforward to 
allocate and bind the right amount of threads to an MPI rank inside a 
single node. Therefore, I have to create a rankfile myself, as soon as 
the (a priori unknown) ressources are allocated.


So, after my job get dispatched, I run:

mpirun -n "$nslots" -display-allocation -nooversubscribe --map-by 
core:PE=1 --bind-to core mpi_allocation/show_numactl.sh 
 >mpi_allocation/allocation_files/allocation.txt


Just out of curiosity: why do you not use the built-in LSF features to 
do this mapping?  Something like


#BSUB -n 4
#BSUB -R "span[block=1] affinity[core(4)]"

mpirun ./MyHybridApplication

This will give you 4 cores for each of your 4 MPI ranks, and it sets 
OMP_NUM_THREADS=4 automatically.  LSF's affinity is even more fine 
grained, so you can specify that the 4 cores should be on one socket 
(e.g. if your application is memory bound, and you want to make use of 
more memory bandwidth).  Check the LSF documentation for more details.


Examples:

1) with span[block=...] (allow LSF to place resources on one host)

#BSUB -n 4
#BSUB -R "span[block=1] affinity[core(4)]"

export OMP_DISPLAY_AFFINITY=true
export OMP_AFFINITY_FORMAT="host: %H PID: %P TID: %n affinity: %A"
mpirun --tag-output ./hello

gives this output (sorted):

[1,0]:host: node-23-8 PID: 2798 TID: 0 affinity: 0
[1,0]:host: node-23-8 PID: 2798 TID: 1 affinity: 1
[1,0]:host: node-23-8 PID: 2798 TID: 2 affinity: 2
[1,0]:host: node-23-8 PID: 2798 TID: 3 affinity: 3
[1,0]:Hello world from thread 0!
[1,0]:Hello world from thread 1!
[1,0]:Hello world from thread 2!
[1,0]:Hello world from thread 3!
[1,1]:host: node-23-8 PID: 2799 TID: 0 affinity: 4
[1,1]:host: node-23-8 PID: 2799 TID: 1 affinity: 5
[1,1]:host: node-23-8 PID: 2799 TID: 2 affinity: 6
[1,1]:host: node-23-8 PID: 2799 TID: 3 affinity: 7
[1,1]:Hello world from thread 0!
[1,1]:Hello world from thread 1!
[1,1]:Hello world from thread 2!
[1,1]:Hello world from thread 3!
[1,2]:host: node-23-8 PID: 2803 TID: 0 affinity: 10
[1,2]:host: node-23-8 PID: 2803 TID: 1 affinity: 11
[1,2]:host: node-23-8 PID: 2803 TID: 2 affinity: 12
[1,2]:host: node-23-8 PID: 2803 TID: 3 affinity: 13
[1,2]:Hello world from thread 0!
[1,2]:Hello world from thread 1!
[1,2]:Hello world from thread 2!
[1,2]:Hello world from thread 3!
[1,3]:host: node-23-8 PID: 2807 TID: 0 affinity: 14
[1,3]:host: node-23-8 PID: 2807 TID: 1 affinity: 15
[1,3]:host: node-23-8 PID: 2807 TID: 2 affinity: 16
[1,3]:host: node-23-8 PID: 2807 TID: 3 affinity: 17
[1,3]:Hello world from thread 0!
[1,3]:Hello world from thread 1!
[1,3]:Hello world from thread 2!
[1,3]:Hello world from thread 3!

I got 4 groups of 4 cores each, all on the same host!


2) with span[ptile=...] (force LSF to distribute over several hosts)

#BSUB -n 4
#BSUB -R "span[pile=1] affinity[core(4)]"

export OMP_DISPLAY_AFFINITY=true
export OMP_AFFINITY_FORMAT="host: %H PID: %P TID: %n affinity: %A"
mpirun --tag-output ./hello

gives this (sorted):

[1,0]:host: node-23-8 PID: 2438 TID: 0 affinity: 0
[1,0]:host: node-23-8 PID: 2438 TID: 1 affinity: 1
[1,0]:host: node-23-8 PID: 2438 TID: 2 affinity: 2
[1,0]:host: node-23-8 PID: 2438 TID: 3 affinity: 3
[1,0]:Hello world from thread 0!
[1,0]:Hello world from thread 1!
[1,0]:Hello world from thread 2!
[1,0]:Hello world from thread 3!
[1,1]:host: node-23-7 PID: 19425 TID: 0 affinity: 0
[1,1]:host: node-23-7 PID: 19425 TID: 1 affinity: 1
[1,1]:host: node-23-7 PID: 19425 TID: 2 affinity: 2
[1,1]:host: node-23-7 PID: 19425 TID: 3 affinity: 3
[1,1]:Hello world from thread 0!
[1,1]:Hello world from thread 1!
[1,1]:Hello world from thread 2!
[1,1]:Hello world from thread 3!
[1,2]:host: node-23-6 PID: 23940 TID: 0 affinity: 0
[1,2]:host: node-23-6 PID: 23940 TID: 1 affinity: 1
[1,2]:host: node-23-6 PID: 23940 TID: 2 affinity: 2
[1,2]:host: node-23-6 PID: 23940 TID: 3 affinity: 3
[1,2]:Hello world from thread 0!
[1,2]:Hello world from thread 1!
[1,2]:Hello world from thread 2!
[1,2]:Hello world from thread 3!
[1,3]:host: node-23-5 PID: 30341 TID: 0 affinity: 0
[1,3]:host: node-23-5 PID: 30341 TID: 1 affinity: 1
[1,3]:host: node-23-5 PID: 30341 TID: 2 affinity: 2
[1,3]:host: node-23-5 PID: 30341 TID: 3 affinity: 3
[1,3]:Hello world from thread 0!
[1,3]:Hello world from thread 1!
[1,3]:Hello world from thread 2!
[1,3]:Hello world from thread 3!

Here I got 4 groups of 4 cores on different hosts!


Maybe the above can be some kind of inspiration to solve your problem in 
a different way!


/Bernd