Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

Ralph Castain via users Thu, 03 Feb 2022 00:53:49 -0800

Hmmm...okay, I found the code path that fails without an error - not one of the 
ones I was citing. Thanks for that detailed explanation of what you were doing! 
I'll add some code to the master branch to plug that hole along with the other 
I identified.


Just an FYI: we stopped supporting "physical" cpus a long time ago, so the 
"rmaps_rank_file_physical" MCA param is just being ignored (we don't have a way 
to detect that the param you cited doesn't exist). We only take the input as 
being "logical" cpu designations. You might check, but I suspect the two 
(logical and physical IDs) are the same here.

It also appears from your output that you are using hwthreads as cpus, so the 
slot descriptions are being applied to threads and not cores. At least, it 
appears that way to me - was that expected?



> On Feb 3, 2022, at 12:27 AM, David Perozzi <peroz...@ethz.ch> wrote:
> 
> Thanks for looking into that and sorry if I only included the version in use 
> in the pastebin. I'll ask the cluster support if they could install OMPI 
> master.
> 
> I really am unfamiliar with openmpi's codebase, so I haven't looked into it 
> and are very thanful that you could already identify possible places that I 
> could've "visited". One thing that I can add, however, is that I tried both 
> on the cluster (OMPI 4.0.2) and on my local machine (OMPI 4.0.3) to run a 
> dummy test, which basically consists in launching the following:
> 
> $ mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo 
> ""
> 
> I report here the results coming from the cluster, where I allocated 6 cores, 
> all on the same node:
> 
> $ numactl --show
> policy: default
> preferred node: current
> physcpubind: 3 11 12 13 21 29
> cpubind: 0 1
> nodebind: 0 1
> membind: 0 1 2 3 4 5 6 7
> 
> $ hostname
> eu-g1-018-1
> 
> 
> $ mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo 
> ""
> 
> [eu-g1-018-1:37621] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 
> 0[core 1[hwt 0]]: [B/B/./././.][]
> [eu-g1-018-1:37621] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 
> 0[core 4[hwt 0]]: [././B/./B/.][]
> 
> $ cat rankfile
> rank 0=eu-g1-018-1 slot=3,11
> rank 1=eu-g1-018-1 slot=12,21
> 
> However, if I change the rankfile to use an unavailable core location, e.g.
> 
> $ cat rankfile
> rank 0=eu-g1-018-1 slot=3,11
> rank 1=eu-g1-018-1 slot=12,28
> 
> I get no error message in return:
> 
> $ mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo 
> ""
> $
> 
> So, at least in this version, it is possible to get no error message in 
> return quite easily (but this is maybe one of the error you said should never 
> happen).
> 
> I'll double (triple) check my python script that generates the rankfile 
> again, but as of now I'm pretty sure no nasty things should happen at that 
> level. Especially because in the case reported in my initial message one can 
> manually check that all locations are indeed allocated to my job (by 
> comparing the rankfile and the allocation.txt file).
> 
> I was wondering if somehow mpirun cannot find all the hosts sometimes (but 
> sometimes it can, so it's a mistery to me)?
> 
> Just wanted to point that out. Now I'll get in touch with the cluster support 
> to see if it's possible to test on master.
> 
> Cheers,
> David
> 
> On 03.02.22 01:59, Ralph Castain via users wrote:
>> Are you willing to try this with OMPI master? Asking because it would be 
>> hard to push changes all the way back to 4.0.x every time we want to see if 
>> we fixed something.
>> 
>> Also, few of us have any access to LSF, though I doubt that has much impact 
>> here as it sounds like the issue is in the rank_file mapper.
>> 
>> Glancing over the rank_file mapper in master branch, I only see a couple of 
>> places (both errors that should never happen) that wouldn't result in a 
>> gaudy "show help" message. It would be interesting to know if you are 
>> hitting those.
>> 
>> One way you could get more debug info is to ensure the OMPI is configure 
>> with --enable-debug and then add "--mca rmaps_base_verbose 5" to your cmd 
>> line.
>> 
>> 
>>> On Feb 2, 2022, at 3:46 PM, Christoph Niethammer <nietham...@hlrs.de> wrote:
>>> 
>>> The linked pastebin includes the following version information:
>>> 
>>> [1,0]<stdout>:package:Open MPI spackapps@eu-c7-042-03 Distribution
>>> [1,0]<stdout>:ompi:version:full:4.0.2
>>> [1,0]<stdout>:ompi:version:repo:v4.0.2
>>> [1,0]<stdout>:ompi:version:release_date:Oct 07, 2019
>>> [1,0]<stdout>:orte:version:full:4.0.2
>>> [1,0]<stdout>:orte:version:repo:v4.0.2
>>> [1,0]<stdout>:orte:version:release_date:Oct 07, 2019
>>> [1,0]<stdout>:opal:version:full:4.0.2
>>> [1,0]<stdout>:opal:version:repo:v4.0.2
>>> [1,0]<stdout>:opal:version:release_date:Oct 07, 2019
>>> [1,0]<stdout>:mpi-api:version:full:3.1.0
>>> [1,0]<stdout>:ident:4.0.2
>>> 
>>> Best
>>> Christoph
>>> 
>>> ----- Original Message -----
>>> From: "Open MPI Users" <users@lists.open-mpi.org>
>>> To: "Open MPI Users" <users@lists.open-mpi.org>
>>> Cc: "Ralph Castain" <r...@open-mpi.org>
>>> Sent: Thursday, 3 February, 2022 00:22:30
>>> Subject: Re: [OMPI users] Error using rankfile to bind multiple cores on 
>>> the same node for threaded OpenMPI application
>>> 
>>> Errr...what version OMPI are you using?
>>> 
>>>> On Feb 2, 2022, at 3:03 PM, David Perozzi via users 
>>>> <users@lists.open-mpi.org> wrote:
>>>> 
>>>> Helo,
>>>> 
>>>> I'm trying to run a code implemented with OpenMPI and OpenMP (for 
>>>> threading) on a large cluster that uses LSF for the job scheduling and 
>>>> dispatch. The problem with LSF is that it is not very straightforward to 
>>>> allocate and bind the right amount of threads to an MPI rank inside a 
>>>> single node. Therefore, I have to create a rankfile myself, as soon as the 
>>>> (a priori unknown) ressources are allocated.
>>>> 
>>>> So, after my job get dispatched, I run:
>>>> 
>>>> mpirun -n "$nslots" -display-allocation -nooversubscribe --map-by 
>>>> core:PE=1 --bind-to core mpi_allocation/show_numactl.sh 
>>>> >mpi_allocation/allocation_files/allocation.txt
>>>> 
>>>> where show_numactl.sh consists of just one line:
>>>> 
>>>> { hostname; numactl --show; } | sed ':a;N;s/\n/ /;ba'
>>>> 
>>>> If I ask for 16 slots, in blocks of 4 (i.e., bsub -n 16 -R 
>>>> "span[block=4]"), I get something like:
>>>> 
>>>> ======================   ALLOCATED NODES   ======================
>>>>    eu-g1-006-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
>>>>    eu-g1-009-2: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
>>>>    eu-g1-002-3: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
>>>>    eu-g1-005-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
>>>> =================================================================
>>>> eu-g1-006-1 policy: default preferred node: current physcpubind: 16  
>>>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-006-1 policy: default preferred node: current physcpubind: 24  
>>>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-006-1 policy: default preferred node: current physcpubind: 32  
>>>> cpubind: 2  nodebind: 2  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-002-3 policy: default preferred node: current physcpubind: 21  
>>>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-002-3 policy: default preferred node: current physcpubind: 22  
>>>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-009-2 policy: default preferred node: current physcpubind: 0  
>>>> cpubind: 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-009-2 policy: default preferred node: current physcpubind: 1  
>>>> cpubind: 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-009-2 policy: default preferred node: current physcpubind: 2  
>>>> cpubind: 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-002-3 policy: default preferred node: current physcpubind: 19  
>>>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-002-3 policy: default preferred node: current physcpubind: 23  
>>>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-006-1 policy: default preferred node: current physcpubind: 52  
>>>> cpubind: 3  nodebind: 3  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-009-2 policy: default preferred node: current physcpubind: 3  
>>>> cpubind: 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-005-1 policy: default preferred node: current physcpubind: 90  
>>>> cpubind: 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-005-1 policy: default preferred node: current physcpubind: 91  
>>>> cpubind: 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-005-1 policy: default preferred node: current physcpubind: 94  
>>>> cpubind: 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
>>>> eu-g1-005-1 policy: default preferred node: current physcpubind: 95  
>>>> cpubind: 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
>>>> 
>>>> After that, I parse this allocation file in python and I create a hostfile 
>>>> and a rankfile.
>>>> 
>>>> The hostfile reads:
>>>> 
>>>> eu-g1-006-1
>>>> eu-g1-009-2
>>>> eu-g1-002-3
>>>> eu-g1-005-1
>>>> 
>>>> The rankfile:
>>>> 
>>>> rank 0=eu-g1-006-1 slot=16,24,32,52
>>>> rank 1=eu-g1-009-2 slot=0,1,2,3
>>>> rank 2=eu-g1-002-3 slot=21,22,19,23
>>>> rank 3=eu-g1-005-1 slot=90,91,94,95
>>>> 
>>>> Following OpenMPI's manpages and FAQs, I then run my application using
>>>> 
>>>> mpirun -n "$nmpiproc" --rankfile mpi_allocation/hostfiles/rankfile --mca 
>>>> rmaps_rank_file_physical 1 ./build/"$executable_name" true "$input_file"
>>>> 
>>>> where the bash variables are passed in directly in the bsub command (I 
>>>> basically run bsub -n 16 -R "span[block=4]" "my_script.sh num_slots 
>>>> num_thread_per_rank executable_name input_file").
>>>> 
>>>> 
>>>> Now, this procedure sometimes works just fine, sometimes not. When it 
>>>> doesn't, the problem is that I don't get any error message (I noticed that 
>>>> if an error is made inside the rankfile, one does not get any error). 
>>>> Strangely, it seems that for 16 slots and four threads (so 4 MPI ranks), 
>>>> it works better if I have 8 slots allocated in two nodes than if I have 4 
>>>> slots in 4 different nodes. My goal is tu run the application with 256 
>>>> slots and 32 threads per rank (the cluster has mainly AMD EPYC based 
>>>> nodes).
>>>> 
>>>> The ompi information of the nodes running a failed job and the rankfile 
>>>> for that failed job can be found at https://pastebin.com/40f6FigH and the 
>>>> allocation file at https://pastebin.com/jeWnkU40
>>>> 
>>>> 
>>>> Do you see any problem with my procedure? Why is it failing seemingly 
>>>> randomly? Can I somehow get more informtion about what's failing from 
>>>> mpirun?
>>>> 
>>>> 
>>>> I hope not having omitted to much information but, in case, just ask and 
>>>> I'll provide more details.
>>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> David
>>>> 
>>>> 
>> 
>

Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

Reply via email to