I can confirm it works fine with 4.0.6

Thanks for your help Gilles.

On Thu, 2021-08-26 at 06:53 +0000, Broi, Franco via users wrote:
Any chance of rpms for CentOS 8 for newer versions?
________________________________
From: users <users-boun...@lists.open-mpi.org> on behalf of Gilles Gouaillardet 
via users <users@lists.open-mpi.org>
Sent: 26 August 2021 14:47
To: Broi, Franco via users <users@lists.open-mpi.org>
Cc: Gilles Gouaillardet <gil...@rist.or.jp>
Subject: Re: [OMPI users] OpenMPI-4.0.5 and MPI_spawn

Indeed ...


I am not 100% sure the two errors are unrelated, but anyway,


That examples passes with Open MPI 4.0.1 and 4.0.6 and crashed with the
versions in between.

It also passes with the 4.1 and master branches


Bottom line, upgrade Open MPI to a latest version and you should be fine.



Cheers,


Gilles

On 8/26/2021 2:42 PM, Broi, Franco via users wrote:
>
> Thanks Gilles but no go...
>
> /usr/lib64/openmpi/bin/mpirun -c 1 --mca pml ^ucx
> /home/franco/spawn_example 47
>
> I'm the parent on fsc07
> Starting 47 children
>
>   Process 1 ([[48649,2],32]) is on host: fsc08
>   Process 2 ([[48649,1],0]) is on host: unknown!
>   BTLs attempted: vader tcp self
>
> Your MPI job is now going to abort; sorry.
>
> [fsc08:465159] [[45369,2],27] ORTE_ERROR_LOG: Unreachable in file
> dpm/dpm.c at line 493
>
> On Thu, 2021-08-26 at 14:30 +0900, Gilles Gouaillardet via users wrote:
>> Franco,
>>
>> I am surprised UCX gets selected since there is no Infiniband network.
>> There used to be a bug that lead UCX to be selected on shm/tcp
>> systems, but
>> it has been fixed. You might want to give a try to the latest
>> versions of Open MPI
>> (4.0.6 or 4.1.1)
>>
>> Meanwhile, try to
>> mpirun --mca pml ^ucx ...
>> and see if it helps
>>
>>
>> Cheers,
>>
>> Gilles
>>
>> On Thu, Aug 26, 2021 at 2:13 PM Broi, Franco via users
>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>> Hi,
>>>
>>> I have 2 example progs that I found on the internet (attached) that
>>> illustrate a problem we are having launching multiple node jobs with
>>> OpenMPI-4.0.5 and MPI_spawn
>>>
>>> CentOS Linux release 8.4.2105
>>> openmpi-4.0.5-3.el8.x86_64
>>> Slum 20.11.8
>>>
>>> 10Gbit ethernet network, no IB or other networks
>>>
>>> I allocate 2 nodes, each with 24 cores. They are identical systems
>>> with a shared NFS root.
>>>
>>> salloc -p fsc -w fsc07,fsc08 --ntasks-per-node=24
>>>
>>> Running the hello prog with OpenMPI 4.0.5
>>>
>>> /usr/lib64/openmpi/bin/mpirun --version
>>> mpirun (Open MPI) 4.0.5
>>>
>>> */usr/lib64/openmpi/bin/mpirun /home/franco/hello*
>>>
>>> MPI_Init(): 307.434000
>>> hello, world (rank 0 of 48 fsc07)
>>> ...
>>> MPI_Init(): 264.714000
>>> hello, world (rank 47 of 48 fsc08)
>>>
>>> All well and good.
>>>
>>> Now running the MPI_spawn example prog with OpenMPI 4.0.1
>>>
>>> */library/mpi/openmpi-4.0.1//bin/mpirun -c 1
>>> /home/franco/spawn_example 47*
>>>
>>> I'm the parent on fsc07
>>> Starting 47 children
>>>
>>> I'm the spawned.
>>> hello, world (rank 0 of 47 fsc07)
>>> Received 999 err 0 (rank 0 of 47 fsc07)
>>> I'm the spawned.
>>> hello, world (rank 1 of 47 fsc07)
>>> Received 999 err 0 (rank 1 of 47 fsc07)
>>> ....
>>> I'm the spawned.
>>> hello, world (rank 45 of 47 fsc08)
>>> Received 999 err 0 (rank 45 of 47 fsc08)
>>> I'm the spawned.
>>> hello, world (rank 46 of 47 fsc08)
>>> Received 999 err 0 (rank 46 of 47 fsc08)
>>>
>>> Works fine.
>>>
>>> Now rebuild spawn_example with 4.0.5 and run as before
>>>
>>> ldd /home/franco/spawn_example | grep openmpi
>>>         libmpi.so.40 => /usr/lib64/openmpi/lib/libmpi.so.40
>>> (0x00007fc2c0655000)
>>>         libopen-rte.so.40 =>
>>> /usr/lib64/openmpi/lib/libopen-rte.so.40 (0x00007fc2bfdb6000)
>>>         libopen-pal.so.40 =>
>>> /usr/lib64/openmpi/lib/libopen-pal.so.40 (0x00007fc2bfb08000)
>>>
>>> /usr/lib64/openmpi/bin/mpirun --version
>>> mpirun (Open MPI) 4.0.5
>>>
>>> */usr/lib64/openmpi/bin/mpirun -c 1 /home/franco/spawn_example 47*
>>>
>>> I'm the parent on fsc07
>>> Starting 47 children
>>>
>>> [fsc08:463361] pml_ucx.c:178  Error: Failed to receive UCX worker address: 
>>> Not found (-13)
>>> [fsc08:463361] [[42596,2],32] ORTE_ERROR_LOG: Error in file dpm/dpm.c at 
>>> line 493
>>> ....
>>> [fsc08:462917] pml_ucx.c:178  Error: Failed to receive UCX worker address: 
>>> Not found (-13)
>>> [fsc08:462917] [[42416,2],33] ORTE_ERROR_LOG: Error in file dpm/dpm.c at 
>>> line 493
>>>
>>>    ompi_dpm_dyn_init() failed
>>>    --> Returned "Error" (-1) instead of "Success" (0)
>>> --------------------------------------------------------------------------
>>> [fsc08:462926] *** An error occurred in MPI_Init
>>> [fsc08:462926] *** reported by process [2779774978,42]
>>> [fsc08:462926] *** on a NULL communicator
>>> [fsc08:462926] *** Unknown error
>>> [fsc08:462926] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
>>> will now abort,
>>> [fsc08:462926] ***    and potentially your MPI job)
>>> [fsc07:1158342] *** An error occurred in MPI_Comm_spawn_multiple
>>> [fsc07:1158342] *** reported by process [2779774977,0]
>>> [fsc07:1158342] *** on communicator MPI_COMM_WORLD
>>> [fsc07:1158342] *** MPI_ERR_OTHER: known error not in list
>>> [fsc07:1158342] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
>>> will now abort,
>>> [fsc07:1158342] ***    and potentially your MPI job)
>>> [1629952748.688500] [fsc07:1158342:0]           sock.c:244  UCX  ERROR 
>>> connect(fd=64, dest_addr=
>>> 10.220.6.239:38471
>>> <http://10.220.6.239:38471>
>>> ) failed: Connection refused
>>>
>>> The IP address is for node fsc08, the program is being run from fsc07
>>>
>>> I see the orted process running on fsc08 for both hello and
>>> spwan_example with the same arguments. I also tried turning on
>>> various debug options but I'm none the wiser.
>>>
>>> If I run the spawn example with 23 children it works fine - because
>>> they are all on fsc07.
>>>
>>> Any idea what might be wrong?
>>>
>>> Cheers,
>>> Franco
>>>

Reply via email to