Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

2022-02-28 Thread Angel de Vicente via users
Hello,

"Jeff Squyres (jsquyres)"  writes:

> I'd recommend against using Open MPI v3.1.0 -- it's quite old.  If you
> have to use Open MPI v3.1.x, I'd at least suggest using v3.1.6, which
> has all the rolled-up bug fixes on the v3.1.x series.
>
> That being said, Open MPI v4.1.2 is the most current.  Open MPI v4.1.2 does
> restrict which versions of UCX it uses because there are bugs in the older
> versions of UCX.  I am not intimately familiar with UCX -- you'll need to ask
> Nvidia for support there -- but I was under the impression that it's just a
> user-level library, and you could certainly install your own copy of UCX to 
> use
> with your compilation of Open MPI.  I.e., you're not restricted to whatever 
> UCX
> is installed in the cluster system-default locations.

I did follow your advice, so I compiled my own version of UCX (1.11.2)
and OpenMPI v4.1.1, but for some reason the latency / bandwidth numbers
are really bad compared to the previous ones, so something is wrong, but
not sure how to debug it. 

> I don't know why you're getting MXM-specific error messages; those don't 
> appear
> to be coming from Open MPI (especially since you configured Open MPI with
> --without-mxm).  If you can upgrade to Open MPI v4.1.2 and the latest UCX, see
> if you are still getting those MXM error messages.

In this latest attempt, yes, the MXM error messages are still there.

Cheers,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


[OMPI users] Need help for troubleshooting OpenMPI performances

2022-02-28 Thread Patrick Begou via users

Hi,

I meet a performance problem with OpenMPI on my cluster. In some 
situation my parallel code is really slow (same binary running on a 
different mesh).


To investigate, the fortran code code is built with profiling option 
(mpifort -p -O3.) and launched on 91 cores.


One mon.out file per process, they show a maximum cpu time of 20.4 
seconds for each processes (32.7 seconds on my old cluster) and this is Ok.


But running on my new cluster requires near 3mn instead of 1mn on the 
old cluster (elapsed time).


New cluster is running OpenMPI 4.05 with HDR-100 connections.

Old cluster is running OpenMPI 3.1 with QDR connections.

Running Osu Collectives tests on 91 cores shows good latency values on 
91 cores and the point-to-points between nodes is correct.


How can I investigate this problem as it seams related to MPI 
communications in some situations that I can reproduce? Using Scalasca ? 
Other tools ? OpenMPI is not built with special profiling options.


Thanks

Patrick