Hello,

Joshua Ladd <jladd.m...@gmail.com> writes:

> These are very, very old versions of UCX and HCOLL installed in your
> environment. Also, MXM was deprecated years ago in favor of UCX. What
> version of MOFED is installed (run ofed_info -s)? What HCA generation
> is present (run ibstat).

MOFED is: MLNX_OFED_LINUX-4.1-1.0.2.0

As for the HCA generation, we don't seem to have the command ibstat
installed, any other way to get this info? But I *think* they are
ConnectX-3. 


>     > Stupid answer from me. If latency/bandwidth numbers are bad then check
>     > that you are really running over the interface that you think you
>     > should be. You could be falling back to running over Ethernet.

apparently the problem with my first attempt was that I was installing a
very bare version of UCX. I re-did the installation with the following
configuration:

,----
| 
'--prefix=/storage/projects/can30/angelv/spack/opt/spack/linux-sles12-sandybridge/gcc-9.3.0/ucx-1.11.2-67aihiwsolnad6aqt2ei6j6iaptqgecf'
| '--enable-mt' '--enable-cma' '--disable-params-check' '--with-avx'
| '--enable-optimizations' '--disable-assertions' '--disable-logging'
| '--with-pic' '--with-rc' '--with-ud' '--with-dc' '--without-mlx5-dv'
| '--with-ib-hw-tm' '--with-dm' '--with-cm' '--without-rocm'
| '--without-java' '--without-cuda' '--without-gdrcopy' '--with-knem'
| '--without-xpmem'
`----


and now the numbers are very good, most of the time better than the
"native" OpenMPI provided in the cluster.


So now I wanted to try another combination, using the Intel compiler
instead of gnu one. Apparently everything was compiled OK, and when I
try to run the OSU Microbenchmaks I have no problems with the
point-to-point benchmarks, but I get Segmentation Faults:

,----
| load intel/2018.2 Set Intel compilers (LICENSE NEEDED! Please, contact 
support if you have any issue with license)
| /scratch/slurm/job1182830/slurm_script: line 59: unalias: despacktivate: not 
found
| [s01r2b22:26669] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B/././././././.][./././././././.]
| [s01r2b23:20286] MCW rank 1 bound to socket 0[core 0[hwt 0]]: 
[B/././././././.][./././././././.]
| [s01r2b22:26681:0] Caught signal 11 (Segmentation fault)
| [s01r2b23:20292:0] Caught signal 11 (Segmentation fault)
| ==== backtrace ====
|  2 0x000000000010000c mxm_handle_error()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
|  3 0x000000000010055c mxm_error_signal_handler()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
|  4 0x0000000000034950 killpg()  ??:0
|  5 0x00000000000a7d41 PMPI_Comm_rank()  ??:0
|  6 0x0000000000402e56 main()  ??:0
|  7 0x00000000000206e5 __libc_start_main()  ??:0
|  8 0x0000000000402ca9 _start()  
/home/abuild/rpmbuild/BUILD/glibc-2.22/csu/../sysdeps/x86_64/start.S:118
| ===================
| ==== backtrace ====
|  2 0x000000000010000c mxm_handle_error()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
|  3 0x000000000010055c mxm_error_signal_handler()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
|  4 0x0000000000034950 killpg()  ??:0
|  5 0x00000000000a7d41 PMPI_Comm_rank()  ??:0
|  6 0x0000000000402e56 main()  ??:0
|  7 0x00000000000206e5 __libc_start_main()  ??:0
|  8 0x0000000000402ca9 _start()  
/home/abuild/rpmbuild/BUILD/glibc-2.22/csu/../sysdeps/x86_64/start.S:118
| ===================
`----


Any idea how I could try to debug/solve this?

Thanks,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
---------------------------------------------------------------------------------------------
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer

Reply via email to