Hi Mikhail,
On 2022-06-02 02:51, Mikhail Brinskii wrote:
Hi Eric,
Yes, UCX is supposed to be stable for large sized problems.
Since I am highly aware about what it implies to deliver and deploy
tested and verified software, I have headaches here because the
validation toolchain required for an MPI library (or component of) to
testify use cases with large scale computation shall requires
large-scale hardware and large data-set... which is not given to
everybody...
So, how can be very large use cases be tested, nighlty or in a CI, for
libraries like UCX or MPI itself? And by curiosity, how is it done for UCX?
Did you see the same crash with both OMPI-4.0.3 + UCX 1.8.0 and
OMPI-4.1.2 + UCX1.11.2?
Yep! Exactly at the same place, here is the stack for UCX 1.9.0 and
OpenMPI 4.1.1:
Fri May 27 21:23:44 2022<stdout>:Erreur : MEF++ Signal recu : 11 :
segmentation violation
Fri May 27 21:23:44 2022<stdout>:Erreur :
Fri May 27 21:23:44 2022<stdout>:------------------------------ (Début
des informations destinées aux développeurs C++)
------------------------------
Fri May 27 21:23:44 2022<stdout>:La pile d'appels contient 27 symboles.
Fri May 27 21:23:44 2022<stdout>:# 000:
reqBacktrace(std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >&) >>> probGD.opt
(probGD.opt(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x42)
[0x411942])
Fri May 27 21:23:44 2022<stdout>:# 001: attacheDebugger() >>>
probGD.opt (probGD.opt(_Z15attacheDebuggerv+0x2a1) [0x4137b1])
Fri May 27 21:23:44 2022<stdout>:# 002:
/gpfs/fs0/project/d/deteix/MEF++_petscGIREF_64bits/avx2/bin/../lib/libgiref_opt_Util.so(traitementSignal+0x1fef)
[0x2b1c7a27017f]
Fri May 27 21:23:44 2022<stdout>:# 003:
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(+0x38980)
[0x2b1c872fb980]
Fri May 27 21:23:44 2022<stdout>:# 004:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.9.0/lib/libucp.so.0(ucp_dt_pack+0x13e)
[0x2b1c8ae558fe]
Fri May 27 21:23:44 2022<stdout>:# 005:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.9.0/lib/libucp.so.0(+0x2cc10)
[0x2b1c8ae5ec10]
Fri May 27 21:23:44 2022<stdout>:# 006:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.9.0/lib/ucx/libuct_ib.so.0(uct_dc_mlx5_ep_am_bcopy+0xbd)
[0x2b1c8b0c163d]
Fri May 27 21:23:44 2022<stdout>:# 007:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.9.0/lib/libucp.so.0(+0x2c557)
[0x2b1c8ae5e557]
Fri May 27 21:23:44 2022<stdout>:# 008:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.9.0/lib/libucp.so.0(ucp_tag_send_nbx+0x34d)
[0x2b1c8ae696ad]
Fri May 27 21:23:44 2022<stdout>:# 009:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc10/openmpi/4.1.1/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xea)
[0x2b1c8ae2489a]
Fri May 27 21:23:44 2022<stdout>:# 010:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc10/openmpi/4.1.1/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0x94)
[0x2b1c86c86cb4]
Fri May 27 21:23:44 2022<stdout>:# 011:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc10/openmpi/4.1.1/lib/libmpi.so.40(ompi_coll_base_alltoallv_intra_pairwise+0x145)
[0x2b1c86c8ccd5]
Fri May 27 21:23:44 2022<stdout>:# 012:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc10/openmpi/4.1.1/lib/libmpi.so.40(ompi_coll_tuned_alltoallv_intra_dec_fixed+0x42)
[0x2b1c86c976e2]
Fri May 27 21:23:44 2022<stdout>:# 013:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc10/openmpi/4.1.1/lib/libmpi.so.40(MPI_Alltoallv+0x1a3)
[0x2b1c86c3a043]
Fri May 27 21:23:44 2022<stdout>:# 014:
/gpfs/fs0/project/d/deteix/petsc-3.17.1_ompi-4.1.1/lib/libparmetis.so(libparmetis__gkMPI_Alltoallv+0x111)
[0x2b1c86661211]
Fri May 27 21:23:44 2022<stdout>:# 015:
/gpfs/fs0/project/d/deteix/petsc-3.17.1_ompi-4.1.1/lib/libparmetis.so(ParMETIS_V3_Mesh2Dual+0x10b9)
[0x2b1c86674399]
Fri May 27 21:23:44 2022<stdout>:# 016:
/gpfs/fs0/project/d/deteix/petsc-3.17.1_ompi-4.1.1/lib/libparmetis.so(ParMETIS_V3_PartMeshKway+0x100)
[0x2b1c86674e10]
And for OpenMPI 4.0.3 with UCX 1.9.0:
Wed May 25 21:34:02 2022<stdout>:Erreur : MEF++ Signal recu : 11 :
segmentation violation
Wed May 25 21:34:02 2022<stdout>:Erreur :
Wed May 25 21:34:02 2022<stdout>:------------------------------ (Début
des informations destinées aux développeurs C++)
------------------------------
Wed May 25 21:34:02 2022<stdout>:La pile d'appels contient 26 symboles.
Wed May 25 21:34:02 2022<stdout>:# 000:
reqBacktrace(std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >&) >>> probGD.opt
(probGD.opt(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x42)
[0x411a42])
Wed May 25 21:34:02 2022<stdout>:# 001: attacheDebugger() >>>
probGD.opt (probGD.opt(_Z15attacheDebuggerv+0x287) [0x4137b7])
Wed May 25 21:34:02 2022<stdout>:# 002:
/gpfs/fs0/project/d/deteix/MEF++_64bits/avx2/bin/../lib/libgiref_opt_Util.so(traitementSignal+0x1e07)
[0x2aaeaea82cb7]
Wed May 25 21:34:02 2022<stdout>:# 003:
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(+0x38980)
[0x2aaeb98e2980]
Wed May 25 21:34:02 2022<stdout>:# 004:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(ucp_dt_pack+0x13b)
[0x2aaebd3d407b]
Wed May 25 21:34:02 2022<stdout>:# 005:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(+0x3872a)
[0x2aaebd3e472a]
Wed May 25 21:34:02 2022<stdout>:# 006:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(uct_dc_mlx5_ep_am_bcopy+0xd3)
[0x2aaebd6a4713]
Wed May 25 21:34:02 2022<stdout>:# 007:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(+0x38ffc)
[0x2aaebd3e4ffc]
Wed May 25 21:34:02 2022<stdout>:# 008:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(ucp_tag_send_nbr+0x511)
[0x2aaebd3f7b91]
Wed May 25 21:34:02 2022<stdout>:# 009:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xbb)
[0x2aaea87132eb]
Wed May 25 21:34:02 2022<stdout>:# 010:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0x8c)
[0x2aaeb955d90c]
Wed May 25 21:34:02 2022<stdout>:# 011:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(ompi_coll_base_alltoallv_intra_pairwise+0x13f)
[0x2aaeb9562eff]
Wed May 25 21:34:02 2022<stdout>:# 012:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(MPI_Alltoallv+0x1a3)
[0x2aaeb9511be3]
Wed May 25 21:34:02 2022<stdout>:# 013:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/petsc-pardiso-64bits/3.17.1/lib/libstrumpack.so(libparmetis__gkMPI_Alltoallv+0x108)
[0x2aaeb15b1ca8]
Wed May 25 21:34:02 2022<stdout>:# 014:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/petsc-pardiso-64bits/3.17.1/lib/libpetsc.so.3.17(ParMETIS_V3_Mesh2Dual+0x10af)
[0x2aaeb081c98f]
Wed May 25 21:34:02 2022<stdout>:# 015:
probGD.opt(ParMETIS_V3_PartMeshKway+0x100) [0x432680]
Have you also tried to run large sized problems test with OMPI-5.0.x?
Not for large problems, only small ones without UCX... I am not
compiling my own mpi on Compute Canada clusters, but I do in our lab for
our nightly validation tests.
Regarding the application, at some point it invokes MPI_Alltoallv
sending more than 2GB to some of the ranks (using derived dt), right?
I have to track down the very specific call, but I am not sure it is
sending 2GB to a specific rank but maybe have 2GB divided between many
rank. The fact is that this part of the code, when it works, does not
create such a bump in memory usage... But I have to dig a bit more...
Regards,
Eric
//WBR, Mikhail
*From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Eric
Chamberland via users
*Sent:* Thursday, June 2, 2022 5:31 AM
*To:* Open MPI Users <users@lists.open-mpi.org>
*Cc:* Eric Chamberland <eric.chamberl...@giref.ulaval.ca>; Thomas
Briffard <thomas.briff...@michelin.com>; Vivien Clauzon
<vivien.clau...@michelin.com>; dave.mar...@giref.ulaval.ca; Ramses van
Zon <r...@scinet.utoronto.ca>; charles.coulomb...@ulaval.ca
*Subject:* [OMPI users] Segfault in ucp_dt_pack function from UCX
library 1.8.0 and 1.11.2 for large sized communications using both
OpenMPI 4.0.3 and 4.1.2
Hi,
In the past, we have successfully launched large sized (finite
elements) computations using PARMetis as mesh partitioner.
It was first in 2012 with OpenMPI (v2.?) and secondly in March 2019
with OpenMPI 3.1.2 that we succeeded.
Today, we have a bunch of nightly (small) tests running nicely and
testing all of OpenMPI (4.0.x, 4.1.x and 5.0x), MPICH-3.3.2 and
IntelMPI 2021.6.
Preparing for launching the same computation we did in 2012, and even
larger ones, we compiled with bot OpenMPI 4.0.3+ucx-1.8.0 and OpenMPI
4.1.2+ucx-1.11.2 and launched computation from small to large problems
(meshes).
For small meshes, it goes fine.
But when we reach near 2^31 faces into the 3D mesh we are using and
call ParMETIS_V3_PartMeshKway, we always get a segfault with the same
backtrace pointing into ucx library:
Wed Jun 1 23:04:54
2022<stdout>:chrono::InterfaceParMetis::ParMETIS_V3_PartMeshKway::debut
VmSize: 1202304 VmRSS: 349456 VmPeak: 1211736 VmData: 500764 VmHWM:
359012 <etiq_18>
Wed Jun 1 23:07:07 2022<stdout>:Erreur : MEF++ Signal recu : 11 :
segmentation violation
Wed Jun 1 23:07:07 2022<stdout>:Erreur :
Wed Jun 1 23:07:07 2022<stdout>:------------------------------ (Début
des informations destinées aux développeurs C++)
------------------------------
Wed Jun 1 23:07:07 2022<stdout>:La pile d'appels contient 27 symboles.
Wed Jun 1 23:07:07 2022<stdout>:# 000:
reqBacktrace(std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >&) >>> probGD.opt
(probGD.opt(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x71)
[0x4119f1])
Wed Jun 1 23:07:07 2022<stdout>:# 001: attacheDebugger() >>>
probGD.opt (probGD.opt(_Z15attacheDebuggerv+0x29a) [0x41386a])
Wed Jun 1 23:07:07 2022<stdout>:# 002:
/gpfs/fs0/project/d/deteix/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x1f9f)
[0x2ab3aef0e5cf]
Wed Jun 1 23:07:07 2022<stdout>:# 003: /lib64/libc.so.6(+0x36400)
[0x2ab3bd59a400]
Wed Jun 1 23:07:07 2022<stdout>:# 004:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_dt_pack+0x123)
[0x2ab3c966e353]
Wed Jun 1 23:07:07 2022<stdout>:# 005:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x536b7)
[0x2ab3c968d6b7]
Wed Jun 1 23:07:07 2022<stdout>:# 006:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/ucx/libuct_ib.so.0(uct_dc_mlx5_ep_am_bcopy+0xd7)
[0x2ab3ca712137]
Wed Jun 1 23:07:07 2022<stdout>:# 007:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x52d3c)
[0x2ab3c968cd3c]
Wed Jun 1 23:07:07 2022<stdout>:# 008:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_tag_send_nbx+0x5ad)
[0x2ab3c9696dcd]
Wed Jun 1 23:07:07 2022<stdout>:# 009:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf2)
[0x2ab3c922e0b2]
Wed Jun 1 23:07:07 2022<stdout>:# 010:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0x92)
[0x2ab3bbca5a32]
Wed Jun 1 23:07:07 2022<stdout>:# 011:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_alltoallv_intra_pairwise+0x141)
[0x2ab3bbcad941]
Wed Jun 1 23:07:07 2022<stdout>:# 012:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_alltoallv_intra_dec_fixed+0x42)
[0x2ab3d4836da2]
Wed Jun 1 23:07:07 2022<stdout>:# 013:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(PMPI_Alltoallv+0x29)
[0x2ab3bbc7bdf9]
Wed Jun 1 23:07:07 2022<stdout>:# 014:
/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(libparmetis__gkMPI_Alltoallv+0x106)
[0x2ab3bb0e1c06]
Wed Jun 1 23:07:07 2022<stdout>:# 015:
/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(ParMETIS_V3_Mesh2Dual+0xdd6)
[0x2ab3bb0f10b6]
Wed Jun 1 23:07:07 2022<stdout>:# 016:
/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(ParMETIS_V3_PartMeshKway+0x100)
[0x2ab3bb0f1ac0]
PARMetis is compiled as part of PETSc-3.17.1 with 64bit indices. Here
are PETSc configure options:
--prefix=/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1
COPTFLAGS=\"-O2 -march=native\"
CXXOPTFLAGS=\"-O2 -march=native\"
FOPTFLAGS=\"-O2 -march=native\"
--download-fftw=1
--download-hdf5=1
--download-hypre=1
--download-metis=1
--download-mumps=1
--download-parmetis=1
--download-plapack=1
--download-prometheus=1
--download-ptscotch=1
--download-scotch=1
--download-sprng=1
--download-superlu_dist=1
--download-triangle=1
--with-avx512-kernels=1
--with-blaslapack-dir=/scinet/intel/oneapi/2021u4/mkl/2021.4.0
--with-cc=mpicc
--with-cxx=mpicxx
--with-cxx-dialect=C++11
--with-debugging=0
--with-fc=mpifort
--with-mkl_pardiso-dir=/scinet/intel/oneapi/2021u4/mkl/2021.4.0
--with-scalapack=1
--with-scalapack-lib=\"[/scinet/intel/oneapi/2021u4/mkl/2021.4.0/lib/intel64/libmkl_scalapack_lp64.so,/scinet/intel/oneapi/2021u4/mkl/2021.4.0/lib/intel64/libmkl_blacs_openmpi_lp64.so]\"
--with-x=0
--with-64-bit-indices=1
--with-memalign=64
and OpenMPI configure options:
'--prefix=/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2'
'--enable-mpi-cxx'
'--enable-mpi1-compatibility'
'--with-hwloc=internal'
'--with-knem=/opt/knem-1.1.3.90mlnx1'
'--with-libevent=internal'
'--with-platform=contrib/platform/mellanox/optimized'
'--with-pmix=internal'
'--with-slurm=/opt/slurm'
'--with-ucx=/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2'
I am then wondering:
1) Is UCX library considered "stable" for production use with very
large sized problems ?
2) Is there a way to "bypass" UCX at runtime?
3) Any idea for debugging this?
Of course, I do not yet have a "minimum reproducer" that bugs, since
it happens only on "large" problems, but I think I could export the
data for a 512 processes reproducer with PARMetis call only...
Thanks for helping,
Eric
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42