[OMPI users] CephFS and striping_factor
Hi, I would like to know if OpenMPI is supporting file creation with "striping_factor" for CephFS? According to CephFS library, I *think* it would be possible to do it at file creation with "ceph_open_layout". https://github.com/ceph/ceph/blob/main/src/include/cephfs/libcephfs.h Is it a possible futur enhancement? Thanks, Eric -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42
[OMPI users] How to force striping_factor (on lustre or other FS)?
Hi, In 2012 we wrote and tested our functions to use MPI I/O to have good performances while doing I/O on a Lustre filesystem. Everything was fine about "striping_factor" we passed to file creation. Now I am trying to verify some performance degradation we observed and I am surprised because it looks like I am unable to create a new file with a given "striping_factor" with any mpi flavor. I attached a simple example for file creation with hints, and tried it the following way with OpenMPI: OpenMPI-4.0.3: which mpicc /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/bin/mpicc mpicc -o s simple_file_create_with_hint.c rm -f foo && mpiexec -n 1 --mca io romio321 ./s foo && lfs getstripe foo Creating the file by MPI_file_open : foo Informations on file: Key is 'striping_factor' and worth: '2' ...closing the file foo Informations on file: foo lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 49 obdidx objid objid group 49 485863591 0x1cf5b0a7 0 #Not forcing romio321: rm -f foo && mpiexec -n 1 ./s foo && lfs getstripe foo Creating the file by MPI_file_open : foo Informations on file: Key is 'striping_factor' and worth: '2' ...closing the file foo Informations on file: foo lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 19 obdidx objid objid group 19 482813449 0x1cc72609 0 First, as you can see, even if I ask for a striping_factor of 2, I only get one! I tried to write some data too, but it changed nothing... Where am I wrong? Second, I was expecting that when I re-open the file for read-only, I would have some information in "MPI_Info" but it is empty... is that normal? For example, using mpich-3.2.1 I have the following output: which mpicc /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc7.3/mpich/3.2.1/bin/mpicc mpicc -o s simple_file_create_with_hint.c rm -f foo && mpiexec -n 1 ./s foo && lfs getstripe foo Creating the file by MPI_file_open : foo Informations on file: Key is 'cb_buffer_size' and worth: '16777216' Key is 'romio_cb_read' and worth: 'automatic' Key is 'romio_cb_write' and worth: 'automatic' Key is 'cb_nodes' and worth: '1' Key is 'romio_no_indep_rw' and worth: 'false' Key is 'romio_cb_pfr' and worth: 'disable' Key is 'romio_cb_fr_types' and worth: 'aar' Key is 'romio_cb_fr_alignment' and worth: '1' Key is 'romio_cb_ds_threshold' and worth: '0' Key is 'romio_cb_alltoall' and worth: 'automatic' Key is 'ind_rd_buffer_size' and worth: '4194304' Key is 'ind_wr_buffer_size' and worth: '524288' Key is 'romio_ds_read' and worth: 'automatic' Key is 'romio_ds_write' and worth: 'automatic' Key is 'cb_config_list' and worth: '*:1' Key is 'romio_filesystem_type' and worth: 'UFS: Generic ROMIO driver for all UNIX-like file systems' Key is 'romio_aggregator_list' and worth: '0 ' ...closing the file foo Informations on file: Key is 'cb_buffer_size' and worth: '16777216' Key is 'romio_cb_read' and worth: 'automatic' Key is 'romio_cb_write' and worth: 'automatic' Key is 'cb_nodes' and worth: '1' Key is 'romio_no_indep_rw' and worth: 'false' Key is 'romio_cb_pfr' and worth: 'disable' Key is 'romio_cb_fr_types' and worth: 'aar' Key is 'romio_cb_fr_alignment' and worth: '1' Key is 'romio_cb_ds_threshold' and worth: '0' Key is 'romio_cb_alltoall' and worth: 'automatic' Key is 'ind_rd_buffer_size' and worth: '4194304' Key is 'ind_wr_buffer_size' and worth: '524288' Key is 'romio_ds_read' and worth: 'automatic' Key is 'romio_ds_write' and worth: 'automatic' Key is 'cb_config_list' and worth: '*:1' Key is 'romio_filesystem_type' and worth: 'UFS: Generic ROMIO driver for all UNIX-like file systems' Key is 'romio_aggregator_list' and worth: '0 ' foo lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 67 obdidx objid objid group 67 357367195 0x154cfd9b 0 but still have only a striping_factor of 1 on the created file... Thanks, Eric -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42 #include "mpi.h" #include #include /* * Simple function to abort execution and print an error based on MPI return values: */ void abortOnError(int ierr) { if (ierr != MPI_SUCCESS) { printf("ERROR Returned by MPI: %d\n",ierr); char* lCharPtr = (char*) malloc(sizeof(char)*MPI_MAX_ERROR_STRING); int lLongueur = 0; MPI_Error_string(ierr,lCharPtr, &lLongueur); printf("ERROR_string Returned by MPI: %s\n",lCharPtr); free(lCharPtr); MPI_Abort( MPI_COMM_WORLD, ierr ); } } /* * Here I use only the first hint for now, but you can try a few to see the di
Re: [OMPI users] MPI I/O, Romio vs Ompio on GPFS
Hi, just almost found what I wanted with "--mca io_base_verbose100" Now I am looking at performances for GPFS and I must say OpenMPI 4.1.2 performs very poorly when it comes the time to write. I am launching a 512 processes, read+compute (ghosts components of a mesh), and then later write a 79Gb file. Here are the timings (all in seconds): IO module ; reading+ghost computing ; writing ompio ; 24.9 ;2040+ (job got killed before completion) romio321 ; 20.8 ; 15.6 I have run many times the job with Ompio module (the default) and Romio and the timings are always similar to those given. I also activated maximum debug output with " --mca mca_base_verbose stdout,level:9 --mca mpi_show_mca_params all --mca io_base_verbose100" and got a few lines but nothing relevant to debug: Sat Jun 11 20:08:28 2022:chrono::ecritMaillageMPI::debut VmSize: 6530408 VmRSS: 5599604 VmPeak: 7706396 VmData: 5734408 VmHWM: 5699324 Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:delete: deleting file: resultat01_-2.mail Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:delete: Checking all available modules Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:delete: component available: ompio, priority: 30 Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:delete: component available: romio321, priority: 10 Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:delete: Selected io component ompio Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:file_select: new file: resultat01_-2.mail Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:file_select: Checking all available modules Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:file_select: component available: ompio, priority: 30 Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:file_select: component available: romio321, priority: 10 Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:file_select: Selected io module ompio What else can I do to dig into this? Are there parameters ompio is aware of with GPFS? Thanks, Eric -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42 On 2022-06-10 16:23, Eric Chamberland via users wrote: Hi, I want to try romio with OpenMPI 4.1.2 because I am observing a big performance difference with IntelMPI on GPFS. I want to see, at *runtime*, all parameters (default values, names) used by MPI (at least for the "io" framework). I would like to have all the same output as "ompi_info --all" gives me... I have tried this: mpiexec --mca io romio321 --mca mca_verbose 1 --mca mpi_show_mca_params 1 --mca io_base_verbose 1 ... But I cannot see anything about io coming out... With "ompi_info" I do... Is it possible? Thanks, Eric -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42
[OMPI users] MPI I/O, ROMIO and showing io mca parameters at run-time
Hi, I want to try romio with OpenMPI 4.1.2 because I am observing a big performance difference with IntelMPI on GPFS. I want to see, at *runtime*, all parameters (default values, names) used by MPI (at least for the "io" framework). I would like to have all the same output as "ompi_info --all" gives me... I have tried this: mpiexec --mca io romio321 --mca mca_verbose 1 --mca mpi_show_mca_params 1 --mca io_base_verbose 1 ... But I cannot see anything about io coming out... With "ompi_info" I do... Is it possible? Thanks, Eric -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42
Re: [OMPI users] Segfault in ucp_dt_pack function from UCX library 1.8.0 and 1.11.2 for large sized communications using both OpenMPI 4.0.3 and 4.1.2
Hi, to give further information about this problem... it seems not related to MPI or UCX at all but seems to come from ParMETIS itself... With ParMETIS installed from SPACK, with "+int64" option, I have been able to use both OpenMPI 4.1.2 and IntelMPI 2021.6 successfully! With ParMETIS installed by PETSc, with "--with-64-bit-indices=1" option, all MPI implementations listed later do not work. I've opened an issue at Petsc here: https://gitlab.com/petsc/petsc/-/issues/1204#note_980344101 So, sorry for disturbing MPI guys here... Thanks for all suggestions! Eric On 2022-06-01 23:31, Eric Chamberland via users wrote: Hi, In the past, we have successfully launched large sized (finite elements) computations using PARMetis as mesh partitioner. It was first in 2012 with OpenMPI (v2.?) and secondly in March 2019 with OpenMPI 3.1.2 that we succeeded. Today, we have a bunch of nightly (small) tests running nicely and testing all of OpenMPI (4.0.x, 4.1.x and 5.0x), MPICH-3.3.2 and IntelMPI 2021.6. Preparing for launching the same computation we did in 2012, and even larger ones, we compiled with bot OpenMPI 4.0.3+ucx-1.8.0 and OpenMPI 4.1.2+ucx-1.11.2 and launched computation from small to large problems (meshes). For small meshes, it goes fine. But when we reach near 2^31 faces into the 3D mesh we are using and call ParMETIS_V3_PartMeshKway, we always get a segfault with the same backtrace pointing into ucx library: Wed Jun 1 23:04:54 2022:chrono::InterfaceParMetis::ParMETIS_V3_PartMeshKway::debut VmSize: 1202304 VmRSS: 349456 VmPeak: 1211736 VmData: 500764 VmHWM: 359012 Wed Jun 1 23:07:07 2022:Erreur : MEF++ Signal recu : 11 : segmentation violation Wed Jun 1 23:07:07 2022:Erreur : Wed Jun 1 23:07:07 2022:-- (Début des informations destinées aux développeurs C++) -- Wed Jun 1 23:07:07 2022:La pile d'appels contient 27 symboles. Wed Jun 1 23:07:07 2022:# 000: reqBacktrace(std::__cxx11::basic_string, std::allocator >&) >>> probGD.opt (probGD.opt(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x71) [0x4119f1]) Wed Jun 1 23:07:07 2022:# 001: attacheDebugger() >>> probGD.opt (probGD.opt(_Z15attacheDebuggerv+0x29a) [0x41386a]) Wed Jun 1 23:07:07 2022:# 002: /gpfs/fs0/project/d/deteix/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x1f9f) [0x2ab3aef0e5cf] Wed Jun 1 23:07:07 2022:# 003: /lib64/libc.so.6(+0x36400) [0x2ab3bd59a400] Wed Jun 1 23:07:07 2022:# 004: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_dt_pack+0x123) [0x2ab3c966e353] Wed Jun 1 23:07:07 2022:# 005: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x536b7) [0x2ab3c968d6b7] Wed Jun 1 23:07:07 2022:# 006: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/ucx/libuct_ib.so.0(uct_dc_mlx5_ep_am_bcopy+0xd7) [0x2ab3ca712137] Wed Jun 1 23:07:07 2022:# 007: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x52d3c) [0x2ab3c968cd3c] Wed Jun 1 23:07:07 2022:# 008: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_tag_send_nbx+0x5ad) [0x2ab3c9696dcd] Wed Jun 1 23:07:07 2022:# 009: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf2) [0x2ab3c922e0b2] Wed Jun 1 23:07:07 2022:# 010: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0x92) [0x2ab3bbca5a32] Wed Jun 1 23:07:07 2022:# 011: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_alltoallv_intra_pairwise+0x141) [0x2ab3bbcad941] Wed Jun 1 23:07:07 2022:# 012: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_alltoallv_intra_dec_fixed+0x42) [0x2ab3d4836da2] Wed Jun 1 23:07:07 2022:# 013: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(PMPI_Alltoallv+0x29) [0x2ab3bbc7bdf9] Wed Jun 1 23:07:07 2022:# 014: /scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(libparmetis__gkMPI_Alltoallv+0x106) [0x2ab3bb0e1c06] Wed Jun 1 23:07:07 2022:# 015: /scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(ParMETIS_V3_Mesh2Dual+0xdd6) [0x2ab3bb0f10b6] Wed Jun 1 23:07:07 2022:# 016: /scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(ParMETIS_V3_PartMeshKway+0x100) [0x2ab3bb0f1ac0] PARMetis is compiled as part of PETSc-3.17.1 with 64bit indices. Here are PETSc configure options: --prefix=/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1 COPTFLAGS=\"-O2
Re: [OMPI users] Segfault in ucp_dt_pack function from UCX library 1.8.0 and 1.11.2 for large sized communications using both OpenMPI 4.0.3 and 4.1.2
Hi Josh, ok, thanks for the suggestion. We are in process to test with IntelMPI right now. I hope to do it with a newer version of OpenMPI too. Do you suggest a minimum version for UCX lib? Thanks, Eric On 2022-06-02 04:05, Josh Hursey via users wrote: I would suggest trying OMPI v4.1.4 (or the v5 snapshot) * https://www.open-mpi.org/software/ompi/v4.1/ * https://www.mail-archive.com/announce@lists.open-mpi.org//msg00152.html We fixed some large payload collective issues in that release which might be what you are seeing here with MPI_Alltoallv with the tuned collective component. On Thu, Jun 2, 2022 at 1:54 AM Mikhail Brinskii via users wrote: Hi Eric, Yes, UCX is supposed to be stable for large sized problems. Did you see the same crash with both OMPI-4.0.3 + UCX 1.8.0 and OMPI-4.1.2 + UCX1.11.2? Have you also tried to run large sized problems test with OMPI-5.0.x? Regarding the application, at some point it invokes MPI_Alltoallv sending more than 2GB to some of the ranks (using derived dt), right? //WBR, Mikhail *From:* users *On Behalf Of *Eric Chamberland via users *Sent:* Thursday, June 2, 2022 5:31 AM *To:* Open MPI Users *Cc:* Eric Chamberland ; Thomas Briffard ; Vivien Clauzon ; dave.mar...@giref.ulaval.ca; Ramses van Zon ; charles.coulomb...@ulaval.ca *Subject:* [OMPI users] Segfault in ucp_dt_pack function from UCX library 1.8.0 and 1.11.2 for large sized communications using both OpenMPI 4.0.3 and 4.1.2 Hi, In the past, we have successfully launched large sized (finite elements) computations using PARMetis as mesh partitioner. It was first in 2012 with OpenMPI (v2.?) and secondly in March 2019 with OpenMPI 3.1.2 that we succeeded. Today, we have a bunch of nightly (small) tests running nicely and testing all of OpenMPI (4.0.x, 4.1.x and 5.0x), MPICH-3.3.2 and IntelMPI 2021.6. Preparing for launching the same computation we did in 2012, and even larger ones, we compiled with bot OpenMPI 4.0.3+ucx-1.8.0 and OpenMPI 4.1.2+ucx-1.11.2 and launched computation from small to large problems (meshes). For small meshes, it goes fine. But when we reach near 2^31 faces into the 3D mesh we are using and call ParMETIS_V3_PartMeshKway, we always get a segfault with the same backtrace pointing into ucx library: Wed Jun 1 23:04:54 2022:chrono::InterfaceParMetis::ParMETIS_V3_PartMeshKway::debut VmSize: 1202304 VmRSS: 349456 VmPeak: 1211736 VmData: 500764 VmHWM: 359012 Wed Jun 1 23:07:07 2022:Erreur : MEF++ Signal recu : 11 : segmentation violation Wed Jun 1 23:07:07 2022:Erreur : Wed Jun 1 23:07:07 2022:-- (Début des informations destinées aux développeurs C++) -- Wed Jun 1 23:07:07 2022:La pile d'appels contient 27 symboles. Wed Jun 1 23:07:07 2022:# 000: reqBacktrace(std::__cxx11::basic_string, std::allocator >&) >>> probGD.opt (probGD.opt(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x71) [0x4119f1]) Wed Jun 1 23:07:07 2022:# 001: attacheDebugger() >>> probGD.opt (probGD.opt(_Z15attacheDebuggerv+0x29a) [0x41386a]) Wed Jun 1 23:07:07 2022:# 002: /gpfs/fs0/project/d/deteix/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x1f9f) [0x2ab3aef0e5cf] Wed Jun 1 23:07:07 2022:# 003: /lib64/libc.so.6(+0x36400) [0x2ab3bd59a400] Wed Jun 1 23:07:07 2022:# 004: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_dt_pack+0x123) [0x2ab3c966e353] Wed Jun 1 23:07:07 2022:# 005: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x536b7) [0x2ab3c968d6b7] Wed Jun 1 23:07:07 2022:# 006: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/ucx/libuct_ib.so.0(uct_dc_mlx5_ep_am_bcopy+0xd7) [0x2ab3ca712137] Wed Jun 1 23:07:07 2022:# 007: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x52d3c) [0x2ab3c968cd3c] Wed Jun 1 23:07:07 2022:# 008: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_tag_send_nbx+0x5ad) [0x2ab3c9696dcd] Wed Jun 1 23:07:07 2022:# 009: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf2) [0x2ab3c922e0b2] Wed Jun 1 23:07:07 2022:# 010: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0x92) [0x2ab3bbca5a32] Wed Jun 1 23:07:07 2022:# 011: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_alltoallv_intra_pairwise+0x141) [0x2ab3bbcad941] Wed Jun 1 23:07:07 2022:# 012: /scinet/niagara/software/2022
Re: [OMPI users] Segfault in ucp_dt_pack function from UCX library 1.8.0 and 1.11.2 for large sized communications using both OpenMPI 4.0.3 and 4.1.2
6(+0x38980) [0x2aaeb98e2980] Wed May 25 21:34:02 2022:# 004: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(ucp_dt_pack+0x13b) [0x2aaebd3d407b] Wed May 25 21:34:02 2022:# 005: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(+0x3872a) [0x2aaebd3e472a] Wed May 25 21:34:02 2022:# 006: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(uct_dc_mlx5_ep_am_bcopy+0xd3) [0x2aaebd6a4713] Wed May 25 21:34:02 2022:# 007: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(+0x38ffc) [0x2aaebd3e4ffc] Wed May 25 21:34:02 2022:# 008: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(ucp_tag_send_nbr+0x511) [0x2aaebd3f7b91] Wed May 25 21:34:02 2022:# 009: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xbb) [0x2aaea87132eb] Wed May 25 21:34:02 2022:# 010: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0x8c) [0x2aaeb955d90c] Wed May 25 21:34:02 2022:# 011: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(ompi_coll_base_alltoallv_intra_pairwise+0x13f) [0x2aaeb9562eff] Wed May 25 21:34:02 2022:# 012: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(MPI_Alltoallv+0x1a3) [0x2aaeb9511be3] Wed May 25 21:34:02 2022:# 013: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/petsc-pardiso-64bits/3.17.1/lib/libstrumpack.so(libparmetis__gkMPI_Alltoallv+0x108) [0x2aaeb15b1ca8] Wed May 25 21:34:02 2022:# 014: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/petsc-pardiso-64bits/3.17.1/lib/libpetsc.so.3.17(ParMETIS_V3_Mesh2Dual+0x10af) [0x2aaeb081c98f] Wed May 25 21:34:02 2022:# 015: probGD.opt(ParMETIS_V3_PartMeshKway+0x100) [0x432680] Have you also tried to run large sized problems test with OMPI-5.0.x? Not for large problems, only small ones without UCX... I am not compiling my own mpi on Compute Canada clusters, but I do in our lab for our nightly validation tests. Regarding the application, at some point it invokes MPI_Alltoallv sending more than 2GB to some of the ranks (using derived dt), right? I have to track down the very specific call, but I am not sure it is sending 2GB to a specific rank but maybe have 2GB divided between many rank. The fact is that this part of the code, when it works, does not create such a bump in memory usage... But I have to dig a bit more... Regards, Eric //WBR, Mikhail *From:* users *On Behalf Of *Eric Chamberland via users *Sent:* Thursday, June 2, 2022 5:31 AM *To:* Open MPI Users *Cc:* Eric Chamberland ; Thomas Briffard ; Vivien Clauzon ; dave.mar...@giref.ulaval.ca; Ramses van Zon ; charles.coulomb...@ulaval.ca *Subject:* [OMPI users] Segfault in ucp_dt_pack function from UCX library 1.8.0 and 1.11.2 for large sized communications using both OpenMPI 4.0.3 and 4.1.2 Hi, In the past, we have successfully launched large sized (finite elements) computations using PARMetis as mesh partitioner. It was first in 2012 with OpenMPI (v2.?) and secondly in March 2019 with OpenMPI 3.1.2 that we succeeded. Today, we have a bunch of nightly (small) tests running nicely and testing all of OpenMPI (4.0.x, 4.1.x and 5.0x), MPICH-3.3.2 and IntelMPI 2021.6. Preparing for launching the same computation we did in 2012, and even larger ones, we compiled with bot OpenMPI 4.0.3+ucx-1.8.0 and OpenMPI 4.1.2+ucx-1.11.2 and launched computation from small to large problems (meshes). For small meshes, it goes fine. But when we reach near 2^31 faces into the 3D mesh we are using and call ParMETIS_V3_PartMeshKway, we always get a segfault with the same backtrace pointing into ucx library: Wed Jun 1 23:04:54 2022:chrono::InterfaceParMetis::ParMETIS_V3_PartMeshKway::debut VmSize: 1202304 VmRSS: 349456 VmPeak: 1211736 VmData: 500764 VmHWM: 359012 Wed Jun 1 23:07:07 2022:Erreur : MEF++ Signal recu : 11 : segmentation violation Wed Jun 1 23:07:07 2022:Erreur : Wed Jun 1 23:07:07 2022:-- (Début des informations destinées aux développeurs C++) -- Wed Jun 1 23:07:07 2022:La pile d'appels contient 27 symboles. Wed Jun 1 23:07:07 2022:# 000: reqBacktrace(std::__cxx11::basic_string, std::allocator >&) >>> probGD.opt (probGD.opt(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x71) [0x4119f1]) Wed Jun 1 23:07:07 2022:# 001: attacheDebugger() >>> probGD.opt (probGD.opt(_Z15attacheDebuggerv+0x29a) [0x41386a]) Wed Jun 1 23:07:07 2022:# 002: /gpfs/fs0/project/d/deteix/ericc/GIREF/lib/libgiref_opt_Util.so(t
[OMPI users] Segfault in ucp_dt_pack function from UCX library 1.8.0 and 1.11.2 for large sized communications using both OpenMPI 4.0.3 and 4.1.2
Hi, In the past, we have successfully launched large sized (finite elements) computations using PARMetis as mesh partitioner. It was first in 2012 with OpenMPI (v2.?) and secondly in March 2019 with OpenMPI 3.1.2 that we succeeded. Today, we have a bunch of nightly (small) tests running nicely and testing all of OpenMPI (4.0.x, 4.1.x and 5.0x), MPICH-3.3.2 and IntelMPI 2021.6. Preparing for launching the same computation we did in 2012, and even larger ones, we compiled with bot OpenMPI 4.0.3+ucx-1.8.0 and OpenMPI 4.1.2+ucx-1.11.2 and launched computation from small to large problems (meshes). For small meshes, it goes fine. But when we reach near 2^31 faces into the 3D mesh we are using and call ParMETIS_V3_PartMeshKway, we always get a segfault with the same backtrace pointing into ucx library: Wed Jun 1 23:04:54 2022:chrono::InterfaceParMetis::ParMETIS_V3_PartMeshKway::debut VmSize: 1202304 VmRSS: 349456 VmPeak: 1211736 VmData: 500764 VmHWM: 359012 Wed Jun 1 23:07:07 2022:Erreur : MEF++ Signal recu : 11 : segmentation violation Wed Jun 1 23:07:07 2022:Erreur : Wed Jun 1 23:07:07 2022:-- (Début des informations destinées aux développeurs C++) -- Wed Jun 1 23:07:07 2022:La pile d'appels contient 27 symboles. Wed Jun 1 23:07:07 2022:# 000: reqBacktrace(std::__cxx11::basic_string, std::allocator >&) >>> probGD.opt (probGD.opt(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x71) [0x4119f1]) Wed Jun 1 23:07:07 2022:# 001: attacheDebugger() >>> probGD.opt (probGD.opt(_Z15attacheDebuggerv+0x29a) [0x41386a]) Wed Jun 1 23:07:07 2022:# 002: /gpfs/fs0/project/d/deteix/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x1f9f) [0x2ab3aef0e5cf] Wed Jun 1 23:07:07 2022:# 003: /lib64/libc.so.6(+0x36400) [0x2ab3bd59a400] Wed Jun 1 23:07:07 2022:# 004: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_dt_pack+0x123) [0x2ab3c966e353] Wed Jun 1 23:07:07 2022:# 005: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x536b7) [0x2ab3c968d6b7] Wed Jun 1 23:07:07 2022:# 006: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/ucx/libuct_ib.so.0(uct_dc_mlx5_ep_am_bcopy+0xd7) [0x2ab3ca712137] Wed Jun 1 23:07:07 2022:# 007: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x52d3c) [0x2ab3c968cd3c] Wed Jun 1 23:07:07 2022:# 008: /scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_tag_send_nbx+0x5ad) [0x2ab3c9696dcd] Wed Jun 1 23:07:07 2022:# 009: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf2) [0x2ab3c922e0b2] Wed Jun 1 23:07:07 2022:# 010: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0x92) [0x2ab3bbca5a32] Wed Jun 1 23:07:07 2022:# 011: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_alltoallv_intra_pairwise+0x141) [0x2ab3bbcad941] Wed Jun 1 23:07:07 2022:# 012: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_alltoallv_intra_dec_fixed+0x42) [0x2ab3d4836da2] Wed Jun 1 23:07:07 2022:# 013: /scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(PMPI_Alltoallv+0x29) [0x2ab3bbc7bdf9] Wed Jun 1 23:07:07 2022:# 014: /scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(libparmetis__gkMPI_Alltoallv+0x106) [0x2ab3bb0e1c06] Wed Jun 1 23:07:07 2022:# 015: /scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(ParMETIS_V3_Mesh2Dual+0xdd6) [0x2ab3bb0f10b6] Wed Jun 1 23:07:07 2022:# 016: /scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(ParMETIS_V3_PartMeshKway+0x100) [0x2ab3bb0f1ac0] PARMetis is compiled as part of PETSc-3.17.1 with 64bit indices. Here are PETSc configure options: --prefix=/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1 COPTFLAGS=\"-O2 -march=native\" CXXOPTFLAGS=\"-O2 -march=native\" FOPTFLAGS=\"-O2 -march=native\" --download-fftw=1 --download-hdf5=1 --download-hypre=1 --download-metis=1 --download-mumps=1 --download-parmetis=1 --download-plapack=1 --download-prometheus=1 --download-ptscotch=1 --download-scotch=1 --download-sprng=1 --download-superlu_dist=1 --download-triangle=1 --with-avx512-kernels=1 --with-blaslapack-dir=/scinet/intel/oneapi/2021u4/mkl/2021.4.0 --with-cc=mpicc --with-cxx=mpicxx --with-cxx-dialect=C++11 --with-debugging=0 --with-fc=mpifort --with-mkl_pardiso-dir=/scinet/intel/oneapi/2021u4/mkl/2021.4.0 --with-scalapack=1 --with-scalapack-lib=\"[/scinet/intel/oneapi/2021u4/mkl/2021.4.0/lib/intel64/libmkl_scal
Re: [OMPI users] Status of pNFS, CephFS and MPI I/O
Thanks for your answer Edgard! In fact, we are able to use NFS and certainly any POSIX file system on a single node basis. I should have been asking for: What are the supported file systems for *multiple nodes* read/write access to files? For nfs, MPI I/O is known to *not* work on NFS when using multiple nodes ... except for NFS v3 with "noac" mount option (we are about to test with "actimeo=0" option to see if it works). Btw, is OpenMPI MPI I/O having some "hidden" (mca?) options to make a multiple nodes NFS cluster to work? Thanks, Eric On 2021-09-23 1:57 p.m., Gabriel, Edgar wrote: Eric, generally speaking, ompio should be able to operate correctly on all file systems that have support for POSIX functions. The generic ufs component is for example being used on BeeGFS parallel file systems without problems, we are using that on a daily basis. For GPFS, the only reason we handle that file system separately is because of some custom info objects that can be used to configure the file during file_open. If one would not use these info objects the generic ufs component would be as good as the GPFS specific component. Note, the generic ufs component is also being used for NFS, it has logic built in to recognize an NFS file system and handle some operations slightly differently (but still relying on POSIX functions). The one big exception is Lustre: due its different file locking strategy we are required to use a different collective I/O component (dynamic_gen2 vs. vulcan). Generic ufs would work on Lustre, too, but it would be horribly slow. I cannot comment on CephFS and pNFS since I do not have access to those file systems, it would come down to test them. Thanks Edgar -Original Message----- From: users On Behalf Of Eric Chamberland via users Sent: Thursday, September 23, 2021 9:28 AM To: Open MPI Users Cc: Eric Chamberland ; Vivien Clauzon Subject: [OMPI users] Status of pNFS, CephFS and MPI I/O Hi, I am looking around for information about parallel filesystems supported for MPI I/O. Clearly, GFPS, Lustre are fully supported, but what about others? - CephFS - pNFS - Other? when I "grep" for "pnfs\|cephfs" into ompi source code, I found nothing... Otherwise I found this into ompi/mca/common/ompio/common_ompio.h : enum ompio_fs_type { NONE = 0, UFS = 1, PVFS2 = 2, LUSTRE = 3, PLFS = 4, IME = 5, GPFS = 6 }; Does that mean that other fs types (pNFS, CephFS) does not need special treatment or are not supported or not optimally supported? Thanks, Eric -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42 -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42
[OMPI users] Status of pNFS, CephFS and MPI I/O
Hi, I am looking around for information about parallel filesystems supported for MPI I/O. Clearly, GFPS, Lustre are fully supported, but what about others? - CephFS - pNFS - Other? when I "grep" for "pnfs\|cephfs" into ompi source code, I found nothing... Otherwise I found this into ompi/mca/common/ompio/common_ompio.h : enum ompio_fs_type { NONE = 0, UFS = 1, PVFS2 = 2, LUSTRE = 3, PLFS = 4, IME = 5, GPFS = 6 }; Does that mean that other fs types (pNFS, CephFS) does not need special treatment or are not supported or not optimally supported? Thanks, Eric -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42
[OMPI users] Error code for I/O operations
Hi, I have a simple question about error codes returned by MPI_File_*_all* and MPI_File_open/close functions: If an error is returned will it be the same for *all* processes? In other worlds, are error codes communicated under the hood so we, end users, can avoid to add "reduce" on those error codes? Thanks, Eric -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42
Re: [OMPI users] SLURM seems to ignore --output-filename option of OpenMPI
Hi, ok I think I just completely missed out a default behavior change with 3.x and over: --output-filename foo is now generating a *directory* foo Before 3.x, it was *files* named with "foo.1.rank". Is there a way to have --output-filename behave like in vers 2.x and before? Thanks, Eric Le 2019-09-30 à 3:34 p.m., Eric Chamberland via users a écrit : Hi, I am using OpenMPI 3.1.2 with slurm 17.11.12 and it looks like I can't have the "--output-filename" option taken into account. All my outputs are going into slurms output files. Can it be imposed or ignored by a slurm configuration? How is it possible to bypass that? Strangely, the "--timestamp-output" seems to work well... Thanks, Eric
[OMPI users] SLURM seems to ignore --output-filename option of OpenMPI
Hi, I am using OpenMPI 3.1.2 with slurm 17.11.12 and it looks like I can't have the "--output-filename" option taken into account. All my outputs are going into slurms output files. Can it be imposed or ignored by a slurm configuration? How is it possible to bypass that? Strangely, the "--timestamp-output" seems to work well... Thanks, Eric