Hi, please open an issue on GitHub at https://github.com/open-mpi/ompi/issues and provide the requested information.
If the compilation failed when configured with --enable-debug, please share the logs. the name of the WRF subroutine suggests the crash might occur in MPI_Comm_split(), if so, are you able to craft a reproducer that causes the crash? How many nodes and MPI tasks are needed in order to evidence the crash? Cheers, Gilles On Wed, Jan 31, 2024 at 10:09 PM afernandez via users < users@lists.open-mpi.org> wrote: > Hello Joseph, > Sorry for the delay but I didn't know if I was missing something yesterday > evening and wanted to double check everything this morning. This is for WRF > but other apps exhibit the same behavior. > * I had no problem with the serial version (and gdb obviously didn't > report any issue). > * I tried compiling with the --enable-debug flag but it was generating > errors during the compilation and never completed. > * I went back to my standard flags for debugging: -g -fbacktrace -ggdb > -fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is > still crashing with little extra info vs yesterday: > *Backtrace for this error:* > *#0 0x7f5a4e54451f in ???* > * at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0* > *#1 0x7f5a4e5a73fe in __GI___libc_free* > * at ./malloc/malloc.c:3368* > *#2 0x7f5a4c7aa5c3 in ???* > *#3 0x7f5a4e83b048 in ???* > *#4 0x7f5a4e7d3ef1 in ???* > *#5 0x7f5a4e8dab7b in ???* > *#6 0x8f6bbf in __module_dm_MOD_split_communicator* > * at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734* > *#7 0x1879ebd in init_modules_* > * at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63* > *#8 0x406fe4 in __module_wrf_top_MOD_wrf_init* > * at ../main/module_wrf_top.f90:130* > *#9 0x405ff3 in wrf* > * at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22* > *#10 0x40605c in main* > * at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6* > > *--------------------------------------------------------------------------* > *Primary job terminated normally, but 1 process returned* > *a non-zero exit code. Per user-direction, the job has been aborted.* > > *--------------------------------------------------------------------------* > > *--------------------------------------------------------------------------* > *mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163 > exited on signal 11 (Segmentation fault).* > > *--------------------------------------------------------------------------* > Any pointers on what might be going on here as this never happened with > OMPIv4. Thanks. > > > > Joseph Schuchart via users wrote: > > > Hello, > > This looks like memory corruption. Do you have more details on what your > app is doing? I don't see any MPI calls inside the call stack. Could you > rebuild Open MPI with debug information enabled (by adding `--enable-debug` > to configure)? If this error occurs on singleton runs (1 process) then you > can easily attach gdb to it to get a better stack trace. Also, valgrind may > help pin down the problem by telling you which memory block is being free'd > here. > > Thanks > Joseph > > On 1/30/24 07:41, afernandez via users wrote: > > quote class="gmail_quote" type="cite" style="margin:0 0 0 > .8ex;border-left:1px #ccc solid;padding-left:1ex"> > Hello, > I upgraded one of the systems to v5.0.1 and have compiled everything > > exactly as dozens of previous times with v4. I wasn't expecting any > issue > (and the compilations didn't report anything out of ordinary) > but running > several apps has resulted in error messages such as: > /Backtrace for this error:/ > /#0 0x7f7c9571f51f in ???/ > / at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/ > /#1 0x7f7c957823fe in __GI___libc_free/ > / at ./malloc/malloc.c:3368/ > /#2 0x7f7c93a635c3 in ???/ > /#3 0x7f7c95f84048 in ???/ > /#4 0x7f7c95f1cef1 in ???/ > /#5 0x7f7c95e34b7b in ???/ > /#6 0x6e05be in ???/ > /#7 0x6e58d7 in ???/ > /#8 0x405d2c in ???/ > /#9 0x7f7c95706d8f in __libc_start_call_main/ > / at ../sysdeps/nptl/libc_start_call_main.h:58/ > /#10 0x7f7c95706e3f in __libc_start_main_impl/ > / at ../csu/libc-start.c:392/ > /#11 0x405d64 in ???/ > /#12 0xffffffffffffffff in ???/ > OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building > OpenMPI, I had previously built the hwloc (2.10.0) library at > > /usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but > the problem seems to be related to memory allocation. > Thanks. > > > > >