Re: [OMPI users] Seg error when using v5.0.1

2024-01-31 Thread Jeff Squyres (jsquyres) via users
No worries – glad you figured it out!

From: users  on behalf of afernandez via 
users 
Sent: Wednesday, January 31, 2024 10:56 AM
To: Open MPI Users 
Cc: afernandez 
Subject: Re: [OMPI users] Seg error when using v5.0.1

Hello,
I'm sorry as I totally messed up here. It turns out that the problem was caused 
because there's a previous installation of OpenMPI (v4.1.6) and it was trying 
to run the codes compiled against v5 with the mpirun from v4. I always set up 
the systems so that the OS picks up the latest MPI version, but it apparently 
didn't become effective this time prompting me to the wrong conclusion. I 
should have realized of this fact earlier and not waste everyone's time. My 
apologies.
Arturo

Gilles Gouaillardet via users wrote:


Hi,

please open an issue on GitHub at https://github.com/open-mpi/ompi/issues
and provide the requested information.

If the compilation failed when configured with --enable-debug, please share the 
logs.

the name of the WRF subroutine suggests the crash might occur in 
MPI_Comm_split(),
if so, are you able to craft a reproducer that causes the crash?

How many nodes and MPI tasks are needed in order to evidence the crash?


Cheers,

Gilles

On Wed, Jan 31, 2024 at 10:09 PM afernandez via users 
mailto:users@lists.open-mpi.org>> wrote:
Hello Joseph,
Sorry for the delay but I didn't know if I was missing something yesterday 
evening and wanted to double check everything this morning. This is for WRF but 
other apps exhibit the same behavior.
* I had no problem with the serial version (and gdb obviously didn't report any 
issue).
* I tried compiling with the --enable-debug flag but it was generating errors 
during the compilation and never completed.
* I went back to my standard flags for debugging: -g -fbacktrace -ggdb 
-fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is still 
crashing with little extra info vs yesterday:
Backtrace for this error:
#0  0x7f5a4e54451f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1  0x7f5a4e5a73fe in __GI___libc_free
at ./malloc/malloc.c:3368
#2  0x7f5a4c7aa5c3 in ???
#3  0x7f5a4e83b048 in ???
#4  0x7f5a4e7d3ef1 in ???
#5  0x7f5a4e8dab7b in ???
#6  0x8f6bbf in __module_dm_MOD_split_communicator
at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734
#7  0x1879ebd in init_modules_
at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63
#8  0x406fe4 in __module_wrf_top_MOD_wrf_init
at ../main/module_wrf_top.f90:130
#9  0x405ff3 in wrf
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22
#10  0x40605c in main
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163 exited 
on signal 11 (Segmentation fault).
--
Any pointers on what might be going on here as this never happened with OMPIv4. 
Thanks.



Joseph Schuchart via users wrote:


Hello,

This looks like memory corruption. Do you have more details on what your app is 
doing? I don't see any MPI calls inside the call stack. Could you rebuild Open 
MPI with debug information enabled (by adding `--enable-debug` to configure)? 
If this error occurs on singleton runs (1 process) then you can easily attach 
gdb to it to get a better stack trace. Also, valgrind may help pin down the 
problem by telling you which memory block is being free'd here.

Thanks
Joseph

On 1/30/24 07:41, afernandez via users wrote:

quote class="gmail_quote" type="cite" style="margin:0 0 0 .8ex;border-left:1px 
#ccc solid;padding-left:1ex">
Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything > exactly 
as dozens of previous times with v4. I wasn't expecting any > issue (and the 
compilations didn't report anything out of ordinary) > but running several apps 
has resulted in error messages such as:
/Backtrace for this error:/
/#0  0x7f7c9571f51f in ???/
/at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1  0x7f7c957823fe in __GI___libc_free/
/at ./malloc/malloc.c:3368/
/#2  0x7f7c93a635c3 in ???/
/#3  0x7f7c95f84048 in ???/
/#4  0x7f7c95f1cef1 in ???/
/#5  0x7f7c95e34b7b in ???/
/#6  0x6e05be in ???/
/#7  0x6e58d7 in ???/
/#8  0x405d2c in ???/
/#9  0x7f7c95706d8f in __libc_start_call_main/
/at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10  0x7f7c95706e3f in __libc_start_main_impl/
/at ../csu/libc-start.c:392/
/#11  0x405d64 in ???/
/#12  0x in ???/
OS is Ubuntu 22.04, OpenMPI was built wit

Re: [OMPI users] Seg error when using v5.0.1

2024-01-31 Thread afernandez via users
Hello,
I'm sorry as I totally messed up here. It turns out that the problem was caused 
because there's a previous installation of OpenMPI (v4.1.6) and it was trying 
to run the codes compiled against v5 with the mpirun from v4. I always set up 
the systems so that the OS picks up the latest MPI version, but it apparently 
didn't become effective this time prompting me to the wrong conclusion. I 
should have realized of this fact earlier and not waste everyone's time. My 
apologies.
Arturo
Gilles Gouaillardet via users wrote:
Hi,
please open an issue on GitHub at https://github.com/open-mpi/ompi/issues 

and provide the requested information.
If the compilation failed when configured with --enable-debug, please share the 
logs.
the name of the WRF subroutine suggests the crash might occur in 
MPI_Comm_split(),
if so, are you able to craft a reproducer that causes the crash?
How many nodes and MPI tasks are needed in order to evidence the crash?
Cheers,
Gilles
On Wed, Jan 31, 2024 at 10:09 PM afernandez via users mailto:users@lists.open-mpi.org> > wrote:
Hello Joseph,
Sorry for the delay but I didn't know if I was missing something yesterday 
evening and wanted to double check everything this morning. This is for WRF but 
other apps exhibit the same behavior.
* I had no problem with the serial version (and gdb obviously didn't report any 
issue).
* I tried compiling with the --enable-debug flag but it was generating errors 
during the compilation and never completed.
* I went back to my standard flags for debugging: -g -fbacktrace -ggdb 
-fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is still 
crashing with little extra info vs yesterday:
Backtrace for this error:
#0 0x7f5a4e54451f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x7f5a4e5a73fe in __GI___libc_free
at ./malloc/malloc.c:3368
#2 0x7f5a4c7aa5c3 in ???
#3 0x7f5a4e83b048 in ???
#4 0x7f5a4e7d3ef1 in ???
#5 0x7f5a4e8dab7b in ???
#6 0x8f6bbf in __module_dm_MOD_split_communicator
at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734
#7 0x1879ebd in init_modules_
at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63
#8 0x406fe4 in __module_wrf_top_MOD_wrf_init
at ../main/module_wrf_top.f90:130
#9 0x405ff3 in wrf
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22
#10 0x40605c in main
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6
--
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163 exited 
on signal 11 (Segmentation fault).
--
Any pointers on what might be going on here as this never happened with OMPIv4. 
Thanks.
Joseph Schuchart via users wrote:
Hello,
This looks like memory corruption. Do you have more details on what your app is 
doing? I don't see any MPI calls inside the call stack. Could you rebuild Open 
MPI with debug information enabled (by adding `--enable-debug` to configure)? 
If this error occurs on singleton runs (1 process) then you can easily attach 
gdb to it to get a better stack trace. Also, valgrind may help pin down the 
problem by telling you which memory block is being free'd here.
Thanks
Joseph
On 1/30/24 07:41, afernandez via users wrote:
quote class="gmail_quote" type="cite" style="margin:0 0 0 .8ex;border-left:1px 
#ccc solid;padding-left:1ex">
Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything > exactly 
as dozens of previous times with v4. I wasn't expecting any > issue (and the 
compilations didn't report anything out of ordinary) > but running several apps 
has resulted in error messages such as:
/Backtrace for this error:/
/#0 0x7f7c9571f51f in ???/
/ at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1 0x7f7c957823fe in __GI___libc_free/
/ at ./malloc/malloc.c:3368/
/#2 0x7f7c93a635c3 in ???/
/#3 0x7f7c95f84048 in ???/
/#4 0x7f7c95f1cef1 in ???/
/#5 0x7f7c95e34b7b in ???/
/#6 0x6e05be in ???/
/#7 0x6e58d7 in ???/
/#8 0x405d2c in ???/
/#9 0x7f7c95706d8f in __libc_start_call_main/
/ at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10 0x7f7c95706e3f in __libc_start_main_impl/
/ at ../csu/libc-start.c:392/
/#11 0x405d64 in ???/
/#12 0x in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building 
OpenMPI, I had previously built the hwloc (2.10.0) library at > 
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but the 
problem seems to be related to memory allocation.
Thanks.


Re: [OMPI users] Seg error when using v5.0.1

2024-01-31 Thread afernandez via users
Hi Gilles,
I created the ticket (#12296). The crash happened with either 1 or 2 MPI ranks 
(have not tried with more but I doubt that it would make any difference).
Thanks,
Arturo

Gilles Gouaillardet via users wrote:
Hi,
please open an issue on GitHub at https://github.com/open-mpi/ompi/issues 

and provide the requested information.
If the compilation failed when configured with --enable-debug, please share the 
logs.
the name of the WRF subroutine suggests the crash might occur in 
MPI_Comm_split(),
if so, are you able to craft a reproducer that causes the crash?
How many nodes and MPI tasks are needed in order to evidence the crash?
Cheers,
Gilles
On Wed, Jan 31, 2024 at 10:09 PM afernandez via users mailto:users@lists.open-mpi.org> > wrote:
Hello Joseph,
Sorry for the delay but I didn't know if I was missing something yesterday 
evening and wanted to double check everything this morning. This is for WRF but 
other apps exhibit the same behavior.
* I had no problem with the serial version (and gdb obviously didn't report any 
issue).
* I tried compiling with the --enable-debug flag but it was generating errors 
during the compilation and never completed.
* I went back to my standard flags for debugging: -g -fbacktrace -ggdb 
-fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is still 
crashing with little extra info vs yesterday:
Backtrace for this error:
#0 0x7f5a4e54451f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x7f5a4e5a73fe in __GI___libc_free
at ./malloc/malloc.c:3368
#2 0x7f5a4c7aa5c3 in ???
#3 0x7f5a4e83b048 in ???
#4 0x7f5a4e7d3ef1 in ???
#5 0x7f5a4e8dab7b in ???
#6 0x8f6bbf in __module_dm_MOD_split_communicator
at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734
#7 0x1879ebd in init_modules_
at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63
#8 0x406fe4 in __module_wrf_top_MOD_wrf_init
at ../main/module_wrf_top.f90:130
#9 0x405ff3 in wrf
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22
#10 0x40605c in main
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6
--
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163 exited 
on signal 11 (Segmentation fault).
--
Any pointers on what might be going on here as this never happened with OMPIv4. 
Thanks.
Joseph Schuchart via users wrote:
Hello,
This looks like memory corruption. Do you have more details on what your app is 
doing? I don't see any MPI calls inside the call stack. Could you rebuild Open 
MPI with debug information enabled (by adding `--enable-debug` to configure)? 
If this error occurs on singleton runs (1 process) then you can easily attach 
gdb to it to get a better stack trace. Also, valgrind may help pin down the 
problem by telling you which memory block is being free'd here.
Thanks
Joseph
On 1/30/24 07:41, afernandez via users wrote:
quote class="gmail_quote" type="cite" style="margin:0 0 0 .8ex;border-left:1px 
#ccc solid;padding-left:1ex">
Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything > exactly 
as dozens of previous times with v4. I wasn't expecting any > issue (and the 
compilations didn't report anything out of ordinary) > but running several apps 
has resulted in error messages such as:
/Backtrace for this error:/
/#0 0x7f7c9571f51f in ???/
/ at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1 0x7f7c957823fe in __GI___libc_free/
/ at ./malloc/malloc.c:3368/
/#2 0x7f7c93a635c3 in ???/
/#3 0x7f7c95f84048 in ???/
/#4 0x7f7c95f1cef1 in ???/
/#5 0x7f7c95e34b7b in ???/
/#6 0x6e05be in ???/
/#7 0x6e58d7 in ???/
/#8 0x405d2c in ???/
/#9 0x7f7c95706d8f in __libc_start_call_main/
/ at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10 0x7f7c95706e3f in __libc_start_main_impl/
/ at ../csu/libc-start.c:392/
/#11 0x405d64 in ???/
/#12 0x in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building 
OpenMPI, I had previously built the hwloc (2.10.0) library at > 
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but the 
problem seems to be related to memory allocation.
Thanks.


Re: [OMPI users] Seg error when using v5.0.1

2024-01-31 Thread Gilles Gouaillardet via users
Hi,

please open an issue on GitHub at https://github.com/open-mpi/ompi/issues
and provide the requested information.

If the compilation failed when configured with --enable-debug, please share
the logs.

the name of the WRF subroutine suggests the crash might occur in
MPI_Comm_split(),
if so, are you able to craft a reproducer that causes the crash?

How many nodes and MPI tasks are needed in order to evidence the crash?


Cheers,

Gilles

On Wed, Jan 31, 2024 at 10:09 PM afernandez via users <
users@lists.open-mpi.org> wrote:

> Hello Joseph,
> Sorry for the delay but I didn't know if I was missing something yesterday
> evening and wanted to double check everything this morning. This is for WRF
> but other apps exhibit the same behavior.
> * I had no problem with the serial version (and gdb obviously didn't
> report any issue).
> * I tried compiling with the --enable-debug flag but it was generating
> errors during the compilation and never completed.
> * I went back to my standard flags for debugging: -g -fbacktrace -ggdb
> -fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is
> still crashing with little extra info vs yesterday:
> *Backtrace for this error:*
> *#0  0x7f5a4e54451f in ???*
> *at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0*
> *#1  0x7f5a4e5a73fe in __GI___libc_free*
> *at ./malloc/malloc.c:3368*
> *#2  0x7f5a4c7aa5c3 in ???*
> *#3  0x7f5a4e83b048 in ???*
> *#4  0x7f5a4e7d3ef1 in ???*
> *#5  0x7f5a4e8dab7b in ???*
> *#6  0x8f6bbf in __module_dm_MOD_split_communicator*
> *at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734*
> *#7  0x1879ebd in init_modules_*
> *at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63*
> *#8  0x406fe4 in __module_wrf_top_MOD_wrf_init*
> *at ../main/module_wrf_top.f90:130*
> *#9  0x405ff3 in wrf*
> *at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22*
> *#10  0x40605c in main*
> *at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6*
>
> *--*
> *Primary job  terminated normally, but 1 process returned*
> *a non-zero exit code. Per user-direction, the job has been aborted.*
>
> *--*
>
> *--*
> *mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163
> exited on signal 11 (Segmentation fault).*
>
> *--*
> Any pointers on what might be going on here as this never happened with
> OMPIv4. Thanks.
>
>
>
> Joseph Schuchart via users wrote:
>
>
> Hello,
>
> This looks like memory corruption. Do you have more details on what your
> app is doing? I don't see any MPI calls inside the call stack. Could you
> rebuild Open MPI with debug information enabled (by adding `--enable-debug`
> to configure)? If this error occurs on singleton runs (1 process) then you
> can easily attach gdb to it to get a better stack trace. Also, valgrind may
> help pin down the problem by telling you which memory block is being free'd
> here.
>
> Thanks
> Joseph
>
> On 1/30/24 07:41, afernandez via users wrote:
>
> quote class="gmail_quote" type="cite" style="margin:0 0 0
> .8ex;border-left:1px #ccc solid;padding-left:1ex">
> Hello,
> I upgraded one of the systems to v5.0.1 and have compiled everything >
> exactly as dozens of previous times with v4. I wasn't expecting any > issue
> (and the compilations didn't report anything out of ordinary) > but running
> several apps has resulted in error messages such as:
> /Backtrace for this error:/
> /#0  0x7f7c9571f51f in ???/
> /at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
> /#1  0x7f7c957823fe in __GI___libc_free/
> /at ./malloc/malloc.c:3368/
> /#2  0x7f7c93a635c3 in ???/
> /#3  0x7f7c95f84048 in ???/
> /#4  0x7f7c95f1cef1 in ???/
> /#5  0x7f7c95e34b7b in ???/
> /#6  0x6e05be in ???/
> /#7  0x6e58d7 in ???/
> /#8  0x405d2c in ???/
> /#9  0x7f7c95706d8f in __libc_start_call_main/
> /at ../sysdeps/nptl/libc_start_call_main.h:58/
> /#10  0x7f7c95706e3f in __libc_start_main_impl/
> /at ../csu/libc-start.c:392/
> /#11  0x405d64 in ???/
> /#12  0x in ???/
> OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building
> OpenMPI, I had previously built the hwloc (2.10.0) library at >
> /usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but
> the problem seems to be related to memory allocation.
> Thanks.
>
>
>
>
>


Re: [OMPI users] Seg error when using v5.0.1

2024-01-31 Thread afernandez via users

Hello Joseph,
Sorry for the delay but I didn't know if I was missing something yesterday
evening and wanted to double check everything this morning. This is for WRF
but other apps exhibit the same behavior.
* I had no problem with the serial version (and gdb obviously didn't report
any issue).
* I tried compiling with the --enable-debug flag but it was generating
errors during the compilation and never completed.
* I went back to my standard flags for debugging: -g -fbacktrace -ggdb
-fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is
still crashing with little extra info vs yesterday:
Backtrace for this error:
#0 0x7f5a4e54451f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x7f5a4e5a73fe in __GI___libc_free
at ./malloc/malloc.c:3368
#2 0x7f5a4c7aa5c3 in ???
#3 0x7f5a4e83b048 in ???
#4 0x7f5a4e7d3ef1 in ???
#5 0x7f5a4e8dab7b in ???
#6 0x8f6bbf in __module_dm_MOD_split_communicator
at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734
#7 0x1879ebd in init_modules_
at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63
#8 0x406fe4 in __module_wrf_top_MOD_wrf_init
at ../main/module_wrf_top.f90:130
#9 0x405ff3 in wrf
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22
#10 0x40605c in main
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6
--
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163
exited on signal 11 (Segmentation fault).
--
Any pointers on what might be going on here as this never happened with
OMPIv4. Thanks.
Joseph Schuchart via users wrote:
Hello,
This looks like memory corruption. Do you have more details on what your
app is doing? I don't see any MPI calls inside the call stack. Could you
rebuild Open MPI with debug information enabled (by adding `--enable-debug`
to configure)? If this error occurs on singleton runs (1 process) then you
can easily attach gdb to it to get a better stack trace. Also, valgrind may
help pin down the problem by telling you which memory block is being free'd
here.
Thanks
Joseph
On 1/30/24 07:41, afernandez via users wrote:

Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything >
exactly as dozens of previous times with v4. I wasn't expecting any > issue
(and the compilations didn't report anything out of ordinary) > but running
several apps has resulted in error messages such as:
/Backtrace for this error:/
/#0 0x7f7c9571f51f in ???/
/ at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1 0x7f7c957823fe in __GI___libc_free/
/ at ./malloc/malloc.c:3368/
/#2 0x7f7c93a635c3 in ???/
/#3 0x7f7c95f84048 in ???/
/#4 0x7f7c95f1cef1 in ???/
/#5 0x7f7c95e34b7b in ???/
/#6 0x6e05be in ???/
/#7 0x6e58d7 in ???/
/#8 0x405d2c in ???/
/#9 0x7f7c95706d8f in __libc_start_call_main/
/ at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10 0x7f7c95706e3f in __libc_start_main_impl/
/ at ../csu/libc-start.c:392/
/#11 0x405d64 in ???/
/#12 0x in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building
OpenMPI, I had previously built the hwloc (2.10.0) library at >
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but
the problem seems to be related to memory allocation.
Thanks.


Re: [OMPI users] Seg error when using v5.0.1

2024-01-30 Thread afernandez via users

Hi Joseph,
It's happening with several apps including WRF. I was trying to find a
quick answer or fix but it seems that I'll have to recompile it in debug
mode. Will report back with the extra info.
Thanks.
Joseph Schuchart via users wrote:
Hello,
This looks like memory corruption. Do you have more details on what your
app is doing? I don't see any MPI calls inside the call stack. Could you
rebuild Open MPI with debug information enabled (by adding `--enable-debug`
to configure)? If this error occurs on singleton runs (1 process) then you
can easily attach gdb to it to get a better stack trace. Also, valgrind may
help pin down the problem by telling you which memory block is being free'd
here.
Thanks
Joseph
On 1/30/24 07:41, afernandez via users wrote:

Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything >
exactly as dozens of previous times with v4. I wasn't expecting any > issue
(and the compilations didn't report anything out of ordinary) > but running
several apps has resulted in error messages such as:
/Backtrace for this error:/
/#0 0x7f7c9571f51f in ???/
/ at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1 0x7f7c957823fe in __GI___libc_free/
/ at ./malloc/malloc.c:3368/
/#2 0x7f7c93a635c3 in ???/
/#3 0x7f7c95f84048 in ???/
/#4 0x7f7c95f1cef1 in ???/
/#5 0x7f7c95e34b7b in ???/
/#6 0x6e05be in ???/
/#7 0x6e58d7 in ???/
/#8 0x405d2c in ???/
/#9 0x7f7c95706d8f in __libc_start_call_main/
/ at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10 0x7f7c95706e3f in __libc_start_main_impl/
/ at ../csu/libc-start.c:392/
/#11 0x405d64 in ???/
/#12 0x in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building
OpenMPI, I had previously built the hwloc (2.10.0) library at >
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but
the problem seems to be related to memory allocation.
Thanks.


Re: [OMPI users] Seg error when using v5.0.1

2024-01-30 Thread Joseph Schuchart via users

Hello,

This looks like memory corruption. Do you have more details on what your 
app is doing? I don't see any MPI calls inside the call stack. Could you 
rebuild Open MPI with debug information enabled (by adding 
`--enable-debug` to configure)? If this error occurs on singleton runs 
(1 process) then you can easily attach gdb to it to get a better stack 
trace. Also, valgrind may help pin down the problem by telling you which 
memory block is being free'd here.


Thanks
Joseph

On 1/30/24 07:41, afernandez via users wrote:

Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything 
exactly as dozens of previous times with v4. I wasn't expecting any 
issue (and the compilations didn't report anything out of ordinary) 
but running several apps has resulted in error messages such as:

/Backtrace for this error:/
/#0  0x7f7c9571f51f in ???/
/        at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1  0x7f7c957823fe in __GI___libc_free/
/        at ./malloc/malloc.c:3368/
/#2  0x7f7c93a635c3 in ???/
/#3  0x7f7c95f84048 in ???/
/#4  0x7f7c95f1cef1 in ???/
/#5  0x7f7c95e34b7b in ???/
/#6  0x6e05be in ???/
/#7  0x6e58d7 in ???/
/#8  0x405d2c in ???/
/#9  0x7f7c95706d8f in __libc_start_call_main/
/        at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10  0x7f7c95706e3f in __libc_start_main_impl/
/        at ../csu/libc-start.c:392/
/#11  0x405d64 in ???/
/#12  0x in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before 
building OpenMPI, I had previously built the hwloc (2.10.0) library at 
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, 
but the problem seems to be related to memory allocation.

Thanks.





[OMPI users] Seg error when using v5.0.1

2024-01-30 Thread afernandez via users

Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything
exactly as dozens of previous times with v4. I wasn't expecting any issue
(and the compilations didn't report anything out of ordinary) but running
several apps has resulted in error messages such as:
Backtrace for this error:
#0 0x7f7c9571f51f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x7f7c957823fe in __GI___libc_free
at ./malloc/malloc.c:3368
#2 0x7f7c93a635c3 in ???
#3 0x7f7c95f84048 in ???
#4 0x7f7c95f1cef1 in ???
#5 0x7f7c95e34b7b in ???
#6 0x6e05be in ???
#7 0x6e58d7 in ???
#8 0x405d2c in ???
#9 0x7f7c95706d8f in __libc_start_call_main
at ../sysdeps/nptl/libc_start_call_main.h:58
#10 0x7f7c95706e3f in __libc_start_main_impl
at ../csu/libc-start.c:392
#11 0x405d64 in ???
#12 0x in ???
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before building
OpenMPI, I had previously built the hwloc (2.10.0) library at
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, but
the problem seems to be related to memory allocation.
Thanks.