Hi,

I'm new to valgrind. My goal is to investigate a possible memory problem
in a large parallel MPI+OpenMP code.

I've cloned Valgrind from git and built it with GCC7.3 and fortran 3.1
for mpicc (my application is built with the same environment). I'm using
these 2 options:

--enable-only64bit --with-mpicc=$(which mpicc)

"mpirun -np 8 my_application" is working on my fat node (just to have
few processes for the test, I use nearly 60GB of RAM over more than
1TB). It fails after some tenth of iterations.

"mpirun -np 8 valgrind /bin/hostname" works too. So Valgrind seams
working with MPI 3.1 compiled with GCC7.3.

But "mpirun -np 8 valgrind ./my_application" immediately fails with:

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:
vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x8 0x6F 0x5
0x25 0xA8 0x18 0x0
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==377969== valgrind: Unrecognised instruction at address 0xabf9581.
==377969==    at 0xABF9581: opal_pointer_array_construct (in
/opt/openmpi-GCC73/v3.1.x-20181010/lib/libopen-pal.so.40.10.3)
==377969==    by 0xAC1BA78: mca_base_var_init (in
/opt/openmpi-GCC73/v3.1.x-20181010/lib/libopen-pal.so.40.10.3)
==377969==    by 0xABFDE39: opal_init_util (in
/opt/openmpi-GCC73/v3.1.x-20181010/lib/libopen-pal.so.40.10.3)
==377969==    by 0x911AD60: ompi_mpi_init (in
/opt/openmpi-GCC73/v3.1.x-20181010/lib/libmpi.so.40.10.3)
==377969==    by 0x914BB34: PMPI_Init_thread (in
/opt/openmpi-GCC73/v3.1.x-20181010/lib/libmpi.so.40.10.3)
==377969==    by 0x8E97C1F: MPI_INIT_THREAD (in
/opt/openmpi-GCC73/v3.1.x-20181010/lib/libmpi_mpifh.so.40.11.2)
==377969==    by 0x543066: __mpi_m_MOD_init_mpi (mpi_m.f90:140)
==377969==    by 0x411447: __yales2_m_MOD_init_yales2_env (yales2_m.f90:511)
==377969==    by 0x411595: __yales2_m_MOD_run_yales2 (yales2_m.f90:378)
==377969==    by 0x40B9E0: MAIN__ (3D_cylinder.f90:20)
==377969==    by 0x40B9E0: main (3D_cylinder.f90:8)
==377969== Your program just tried to execute an instruction that Valgrind
==377969== did not recognise.  There are two possible reasons for this.
==377969== 1. Your program has a bug and erroneously jumped to a non-code
==377969==    location.  If you are running Memcheck and you just saw a
==377969==    warning about a bad jump, it's probably your program's fault.
==377969== 2. The instruction is legitimate but Valgrind doesn't handle it,
==377969==    i.e. it's Valgrind's fault.  If you think this is the case or
==377969==    you are not sure, please let us know and we'll try to fix it.
==377969== Either way, Valgrind will now raise a SIGILL signal which will
==377969== probably kill your program.

May be I've missed something ?

I'm using master branch. The branch VALGRIND_3_16_BRANCH that I have
tested do not build:

make: ***  Aucune règle pour fabriquer la cible « exp-sgcheck.supp »,
nécessaire pour « default.supp ». Arrêt.

Thanks for your help

Patrick



_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Reply via email to