I have tried valgrind 3.17.0 and openmpi 4.0.2, and it
works.
Do you know if there are some reported bugs with that
specific
version?
Regards,
Federico Tesser
On Wed, 07 Jul 2021 10:25:52 +0200
"TESSER FEDERICO" <federico.tes...@polito.it> wrote:
Good morning.
I have installed valgrind 3.17.0, having previously
loaded the
module for openmpi 4.0.5, so it found the
"MPI2-compliant mpicc
and mpi.h...".
However, trying to run just a simple program like this
one:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
MPI_Init(NULL, NULL);
int world_size;
int world_rank;
int name_len;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Get_processor_name(processor_name, &name_len);
printf("Hello world from processor %s, rank %d out of %d
processors\n",
processor_name, world_rank, world_size);
MPI_Finalize();
}
will produce the following errors:
==113228== Memcheck, a memory error detector
==113228== Copyright (C) 2002-2017, and GNU GPL'd, by
Julian Seward et al.
==113228== Using Valgrind-3.17.0 and LibVEX; rerun with
-h for copyright info
==113228== Command: ./pure_mpi_valgrind_try/a.out
==113228==
valgrind MPI wrappers 113228: Active for pid 113228
valgrind MPI wrappers 113228: Try MPIWRAP_DEBUG=help for
possible options
vex amd64->IR: unhandled instruction bytes: 0x62 0xF2
0x7D 0x8 0x7C 0xC5 0xC5 0xF9 0xD6 0x43
vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0
==113228== valgrind: Unrecognised instruction at address
0x5c79318.
==113228== at 0x5C79318: opal_pointer_array_init (in
/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
==113228== by 0x5CA4BDB: mca_base_var_init (in
/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
==113228== by 0x5C82F11: opal_init_util (in
/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
==113228== by 0x5157FD9: ompi_mpi_init
(ompi_mpi_init.c:428)
==113228== by 0x50FB3A8: PMPI_Init (pinit.c:69)
==113228== by 0x4E4BC26: PMPI_Init
(libmpiwrap.c:2288)
==113228== by 0x10893B: main (main.c:6)
==113228== Your program just tried to execute an
instruction that Valgrind
==113228== did not recognise. There are two possible
reasons for this.
==113228== 1. Your program has a bug and erroneously
jumped to a non-code
==113228== location. If you are running Memcheck and
you just saw a
==113228== warning about a bad jump, it's probably
your program's fault.
==113228== 2. The instruction is legitimate but Valgrind
doesn't handle it,
==113228== i.e. it's Valgrind's fault. If you think
this is the case or
==113228== you are not sure, please let us know and
we'll try to fix it.
==113228== Either way, Valgrind will now raise a SIGILL
signal which will
==113228== probably kill your program.
==113228==
==113228== Process terminating with default action of
signal 4 (SIGILL): dumping core
==113228== Illegal opcode at address 0x5C79318
==113228== at 0x5C79318: opal_pointer_array_init (in
/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
==113228== by 0x5CA4BDB: mca_base_var_init (in
/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
==113228== by 0x5C82F11: opal_init_util (in
/usr/local/openmpi-4.0.5/lib/libopen-pal.so.40.20.5)
==113228== by 0x5157FD9: ompi_mpi_init
(ompi_mpi_init.c:428)
==113228== by 0x50FB3A8: PMPI_Init (pinit.c:69)
==113228== by 0x4E4BC26: PMPI_Init
(libmpiwrap.c:2288)
==113228== by 0x10893B: main (main.c:6)
slurmstepd: error: *** JOB 159641 ON node01 CANCELLED AT
2021-07-07T10:21:29 ***
srun: Job step aborted: Waiting up to 32 seconds for job
step to finish.
srun: error: Timed out waiting for job step to complete
slurmstepd: error: *** STEP 159641.0 ON node01 CANCELLED
AT 2021-07-07T10:22:48 ***
What am I doing wrong?
Regards,
Federico Tesser
_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users