Re: [OMPI users] OMPI 4.0.1 valgrind error on simple MPI_Send()

2019-04-30 Thread David Mathog via users

On 30-Apr-2019 11:39, George Bosilca wrote:

Depending on the alignment of the different types there might be small
holes in the low-level headers we exchange between processes It should 
not

be a concern for users.

valgrind should not stop on the first detected issue except
if --exit-on-first-error has been provided (the default value should be
no), so the SIGTERM might be generated for some other reason. What is
at jackhmmer.c:1597 ?


Ah, my bad, I didn't notice that the exit was a different line number 
than the error message.


1597 is a coded exit on an error condition (triggered by the omission in 
the test of
a parameter needed by the modified jackhmmer.)  So you are right, it did 
not in fact fail

at the first error.

This suppression file:

cat >/usr/common/tmp 

Re: [OMPI users] OMPI 4.0.1 valgrind error on simple MPI_Send()

2019-04-30 Thread George Bosilca via users
Depending on the alignment of the different types there might be small
holes in the low-level headers we exchange between processes It should not
be a concern for users.

valgrind should not stop on the first detected issue except
if --exit-on-first-error has been provided (the default value should be
no), so the SIGTERM might be generated for some other reason. What is
at jackhmmer.c:1597 ?

  George.


On Tue, Apr 30, 2019 at 2:27 PM David Mathog via users <
users@lists.open-mpi.org> wrote:

> Attempting to debug a complex program (99.9% of which is others' code)
> which stops running when run in valgrind as follows:
>
> mpirun -np 10 \
>--hostfile /usr/common/etc/openmpi.machines.LINUX_INTEL_newsaf_rev2 \
>--mca plm_rsh_agent rsh \
>/usr/bin/valgrind \
>  --leak-check=full \
>  --leak-resolution=high \
>  --show-reachable=yes \
>  --log-file=nc.vg.%p \
>  --suppressions=/opt/ompi401/share/openmpi/openmpi-valgrind.supp \
> /usr/common/tmp/jackhmmer  \
>--tformat ncbi \
>-T 150  \
>--chkhmm jackhmmer_test \
>--mpi \
>~safrun/a1hu.pfa \
>/usr/common/tmp/testing/nr_lcl \
>>jackhmmer_test_mpi.out 2>jackhmmer_test_mpi.stderr &
>
> Every one of the nodes has a variant of this in the log file (followed
> by a long list
> of memory allocation errors, since it exits without being able to clean
> anything up):
>
> ==5135== Memcheck, a memory error detector
> ==5135== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> ==5135== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright
> info
> ==5135== Command: /usr/common/tmp/jackhmmer --tformat ncbi -T 150
> --chkhmm jackhmmer_test --mpi /ulhhmi/safrun
> /a1hu.pfa /usr/common/tmp/testing/nr_lcl
> ==5135== Parent PID: 5119
> ==5135==
> ==5135== Syscall param socketcall.sendto(msg) points to uninitialised
> byte(s)
> ==5135==at 0x5459BFB: send (in /usr/lib64/libpthread-2.17.so)
> ==5135==by 0xF84A282: mca_btl_tcp_send_blocking (in
> /opt/ompi401/lib/openmpi/mca_btl_tcp.so)
> ==5135==by 0xF84E414: mca_btl_tcp_endpoint_send_handler (in
> /opt/ompi401/lib/openmpi/mca_btl_tcp.so)
> ==5135==by 0x5D6E4EF: event_persist_closure (event.c:1321)
> ==5135==by 0x5D6E4EF: event_process_active_single_queue
> (event.c:1365)
> ==5135==by 0x5D6E4EF: event_process_active (event.c:1440)
> ==5135==by 0x5D6E4EF: opal_libevent2022_event_base_loop
> (event.c:1644)
> ==5135==by 0x5D2465F: opal_progress (in
> /opt/ompi401/lib/libopen-pal.so.40.20.1)
> ==5135==by 0xF36A9CC: ompi_request_wait_completion (in
> /opt/ompi401/lib/openmpi/mca_pml_ob1.so)
> ==5135==by 0xF36C30E: mca_pml_ob1_send (in
> /opt/ompi401/lib/openmpi/mca_pml_ob1.so)
> ==5135==by 0x51BC581: PMPI_Send (in
> /opt/ompi401/lib/libmpi.so.40.20.1)
> ==5135==by 0x40B46E: mpi_worker (jackhmmer.c:1560)
> ==5135==by 0x406726: main (jackhmmer.c:413)
> ==5135==  Address 0x1ffefff8d5 is on thread 1's stack
> ==5135==  in frame #2, created by mca_btl_tcp_endpoint_send_handler
> (???:)
> ==5135==
> ==5135==
> ==5135== Process terminating with default action of signal 15 (SIGTERM)
> ==5135==at 0x5459EFD: ??? (in /usr/lib64/libpthread-2.17.so)
> ==5135==by 0x408817: mpi_failure (jackhmmer.c:887)
> ==5135==by 0x40B708: mpi_worker (jackhmmer.c:1597)
> ==5135==by 0x406726: main (jackhmmer.c:413)
>
> jackhmmer line 1560 is just this:
>
>
>  MPI_Send(, 1, MPI_INT, 0, HMMER_SETUP_READY_TAG,
> MPI_COMM_WORLD);
>
> preceded at varying distances by:
>
>int  status   = eslOK;
>status = 0;
>
> I can see why MPI might have some uninitialized bytes in that send, for
> instance, if it has a minimum buffer size it will send or something like
> that.  The problem is that it completely breaks valgrind in this
> application because valgrind exits immediately when it sees this error.
> The suppression file supplied with the release does not prevent that.
>
> How do I work around this?
>
> Thank you,
>
> David Mathog
> mat...@caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] OMPI 4.0.1 valgrind error on simple MPI_Send()

2019-04-30 Thread David Mathog via users

Attempting to debug a complex program (99.9% of which is others' code)
which stops running when run in valgrind as follows:

mpirun -np 10 \
  --hostfile /usr/common/etc/openmpi.machines.LINUX_INTEL_newsaf_rev2 \
  --mca plm_rsh_agent rsh \
  /usr/bin/valgrind \
--leak-check=full \
--leak-resolution=high \
--show-reachable=yes \
--log-file=nc.vg.%p \
--suppressions=/opt/ompi401/share/openmpi/openmpi-valgrind.supp \
   /usr/common/tmp/jackhmmer  \
  --tformat ncbi \
  -T 150  \
  --chkhmm jackhmmer_test \
  --mpi \
  ~safrun/a1hu.pfa \
  /usr/common/tmp/testing/nr_lcl \
  >jackhmmer_test_mpi.out 2>jackhmmer_test_mpi.stderr &

Every one of the nodes has a variant of this in the log file (followed 
by a long list
of memory allocation errors, since it exits without being able to clean 
anything up):


==5135== Memcheck, a memory error detector
==5135== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5135== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright 
info
==5135== Command: /usr/common/tmp/jackhmmer --tformat ncbi -T 150 
--chkhmm jackhmmer_test --mpi /ulhhmi/safrun

/a1hu.pfa /usr/common/tmp/testing/nr_lcl
==5135== Parent PID: 5119
==5135==
==5135== Syscall param socketcall.sendto(msg) points to uninitialised 
byte(s)

==5135==at 0x5459BFB: send (in /usr/lib64/libpthread-2.17.so)
==5135==by 0xF84A282: mca_btl_tcp_send_blocking (in 
/opt/ompi401/lib/openmpi/mca_btl_tcp.so)
==5135==by 0xF84E414: mca_btl_tcp_endpoint_send_handler (in 
/opt/ompi401/lib/openmpi/mca_btl_tcp.so)

==5135==by 0x5D6E4EF: event_persist_closure (event.c:1321)
==5135==by 0x5D6E4EF: event_process_active_single_queue 
(event.c:1365)

==5135==by 0x5D6E4EF: event_process_active (event.c:1440)
==5135==by 0x5D6E4EF: opal_libevent2022_event_base_loop 
(event.c:1644)
==5135==by 0x5D2465F: opal_progress (in 
/opt/ompi401/lib/libopen-pal.so.40.20.1)
==5135==by 0xF36A9CC: ompi_request_wait_completion (in 
/opt/ompi401/lib/openmpi/mca_pml_ob1.so)
==5135==by 0xF36C30E: mca_pml_ob1_send (in 
/opt/ompi401/lib/openmpi/mca_pml_ob1.so)
==5135==by 0x51BC581: PMPI_Send (in 
/opt/ompi401/lib/libmpi.so.40.20.1)

==5135==by 0x40B46E: mpi_worker (jackhmmer.c:1560)
==5135==by 0x406726: main (jackhmmer.c:413)
==5135==  Address 0x1ffefff8d5 is on thread 1's stack
==5135==  in frame #2, created by mca_btl_tcp_endpoint_send_handler 
(???:)

==5135==
==5135==
==5135== Process terminating with default action of signal 15 (SIGTERM)
==5135==at 0x5459EFD: ??? (in /usr/lib64/libpthread-2.17.so)
==5135==by 0x408817: mpi_failure (jackhmmer.c:887)
==5135==by 0x40B708: mpi_worker (jackhmmer.c:1597)
==5135==by 0x406726: main (jackhmmer.c:413)

jackhmmer line 1560 is just this:


	 MPI_Send(, 1, MPI_INT, 0, HMMER_SETUP_READY_TAG, 
MPI_COMM_WORLD);


preceded at varying distances by:

  int  status   = eslOK;
  status = 0;

I can see why MPI might have some uninitialized bytes in that send, for 
instance, if it has a minimum buffer size it will send or something like 
that.  The problem is that it completely breaks valgrind in this 
application because valgrind exits immediately when it sees this error.  
The suppression file supplied with the release does not prevent that.


How do I work around this?

Thank you,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users