Depending on the alignment of the different types there might be small
holes in the low-level headers we exchange between processes It should not
be a concern for users.

valgrind should not stop on the first detected issue except
if --exit-on-first-error has been provided (the default value should be
no), so the SIGTERM might be generated for some other reason. What is
at jackhmmer.c:1597 ?

  George.


On Tue, Apr 30, 2019 at 2:27 PM David Mathog via users <
users@lists.open-mpi.org> wrote:

> Attempting to debug a complex program (99.9% of which is others' code)
> which stops running when run in valgrind as follows:
>
> mpirun -np 10 \
>    --hostfile /usr/common/etc/openmpi.machines.LINUX_INTEL_newsaf_rev2 \
>    --mca plm_rsh_agent rsh \
>    /usr/bin/valgrind \
>      --leak-check=full \
>      --leak-resolution=high \
>      --show-reachable=yes \
>      --log-file=nc.vg.%p \
>      --suppressions=/opt/ompi401/share/openmpi/openmpi-valgrind.supp \
>     /usr/common/tmp/jackhmmer  \
>        --tformat ncbi \
>        -T 150  \
>        --chkhmm jackhmmer_test \
>        --mpi \
>        ~safrun/a1hu.pfa \
>        /usr/common/tmp/testing/nr_lcl \
>        >jackhmmer_test_mpi.out 2>jackhmmer_test_mpi.stderr &
>
> Every one of the nodes has a variant of this in the log file (followed
> by a long list
> of memory allocation errors, since it exits without being able to clean
> anything up):
>
> ==5135== Memcheck, a memory error detector
> ==5135== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> ==5135== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright
> info
> ==5135== Command: /usr/common/tmp/jackhmmer --tformat ncbi -T 150
> --chkhmm jackhmmer_test --mpi /ulhhmi/safrun
> /a1hu.pfa /usr/common/tmp/testing/nr_lcl
> ==5135== Parent PID: 5119
> ==5135==
> ==5135== Syscall param socketcall.sendto(msg) points to uninitialised
> byte(s)
> ==5135==    at 0x5459BFB: send (in /usr/lib64/libpthread-2.17.so)
> ==5135==    by 0xF84A282: mca_btl_tcp_send_blocking (in
> /opt/ompi401/lib/openmpi/mca_btl_tcp.so)
> ==5135==    by 0xF84E414: mca_btl_tcp_endpoint_send_handler (in
> /opt/ompi401/lib/openmpi/mca_btl_tcp.so)
> ==5135==    by 0x5D6E4EF: event_persist_closure (event.c:1321)
> ==5135==    by 0x5D6E4EF: event_process_active_single_queue
> (event.c:1365)
> ==5135==    by 0x5D6E4EF: event_process_active (event.c:1440)
> ==5135==    by 0x5D6E4EF: opal_libevent2022_event_base_loop
> (event.c:1644)
> ==5135==    by 0x5D2465F: opal_progress (in
> /opt/ompi401/lib/libopen-pal.so.40.20.1)
> ==5135==    by 0xF36A9CC: ompi_request_wait_completion (in
> /opt/ompi401/lib/openmpi/mca_pml_ob1.so)
> ==5135==    by 0xF36C30E: mca_pml_ob1_send (in
> /opt/ompi401/lib/openmpi/mca_pml_ob1.so)
> ==5135==    by 0x51BC581: PMPI_Send (in
> /opt/ompi401/lib/libmpi.so.40.20.1)
> ==5135==    by 0x40B46E: mpi_worker (jackhmmer.c:1560)
> ==5135==    by 0x406726: main (jackhmmer.c:413)
> ==5135==  Address 0x1ffefff8d5 is on thread 1's stack
> ==5135==  in frame #2, created by mca_btl_tcp_endpoint_send_handler
> (???:)
> ==5135==
> ==5135==
> ==5135== Process terminating with default action of signal 15 (SIGTERM)
> ==5135==    at 0x5459EFD: ??? (in /usr/lib64/libpthread-2.17.so)
> ==5135==    by 0x408817: mpi_failure (jackhmmer.c:887)
> ==5135==    by 0x40B708: mpi_worker (jackhmmer.c:1597)
> ==5135==    by 0x406726: main (jackhmmer.c:413)
>
> jackhmmer line 1560 is just this:
>
>
>          MPI_Send(&status, 1, MPI_INT, 0, HMMER_SETUP_READY_TAG,
> MPI_COMM_WORLD);
>
> preceded at varying distances by:
>
>    int              status   = eslOK;
>    status = 0;
>
> I can see why MPI might have some uninitialized bytes in that send, for
> instance, if it has a minimum buffer size it will send or something like
> that.  The problem is that it completely breaks valgrind in this
> application because valgrind exits immediately when it sees this error.
> The suppression file supplied with the release does not prevent that.
>
> How do I work around this?
>
> Thank you,
>
> David Mathog
> mat...@caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to