Attempting to debug a complex program (99.9% of which is others' code)
which stops running when run in valgrind as follows:

mpirun -np 10 \
  --hostfile /usr/common/etc/openmpi.machines.LINUX_INTEL_newsaf_rev2 \
  --mca plm_rsh_agent rsh \
  /usr/bin/valgrind \
    --leak-check=full \
    --leak-resolution=high \
    --show-reachable=yes \ \
    --suppressions=/opt/ompi401/share/openmpi/openmpi-valgrind.supp \
   /usr/common/tmp/jackhmmer  \
      --tformat ncbi \
      -T 150  \
      --chkhmm jackhmmer_test \
      --mpi \
      ~safrun/a1hu.pfa \
      /usr/common/tmp/testing/nr_lcl \
      >jackhmmer_test_mpi.out 2>jackhmmer_test_mpi.stderr &

Every one of the nodes has a variant of this in the log file (followed by a long list of memory allocation errors, since it exits without being able to clean anything up):

==5135== Memcheck, a memory error detector
==5135== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5135== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==5135== Command: /usr/common/tmp/jackhmmer --tformat ncbi -T 150 --chkhmm jackhmmer_test --mpi /ulhhmi/safrun
/a1hu.pfa /usr/common/tmp/testing/nr_lcl
==5135== Parent PID: 5119
==5135== Syscall param socketcall.sendto(msg) points to uninitialised byte(s)
==5135==    at 0x5459BFB: send (in /usr/lib64/
==5135== by 0xF84A282: mca_btl_tcp_send_blocking (in /opt/ompi401/lib/openmpi/ ==5135== by 0xF84E414: mca_btl_tcp_endpoint_send_handler (in /opt/ompi401/lib/openmpi/
==5135==    by 0x5D6E4EF: event_persist_closure (event.c:1321)
==5135== by 0x5D6E4EF: event_process_active_single_queue (event.c:1365)
==5135==    by 0x5D6E4EF: event_process_active (event.c:1440)
==5135== by 0x5D6E4EF: opal_libevent2022_event_base_loop (event.c:1644) ==5135== by 0x5D2465F: opal_progress (in /opt/ompi401/lib/ ==5135== by 0xF36A9CC: ompi_request_wait_completion (in /opt/ompi401/lib/openmpi/ ==5135== by 0xF36C30E: mca_pml_ob1_send (in /opt/ompi401/lib/openmpi/ ==5135== by 0x51BC581: PMPI_Send (in /opt/ompi401/lib/
==5135==    by 0x40B46E: mpi_worker (jackhmmer.c:1560)
==5135==    by 0x406726: main (jackhmmer.c:413)
==5135==  Address 0x1ffefff8d5 is on thread 1's stack
==5135== in frame #2, created by mca_btl_tcp_endpoint_send_handler (???:)
==5135== Process terminating with default action of signal 15 (SIGTERM)
==5135==    at 0x5459EFD: ??? (in /usr/lib64/
==5135==    by 0x408817: mpi_failure (jackhmmer.c:887)
==5135==    by 0x40B708: mpi_worker (jackhmmer.c:1597)
==5135==    by 0x406726: main (jackhmmer.c:413)

jackhmmer line 1560 is just this:


preceded at varying distances by:

  int              status   = eslOK;
  status = 0;

I can see why MPI might have some uninitialized bytes in that send, for instance, if it has a minimum buffer size it will send or something like that. The problem is that it completely breaks valgrind in this application because valgrind exits immediately when it sees this error. The suppression file supplied with the release does not prevent that.

How do I work around this?

Thank you,

David Mathog
Manager, Sequence Analysis Facility, Biology Division, Caltech
users mailing list

Reply via email to