Re: [OMPI users] OMPI 4.0.1 valgrind error on simple MPI_Send()
On 30-Apr-2019 11:39, George Bosilca wrote: Depending on the alignment of the different types there might be small holes in the low-level headers we exchange between processes It should not be a concern for users. valgrind should not stop on the first detected issue except if --exit-on-first-error has been provided (the default value should be no), so the SIGTERM might be generated for some other reason. What is at jackhmmer.c:1597 ? Ah, my bad, I didn't notice that the exit was a different line number than the error message. 1597 is a coded exit on an error condition (triggered by the omission in the test of a parameter needed by the modified jackhmmer.) So you are right, it did not in fact fail at the first error. This suppression file: cat >/usr/common/tmp
Re: [OMPI users] OMPI 4.0.1 valgrind error on simple MPI_Send()
Depending on the alignment of the different types there might be small holes in the low-level headers we exchange between processes It should not be a concern for users. valgrind should not stop on the first detected issue except if --exit-on-first-error has been provided (the default value should be no), so the SIGTERM might be generated for some other reason. What is at jackhmmer.c:1597 ? George. On Tue, Apr 30, 2019 at 2:27 PM David Mathog via users < users@lists.open-mpi.org> wrote: > Attempting to debug a complex program (99.9% of which is others' code) > which stops running when run in valgrind as follows: > > mpirun -np 10 \ >--hostfile /usr/common/etc/openmpi.machines.LINUX_INTEL_newsaf_rev2 \ >--mca plm_rsh_agent rsh \ >/usr/bin/valgrind \ > --leak-check=full \ > --leak-resolution=high \ > --show-reachable=yes \ > --log-file=nc.vg.%p \ > --suppressions=/opt/ompi401/share/openmpi/openmpi-valgrind.supp \ > /usr/common/tmp/jackhmmer \ >--tformat ncbi \ >-T 150 \ >--chkhmm jackhmmer_test \ >--mpi \ >~safrun/a1hu.pfa \ >/usr/common/tmp/testing/nr_lcl \ >>jackhmmer_test_mpi.out 2>jackhmmer_test_mpi.stderr & > > Every one of the nodes has a variant of this in the log file (followed > by a long list > of memory allocation errors, since it exits without being able to clean > anything up): > > ==5135== Memcheck, a memory error detector > ==5135== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. > ==5135== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright > info > ==5135== Command: /usr/common/tmp/jackhmmer --tformat ncbi -T 150 > --chkhmm jackhmmer_test --mpi /ulhhmi/safrun > /a1hu.pfa /usr/common/tmp/testing/nr_lcl > ==5135== Parent PID: 5119 > ==5135== > ==5135== Syscall param socketcall.sendto(msg) points to uninitialised > byte(s) > ==5135==at 0x5459BFB: send (in /usr/lib64/libpthread-2.17.so) > ==5135==by 0xF84A282: mca_btl_tcp_send_blocking (in > /opt/ompi401/lib/openmpi/mca_btl_tcp.so) > ==5135==by 0xF84E414: mca_btl_tcp_endpoint_send_handler (in > /opt/ompi401/lib/openmpi/mca_btl_tcp.so) > ==5135==by 0x5D6E4EF: event_persist_closure (event.c:1321) > ==5135==by 0x5D6E4EF: event_process_active_single_queue > (event.c:1365) > ==5135==by 0x5D6E4EF: event_process_active (event.c:1440) > ==5135==by 0x5D6E4EF: opal_libevent2022_event_base_loop > (event.c:1644) > ==5135==by 0x5D2465F: opal_progress (in > /opt/ompi401/lib/libopen-pal.so.40.20.1) > ==5135==by 0xF36A9CC: ompi_request_wait_completion (in > /opt/ompi401/lib/openmpi/mca_pml_ob1.so) > ==5135==by 0xF36C30E: mca_pml_ob1_send (in > /opt/ompi401/lib/openmpi/mca_pml_ob1.so) > ==5135==by 0x51BC581: PMPI_Send (in > /opt/ompi401/lib/libmpi.so.40.20.1) > ==5135==by 0x40B46E: mpi_worker (jackhmmer.c:1560) > ==5135==by 0x406726: main (jackhmmer.c:413) > ==5135== Address 0x1ffefff8d5 is on thread 1's stack > ==5135== in frame #2, created by mca_btl_tcp_endpoint_send_handler > (???:) > ==5135== > ==5135== > ==5135== Process terminating with default action of signal 15 (SIGTERM) > ==5135==at 0x5459EFD: ??? (in /usr/lib64/libpthread-2.17.so) > ==5135==by 0x408817: mpi_failure (jackhmmer.c:887) > ==5135==by 0x40B708: mpi_worker (jackhmmer.c:1597) > ==5135==by 0x406726: main (jackhmmer.c:413) > > jackhmmer line 1560 is just this: > > > MPI_Send(, 1, MPI_INT, 0, HMMER_SETUP_READY_TAG, > MPI_COMM_WORLD); > > preceded at varying distances by: > >int status = eslOK; >status = 0; > > I can see why MPI might have some uninitialized bytes in that send, for > instance, if it has a minimum buffer size it will send or something like > that. The problem is that it completely breaks valgrind in this > application because valgrind exits immediately when it sees this error. > The suppression file supplied with the release does not prevent that. > > How do I work around this? > > Thank you, > > David Mathog > mat...@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] OMPI 4.0.1 valgrind error on simple MPI_Send()
Attempting to debug a complex program (99.9% of which is others' code) which stops running when run in valgrind as follows: mpirun -np 10 \ --hostfile /usr/common/etc/openmpi.machines.LINUX_INTEL_newsaf_rev2 \ --mca plm_rsh_agent rsh \ /usr/bin/valgrind \ --leak-check=full \ --leak-resolution=high \ --show-reachable=yes \ --log-file=nc.vg.%p \ --suppressions=/opt/ompi401/share/openmpi/openmpi-valgrind.supp \ /usr/common/tmp/jackhmmer \ --tformat ncbi \ -T 150 \ --chkhmm jackhmmer_test \ --mpi \ ~safrun/a1hu.pfa \ /usr/common/tmp/testing/nr_lcl \ >jackhmmer_test_mpi.out 2>jackhmmer_test_mpi.stderr & Every one of the nodes has a variant of this in the log file (followed by a long list of memory allocation errors, since it exits without being able to clean anything up): ==5135== Memcheck, a memory error detector ==5135== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==5135== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==5135== Command: /usr/common/tmp/jackhmmer --tformat ncbi -T 150 --chkhmm jackhmmer_test --mpi /ulhhmi/safrun /a1hu.pfa /usr/common/tmp/testing/nr_lcl ==5135== Parent PID: 5119 ==5135== ==5135== Syscall param socketcall.sendto(msg) points to uninitialised byte(s) ==5135==at 0x5459BFB: send (in /usr/lib64/libpthread-2.17.so) ==5135==by 0xF84A282: mca_btl_tcp_send_blocking (in /opt/ompi401/lib/openmpi/mca_btl_tcp.so) ==5135==by 0xF84E414: mca_btl_tcp_endpoint_send_handler (in /opt/ompi401/lib/openmpi/mca_btl_tcp.so) ==5135==by 0x5D6E4EF: event_persist_closure (event.c:1321) ==5135==by 0x5D6E4EF: event_process_active_single_queue (event.c:1365) ==5135==by 0x5D6E4EF: event_process_active (event.c:1440) ==5135==by 0x5D6E4EF: opal_libevent2022_event_base_loop (event.c:1644) ==5135==by 0x5D2465F: opal_progress (in /opt/ompi401/lib/libopen-pal.so.40.20.1) ==5135==by 0xF36A9CC: ompi_request_wait_completion (in /opt/ompi401/lib/openmpi/mca_pml_ob1.so) ==5135==by 0xF36C30E: mca_pml_ob1_send (in /opt/ompi401/lib/openmpi/mca_pml_ob1.so) ==5135==by 0x51BC581: PMPI_Send (in /opt/ompi401/lib/libmpi.so.40.20.1) ==5135==by 0x40B46E: mpi_worker (jackhmmer.c:1560) ==5135==by 0x406726: main (jackhmmer.c:413) ==5135== Address 0x1ffefff8d5 is on thread 1's stack ==5135== in frame #2, created by mca_btl_tcp_endpoint_send_handler (???:) ==5135== ==5135== ==5135== Process terminating with default action of signal 15 (SIGTERM) ==5135==at 0x5459EFD: ??? (in /usr/lib64/libpthread-2.17.so) ==5135==by 0x408817: mpi_failure (jackhmmer.c:887) ==5135==by 0x40B708: mpi_worker (jackhmmer.c:1597) ==5135==by 0x406726: main (jackhmmer.c:413) jackhmmer line 1560 is just this: MPI_Send(, 1, MPI_INT, 0, HMMER_SETUP_READY_TAG, MPI_COMM_WORLD); preceded at varying distances by: int status = eslOK; status = 0; I can see why MPI might have some uninitialized bytes in that send, for instance, if it has a minimum buffer size it will send or something like that. The problem is that it completely breaks valgrind in this application because valgrind exits immediately when it sees this error. The suppression file supplied with the release does not prevent that. How do I work around this? Thank you, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users