I was not able to reproduce the issue with openib on the 4.0, but instead I randomly segfault in MPI finalize during the grdma cleanup.
I could however reproduce the TCP timeout part with both 4.0 and master, on a pretty sane cluster (only 3 interfaces, lo, eth0 and virbr0). With no surprise, the timeout was triggered by a busted TCP interfaces selection mechanism. As soon as I exclude the virbr0 interface, everything goes back to normal. George. On Wed, Feb 20, 2019 at 5:20 PM Adam LeBlanc <alebl...@iol.unh.edu> wrote: > Hello Howard, > > Thanks for all of the help and suggestions I will look into them. I also > realized that my ansible wasn't setup properly for handling tar files so > the nightly build didn't even install, but will do it by hand and will give > you an update tomorrow somewhere in the afternoon. > > Thanks, > Adam LeBlanc > > On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard <hpprit...@gmail.com> > wrote: > >> Hello Adam, >> >> This helps some. Could you post first 20 lines of you config.log. This >> will >> help in trying to reproduce. The content of your host file (you can use >> generic >> names for the nodes if that'a an issue to publicize) would also help as >> the number of nodes and number of MPI processes/node impacts the way >> the reduce scatter operation works. >> >> One thing to note about the openib BTL - it is on life support. That's >> why you needed to set btl_openib_allow_ib 1 on the mpirun command line. >> >> You may get much better success by installing UCX >> <https://github.com/openucx/ucx/releases> and rebuilding Open MPI to use >> UCX. You may actually already have UCX installed on your system if >> a recent version of MOFED is installed. >> >> You can check this by running /usr/bin/ofed_rpm_info. It will show which >> ucx version has been installed. >> If UCX is installed, you can add --with-ucx to the Open MPi configuration >> line and it should build in UCX >> support. If Open MPI is built with UCX support, it will by default use >> UCX for message transport rather than >> the OpenIB BTL. >> >> thanks, >> >> Howard >> >> >> Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc < >> alebl...@iol.unh.edu>: >> >>> On tcp side it doesn't seg fault anymore but will timeout on some tests >>> but on the openib side it will still seg fault, here is the output: >>> >>> [pandora:19256] *** Process received signal *** >>> [pandora:19256] Signal: Segmentation fault (11) >>> [pandora:19256] Signal code: Address not mapped (1) >>> [pandora:19256] Failing at address: 0x7f911c69fff0 >>> [pandora:19255] *** Process received signal *** >>> [pandora:19255] Signal: Segmentation fault (11) >>> [pandora:19255] Signal code: Address not mapped (1) >>> [pandora:19255] Failing at address: 0x7ff09cd3fff0 >>> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680] >>> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0] >>> [pandora:19256] [ 2] >>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55] >>> [pandora:19256] [ 3] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b] >>> [pandora:19256] [ 4] [pandora:19255] [ 0] >>> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680] >>> [pandora:19255] [ 1] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7] >>> [pandora:19256] [ 5] IMB-MPI1[0x40b83b] >>> [pandora:19256] [ 6] IMB-MPI1[0x407155] >>> [pandora:19256] [ 7] IMB-MPI1[0x4022ea] >>> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0] >>> [pandora:19255] [ 2] >>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5] >>> [pandora:19256] [ 9] IMB-MPI1[0x401d49] >>> [pandora:19256] *** End of error message *** >>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55] >>> [pandora:19255] [ 3] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b] >>> [pandora:19255] [ 4] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7] >>> [pandora:19255] [ 5] IMB-MPI1[0x40b83b] >>> [pandora:19255] [ 6] IMB-MPI1[0x407155] >>> [pandora:19255] [ 7] IMB-MPI1[0x4022ea] >>> [pandora:19255] [ 8] >>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5] >>> [pandora:19255] [ 9] IMB-MPI1[0x401d49] >>> [pandora:19255] *** End of error message *** >>> [phoebe:12418] *** Process received signal *** >>> [phoebe:12418] Signal: Segmentation fault (11) >>> [phoebe:12418] Signal code: Address not mapped (1) >>> [phoebe:12418] Failing at address: 0x7f5ce27dfff0 >>> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680] >>> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0] >>> [phoebe:12418] [ 2] >>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55] >>> [phoebe:12418] [ 3] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b] >>> [phoebe:12418] [ 4] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7] >>> [phoebe:12418] [ 5] IMB-MPI1[0x40b83b] >>> [phoebe:12418] [ 6] IMB-MPI1[0x407155] >>> [phoebe:12418] [ 7] IMB-MPI1[0x4022ea] >>> [phoebe:12418] [ 8] >>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5] >>> [phoebe:12418] [ 9] IMB-MPI1[0x401d49] >>> [phoebe:12418] *** End of error message *** >>> >>> -------------------------------------------------------------------------- >>> Primary job terminated normally, but 1 process returned >>> a non-zero exit code. Per user-direction, the job has been aborted. >>> >>> -------------------------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 0 with PID 0 on node pandora exited on >>> signal 11 (Segmentation fault). >>> >>> -------------------------------------------------------------------------- >>> >>> - Adam LeBlanc >>> >>> On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users < >>> users@lists.open-mpi.org> wrote: >>> >>>> Can you try the latest 4.0.x nightly snapshot and see if the problem >>>> still occurs? >>>> >>>> https://www.open-mpi.org/nightly/v4.0.x/ >>>> >>>> >>>> > On Feb 20, 2019, at 1:40 PM, Adam LeBlanc <alebl...@iol.unh.edu> >>>> wrote: >>>> > >>>> > I do here is the output: >>>> > >>>> > 2 total processes killed (some possibly by mpirun during cleanup) >>>> > [pandora:12238] *** Process received signal *** >>>> > [pandora:12238] Signal: Segmentation fault (11) >>>> > [pandora:12238] Signal code: Invalid permissions (2) >>>> > [pandora:12238] Failing at address: 0x7f5c8e31fff0 >>>> > [pandora:12238] [ 0] >>>> /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680] >>>> > [pandora:12238] [ 1] [pandora:12237] *** Process received signal *** >>>> > /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0] >>>> > [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11) >>>> > [pandora:12237] Signal code: Invalid permissions (2) >>>> > [pandora:12237] Failing at address: 0x7f6c4ab3fff0 >>>> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55] >>>> > [pandora:12238] [ 3] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b] >>>> > [pandora:12238] [ 4] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7] >>>> > [pandora:12238] [ 5] IMB-MPI1[0x40b83b] >>>> > [pandora:12238] [ 6] IMB-MPI1[0x407155] >>>> > [pandora:12238] [ 7] IMB-MPI1[0x4022ea] >>>> > [pandora:12238] [ 8] >>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5] >>>> > [pandora:12238] [ 9] IMB-MPI1[0x401d49] >>>> > [pandora:12238] *** End of error message *** >>>> > [pandora:12237] [ 0] >>>> /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680] >>>> > [pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0] >>>> > [pandora:12237] [ 2] >>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55] >>>> > [pandora:12237] [ 3] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b] >>>> > [pandora:12237] [ 4] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7] >>>> > [pandora:12237] [ 5] IMB-MPI1[0x40b83b] >>>> > [pandora:12237] [ 6] IMB-MPI1[0x407155] >>>> > [pandora:12237] [ 7] IMB-MPI1[0x4022ea] >>>> > [pandora:12237] [ 8] >>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5] >>>> > [pandora:12237] [ 9] IMB-MPI1[0x401d49] >>>> > [pandora:12237] *** End of error message *** >>>> > [phoebe:07408] *** Process received signal *** >>>> > [phoebe:07408] Signal: Segmentation fault (11) >>>> > [phoebe:07408] Signal code: Invalid permissions (2) >>>> > [phoebe:07408] Failing at address: 0x7f6b9ca9fff0 >>>> > [titan:07169] *** Process received signal *** >>>> > [titan:07169] Signal: Segmentation fault (11) >>>> > [titan:07169] Signal code: Invalid permissions (2) >>>> > [titan:07169] Failing at address: 0x7fc01295fff0 >>>> > [phoebe:07408] [ 0] >>>> /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680] >>>> > [phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0] >>>> > [phoebe:07408] [ 2] [titan:07169] [ 0] >>>> /usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680] >>>> > [titan:07169] [ 1] >>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55] >>>> > [phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0] >>>> > [titan:07169] [ 2] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b] >>>> > [phoebe:07408] [ 4] >>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55] >>>> > [titan:07169] [ 3] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7] >>>> > [phoebe:07408] [ 5] IMB-MPI1[0x40b83b] >>>> > [phoebe:07408] [ 6] IMB-MPI1[0x407155] >>>> > >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b] >>>> > [titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea] >>>> > [phoebe:07408] [ 8] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7] >>>> > [titan:07169] [ 5] IMB-MPI1[0x40b83b] >>>> > [titan:07169] [ 6] IMB-MPI1[0x407155] >>>> > [titan:07169] [ 7] >>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5] >>>> > [phoebe:07408] [ 9] IMB-MPI1[0x401d49] >>>> > [phoebe:07408] *** End of error message *** >>>> > IMB-MPI1[0x4022ea] >>>> > [titan:07169] [ 8] >>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5] >>>> > [titan:07169] [ 9] IMB-MPI1[0x401d49] >>>> > [titan:07169] *** End of error message *** >>>> > >>>> -------------------------------------------------------------------------- >>>> > Primary job terminated normally, but 1 process returned >>>> > a non-zero exit code. Per user-direction, the job has been aborted. >>>> > >>>> -------------------------------------------------------------------------- >>>> > >>>> -------------------------------------------------------------------------- >>>> > mpirun noticed that process rank 0 with PID 0 on node pandora exited >>>> on signal 11 (Segmentation fault). >>>> > >>>> -------------------------------------------------------------------------- >>>> > >>>> > >>>> > - Adam LeBlanc >>>> > >>>> > On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard <hpprit...@gmail.com> >>>> wrote: >>>> > HI Adam, >>>> > >>>> > As a sanity check, if you try to use --mca btl self,vader,tcp >>>> > >>>> > do you still see the segmentation fault? >>>> > >>>> > Howard >>>> > >>>> > >>>> > Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc < >>>> alebl...@iol.unh.edu>: >>>> > Hello, >>>> > >>>> > When I do a run with OpenMPI v4.0.0 on Infiniband with this command: >>>> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca >>>> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca >>>> btl_openib_allow_ib 1 -np 6 >>>> > -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1 >>>> > >>>> > I get this error: >>>> > >>>> > #---------------------------------------------------------------- >>>> > # Benchmarking Reduce_scatter >>>> > # #processes = 4 >>>> > # ( 2 additional processes waiting in MPI_Barrier) >>>> > #---------------------------------------------------------------- >>>> > #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] >>>> > 0 1000 0.14 0.15 0.14 >>>> > 4 1000 5.00 7.58 6.28 >>>> > 8 1000 5.13 7.68 6.41 >>>> > 16 1000 5.05 7.74 6.39 >>>> > 32 1000 5.43 7.96 6.75 >>>> > 64 1000 6.78 8.56 7.69 >>>> > 128 1000 7.77 9.55 8.59 >>>> > 256 1000 8.28 10.96 9.66 >>>> > 512 1000 9.19 12.49 10.85 >>>> > 1024 1000 11.78 15.01 13.38 >>>> > 2048 1000 17.41 19.51 18.52 >>>> > 4096 1000 25.73 28.22 26.89 >>>> > 8192 1000 47.75 49.44 48.79 >>>> > 16384 1000 81.10 90.15 84.75 >>>> > 32768 1000 163.01 178.58 173.19 >>>> > 65536 640 315.63 340.51 333.18 >>>> > 131072 320 475.48 528.82 510.85 >>>> > 262144 160 979.70 1063.81 1035.61 >>>> > 524288 80 2070.51 2242.58 2150.15 >>>> > 1048576 40 4177.36 4527.25 4431.65 >>>> > 2097152 20 8738.08 9340.50 9147.89 >>>> > [pandora:04500] *** Process received signal *** >>>> > [pandora:04500] Signal: Segmentation fault (11) >>>> > [pandora:04500] Signal code: Address not mapped (1) >>>> > [pandora:04500] Failing at address: 0x7f310ebffff0 >>>> > [pandora:04499] *** Process received signal *** >>>> > [pandora:04499] Signal: Segmentation fault (11) >>>> > [pandora:04499] Signal code: Address not mapped (1) >>>> > [pandora:04499] Failing at address: 0x7f28b11ffff0 >>>> > [pandora:04500] [ 0] >>>> /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680] >>>> > [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0] >>>> > [pandora:04500] [ 2] >>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55] >>>> > [pandora:04500] [ 3] [pandora:04499] [ 0] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b] >>>> > [pandora:04500] [ 4] >>>> /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680] >>>> > [pandora:04499] [ 1] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7] >>>> > [pandora:04500] [ 5] IMB-MPI1[0x40b83b] >>>> > [pandora:04500] [ 6] IMB-MPI1[0x407155] >>>> > [pandora:04500] [ 7] IMB-MPI1[0x4022ea] >>>> > [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0] >>>> > [pandora:04499] [ 2] >>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5] >>>> > [pandora:04500] [ 9] IMB-MPI1[0x401d49] >>>> > [pandora:04500] *** End of error message *** >>>> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55] >>>> > [pandora:04499] [ 3] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b] >>>> > [pandora:04499] [ 4] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7] >>>> > [pandora:04499] [ 5] IMB-MPI1[0x40b83b] >>>> > [pandora:04499] [ 6] IMB-MPI1[0x407155] >>>> > [pandora:04499] [ 7] IMB-MPI1[0x4022ea] >>>> > [pandora:04499] [ 8] >>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5] >>>> > [pandora:04499] [ 9] IMB-MPI1[0x401d49] >>>> > [pandora:04499] *** End of error message *** >>>> > [phoebe:03779] *** Process received signal *** >>>> > [phoebe:03779] Signal: Segmentation fault (11) >>>> > [phoebe:03779] Signal code: Address not mapped (1) >>>> > [phoebe:03779] Failing at address: 0x7f483d6ffff0 >>>> > [phoebe:03779] [ 0] >>>> /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680] >>>> > [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0] >>>> > [phoebe:03779] [ 2] >>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55] >>>> > [phoebe:03779] [ 3] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b] >>>> > [phoebe:03779] [ 4] >>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7] >>>> > [phoebe:03779] [ 5] IMB-MPI1[0x40b83b] >>>> > [phoebe:03779] [ 6] IMB-MPI1[0x407155] >>>> > [phoebe:03779] [ 7] IMB-MPI1[0x4022ea] >>>> > [phoebe:03779] [ 8] >>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5] >>>> > [phoebe:03779] [ 9] IMB-MPI1[0x401d49] >>>> > [phoebe:03779] *** End of error message *** >>>> > >>>> -------------------------------------------------------------------------- >>>> > Primary job terminated normally, but 1 process returned >>>> > a non-zero exit code. Per user-direction, the job has been aborted. >>>> > >>>> -------------------------------------------------------------------------- >>>> > >>>> -------------------------------------------------------------------------- >>>> > mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib >>>> exited on signal 11 (Segmentation fault). >>>> > >>>> -------------------------------------------------------------------------- >>>> > >>>> > Also if I reinstall 3.1.2 I do not have this issue at all. >>>> > >>>> > Any thoughts on what could be the issue? >>>> > >>>> > Thanks, >>>> > Adam LeBlanc >>>> > _______________________________________________ >>>> > users mailing list >>>> > users@lists.open-mpi.org >>>> > https://lists.open-mpi.org/mailman/listinfo/users >>>> > _______________________________________________ >>>> > users mailing list >>>> > users@lists.open-mpi.org >>>> > https://lists.open-mpi.org/mailman/listinfo/users >>>> > _______________________________________________ >>>> > users mailing list >>>> > users@lists.open-mpi.org >>>> > https://lists.open-mpi.org/mailman/listinfo/users >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users