Hello Adam,
During the InfiniBand Plugfest 34 event last October, we found that mpirun hang
on FDR systems if you run with the openib btl option.
Yossi Itigin (@Mellanox) suggested that we run using the following options:
--mca btl self,vader --mca pml ucx -x UCX_RC_PATH_MTU=4096
If you still have trouble, please try the above options (& per Howard’s
suggestion) and see if that resolves your troubles.
Thanks.
--
Llolsten
From: users <[email protected]> On Behalf Of Adam LeBlanc
Sent: Wednesday, February 20, 2019 5:18 PM
To: Open MPI Users <[email protected]>
Subject: Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)
Hello Howard,
Thanks for all of the help and suggestions I will look into them. I also
realized that my ansible wasn't setup properly for handling tar files so the
nightly build didn't even install, but will do it by hand and will give you an
update tomorrow somewhere in the afternoon.
Thanks,
Adam LeBlanc
On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard <[email protected]
<mailto:[email protected]> > wrote:
Hello Adam,
This helps some. Could you post first 20 lines of you config.log. This will
help in trying to reproduce. The content of your host file (you can use generic
names for the nodes if that'a an issue to publicize) would also help as
the number of nodes and number of MPI processes/node impacts the way
the reduce scatter operation works.
One thing to note about the openib BTL - it is on life support. That's
why you needed to set btl_openib_allow_ib 1 on the mpirun command line.
You may get much better success by installing UCX
<https://github.com/openucx/ucx/releases> and rebuilding Open MPI to use UCX.
You may actually already have UCX installed on your system if
a recent version of MOFED is installed.
You can check this by running /usr/bin/ofed_rpm_info. It will show which ucx
version has been installed.
If UCX is installed, you can add --with-ucx to the Open MPi configuration line
and it should build in UCX
support. If Open MPI is built with UCX support, it will by default use UCX
for message transport rather than
the OpenIB BTL.
thanks,
Howard
Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <[email protected]
<mailto:[email protected]> >:
On tcp side it doesn't seg fault anymore but will timeout on some tests but on
the openib side it will still seg fault, here is the output:
[pandora:19256] *** Process received signal ***
[pandora:19256] Signal: Segmentation fault (11)
[pandora:19256] Signal code: Address not mapped (1)
[pandora:19256] Failing at address: 0x7f911c69fff0
[pandora:19255] *** Process received signal ***
[pandora:19255] Signal: Segmentation fault (11)
[pandora:19255] Signal code: Address not mapped (1)
[pandora:19255] Failing at address: 0x7ff09cd3fff0
[pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
[pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
[pandora:19256] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
[pandora:19256] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
[pandora:19256] [ 4] [pandora:19255] [ 0]
/usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
[pandora:19255] [ 1]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
[pandora:19256] [ 5] IMB-MPI1[0x40b83b]
[pandora:19256] [ 6] IMB-MPI1[0x407155]
[pandora:19256] [ 7] IMB-MPI1[0x4022ea]
[pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
[pandora:19255] [ 2]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
[pandora:19256] [ 9] IMB-MPI1[0x401d49]
[pandora:19256] *** End of error message ***
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
[pandora:19255] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
[pandora:19255] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
[pandora:19255] [ 5] IMB-MPI1[0x40b83b]
[pandora:19255] [ 6] IMB-MPI1[0x407155]
[pandora:19255] [ 7] IMB-MPI1[0x4022ea]
[pandora:19255] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
[pandora:19255] [ 9] IMB-MPI1[0x401d49]
[pandora:19255] *** End of error message ***
[phoebe:12418] *** Process received signal ***
[phoebe:12418] Signal: Segmentation fault (11)
[phoebe:12418] Signal code: Address not mapped (1)
[phoebe:12418] Failing at address: 0x7f5ce27dfff0
[phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
[phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
[phoebe:12418] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
[phoebe:12418] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
[phoebe:12418] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
[phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
[phoebe:12418] [ 6] IMB-MPI1[0x407155]
[phoebe:12418] [ 7] IMB-MPI1[0x4022ea]
[phoebe:12418] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]
[phoebe:12418] [ 9] IMB-MPI1[0x401d49]
[phoebe:12418] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node pandora exited on signal
11 (Segmentation fault).
--------------------------------------------------------------------------
- Adam LeBlanc
On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users
<[email protected] <mailto:[email protected]> > wrote:
Can you try the latest 4.0.x nightly snapshot and see if the problem still
occurs?
https://www.open-mpi.org/nightly/v4.0.x/
> On Feb 20, 2019, at 1:40 PM, Adam LeBlanc <[email protected]
> <mailto:[email protected]> > wrote:
>
> I do here is the output:
>
> 2 total processes killed (some possibly by mpirun during cleanup)
> [pandora:12238] *** Process received signal ***
> [pandora:12238] Signal: Segmentation fault (11)
> [pandora:12238] Signal code: Invalid permissions (2)
> [pandora:12238] Failing at address: 0x7f5c8e31fff0
> [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
> [pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
> /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
> [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
> [pandora:12237] Signal code: Invalid permissions (2)
> [pandora:12237] Failing at address: 0x7f6c4ab3fff0
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
> [pandora:12238] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
> [pandora:12238] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
> [pandora:12238] [ 5] IMB-MPI1[0x40b83b]
> [pandora:12238] [ 6] IMB-MPI1[0x407155]
> [pandora:12238] [ 7] IMB-MPI1[0x4022ea]
> [pandora:12238] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5]
> [pandora:12238] [ 9] IMB-MPI1[0x401d49]
> [pandora:12238] *** End of error message ***
> [pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680]
> [pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0]
> [pandora:12237] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55]
> [pandora:12237] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b]
> [pandora:12237] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7]
> [pandora:12237] [ 5] IMB-MPI1[0x40b83b]
> [pandora:12237] [ 6] IMB-MPI1[0x407155]
> [pandora:12237] [ 7] IMB-MPI1[0x4022ea]
> [pandora:12237] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5]
> [pandora:12237] [ 9] IMB-MPI1[0x401d49]
> [pandora:12237] *** End of error message ***
> [phoebe:07408] *** Process received signal ***
> [phoebe:07408] Signal: Segmentation fault (11)
> [phoebe:07408] Signal code: Invalid permissions (2)
> [phoebe:07408] Failing at address: 0x7f6b9ca9fff0
> [titan:07169] *** Process received signal ***
> [titan:07169] Signal: Segmentation fault (11)
> [titan:07169] Signal code: Invalid permissions (2)
> [titan:07169] Failing at address: 0x7fc01295fff0
> [phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680]
> [phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0]
> [phoebe:07408] [ 2] [titan:07169] [ 0]
> /usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680]
> [titan:07169] [ 1]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55]
> [phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0]
> [titan:07169] [ 2]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b]
> [phoebe:07408] [ 4]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55]
> [titan:07169] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7]
> [phoebe:07408] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:07408] [ 6] IMB-MPI1[0x407155]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b]
> [titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:07408] [ 8]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7]
> [titan:07169] [ 5] IMB-MPI1[0x40b83b]
> [titan:07169] [ 6] IMB-MPI1[0x407155]
> [titan:07169] [ 7]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5]
> [phoebe:07408] [ 9] IMB-MPI1[0x401d49]
> [phoebe:07408] *** End of error message ***
> IMB-MPI1[0x4022ea]
> [titan:07169] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5]
> [titan:07169] [ 9] IMB-MPI1[0x401d49]
> [titan:07169] *** End of error message ***
> --------------------------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 0 on node pandora exited on
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
>
> - Adam LeBlanc
>
> On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard <[email protected]
> <mailto:[email protected]> > wrote:
> HI Adam,
>
> As a sanity check, if you try to use --mca btl self,vader,tcp
>
> do you still see the segmentation fault?
>
> Howard
>
>
> Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <[email protected]
> <mailto:[email protected]> >:
> Hello,
>
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun
> --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
> btl_openib_allow_ib 1 -np 6
> -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
>
> I get this error:
>
> #----------------------------------------------------------------
> # Benchmarking Reduce_scatter
> # #processes = 4
> # ( 2 additional processes waiting in MPI_Barrier)
> #----------------------------------------------------------------
> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
> 0 1000 0.14 0.15 0.14
> 4 1000 5.00 7.58 6.28
> 8 1000 5.13 7.68 6.41
> 16 1000 5.05 7.74 6.39
> 32 1000 5.43 7.96 6.75
> 64 1000 6.78 8.56 7.69
> 128 1000 7.77 9.55 8.59
> 256 1000 8.28 10.96 9.66
> 512 1000 9.19 12.49 10.85
> 1024 1000 11.78 15.01 13.38
> 2048 1000 17.41 19.51 18.52
> 4096 1000 25.73 28.22 26.89
> 8192 1000 47.75 49.44 48.79
> 16384 1000 81.10 90.15 84.75
> 32768 1000 163.01 178.58 173.19
> 65536 640 315.63 340.51 333.18
> 131072 320 475.48 528.82 510.85
> 262144 160 979.70 1063.81 1035.61
> 524288 80 2070.51 2242.58 2150.15
> 1048576 40 4177.36 4527.25 4431.65
> 2097152 20 8738.08 9340.50 9147.89
> [pandora:04500] *** Process received signal ***
> [pandora:04500] Signal: Segmentation fault (11)
> [pandora:04500] Signal code: Address not mapped (1)
> [pandora:04500] Failing at address: 0x7f310ebffff0
> [pandora:04499] *** Process received signal ***
> [pandora:04499] Signal: Segmentation fault (11)
> [pandora:04499] Signal code: Address not mapped (1)
> [pandora:04499] Failing at address: 0x7f28b11ffff0
> [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
> [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
> [pandora:04500] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
> [pandora:04500] [ 3] [pandora:04499] [ 0]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
> [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
> [pandora:04499] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
> [pandora:04500] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04500] [ 6] IMB-MPI1[0x407155]
> [pandora:04500] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
> [pandora:04499] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
> [pandora:04500] [ 9] IMB-MPI1[0x401d49]
> [pandora:04500] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
> [pandora:04499] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
> [pandora:04499] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
> [pandora:04499] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04499] [ 6] IMB-MPI1[0x407155]
> [pandora:04499] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04499] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
> [pandora:04499] [ 9] IMB-MPI1[0x401d49]
> [pandora:04499] *** End of error message ***
> [phoebe:03779] *** Process received signal ***
> [phoebe:03779] Signal: Segmentation fault (11)
> [phoebe:03779] Signal code: Address not mapped (1)
> [phoebe:03779] Failing at address: 0x7f483d6ffff0
> [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
> [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
> [phoebe:03779] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
> [phoebe:03779] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
> [phoebe:03779] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
> [phoebe:03779] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:03779] [ 6] IMB-MPI1[0x407155]
> [phoebe:03779] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:03779] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5]
> [phoebe:03779] [ 9] IMB-MPI1[0x401d49]
> [phoebe:03779] *** End of error message ***
> --------------------------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib exited on
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> Also if I reinstall 3.1.2 I do not have this issue at all.
>
> Any thoughts on what could be the issue?
>
> Thanks,
> Adam LeBlanc
> _______________________________________________
> users mailing list
> [email protected] <mailto:[email protected]>
> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> [email protected] <mailto:[email protected]>
> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> [email protected] <mailto:[email protected]>
> https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
[email protected] <mailto:[email protected]>
_______________________________________________
users mailing list
[email protected] <mailto:[email protected]>
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected] <mailto:[email protected]>
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected] <mailto:[email protected]>
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/users