Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-28 Thread Adam LeBlanc
Hello all,

Thank you all for the suggestions. Takahiro suggestion has gotten me to a
point were all of the test will run but as soon as it gets to the clean up
step IMB will seg fault again. I opened an issues on IMB's Github but I
guess I am not gonna be able to get much help from them. So I will have to
wait and see what happens next.

Thanks again for all your help,
Adam LeBlanc

On Thu, Feb 21, 2019 at 7:22 AM Peter Kjellström  wrote:

> On Wed, 20 Feb 2019 10:46:10 -0500
> Adam LeBlanc  wrote:
>
> > Hello,
> >
> > When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> > mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node
> > --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca
> > pml ob1 --mca btl_openib_allow_ib 1 -np 6
> >  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
> >
> > I get this error:
> ...
> > # Benchmarking Reduce_scatter
> ...
> >   2097152   20  8738.08  9340.50  9147.89
> > [pandora:04500] *** Process received signal ***
> > [pandora:04500] Signal: Segmentation fault (11)
>
> This is very likely a bug in IMB not in OpenMPI. It's been discussed on
> the list before, thread name:
>
>  MPI_Reduce_Scatter Segmentation Fault with Intel  2019 Update 1
>  Compilers on OPA-1...
>
> You can work around by using an older IMB version (the bug is in the
> newer/est version).
>
> /Peter K
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-21 Thread Peter Kjellström
On Wed, 20 Feb 2019 10:46:10 -0500
Adam LeBlanc  wrote:

> Hello,
> 
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node
> --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca
> pml ob1 --mca btl_openib_allow_ib 1 -np 6
>  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
> 
> I get this error:
...
> # Benchmarking Reduce_scatter
...
>   2097152   20  8738.08  9340.50  9147.89
> [pandora:04500] *** Process received signal ***
> [pandora:04500] Signal: Segmentation fault (11)

This is very likely a bug in IMB not in OpenMPI. It's been discussed on
the list before, thread name:

 MPI_Reduce_Scatter Segmentation Fault with Intel  2019 Update 1
 Compilers on OPA-1...

You can work around by using an older IMB version (the bug is in the
newer/est version).

/Peter K
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Llolsten Kaonga
Hello Adam,

 

During the InfiniBand Plugfest 34 event last October, we found that mpirun hang 
on FDR systems if you run with the openib btl option.

 

Yossi Itigin (@Mellanox) suggested that we run using the following options:

--mca btl self,vader --mca pml ucx -x UCX_RC_PATH_MTU=4096

 

If you still have trouble, please try the above options (& per Howard’s 
suggestion) and see if that resolves your troubles.

 

Thanks.

--

Llolsten

 

From: users  On Behalf Of Adam LeBlanc
Sent: Wednesday, February 20, 2019 5:18 PM
To: Open MPI Users 
Subject: Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

 

Hello Howard,

 

Thanks for all of the help and suggestions I will look into them. I also 
realized that my ansible wasn't setup properly for handling tar files so the 
nightly build didn't even install, but will do it by hand and will give you an 
update tomorrow somewhere in the afternoon.

 

Thanks,

Adam LeBlanc

 

On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard mailto:hpprit...@gmail.com> > wrote:

Hello Adam,

 

This helps some.  Could you post first 20 lines of you config.log.  This will

help in trying to reproduce.  The content of your host file (you can use generic

names for the nodes if that'a an issue to publicize) would also help as

the number of nodes and number of MPI processes/node impacts the way

the reduce scatter operation works.

 

One thing to note about the openib BTL - it is on life support.   That's

why you needed to set btl_openib_allow_ib 1 on the mpirun command line.

 

You may get much better success by installing UCX 
<https://github.com/openucx/ucx/releases>  and rebuilding Open MPI to use UCX.  
You may actually already have UCX installed on your system if

a recent version of MOFED is installed.

 

You can check this by running /usr/bin/ofed_rpm_info.  It will show which ucx 
version has been installed.

If UCX is installed, you can add --with-ucx to the Open MPi configuration line 
and it should build in UCX

support.   If Open MPI is built with UCX support, it will by default use UCX 
for message transport rather than

the OpenIB BTL.

 

thanks,

 

Howard

 

 

Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc mailto:alebl...@iol.unh.edu> >:

On tcp side it doesn't seg fault anymore but will timeout on some tests but on 
the openib side it will still seg fault, here is the output:

 

[pandora:19256] *** Process received signal ***

[pandora:19256] Signal: Segmentation fault (11)

[pandora:19256] Signal code: Address not mapped (1)

[pandora:19256] Failing at address: 0x7f911c69fff0

[pandora:19255] *** Process received signal ***

[pandora:19255] Signal: Segmentation fault (11)

[pandora:19255] Signal code: Address not mapped (1)

[pandora:19255] Failing at address: 0x7ff09cd3fff0

[pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]

[pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]

[pandora:19256] [ 2] 
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]

[pandora:19256] [ 3] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]

[pandora:19256] [ 4] [pandora:19255] [ 0] 
/usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]

[pandora:19255] [ 1] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]

[pandora:19256] [ 5] IMB-MPI1[0x40b83b]

[pandora:19256] [ 6] IMB-MPI1[0x407155]

[pandora:19256] [ 7] IMB-MPI1[0x4022ea]

[pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]

[pandora:19255] [ 2] 
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]

[pandora:19256] [ 9] IMB-MPI1[0x401d49]

[pandora:19256] *** End of error message ***

/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]

[pandora:19255] [ 3] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]

[pandora:19255] [ 4] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]

[pandora:19255] [ 5] IMB-MPI1[0x40b83b]

[pandora:19255] [ 6] IMB-MPI1[0x407155]

[pandora:19255] [ 7] IMB-MPI1[0x4022ea]

[pandora:19255] [ 8] 
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]

[pandora:19255] [ 9] IMB-MPI1[0x401d49]

[pandora:19255] *** End of error message ***

[phoebe:12418] *** Process received signal ***

[phoebe:12418] Signal: Segmentation fault (11)

[phoebe:12418] Signal code: Address not mapped (1)

[phoebe:12418] Failing at address: 0x7f5ce27dfff0

[phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]

[phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]

[phoebe:12418] [ 2] 
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]

[phoebe:12418] [ 3] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]

[phoebe:12418] [ 4] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]

[phoeb

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Kawashima, Takahiro
Hello Adam,

IMB had a bug related to Reduce_scatter.

  https://github.com/intel/mpi-benchmarks/pull/11

I'm not sure this bug is the cause but you can try the patch.

  
https://github.com/intel/mpi-benchmarks/commit/841446d8cf4ca1f607c0f24b9a424ee39ee1f569

Thanks,
Takahiro Kawashima,
Fujitsu

> Hello,
> 
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun
> --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
> btl_openib_allow_ib 1 -np 6
>  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
> 
> I get this error:
> 
> #
> # Benchmarking Reduce_scatter
> # #processes = 4
> # ( 2 additional processes waiting in MPI_Barrier)
> #
>#bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
> 0 1000 0.14 0.15 0.14
> 4 1000 5.00 7.58 6.28
> 8 1000 5.13 7.68 6.41
>16 1000 5.05 7.74 6.39
>32 1000 5.43 7.96 6.75
>64 1000 6.78 8.56 7.69
>   128 1000 7.77 9.55 8.59
>   256 1000 8.2810.96 9.66
>   512 1000 9.1912.4910.85
>  1024 100011.7815.0113.38
>  2048 100017.4119.5118.52
>  4096 100025.7328.2226.89
>  8192 100047.7549.4448.79
> 16384 100081.1090.1584.75
> 32768 1000   163.01   178.58   173.19
> 65536  640   315.63   340.51   333.18
>131072  320   475.48   528.82   510.85
>262144  160   979.70  1063.81  1035.61
>524288   80  2070.51  2242.58  2150.15
>   1048576   40  4177.36  4527.25  4431.65
>   2097152   20  8738.08  9340.50  9147.89
> [pandora:04500] *** Process received signal ***
> [pandora:04500] Signal: Segmentation fault (11)
> [pandora:04500] Signal code: Address not mapped (1)
> [pandora:04500] Failing at address: 0x7f310eb0
> [pandora:04499] *** Process received signal ***
> [pandora:04499] Signal: Segmentation fault (11)
> [pandora:04499] Signal code: Address not mapped (1)
> [pandora:04499] Failing at address: 0x7f28b110
> [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
> [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
> [pandora:04500] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
> [pandora:04500] [ 3] [pandora:04499] [ 0]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
> [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
> [pandora:04499] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
> [pandora:04500] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04500] [ 6] IMB-MPI1[0x407155]
> [pandora:04500] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
> [pandora:04499] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
> [pandora:04500] [ 9] IMB-MPI1[0x401d49]
> [pandora:04500] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
> [pandora:04499] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
> [pandora:04499] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
> [pandora:04499] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04499] [ 6] IMB-MPI1[0x407155]
> [pandora:04499] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04499] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
> [pandora:04499] [ 9] IMB-MPI1[0x401d49]
> [pandora:04499] *** End of error message ***
> [phoebe:03779] *** Process received signal ***
> [phoebe:03779] Signal: Segmentation fault (11)
> [phoebe:03779] Signal code: Address not mapped (1)
> [phoebe:03779] Failing at address: 0x7f483d60
> [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
> [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
> [phoebe:03779] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
> [phoebe:03779] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
> [phoebe:03779] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
> 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread George Bosilca
I was not able to reproduce the issue with openib on the 4.0, but instead I
randomly segfault in MPI finalize during the grdma cleanup.

I could however reproduce the TCP timeout part with both 4.0 and master, on
a pretty sane cluster (only 3 interfaces, lo, eth0 and virbr0). With no
surprise, the timeout was triggered by a busted TCP interfaces selection
mechanism. As soon as I exclude the virbr0 interface, everything goes back
to normal.

  George.

On Wed, Feb 20, 2019 at 5:20 PM Adam LeBlanc  wrote:

> Hello Howard,
>
> Thanks for all of the help and suggestions I will look into them. I also
> realized that my ansible wasn't setup properly for handling tar files so
> the nightly build didn't even install, but will do it by hand and will give
> you an update tomorrow somewhere in the afternoon.
>
> Thanks,
> Adam LeBlanc
>
> On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard 
> wrote:
>
>> Hello Adam,
>>
>> This helps some.  Could you post first 20 lines of you config.log.  This
>> will
>> help in trying to reproduce.  The content of your host file (you can use
>> generic
>> names for the nodes if that'a an issue to publicize) would also help as
>> the number of nodes and number of MPI processes/node impacts the way
>> the reduce scatter operation works.
>>
>> One thing to note about the openib BTL - it is on life support.   That's
>> why you needed to set btl_openib_allow_ib 1 on the mpirun command line.
>>
>> You may get much better success by installing UCX
>>  and rebuilding Open MPI to use
>> UCX.  You may actually already have UCX installed on your system if
>> a recent version of MOFED is installed.
>>
>> You can check this by running /usr/bin/ofed_rpm_info.  It will show which
>> ucx version has been installed.
>> If UCX is installed, you can add --with-ucx to the Open MPi configuration
>> line and it should build in UCX
>> support.   If Open MPI is built with UCX support, it will by default use
>> UCX for message transport rather than
>> the OpenIB BTL.
>>
>> thanks,
>>
>> Howard
>>
>>
>> Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <
>> alebl...@iol.unh.edu>:
>>
>>> On tcp side it doesn't seg fault anymore but will timeout on some tests
>>> but on the openib side it will still seg fault, here is the output:
>>>
>>> [pandora:19256] *** Process received signal ***
>>> [pandora:19256] Signal: Segmentation fault (11)
>>> [pandora:19256] Signal code: Address not mapped (1)
>>> [pandora:19256] Failing at address: 0x7f911c69fff0
>>> [pandora:19255] *** Process received signal ***
>>> [pandora:19255] Signal: Segmentation fault (11)
>>> [pandora:19255] Signal code: Address not mapped (1)
>>> [pandora:19255] Failing at address: 0x7ff09cd3fff0
>>> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
>>> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
>>> [pandora:19256] [ 2]
>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
>>> [pandora:19256] [ 3]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
>>> [pandora:19256] [ 4] [pandora:19255] [ 0]
>>> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
>>> [pandora:19255] [ 1]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
>>> [pandora:19256] [ 5] IMB-MPI1[0x40b83b]
>>> [pandora:19256] [ 6] IMB-MPI1[0x407155]
>>> [pandora:19256] [ 7] IMB-MPI1[0x4022ea]
>>> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
>>> [pandora:19255] [ 2]
>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
>>> [pandora:19256] [ 9] IMB-MPI1[0x401d49]
>>> [pandora:19256] *** End of error message ***
>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
>>> [pandora:19255] [ 3]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
>>> [pandora:19255] [ 4]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
>>> [pandora:19255] [ 5] IMB-MPI1[0x40b83b]
>>> [pandora:19255] [ 6] IMB-MPI1[0x407155]
>>> [pandora:19255] [ 7] IMB-MPI1[0x4022ea]
>>> [pandora:19255] [ 8]
>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
>>> [pandora:19255] [ 9] IMB-MPI1[0x401d49]
>>> [pandora:19255] *** End of error message ***
>>> [phoebe:12418] *** Process received signal ***
>>> [phoebe:12418] Signal: Segmentation fault (11)
>>> [phoebe:12418] Signal code: Address not mapped (1)
>>> [phoebe:12418] Failing at address: 0x7f5ce27dfff0
>>> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
>>> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
>>> [phoebe:12418] [ 2]
>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
>>> [phoebe:12418] [ 3]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
>>> [phoebe:12418] [ 4]
>>> 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
Hello Howard,

Thanks for all of the help and suggestions I will look into them. I also
realized that my ansible wasn't setup properly for handling tar files so
the nightly build didn't even install, but will do it by hand and will give
you an update tomorrow somewhere in the afternoon.

Thanks,
Adam LeBlanc

On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard 
wrote:

> Hello Adam,
>
> This helps some.  Could you post first 20 lines of you config.log.  This
> will
> help in trying to reproduce.  The content of your host file (you can use
> generic
> names for the nodes if that'a an issue to publicize) would also help as
> the number of nodes and number of MPI processes/node impacts the way
> the reduce scatter operation works.
>
> One thing to note about the openib BTL - it is on life support.   That's
> why you needed to set btl_openib_allow_ib 1 on the mpirun command line.
>
> You may get much better success by installing UCX
>  and rebuilding Open MPI to use
> UCX.  You may actually already have UCX installed on your system if
> a recent version of MOFED is installed.
>
> You can check this by running /usr/bin/ofed_rpm_info.  It will show which
> ucx version has been installed.
> If UCX is installed, you can add --with-ucx to the Open MPi configuration
> line and it should build in UCX
> support.   If Open MPI is built with UCX support, it will by default use
> UCX for message transport rather than
> the OpenIB BTL.
>
> thanks,
>
> Howard
>
>
> Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <
> alebl...@iol.unh.edu>:
>
>> On tcp side it doesn't seg fault anymore but will timeout on some tests
>> but on the openib side it will still seg fault, here is the output:
>>
>> [pandora:19256] *** Process received signal ***
>> [pandora:19256] Signal: Segmentation fault (11)
>> [pandora:19256] Signal code: Address not mapped (1)
>> [pandora:19256] Failing at address: 0x7f911c69fff0
>> [pandora:19255] *** Process received signal ***
>> [pandora:19255] Signal: Segmentation fault (11)
>> [pandora:19255] Signal code: Address not mapped (1)
>> [pandora:19255] Failing at address: 0x7ff09cd3fff0
>> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
>> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
>> [pandora:19256] [ 2]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
>> [pandora:19256] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
>> [pandora:19256] [ 4] [pandora:19255] [ 0]
>> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
>> [pandora:19255] [ 1]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
>> [pandora:19256] [ 5] IMB-MPI1[0x40b83b]
>> [pandora:19256] [ 6] IMB-MPI1[0x407155]
>> [pandora:19256] [ 7] IMB-MPI1[0x4022ea]
>> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
>> [pandora:19255] [ 2]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
>> [pandora:19256] [ 9] IMB-MPI1[0x401d49]
>> [pandora:19256] *** End of error message ***
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
>> [pandora:19255] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
>> [pandora:19255] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
>> [pandora:19255] [ 5] IMB-MPI1[0x40b83b]
>> [pandora:19255] [ 6] IMB-MPI1[0x407155]
>> [pandora:19255] [ 7] IMB-MPI1[0x4022ea]
>> [pandora:19255] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
>> [pandora:19255] [ 9] IMB-MPI1[0x401d49]
>> [pandora:19255] *** End of error message ***
>> [phoebe:12418] *** Process received signal ***
>> [phoebe:12418] Signal: Segmentation fault (11)
>> [phoebe:12418] Signal code: Address not mapped (1)
>> [phoebe:12418] Failing at address: 0x7f5ce27dfff0
>> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
>> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
>> [phoebe:12418] [ 2]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
>> [phoebe:12418] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
>> [phoebe:12418] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
>> [phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
>> [phoebe:12418] [ 6] IMB-MPI1[0x407155]
>> [phoebe:12418] [ 7] IMB-MPI1[0x4022ea]
>> [phoebe:12418] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]
>> [phoebe:12418] [ 9] IMB-MPI1[0x401d49]
>> [phoebe:12418] *** End of error message ***
>> --
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> --

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Howard Pritchard
Hello Adam,

This helps some.  Could you post first 20 lines of you config.log.  This
will
help in trying to reproduce.  The content of your host file (you can use
generic
names for the nodes if that'a an issue to publicize) would also help as
the number of nodes and number of MPI processes/node impacts the way
the reduce scatter operation works.

One thing to note about the openib BTL - it is on life support.   That's
why you needed to set btl_openib_allow_ib 1 on the mpirun command line.

You may get much better success by installing UCX
 and rebuilding Open MPI to use
UCX.  You may actually already have UCX installed on your system if
a recent version of MOFED is installed.

You can check this by running /usr/bin/ofed_rpm_info.  It will show which
ucx version has been installed.
If UCX is installed, you can add --with-ucx to the Open MPi configuration
line and it should build in UCX
support.   If Open MPI is built with UCX support, it will by default use
UCX for message transport rather than
the OpenIB BTL.

thanks,

Howard


Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <
alebl...@iol.unh.edu>:

> On tcp side it doesn't seg fault anymore but will timeout on some tests
> but on the openib side it will still seg fault, here is the output:
>
> [pandora:19256] *** Process received signal ***
> [pandora:19256] Signal: Segmentation fault (11)
> [pandora:19256] Signal code: Address not mapped (1)
> [pandora:19256] Failing at address: 0x7f911c69fff0
> [pandora:19255] *** Process received signal ***
> [pandora:19255] Signal: Segmentation fault (11)
> [pandora:19255] Signal code: Address not mapped (1)
> [pandora:19255] Failing at address: 0x7ff09cd3fff0
> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
> [pandora:19256] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
> [pandora:19256] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
> [pandora:19256] [ 4] [pandora:19255] [ 0]
> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
> [pandora:19255] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
> [pandora:19256] [ 5] IMB-MPI1[0x40b83b]
> [pandora:19256] [ 6] IMB-MPI1[0x407155]
> [pandora:19256] [ 7] IMB-MPI1[0x4022ea]
> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
> [pandora:19255] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
> [pandora:19256] [ 9] IMB-MPI1[0x401d49]
> [pandora:19256] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
> [pandora:19255] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
> [pandora:19255] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
> [pandora:19255] [ 5] IMB-MPI1[0x40b83b]
> [pandora:19255] [ 6] IMB-MPI1[0x407155]
> [pandora:19255] [ 7] IMB-MPI1[0x4022ea]
> [pandora:19255] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
> [pandora:19255] [ 9] IMB-MPI1[0x401d49]
> [pandora:19255] *** End of error message ***
> [phoebe:12418] *** Process received signal ***
> [phoebe:12418] Signal: Segmentation fault (11)
> [phoebe:12418] Signal code: Address not mapped (1)
> [phoebe:12418] Failing at address: 0x7f5ce27dfff0
> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
> [phoebe:12418] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
> [phoebe:12418] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
> [phoebe:12418] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
> [phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:12418] [ 6] IMB-MPI1[0x407155]
> [phoebe:12418] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:12418] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]
> [phoebe:12418] [ 9] IMB-MPI1[0x401d49]
> [phoebe:12418] *** End of error message ***
> --
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --
> --
> mpirun noticed that process rank 0 with PID 0 on node pandora exited on
> signal 11 (Segmentation fault).
> --
>
> - Adam LeBlanc
>
> On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users <
> users@lists.open-mpi.org> wrote:
>
>> Can you try the latest 4.0.x nightly snapshot and see if the problem
>> still occurs?
>>
>> 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
On tcp side it doesn't seg fault anymore but will timeout on some tests but
on the openib side it will still seg fault, here is the output:

[pandora:19256] *** Process received signal ***
[pandora:19256] Signal: Segmentation fault (11)
[pandora:19256] Signal code: Address not mapped (1)
[pandora:19256] Failing at address: 0x7f911c69fff0
[pandora:19255] *** Process received signal ***
[pandora:19255] Signal: Segmentation fault (11)
[pandora:19255] Signal code: Address not mapped (1)
[pandora:19255] Failing at address: 0x7ff09cd3fff0
[pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
[pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
[pandora:19256] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
[pandora:19256] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
[pandora:19256] [ 4] [pandora:19255] [ 0]
/usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
[pandora:19255] [ 1]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
[pandora:19256] [ 5] IMB-MPI1[0x40b83b]
[pandora:19256] [ 6] IMB-MPI1[0x407155]
[pandora:19256] [ 7] IMB-MPI1[0x4022ea]
[pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
[pandora:19255] [ 2]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
[pandora:19256] [ 9] IMB-MPI1[0x401d49]
[pandora:19256] *** End of error message ***
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
[pandora:19255] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
[pandora:19255] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
[pandora:19255] [ 5] IMB-MPI1[0x40b83b]
[pandora:19255] [ 6] IMB-MPI1[0x407155]
[pandora:19255] [ 7] IMB-MPI1[0x4022ea]
[pandora:19255] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
[pandora:19255] [ 9] IMB-MPI1[0x401d49]
[pandora:19255] *** End of error message ***
[phoebe:12418] *** Process received signal ***
[phoebe:12418] Signal: Segmentation fault (11)
[phoebe:12418] Signal code: Address not mapped (1)
[phoebe:12418] Failing at address: 0x7f5ce27dfff0
[phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
[phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
[phoebe:12418] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
[phoebe:12418] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
[phoebe:12418] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
[phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
[phoebe:12418] [ 6] IMB-MPI1[0x407155]
[phoebe:12418] [ 7] IMB-MPI1[0x4022ea]
[phoebe:12418] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]
[phoebe:12418] [ 9] IMB-MPI1[0x401d49]
[phoebe:12418] *** End of error message ***
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node pandora exited on
signal 11 (Segmentation fault).
--

- Adam LeBlanc

On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> Can you try the latest 4.0.x nightly snapshot and see if the problem still
> occurs?
>
> https://www.open-mpi.org/nightly/v4.0.x/
>
>
> > On Feb 20, 2019, at 1:40 PM, Adam LeBlanc  wrote:
> >
> > I do here is the output:
> >
> > 2 total processes killed (some possibly by mpirun during cleanup)
> > [pandora:12238] *** Process received signal ***
> > [pandora:12238] Signal: Segmentation fault (11)
> > [pandora:12238] Signal code: Invalid permissions (2)
> > [pandora:12238] Failing at address: 0x7f5c8e31fff0
> > [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
> > [pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
> > /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
> > [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
> > [pandora:12237] Signal code: Invalid permissions (2)
> > [pandora:12237] Failing at address: 0x7f6c4ab3fff0
> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
> > [pandora:12238] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
> > [pandora:12238] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
> > [pandora:12238] [ 5] IMB-MPI1[0x40b83b]
> > [pandora:12238] [ 6] IMB-MPI1[0x407155]
> > [pandora:12238] [ 7] IMB-MPI1[0x4022ea]
> > [pandora:12238] [ 8]
> 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Jeff Squyres (jsquyres) via users
Can you try the latest 4.0.x nightly snapshot and see if the problem still 
occurs?

https://www.open-mpi.org/nightly/v4.0.x/


> On Feb 20, 2019, at 1:40 PM, Adam LeBlanc  wrote:
> 
> I do here is the output:
> 
> 2 total processes killed (some possibly by mpirun during cleanup)
> [pandora:12238] *** Process received signal ***
> [pandora:12238] Signal: Segmentation fault (11)
> [pandora:12238] Signal code: Invalid permissions (2)
> [pandora:12238] Failing at address: 0x7f5c8e31fff0
> [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
> [pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
> /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
> [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
> [pandora:12237] Signal code: Invalid permissions (2)
> [pandora:12237] Failing at address: 0x7f6c4ab3fff0
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
> [pandora:12238] [ 3] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
> [pandora:12238] [ 4] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
> [pandora:12238] [ 5] IMB-MPI1[0x40b83b]
> [pandora:12238] [ 6] IMB-MPI1[0x407155]
> [pandora:12238] [ 7] IMB-MPI1[0x4022ea]
> [pandora:12238] [ 8] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5]
> [pandora:12238] [ 9] IMB-MPI1[0x401d49]
> [pandora:12238] *** End of error message ***
> [pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680]
> [pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0]
> [pandora:12237] [ 2] 
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55]
> [pandora:12237] [ 3] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b]
> [pandora:12237] [ 4] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7]
> [pandora:12237] [ 5] IMB-MPI1[0x40b83b]
> [pandora:12237] [ 6] IMB-MPI1[0x407155]
> [pandora:12237] [ 7] IMB-MPI1[0x4022ea]
> [pandora:12237] [ 8] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5]
> [pandora:12237] [ 9] IMB-MPI1[0x401d49]
> [pandora:12237] *** End of error message ***
> [phoebe:07408] *** Process received signal ***
> [phoebe:07408] Signal: Segmentation fault (11)
> [phoebe:07408] Signal code: Invalid permissions (2)
> [phoebe:07408] Failing at address: 0x7f6b9ca9fff0
> [titan:07169] *** Process received signal ***
> [titan:07169] Signal: Segmentation fault (11)
> [titan:07169] Signal code: Invalid permissions (2)
> [titan:07169] Failing at address: 0x7fc01295fff0
> [phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680]
> [phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0]
> [phoebe:07408] [ 2] [titan:07169] [ 0] 
> /usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680]
> [titan:07169] [ 1] 
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55]
> [phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0]
> [titan:07169] [ 2] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b]
> [phoebe:07408] [ 4] 
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55]
> [titan:07169] [ 3] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7]
> [phoebe:07408] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:07408] [ 6] IMB-MPI1[0x407155]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b]
> [titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:07408] [ 8] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7]
> [titan:07169] [ 5] IMB-MPI1[0x40b83b]
> [titan:07169] [ 6] IMB-MPI1[0x407155]
> [titan:07169] [ 7] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5]
> [phoebe:07408] [ 9] IMB-MPI1[0x401d49]
> [phoebe:07408] *** End of error message ***
> IMB-MPI1[0x4022ea]
> [titan:07169] [ 8] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5]
> [titan:07169] [ 9] IMB-MPI1[0x401d49]
> [titan:07169] *** End of error message ***
> --
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --
> --
> mpirun noticed that process rank 0 with PID 0 on node pandora exited on 
> signal 11 (Segmentation fault).
> --
> 
> 
> - Adam LeBlanc
> 
> On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard  wrote:
> HI Adam,
> 
> As a sanity check, if you try to use --mca btl self,vader,tcp
> 
> do you still see the segmentation fault?
> 
> Howard
> 
> 
> Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc 
> :
> 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
I do here is the output:

2 total processes killed (some possibly by mpirun during cleanup)
[pandora:12238] *** Process received signal ***
[pandora:12238] Signal: Segmentation fault (11)
[pandora:12238] Signal code: Invalid permissions (2)
[pandora:12238] Failing at address: 0x7f5c8e31fff0
[pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
[pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
/usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
[pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
[pandora:12237] Signal code: Invalid permissions (2)
[pandora:12237] Failing at address: 0x7f6c4ab3fff0
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
[pandora:12238] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
[pandora:12238] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
[pandora:12238] [ 5] IMB-MPI1[0x40b83b]
[pandora:12238] [ 6] IMB-MPI1[0x407155]
[pandora:12238] [ 7] IMB-MPI1[0x4022ea]
[pandora:12238] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5]
[pandora:12238] [ 9] IMB-MPI1[0x401d49]
[pandora:12238] *** End of error message ***
[pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680]
[pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0]
[pandora:12237] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55]
[pandora:12237] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b]
[pandora:12237] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7]
[pandora:12237] [ 5] IMB-MPI1[0x40b83b]
[pandora:12237] [ 6] IMB-MPI1[0x407155]
[pandora:12237] [ 7] IMB-MPI1[0x4022ea]
[pandora:12237] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5]
[pandora:12237] [ 9] IMB-MPI1[0x401d49]
[pandora:12237] *** End of error message ***
[phoebe:07408] *** Process received signal ***
[phoebe:07408] Signal: Segmentation fault (11)
[phoebe:07408] Signal code: Invalid permissions (2)
[phoebe:07408] Failing at address: 0x7f6b9ca9fff0
[titan:07169] *** Process received signal ***
[titan:07169] Signal: Segmentation fault (11)
[titan:07169] Signal code: Invalid permissions (2)
[titan:07169] Failing at address: 0x7fc01295fff0
[phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680]
[phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0]
[phoebe:07408] [ 2] [titan:07169] [ 0]
/usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680]
[titan:07169] [ 1]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55]
[phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0]
[titan:07169] [ 2]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b]
[phoebe:07408] [ 4]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55]
[titan:07169] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7]
[phoebe:07408] [ 5] IMB-MPI1[0x40b83b]
[phoebe:07408] [ 6] IMB-MPI1[0x407155]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b]
[titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea]
[phoebe:07408] [ 8]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7]
[titan:07169] [ 5] IMB-MPI1[0x40b83b]
[titan:07169] [ 6] IMB-MPI1[0x407155]
[titan:07169] [ 7]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5]
[phoebe:07408] [ 9] IMB-MPI1[0x401d49]
[phoebe:07408] *** End of error message ***
IMB-MPI1[0x4022ea]
[titan:07169] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5]
[titan:07169] [ 9] IMB-MPI1[0x401d49]
[titan:07169] *** End of error message ***
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node pandora exited on
signal 11 (Segmentation fault).
--


- Adam LeBlanc

On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard 
wrote:

> HI Adam,
>
> As a sanity check, if you try to use --mca btl self,vader,tcp
>
> do you still see the segmentation fault?
>
> Howard
>
>
> Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <
> alebl...@iol.unh.edu>:
>
>> Hello,
>>
>> When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
>> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
>> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
>> btl_openib_allow_ib 1 -np 6
>>  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
>>
>> I get this error:
>>
>> 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Howard Pritchard
HI Adam,

As a sanity check, if you try to use --mca btl self,vader,tcp

do you still see the segmentation fault?

Howard


Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <
alebl...@iol.unh.edu>:

> Hello,
>
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
> btl_openib_allow_ib 1 -np 6
>  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
>
> I get this error:
>
> #
> # Benchmarking Reduce_scatter
> # #processes = 4
> # ( 2 additional processes waiting in MPI_Barrier)
> #
>#bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
> 0 1000 0.14 0.15 0.14
> 4 1000 5.00 7.58 6.28
> 8 1000 5.13 7.68 6.41
>16 1000 5.05 7.74 6.39
>32 1000 5.43 7.96 6.75
>64 1000 6.78 8.56 7.69
>   128 1000 7.77 9.55 8.59
>   256 1000 8.2810.96 9.66
>   512 1000 9.1912.4910.85
>  1024 100011.7815.0113.38
>  2048 100017.4119.5118.52
>  4096 100025.7328.2226.89
>  8192 100047.7549.4448.79
> 16384 100081.1090.1584.75
> 32768 1000   163.01   178.58   173.19
> 65536  640   315.63   340.51   333.18
>131072  320   475.48   528.82   510.85
>262144  160   979.70  1063.81  1035.61
>524288   80  2070.51  2242.58  2150.15
>   1048576   40  4177.36  4527.25  4431.65
>   2097152   20  8738.08  9340.50  9147.89
> [pandora:04500] *** Process received signal ***
> [pandora:04500] Signal: Segmentation fault (11)
> [pandora:04500] Signal code: Address not mapped (1)
> [pandora:04500] Failing at address: 0x7f310eb0
> [pandora:04499] *** Process received signal ***
> [pandora:04499] Signal: Segmentation fault (11)
> [pandora:04499] Signal code: Address not mapped (1)
> [pandora:04499] Failing at address: 0x7f28b110
> [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
> [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
> [pandora:04500] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
> [pandora:04500] [ 3] [pandora:04499] [ 0]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
> [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
> [pandora:04499] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
> [pandora:04500] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04500] [ 6] IMB-MPI1[0x407155]
> [pandora:04500] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
> [pandora:04499] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
> [pandora:04500] [ 9] IMB-MPI1[0x401d49]
> [pandora:04500] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
> [pandora:04499] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
> [pandora:04499] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
> [pandora:04499] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04499] [ 6] IMB-MPI1[0x407155]
> [pandora:04499] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04499] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
> [pandora:04499] [ 9] IMB-MPI1[0x401d49]
> [pandora:04499] *** End of error message ***
> [phoebe:03779] *** Process received signal ***
> [phoebe:03779] Signal: Segmentation fault (11)
> [phoebe:03779] Signal code: Address not mapped (1)
> [phoebe:03779] Failing at address: 0x7f483d60
> [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
> [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
> [phoebe:03779] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
> [phoebe:03779] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
> [phoebe:03779] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
> [phoebe:03779] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:03779] [ 6] IMB-MPI1[0x407155]
> [phoebe:03779] [ 7] 

[OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
Hello,

When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun
--mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_allow_ib 1 -np 6
 -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1

I get this error:

#
# Benchmarking Reduce_scatter
# #processes = 4
# ( 2 additional processes waiting in MPI_Barrier)
#
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
0 1000 0.14 0.15 0.14
4 1000 5.00 7.58 6.28
8 1000 5.13 7.68 6.41
   16 1000 5.05 7.74 6.39
   32 1000 5.43 7.96 6.75
   64 1000 6.78 8.56 7.69
  128 1000 7.77 9.55 8.59
  256 1000 8.2810.96 9.66
  512 1000 9.1912.4910.85
 1024 100011.7815.0113.38
 2048 100017.4119.5118.52
 4096 100025.7328.2226.89
 8192 100047.7549.4448.79
16384 100081.1090.1584.75
32768 1000   163.01   178.58   173.19
65536  640   315.63   340.51   333.18
   131072  320   475.48   528.82   510.85
   262144  160   979.70  1063.81  1035.61
   524288   80  2070.51  2242.58  2150.15
  1048576   40  4177.36  4527.25  4431.65
  2097152   20  8738.08  9340.50  9147.89
[pandora:04500] *** Process received signal ***
[pandora:04500] Signal: Segmentation fault (11)
[pandora:04500] Signal code: Address not mapped (1)
[pandora:04500] Failing at address: 0x7f310eb0
[pandora:04499] *** Process received signal ***
[pandora:04499] Signal: Segmentation fault (11)
[pandora:04499] Signal code: Address not mapped (1)
[pandora:04499] Failing at address: 0x7f28b110
[pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
[pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
[pandora:04500] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
[pandora:04500] [ 3] [pandora:04499] [ 0]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
[pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
[pandora:04499] [ 1]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
[pandora:04500] [ 5] IMB-MPI1[0x40b83b]
[pandora:04500] [ 6] IMB-MPI1[0x407155]
[pandora:04500] [ 7] IMB-MPI1[0x4022ea]
[pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
[pandora:04499] [ 2]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
[pandora:04500] [ 9] IMB-MPI1[0x401d49]
[pandora:04500] *** End of error message ***
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
[pandora:04499] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
[pandora:04499] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
[pandora:04499] [ 5] IMB-MPI1[0x40b83b]
[pandora:04499] [ 6] IMB-MPI1[0x407155]
[pandora:04499] [ 7] IMB-MPI1[0x4022ea]
[pandora:04499] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
[pandora:04499] [ 9] IMB-MPI1[0x401d49]
[pandora:04499] *** End of error message ***
[phoebe:03779] *** Process received signal ***
[phoebe:03779] Signal: Segmentation fault (11)
[phoebe:03779] Signal code: Address not mapped (1)
[phoebe:03779] Failing at address: 0x7f483d60
[phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
[phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
[phoebe:03779] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
[phoebe:03779] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
[phoebe:03779] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
[phoebe:03779] [ 5] IMB-MPI1[0x40b83b]
[phoebe:03779] [ 6] IMB-MPI1[0x407155]
[phoebe:03779] [ 7] IMB-MPI1[0x4022ea]
[phoebe:03779] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5]
[phoebe:03779] [ 9] IMB-MPI1[0x401d49]
[phoebe:03779] *** End of error message ***
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been