Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Llolsten Kaonga
Hello Adam,

 

During the InfiniBand Plugfest 34 event last October, we found that mpirun hang 
on FDR systems if you run with the openib btl option.

 

Yossi Itigin (@Mellanox) suggested that we run using the following options:

--mca btl self,vader --mca pml ucx -x UCX_RC_PATH_MTU=4096

 

If you still have trouble, please try the above options (& per Howard’s 
suggestion) and see if that resolves your troubles.

 

Thanks.

--

Llolsten

 

From: users  On Behalf Of Adam LeBlanc
Sent: Wednesday, February 20, 2019 5:18 PM
To: Open MPI Users 
Subject: Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

 

Hello Howard,

 

Thanks for all of the help and suggestions I will look into them. I also 
realized that my ansible wasn't setup properly for handling tar files so the 
nightly build didn't even install, but will do it by hand and will give you an 
update tomorrow somewhere in the afternoon.

 

Thanks,

Adam LeBlanc

 

On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard mailto:hpprit...@gmail.com> > wrote:

Hello Adam,

 

This helps some.  Could you post first 20 lines of you config.log.  This will

help in trying to reproduce.  The content of your host file (you can use generic

names for the nodes if that'a an issue to publicize) would also help as

the number of nodes and number of MPI processes/node impacts the way

the reduce scatter operation works.

 

One thing to note about the openib BTL - it is on life support.   That's

why you needed to set btl_openib_allow_ib 1 on the mpirun command line.

 

You may get much better success by installing UCX 
  and rebuilding Open MPI to use UCX.  
You may actually already have UCX installed on your system if

a recent version of MOFED is installed.

 

You can check this by running /usr/bin/ofed_rpm_info.  It will show which ucx 
version has been installed.

If UCX is installed, you can add --with-ucx to the Open MPi configuration line 
and it should build in UCX

support.   If Open MPI is built with UCX support, it will by default use UCX 
for message transport rather than

the OpenIB BTL.

 

thanks,

 

Howard

 

 

Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc mailto:alebl...@iol.unh.edu> >:

On tcp side it doesn't seg fault anymore but will timeout on some tests but on 
the openib side it will still seg fault, here is the output:

 

[pandora:19256] *** Process received signal ***

[pandora:19256] Signal: Segmentation fault (11)

[pandora:19256] Signal code: Address not mapped (1)

[pandora:19256] Failing at address: 0x7f911c69fff0

[pandora:19255] *** Process received signal ***

[pandora:19255] Signal: Segmentation fault (11)

[pandora:19255] Signal code: Address not mapped (1)

[pandora:19255] Failing at address: 0x7ff09cd3fff0

[pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]

[pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]

[pandora:19256] [ 2] 
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]

[pandora:19256] [ 3] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]

[pandora:19256] [ 4] [pandora:19255] [ 0] 
/usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]

[pandora:19255] [ 1] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]

[pandora:19256] [ 5] IMB-MPI1[0x40b83b]

[pandora:19256] [ 6] IMB-MPI1[0x407155]

[pandora:19256] [ 7] IMB-MPI1[0x4022ea]

[pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]

[pandora:19255] [ 2] 
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]

[pandora:19256] [ 9] IMB-MPI1[0x401d49]

[pandora:19256] *** End of error message ***

/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]

[pandora:19255] [ 3] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]

[pandora:19255] [ 4] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]

[pandora:19255] [ 5] IMB-MPI1[0x40b83b]

[pandora:19255] [ 6] IMB-MPI1[0x407155]

[pandora:19255] [ 7] IMB-MPI1[0x4022ea]

[pandora:19255] [ 8] 
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]

[pandora:19255] [ 9] IMB-MPI1[0x401d49]

[pandora:19255] *** End of error message ***

[phoebe:12418] *** Process received signal ***

[phoebe:12418] Signal: Segmentation fault (11)

[phoebe:12418] Signal code: Address not mapped (1)

[phoebe:12418] Failing at address: 0x7f5ce27dfff0

[phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]

[phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]

[phoebe:12418] [ 2] 
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]

[phoebe:12418] [ 3] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]

[phoebe:12418] [ 4] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]

[phoebe:12418] [ 5] 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Kawashima, Takahiro
Hello Adam,

IMB had a bug related to Reduce_scatter.

  https://github.com/intel/mpi-benchmarks/pull/11

I'm not sure this bug is the cause but you can try the patch.

  
https://github.com/intel/mpi-benchmarks/commit/841446d8cf4ca1f607c0f24b9a424ee39ee1f569

Thanks,
Takahiro Kawashima,
Fujitsu

> Hello,
> 
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun
> --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
> btl_openib_allow_ib 1 -np 6
>  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
> 
> I get this error:
> 
> #
> # Benchmarking Reduce_scatter
> # #processes = 4
> # ( 2 additional processes waiting in MPI_Barrier)
> #
>#bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
> 0 1000 0.14 0.15 0.14
> 4 1000 5.00 7.58 6.28
> 8 1000 5.13 7.68 6.41
>16 1000 5.05 7.74 6.39
>32 1000 5.43 7.96 6.75
>64 1000 6.78 8.56 7.69
>   128 1000 7.77 9.55 8.59
>   256 1000 8.2810.96 9.66
>   512 1000 9.1912.4910.85
>  1024 100011.7815.0113.38
>  2048 100017.4119.5118.52
>  4096 100025.7328.2226.89
>  8192 100047.7549.4448.79
> 16384 100081.1090.1584.75
> 32768 1000   163.01   178.58   173.19
> 65536  640   315.63   340.51   333.18
>131072  320   475.48   528.82   510.85
>262144  160   979.70  1063.81  1035.61
>524288   80  2070.51  2242.58  2150.15
>   1048576   40  4177.36  4527.25  4431.65
>   2097152   20  8738.08  9340.50  9147.89
> [pandora:04500] *** Process received signal ***
> [pandora:04500] Signal: Segmentation fault (11)
> [pandora:04500] Signal code: Address not mapped (1)
> [pandora:04500] Failing at address: 0x7f310eb0
> [pandora:04499] *** Process received signal ***
> [pandora:04499] Signal: Segmentation fault (11)
> [pandora:04499] Signal code: Address not mapped (1)
> [pandora:04499] Failing at address: 0x7f28b110
> [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
> [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
> [pandora:04500] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
> [pandora:04500] [ 3] [pandora:04499] [ 0]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
> [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
> [pandora:04499] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
> [pandora:04500] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04500] [ 6] IMB-MPI1[0x407155]
> [pandora:04500] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
> [pandora:04499] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
> [pandora:04500] [ 9] IMB-MPI1[0x401d49]
> [pandora:04500] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
> [pandora:04499] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
> [pandora:04499] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
> [pandora:04499] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04499] [ 6] IMB-MPI1[0x407155]
> [pandora:04499] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04499] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
> [pandora:04499] [ 9] IMB-MPI1[0x401d49]
> [pandora:04499] *** End of error message ***
> [phoebe:03779] *** Process received signal ***
> [phoebe:03779] Signal: Segmentation fault (11)
> [phoebe:03779] Signal code: Address not mapped (1)
> [phoebe:03779] Failing at address: 0x7f483d60
> [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
> [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
> [phoebe:03779] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
> [phoebe:03779] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
> [phoebe:03779] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
> 

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-20 Thread Gilles Gouaillardet

Ryan,


as Edgar explained, that could be a compiler issue (fwiw, I am unable to 
reproduce the bug)


You can build Open MPI again and pass --disable-builtin-atomics to the 
configure command line.



That being said, the "Alarm clock" message looks a bit suspicious.

Does it always occur at 20+ minutes elapsed ?

Is there some mechanism that automatically kills a job if it does not 
write anything to stdout for some time ?


A quick way to rule that out is to

srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1 sleep 1800

and see if that completes or get killed with the same error message.


You can also run use mpirun instead of srun, and even run mpirun outside 
of slurm


(if your cluster policy allows it, you can for example use mpirun and 
run on the frontend node)



Cheers,


Gilles

On 2/21/2019 3:01 AM, Ryan Novosielski wrote:

Does it make any sense that it seems to work fine when OpenMPI and HDF5 are 
built with GCC 7.4 and GCC 8.2, but /not/ when they are built with 
RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 
build, I did try an XFS filesystem and it didn’t help. GPFS works fine for 
either of the 7.4 and 8.2 builds.

Just as a reminder, since it was reasonably far back in the thread, what I’m 
doing is running the “make check” tests in HDF5 1.10.4, in part because users 
use it, but also because it seems to have a good test suite and I can therefore 
verify the compiler and MPI stack installs. I get very little information, 
apart from it not working and getting that “Alarm clock” message.

I originally suspected I’d somehow built some component of this with a 
host-specific optimization that wasn’t working on some compute nodes. But I 
controlled for that and it didn’t seem to make any difference.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
  `'


On Feb 18, 2019, at 1:34 PM, Ryan Novosielski  wrote:

It didn’t work any better with XFS, as it happens. Must be something else. I’m 
going to test some more and see if I can narrow it down any, as it seems to me 
that it did work with a different compiler.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'


On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar  wrote:

While I was working on something else, I let the tests run with Open MPI master 
(which is for parallel I/O equivalent to the upcoming v4.0.1  release), and 
here is what I found for the HDF5 1.10.4 tests on my local desktop:

In the testpar directory, there is in fact one test that fails for both ompio 
and romio321 in exactly the same manner.
I used 6 processes as you did (although I used mpirun directly  instead of 
srun...) From the 13 tests in the testpar directory, 12 pass correctly 
(t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, 
t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame).

The one tests that officially fails ( t_pflush1) actually reports that it 
passed, but then throws message that indicates that MPI_Abort has been called, 
for both ompio and romio. I will try to investigate this test to see what is 
going on.

That being said, your report shows an issue in t_mpi, which passes without 
problems for me. This is however not GPFS, this was an XFS local file system. 
Running the tests on GPFS are on my todo list as well.

Thanks
Edgar




-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
Gabriel, Edgar
Sent: Sunday, February 17, 2019 10:34 AM
To: Open MPI Users 
Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
3.1.3

I will also run our testsuite and the HDF5 testsuite on GPFS, I have access to a
GPFS file system since recently, and will report back on that, but it will take 
a
few days.

Thanks
Edgar


-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
Ryan Novosielski
Sent: Sunday, February 17, 2019 2:37 AM
To: users@lists.open-mpi.org
Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
3.1.3

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

This is on GPFS. I'll try it on XFS to see if it makes any difference.

On 2/16/19 11:57 PM, Gilles Gouaillardet wrote:

Ryan,

What filesystem are you running on ?

Open MPI defaults to the ompio component, except on Lustre
filesystem where ROMIO is used. (if the issue is related to ROMIO,
that can explain why you did not see any difference, in that case,
you might want to try an other filesystem (local filesystem or NFS
for 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread George Bosilca
I was not able to reproduce the issue with openib on the 4.0, but instead I
randomly segfault in MPI finalize during the grdma cleanup.

I could however reproduce the TCP timeout part with both 4.0 and master, on
a pretty sane cluster (only 3 interfaces, lo, eth0 and virbr0). With no
surprise, the timeout was triggered by a busted TCP interfaces selection
mechanism. As soon as I exclude the virbr0 interface, everything goes back
to normal.

  George.

On Wed, Feb 20, 2019 at 5:20 PM Adam LeBlanc  wrote:

> Hello Howard,
>
> Thanks for all of the help and suggestions I will look into them. I also
> realized that my ansible wasn't setup properly for handling tar files so
> the nightly build didn't even install, but will do it by hand and will give
> you an update tomorrow somewhere in the afternoon.
>
> Thanks,
> Adam LeBlanc
>
> On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard 
> wrote:
>
>> Hello Adam,
>>
>> This helps some.  Could you post first 20 lines of you config.log.  This
>> will
>> help in trying to reproduce.  The content of your host file (you can use
>> generic
>> names for the nodes if that'a an issue to publicize) would also help as
>> the number of nodes and number of MPI processes/node impacts the way
>> the reduce scatter operation works.
>>
>> One thing to note about the openib BTL - it is on life support.   That's
>> why you needed to set btl_openib_allow_ib 1 on the mpirun command line.
>>
>> You may get much better success by installing UCX
>>  and rebuilding Open MPI to use
>> UCX.  You may actually already have UCX installed on your system if
>> a recent version of MOFED is installed.
>>
>> You can check this by running /usr/bin/ofed_rpm_info.  It will show which
>> ucx version has been installed.
>> If UCX is installed, you can add --with-ucx to the Open MPi configuration
>> line and it should build in UCX
>> support.   If Open MPI is built with UCX support, it will by default use
>> UCX for message transport rather than
>> the OpenIB BTL.
>>
>> thanks,
>>
>> Howard
>>
>>
>> Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <
>> alebl...@iol.unh.edu>:
>>
>>> On tcp side it doesn't seg fault anymore but will timeout on some tests
>>> but on the openib side it will still seg fault, here is the output:
>>>
>>> [pandora:19256] *** Process received signal ***
>>> [pandora:19256] Signal: Segmentation fault (11)
>>> [pandora:19256] Signal code: Address not mapped (1)
>>> [pandora:19256] Failing at address: 0x7f911c69fff0
>>> [pandora:19255] *** Process received signal ***
>>> [pandora:19255] Signal: Segmentation fault (11)
>>> [pandora:19255] Signal code: Address not mapped (1)
>>> [pandora:19255] Failing at address: 0x7ff09cd3fff0
>>> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
>>> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
>>> [pandora:19256] [ 2]
>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
>>> [pandora:19256] [ 3]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
>>> [pandora:19256] [ 4] [pandora:19255] [ 0]
>>> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
>>> [pandora:19255] [ 1]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
>>> [pandora:19256] [ 5] IMB-MPI1[0x40b83b]
>>> [pandora:19256] [ 6] IMB-MPI1[0x407155]
>>> [pandora:19256] [ 7] IMB-MPI1[0x4022ea]
>>> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
>>> [pandora:19255] [ 2]
>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
>>> [pandora:19256] [ 9] IMB-MPI1[0x401d49]
>>> [pandora:19256] *** End of error message ***
>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
>>> [pandora:19255] [ 3]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
>>> [pandora:19255] [ 4]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
>>> [pandora:19255] [ 5] IMB-MPI1[0x40b83b]
>>> [pandora:19255] [ 6] IMB-MPI1[0x407155]
>>> [pandora:19255] [ 7] IMB-MPI1[0x4022ea]
>>> [pandora:19255] [ 8]
>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
>>> [pandora:19255] [ 9] IMB-MPI1[0x401d49]
>>> [pandora:19255] *** End of error message ***
>>> [phoebe:12418] *** Process received signal ***
>>> [phoebe:12418] Signal: Segmentation fault (11)
>>> [phoebe:12418] Signal code: Address not mapped (1)
>>> [phoebe:12418] Failing at address: 0x7f5ce27dfff0
>>> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
>>> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
>>> [phoebe:12418] [ 2]
>>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
>>> [phoebe:12418] [ 3]
>>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
>>> [phoebe:12418] [ 4]
>>> 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
Hello Howard,

Thanks for all of the help and suggestions I will look into them. I also
realized that my ansible wasn't setup properly for handling tar files so
the nightly build didn't even install, but will do it by hand and will give
you an update tomorrow somewhere in the afternoon.

Thanks,
Adam LeBlanc

On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard 
wrote:

> Hello Adam,
>
> This helps some.  Could you post first 20 lines of you config.log.  This
> will
> help in trying to reproduce.  The content of your host file (you can use
> generic
> names for the nodes if that'a an issue to publicize) would also help as
> the number of nodes and number of MPI processes/node impacts the way
> the reduce scatter operation works.
>
> One thing to note about the openib BTL - it is on life support.   That's
> why you needed to set btl_openib_allow_ib 1 on the mpirun command line.
>
> You may get much better success by installing UCX
>  and rebuilding Open MPI to use
> UCX.  You may actually already have UCX installed on your system if
> a recent version of MOFED is installed.
>
> You can check this by running /usr/bin/ofed_rpm_info.  It will show which
> ucx version has been installed.
> If UCX is installed, you can add --with-ucx to the Open MPi configuration
> line and it should build in UCX
> support.   If Open MPI is built with UCX support, it will by default use
> UCX for message transport rather than
> the OpenIB BTL.
>
> thanks,
>
> Howard
>
>
> Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <
> alebl...@iol.unh.edu>:
>
>> On tcp side it doesn't seg fault anymore but will timeout on some tests
>> but on the openib side it will still seg fault, here is the output:
>>
>> [pandora:19256] *** Process received signal ***
>> [pandora:19256] Signal: Segmentation fault (11)
>> [pandora:19256] Signal code: Address not mapped (1)
>> [pandora:19256] Failing at address: 0x7f911c69fff0
>> [pandora:19255] *** Process received signal ***
>> [pandora:19255] Signal: Segmentation fault (11)
>> [pandora:19255] Signal code: Address not mapped (1)
>> [pandora:19255] Failing at address: 0x7ff09cd3fff0
>> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
>> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
>> [pandora:19256] [ 2]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
>> [pandora:19256] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
>> [pandora:19256] [ 4] [pandora:19255] [ 0]
>> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
>> [pandora:19255] [ 1]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
>> [pandora:19256] [ 5] IMB-MPI1[0x40b83b]
>> [pandora:19256] [ 6] IMB-MPI1[0x407155]
>> [pandora:19256] [ 7] IMB-MPI1[0x4022ea]
>> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
>> [pandora:19255] [ 2]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
>> [pandora:19256] [ 9] IMB-MPI1[0x401d49]
>> [pandora:19256] *** End of error message ***
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
>> [pandora:19255] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
>> [pandora:19255] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
>> [pandora:19255] [ 5] IMB-MPI1[0x40b83b]
>> [pandora:19255] [ 6] IMB-MPI1[0x407155]
>> [pandora:19255] [ 7] IMB-MPI1[0x4022ea]
>> [pandora:19255] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
>> [pandora:19255] [ 9] IMB-MPI1[0x401d49]
>> [pandora:19255] *** End of error message ***
>> [phoebe:12418] *** Process received signal ***
>> [phoebe:12418] Signal: Segmentation fault (11)
>> [phoebe:12418] Signal code: Address not mapped (1)
>> [phoebe:12418] Failing at address: 0x7f5ce27dfff0
>> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
>> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
>> [phoebe:12418] [ 2]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
>> [phoebe:12418] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
>> [phoebe:12418] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
>> [phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
>> [phoebe:12418] [ 6] IMB-MPI1[0x407155]
>> [phoebe:12418] [ 7] IMB-MPI1[0x4022ea]
>> [phoebe:12418] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]
>> [phoebe:12418] [ 9] IMB-MPI1[0x401d49]
>> [phoebe:12418] *** End of error message ***
>> --
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> --

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Howard Pritchard
Hello Adam,

This helps some.  Could you post first 20 lines of you config.log.  This
will
help in trying to reproduce.  The content of your host file (you can use
generic
names for the nodes if that'a an issue to publicize) would also help as
the number of nodes and number of MPI processes/node impacts the way
the reduce scatter operation works.

One thing to note about the openib BTL - it is on life support.   That's
why you needed to set btl_openib_allow_ib 1 on the mpirun command line.

You may get much better success by installing UCX
 and rebuilding Open MPI to use
UCX.  You may actually already have UCX installed on your system if
a recent version of MOFED is installed.

You can check this by running /usr/bin/ofed_rpm_info.  It will show which
ucx version has been installed.
If UCX is installed, you can add --with-ucx to the Open MPi configuration
line and it should build in UCX
support.   If Open MPI is built with UCX support, it will by default use
UCX for message transport rather than
the OpenIB BTL.

thanks,

Howard


Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <
alebl...@iol.unh.edu>:

> On tcp side it doesn't seg fault anymore but will timeout on some tests
> but on the openib side it will still seg fault, here is the output:
>
> [pandora:19256] *** Process received signal ***
> [pandora:19256] Signal: Segmentation fault (11)
> [pandora:19256] Signal code: Address not mapped (1)
> [pandora:19256] Failing at address: 0x7f911c69fff0
> [pandora:19255] *** Process received signal ***
> [pandora:19255] Signal: Segmentation fault (11)
> [pandora:19255] Signal code: Address not mapped (1)
> [pandora:19255] Failing at address: 0x7ff09cd3fff0
> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
> [pandora:19256] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
> [pandora:19256] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
> [pandora:19256] [ 4] [pandora:19255] [ 0]
> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
> [pandora:19255] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
> [pandora:19256] [ 5] IMB-MPI1[0x40b83b]
> [pandora:19256] [ 6] IMB-MPI1[0x407155]
> [pandora:19256] [ 7] IMB-MPI1[0x4022ea]
> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
> [pandora:19255] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
> [pandora:19256] [ 9] IMB-MPI1[0x401d49]
> [pandora:19256] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
> [pandora:19255] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
> [pandora:19255] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
> [pandora:19255] [ 5] IMB-MPI1[0x40b83b]
> [pandora:19255] [ 6] IMB-MPI1[0x407155]
> [pandora:19255] [ 7] IMB-MPI1[0x4022ea]
> [pandora:19255] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
> [pandora:19255] [ 9] IMB-MPI1[0x401d49]
> [pandora:19255] *** End of error message ***
> [phoebe:12418] *** Process received signal ***
> [phoebe:12418] Signal: Segmentation fault (11)
> [phoebe:12418] Signal code: Address not mapped (1)
> [phoebe:12418] Failing at address: 0x7f5ce27dfff0
> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
> [phoebe:12418] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
> [phoebe:12418] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
> [phoebe:12418] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
> [phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:12418] [ 6] IMB-MPI1[0x407155]
> [phoebe:12418] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:12418] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]
> [phoebe:12418] [ 9] IMB-MPI1[0x401d49]
> [phoebe:12418] *** End of error message ***
> --
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --
> --
> mpirun noticed that process rank 0 with PID 0 on node pandora exited on
> signal 11 (Segmentation fault).
> --
>
> - Adam LeBlanc
>
> On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users <
> users@lists.open-mpi.org> wrote:
>
>> Can you try the latest 4.0.x nightly snapshot and see if the problem
>> still occurs?
>>
>> 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
On tcp side it doesn't seg fault anymore but will timeout on some tests but
on the openib side it will still seg fault, here is the output:

[pandora:19256] *** Process received signal ***
[pandora:19256] Signal: Segmentation fault (11)
[pandora:19256] Signal code: Address not mapped (1)
[pandora:19256] Failing at address: 0x7f911c69fff0
[pandora:19255] *** Process received signal ***
[pandora:19255] Signal: Segmentation fault (11)
[pandora:19255] Signal code: Address not mapped (1)
[pandora:19255] Failing at address: 0x7ff09cd3fff0
[pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
[pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
[pandora:19256] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
[pandora:19256] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
[pandora:19256] [ 4] [pandora:19255] [ 0]
/usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
[pandora:19255] [ 1]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
[pandora:19256] [ 5] IMB-MPI1[0x40b83b]
[pandora:19256] [ 6] IMB-MPI1[0x407155]
[pandora:19256] [ 7] IMB-MPI1[0x4022ea]
[pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
[pandora:19255] [ 2]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
[pandora:19256] [ 9] IMB-MPI1[0x401d49]
[pandora:19256] *** End of error message ***
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
[pandora:19255] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
[pandora:19255] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
[pandora:19255] [ 5] IMB-MPI1[0x40b83b]
[pandora:19255] [ 6] IMB-MPI1[0x407155]
[pandora:19255] [ 7] IMB-MPI1[0x4022ea]
[pandora:19255] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
[pandora:19255] [ 9] IMB-MPI1[0x401d49]
[pandora:19255] *** End of error message ***
[phoebe:12418] *** Process received signal ***
[phoebe:12418] Signal: Segmentation fault (11)
[phoebe:12418] Signal code: Address not mapped (1)
[phoebe:12418] Failing at address: 0x7f5ce27dfff0
[phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
[phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
[phoebe:12418] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
[phoebe:12418] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
[phoebe:12418] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
[phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
[phoebe:12418] [ 6] IMB-MPI1[0x407155]
[phoebe:12418] [ 7] IMB-MPI1[0x4022ea]
[phoebe:12418] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]
[phoebe:12418] [ 9] IMB-MPI1[0x401d49]
[phoebe:12418] *** End of error message ***
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node pandora exited on
signal 11 (Segmentation fault).
--

- Adam LeBlanc

On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> Can you try the latest 4.0.x nightly snapshot and see if the problem still
> occurs?
>
> https://www.open-mpi.org/nightly/v4.0.x/
>
>
> > On Feb 20, 2019, at 1:40 PM, Adam LeBlanc  wrote:
> >
> > I do here is the output:
> >
> > 2 total processes killed (some possibly by mpirun during cleanup)
> > [pandora:12238] *** Process received signal ***
> > [pandora:12238] Signal: Segmentation fault (11)
> > [pandora:12238] Signal code: Invalid permissions (2)
> > [pandora:12238] Failing at address: 0x7f5c8e31fff0
> > [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
> > [pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
> > /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
> > [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
> > [pandora:12237] Signal code: Invalid permissions (2)
> > [pandora:12237] Failing at address: 0x7f6c4ab3fff0
> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
> > [pandora:12238] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
> > [pandora:12238] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
> > [pandora:12238] [ 5] IMB-MPI1[0x40b83b]
> > [pandora:12238] [ 6] IMB-MPI1[0x407155]
> > [pandora:12238] [ 7] IMB-MPI1[0x4022ea]
> > [pandora:12238] [ 8]
> 

Re: [OMPI users] [Request for Cooperation] -- MPI International Survey

2019-02-20 Thread George Bosilca
George,

Thanks for letting us know about this issue, it was a misconfiguration
issue with the form. I guess we did not realized as most of us are
automatically signed in by our browsers. Anyway, thanks for the feedback,
the access to the form should now be completely open.

Sorry for the inconvenience,
  George.



On Wed, Feb 20, 2019 at 2:27 PM George Reeke 
wrote:

> On Wed, 2019-02-20 at 13:21 -0500, George Bosilca wrote:
>
> > To obtain representative samples of the MPI community, we have
> > prepared a survey
> >
> >
> https://docs.google.com/forms/d/e/1FAIpQLSd1bDppVODc8nB0BjIXdqSCO_MuEuNAAbBixl4onTchwSQFwg/viewform
> >
> To access the survey, I was asked to create a google login.
> I do not wish to do this and cannot think of any obvious
> reason why this should be connected to the goals of the
> survey.  Can someone explain the purpose of this or
> hopefully change the survey so no login (to anyplace)
> is required.  I do program with open_mpi and would like
> to participate.
> George Reeke
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [Request for Cooperation] -- MPI International Survey

2019-02-20 Thread George Reeke
On Wed, 2019-02-20 at 13:21 -0500, George Bosilca wrote:

> To obtain representative samples of the MPI community, we have
> prepared a survey 
> 
> https://docs.google.com/forms/d/e/1FAIpQLSd1bDppVODc8nB0BjIXdqSCO_MuEuNAAbBixl4onTchwSQFwg/viewform
> 
To access the survey, I was asked to create a google login.
I do not wish to do this and cannot think of any obvious
reason why this should be connected to the goals of the
survey.  Can someone explain the purpose of this or
hopefully change the survey so no login (to anyplace)
is required.  I do program with open_mpi and would like
to participate.
George Reeke



___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Jeff Squyres (jsquyres) via users
Can you try the latest 4.0.x nightly snapshot and see if the problem still 
occurs?

https://www.open-mpi.org/nightly/v4.0.x/


> On Feb 20, 2019, at 1:40 PM, Adam LeBlanc  wrote:
> 
> I do here is the output:
> 
> 2 total processes killed (some possibly by mpirun during cleanup)
> [pandora:12238] *** Process received signal ***
> [pandora:12238] Signal: Segmentation fault (11)
> [pandora:12238] Signal code: Invalid permissions (2)
> [pandora:12238] Failing at address: 0x7f5c8e31fff0
> [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
> [pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
> /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
> [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
> [pandora:12237] Signal code: Invalid permissions (2)
> [pandora:12237] Failing at address: 0x7f6c4ab3fff0
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
> [pandora:12238] [ 3] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
> [pandora:12238] [ 4] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
> [pandora:12238] [ 5] IMB-MPI1[0x40b83b]
> [pandora:12238] [ 6] IMB-MPI1[0x407155]
> [pandora:12238] [ 7] IMB-MPI1[0x4022ea]
> [pandora:12238] [ 8] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5]
> [pandora:12238] [ 9] IMB-MPI1[0x401d49]
> [pandora:12238] *** End of error message ***
> [pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680]
> [pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0]
> [pandora:12237] [ 2] 
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55]
> [pandora:12237] [ 3] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b]
> [pandora:12237] [ 4] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7]
> [pandora:12237] [ 5] IMB-MPI1[0x40b83b]
> [pandora:12237] [ 6] IMB-MPI1[0x407155]
> [pandora:12237] [ 7] IMB-MPI1[0x4022ea]
> [pandora:12237] [ 8] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5]
> [pandora:12237] [ 9] IMB-MPI1[0x401d49]
> [pandora:12237] *** End of error message ***
> [phoebe:07408] *** Process received signal ***
> [phoebe:07408] Signal: Segmentation fault (11)
> [phoebe:07408] Signal code: Invalid permissions (2)
> [phoebe:07408] Failing at address: 0x7f6b9ca9fff0
> [titan:07169] *** Process received signal ***
> [titan:07169] Signal: Segmentation fault (11)
> [titan:07169] Signal code: Invalid permissions (2)
> [titan:07169] Failing at address: 0x7fc01295fff0
> [phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680]
> [phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0]
> [phoebe:07408] [ 2] [titan:07169] [ 0] 
> /usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680]
> [titan:07169] [ 1] 
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55]
> [phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0]
> [titan:07169] [ 2] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b]
> [phoebe:07408] [ 4] 
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55]
> [titan:07169] [ 3] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7]
> [phoebe:07408] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:07408] [ 6] IMB-MPI1[0x407155]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b]
> [titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:07408] [ 8] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7]
> [titan:07169] [ 5] IMB-MPI1[0x40b83b]
> [titan:07169] [ 6] IMB-MPI1[0x407155]
> [titan:07169] [ 7] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5]
> [phoebe:07408] [ 9] IMB-MPI1[0x401d49]
> [phoebe:07408] *** End of error message ***
> IMB-MPI1[0x4022ea]
> [titan:07169] [ 8] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5]
> [titan:07169] [ 9] IMB-MPI1[0x401d49]
> [titan:07169] *** End of error message ***
> --
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --
> --
> mpirun noticed that process rank 0 with PID 0 on node pandora exited on 
> signal 11 (Segmentation fault).
> --
> 
> 
> - Adam LeBlanc
> 
> On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard  wrote:
> HI Adam,
> 
> As a sanity check, if you try to use --mca btl self,vader,tcp
> 
> do you still see the segmentation fault?
> 
> Howard
> 
> 
> Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc 
> :
> 

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-20 Thread Ryan Novosielski
This is what I did for my build — not much going on there:

../openmpi-3.1.3/configure --prefix=/opt/sw/packages/gcc-4_8/openmpi/3.1.3 
--with-pmi && \
make -j32

We have a mixture of types of Infiniband, using the RHEL-supplied Infiniband 
packages.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Feb 20, 2019, at 1:46 PM, Gabriel, Edgar  wrote:
> 
> Well, the way you describe it, it sounds to me like maybe an atomic issue 
> with this compiler version. What was your configure line of Open MPI, and 
> what network interconnect are you using?
> 
> An easy way to test this theory would be to force OpenMPI to use the tcp 
> interfaces (everything will be slow however). You can do that by creating in 
> your home directory a directory called .openmpi, and add there a file called 
> mca-params.conf
> 
> The file should look something like this:
> 
> btl = tcp,self
> 
> 
> 
> Thanks
> Edgar
> 
> 
> 
>> -Original Message-
>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan
>> Novosielski
>> Sent: Wednesday, February 20, 2019 12:02 PM
>> To: Open MPI Users 
>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>> 3.1.3
>> 
>> Does it make any sense that it seems to work fine when OpenMPI and HDF5
>> are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with RHEL-
>> supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 build,
>> I did try an XFS filesystem and it didn’t help. GPFS works fine for either 
>> of the
>> 7.4 and 8.2 builds.
>> 
>> Just as a reminder, since it was reasonably far back in the thread, what I’m
>> doing is running the “make check” tests in HDF5 1.10.4, in part because users
>> use it, but also because it seems to have a good test suite and I can 
>> therefore
>> verify the compiler and MPI stack installs. I get very little information, 
>> apart
>> from it not working and getting that “Alarm clock” message.
>> 
>> I originally suspected I’d somehow built some component of this with a host-
>> specific optimization that wasn’t working on some compute nodes. But I
>> controlled for that and it didn’t seem to make any difference.
>> 
>> --
>> 
>> || \\UTGERS,  
>> |---*O*---
>> ||_// the State   | Ryan Novosielski - novos...@rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>> ||  \\of NJ   | Office of Advanced Research Computing - MSB C630,
>> Newark
>> `'
>> 
>>> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski 
>> wrote:
>>> 
>>> It didn’t work any better with XFS, as it happens. Must be something else.
>> I’m going to test some more and see if I can narrow it down any, as it seems
>> to me that it did work with a different compiler.
>>> 
>>> --
>>> 
>>> || \\UTGERS, 
>>> |---*O*---
>>> ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
>> Campus
>>> ||  \\of NJ  | Office of Advanced Research Computing - MSB C630,
>> Newark
>>>`'
>>> 
 On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar 
>> wrote:
 
 While I was working on something else, I let the tests run with Open MPI
>> master (which is for parallel I/O equivalent to the upcoming v4.0.1  
>> release),
>> and here is what I found for the HDF5 1.10.4 tests on my local desktop:
 
 In the testpar directory, there is in fact one test that fails for both 
 ompio
>> and romio321 in exactly the same manner.
 I used 6 processes as you did (although I used mpirun directly  instead of
>> srun...) From the 13 tests in the testpar directory, 12 pass correctly 
>> (t_bigio,
>> t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, t_mpi,
>> t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame).
 
 The one tests that officially fails ( t_pflush1) actually reports that it 
 passed,
>> but then throws message that indicates that MPI_Abort has been called, for
>> both ompio and romio. I will try to investigate this test to see what is 
>> going
>> on.
 
 That being said, your report shows an issue in t_mpi, which passes
>> without problems for me. This is however not GPFS, this was an XFS local file
>> system. Running the tests on GPFS are on my todo list as well.
 
 Thanks
 Edgar
 
 
 
> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
> Gabriel, Edgar
> Sent: Sunday, February 17, 2019 10:34 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" 

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-20 Thread Gabriel, Edgar
Well, the way you describe it, it sounds to me like maybe an atomic issue with 
this compiler version. What was your configure line of Open MPI, and what 
network interconnect are you using?

An easy way to test this theory would be to force OpenMPI to use the tcp 
interfaces (everything will be slow however). You can do that by creating in 
your home directory a directory called .openmpi, and add there a file called 
mca-params.conf

The file should look something like this:

btl = tcp,self



Thanks
Edgar



> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan
> Novosielski
> Sent: Wednesday, February 20, 2019 12:02 PM
> To: Open MPI Users 
> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
> 3.1.3
> 
> Does it make any sense that it seems to work fine when OpenMPI and HDF5
> are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with RHEL-
> supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 build,
> I did try an XFS filesystem and it didn’t help. GPFS works fine for either of 
> the
> 7.4 and 8.2 builds.
> 
> Just as a reminder, since it was reasonably far back in the thread, what I’m
> doing is running the “make check” tests in HDF5 1.10.4, in part because users
> use it, but also because it seems to have a good test suite and I can 
> therefore
> verify the compiler and MPI stack installs. I get very little information, 
> apart
> from it not working and getting that “Alarm clock” message.
> 
> I originally suspected I’d somehow built some component of this with a host-
> specific optimization that wasn’t working on some compute nodes. But I
> controlled for that and it didn’t seem to make any difference.
> 
> --
> 
> || \\UTGERS,   
> |---*O*---
> ||_// the State| Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ| Office of Advanced Research Computing - MSB C630,
> Newark
>  `'
> 
> > On Feb 18, 2019, at 1:34 PM, Ryan Novosielski 
> wrote:
> >
> > It didn’t work any better with XFS, as it happens. Must be something else.
> I’m going to test some more and see if I can narrow it down any, as it seems
> to me that it did work with a different compiler.
> >
> > --
> > 
> > || \\UTGERS, 
> > |---*O*---
> > ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
> > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> > ||  \\of NJ  | Office of Advanced Research Computing - MSB C630,
> Newark
> > `'
> >
> >> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar 
> wrote:
> >>
> >> While I was working on something else, I let the tests run with Open MPI
> master (which is for parallel I/O equivalent to the upcoming v4.0.1  release),
> and here is what I found for the HDF5 1.10.4 tests on my local desktop:
> >>
> >> In the testpar directory, there is in fact one test that fails for both 
> >> ompio
> and romio321 in exactly the same manner.
> >> I used 6 processes as you did (although I used mpirun directly  instead of
> srun...) From the 13 tests in the testpar directory, 12 pass correctly 
> (t_bigio,
> t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, t_mpi,
> t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame).
> >>
> >> The one tests that officially fails ( t_pflush1) actually reports that it 
> >> passed,
> but then throws message that indicates that MPI_Abort has been called, for
> both ompio and romio. I will try to investigate this test to see what is going
> on.
> >>
> >> That being said, your report shows an issue in t_mpi, which passes
> without problems for me. This is however not GPFS, this was an XFS local file
> system. Running the tests on GPFS are on my todo list as well.
> >>
> >> Thanks
> >> Edgar
> >>
> >>
> >>
> >>> -Original Message-
> >>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
> >>> Gabriel, Edgar
> >>> Sent: Sunday, February 17, 2019 10:34 AM
> >>> To: Open MPI Users 
> >>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems
> >>> w/OpenMPI
> >>> 3.1.3
> >>>
> >>> I will also run our testsuite and the HDF5 testsuite on GPFS, I have
> >>> access to a GPFS file system since recently, and will report back on
> >>> that, but it will take a few days.
> >>>
> >>> Thanks
> >>> Edgar
> >>>
>  -Original Message-
>  From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
>  Ryan Novosielski
>  Sent: Sunday, February 17, 2019 2:37 AM
>  To: users@lists.open-mpi.org
>  Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems
>  w/OpenMPI
>  3.1.3
> 
>  -BEGIN PGP SIGNED MESSAGE-
>  Hash: SHA1
> 
>  This is on GPFS. I'll try it on XFS to see if it makes any difference.
> 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
I do here is the output:

2 total processes killed (some possibly by mpirun during cleanup)
[pandora:12238] *** Process received signal ***
[pandora:12238] Signal: Segmentation fault (11)
[pandora:12238] Signal code: Invalid permissions (2)
[pandora:12238] Failing at address: 0x7f5c8e31fff0
[pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
[pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
/usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
[pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
[pandora:12237] Signal code: Invalid permissions (2)
[pandora:12237] Failing at address: 0x7f6c4ab3fff0
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
[pandora:12238] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
[pandora:12238] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
[pandora:12238] [ 5] IMB-MPI1[0x40b83b]
[pandora:12238] [ 6] IMB-MPI1[0x407155]
[pandora:12238] [ 7] IMB-MPI1[0x4022ea]
[pandora:12238] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5]
[pandora:12238] [ 9] IMB-MPI1[0x401d49]
[pandora:12238] *** End of error message ***
[pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680]
[pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0]
[pandora:12237] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55]
[pandora:12237] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b]
[pandora:12237] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7]
[pandora:12237] [ 5] IMB-MPI1[0x40b83b]
[pandora:12237] [ 6] IMB-MPI1[0x407155]
[pandora:12237] [ 7] IMB-MPI1[0x4022ea]
[pandora:12237] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5]
[pandora:12237] [ 9] IMB-MPI1[0x401d49]
[pandora:12237] *** End of error message ***
[phoebe:07408] *** Process received signal ***
[phoebe:07408] Signal: Segmentation fault (11)
[phoebe:07408] Signal code: Invalid permissions (2)
[phoebe:07408] Failing at address: 0x7f6b9ca9fff0
[titan:07169] *** Process received signal ***
[titan:07169] Signal: Segmentation fault (11)
[titan:07169] Signal code: Invalid permissions (2)
[titan:07169] Failing at address: 0x7fc01295fff0
[phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680]
[phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0]
[phoebe:07408] [ 2] [titan:07169] [ 0]
/usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680]
[titan:07169] [ 1]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55]
[phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0]
[titan:07169] [ 2]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b]
[phoebe:07408] [ 4]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55]
[titan:07169] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7]
[phoebe:07408] [ 5] IMB-MPI1[0x40b83b]
[phoebe:07408] [ 6] IMB-MPI1[0x407155]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b]
[titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea]
[phoebe:07408] [ 8]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7]
[titan:07169] [ 5] IMB-MPI1[0x40b83b]
[titan:07169] [ 6] IMB-MPI1[0x407155]
[titan:07169] [ 7]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5]
[phoebe:07408] [ 9] IMB-MPI1[0x401d49]
[phoebe:07408] *** End of error message ***
IMB-MPI1[0x4022ea]
[titan:07169] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5]
[titan:07169] [ 9] IMB-MPI1[0x401d49]
[titan:07169] *** End of error message ***
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node pandora exited on
signal 11 (Segmentation fault).
--


- Adam LeBlanc

On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard 
wrote:

> HI Adam,
>
> As a sanity check, if you try to use --mca btl self,vader,tcp
>
> do you still see the segmentation fault?
>
> Howard
>
>
> Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <
> alebl...@iol.unh.edu>:
>
>> Hello,
>>
>> When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
>> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
>> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
>> btl_openib_allow_ib 1 -np 6
>>  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
>>
>> I get this error:
>>
>> 

[OMPI users] [Request for Cooperation] -- MPI International Survey

2019-02-20 Thread George Bosilca
Dear colleagues,

As part of a wide-ranging effort to understand the current usage of the
Message Passing Interface (MPI) in the development of parallel applications
and to drive future additions to the MPI standard, an international team is
seeking feedback from the largest possible MPI audience (past, current, and
potential users on the globe) to obtain a better understanding of their
needs and to understand the impact of different MPI capabilities on the
development of distributed applications.

To obtain representative samples of the MPI community, we have prepared a
survey

https://docs.google.com/forms/d/e/1FAIpQLSd1bDppVODc8nB0BjIXdqSCO_MuEuNAAbBixl4onTchwSQFwg/viewform

that specifically targets all potential MPI users, including those in the
public, education, research and engineering domains--from undergraduate and
graduate students and postdocs to seasoned researchers and engineers.

The information gathered will be used to publish a comprehensive report of
the different use cases and potential areas of opportunity. These results
will be made freely available, and the raw data, the scripts to manipulate
it, as well as the resulting analysis will be, in time, posted on github [1],
while the curated results will be available via github pages [2].

For anyone interested in participating in the survey, we sincerely
appreciate your feedback.  The survey is rather short (about 30 easy
questions), and should not take more than 15 minutes to complete.  In
addition to your participation, we would appreciate if you re-distribute
this e-mail to your domestic/local communities.

Important Date -- This survey will be closed by the end of February 2019.

Questions? -- Please send any queries about this MPI survey to the sender
of this email.


Thank you on behalf of the International MPI Survey,
George Bosilca (UT/ICL)
Geoffroy Vallee (ORNL)
Emmanuel Jeannot (Inria)
Atsushi Hori (RIKEN)
Takahiro Ogura (RIKEN)

[1] https://github.com/bosilca/MPIsurvey/
[2] https://bosilca.github.io/MPIsurvey/
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Howard Pritchard
HI Adam,

As a sanity check, if you try to use --mca btl self,vader,tcp

do you still see the segmentation fault?

Howard


Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <
alebl...@iol.unh.edu>:

> Hello,
>
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
> btl_openib_allow_ib 1 -np 6
>  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
>
> I get this error:
>
> #
> # Benchmarking Reduce_scatter
> # #processes = 4
> # ( 2 additional processes waiting in MPI_Barrier)
> #
>#bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
> 0 1000 0.14 0.15 0.14
> 4 1000 5.00 7.58 6.28
> 8 1000 5.13 7.68 6.41
>16 1000 5.05 7.74 6.39
>32 1000 5.43 7.96 6.75
>64 1000 6.78 8.56 7.69
>   128 1000 7.77 9.55 8.59
>   256 1000 8.2810.96 9.66
>   512 1000 9.1912.4910.85
>  1024 100011.7815.0113.38
>  2048 100017.4119.5118.52
>  4096 100025.7328.2226.89
>  8192 100047.7549.4448.79
> 16384 100081.1090.1584.75
> 32768 1000   163.01   178.58   173.19
> 65536  640   315.63   340.51   333.18
>131072  320   475.48   528.82   510.85
>262144  160   979.70  1063.81  1035.61
>524288   80  2070.51  2242.58  2150.15
>   1048576   40  4177.36  4527.25  4431.65
>   2097152   20  8738.08  9340.50  9147.89
> [pandora:04500] *** Process received signal ***
> [pandora:04500] Signal: Segmentation fault (11)
> [pandora:04500] Signal code: Address not mapped (1)
> [pandora:04500] Failing at address: 0x7f310eb0
> [pandora:04499] *** Process received signal ***
> [pandora:04499] Signal: Segmentation fault (11)
> [pandora:04499] Signal code: Address not mapped (1)
> [pandora:04499] Failing at address: 0x7f28b110
> [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
> [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
> [pandora:04500] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
> [pandora:04500] [ 3] [pandora:04499] [ 0]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
> [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
> [pandora:04499] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
> [pandora:04500] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04500] [ 6] IMB-MPI1[0x407155]
> [pandora:04500] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
> [pandora:04499] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
> [pandora:04500] [ 9] IMB-MPI1[0x401d49]
> [pandora:04500] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
> [pandora:04499] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
> [pandora:04499] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
> [pandora:04499] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04499] [ 6] IMB-MPI1[0x407155]
> [pandora:04499] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04499] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
> [pandora:04499] [ 9] IMB-MPI1[0x401d49]
> [pandora:04499] *** End of error message ***
> [phoebe:03779] *** Process received signal ***
> [phoebe:03779] Signal: Segmentation fault (11)
> [phoebe:03779] Signal code: Address not mapped (1)
> [phoebe:03779] Failing at address: 0x7f483d60
> [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
> [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
> [phoebe:03779] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
> [phoebe:03779] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
> [phoebe:03779] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
> [phoebe:03779] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:03779] [ 6] IMB-MPI1[0x407155]
> [phoebe:03779] [ 7] 

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-20 Thread Ryan Novosielski
Does it make any sense that it seems to work fine when OpenMPI and HDF5 are 
built with GCC 7.4 and GCC 8.2, but /not/ when they are built with 
RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 
build, I did try an XFS filesystem and it didn’t help. GPFS works fine for 
either of the 7.4 and 8.2 builds.

Just as a reminder, since it was reasonably far back in the thread, what I’m 
doing is running the “make check” tests in HDF5 1.10.4, in part because users 
use it, but also because it seems to have a good test suite and I can therefore 
verify the compiler and MPI stack installs. I get very little information, 
apart from it not working and getting that “Alarm clock” message.

I originally suspected I’d somehow built some component of this with a 
host-specific optimization that wasn’t working on some compute nodes. But I 
controlled for that and it didn’t seem to make any difference.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski  wrote:
> 
> It didn’t work any better with XFS, as it happens. Must be something else. 
> I’m going to test some more and see if I can narrow it down any, as it seems 
> to me that it did work with a different compiler.
> 
> --
> 
> || \\UTGERS,   
> |---*O*---
> ||_// the State| Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ| Office of Advanced Research Computing - MSB C630, 
> Newark
> `'
> 
>> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar  wrote:
>> 
>> While I was working on something else, I let the tests run with Open MPI 
>> master (which is for parallel I/O equivalent to the upcoming v4.0.1  
>> release), and here is what I found for the HDF5 1.10.4 tests on my local 
>> desktop:
>> 
>> In the testpar directory, there is in fact one test that fails for both 
>> ompio and romio321 in exactly the same manner.
>> I used 6 processes as you did (although I used mpirun directly  instead of 
>> srun...) From the 13 tests in the testpar directory, 12 pass correctly 
>> (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, 
>> t_init_term, t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, 
>> t_shapesame).
>> 
>> The one tests that officially fails ( t_pflush1) actually reports that it 
>> passed, but then throws message that indicates that MPI_Abort has been 
>> called, for both ompio and romio. I will try to investigate this test to see 
>> what is going on.
>> 
>> That being said, your report shows an issue in t_mpi, which passes without 
>> problems for me. This is however not GPFS, this was an XFS local file 
>> system. Running the tests on GPFS are on my todo list as well.
>> 
>> Thanks
>> Edgar
>> 
>> 
>> 
>>> -Original Message-
>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
>>> Gabriel, Edgar
>>> Sent: Sunday, February 17, 2019 10:34 AM
>>> To: Open MPI Users 
>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>>> 3.1.3
>>> 
>>> I will also run our testsuite and the HDF5 testsuite on GPFS, I have access 
>>> to a
>>> GPFS file system since recently, and will report back on that, but it will 
>>> take a
>>> few days.
>>> 
>>> Thanks
>>> Edgar
>>> 
 -Original Message-
 From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
 Ryan Novosielski
 Sent: Sunday, February 17, 2019 2:37 AM
 To: users@lists.open-mpi.org
 Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
 3.1.3
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 This is on GPFS. I'll try it on XFS to see if it makes any difference.
 
 On 2/16/19 11:57 PM, Gilles Gouaillardet wrote:
> Ryan,
> 
> What filesystem are you running on ?
> 
> Open MPI defaults to the ompio component, except on Lustre
> filesystem where ROMIO is used. (if the issue is related to ROMIO,
> that can explain why you did not see any difference, in that case,
> you might want to try an other filesystem (local filesystem or NFS
> for example)\
> 
> 
> Cheers,
> 
> Gilles
> 
> On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski
>  wrote:
>> 
>> I verified that it makes it through to a bash prompt, but I’m a
>> little less confident that something make test does doesn’t clear it.
>> Any recommendation for a way to verify?
>> 
>> In any case, no change, unfortunately.
>> 
>> Sent from my iPhone
>> 
>>> On Feb 16, 2019, at 08:13, Gabriel, Edgar
>>> 
>>> wrote:
>>> 

[OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
Hello,

When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun
--mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_allow_ib 1 -np 6
 -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1

I get this error:

#
# Benchmarking Reduce_scatter
# #processes = 4
# ( 2 additional processes waiting in MPI_Barrier)
#
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
0 1000 0.14 0.15 0.14
4 1000 5.00 7.58 6.28
8 1000 5.13 7.68 6.41
   16 1000 5.05 7.74 6.39
   32 1000 5.43 7.96 6.75
   64 1000 6.78 8.56 7.69
  128 1000 7.77 9.55 8.59
  256 1000 8.2810.96 9.66
  512 1000 9.1912.4910.85
 1024 100011.7815.0113.38
 2048 100017.4119.5118.52
 4096 100025.7328.2226.89
 8192 100047.7549.4448.79
16384 100081.1090.1584.75
32768 1000   163.01   178.58   173.19
65536  640   315.63   340.51   333.18
   131072  320   475.48   528.82   510.85
   262144  160   979.70  1063.81  1035.61
   524288   80  2070.51  2242.58  2150.15
  1048576   40  4177.36  4527.25  4431.65
  2097152   20  8738.08  9340.50  9147.89
[pandora:04500] *** Process received signal ***
[pandora:04500] Signal: Segmentation fault (11)
[pandora:04500] Signal code: Address not mapped (1)
[pandora:04500] Failing at address: 0x7f310eb0
[pandora:04499] *** Process received signal ***
[pandora:04499] Signal: Segmentation fault (11)
[pandora:04499] Signal code: Address not mapped (1)
[pandora:04499] Failing at address: 0x7f28b110
[pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
[pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
[pandora:04500] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
[pandora:04500] [ 3] [pandora:04499] [ 0]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
[pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
[pandora:04499] [ 1]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
[pandora:04500] [ 5] IMB-MPI1[0x40b83b]
[pandora:04500] [ 6] IMB-MPI1[0x407155]
[pandora:04500] [ 7] IMB-MPI1[0x4022ea]
[pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
[pandora:04499] [ 2]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
[pandora:04500] [ 9] IMB-MPI1[0x401d49]
[pandora:04500] *** End of error message ***
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
[pandora:04499] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
[pandora:04499] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
[pandora:04499] [ 5] IMB-MPI1[0x40b83b]
[pandora:04499] [ 6] IMB-MPI1[0x407155]
[pandora:04499] [ 7] IMB-MPI1[0x4022ea]
[pandora:04499] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
[pandora:04499] [ 9] IMB-MPI1[0x401d49]
[pandora:04499] *** End of error message ***
[phoebe:03779] *** Process received signal ***
[phoebe:03779] Signal: Segmentation fault (11)
[phoebe:03779] Signal code: Address not mapped (1)
[phoebe:03779] Failing at address: 0x7f483d60
[phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
[phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
[phoebe:03779] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
[phoebe:03779] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
[phoebe:03779] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
[phoebe:03779] [ 5] IMB-MPI1[0x40b83b]
[phoebe:03779] [ 6] IMB-MPI1[0x407155]
[phoebe:03779] [ 7] IMB-MPI1[0x4022ea]
[phoebe:03779] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5]
[phoebe:03779] [ 9] IMB-MPI1[0x401d49]
[phoebe:03779] *** End of error message ***
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been