Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-28 Thread Adam LeBlanc
Hello all,

Thank you all for the suggestions. Takahiro suggestion has gotten me to a
point were all of the test will run but as soon as it gets to the clean up
step IMB will seg fault again. I opened an issues on IMB's Github but I
guess I am not gonna be able to get much help from them. So I will have to
wait and see what happens next.

Thanks again for all your help,
Adam LeBlanc

On Thu, Feb 21, 2019 at 7:22 AM Peter Kjellström  wrote:

> On Wed, 20 Feb 2019 10:46:10 -0500
> Adam LeBlanc  wrote:
>
> > Hello,
> >
> > When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> > mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node
> > --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca
> > pml ob1 --mca btl_openib_allow_ib 1 -np 6
> >  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
> >
> > I get this error:
> ...
> > # Benchmarking Reduce_scatter
> ...
> >   2097152   20  8738.08  9340.50  9147.89
> > [pandora:04500] *** Process received signal ***
> > [pandora:04500] Signal: Segmentation fault (11)
>
> This is very likely a bug in IMB not in OpenMPI. It's been discussed on
> the list before, thread name:
>
>  MPI_Reduce_Scatter Segmentation Fault with Intel  2019 Update 1
>  Compilers on OPA-1...
>
> You can work around by using an older IMB version (the bug is in the
> newer/est version).
>
> /Peter K
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
Hello Howard,

Thanks for all of the help and suggestions I will look into them. I also
realized that my ansible wasn't setup properly for handling tar files so
the nightly build didn't even install, but will do it by hand and will give
you an update tomorrow somewhere in the afternoon.

Thanks,
Adam LeBlanc

On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard 
wrote:

> Hello Adam,
>
> This helps some.  Could you post first 20 lines of you config.log.  This
> will
> help in trying to reproduce.  The content of your host file (you can use
> generic
> names for the nodes if that'a an issue to publicize) would also help as
> the number of nodes and number of MPI processes/node impacts the way
> the reduce scatter operation works.
>
> One thing to note about the openib BTL - it is on life support.   That's
> why you needed to set btl_openib_allow_ib 1 on the mpirun command line.
>
> You may get much better success by installing UCX
> <https://github.com/openucx/ucx/releases> and rebuilding Open MPI to use
> UCX.  You may actually already have UCX installed on your system if
> a recent version of MOFED is installed.
>
> You can check this by running /usr/bin/ofed_rpm_info.  It will show which
> ucx version has been installed.
> If UCX is installed, you can add --with-ucx to the Open MPi configuration
> line and it should build in UCX
> support.   If Open MPI is built with UCX support, it will by default use
> UCX for message transport rather than
> the OpenIB BTL.
>
> thanks,
>
> Howard
>
>
> Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <
> alebl...@iol.unh.edu>:
>
>> On tcp side it doesn't seg fault anymore but will timeout on some tests
>> but on the openib side it will still seg fault, here is the output:
>>
>> [pandora:19256] *** Process received signal ***
>> [pandora:19256] Signal: Segmentation fault (11)
>> [pandora:19256] Signal code: Address not mapped (1)
>> [pandora:19256] Failing at address: 0x7f911c69fff0
>> [pandora:19255] *** Process received signal ***
>> [pandora:19255] Signal: Segmentation fault (11)
>> [pandora:19255] Signal code: Address not mapped (1)
>> [pandora:19255] Failing at address: 0x7ff09cd3fff0
>> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
>> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
>> [pandora:19256] [ 2]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
>> [pandora:19256] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
>> [pandora:19256] [ 4] [pandora:19255] [ 0]
>> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
>> [pandora:19255] [ 1]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
>> [pandora:19256] [ 5] IMB-MPI1[0x40b83b]
>> [pandora:19256] [ 6] IMB-MPI1[0x407155]
>> [pandora:19256] [ 7] IMB-MPI1[0x4022ea]
>> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
>> [pandora:19255] [ 2]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
>> [pandora:19256] [ 9] IMB-MPI1[0x401d49]
>> [pandora:19256] *** End of error message ***
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
>> [pandora:19255] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
>> [pandora:19255] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
>> [pandora:19255] [ 5] IMB-MPI1[0x40b83b]
>> [pandora:19255] [ 6] IMB-MPI1[0x407155]
>> [pandora:19255] [ 7] IMB-MPI1[0x4022ea]
>> [pandora:19255] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
>> [pandora:19255] [ 9] IMB-MPI1[0x401d49]
>> [pandora:19255] *** End of error message ***
>> [phoebe:12418] *** Process received signal ***
>> [phoebe:12418] Signal: Segmentation fault (11)
>> [phoebe:12418] Signal code: Address not mapped (1)
>> [phoebe:12418] Failing at address: 0x7f5ce27dfff0
>> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
>> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
>> [phoebe:12418] [ 2]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
>> [phoebe:12418] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
>> [phoebe:12418] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
>> [phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
>> [phoebe:12418] [ 6] 

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
On tcp side it doesn't seg fault anymore but will timeout on some tests but
on the openib side it will still seg fault, here is the output:

[pandora:19256] *** Process received signal ***
[pandora:19256] Signal: Segmentation fault (11)
[pandora:19256] Signal code: Address not mapped (1)
[pandora:19256] Failing at address: 0x7f911c69fff0
[pandora:19255] *** Process received signal ***
[pandora:19255] Signal: Segmentation fault (11)
[pandora:19255] Signal code: Address not mapped (1)
[pandora:19255] Failing at address: 0x7ff09cd3fff0
[pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
[pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
[pandora:19256] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
[pandora:19256] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
[pandora:19256] [ 4] [pandora:19255] [ 0]
/usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
[pandora:19255] [ 1]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
[pandora:19256] [ 5] IMB-MPI1[0x40b83b]
[pandora:19256] [ 6] IMB-MPI1[0x407155]
[pandora:19256] [ 7] IMB-MPI1[0x4022ea]
[pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
[pandora:19255] [ 2]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
[pandora:19256] [ 9] IMB-MPI1[0x401d49]
[pandora:19256] *** End of error message ***
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
[pandora:19255] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
[pandora:19255] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
[pandora:19255] [ 5] IMB-MPI1[0x40b83b]
[pandora:19255] [ 6] IMB-MPI1[0x407155]
[pandora:19255] [ 7] IMB-MPI1[0x4022ea]
[pandora:19255] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
[pandora:19255] [ 9] IMB-MPI1[0x401d49]
[pandora:19255] *** End of error message ***
[phoebe:12418] *** Process received signal ***
[phoebe:12418] Signal: Segmentation fault (11)
[phoebe:12418] Signal code: Address not mapped (1)
[phoebe:12418] Failing at address: 0x7f5ce27dfff0
[phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
[phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
[phoebe:12418] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
[phoebe:12418] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
[phoebe:12418] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
[phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
[phoebe:12418] [ 6] IMB-MPI1[0x407155]
[phoebe:12418] [ 7] IMB-MPI1[0x4022ea]
[phoebe:12418] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]
[phoebe:12418] [ 9] IMB-MPI1[0x401d49]
[phoebe:12418] *** End of error message ***
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node pandora exited on
signal 11 (Segmentation fault).
--

- Adam LeBlanc

On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> Can you try the latest 4.0.x nightly snapshot and see if the problem still
> occurs?
>
> https://www.open-mpi.org/nightly/v4.0.x/
>
>
> > On Feb 20, 2019, at 1:40 PM, Adam LeBlanc  wrote:
> >
> > I do here is the output:
> >
> > 2 total processes killed (some possibly by mpirun during cleanup)
> > [pandora:12238] *** Process received signal ***
> > [pandora:12238] Signal: Segmentation fault (11)
> > [pandora:12238] Signal code: Invalid permissions (2)
> > [pandora:12238] Failing at address: 0x7f5c8e31fff0
> > [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
> > [pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
> > /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
> > [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
> > [pandora:12237] Signal code: Invalid permissions (2)
> > [pandora:12237] Failing at address: 0x7f6c4ab3fff0
> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
> > [pandora:12238] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
> > [pandora:12238] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
> > [pandora:12238] [ 5] IMB-MPI1[0x40b83b]
> > [pan

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
I do here is the output:

2 total processes killed (some possibly by mpirun during cleanup)
[pandora:12238] *** Process received signal ***
[pandora:12238] Signal: Segmentation fault (11)
[pandora:12238] Signal code: Invalid permissions (2)
[pandora:12238] Failing at address: 0x7f5c8e31fff0
[pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
[pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
/usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
[pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
[pandora:12237] Signal code: Invalid permissions (2)
[pandora:12237] Failing at address: 0x7f6c4ab3fff0
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
[pandora:12238] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
[pandora:12238] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
[pandora:12238] [ 5] IMB-MPI1[0x40b83b]
[pandora:12238] [ 6] IMB-MPI1[0x407155]
[pandora:12238] [ 7] IMB-MPI1[0x4022ea]
[pandora:12238] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5]
[pandora:12238] [ 9] IMB-MPI1[0x401d49]
[pandora:12238] *** End of error message ***
[pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680]
[pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0]
[pandora:12237] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55]
[pandora:12237] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b]
[pandora:12237] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7]
[pandora:12237] [ 5] IMB-MPI1[0x40b83b]
[pandora:12237] [ 6] IMB-MPI1[0x407155]
[pandora:12237] [ 7] IMB-MPI1[0x4022ea]
[pandora:12237] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5]
[pandora:12237] [ 9] IMB-MPI1[0x401d49]
[pandora:12237] *** End of error message ***
[phoebe:07408] *** Process received signal ***
[phoebe:07408] Signal: Segmentation fault (11)
[phoebe:07408] Signal code: Invalid permissions (2)
[phoebe:07408] Failing at address: 0x7f6b9ca9fff0
[titan:07169] *** Process received signal ***
[titan:07169] Signal: Segmentation fault (11)
[titan:07169] Signal code: Invalid permissions (2)
[titan:07169] Failing at address: 0x7fc01295fff0
[phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680]
[phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0]
[phoebe:07408] [ 2] [titan:07169] [ 0]
/usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680]
[titan:07169] [ 1]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55]
[phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0]
[titan:07169] [ 2]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b]
[phoebe:07408] [ 4]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55]
[titan:07169] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7]
[phoebe:07408] [ 5] IMB-MPI1[0x40b83b]
[phoebe:07408] [ 6] IMB-MPI1[0x407155]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b]
[titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea]
[phoebe:07408] [ 8]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7]
[titan:07169] [ 5] IMB-MPI1[0x40b83b]
[titan:07169] [ 6] IMB-MPI1[0x407155]
[titan:07169] [ 7]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5]
[phoebe:07408] [ 9] IMB-MPI1[0x401d49]
[phoebe:07408] *** End of error message ***
IMB-MPI1[0x4022ea]
[titan:07169] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5]
[titan:07169] [ 9] IMB-MPI1[0x401d49]
[titan:07169] *** End of error message ***
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node pandora exited on
signal 11 (Segmentation fault).
--


- Adam LeBlanc

On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard 
wrote:

> HI Adam,
>
> As a sanity check, if you try to use --mca btl self,vader,tcp
>
> do you still see the segmentation fault?
>
> Howard
>
>
> Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <
> alebl...@iol.unh.edu>:
>
>> Hello,
>>
>> When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
>> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
>> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
>> btl_openib_allow_ib 1 -

[OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Adam LeBlanc
.
--
--
mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib exited
on signal 11 (Segmentation fault).
--

Also if I reinstall 3.1.2 I do not have this issue at all.

Any thoughts on what could be the issue?

Thanks,
Adam LeBlanc
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Bug with Open-MPI Processor Count

2018-11-08 Thread Adam LeBlanc
Hello Ralph,

Is there any update on this?

Thanks,
Adam LeBlanc

On Fri, Nov 2, 2018 at 11:06 AM Adam LeBlanc  wrote:

> Hello Ralph,
>
> When I do the -np 7 it still fails with "There are not enough slots
> available in the system to satisfy the 7 slots that were requested by the
> application", but when I do -np 2 it will actually run from a machine that
> was failing but will only run on one other machine and in this case it ran
> from a machine with 2 processors to a machine with only 1 processor. If I
> try to make -np higher then 2 it will also fail.
>
> -Adam LeBlanc
>
> On Thu, Nov 1, 2018 at 3:53 PM Ralph H Castain  wrote:
>
>> Hmmm - try adding a value for nprocs instead of leaving it blank. Say
>> “-np 7”
>>
>> Sent from my iPhone
>>
>> On Nov 1, 2018, at 11:56 AM, Adam LeBlanc  wrote:
>>
>> Hello Ralph,
>>
>> Here is the output for a failing machine:
>>
>> [130_02:44:13_aleblanc@farbauti]{~}$ > mpirun --mca
>> btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
>> --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
>> P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
>> ras_base_verbose 5 IMB-MPI1
>>
>> ==   ALLOCATED NODES   ==
>> farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
>> hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> =
>> --
>> There are not enough slots available in the system to satisfy the 7 slots
>> that were requested by the application:
>>   10
>>
>> Either request fewer slots for your application, or make more slots
>> available
>> for use.
>> --
>>
>>
>> Here is an output of a passing machine:
>>
>> [1_02:54:26_aleblanc@hyperion]{~}$ > mpirun --mca
>> btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
>> --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
>> P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
>> ras_base_verbose 5 IMB-MPI1
>>
>> ==   ALLOCATED NODES   ==
>> hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
>> farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>> =
>>
>>
>> Yes the hostfile is available on all nodes through an NFS mount for all
>> of our home directories.
>>
>> On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc  wrote:
>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Ralph H Castain 
>>> Date: Thu, Nov 1, 2018 at 2:34 PM
>>> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
>>> To: Open MPI Users 
>>>
>>>
>>> I’m a little under the weather and so will only be able to help a bit at
>>> a time. However, a couple of things to check:
>>>
>>> * add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought
>>> the allocation was
>>>
>>> * is the hostfile available on every node?
>>>
>>> Ralph
>>>
>>> On Nov 1, 2018, at 10:55 AM, Adam LeBlanc  wrote:
>>>
>>> Hello Ralph,
>>>
>>> Attached below is the verbose output for a failing machine and a passing
>>> machine.
>>>
>>> Thanks,
>>> Adam LeBlanc
>>>
>>> On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc 
>>> wrote:
>>>
>>>>
>>>>
>>>> -- Forwarded message -
>>>> From: R

Re: [OMPI users] Bug with Open-MPI Processor Count

2018-11-02 Thread Adam LeBlanc
Hello Ralph,

When I do the -np 7 it still fails with "There are not enough slots
available in the system to satisfy the 7 slots that were requested by the
application", but when I do -np 2 it will actually run from a machine that
was failing but will only run on one other machine and in this case it ran
from a machine with 2 processors to a machine with only 1 processor. If I
try to make -np higher then 2 it will also fail.

-Adam LeBlanc

On Thu, Nov 1, 2018 at 3:53 PM Ralph H Castain  wrote:

> Hmmm - try adding a value for nprocs instead of leaving it blank. Say “-np
> 7”
>
> Sent from my iPhone
>
> On Nov 1, 2018, at 11:56 AM, Adam LeBlanc  wrote:
>
> Hello Ralph,
>
> Here is the output for a failing machine:
>
> [130_02:44:13_aleblanc@farbauti]{~}$ > mpirun --mca
> btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
> --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
> P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
> ras_base_verbose 5 IMB-MPI1
>
> ==   ALLOCATED NODES   ==
> farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
> hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> =
> --
> There are not enough slots available in the system to satisfy the 7 slots
> that were requested by the application:
>   10
>
> Either request fewer slots for your application, or make more slots
> available
> for use.
> --
>
>
> Here is an output of a passing machine:
>
> [1_02:54:26_aleblanc@hyperion]{~}$ > mpirun --mca
> btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
> --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
> P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
> ras_base_verbose 5 IMB-MPI1
>
> ==   ALLOCATED NODES   ==
> hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
> farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> =====
>
>
> Yes the hostfile is available on all nodes through an NFS mount for all of
> our home directories.
>
> On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc  wrote:
>
>>
>>
>> -- Forwarded message -
>> From: Ralph H Castain 
>> Date: Thu, Nov 1, 2018 at 2:34 PM
>> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
>> To: Open MPI Users 
>>
>>
>> I’m a little under the weather and so will only be able to help a bit at
>> a time. However, a couple of things to check:
>>
>> * add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought
>> the allocation was
>>
>> * is the hostfile available on every node?
>>
>> Ralph
>>
>> On Nov 1, 2018, at 10:55 AM, Adam LeBlanc  wrote:
>>
>> Hello Ralph,
>>
>> Attached below is the verbose output for a failing machine and a passing
>> machine.
>>
>> Thanks,
>> Adam LeBlanc
>>
>> On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc  wrote:
>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Ralph H Castain 
>>> Date: Thu, Nov 1, 2018 at 1:07 PM
>>> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
>>> To: Open MPI Users 
>>>
>>>
>>> Set rmaps_base_verbose=10 for debugging output
>>>
>>> Sent from my iPhone
>>>
>>> On Nov 1, 2018, at 9:31 AM, Adam LeBlanc  wrote:
>>>
>>> The version by the way for Open-MPI is 3.1.2.
>>>
>>> -Adam LeBlanc
&

Re: [OMPI users] Bug with Open-MPI Processor Count

2018-11-01 Thread Adam LeBlanc
Hello Ralph,

Here is the output for a failing machine:

[130_02:44:13_aleblanc@farbauti]{~}$ > mpirun --mca
btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
--mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
ras_base_verbose 5 IMB-MPI1

==   ALLOCATED NODES   ==
farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=
--
There are not enough slots available in the system to satisfy the 7 slots
that were requested by the application:
  10

Either request fewer slots for your application, or make more slots
available
for use.
--


Here is an output of a passing machine:

[1_02:54:26_aleblanc@hyperion]{~}$ > mpirun --mca
btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
--mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
ras_base_verbose 5 IMB-MPI1

==   ALLOCATED NODES   ==
hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=


Yes the hostfile is available on all nodes through an NFS mount for all of
our home directories.

On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc  wrote:

>
>
> -- Forwarded message -
> From: Ralph H Castain 
> Date: Thu, Nov 1, 2018 at 2:34 PM
> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
> To: Open MPI Users 
>
>
> I’m a little under the weather and so will only be able to help a bit at a
> time. However, a couple of things to check:
>
> * add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought
> the allocation was
>
> * is the hostfile available on every node?
>
> Ralph
>
> On Nov 1, 2018, at 10:55 AM, Adam LeBlanc  wrote:
>
> Hello Ralph,
>
> Attached below is the verbose output for a failing machine and a passing
> machine.
>
> Thanks,
> Adam LeBlanc
>
> On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc  wrote:
>
>>
>>
>> -- Forwarded message -
>> From: Ralph H Castain 
>> Date: Thu, Nov 1, 2018 at 1:07 PM
>> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
>> To: Open MPI Users 
>>
>>
>> Set rmaps_base_verbose=10 for debugging output
>>
>> Sent from my iPhone
>>
>> On Nov 1, 2018, at 9:31 AM, Adam LeBlanc  wrote:
>>
>> The version by the way for Open-MPI is 3.1.2.
>>
>> -Adam LeBlanc
>>
>> On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc 
>> wrote:
>>
>>> Hello, I am an employee of the UNH InterOperability Lab, and we are in
>>> the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
>>> purchased some new hardware that has one processor, and noticed an issue
>>> when running mpi jobs on nodes that do not have similar processor counts.
>>> If we launch the MPI job from a node that has 2 processors, it will fail
>>> and stating there are not enough resources and will not start the run, like
>>> so:
>>> --
>>> There are not enough slots available in the system to satisfy the 14 slots
>>> that were requested by the application:   IMB-MPI1 Either request fewer
>>> slots for your application, or make more slots available for use.
>>> --
>>> If we launch the MPI job from the node with one processor, without changing
>>> the mpirun command at all, it runs as expected. Here is the command being
>>&

Re: [OMPI users] Bug with Open-MPI Processor Count

2018-11-01 Thread Adam LeBlanc
Hello Ralph,

Attached below is the verbose output for a failing machine and a passing
machine.

Thanks,
Adam LeBlanc

On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc  wrote:

>
>
> -- Forwarded message -
> From: Ralph H Castain 
> Date: Thu, Nov 1, 2018 at 1:07 PM
> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
> To: Open MPI Users 
>
>
> Set rmaps_base_verbose=10 for debugging output
>
> Sent from my iPhone
>
> On Nov 1, 2018, at 9:31 AM, Adam LeBlanc  wrote:
>
> The version by the way for Open-MPI is 3.1.2.
>
> -Adam LeBlanc
>
> On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc  wrote:
>
>> Hello, I am an employee of the UNH InterOperability Lab, and we are in
>> the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
>> purchased some new hardware that has one processor, and noticed an issue
>> when running mpi jobs on nodes that do not have similar processor counts.
>> If we launch the MPI job from a node that has 2 processors, it will fail
>> and stating there are not enough resources and will not start the run, like
>> so:
>> --
>> There are not enough slots available in the system to satisfy the 14 slots
>> that were requested by the application:   IMB-MPI1 Either request fewer
>> slots for your application, or make more slots available for use.
>> --
>> If we launch the MPI job from the node with one processor, without changing
>> the mpirun command at all, it runs as expected. Here is the command being
>> run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
>> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
>> btl_openib_receive_queues P,65536,120,64,32 -hostfile
>> /home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used:
>> farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1
>> io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1
>> rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1
>> tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would
>> like some help to explain and fix what is happening. The IBTA plugfest saw
>> similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc
>>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
[0_01:49:28_aleblanc@hyperion]{~}$ > mpirun --mca 
btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 --mca 
btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues 
P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca 
rmaps_base_verbose 10 IMB-MPI1
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: registering 
framework rmaps components
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded 
component resilient
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component 
resilient register function successful
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded 
component seq
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component seq 
register function successful
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded 
component ppr
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component ppr 
register function successful
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded 
component mindist
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component 
mindist register function successful
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded 
component round_robin
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component 
round_robin register function successful
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded 
component rank_file
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component 
rank_file register function successful
[hyperion.ofa.iol.unh.edu:05190] [[63394,0],0] rmaps:base set policy with NULL 
device NONNULL
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_open: opening rmaps 
components
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_open: found loaded 
component resilient
[hyperion.ofa.iol.unh.edu:05190] mca: base: components_open: component 
resilient open function successful
[hyperion.ofa.iol.unh.edu:05190]

Re: [OMPI users] Bug with Open-MPI Processor Count

2018-11-01 Thread Adam LeBlanc
The version by the way for Open-MPI is 3.1.2.

-Adam LeBlanc

On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc  wrote:

> Hello, I am an employee of the UNH InterOperability Lab, and we are in the
> process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
> purchased some new hardware that has one processor, and noticed an issue
> when running mpi jobs on nodes that do not have similar processor counts.
> If we launch the MPI job from a node that has 2 processors, it will fail
> and stating there are not enough resources and will not start the run, like
> so:
> --
> There are not enough slots available in the system to satisfy the 14 slots
> that were requested by the application:   IMB-MPI1 Either request fewer
> slots for your application, or make more slots available for use.
> --
> If we launch the MPI job from the node with one processor, without changing
> the mpirun command at all, it runs as expected. Here is the command being
> run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
> btl_openib_receive_queues P,65536,120,64,32 -hostfile
> /home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used:
> farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1
> io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1
> rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1
> tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would like
> some help to explain and fix what is happening. The IBTA plugfest saw
> similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Bug with Open-MPI Processor Count

2018-11-01 Thread Adam LeBlanc
Hello, I am an employee of the UNH InterOperability Lab, and we are in the
process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
purchased some new hardware that has one processor, and noticed an issue
when running mpi jobs on nodes that do not have similar processor counts.
If we launch the MPI job from a node that has 2 processors, it will fail
and stating there are not enough resources and will not start the run, like
so:
--
There are not enough slots available in the system to satisfy the 14 slots
that were requested by the application:   IMB-MPI1 Either request fewer
slots for your application, or make more slots available for use.
--
If we launch the MPI job from the node with one processor, without changing
the mpirun command at all, it runs as expected. Here is the command being
run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_receive_queues P,65536,120,64,32 -hostfile
/home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used:
farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1
io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1
rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1
tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would like
some help to explain and fix what is happening. The IBTA plugfest saw
similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users