Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-20 Thread Pavel Shamis
More UCX packages:
Fedora: http://rpms.famillecollet.com/rpmphp/zoom.php?rpm=ucx
OpenSUSE: https://software.opensuse.org/package/openucx


On Thu, Sep 20, 2018 at 7:53 AM Yossi Itigin  wrote:

> Currently the target is RH 8
> And yes, UCX is also available on EPEL, for example:
> https://centos.pkgs.org/7/epel-x86_64/ucx-1.3.1-1.el7.x86_64.rpm.html
>
> -Original Message-
> From: Peter Kjellström 
> Sent: Thursday, September 20, 2018 2:11 PM
> To: Yossi Itigin 
> Cc: Open MPI Developers 
> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1
>
> On Thu, 20 Sep 2018 09:03:44 +
> Yossi Itigin  wrote:
>
> > Hi,
> >
> > UCX is on the way into RH distro and will be available and ON by
> > default (auto-detectable by OMPI build process) automatically in the
> > near future.
>
> That is good to hear, already in the upcomming 7.6?
>
> > Meanwhile, user can enable UCX by two methods:
> > 1. Download & Install UCX from openucx.org and Build openmpi with it.
> > 2. Download HPC-X from Mellanox site (openmpi pre-compiled and
> > packaged with UCX) for distro of interest (User can re-compile the
> > package with site default as well)
>
> There's even:
>  3. enable epel and install it from there.
>
> /Peter K
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-20 Thread Yossi Itigin
Currently the target is RH 8
And yes, UCX is also available on EPEL, for example: 
https://centos.pkgs.org/7/epel-x86_64/ucx-1.3.1-1.el7.x86_64.rpm.html

-Original Message-
From: Peter Kjellström  
Sent: Thursday, September 20, 2018 2:11 PM
To: Yossi Itigin 
Cc: Open MPI Developers 
Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

On Thu, 20 Sep 2018 09:03:44 +
Yossi Itigin  wrote:

> Hi,
> 
> UCX is on the way into RH distro and will be available and ON by 
> default (auto-detectable by OMPI build process) automatically in the 
> near future.

That is good to hear, already in the upcomming 7.6?
 
> Meanwhile, user can enable UCX by two methods:
> 1. Download & Install UCX from openucx.org and Build openmpi with it.
> 2. Download HPC-X from Mellanox site (openmpi pre-compiled and 
> packaged with UCX) for distro of interest (User can re-compile the 
> package with site default as well)

There's even:
 3. enable epel and install it from there.

/Peter K
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-20 Thread Peter Kjellström
On Thu, 20 Sep 2018 14:18:35 +0200
Peter Kjellström  wrote:

> On Wed, 19 Sep 2018 16:24:53 +
> "Gabriel, Edgar"  wrote:
...
> > So bottom line, if I do
> > 
> > mpirun –mca btl^openib –mca mtl^ofi ….
> > 
> > my tests finish correctly, although mpirun will still return an
> > error.  
> 
> I get some things to work with this approach (two ranks on two nodes
> for example). But a lot of things crash rahter hard:
> 
>  $ mpirun -mca btl ^openib -mca mtl
> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1

Turns out I had an ofi BTL too.. Disabling this gets me stable imb at
128 ranks. That is this works for me:

mpirun -mca btl ^openib,ofi -mca mtl ^ofi ./imb

/Peter K
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-20 Thread Peter Kjellström
On Wed, 19 Sep 2018 16:24:53 +
"Gabriel, Edgar"  wrote:

> I performed some tests on our Omnipath cluster, and I have a mixed
> bag of results with 4.0.0rc1

I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
very similar results.

> compute-1-1.local.4351PSM2 has not been initialized
> compute-1-0.local.3826PSM2 has not been initialized

yup I too see these.
 
> mpirun detected that one or more processes exited with non-zero
> status, thus causing the job to be terminated. The first process to
> do so was:
> 
>   Process name: [[38418,1],1]
>   Exit code:255
>   
> 

yup.
 
> 
> 2.   The ofi mtl does not work at all on our Omnipath cluster. If
> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
> error message.

Yes ofi seems broken. But not even disabling it helps me completely (I
see "mca_btl_ofi.so   [.] mca_btl_ofi_component_progress" in my
perf top...

> 3.   The openib btl component is always getting in the way with
> annoying warnings. It is not really used, but constantly complains:
...
> [sabine.cacds.uh.edu:25996] 1 more process has sent help message
> help-mpi-btl-openib.txt / ib port not selected

Yup.

...
> So bottom line, if I do
> 
> mpirun –mca btl^openib –mca mtl^ofi ….
> 
> my tests finish correctly, although mpirun will still return an error.

I get some things to work with this approach (two ranks on two nodes
for example). But a lot of things crash rahter hard:

 $ mpirun -mca btl ^openib -mca mtl
^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1
--
PSM2 was unable to open an endpoint. Please make sure that the network
link is active on the node and the hardware is functioning.

  Error: Failure in initializing endpoint
--
n909.279895hfi_userinit: assign_context command failed: Device or
resource busy n909.279895psmi_context_open: hfi_userinit: failed,
trying again (1/3)
...
  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
[n908:298761] *** An error occurred in MPI_Init
[n908:298761] *** reported by process [4092002305,59]
[n908:298761] *** on a NULL communicator
[n908:298761] *** Unknown error
[n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
  will now abort, [n908:298761] ***and potentially your MPI job)
[n907:407748] 255 more processes have sent help message
  help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA
  parameter "orte_base_help_aggregate" to 0 to see all help / error
  messages [n907:407748] 127 more processes have sent help message
  help-mpi-runtime.txt / mpi_init:startup:internal-failure
  [n907:407748] 56 more processes have sent help message
  help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

If I disable psm2 too I get it to run (apparantly on vader?)

/Peter K
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-20 Thread Peter Kjellström
On Thu, 20 Sep 2018 09:03:44 +
Yossi Itigin  wrote:

> Hi,
> 
> UCX is on the way into RH distro and will be available and ON by
> default (auto-detectable by OMPI build process) automatically in the
> near future.

That is good to hear, already in the upcomming 7.6?
 
> Meanwhile, user can enable UCX by two methods:
> 1. Download & Install UCX from openucx.org and Build openmpi with it.
> 2. Download HPC-X from Mellanox site (openmpi pre-compiled and
> packaged with UCX) for distro of interest (User can re-compile the
> package with site default as well)

There's even:
 3. enable epel and install it from there.

/Peter K
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-20 Thread Peter Kjellström
On Tue, 18 Sep 2018 20:49:52 +
"Jeff Squyres \(jsquyres\) via devel"  wrote:

> On Sep 18, 2018, at 3:46 PM, Thananon Patinyasakdikul
>  wrote:
> > 
> > I tested on our cluster (UTK). I will give a thumb up but I have
> > some comments.
> > 
> > What I understand with 4.0.
> > - openib btl is disabled by default (can be turned on by mca)  
...
> > My question is, what if the user does not have UCX installed (but
> > they have infiniband hardware). The user will not have fast
> > transport for their hardware. As of my testing, this release will
> > fall back to btl/tcp if I dont specify the mca to use uct or force
> > openib. Will this be a problem?   
> 
> This is a question for Mellanox.
 
I may be worth noting here that quite a few clusters with Mellanox IB
run the RHEL IB stack and RHEL does not package ucx. Not picking up
and using verbs on such a cluster by default seems like a strange
behavior to me.

/Peter K
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-20 Thread Yossi Itigin
Hi,

UCX is on the way into RH distro and will be available and ON by default 
(auto-detectable by OMPI build process) automatically in the near future.

Meanwhile, user can enable UCX by two methods:
1. Download & Install UCX from openucx.org and Build openmpi with it.
2. Download HPC-X from Mellanox site (openmpi pre-compiled and packaged with 
UCX) for distro of interest (User can re-compile the package with site default 
as well)

Also, for users using the latest MLNX_OFED, UCX is already part of it.


--Yossi

-Original Message-
From: devel  On Behalf Of Jeff Squyres 
(jsquyres) via devel
Sent: Tuesday, September 18, 2018 11:50 PM
To: Open MPI Developers List 
Cc: Jeff Squyres (jsquyres) 
Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

On Sep 18, 2018, at 3:46 PM, Thananon Patinyasakdikul  
wrote:
> 
> I tested on our cluster (UTK). I will give a thumb up but I have some 
> comments.
> 
> What I understand with 4.0.
> - openib btl is disabled by default (can be turned on by mca)

It is disabled by default *for InfiniBand*.  It is still enabled by default for 
RoCE and iWARP.

> - pml ucx will be the default for infiniband hardware.
> - btl uct is for one-sided but can also be used for two sided as well (needs 
> explicit mca).
> 
> My question is, what if the user does not have UCX installed (but they have 
> infiniband hardware). The user will not have fast transport for their 
> hardware. As of my testing, this release will fall back to btl/tcp if I dont 
> specify the mca to use uct or force openib. Will this be a problem? 

This is a question for Mellanox.

-- 
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
devel@lists.open-mpi.org
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fdevel&data=02%7C01%7Cyosefe%40mellanox.com%7C97405b61e3e34a4f62b908d61da87297%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C63672900150887&sdata=%2B%2BDgDgq0hJrzkVNw3U62wSvJZfpzwa0%2B3usBBsYtFc8%3D&reserved=0
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-19 Thread Cabral, Matias A
Hi Arm,

> IIRC, OFI BTL only create one EP
Correct. But only one is needed to trigger the below issues. There are 
different manifestations according to combinations of MTL OFI/PSM2, the version 
of libpsm2, and the support of OFI Scalable Eps.

> Do you think moving EP creation from component_init to component_open will 
> solve the problem?
If the component_open is only called when the component is/will be effectively 
used, it may work. Let me check.

Thanks,

_MAC

From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Thananon 
Patinyasakdikul
Sent: Wednesday, September 19, 2018 10:15 AM
To: Open MPI Developers 
Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

Mattias,

IIRC, OFI BTL only create one EP. If you move it to add_proc, you might need to 
add some checks to not re-creating EP over and over. Do you think moving EP 
creation from component_init to component_open will solve the problem?

Arm


On Sep 19, 2018, at 1:08 PM, Cabral, Matias A 
mailto:matias.a.cab...@intel.com>> wrote:

Hi Edgar,

I also saw some similar issues, not exactly the same, but look very similar 
(may be because of different version of libpsm2 ). 1 and 2 are related to the 
introduction of the OFI BTL and the fact that it opens an OFI EP in its init 
function. I see that all btls call the init function during transport selection 
time. Moreover, this happens even when you explicitly ask for a different one 
(-mca pml cm -mca mtl psm2).  Workaround:  -mca btl ^ofi.  My current idea is 
to update the OFI BTL and move the EPs opening to add_procs. Feedback?

Number 3 is goes beyond me.

Thanks,

_MAC

From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gabriel, 
Edgar
Sent: Wednesday, September 19, 2018 9:25 AM
To: Open MPI Developers 
mailto:devel@lists.open-mpi.org>>
Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

I performed some tests on our Omnipath cluster, and I have a mixed bag of 
results with 4.0.0rc1

1.   Good news, the problems with the psm2 mtl that I reported in June/July 
seem to be fixed. I still get however a warning every time I run a job with 
4.0.0, e.g.

compute-1-1.local.4351PSM2 has not been initialized
compute-1-0.local.3826PSM2 has not been initialized

although based on the performance, it is very clear that psm2 is being used. I 
double checked with 3.0 series, I do not get the same warnings on the same
set of nodes. The unfortunate part about this error  message is, that it seems 
that applications seem to return an error (although tests and applications seem 
to
finish correctly otherwise)

--
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[38418,1],1]
  Exit code:255
  


2.   The ofi mtl does not work at all on our Omnipath cluster. If I try to 
force it using ‘mpirun –mca mtl ofi …’ I get the following error message.

[compute-1-0:03988] *** An error occurred in MPI_Barrier
[compute-1-0:03988] *** reported by process [2712141825,0]
[compute-1-0:03988] *** on communicator MPI_COMM_WORLD
[compute-1-0:03988] *** MPI_ERR_OTHER: known error not in list
[compute-1-0:03988] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[compute-1-0:03988] ***and potentially your MPI job)
[sabine.cacds.uh.edu:21046<http://sabine.cacds.uh.edu:21046>] 1 more process 
has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[sabine.cacds.uh.edu:21046<http://sabine.cacds.uh.edu:21046>] Set MCA parameter 
"orte_base_help_aggregate" to 0 to see all help / error messages

I once again double checked that this works correctly in the 3.0 (and 3.1, 
although I did not run that test this time).

3.   The openib btl component is always getting in the way with annoying 
warnings. It is not really used, but constantly complains:


[sabine.cacds.uh.edu:25996<http://sabine.cacds.uh.edu:25996>] 1 more process 
has sent help message help-mpi-btl-openib.txt / ib port not selected
[sabine.cacds.uh.edu:25996<http://sabine.cacds.uh.edu:25996>] Set MCA parameter 
"orte_base_help_aggregate" to 0 to see all help / error messages
[sabine.cacds.uh.edu:25996<http://sabine.cacds.uh.edu:25996>] 1 more process 
has sent help message help-mpi-btl-openib.txt / error in device init

So bottom line, if I do

mpirun –mca btl^openib –mca mtl^ofi ….

my tests finish correctly, although mpirun will still return an error.

Thanks
Edgar


From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Geoffrey 
Paulsen
Sent: Sunday, September 16, 2018 2:31 PM
To: devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
Subject: [OMPI devel] Announcing Open MPI v4.0.0rc1


Th

Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-19 Thread Thananon Patinyasakdikul
Mattias,

IIRC, OFI BTL only create one EP. If you move it to add_proc, you might need to 
add some checks to not re-creating EP over and over. Do you think moving EP 
creation from component_init to component_open will solve the problem?

Arm

> On Sep 19, 2018, at 1:08 PM, Cabral, Matias A  
> wrote:
> 
> Hi Edgar, <>
>  
> I also saw some similar issues, not exactly the same, but look very similar 
> (may be because of different version of libpsm2 ). 1 and 2 are related to the 
> introduction of the OFI BTL and the fact that it opens an OFI EP in its init 
> function. I see that all btls call the init function during transport 
> selection time. Moreover, this happens even when you explicitly ask for a 
> different one (-mca pml cm -mca mtl psm2).  Workaround:  -mca btl ^ofi.  My 
> current idea is to update the OFI BTL and move the EPs opening to add_procs. 
> Feedback?
>  
> Number 3 is goes beyond me.
>  
> Thanks,
>  
> _MAC
>  
>  <>From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
> Gabriel, Edgar
> Sent: Wednesday, September 19, 2018 9:25 AM
> To: Open MPI Developers 
> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1
>  
> I performed some tests on our Omnipath cluster, and I have a mixed bag of 
> results with 4.0.0rc1
>  
> 1.   Good news, the problems with the psm2 mtl that I reported in 
> June/July seem to be fixed. I still get however a warning every time I run a 
> job with 4.0.0, e.g.
>  
> compute-1-1.local.4351PSM2 has not been initialized
> compute-1-0.local.3826PSM2 has not been initialized
>  
> although based on the performance, it is very clear that psm2 is being used. 
> I double checked with 3.0 series, I do not get the same warnings on the same 
> set of nodes. The unfortunate part about this error  message is, that it 
> seems that applications seem to return an error (although tests and 
> applications seem to 
> finish correctly otherwise)
>  
> --
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
>  
>   Process name: [[38418,1],1]
>   Exit code:255
>   
> 
>  
> 2.   The ofi mtl does not work at all on our Omnipath cluster. If I try 
> to force it using ‘mpirun –mca mtl ofi …’ I get the following error message.
>  
> [compute-1-0:03988] *** An error occurred in MPI_Barrier
> [compute-1-0:03988] *** reported by process [2712141825,0]
> [compute-1-0:03988] *** on communicator MPI_COMM_WORLD
> [compute-1-0:03988] *** MPI_ERR_OTHER: known error not in list
> [compute-1-0:03988] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
> will now abort,
> [compute-1-0:03988] ***and potentially your MPI job)
> [sabine.cacds.uh.edu:21046] 1 more process has sent help message 
> help-mpi-errors.txt / mpi_errors_are_fatal
> [sabine.cacds.uh.edu:21046] Set MCA parameter "orte_base_help_aggregate" to 0 
> to see all help / error messages
>  
> I once again double checked that this works correctly in the 3.0 (and 3.1, 
> although I did not run that test this time).
>  
> 3.   The openib btl component is always getting in the way with annoying 
> warnings. It is not really used, but constantly complains:
>  
>  
> [sabine.cacds.uh.edu:25996] 1 more process has sent help message 
> help-mpi-btl-openib.txt / ib port not selected
> [sabine.cacds.uh.edu:25996] Set MCA parameter "orte_base_help_aggregate" to 0 
> to see all help / error messages
> [sabine.cacds.uh.edu:25996] 1 more process has sent help message 
> help-mpi-btl-openib.txt / error in device init
>  
> So bottom line, if I do
>  
> mpirun –mca btl^openib –mca mtl^ofi ….
>  
> my tests finish correctly, although mpirun will still return an error.
>  
> Thanks
> Edgar
>  
>  
> From: devel [mailto:devel-boun...@lists.open-mpi.org 
> <mailto:devel-boun...@lists.open-mpi.org>] On Behalf Of Geoffrey Paulsen
> Sent: Sunday, September 16, 2018 2:31 PM
> To: devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> Subject: [OMPI devel] Announcing Open MPI v4.0.0rc1
>  
> The first release candidate for the Open MPI v4.0.0 release is posted at 
> https://www.open-mpi.org/software/ompi/v4.0/ 
> <https://www.open-mpi.org/software/ompi/v4.0/>
> Major changes include:
>  
> 4.0.0 -- September, 2018
> 
>  
> - OSHMEM updated to the OpenSHMEM 1.4 API.
> - Do not build Open SHMEM layer when there are no SPMLs available.
>

Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-19 Thread Cabral, Matias A
Hi Edgar,

I also saw some similar issues, not exactly the same, but look very similar 
(may be because of different version of libpsm2 ). 1 and 2 are related to the 
introduction of the OFI BTL and the fact that it opens an OFI EP in its init 
function. I see that all btls call the init function during transport selection 
time. Moreover, this happens even when you explicitly ask for a different one 
(-mca pml cm -mca mtl psm2).  Workaround:  -mca btl ^ofi.  My current idea is 
to update the OFI BTL and move the EPs opening to add_procs. Feedback?

Number 3 is goes beyond me.

Thanks,

_MAC

From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gabriel, 
Edgar
Sent: Wednesday, September 19, 2018 9:25 AM
To: Open MPI Developers 
Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

I performed some tests on our Omnipath cluster, and I have a mixed bag of 
results with 4.0.0rc1


1.   Good news, the problems with the psm2 mtl that I reported in June/July 
seem to be fixed. I still get however a warning every time I run a job with 
4.0.0, e.g.



compute-1-1.local.4351PSM2 has not been initialized

compute-1-0.local.3826PSM2 has not been initialized

although based on the performance, it is very clear that psm2 is being used. I 
double checked with 3.0 series, I do not get the same warnings on the same
set of nodes. The unfortunate part about this error  message is, that it seems 
that applications seem to return an error (although tests and applications seem 
to
finish correctly otherwise)

--
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[38418,1],1]
  Exit code:255
  



2.   The ofi mtl does not work at all on our Omnipath cluster. If I try to 
force it using ‘mpirun –mca mtl ofi …’ I get the following error message.



[compute-1-0:03988] *** An error occurred in MPI_Barrier

[compute-1-0:03988] *** reported by process [2712141825,0]

[compute-1-0:03988] *** on communicator MPI_COMM_WORLD

[compute-1-0:03988] *** MPI_ERR_OTHER: known error not in list

[compute-1-0:03988] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,

[compute-1-0:03988] ***and potentially your MPI job)

[sabine.cacds.uh.edu:21046] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal

[sabine.cacds.uh.edu:21046] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages



I once again double checked that this works correctly in the 3.0 (and 3.1, 
although I did not run that test this time).



3.   The openib btl component is always getting in the way with annoying 
warnings. It is not really used, but constantly complains:



[sabine.cacds.uh.edu:25996] 1 more process has sent help message 
help-mpi-btl-openib.txt / ib port not selected
[sabine.cacds.uh.edu:25996] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages
[sabine.cacds.uh.edu:25996] 1 more process has sent help message 
help-mpi-btl-openib.txt / error in device init

So bottom line, if I do

mpirun –mca btl^openib –mca mtl^ofi ….

my tests finish correctly, although mpirun will still return an error.

Thanks
Edgar


From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Geoffrey 
Paulsen
Sent: Sunday, September 16, 2018 2:31 PM
To: devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
Subject: [OMPI devel] Announcing Open MPI v4.0.0rc1


The first release candidate for the Open MPI v4.0.0 release is posted at

https://www.open-mpi.org/software/ompi/v4.0/

Major changes include:



4.0.0 -- September, 2018





- OSHMEM updated to the OpenSHMEM 1.4 API.

- Do not build Open SHMEM layer when there are no SPMLs available.

  Currently, this means the Open SHMEM layer will only build if

  a MXM or UCX library is found.

- A UCX BTL was added for enhanced MPI RMA support using UCX

- With this release,  OpenIB BTL now only supports iWarp and RoCE by default.

- Updated internal HWLOC to 2.0.1

- Updated internal PMIx to 3.0.1

- Change the priority for selecting external verses internal HWLOC

  and PMIx packages to build.  Starting with this release, configure

  by default selects available external HWLOC and PMIx packages over

  the internal ones.

- Updated internal ROMIO to 3.2.1.

- Removed support for the MXM MTL.

- Improved CUDA support when using UCX.

- Improved support for two phase MPI I/O operations when using OMPIO.

- Added support for Software-based Performance Counters, see

  
https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI-
 Various improvements to MPI RMA performance when using RDMA

  capabl

Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-19 Thread Gabriel, Edgar
I performed some tests on our Omnipath cluster, and I have a mixed bag of 
results with 4.0.0rc1


1.   Good news, the problems with the psm2 mtl that I reported in June/July 
seem to be fixed. I still get however a warning every time I run a job with 
4.0.0, e.g.



compute-1-1.local.4351PSM2 has not been initialized

compute-1-0.local.3826PSM2 has not been initialized

although based on the performance, it is very clear that psm2 is being used. I 
double checked with 3.0 series, I do not get the same warnings on the same
set of nodes. The unfortunate part about this error  message is, that it seems 
that applications seem to return an error (although tests and applications seem 
to
finish correctly otherwise)

--
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[38418,1],1]
  Exit code:255
  



2.   The ofi mtl does not work at all on our Omnipath cluster. If I try to 
force it using ‘mpirun –mca mtl ofi …’ I get the following error message.



[compute-1-0:03988] *** An error occurred in MPI_Barrier

[compute-1-0:03988] *** reported by process [2712141825,0]

[compute-1-0:03988] *** on communicator MPI_COMM_WORLD

[compute-1-0:03988] *** MPI_ERR_OTHER: known error not in list

[compute-1-0:03988] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,

[compute-1-0:03988] ***and potentially your MPI job)

[sabine.cacds.uh.edu:21046] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal

[sabine.cacds.uh.edu:21046] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages



I once again double checked that this works correctly in the 3.0 (and 3.1, 
although I did not run that test this time).



3.   The openib btl component is always getting in the way with annoying 
warnings. It is not really used, but constantly complains:



[sabine.cacds.uh.edu:25996] 1 more process has sent help message 
help-mpi-btl-openib.txt / ib port not selected
[sabine.cacds.uh.edu:25996] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages
[sabine.cacds.uh.edu:25996] 1 more process has sent help message 
help-mpi-btl-openib.txt / error in device init

So bottom line, if I do

mpirun –mca btl^openib –mca mtl^ofi ….

my tests finish correctly, although mpirun will still return an error.

Thanks
Edgar


From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Geoffrey 
Paulsen
Sent: Sunday, September 16, 2018 2:31 PM
To: devel@lists.open-mpi.org
Subject: [OMPI devel] Announcing Open MPI v4.0.0rc1


The first release candidate for the Open MPI v4.0.0 release is posted at

https://www.open-mpi.org/software/ompi/v4.0/

Major changes include:



4.0.0 -- September, 2018





- OSHMEM updated to the OpenSHMEM 1.4 API.

- Do not build Open SHMEM layer when there are no SPMLs available.

  Currently, this means the Open SHMEM layer will only build if

  a MXM or UCX library is found.

- A UCX BTL was added for enhanced MPI RMA support using UCX

- With this release,  OpenIB BTL now only supports iWarp and RoCE by default.

- Updated internal HWLOC to 2.0.1

- Updated internal PMIx to 3.0.1

- Change the priority for selecting external verses internal HWLOC

  and PMIx packages to build.  Starting with this release, configure

  by default selects available external HWLOC and PMIx packages over

  the internal ones.

- Updated internal ROMIO to 3.2.1.

- Removed support for the MXM MTL.

- Improved CUDA support when using UCX.

- Improved support for two phase MPI I/O operations when using OMPIO.

- Added support for Software-based Performance Counters, see

  
https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI-
 Various improvements to MPI RMA performance when using RDMA

  capable interconnects.

- Update memkind component to use the memkind 1.6 public API.

- Fix problems with use of newer map-by mpirun options.  Thanks to

  Tony Reina for reporting.

- Fix rank-by algorithms to properly rank by object and span

- Allow for running as root of two environment variables are set.

  Requested by Axel Huebl.

- Fix a problem with building the Java bindings when using Java 10.

  Thanks to Bryce Glover for reporting.

Our goal is to release 4.0.0 by mid Oct, so any testing is appreciated.


___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-18 Thread Jeff Squyres (jsquyres) via devel
On Sep 18, 2018, at 3:46 PM, Thananon Patinyasakdikul  
wrote:
> 
> I tested on our cluster (UTK). I will give a thumb up but I have some 
> comments.
> 
> What I understand with 4.0.
> - openib btl is disabled by default (can be turned on by mca)

It is disabled by default *for InfiniBand*.  It is still enabled by default for 
RoCE and iWARP.

> - pml ucx will be the default for infiniband hardware.
> - btl uct is for one-sided but can also be used for two sided as well (needs 
> explicit mca).
> 
> My question is, what if the user does not have UCX installed (but they have 
> infiniband hardware). The user will not have fast transport for their 
> hardware. As of my testing, this release will fall back to btl/tcp if I dont 
> specify the mca to use uct or force openib. Will this be a problem? 

This is a question for Mellanox.

-- 
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Announcing Open MPI v4.0.0rc1

2018-09-18 Thread Thananon Patinyasakdikul
Hi,

I tested on our cluster (UTK). I will give a thumb up but I have some comments.

What I understand with 4.0.
- openib btl is disabled by default (can be turned on by mca)
- pml ucx will be the default for infiniband hardware.
- btl uct is for one-sided but can also be used for two sided as well (needs 
explicit mca).

My question is, what if the user does not have UCX installed (but they have 
infiniband hardware). The user will not have fast transport for their hardware. 
As of my testing, this release will fall back to btl/tcp if I dont specify the 
mca to use uct or force openib. Will this be a problem? 

Arm

> On Sep 16, 2018, at 3:31 PM, Geoffrey Paulsen  wrote:
> 
> The first release candidate for the Open MPI v4.0.0 release is posted at 
> https://www.open-mpi.org/software/ompi/v4.0/ 
> 
> Major changes include:
> 
> 4.0.0 -- September, 2018
> 
> 
> - OSHMEM updated to the OpenSHMEM 1.4 API.
> - Do not build Open SHMEM layer when there are no SPMLs available.
>   Currently, this means the Open SHMEM layer will only build if
>   a MXM or UCX library is found.
> - A UCX BTL was added for enhanced MPI RMA support using UCX
> - With this release,  OpenIB BTL now only supports iWarp and RoCE by default.
> - Updated internal HWLOC to 2.0.1
> - Updated internal PMIx to 3.0.1
> - Change the priority for selecting external verses internal HWLOC
>   and PMIx packages to build.  Starting with this release, configure
>   by default selects available external HWLOC and PMIx packages over
>   the internal ones.
> - Updated internal ROMIO to 3.2.1.
> - Removed support for the MXM MTL.
> - Improved CUDA support when using UCX.
> - Improved support for two phase MPI I/O operations when using OMPIO.
> - Added support for Software-based Performance Counters, see
>   
> https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI
>  
> -
>  Various improvements to MPI RMA performance when using RDMA
>   capable interconnects.
> - Update memkind component to use the memkind 1.6 public API.
> - Fix problems with use of newer map-by mpirun options.  Thanks to
>   Tony Reina for reporting.
> - Fix rank-by algorithms to properly rank by object and span
> - Allow for running as root of two environment variables are set.
>   Requested by Axel Huebl.
> - Fix a problem with building the Java bindings when using Java 10.
>   Thanks to Bryce Glover for reporting.
> Our goal is to release 4.0.0 by mid Oct, so any testing is appreciated.
>  
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel