Re: [OMPI devel] MPI ABI effort

2023-08-29 Thread Howard Pritchard via devel
LANL would be interested in supporting this feature as well.

Howard

On Mon, Aug 28, 2023 at 9:58 AM Jeff Squyres (jsquyres) via devel <
devel@lists.open-mpi.org> wrote:

> We got a presentation from the ABI WG (proxied via Quincey from AWS) a few
> months ago.
>
> The proposal looked reasonable.
>
> No one has signed up to do the work yet, but based on what we saw in that
> presentation, the general consensus was "sure, we could probably get on
> board with that."
>
> There's definitely going to be issues to be worked out (e.g., are we going
> to break Open MPI ABI? Maybe offer 2 flavors of ABI? Is this a
> configure-time option, or do we build "both" ways?  ...etc.), but it
> sounded like the community members who heard this proposal were generally
> in favor of moving in this direction.
> --
> *From:* devel  on behalf of Gilles
> Gouaillardet via devel 
> *Sent:* Saturday, August 26, 2023 2:20 AM
> *To:* Open MPI Developers 
> *Cc:* Gilles Gouaillardet 
> *Subject:* [OMPI devel] MPI ABI effort
>
> Folks,
>
> Jeff Hammond and al. published "MPI Application Binary Interface
> Standardization" las week
> https://arxiv.org/abs/2308.11214
>
> The paper reads the (C) ABI has already been prototyped natively in MPICH.
>
> Is there any current interest into prototyping this ABI into Open MPI?
>
>
> Cheers,
>
> Gilles
>


[OMPI devel] Open MPI 4.0.6rc4

2021-06-04 Thread Howard Pritchard via devel
Hi All,

Open MPI v4.0.6rc4 (we messed up and had to skip rc3) is now available at

https://www.open-mpi.org/software/ompi/v4.0/

Changes since the 4.0.5 release include:

- Update embedded PMIx to 3.2.3.  This update addresses several

  MPI_COMM_SPAWN problems.

- Fix an issue with MPI_FILE_GET_BYTE_OFFSET when supplying a

  zero size file view.  Thanks to @shanedsnyder for reporting.

- Fix an issue with MPI_COMM_SPLIT_TYPE not observing key correctly.

  Thanks to Wolfgang Bangerth for reporting.

- Fix a derived datatype issue that could lead to potential data

  corruption when using UCX.  Thanks to @jayeshkrishna for reporting.

- Fix a problem with shared memory transport file name collisions.

  Thanks to Moritz Kreutzer for reporting.

- Fix a problem when using Flux PMI and UCX.  Thanks to Sami Ilvonen

  for reporting and supplying a fix.

- Fix a problem with MPIR breakpoint being compiled out using PGI

  compilers.  Thanks to @louisespellacy-arm for reporting.

- Fix some ROMIO issues when using Lustre.  Thanks to Mark Dixon for

  reporting.

- Fix a problem using an external PMIx 4 to build Open MPI 4.0.x.

- Fix a compile problem when using the enable-timing configure option

  and UCX.  Thanks to Jan Bierbaum for reporting.

- Fix a symbol name collision when using the Cray compiler to build

  Open SHMEM.  Thanks to Pak Lui for reporting and fixing.

- Correct an issue encountered when building Open MPI under OSX Big Sur.

  Thanks to FX Coudert for reporting.

- Various fixes to the OFI MTL.

- Fix an issue with allocation of sufficient memory for parsing long

  environment variable values.  Thanks to @zrss for reporting.

- Improve reproducibility of builds to assist Open MPI packages.

  Thanks to Bernhard Wiedmann for bringing this to our attention.


Your Open MPI release team.


[OMPI devel] Open MPI 4.0.4rc3 available for testing

2020-06-09 Thread Howard Pritchard via devel
Open MPI v4.0.4rc3 has been posted to
https://www.open-mpi.org/software/ompi/v4.0/

This rc includes a fix for a problem discovered with the memory patcher
code.
As described in the README:

- Open MPI v4.0.4 fixed an issue with the memory patcher's ability to

  intercept shmat and shmdt that could cause wrong answers.  This was

  observed on RHEL8.1 running on ppc64le, but it may affect other systems.


  For more information, please see:

https://github.com/open-mpi/ompi/pull/7778



4.0.4 -- June, 2020

---

- Fix a memory patcher issue intercepting shmat and shmdt.  This was

  observed on RHEL 8.x ppc64le (see README for more info).

- Fix an illegal access issue caught using gcc's address sanitizer.

  Thanks to  Georg Geiser for reporting.

- Add checks to avoid conflicts with a libevent library shipped with LSF.

- Switch to linking against libevent_core rather than libevent, if present.

- Add improved support for UCX 1.9 and later.

- Fix an ABI compatibility issue with the Fortran 2008 bindings.

  Thanks to Alastair McKinstry for reporting.

- Fix an issue with rpath of /usr/lib64 when building OMPI on

  systems with Lustre.  Thanks to David Shrader for reporting.

- Fix a memory leak occurring with certain MPI RMA operations.

- Fix an issue with ORTE's mapping of MPI processes to resources.

  Thanks to Alex Margolin for reporting and providing a fix.

- Correct a problem with incorrect error codes being returned

  by OMPI MPI_T functions.

- Fix an issue with debugger tools not being able to attach

  to mpirun more than once.  Thanks to Gregory Lee for reporting.

- Fix an issue with the Fortran compiler wrappers when using

  NAG compilers.  Thanks to Peter Brady for reporting.

- Fix an issue with the ORTE ssh based process launcher at scale.

  Thanks to Benjamín Hernández for reporting.

- Address an issue when using shared MPI I/O operations.  OMPIO will

  now successfully return from the file open statement but will

  raise an error if the file system does not supported shared I/O

  operations.  Thanks to Romain Hild for reporting.

- Fix an issue with MPI_WIN_DETACH.  Thanks to Thomas Naughton for
reporting.


Note this release addresses an ABI compatibility issue for the Fortran 2008
bindings.

It will not be backward compatible with releases 4.0.0 through 4.0.3 for
applications making use of the Fortran 2008 bindings.


[OMPI devel] Please test Open MPI v4.0.4rc1

2020-05-09 Thread Howard Pritchard via devel
Open MPI v4.0.4rc1 has been posted to
https://www.open-mpi.org/software/ompi/v4.0/

4.0.4 -- May, 2020

---

- Fix an ABI compatibility issue with the Fortran 2008 bindings.

  Thanks to Alastair McKinstry for reporting.

- Fix an issue with rpath of /usr/lib64 when building OMPI on

  systems with Lustre.  Thanks to David Shrader for reporting.

- Fix a memory leak occurring with certain MPI RMA operations.

- Fix an issue with ORTE's mapping of MPI processes to resources.

  Thanks to Alex Margolin for reporting and providing a fix.

- Correct a problem with incorrect error codes being returned

  by OMPI MPI_T functions.

- Fix an issue with debugger tools not being able to attach

  to mpirun more than once.  Thanks to Gregory Lee for reporting.

- Fix an issue with the Fortran compiler wrappers when using

  NAG compilers.  Thanks to Peter Brady for reporting.

- Fix an issue with the ORTE ssh based process launcher at scale.

  Thanks to Benjamín Hernández for reporting.

- Address an issue when using shared MPI I/O operations.  OMPIO will

  now successfully return from the file open statement but will

  raise an error if the file system does not supported shared I/O

  operations.  Thanks to Romain Hild for reporting.

- Fix an issue with MPI_WIN_DETACH.  Thanks to Thomas Naughton for
reporting.


Note this release addresses an ABI compatibility issue for the Fortran 2008
bindings.

It will not be backward compatible with releases 4.0.0 through 4.0.3 for
applications making use of the Fortran 2008 bindings.


[OMPI devel] Open MPI 4.0.1rc3 available for testing

2019-03-22 Thread Howard Pritchard
A third release candidate for the Open MPI v4.0.1 release is posted at
https://www.open-mpi.org/software/ompi/v4.0/

Fixes since 4.0.1rc2 include

- Add acquire semantics to an Open MPI internal lock acquire function.

Our goal is to release 4.0.1 by the end of March, so any testing is
appreciated.

The following updates and bug fixes are included in the 4.0.1 release.

4.0.1 -- March, 2019


- Update embedded PMIx to 3.1.2.
- Fix an issue with Vader (shared-memory) transport on OS-X. Thanks
  to Daniel Vollmer for reporting.
- Fix a problem with the usNIC BTL Makefile.  Thanks to George Marselis
  for reporting.
- Fix an issue when using --enable-visibility configure option
  and older versions of hwloc.  Thanks to Ben Menadue for reporting
  and providing a fix.
- Fix an issue with MPI_WIN_CREATE_DYNAMIC and MPI_GET from self.
  thanks to Bart Janssens for reporting.
- Fix an issue of excessive compiler warning messages from mpi.h
  when using newer C++ compilers.  Thanks to @Shadow-fax for
  reporting.
- Fix a problem when building Open MPI using clang 5.0.
- Fix a problem with MPI_WIN_CREATE when using UCX.  Thanks
  to Adam Simpson for reporting.
- Fix a memory leak encountered for certain MPI datatype
  destructor operations.  Thanks to Axel Huebl for reporting.
- Fix several problems with MPI RMA accumulate operations.
  Thanks to Jeff Hammond for reporting.
- Fix possible race condition in closing some file descriptors
  during job launch using mpirun.  Thanks to Jason Williams
  for reporting and providing a fix.
- Fix a problem in OMPIO for large individual write operations.
  Thanks to Axel Huebl for reporting.
- Fix a problem with parsing of map-by ppr options to mpirun.
  Thanks to David Rich for reporting.
- Fix a problem observed when using the mpool hugepage component.  Thanks
  to Hunter Easterday for reporting and fixing.
- Fix valgrind warning generated when invoking certain MPI Fortran
  data type creation functions.  Thanks to @rtoijala for reporting.
- Fix a problem when trying to build with a PMIX 3.1 or newer
  release.  Thanks to Alastair McKinstry for reporting.
- Fix a problem encountered with building MPI F08 module files.
  Thanks to Igor Andriyash and Axel Huebl for reporting.
- Fix two memory leaks encountered for certain MPI-RMA usage patterns.
  Thanks to Joseph Schuchart for reporting and fixing.
- Fix a problem with the ORTE rmaps_base_oversubscribe MCA paramater.
  Thanks to @iassiour for reporting.
- Fix a problem with UCX PML default error handler for MPI communicators.
  Thanks to Marcin Krotkiewski for reporting.
- Fix various issues with OMPIO uncovered by the testmpio test suite.

Thanks very much,

Your Open MP release team
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 4.0.1rc2 available for testing

2019-03-19 Thread Howard Pritchard
A second release candidate for the Open MPI v4.0.1 release is posted at
https://www.open-mpi.org/software/ompi/v4.0/

Fixes since 4.0.1rc1 include

- Fix an issue with Vader (shared-memory) transport on OS-X. Thanks
  to Daniel Vollmer for reporting.
- Fix a problem with the usNIC BTL Makefile.  Thanks to
  George Marselis for reporting.

Our goal is to release 4.0.1 before the end of March, so any testing is
appreciated.

Thanks,

your Open MPI release team
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 4.0.1rc1 available for testing

2019-03-01 Thread Howard Pritchard
The first release candidate for the Open MPI v4.0.1 release is posted at
https://www.open-mpi.org/software/ompi/v4.0/


Major changes include:


- Update embedded PMIx to 3.1.2.

- Fix an issue when using --enable-visibility configure option

  and older versions of hwloc.  Thanks to Ben Menadue for reporting

  and providing a fix.

- Fix an issue with MPI_WIN_CREATE_DYNAMIC and MPI_GET from self.

  thanks to Bart Janssens for reporting.

- Fix an issue of excessive compiler warning messages from mpi.h

  when using newer C++ compilers.  Thanks to @Shadow-fax for

  reporting.

- Fix a problem when building Open MPI using clang 5.0.

- Fix a problem with MPI_WIN_CREATE when using UCX.  Thanks

  to Adam Simpson for reporting.

- Fix a memory leak encountered for certain MPI datatype

  destructor operations.  Thanks to Axel Huebl for reporting.

- Fix several problems with MPI RMA accumulate operations.

  Thanks to Jeff Hammond for reporting.

- Fix possible race condition in closing some file descriptors

  during job launch using mpirun.  Thanks to Jason Williams

  for reporting and providing a fix.

- Fix a problem in OMPIO for large individual write operations.

  Thanks to Axel Huebl for reporting.

- Fix a problem with parsing of map-by ppr options to mpirun.

  Thanks to David Rich for reporting.

- Fix a problem observed when using the mpool hugepage component.  Thanks

  to Hunter Easterday for reporting and fixing.

- Fix valgrind warning generated when invoking certain MPI Fortran

  data type creation functions.  Thanks to @rtoijala for reporting.

- Fix a problem when trying to build with a PMIX 3.1 or newer

  release.  Thanks to Alastair McKinstry for reporting.

- Fix a problem encountered with building MPI F08 module files.

  Thanks to Igor Andriyash and Axel Huebl for reporting.

- Fix two memory leaks encountered for certain MPI-RMA usage patterns.

  Thanks to Joseph Schuchart for reporting and fixing.

- Fix a problem with the ORTE rmaps_base_oversubscribe MCA paramater.

  Thanks to @iassiour for reporting.

- Fix a problem with UCX PML default error handler for MPI communicators.

  Thanks to Marcin Krotkiewski for reporting.

- Fix various issues with OMPIO uncovered by the testmpio test suite.


Our goal is to release 4.0.1 by mid March, so any testing is appreciated.


Thanks,


your Open MPI release team
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Entry in mca-btl-openib-device-params.ini

2018-10-15 Thread Howard Pritchard
Hello Sindhu,

Open a github PR with your changes.   See
https://github.com/open-mpi/ompi/wiki/SubmittingPullRequests

Howard


Am Mo., 15. Okt. 2018 um 13:26 Uhr schrieb Devale, Sindhu <
sindhu.dev...@intel.com>:

> Hi,
>
>
>
> I need to add an entry to the *mca-btl-openib-device-params.ini **file.*
>
>
>
> *What is the procedure to do that? Do I need to send out a patch? *
>
>
>
> *Thank you,*
>
> *Sindhu *
>
>
>
>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] testing again (EOM)

2018-09-01 Thread Howard Pritchard

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI website borked up?

2018-09-01 Thread Howard Pritchard
Hi Folks,

Something seems to be borked up about the OMPI website.  Got to website and
you'll
get some odd parsing error appearing.

Howard
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] testing if NMC mail server working again

2018-08-28 Thread Howard Pritchard

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 2.1.3rc3 available for testing

2018-03-14 Thread Howard Pritchard
HI Folks,

A few MPI I/O (both in OMPI I/O and ROMIO glue layer) bugs were found in
the rc2
so we're doing an rc3.


Open MPI 2.1.3rc3 tarballs are available for testing at the usual place:

https://www.open-mpi.org/software/ompi/v2.1/

This is a bug fix release for the Open MPI 2.1.x release stream.
Items fixed in this release include the following:

- Update internal PMIx version to 1.2.5.
- Fix a problem with ompi_info reporting using param option.
  Thanks to Alexander Pozdneev for reporting.
- Correct PMPI_Aint_{add|diff} to be functions (not subroutines)
  in the Fortran mpi_f08 module.
- Fix a problem when doing MPI I/O using data types with large
  extents in conjunction with MPI_TYPE_CREATE_SUBARRAY.  Thanks to
  Christopher Brady for reporting.
- Fix a problem when opening many files using MPI_FILE_OPEN.
  Thanks to William Dawson for reporting.
- Fix a problem with debuggers failing to attach to a running job.
  Thanks to Dirk Schubert for reporting.
- Fix a problem when using madvise and the OpenIB BTL.  Thanks to
  Timo Bingmann for reporting.
- Fix a problem in the Vader BTL that resulted in failures of
  IMB under certain circumstances.  Thanks to Nicolas Morey-
  Chaisemartin for reporting.
- Fix a problem preventing Open MPI from working under Cygwin.
  Thanks to Marco Atzeri for reporting.
- Reduce some verbosity being emitted by the USNIC BTL under certain
  circumstances.  Thanks to Peter Forai for reporting.
- Fix a problem with misdirection of SIGKILL.  Thanks to Michael Fern
  for reporting.
- Replace use of posix_memalign with malloc for small allocations.  Thanks
  to Ben Menaude for reporting.
- Fix a problem with Open MPI's out of band TCP network for file descriptors
  greater than 32767.  Thanks to Wojtek Wasko for reporting and fixing.
- Plug a memory leak in MPI_Mem_free().  Thanks to Philip Blakely for
reporting.

Thanks,

Your Open MPI release team
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 2.1.3 rc2 available for testing

2018-02-22 Thread Howard Pritchard
Hello Folks,

We discovered a bug in the osc/rdma component that we wanted to fix in this
release,hence an rc2.

Open MPI 2.1.3rc2 tarballs are available for testing at the usual place:

https://www.open-mpi.org/software/ompi/v2.1/

This is a bug fix release for the Open MPI 2.1.x release stream.
Items fixed in this release include the following:

- Update internal PMIx version to 1.2.5.
- Fix a problem with ompi_info reporting using param option.
  Thanks to Alexander Pozdneev for reporting.
- Correct PMPI_Aint_{add|diff} to be functions (not subroutines)
  in the Fortran mpi_f08 module.
- Fix a problem when doing MPI I/O using data types with large
  extents in conjunction with MPI_TYPE_CREATE_SUBARRAY.  Thanks to
  Christopher Brady for reporting.
- Fix a problem when opening many files using MPI_FILE_OPEN.
  Thanks to William Dawson for reporting.
- Fix a problem with debuggers failing to attach to a running job.
  Thanks to Dirk Schubert for reporting.
- Fix a problem when using madvise and the OpenIB BTL.  Thanks to
  Timo Bingmann for reporting.
- Fix a problem in the Vader BTL that resulted in failures of
  IMB under certain circumstances.  Thanks to Nicolas Morey-
  Chaisemartin for reporting.
- Fix a problem preventing Open MPI from working under Cygwin.
  Thanks to Marco Atzeri for reporting.
- Reduce some verbosity being emitted by the USNIC BTL under certain
  circumstances.  Thanks to Peter Forai for reporting.
- Fix a problem with misdirection of SIGKILL.  Thanks to Michael Fern
  for reporting.
- Replace use of posix_memalign with malloc for small allocations.  Thanks
  to Ben Menaude for reporting.
- Fix a problem with Open MPI's out of band TCP network for file descriptors
  greater than 32767.  Thanks to Wojtek Wasko for reporting and fixing.
- Plug a memory leak in MPI_Mem_free().  Thanks to Philip Blakely for
reporting.

Thanks,

Your Open MPI release team
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 2.1.3 rc1 available for testing

2018-02-15 Thread Howard Pritchard
Hello Folks,

Open MPI 2.1.3rc1 tarballs are available for testing at the usual place:

https://www.open-mpi.org/software/ompi/v2.1/

This is a bug fix release for the Open MPI 2.1.x release stream.
Items fixed in this release include the following:

- Update internal PMIx version to 1.2.5.
- Fix a problem with ompi_info reporting using param option.
  Thanks to Alexander Pozdneev for reporting.
- Correct PMPI_Aint_{add|diff} to be functions (not subroutines)
  in the Fortran mpi_f08 module.
- Fix a problem when doing MPI I/O using data types with large
  extents in conjunction with MPI_TYPE_CREATE_SUBARRAY.  Thanks to
  Christopher Brady for reporting.
- Fix a problem when opening many files using MPI_FILE_OPEN.
  Thanks to William Dawson for reporting.
- Fix a problem with debuggers failing to attach to a running job.
  Thanks to Dirk Schubert for reporting.
- Fix a problem when using madvise and the OpenIB BTL.  Thanks to
  Timo Bingmann for reporting.
- Fix a problem in the Vader BTL that resulted in failures of
  IMB under certain circumstances.  Thanks to Nicolas Morey-
  Chaisemartin for reporting.
- Fix a problem preventing Open MPI from working under Cygwin.
  Thanks to Marco Atzeri for reporting.
- Reduce some verbosity being emitted by the USNIC BTL under certain
  circumstances.  Thanks to Peter Forai for reporting.
- Fix a problem with misdirection of SIGKILL.  Thanks to Michael Fern
  for reporting.
- Replace use of posix_memalign with malloc for small allocations.  Thanks
  to Ben Menaude for reporting.
- Fix a problem with Open MPI's out of band TCP network for file descriptors
  greater than 32767.  Thanks to Wojtek Wasko for reporting and fixing.
- Plug a memory leak in MPI_Mem_free().  Thanks to Philip Blakely for
reporting.

Thanks,

Your Open MPI release team
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] hwloc2 and cuda and non-default cudatoolkit install location

2017-12-22 Thread Howard Pritchard
Okay.

I'll wait till we've had the discussion about removing embedded versions.

I appreciate the use of pkg-config, but it doesn't look like cudatookit 8.0
installed on our systems includes *.pc files.

Howard


2017-12-20 14:55 GMT-07:00 r...@open-mpi.org :

> FWIW: what we do in PMIx (where we also have some overlapping options) is
> to add in OMPI a new --enable-pmix-foo option and then have the configury
> in the corresponding OMPI component convert it to use inside of the
> embedded PMIx itself. It isn’t a big deal - just have to do a little code
> to save the OMPI settings where they overlap, reset those, and then check
> for the pmix-specific values to re-enable those that are specified.
>
> Frankly, I prefer that to modifying the non-embedded options - after all,
> we hope to remove the embedded versions in the near future anyway.
>
>
> On Dec 20, 2017, at 1:45 PM, Brice Goglin  wrote:
>
> Le 20/12/2017 à 22:01, Howard Pritchard a écrit :
>
> I can think of several ways to fix it.  Easiest would be to modify the
> opal/mca/hwloc/hwloc2a/configure.m4
> to not set --enable-cuda if --with-cuda is evaluated to something other
> than yes.
>
> Optionally, I could fix the hwloc configury to use a --with-cuda argument
> rather than an --enable-cuda configury argument.  Would
> such a configury argument change be traumatic for the hwloc community?
> I think it would be weird to have both an --enable-cuda and a --with-cuda
> configury argument for hwloc.
>
>
> Hello
>
> hwloc currently only has --enable-foo configure options, but very few
> --with-foo. We rely on pkg-config and variables for setting dependency
> paths.
>
> OMPI seems to use --enable for enabling features, and --with for enabling
> dependencies and setting dependency paths. If that's the official
> recommended way to choose between --enable and --with, maybe hwloc should
> just replace many --enable-foo with --with-foo ? But I tend to think we
> should support both to ease the transition?
>
> Brice
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] hwloc2 and cuda and non-default cudatoolkit install location

2017-12-20 Thread Howard Pritchard
Hi Folks,

I've a question about where to fix a problem I'm having building Open MPI
master and its
embedded hwloc2a on a cluster we have that sports Nvidia GPUs.

Here's the problem, I want for Open MPI to be smart about cuda, so I try to
configure with:

./configure --prefix=/usr/projects/hpctools/hpp/ompi/install
--with-cuda=/opt/cudatoolkit/8.0


Well that doesn't work because of a hwloc thingy:


--- MCA component hwloc:hwloc2a (m4 configuration macro, priority 90)

checking for MCA component hwloc:hwloc2a compile mode... static

checking for hwloc --enable-cuda value... "yes"

checking hwloc building mode... embedded

configure: hwloc builddir:
/usr/projects/hpctools/hpp/ompi/opal/mca/hwloc/hwloc2a/hwloc

configure: hwloc srcdir:
/usr/projects/hpctools/hpp/ompi/opal/mca/hwloc/hwloc2a/hwloc

checking for hwloc version... shmem-20170815.1857.git2478ce8

checking if want hwloc maintainer support... disabled (embedded mode)

checking for hwloc directory prefix... opal/mca/hwloc/hwloc2a/hwloc/

checking for hwloc symbol prefix... opal_hwloc2a_

checking size of void *... (cached) 8

checking which OS support to include... Linux

checking which CPU support to include... x86_64

checking size of unsigned long... 8

checking size of unsigned int... 4

checking for the C compiler vendor... gnu

checking for __attribute__... yes

checking for __attribute__(aligned)... yes

checking for __attribute__(always_inline)... yes

checking for __attribute__(cold)... yes

checking for __attribute__(const)... yes

checking for __attribute__(deprecated)... yes

checking for __attribute__(format)... yes

checking for __attribute__(hot)... yes

checking for __attribute__(malloc)... yes

checking for __attribute__(may_alias)... yes

checking for __attribute__(no_instrument_function)... yes

checking for __attribute__(nonnull)... yes

checking for __attribute__(noreturn)... yes

checking for __attribute__(packed)... yes

checking for __attribute__(pure)... yes

checking for __attribute__(sentinel)... yes

checking for __attribute__(unused)... yes

checking for __attribute__(warn_unused_result)... yes

checking for __attribute__(weak_alias)... yes

checking if gcc -std=gnu99 supports -fvisibility=hidden... yes

checking whether to enable symbol visibility... yes (via
-fvisibility=hidden)

configure: WARNING: "-fvisibility=hidden" has been added to the hwloc CFLAGS

checking whether the C compiler rejects function calls with too many
arguments... yes

checking whether the C compiler rejects function calls with too few
arguments... yes

checking for unistd.h... (cached) yes

checking for dirent.h... (cached) yes

checking for strings.h... (cached) yes

checking ctype.h usability... yes

checking ctype.h presence... yes

checking for ctype.h... yes

checking for strncasecmp... yes

checking whether strncasecmp is declared... yes

checking whether function strncasecmp has a complete prototype... yes

checking for strftime... yes

checking for setlocale... yes

checking for stdint.h... (cached) yes

checking for sys/mman.h... (cached) yes

checking for KAFFINITY... no

checking for PROCESSOR_CACHE_TYPE... no

checking for CACHE_DESCRIPTOR... no

checking for LOGICAL_PROCESSOR_RELATIONSHIP... no

checking for RelationProcessorPackage... no

checking for SYSTEM_LOGICAL_PROCESSOR_INFORMATION... no

checking for GROUP_AFFINITY... no

checking for PROCESSOR_RELATIONSHIP... no

checking for NUMA_NODE_RELATIONSHIP... no

checking for CACHE_RELATIONSHIP... no

checking for PROCESSOR_GROUP_INFO... no

checking for GROUP_RELATIONSHIP... no

checking for SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX... no

checking for PSAPI_WORKING_SET_EX_BLOCK... no

checking for PSAPI_WORKING_SET_EX_INFORMATION... no

checking for PROCESSOR_NUMBER... no

checking for main in -lgdi32... no

checking for PostQuitMessage in -luser32... no

checking windows.h usability... no

checking windows.h presence... no

checking for windows.h... no

checking sys/lgrp_user.h usability... no

checking sys/lgrp_user.h presence... no

checking for sys/lgrp_user.h... no

checking kstat.h usability... no

checking kstat.h presence... no

checking for kstat.h... no

checking whether fabsf is declared... yes

checking for fabsf in -lm... yes

checking picl.h usability... no

checking picl.h presence... no

checking for picl.h... no

checking whether _SC_NPROCESSORS_ONLN is declared... yes

checking whether _SC_NPROCESSORS_CONF is declared... yes

checking whether _SC_NPROC_ONLN is declared... no

checking whether _SC_NPROC_CONF is declared... no

checking whether _SC_PAGESIZE is declared... yes

checking whether _SC_PAGE_SIZE is declared... yes

checking whether _SC_LARGE_PAGESIZE is declared... no

checking mach/mach_host.h usability... no

checking mach/mach_host.h presence... no

checking for mach/mach_host.h... no

checking mach/mach_init.h usability... no

checking mach/mach_init.h presence... no

checking for mach/mach_init.h... no

checking for sys/param.h... (cached) yes

che

[OMPI devel] 2.0.4rc3 is available for testing

2017-11-07 Thread Howard Pritchard
Hi Folks,

We fixed one more thing for the 2.0.4 release, so there's another rc, now
rc3.
The fixed item was a problem with neighbor collectives.  Thanks to Lisandro
Dalcin for reporting.

Tarballs are at the usual place,

https://www.open-mpi.org/software/ompi/v2.0/

Thanks,

Open MPI release team
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 2.0.4rc2 available for testing

2017-11-01 Thread Howard Pritchard
HI Folks,

We decided to roll an rc2 to pick up a PMIx fix:

- Fix an issue with visibility of functions defined in the built-in PMIx.
  Thanks to Siegmar Gross for reporting this issue.

Tarballs can be found at the usual place

https://www.open-mpi.org/software/ompi/v2.0/

Thanks,

Your Open MPI release team
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 2.0.4rc1 available for testing

2017-10-29 Thread Howard Pritchard
HI Folks,

Open MPI 2.0.4rc1 is available for download and testing at

https://www.open-mpi.org/software/ompi/v2.0/


Fixes in this release include:

2.0.4 -- October, 2017
--

Bug fixes/minor improvements:
- Add configure check to prevent trying to build this release of
  Open MPI with an external hwloc 2.0 or newer release.
- Add ability to specify layered providers for OFI MTL.
- Fix a correctness issue with Open MPI's memory manager code
  that could result in corrupted message data.  Thanks to
  Valentin Petrov for reporting.
- Fix issues encountered when using newer versions of PBS Pro.
  Thanks to Petr Hanousek for reporting.
- Fix a problem with MPI_GET when using the vader BTL.  Thanks
  to Dahai Guo for reporting.
- Fix a problem when using MPI_ANY_SOURCE with MPI_SENDRECV_REPLACE.
  Thanks to Dahai Guo for reporting.
- Fix a problem using MPI_FILE_OPEN with a communicator with an
  attached cartesian topology.  Thanks to Wei-keng Liao for reporting.
- Remove IB XRC support from the OpenIB BTL due to lack of support.
- Remove support for big endian PowerPC.
- Remove support for XL compilers older than v13.1

Thanks,

Your MPI release team
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] MTT database

2017-10-12 Thread Howard Pritchard
is anyone seeing issues with MTT today?
When I go to the website and click on summary I get this back in my browser
window:

MTTDatabase abort: Could not connect to the ompidb database; submit this
run later.

Howard
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 2.1.2rc4 available for testing

2017-09-13 Thread Howard Pritchard
Hello Folks,

Open MPI 2.1.2 rc4 is uploaded to the usual place:

https://www.open-mpi.org/software/ompi/v2.1/

Issues addressed since the last release candidate:

- Fix a configury problem with the embedded PMIx 1.2.3 package.
- Add an option when using SLURM to launch processes on the head
  node via the slurmd daemon rather than mpirun.
- Fix a problem with one of Open MPI's opal_path_nfs make check test.

Thanks,

Howard and Jeff
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] KNL/hwloc funny message question

2017-09-01 Thread Howard Pritchard
Hi Folks,

I just now subscribed to the hwloc user mail list, but I suspect that
requires human intervention to get on, and that might not mean something
happening till next week.

Alas google has failed me in helping to understand the message.

So, I decided to post to Open MPI devel list and see if I get a response.

Here's what I see with with Open MPI 2.1.1:

srun -n 4 ./hello_c

Invalid knl_memoryside_cache header, expected "version: 1".

Invalid knl_memoryside_cache header, expected "version: 1".

Invalid knl_memoryside_cache header, expected "version: 1".

Invalid knl_memoryside_cache header, expected "version: 1".

Hello, world, I am 0 of 4, (Open MPI v2.1.1rc1, package: Open MPI
dshrader@tt-fey1 Distribution, ident: 2.1.1rc1, repo rev:
v2.1.1-4-g5ded3a2d, Unreleased developer copy, 142)

Hello, world, I am 1 of 4, (Open MPI v2.1.1rc1, package: Open MPI
dshrader@tt-fey1 Distribution, ident: 2.1.1rc1, repo rev:
v2.1.1-4-g5ded3a2d, Unreleased developer copy, 142)

Hello, world, I am 2 of 4, (Open MPI v2.1.1rc1, package: Open MPI
dshrader@tt-fey1 Distribution, ident: 2.1.1rc1, repo rev:
v2.1.1-4-g5ded3a2d, Unreleased developer copy, 142)

Hello, world, I am 3 of 4, (Open MPI v2.1.1rc1, package: Open MPI
dshrader@tt-fey1 Distribution, ident: 2.1.1rc1, repo rev:
v2.1.1-4-g5ded3a2d, Unreleased developer copy, 142)
Anyone know what might be causing hwloc to report this invalid
knl_memoryside_cache header thingy?

Thanks for any help,

Howard
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 2.1.2rc3 available for testing

2017-08-30 Thread Howard Pritchard
Hi Folks,

Open MPI 2.1.2rc3 tarballs are available for testing at the usual place:

https://www.open-mpi.org/software/ompi/v2.1/

Fixes since rc2:

Issue #4122: CMA compilation error in SM BTL.Thanks to Paul Hargrove
for catching this.
Issue #4034: NAG Fortran compiler -rpath configuration error.  Thanks to
Neil Carlson for
reporting.

Also, removed support for big endian PPC and XL compilers older than 13.1.

Thanks,

Jeff and Howard
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 2.1.2rc2 available

2017-08-17 Thread Howard Pritchard
Hello Folks,

Open MPI 2.1.2rc2 tarballs are available for testing:

https://www.open-mpi.org/software/ompi/v2.1/

Fixes since rc1:

Issue #4069 - PMIx visibility problem.  Thanks to Siegmar Gross for
reporting.
Issue #2324 - Fix a problem with neighbor collectives.  Thanks to Lisandro
Dalcin for reporting.

Thanks,

Howard and Jeff
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI v2.1.2rc1 available

2017-08-10 Thread Howard Pritchard
Hi Folks,


Open MPI v2.1.2rc1 tarballs are available for testing at the usual

place:

https://www.open-mpi.org/software/ompi/v2.1/


There is an outstanding issue which will be fixed before the final release:


https://github.com/open-mpi/ompi/issues/4069


but we wanted to get an rc1 out to see what else we may need

to fix.


Bug fixes/changes in this release include:


- Remove IB XRC support from the OpenIB BTL due to loss of maintainer.
- Fix a problem with MPI_IALLTOALLW when using zero-length messages.
  Thanks to Dahai Guo for reporting.
- Fix a problem with C11 generic type interface for SHMEM_G.  Thanks
  to Nick Park for reporting.
- Switch to using the lustreapi.h include file when building Open MPI
  with Lustre support.
- Fix a problem in the OB1 PML that led to hangs with OSU collective tests.
- Fix a progression issue with MPI_WIN_FLUSH_LOCAL.  Thanks to
  Joseph Schuchart for reporting.
- Fix an issue with recent versions of PBSPro requiring libcrypto.
  Thanks to Petr Hanousek for reporting.
- Fix a problem when using MPI_ANY_SOURCE with MPI_SENDRECV.
- Fix an issue that prevented signals from being propagated to ORTE
  daemons.
- Ensure that signals are forwarded from ORTE daemons to all processes
  in the process group created by the daemons.  Thanks to Ted Sussman
  for reporting.
- Fix a problem with launching a job under a debugger. Thanks to
  Greg Lee for reporting.
- Fix a problem with Open MPI native I/O MPI_FILE_OPEN when using
  a communicator having an associated topology.  Thanks to
  Wei-keng Liao for reporting.
- Fix an issue when using MPI_ACCUMULATE with derived datatypes.
- Fix a problem with Fortran bindings that led to compilation errors
  for user defined reduction operations.  Thanks to Nathan Weeks for
  reporting.
- Fix ROMIO issues with large writes/reads when using NFS file systems.
- Fix definition of Fortran MPI_ARGV_NULL and MPI_ARGVS_NULL.
- Enable use of the head node of a SLURM allocation on Cray XC systems.
- Fix a problem with synchronous sends when using the UCX PML.
- Use default socket buffer size to improve TCP BTL performance.
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] hwloc 2 thing

2017-07-20 Thread Howard Pritchard
Hi Folks,

I'm noticing that if I pull a recent version of master with hwloc 2 support
into my local repo, that my autogen,pl  run fails unless I do the following:

mkdir $PWD/opal/mca/hwloc/hwloc2x/hwloc/include/private/autogen

where PWD is the top level of my work area.

I did a

git clean -df

but that did not help.

Is anyone else seeing this?

Just curious,

Howard
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI 3.0.0 first release candidate posted

2017-06-29 Thread Howard Pritchard
Brian,

Things look much better with this patch.  We need it for 3.0.0 release
The patch from 3794 applied cleanly from master.

Howard


2017-06-29 16:51 GMT-06:00 r...@open-mpi.org :

> I tracked down a possible source of the oob/tcp error - this should
> address it, I think: https://github.com/open-mpi/ompi/pull/3794
>
> On Jun 29, 2017, at 3:14 PM, Howard Pritchard  wrote:
>
> Hi Brian,
>
> I tested this rc using both srun native launch and mpirun on the following
> systems:
> - LANL CTS-1 systems (haswell + Intel OPA/PSM2)
> - LANL network testbed system (haswell  + connectX5/UCX and OB1)
> - LANL Cray XC
>
> I am finding some problems with mpirun on the network testbed system.
>
> For example, for spawn_with_env_vars from IBM tests:
>
> *** Error in `mpirun': corrupted double-linked list: 0x006e75b0 ***
>
> === Backtrace: =
>
> /usr/lib64/libc.so.6(+0x7bea2)[0x76597ea2]
>
> /usr/lib64/libc.so.6(+0x7cec6)[0x76598ec6]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(
> opal_proc_table_remove_all+0x91)[0x77855851]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_
> ud.so(+0x5e09)[0x73cc0e09]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_
> ud.so(+0x5952)[0x73cc0952]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(
> +0x6b032)[0x77b94032]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(
> mca_base_framework_close+0x7d)[0x7788592d]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[
> 0x75b04e4d]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(
> orte_finalize+0x79)[0x77b43bf9]
>
> mpirun[0x4014f1]
>
> mpirun[0x401018]
>
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7653db15]
>
> mpirun[0x400f29]
>
> and another like
>
> [hpp@hi-master dynamic (master *)]$mpirun -np 1 ./spawn_with_env_vars
>
> Spawning...
>
> Spawned
>
> Child got foo and baz env variables -- yay!
>
> *** Error in `mpirun': corrupted double-linked list: 0x006eb350 ***
>
> === Backtrace: =
>
> /usr/lib64/libc.so.6(+0x7b184)[0x76597184]
>
> /usr/lib64/libc.so.6(+0x7d1ec)[0x765991ec]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x57a2)[
> 0x732297a2]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x5a87)[
> 0x73229a87]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(
> +0x6b032)[0x77b94032]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(
> mca_base_framework_close+0x7d)[0x7788592d]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[
> 0x75b04e4d]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(
> orte_finalize+0x79)[0x77b43bf9]
>
> mpirun[0x4014f1]
>
> mpirun[0x401018]
>
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7653db15]
>
> mpirun[0x400f29]
> It doesn't happen on every run though.
>
> I'll do some more investigating, but probably not till next week.
>
> Howard
>
>
> 2017-06-28 11:50 GMT-06:00 Barrett, Brian via devel <
> devel@lists.open-mpi.org>:
>
>> The first release candidate of Open MPI 3.0.0 is now available (
>> https://www.open-mpi.org/software/ompi/v3.0/).  We expect to have at
>> least one more release candidate, as there are still outstanding MPI-layer
>> issues to be resolved (particularly around one-sided).  We are posting
>> 3.0.0rc1 to get feedback on run-time stability, as one of the big features
>> of Open MPI 3.0 is the update to the PMIx 2 runtime environment.  We would
>> appreciate any and all testing you can do,  around run-time behaviors.
>>
>> Thank you,
>>
>> Brian & Howard
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI 3.0.0 first release candidate posted

2017-06-29 Thread Howard Pritchard
Hi Brian,

I tested this rc using both srun native launch and mpirun on the following
systems:
- LANL CTS-1 systems (haswell + Intel OPA/PSM2)
- LANL network testbed system (haswell  + connectX5/UCX and OB1)
- LANL Cray XC

I am finding some problems with mpirun on the network testbed system.

For example, for spawn_with_env_vars from IBM tests:

*** Error in `mpirun': corrupted double-linked list: 0x006e75b0 ***

=== Backtrace: =

/usr/lib64/libc.so.6(+0x7bea2)[0x76597ea2]

/usr/lib64/libc.so.6(+0x7cec6)[0x76598ec6]

/home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(opal_proc_table_remove_all+0x91)[0x77855851]

/home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_ud.so(+0x5e09)[0x73cc0e09]

/home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_ud.so(+0x5952)[0x73cc0952]

/home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(+0x6b032)[0x77b94032]

/home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(mca_base_framework_close+0x7d)[0x7788592d]

/home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[0x75b04e4d]

/home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(orte_finalize+0x79)[0x77b43bf9]

mpirun[0x4014f1]

mpirun[0x401018]

/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7653db15]

mpirun[0x400f29]

and another like

[hpp@hi-master dynamic (master *)]$mpirun -np 1 ./spawn_with_env_vars

Spawning...

Spawned

Child got foo and baz env variables -- yay!

*** Error in `mpirun': corrupted double-linked list: 0x006eb350 ***

=== Backtrace: =

/usr/lib64/libc.so.6(+0x7b184)[0x76597184]

/usr/lib64/libc.so.6(+0x7d1ec)[0x765991ec]

/home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x57a2)[0x732297a2]

/home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x5a87)[0x73229a87]

/home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(+0x6b032)[0x77b94032]

/home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(mca_base_framework_close+0x7d)[0x7788592d]

/home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[0x75b04e4d]

/home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(orte_finalize+0x79)[0x77b43bf9]

mpirun[0x4014f1]

mpirun[0x401018]

/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7653db15]

mpirun[0x400f29]
It doesn't happen on every run though.

I'll do some more investigating, but probably not till next week.

Howard


2017-06-28 11:50 GMT-06:00 Barrett, Brian via devel <
devel@lists.open-mpi.org>:

> The first release candidate of Open MPI 3.0.0 is now available (
> https://www.open-mpi.org/software/ompi/v3.0/).  We expect to have at
> least one more release candidate, as there are still outstanding MPI-layer
> issues to be resolved (particularly around one-sided).  We are posting
> 3.0.0rc1 to get feedback on run-time stability, as one of the big features
> of Open MPI 3.0 is the update to the PMIx 2 runtime environment.  We would
> appreciate any and all testing you can do,  around run-time behaviors.
>
> Thank you,
>
> Brian & Howard
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] libtool guru help needed (Fortran problem)

2017-06-22 Thread Howard Pritchard
Hi Folks,

I'm trying to do some experiments with clang/llvm and its openmp runtime.
To add to this mix, the application I'm wanting to use for testing is
written in F08, so I'm having to also use flang:

https://github.com/flang-compiler/flang

Now when I try to build Open MPI, as long as I disable fortran builds,
things are great, at least as far as building.  But if I try to use flang,
it looks like the libtool that is generated can't figure out what to do
with linker arguments for tag == FC.  So there's nothing for the $wl
variable.  And also none for the $pic_flag variable.  If I manually modify
libtool to define these variables things work for the fortran linking and
building the fortran examples using mpifort.

In the hopes that someone may already know configury magic addition to fix
this, I decided to post here before diving in to solving the problem myself.

F08 rocks!

Howard
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-22 Thread Howard Pritchard
Hi Chris

Please go ahead and open a PR for master and I'll open corresponding ones
for the release branches.

Howard

Christoph Niethammer  schrieb am Do. 22. Juni 2017 um
01:10:

> Hi Howard,
>
> Sorry, missed the new license policy. I added a Sign-off now.
> Shall I open a pull request?
>
> Best
> Christoph
>
> - Original Message -
> From: "Howard Pritchard" 
> To: "Open MPI Developers" 
> Sent: Wednesday, June 21, 2017 5:57:05 PM
> Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O
> files in /tmp
>
> Hi Chris,
>
> Sorry for being a bit picky, but could you add a sign-off to the commit
> message?
> I'm not suppose to manually add it for you.
>
> Thanks,
>
> Howard
>
>
> 2017-06-21 9:45 GMT-06:00 Howard Pritchard < [ mailto:hpprit...@gmail.com
> | hpprit...@gmail.com ] > :
>
>
>
> Hi Chris,
>
> Thanks very much for the patch!
>
> Howard
>
>
> 2017-06-21 9:43 GMT-06:00 Christoph Niethammer < [ mailto:
> nietham...@hlrs.de | nietham...@hlrs.de ] > :
>
>
> Hello Ralph,
>
> Thanks for the update on this issue.
>
> I used the latest master (c38866eb3929339147259a3a46c6fc815720afdb).
>
> The behaviour is still the same: aborting before MPI_File_close leaves
> /tmp/OMPI_*.sm files.
> These are not removed by your updated orte-clean.
>
> I now seeked for the origin of these files and it seems to be in
> ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:154
> where also a left over TODO note some lines above is mentioning the need
> for a correct directory.
>
> I would suggest updating the path there to be under the
>  directory which is cleaned by
> orte-clean, see
>
> [
> https://github.com/cniethammer/ompi/commit/2aedf6134813299803628e7d6856a3b781542c02
> |
> https://github.com/cniethammer/ompi/commit/2aedf6134813299803628e7d6856a3b781542c02
> ]
>
> Best
> Christoph
>
> - Original Message -
> From: "Ralph Castain" < [ mailto:r...@open-mpi.org | r...@open-mpi.org ] >
> To: "Open MPI Developers" < [ mailto:devel@lists.open-mpi.org |
> devel@lists.open-mpi.org ] >
> Sent: Wednesday, June 21, 2017 4:33:29 AM
> Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O
> files in /tmp
>
> I updated orte-clean in master, and for v3.0, so it cleans up all both
> current and legacy session directory files as well as any pmix artifacts. I
> don’t see any files named OMPI_*.sm, though that might be something from
> v2.x? I don’t recall us ever making files of that name before - anything we
> make should be under the session directory, not directly in /tmp.
>
> > On May 9, 2017, at 2:10 AM, Christoph Niethammer < [ mailto:
> nietham...@hlrs.de | nietham...@hlrs.de ] > wrote:
> >
> > Hi,
> >
> > I am using Open MPI 2.1.0.
> >
> > Best
> > Christoph
> >
> > - Original Message -
> > From: "Ralph Castain" < [ mailto:r...@open-mpi.org | r...@open-mpi.org ] >
> > To: "Open MPI Developers" < [ mailto:devel@lists.open-mpi.org |
> devel@lists.open-mpi.org ] >
> > Sent: Monday, May 8, 2017 6:28:42 PM
> > Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary
> I/O files in /tmp
> >
> > What version of OMPI are you using?
> >
> >> On May 8, 2017, at 8:56 AM, Christoph Niethammer < [ mailto:
> nietham...@hlrs.de | nietham...@hlrs.de ] > wrote:
> >>
> >> Hello
> >>
> >> According to the manpage "...orte-clean attempts to clean up any
> processes and files left over from Open MPI jobs that were run in the past
> as well as any currently running jobs. This includes OMPI infrastructure
> and helper commands, any processes that were spawned as part of the job,
> and any temporary files...".
> >>
> >> If I now have a program which calls MPI_File_open, MPI_File_write and
> MPI_Abort() in order, I get left over files /tmp/OMPI_*.sm.
> >> Running orte-clean does not remove them.
> >>
> >> Is this a bug or a feature?
> >>
> >> Best
> >> Christoph Niethammer
> >>
> >> --
> >>
> >> Christoph Niethammer
> >> High Performance Computing Center Stuttgart (HLRS)
> >> Nobelstrasse 19
> >> 70569 Stuttgart
> >>
> >> Tel: [ tel:%2B%2B49%280%29711-685-87203 | ++49(0)711-685-87203 ]
> >> email: [ mailto:nietham...@hlrs.de | nietham...@hlrs.de ]
> >> [ http://www.hlrs.de/people/niethammer |
> http://www.hlrs.de/people/niethammer ]
> >> 

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-21 Thread Howard Pritchard
Hi Chris,

Sorry for being a bit picky, but could you add a sign-off to the commit
message?
I'm not suppose to manually add it for you.

Thanks,

Howard


2017-06-21 9:45 GMT-06:00 Howard Pritchard :

> Hi Chris,
>
> Thanks very much for the patch!
>
> Howard
>
>
> 2017-06-21 9:43 GMT-06:00 Christoph Niethammer :
>
>> Hello Ralph,
>>
>> Thanks for the update on this issue.
>>
>> I used the latest master (c38866eb3929339147259a3a46c6fc815720afdb).
>>
>> The behaviour is still the same: aborting before MPI_File_close leaves
>> /tmp/OMPI_*.sm files.
>> These are not removed by your updated orte-clean.
>>
>> I now seeked for the origin of these files and it seems to be in
>> ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:154
>> where also a left over TODO note some lines above is mentioning the need
>> for a correct directory.
>>
>> I would suggest updating the path there to be under the
>>  directory which is cleaned by
>> orte-clean, see
>>
>>  https://github.com/cniethammer/ompi/commit/2aedf61348132998
>> 03628e7d6856a3b781542c02
>>
>> Best
>> Christoph
>>
>> - Original Message -
>> From: "Ralph Castain" 
>> To: "Open MPI Developers" 
>> Sent: Wednesday, June 21, 2017 4:33:29 AM
>> Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O
>> files in /tmp
>>
>> I updated orte-clean in master, and for v3.0, so it cleans up all both
>> current and legacy session directory files as well as any pmix artifacts. I
>> don’t see any files named OMPI_*.sm, though that might be something from
>> v2.x? I don’t recall us ever making files of that name before - anything we
>> make should be under the session directory, not directly in /tmp.
>>
>> > On May 9, 2017, at 2:10 AM, Christoph Niethammer 
>> wrote:
>> >
>> > Hi,
>> >
>> > I am using Open MPI 2.1.0.
>> >
>> > Best
>> > Christoph
>> >
>> > - Original Message -
>> > From: "Ralph Castain" 
>> > To: "Open MPI Developers" 
>> > Sent: Monday, May 8, 2017 6:28:42 PM
>> > Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary
>> I/O files in /tmp
>> >
>> > What version of OMPI are you using?
>> >
>> >> On May 8, 2017, at 8:56 AM, Christoph Niethammer 
>> wrote:
>> >>
>> >> Hello
>> >>
>> >> According to the manpage "...orte-clean attempts to clean up any
>> processes and files left over from Open MPI jobs that were run in the past
>> as well as any currently running jobs. This includes OMPI infrastructure
>> and helper commands, any processes that were spawned as part of the job,
>> and any temporary files...".
>> >>
>> >> If I now have a program which calls MPI_File_open, MPI_File_write and
>> MPI_Abort() in order, I get left over files /tmp/OMPI_*.sm.
>> >> Running orte-clean does not remove them.
>> >>
>> >> Is this a bug or a feature?
>> >>
>> >> Best
>> >> Christoph Niethammer
>> >>
>> >> --
>> >>
>> >> Christoph Niethammer
>> >> High Performance Computing Center Stuttgart (HLRS)
>> >> Nobelstrasse 19
>> >> 70569 Stuttgart
>> >>
>> >> Tel: ++49(0)711-685-87203
>> >> email: nietham...@hlrs.de
>> >> http://www.hlrs.de/people/niethammer
>> >> ___
>> >> devel mailing list
>> >> devel@lists.open-mpi.org
>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> >
>> > ___
>> > devel mailing list
>> > devel@lists.open-mpi.org
>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> > ___
>> > devel mailing list
>> > devel@lists.open-mpi.org
>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-21 Thread Howard Pritchard
Hi Chris,

Thanks very much for the patch!

Howard


2017-06-21 9:43 GMT-06:00 Christoph Niethammer :

> Hello Ralph,
>
> Thanks for the update on this issue.
>
> I used the latest master (c38866eb3929339147259a3a46c6fc815720afdb).
>
> The behaviour is still the same: aborting before MPI_File_close leaves
> /tmp/OMPI_*.sm files.
> These are not removed by your updated orte-clean.
>
> I now seeked for the origin of these files and it seems to be in
> ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:154
> where also a left over TODO note some lines above is mentioning the need
> for a correct directory.
>
> I would suggest updating the path there to be under the
>  directory which is cleaned by
> orte-clean, see
>
>  https://github.com/cniethammer/ompi/commit/2aedf6134813299803628e7d6856a3
> b781542c02
>
> Best
> Christoph
>
> - Original Message -
> From: "Ralph Castain" 
> To: "Open MPI Developers" 
> Sent: Wednesday, June 21, 2017 4:33:29 AM
> Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O
> files in /tmp
>
> I updated orte-clean in master, and for v3.0, so it cleans up all both
> current and legacy session directory files as well as any pmix artifacts. I
> don’t see any files named OMPI_*.sm, though that might be something from
> v2.x? I don’t recall us ever making files of that name before - anything we
> make should be under the session directory, not directly in /tmp.
>
> > On May 9, 2017, at 2:10 AM, Christoph Niethammer 
> wrote:
> >
> > Hi,
> >
> > I am using Open MPI 2.1.0.
> >
> > Best
> > Christoph
> >
> > - Original Message -
> > From: "Ralph Castain" 
> > To: "Open MPI Developers" 
> > Sent: Monday, May 8, 2017 6:28:42 PM
> > Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary
> I/O files in /tmp
> >
> > What version of OMPI are you using?
> >
> >> On May 8, 2017, at 8:56 AM, Christoph Niethammer 
> wrote:
> >>
> >> Hello
> >>
> >> According to the manpage "...orte-clean attempts to clean up any
> processes and files left over from Open MPI jobs that were run in the past
> as well as any currently running jobs. This includes OMPI infrastructure
> and helper commands, any processes that were spawned as part of the job,
> and any temporary files...".
> >>
> >> If I now have a program which calls MPI_File_open, MPI_File_write and
> MPI_Abort() in order, I get left over files /tmp/OMPI_*.sm.
> >> Running orte-clean does not remove them.
> >>
> >> Is this a bug or a feature?
> >>
> >> Best
> >> Christoph Niethammer
> >>
> >> --
> >>
> >> Christoph Niethammer
> >> High Performance Computing Center Stuttgart (HLRS)
> >> Nobelstrasse 19
> >> 70569 Stuttgart
> >>
> >> Tel: ++49(0)711-685-87203
> >> email: nietham...@hlrs.de
> >> http://www.hlrs.de/people/niethammer
> >> ___
> >> devel mailing list
> >> devel@lists.open-mpi.org
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] SLURM 17.02 support

2017-06-19 Thread Howard Pritchard
Hi Ralph

I think the alternative you mention below should suffice.

Howard

r...@open-mpi.org  schrieb am Mo. 19. Juni 2017 um 07:24:

> So what you guys want is for me to detect that no opal/pmix framework
> components could run, detect that we are in a slurm job, and so print out
> an error message saying “hey dummy - you didn’t configure us with slurm pmi
> support”?
>
> It means embedding slurm job detection code in the heart of ORTE (as
> opposed to in a component), which bothers me a bit.
>
> As an alternative, what if I print out a generic “you didn’t configure us
> with pmi support for this environment” instead of the “pmix select failed”
> message? I can mention how to configure the support in a general way, but
> it avoids having to embed slurm detection into ORTE outside of a component.
>
> > On Jun 16, 2017, at 8:39 AM, Jeff Squyres (jsquyres) 
> wrote:
> >
> > +1 on the error message.
> >
> >
> >
> >> On Jun 16, 2017, at 10:06 AM, Howard Pritchard 
> wrote:
> >>
> >> Hi Ralph
> >>
> >> I think a helpful  error message would suffice.
> >>
> >> Howard
> >>
> >> r...@open-mpi.org  schrieb am Di. 13. Juni 2017 um
> 11:15:
> >> Hey folks
> >>
> >> Brian brought this up today on the call, so I spent a little time
> investigating. After installing SLURM 17.02 (with just --prefix as config
> args), I configured OMPI with just --prefix config args. Getting an
> allocation and then executing “srun ./hello” failed, as expected.
> >>
> >> However, configuring OMPI --with-pmi= resolved the
> problem. SLURM continues to default to PMI-1, and so we pick that option up
> and use it. Everything works fine.
> >>
> >> FWIW: I also went back and checked using SLURM 15.08 and got the
> identical behavior.
> >>
> >> So the issue is: we don’t pick up PMI support by default, and never
> have due to the SLURM license issue. Thus, we have always required that the
> user explicitly configure --with-pmi so they take responsibility for the
> license. This is an acknowledged way of avoiding having GPL pull OMPI under
> its umbrella as it is the user, and not the OMPI community, that is making
> the link.
> >>
> >> I’m not sure there is anything we can or should do about this, other
> than perhaps providing a nicer error message. Thoughts?
> >> Ralph
> >>
> >> ___
> >> devel mailing list
> >> devel@lists.open-mpi.org
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >> ___
> >> devel mailing list
> >> devel@lists.open-mpi.org
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] SLURM 17.02 support

2017-06-16 Thread Howard Pritchard
Hi Ralph

I think a helpful  error message would suffice.

Howard

r...@open-mpi.org  schrieb am Di. 13. Juni 2017 um 11:15:

> Hey folks
>
> Brian brought this up today on the call, so I spent a little time
> investigating. After installing SLURM 17.02 (with just --prefix as config
> args), I configured OMPI with just --prefix config args. Getting an
> allocation and then executing “srun ./hello” failed, as expected.
>
> However, configuring OMPI --with-pmi= resolved the problem.
> SLURM continues to default to PMI-1, and so we pick that option up and use
> it. Everything works fine.
>
> FWIW: I also went back and checked using SLURM 15.08 and got the identical
> behavior.
>
> So the issue is: we don’t pick up PMI support by default, and never have
> due to the SLURM license issue. Thus, we have always required that the user
> explicitly configure --with-pmi so they take responsibility for the
> license. This is an acknowledged way of avoiding having GPL pull OMPI under
> its umbrella as it is the user, and not the OMPI community, that is making
> the link.
>
> I’m not sure there is anything we can or should do about this, other than
> perhaps providing a nicer error message. Thoughts?
> Ralph
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Time to remove Travis?

2017-06-01 Thread Howard Pritchard
I vote for removal too.

Howard
r...@open-mpi.org  schrieb am Do. 1. Juni 2017 um 08:10:

> I’d vote to remove it - it’s too unreliable anyway
>
> > On Jun 1, 2017, at 6:30 AM, Jeff Squyres (jsquyres) 
> wrote:
> >
> > Is it time to remove Travis?
> >
> > I believe that the Open MPI PRB now covers all the modern platforms that
> Travis covers, and we have people actively maintaining all of the machines
> / configurations being used for CI.
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Open MPI v2.0.3rc1 available for testing

2017-05-26 Thread Howard Pritchard
Hi Folks,

Open MPI v2.0.3rc1 tarballs are available on the download site for testing:

https://www.open-mpi.org/software/ompi/v2.0/

Fixes in this bug fix release include:

 - Fix a problem with MPI_IALLTOALLW when zero size messages are present.
   Thanks to @mathbird for reporting.
- Add missing MPI_USER_FUNCTION definition to the mpi_f08 module.
  Thanks to Nathan Weeks for reporting this issue.
- Fix a problem with MPI_WIN_LOCK not returning an error code when
   a negative rank is supplied.  Thanks to Jeff Hammond for reporting and
   providing a fix.
- Fix a problem with make check that could lead to hangs.  Thanks to
  Nicolas Morey-Chaisemartin for reporting.
- Resolve a symbol conflict problem with PMI-1 and PMI-2 PMIx components.
   Thanks to Kilian Cavalotti for reporting this issue.
- Insure that memory allocations returned from MPI_WIN_ALLOCATE_SHARED are
   64 byte aligned.  Thanks to Joseph Schuchart for reporting this issue.
- Make use of DOUBLE_COMPLEX, if available, for Fortran bindings.  Thanks
  to Alexander Klein for reporting this issue.
- Add missing MPI_T_PVAR_SESSION_NULL definition to Open MPI mpi.h include
  file.  Thanks to Omri Mor for reporting and fixing.
- Fix a problem with use of MPI shared file pointers when accessing
   a file from independent jobs.  Thanks to Nicolas Joly for reporting
   this issue.
- Optimize zero size MPI_IALLTOALL{V,W} with MPI_IN_PLACE.  Thanks to
   Lisandro Dalcin for the report.
- Fix a ROMIO buffer overflow problem for large transfers when using NFS
   filesystems.
- Fix type of MPI_ARGV[S]_NULL which prevented it from being used
  properly with MPI_COMM_SPAWN[_MULTIPLE] in the mpi_f08 module.
- Ensure to add proper linker flags to the wrapper compilers for
   dynamic libraries on platforms that need it (e.g., RHEL 7.3 and
   later).
- Get better performance on TCP-based networks 10Gbps and higher by
   using OS defaults for buffer sizing.
- Fix a bug with MPI_[R][GET_]ACCUMULATE when using DARRAY datatypes.
 - Fix handling of --with-lustre configure command line argument.
   Thanks to Prentice Bisbal and Tim Mattox for reporting the issue.
- Added MPI_AINT_ADD and MPI_AINT_DIFF declarations to mpif.h.  Thanks
   to Aboorva Devarajan (@AboorvaDevarajan) for the bug report.
- Fix a problem in the TCP BTL when Open MPI is initialized with
   MPI_THREAD_MULTIPLE support.  Thanks to Evgueni Petro for analyzing and
   reporting this issue.
- Fix yalla PML to properly handle underflow errors, and fixed a
   memory leak with blocking non-contiguous sends.
- Restored ability to run autogen.pl on official distribution tarballs
   (although this is still not recommended for most users!).
- Fix accuracy problems with MPI_WTIME on some systems by always using
   either clock_gettime(3) or gettimeofday(3).
- Fix a problem where MPI_WTICK was not returning a higher time resolution
   when available.  Thanks to Mark Dixon for reporting this issue.
- Restore SGE functionality.  Thanks to Kevin Buckley for the initial
   report.
- Fix external hwloc compilation issues, and extend support to allow
  using external hwloc installations as far back as v1.5.0.  Thanks to
   Orion Poplawski for raising the issue.
- Added latest Mellanox Connect-X and Chelsio T-6 adapter part IDs to
   the openib list of default values.
- Do a better job of cleaning up session directories (e.g., in /tmp).
- Update a help message to indicate how to suppress a warning about
   no high performance networks being detected by Open MPI.  Thanks to
   Susan Schwarz for reporting this issue.
- Fix a problem with mangling of custom CFLAGS when configuring Open MPI.
  Thanks to Phil Tooley for reporting.
- Fix some minor memory leaks and remove some unused variables.
   Thanks to Joshua Gerrard for reporting.
- Fix MPI_ALLGATHERV bug with MPI_IN_PLACE.

Thanks,

Howard
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Open MPI 2.1.1rc1 is up

2017-04-27 Thread Howard Pritchard
Hi Open MPI developers,

Open MPI 2.1.1rc1 is available for testing at the usual place:

https://www.open-mpi.org/software/ompi/v2.1/

Bug fixes in this release:

- Add missing MPI_AINT_ADD/MPI_AINT_DIFF function definitions to mpif.h.
  Thanks to Aboorva Devarajan for reporting.

- Fix the error return from MPI_WIN_LOCK when rank argument is invalid.
  Thanks to Jeff Hammond for reporting and fixing this issue.

- Fix a problem with mpirun/orterun when started under a debugger. Thanks
  to Gregory Leff for reporting.

- Add configury option to disable use of CMA by the vader BTL.  Thanks
   to Sascha Hunold for reporting.

- Add configury check for MPI_DOUBLE_COMPLEX datatype support.
   Thanks to Alexander Klein for reporting.

- Fix memory allocated by MPI_WIN_ALLOCATE_SHARED to
   be 64 byte aligned.  Thanks to Joseph Schuchart for reporting.

- Update MPI_WTICK man page to reflect possibly higher
  resolution than 10e-6.  Thanks to Mark Dixon for
  reporting

- Add missing MPI_T_PVAR_SESSION_NULL definition to mpi.h
  include file.  Thanks to Omri Mor for this contribution.

- Enhance the Open MPI spec file to install modulefile in /opt
  if installed in a non-default location.  Thanks to Kevin
  Buckley for reporting and supplying a fix.

- Fix a problem with conflicting PMI symbols when linking statically.
   Thanks to Kilian Cavalotti for reporting.

Please try it out if you have time.

Thanks,

Howard and Jeff
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Open MPI v2.1.1 release reminder - public service announcement

2017-04-19 Thread Howard Pritchard
HI Folks,

Reminder that we are planning to do a v2.1.1 bug release next Tuesday
(4/25/17)
as discussed in yesterday's con-call.

If you have bug fixes you'd like to get in to v2.1.1 please open PRs this
week
so there will be time for review and testing in MTT.

Thanks,

Howard and Jeff
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] OS-X specific jenkins/PR retest

2017-04-07 Thread Howard Pritchard
Hi Folks,

I added an OS-X specific bot retest command for jenkins CI:

bot:osx:retest

Also added a blurb to the related wiki page:

https://github.com/open-mpi/ompi/wiki/PRJenkins

Hope this helps folks who encounter os-x specific problems with
their PRs.

Howard
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Pull request: LANL-XXX tests failing

2017-03-30 Thread Howard Pritchard
well not sure what's going on.  there was an upgrade of jenkins a bunch of
functionality seems to have gotten lost.


2017-03-30 9:37 GMT-06:00 Howard Pritchard :

> Actually it looks like we're running out of disk space at AWS.
>
>
> 2017-03-30 9:28 GMT-06:00 r...@open-mpi.org :
>
>> You didn’t do anything wrong - the Jenkins test server at LANL is having
>> a problem.
>>
>> On Mar 30, 2017, at 8:22 AM, DERBEY, NADIA  wrote:
>>
>> Hi,
>>
>> I just created a pull request and I have a failure in 4/8 checks, all
>> related to LANL:
>>
>> LANL-CI
>> LANL-OS-X
>> LANL-disable-dlopen
>> LANL-distcheck
>>
>> When I ask for the failures details, I repectively get the following
>> messages:
>> ==
>>
>> ...
>> libtoolize: copying file `config/lt~obsolete.m4'
>> Build was aborted
>> ERROR: Step ?Set build status on GitHub commit [deprecated]? failed: no 
>> workspace for ompi_master_PR_CLE #1447Setting status of 
>> b6de94e4490b91aaa6c5df70954fbabb8794528b to FAILURE with url  
>> <http://stacktrace.jenkins-ci.org/search?query=Setting%20status%20of%20b6de94e4490b91aaa6c5df70954fbabb8794528b%20to%20FAILURE%20with%20url%20https>https
>>  
>> <https://jenkins.open-mpi.org/jenkins/job/ompi_master_PR_CLE/1447/>://jenkins.open-mpi.org/jenkins/job/ompi_master_PR_CLE/1447/
>>  and message: ' '
>> Using context: LANL-CI
>> Finished: ABORTED
>>
>> ==
>>
>> ...
>> config.status: creating opal/mca/mpool/hugepage/Makefile
>> Build was aborted
>> ERROR: Step ‘Set build status on GitHub commit [deprecated]’ failed: no 
>> workspace for ompi_pr_osx #557config.status 
>> <http://stacktrace.jenkins-ci.org/search?query=config.status>: creating 
>> opal/mca/mpool/memkind/Makefile
>> Setting status of b6de94e4490b91aaa6c5df70954fbabb8794528b to FAILURE with 
>> url https://jenkins.open-mpi.org/jenkins/job/ompi_pr_osx/557/ and message: ' 
>> '
>> Using context: LANL-OS-X
>> Finished: ABORTED
>>
>> ==
>>
>> ...
>> parallel-tests: installing 'config/test-driver'
>> Build was aborted
>> ERROR: Step ‘Set build status on GitHub commit [deprecated]’ failed: no 
>> workspace for ompi_master_pr_disable_dlopen #1464Setting status of 
>> b6de94e4490b91aaa6c5df70954fbabb8794528b to FAILURE with url  
>> <http://stacktrace.jenkins-ci.org/search?query=Setting%20status%20of%20b6de94e4490b91aaa6c5df70954fbabb8794528b%20to%20FAILURE%20with%20url%20https>https
>>  
>> <https://jenkins.open-mpi.org/jenkins/job/ompi_master_pr_disable_dlopen/1464/>://jenkins.open-mpi.org/jenkins/job/ompi_master_pr_disable_dlopen/1464/
>>  and message: ' '
>> Using context: LANL-disable-dlopen
>> Finished: ABORTED
>>
>> ==
>>
>> ...
>> parallel-tests: installing 'config/test-driver'
>> Build was aborted
>> ERROR: Step ‘Set build status on GitHub commit [deprecated]’ failed: no 
>> workspace for ompi_master_pr_distcheck #1457Setting status of 
>> b6de94e4490b91aaa6c5df70954fbabb8794528b to FAILURE with url  
>> <http://stacktrace.jenkins-ci.org/search?query=Setting%20status%20of%20b6de94e4490b91aaa6c5df70954fbabb8794528b%20to%20FAILURE%20with%20url%20https>https
>>  
>> <https://jenkins.open-mpi.org/jenkins/job/ompi_master_pr_distcheck/1457/>://jenkins.open-mpi.org/jenkins/job/ompi_master_pr_distcheck/1457/
>>  and message: ' '
>> Using context: LANL-distcheck
>> Finished: ABORTED
>>
>> ==
>>
>> Can somebody tell me what I did something wrong? (Compilation went fine
>> on my side).
>>
>> Regards,
>>
>> --
>> Nadia Derbey - B1-387
>> HPC R&D - MPI
>> Tel: +33 4 76 29 77 62 <+33%204%2076%2029%2077%2062>nadia.der...@atos.net
>> 1 Rue de Provence BP 208
>> 38130 Echirolles Cedex, Francewww.atos.com
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Pull request: LANL-XXX tests failing

2017-03-30 Thread Howard Pritchard
Actually it looks like we're running out of disk space at AWS.


2017-03-30 9:28 GMT-06:00 r...@open-mpi.org :

> You didn’t do anything wrong - the Jenkins test server at LANL is having a
> problem.
>
> On Mar 30, 2017, at 8:22 AM, DERBEY, NADIA  wrote:
>
> Hi,
>
> I just created a pull request and I have a failure in 4/8 checks, all
> related to LANL:
>
> LANL-CI
> LANL-OS-X
> LANL-disable-dlopen
> LANL-distcheck
>
> When I ask for the failures details, I repectively get the following
> messages:
> ==
>
> ...
> libtoolize: copying file `config/lt~obsolete.m4'
> Build was aborted
> ERROR: Step ?Set build status on GitHub commit [deprecated]? failed: no 
> workspace for ompi_master_PR_CLE #1447Setting status of 
> b6de94e4490b91aaa6c5df70954fbabb8794528b to FAILURE with url  
> https
>  
> ://jenkins.open-mpi.org/jenkins/job/ompi_master_PR_CLE/1447/
>  and message: ' '
> Using context: LANL-CI
> Finished: ABORTED
>
> ==
>
> ...
> config.status: creating opal/mca/mpool/hugepage/Makefile
> Build was aborted
> ERROR: Step ‘Set build status on GitHub commit [deprecated]’ failed: no 
> workspace for ompi_pr_osx #557config.status 
> : creating 
> opal/mca/mpool/memkind/Makefile
> Setting status of b6de94e4490b91aaa6c5df70954fbabb8794528b to FAILURE with 
> url https://jenkins.open-mpi.org/jenkins/job/ompi_pr_osx/557/ and message: ' '
> Using context: LANL-OS-X
> Finished: ABORTED
>
> ==
>
> ...
> parallel-tests: installing 'config/test-driver'
> Build was aborted
> ERROR: Step ‘Set build status on GitHub commit [deprecated]’ failed: no 
> workspace for ompi_master_pr_disable_dlopen #1464Setting status of 
> b6de94e4490b91aaa6c5df70954fbabb8794528b to FAILURE with url  
> https
>  
> ://jenkins.open-mpi.org/jenkins/job/ompi_master_pr_disable_dlopen/1464/
>  and message: ' '
> Using context: LANL-disable-dlopen
> Finished: ABORTED
>
> ==
>
> ...
> parallel-tests: installing 'config/test-driver'
> Build was aborted
> ERROR: Step ‘Set build status on GitHub commit [deprecated]’ failed: no 
> workspace for ompi_master_pr_distcheck #1457Setting status of 
> b6de94e4490b91aaa6c5df70954fbabb8794528b to FAILURE with url  
> https
>  
> ://jenkins.open-mpi.org/jenkins/job/ompi_master_pr_distcheck/1457/
>  and message: ' '
> Using context: LANL-distcheck
> Finished: ABORTED
>
> ==
>
> Can somebody tell me what I did something wrong? (Compilation went fine on
> my side).
>
> Regards,
>
> --
> Nadia Derbey - B1-387
> HPC R&D - MPI
> Tel: +33 4 76 29 77 62 <+33%204%2076%2029%2077%2062>nadia.der...@atos.net
> 1 Rue de Provence BP 208
> 38130 Echirolles Cedex, Francewww.atos.com
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Segfault during a free in reduce_scatter using basic component

2017-03-28 Thread Howard Pritchard
Hello Emmanuel,

Which version of Open MPI are you using?

Howard


2017-03-28 3:38 GMT-06:00 BRELLE, EMMANUEL :

> Hi,
>
> We are working  on a portals4 components and we have found a bug  (causing
> a segmentation fault ) which must be  related to the coll/basic component.
> Due to a lack of time, we cannot investigate further but this seems to be
> caused by a “free(disps);“ (around line 300 in coll_basic_reduce_scatter)
> in some specific situations. In our case it  happens on a
> osu_reduce_scatter (from the OSU microbenchmarks) with at least 97 procs
> for sizes bigger than 512Ko
>
> Step to reproduce :
> export OMPI_MCA_mtl=^portals4
> export OMPI_MCA_btl=^portals4
> export OMPI_MCA_coll=basic,libnbc,self,tuned
> export OMPI_MCA_osc=^portals4
> export OMPI_MCA_pml=ob1
> mpirun -n 97 osu_reduce_scatter -m 524288:
>
> ( reducing the number of iterations with –i 1 –x 0 should keep the bug)
> Our git branch is based on the v2.x branch and the files differ almost
> only on portals4 parts.
>
> Could someone confirm this bug ?
>
> Emmanuel BRELLE
>
>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [2.1.0rc2] stupid run failure on Mac OS X Sierra

2017-03-07 Thread Howard Pritchard
Hi Paul

There is an entry 8 under OS-X FAQ which describes this problem.

Adding max allowable len is a good idea.

Howard

Paul Hargrove  schrieb am Di. 7. März 2017 um 08:04:

> The following is fairly annoying (though I understand the problem is real):
>
> $ [full-path-to]/mpirun -mca btl sm,self -np 2 examples/ring_c
> PMIx has detected a temporary directory name that results
> in a path that is too long for the Unix domain socket:
>
> Temp dir:
> /var/folders/mg/q0_5yv791yz65cdnbglcqjvcgp/T/openmpi-sessions-502@anlextwls026-173_0
> /53422
>
> Try setting your TMPDIR environmental variable to point to
> something shorter in length
>
> Of course this comes from the fact that something outside my control has
> set TMPDIR to a session-specific directory (same value as $XDG_RUNTIME_DIR)
>TMPDIR=/var/folders/mg/q0_5yv791yz65cdnbglcqjvcgp/T/
>
> I am just reporting this for three minor reasons
> 1) Just in case nobody was aware of this problem
> 2) To request that an FAQ entry related to this be added
> 3) Yes, the message is clear, but it could be improved by indicating the
> allowable length of $TMPDIR
>
> -Paul
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] No Preset Parameters found

2017-02-20 Thread Howard Pritchard
Hello Amit

which version of Open MPI are you using?

Howard

--

sent from my smart phonr so no good type.

Howard

On Feb 20, 2017 12:09 PM, "Kumar, Amit"  wrote:

> Dear OpenMPI,
>
>
>
> Wondering what preset parameters are this warning is indicating?
>
>
>
> Thank you,
>
> Amit
>
>
>
> WARNING: No preset parameters were found for the device that Open MPI
>
> detected:
>
>
>
>   Local host:n001
>
>   Device name:   mlx5_0
>
>   Device vendor ID:  0x02c9
>
>   Device vendor part ID: 4119
>
>
>
> Default device parameters will be used, which may result in lower
>
> performance.  You can edit any of the files specified by the
>
> btl_openib_device_param_files MCA parameter to set values for your
>
> device.
>
>
>
> NOTE: You can turn off this warning by setting the MCA parameter
>
>   btl_openib_warn_no_device_params_found to 0.
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [2.0.2rc4] "make install" failure on NetBSD/i386 (libtool?)

2017-01-28 Thread Howard Pritchard
HI Paul,

This might be a result of building the tarball on a new system.
Would you mind trying the rc3 tarball and see if that builds on the
system?


Howard



2017-01-27 15:12 GMT-07:00 Paul Hargrove :

> I had no problem with 2.0.2rc3 on NetBSD, but with 2.0.2rc4 I am seeing a
> "make install" failure (below).
> This is seen on an x86 (32-bit) platform, but not x86_64.
> I cannot say for certain that this is an Open MPI regression, since there
> *have* been s/w updates on this system since I last tested.
>
> Configured with only --prefix and --disable-mpi-fortran (due to
> https://github.com/open-mpi/ompi/issues/184)
>
> -Paul
>
> $ env LANG=C make install
> [...]
> Making install in mca/btl/sm
>  
> /home/phargrov/OMPI/openmpi-2.0.2rc4-netbsd7-i386/openmpi-2.0.2rc4/config/install-sh
> -c -d '/home/phargrov/OMPI/openmpi-2.0.2rc4-netbsd7-i386/INST/
> share/openmpi' /usr/bin/install -c -m 644 /home/phargrov/OMPI/openmpi-2.
> 0.2rc4-netbsd7-i386/openmpi-2.0.2rc4/opal/mca/btl/sm/help-mpi-btl-sm.txt
> '/home/phargrov/OMPI/openmpi-2.0.2rc4-netbsd7-i386/INST/share/openmpi'
>  
> /home/phargrov/OMPI/openmpi-2.0.2rc4-netbsd7-i386/openmpi-2.0.2rc4/config/install-sh
> -c -d '/home/phargrov/OMPI/openmpi-2.0.2rc4-netbsd7-i386/INST/lib/openmpi'
>  /bin/sh ../../../../libtool   --mode=install /usr/bin/install -c
> mca_btl_sm.la '/home/phargrov/OMPI/openmpi-2.0.2rc4-netbsd7-i386/INST/
> lib/openmpi'
> libtool: warning: relinking 'mca_btl_sm.la'
> libtool: install: (cd /home/phargrov/OMPI/openmpi-2.
> 0.2rc4-netbsd7-i386/BLD/opal/mca/btl/sm; /bin/sh "/home/ph
> argrov/OMPI/openmpi-2.0.2rc4-netbsd7-i386/BLD/libtool"  --tag CC
> --mode=relink gcc -std=gnu99 -O3 -DNDEBUG -finline-functions
> -fno-strict-aliasing -pthread -module -avoid-version -o mca_btl_sm.la
> -rpath /home/phargrov/OMPI/openmpi-2.0.2rc4-netbsd7-i386/INST/lib/openmpi
> mca_btl_sm_la-btl_sm.lo mca_btl_sm_la-btl_sm_component.lo 
> mca_btl_sm_la-btl_sm_frag.lo
> /home/phargrov/OMPI/openmpi-2.0.2rc4-netbsd7-i386/BLD/opal/
> mca/common/sm/libmca_common_sm.la -lrt -lexecinfo -lm -lutil )
>
> *** Warning: linker path does not have real file for library
> -lmca_common_sm.
> *** I have the capability to make that library automatically link in when
> *** you link to this library.  But I can only do this if you have a
> *** shared version of the library, which you do not appear to have
> *** because I did check the linker path looking for a file starting
> *** with libmca_common_sm and none of the candidates passed a file format
> test
> *** using a regex pattern. Last file checked:
> /home/phargrov/OMPI/openmpi-2.0.2rc4-netbsd7-i386/BLD/opal/mca/c
> ommon/sm/.libs/libmca_common_sm.so.20.0
>
> *** Warning: libtool could not satisfy all declared inter-library
> *** dependencies of module mca_btl_sm.  Therefore, libtool will create
> *** a static module, that should work as long as the dlopening
> *** application is linked with the -dlopen flag.
> libtool: relink: ar cru .libs/mca_btl_sm.a .libs/mca_btl_sm_la-btl_sm.o
> .libs/mca_btl_sm_la-btl_sm_component.o
>  .libs/mca_btl_sm_la-btl_sm_frag.o
> libtool: relink: ranlib .libs/mca_btl_sm.a
> libtool: relink: ( cd ".libs" && rm -f "mca_btl_sm.la" && ln -s "../
> mca_btl_sm.la" "mca_btl_sm.la" )
> libtool: install: /usr/bin/install -c .libs/mca_btl_sm.soT
> /home/phargrov/OMPI/openmpi-2.0.2rc4-netbsd7-i386/I
> NST/lib/openmpi/mca_btl_sm.so
> install: .libs/mca_btl_sm.soT: stat: No such file or directory
> *** Error code 1
>
> Stop.
>
>
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> <(510)%20495-2352>
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> <(510)%20486-6900>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux

2017-01-13 Thread Howard Pritchard
Thanks Siegmar.  I just wanted to confirm you weren't having some other
issue besides the host and slot-list problems.

Howard


2017-01-12 23:50 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi Howard and Gilles,
>
> thank you very much for your help. All commands that work for
> Gilles work also on my machine as expected and the commands that
> don't work on his machine don't work on my neither. The first one
> that works with both --slot-list and --host is the following
> command so that it seems that the value depends on the number of
> processes in the remote group.
>
> loki spawn 122 mpirun -np 1 --slot-list 0:0-5,1:0-5 --host loki:3
> spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> Parent process 0: tasks in MPI_COMM_WORLD:1
>   tasks in COMM_CHILD_PROCESSES local group:  1
>   tasks in COMM_CHILD_PROCESSES remote group: 3
>
> Slave process 0 of 3 running on loki
> spawn_slave 0: argv[0]: spawn_slave
> Slave process 1 of 3 running on loki
> spawn_slave 1: argv[0]: spawn_slave
> Slave process 2 of 3 running on loki
> spawn_slave 2: argv[0]: spawn_slave
> loki spawn 123
>
>
> Here is the output from the other commands.
>
> loki spawn 112 mpirun -np 1 spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> Parent process 0: tasks in MPI_COMM_WORLD:1
>   tasks in COMM_CHILD_PROCESSES local group:  1
>   tasks in COMM_CHILD_PROCESSES remote group: 4
>
> Slave process 1 of 4 running on loki
> Slave process 2 of 4 running on loki
> Slave process 3 of 4 running on loki
> Slave process 0 of 4 running on loki
> spawn_slave 3: argv[0]: spawn_slave
> spawn_slave 1: argv[0]: spawn_slave
> spawn_slave 2: argv[0]: spawn_slave
> spawn_slave 0: argv[0]: spawn_slave
> loki spawn 113 mpirun -np 1 --slot-list 0:0-5,1:0-5 spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> Slave process 0 of 4 running on loki
> Slave process 1 of 4 running on loki
> Slave process 2 of 4 running on loki
> spawn_slave 2: argv[0]: spawn_slave
> Slave process 3 of 4 running on loki
> spawn_slave 3: argv[0]: spawn_slave
> spawn_slave 0: argv[0]: spawn_slave
> spawn_slave 1: argv[0]: spawn_slave
> Parent process 0: tasks in MPI_COMM_WORLD:1
>   tasks in COMM_CHILD_PROCESSES local group:  1
>   tasks in COMM_CHILD_PROCESSES remote group: 4
>
> loki spawn 114 mpirun -np 1 --host loki --oversubscribe spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> Slave process 0 of 4 running on loki
> Slave process 1 of 4 running on loki
> Slave process 2 of 4 running on loki
> spawn_slave 2: argv[0]: spawn_slave
> Slave process 3 of 4 running on loki
> spawn_slave 3: argv[0]: spawn_slave
> spawn_slave 0: argv[0]: spawn_slave
> spawn_slave 1: argv[0]: spawn_slave
> Parent process 0: tasks in MPI_COMM_WORLD:1
>   tasks in COMM_CHILD_PROCESSES local group:  1
>   tasks in COMM_CHILD_PROCESSES remote group: 4
>
> loki spawn 115 mpirun -np 1 --slot-list 0:0-5,1:0-5 --host loki:12
> spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> Slave process 0 of 4 running on loki
> Slave process 2 of 4 running on loki
> Slave process 1 of 4 running on loki
> Slave process 3 of 4 running on loki
> Parent process 0: tasks in MPI_COMM_WORLD:1
>   tasks in COMM_CHILD_PROCESSES local group:  1
>   tasks in COMM_CHILD_PROCESSES remote group: 4
>
> spawn_slave 2: argv[0]: spawn_slave
> spawn_slave 0: argv[0]: spawn_slave
> spawn_slave 1: argv[0]: spawn_slave
> spawn_slave 3: argv[0]: spawn_slave
> loki spawn 116 mpirun -np 1 --host loki:12 --slot-list 0:0-5,1:0-5
> spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> Slave process 0 of 4 running on loki
> Slave process 1 of 4 running on loki
> Slave process 2 of 4 running on loki
> spawn_slave 2: argv[0]: spawn_slave
> Slave process 3 of 4 running on loki
> spawn_slave 3: argv[0]: spawn_slave
> spawn_slave 0: argv[0]: spawn_slave
> spawn_slave 1: argv[0]: spawn_slave
> Parent process 0: tasks in MPI_COMM_WORLD:    1
>   tasks in COMM_CHILD_PROCESSES local group:  1
>   tasks in COMM_CHILD_PROCESSES remote group: 4
>
> loki spawn 117
>
>
> Kind regards
>
> Siegmar
>
> Am 12.01.2017 um 22:25 schrieb r...@open-mp

Re: [OMPI devel] Fwd: Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux

2017-01-12 Thread Howard Pritchard
Siegmar,

Could you confirm that if you use one of the mpirun arg lists that works
for Gilles that
your test case passes.  Something simple like

mpirun -np 1 ./spawn_master

?

Howard




2017-01-11 18:27 GMT-07:00 Gilles Gouaillardet :

> Ralph,
>
>
> so it seems the root cause is a kind of incompatibility between the --host
> and the --slot-list options
>
>
> on a single node with two six cores sockets,
> this works :
>
> mpirun -np 1 ./spawn_master
> mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master
> mpirun -np 1 --host motomachi --oversubscribe ./spawn_master
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master
>
>
> this does not work
>
> mpirun -np 1 --host motomachi ./spawn_master # not enough slots available,
> aborts with a user friendly error message
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master #
> various errors sm_segment_attach() fails, a task crashes
> and this ends up with the following error message
>
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[15519,2],0]) is on host: motomachi
>   Process 2 ([[15519,2],1]) is on host: unknown!
>   BTLs attempted: self tcp
>
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master #
> same error as above
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master #
> same error as above
>
>
> for the record, the following command surprisingly works
>
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self
> ./spawn_master
>
>
>
> bottom line, my guess is that when the user specifies the --slot-list and
> the --host options
> *and* there are no default slot numbers to hosts, we should default to
> using the number
> of slots from the slot list.
> (e.g. in this case, defaults to --host motomachi:12 instead of (i guess)
> --host motomachi:1)
>
>
> /* fwiw, i made
>
> https://github.com/open-mpi/ompi/pull/2715
>
> https://github.com/open-mpi/ompi/pull/2715
>
> but these are not the root cause */
>
>
> Cheers,
>
>
> Gilles
>
>
>
>  Forwarded Message 
> Subject: Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3
> on Linux
> Date: Wed, 11 Jan 2017 20:39:02 +0900
> From: Gilles Gouaillardet 
> 
> Reply-To: Open MPI Users 
> 
> To: Open MPI Users  
>
>
> Siegmar,
>
> Your slot list is correct.
> An invalid slot list for your node would be 0:1-7,1:0-7
>
> /* and since the test requires only 5 tasks, that could even work with
> such an invalid list.
> My vm is single socket with 4 cores, so a 0:0-4 slot list results in an
> unfriendly pmix error */
>
> Bottom line, your test is correct, and there is a bug in v2.0.x that I
> will investigate from tomorrow
>
> Cheers,
>
> Gilles
>
> On Wednesday, January 11, 2017, Siegmar Gross <
> siegmar.gr...@informatik.hs-fulda.de> wrote:
>
>> Hi Gilles,
>>
>> thank you very much for your help. What does incorrect slot list
>> mean? My machine has two 6-core processors so that I specified
>> "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
>> allowed to specify more slots than available, to specify fewer
>> slots than available, or to specify more slots than needed for
>> the processes?
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>> Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:
>>
>>> Siegmar,
>>>
>>> I was able to reproduce the issue on my vm
>>> (No need for a real heterogeneous cluster here)
>>>
>>> I will keep digging tomorrow.
>>> Note that if you specify an incorrect slot list, MPI_Comm_spawn fails
>>> with a very unfriendly error message.
>>> Right now, the 4th spawn'ed task crashes, so this is a different issue
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> r...@open-mpi.org wrote:
>>> I think there is some relevant discussion here:
>>> https://github.com/open-mpi/ompi/issues/1569
>>>
>>> It looks like Gilles had (at least at one point) a fix for master when
>>> enable-heterogeneous, but I don’t know if that was committed.
>>>
>>> On Jan 9, 2017, at 8:23 AM, Howard Pritchard >>> <mailto:hpprit..

Re: [OMPI devel] [2.0.2rc3] build failure ppc64/-m32 and bultin-atomics

2017-01-06 Thread Howard Pritchard
HI Paul,

https://github.com/open-mpi/ompi/issues/2677

It seems we have a bunch of problems with PPC64 atomics and I'd like to
see if we can get at least some of these issues resolved for 2.0.2, so
I've set this as a blocker along with 2610.

Howard


2017-01-06 9:48 GMT-07:00 Howard Pritchard :

> Hi Paul,
>
> Sorry for the confusion.  This is a different problem.
> I'll open an issue for this one too.
>
> Howard
>
>
>
> 2017-01-06 9:18 GMT-07:00 Howard Pritchard :
>
>> Hi Paul,
>>
>> Thanks for checking this.
>>
>> This problem was previously reporting and there's an issue:
>>
>> https://github.com/open-mpi/ompi/issues/2610
>>
>> tracking it.
>>
>> Howard
>>
>>
>> 2017-01-05 21:19 GMT-07:00 Paul Hargrove :
>>
>>> I have a standard Linux/ppc64 system with gcc-4.8.3
>>> I have configured the 2.0.2rc3 tarball with
>>>
>>> --prefix=... --enable-builtin-atomics \
>>> CFLAGS=-m32 --with-wrapper-cflags=-m32 \
>>> CXXFLAGS=-m32 --with-wrapper-cxxflags=-m32 \
>>> FCFLAGS=-m32 --with-wrapper-fcflags=-m32 --disable-mpi-fortran
>>>
>>> (Yes, I know the FCFLAGS are unnecessary).
>>>
>>> I get a "make check" failure:
>>>
>>> make[3]: Entering directory `/home/phargrov/OMPI/openmpi-2
>>> .0.2rc3-linux-ppc32-gcc/BLD/test/asm'
>>>   CC   atomic_barrier.o
>>>   CCLD atomic_barrier
>>>   CC   atomic_barrier_noinline-atomic_barrier_noinline.o
>>>   CCLD atomic_barrier_noinline
>>>   CC   atomic_spinlock.o
>>>   CCLD atomic_spinlock
>>>   CC   atomic_spinlock_noinline-atomic_spinlock_noinline.o
>>>   CCLD atomic_spinlock_noinline
>>>   CC   atomic_math.o
>>>   CCLD atomic_math
>>> atomic_math.o: In function `atomic_math_test':
>>> atomic_math.c:(.text+0x78): undefined reference to
>>> `__sync_add_and_fetch_8'
>>> collect2: error: ld returned 1 exit status
>>> make[3]: *** [atomic_math] Error 1
>>>
>>>
>>> It looks like there is an (incorrect) assumption that 8-byte atomics are
>>> available.
>>> Removing --enable-builtin-atomics resolves this issue.
>>>
>>> -Paul
>>>
>>> --
>>> Paul H. Hargrove  phhargr...@lbl.gov
>>> Computer Languages & Systems Software (CLaSS) Group
>>> Computer Science Department   Tel: +1-510-495-2352
>>> <(510)%20495-2352>
>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>> <(510)%20486-6900>
>>>
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>
>>
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [2.0.2rc3] build failure ppc64/-m32 and bultin-atomics

2017-01-06 Thread Howard Pritchard
Hi Paul,

Sorry for the confusion.  This is a different problem.
I'll open an issue for this one too.

Howard



2017-01-06 9:18 GMT-07:00 Howard Pritchard :

> Hi Paul,
>
> Thanks for checking this.
>
> This problem was previously reporting and there's an issue:
>
> https://github.com/open-mpi/ompi/issues/2610
>
> tracking it.
>
> Howard
>
>
> 2017-01-05 21:19 GMT-07:00 Paul Hargrove :
>
>> I have a standard Linux/ppc64 system with gcc-4.8.3
>> I have configured the 2.0.2rc3 tarball with
>>
>> --prefix=... --enable-builtin-atomics \
>> CFLAGS=-m32 --with-wrapper-cflags=-m32 \
>> CXXFLAGS=-m32 --with-wrapper-cxxflags=-m32 \
>> FCFLAGS=-m32 --with-wrapper-fcflags=-m32 --disable-mpi-fortran
>>
>> (Yes, I know the FCFLAGS are unnecessary).
>>
>> I get a "make check" failure:
>>
>> make[3]: Entering directory `/home/phargrov/OMPI/openmpi-2
>> .0.2rc3-linux-ppc32-gcc/BLD/test/asm'
>>   CC   atomic_barrier.o
>>   CCLD atomic_barrier
>>   CC   atomic_barrier_noinline-atomic_barrier_noinline.o
>>   CCLD atomic_barrier_noinline
>>   CC   atomic_spinlock.o
>>   CCLD atomic_spinlock
>>   CC   atomic_spinlock_noinline-atomic_spinlock_noinline.o
>>   CCLD atomic_spinlock_noinline
>>   CC   atomic_math.o
>>   CCLD atomic_math
>> atomic_math.o: In function `atomic_math_test':
>> atomic_math.c:(.text+0x78): undefined reference to
>> `__sync_add_and_fetch_8'
>> collect2: error: ld returned 1 exit status
>> make[3]: *** [atomic_math] Error 1
>>
>>
>> It looks like there is an (incorrect) assumption that 8-byte atomics are
>> available.
>> Removing --enable-builtin-atomics resolves this issue.
>>
>> -Paul
>>
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> <(510)%20495-2352>
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> <(510)%20486-6900>
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [2.0.2rc3] build failure ppc64/-m32 and bultin-atomics

2017-01-06 Thread Howard Pritchard
Hi Paul,

Thanks for checking this.

This problem was previously reporting and there's an issue:

https://github.com/open-mpi/ompi/issues/2610

tracking it.

Howard


2017-01-05 21:19 GMT-07:00 Paul Hargrove :

> I have a standard Linux/ppc64 system with gcc-4.8.3
> I have configured the 2.0.2rc3 tarball with
>
> --prefix=... --enable-builtin-atomics \
> CFLAGS=-m32 --with-wrapper-cflags=-m32 \
> CXXFLAGS=-m32 --with-wrapper-cxxflags=-m32 \
> FCFLAGS=-m32 --with-wrapper-fcflags=-m32 --disable-mpi-fortran
>
> (Yes, I know the FCFLAGS are unnecessary).
>
> I get a "make check" failure:
>
> make[3]: Entering directory `/home/phargrov/OMPI/openmpi-
> 2.0.2rc3-linux-ppc32-gcc/BLD/test/asm'
>   CC   atomic_barrier.o
>   CCLD atomic_barrier
>   CC   atomic_barrier_noinline-atomic_barrier_noinline.o
>   CCLD atomic_barrier_noinline
>   CC   atomic_spinlock.o
>   CCLD atomic_spinlock
>   CC   atomic_spinlock_noinline-atomic_spinlock_noinline.o
>   CCLD atomic_spinlock_noinline
>   CC   atomic_math.o
>   CCLD atomic_math
> atomic_math.o: In function `atomic_math_test':
> atomic_math.c:(.text+0x78): undefined reference to `__sync_add_and_fetch_8'
> collect2: error: ld returned 1 exit status
> make[3]: *** [atomic_math] Error 1
>
>
> It looks like there is an (incorrect) assumption that 8-byte atomics are
> available.
> Removing --enable-builtin-atomics resolves this issue.
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> <(510)%20495-2352>
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> <(510)%20486-6900>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [2.0.2rc2] opal_fifo hang w/ --enable-osx-builtin-atomics

2017-01-05 Thread Howard Pritchard
Hi Paul,

I opened issue 2666  to track
this.

Howard

2017-01-05 0:23 GMT-07:00 Paul Hargrove :

> On Macs running Yosemite (OS X 10.10 w/ Xcode 7.1) and El Capitan (OS X
> 10.11 w/ Xcode 8.1) I have configured with
> CC=cc CXX=c++ FC=/sw/bin/gfortran --prefix=...
> --enable-osx-builtin-atomics
>
> Upon running "make check", the test "opal_fifo" hangs on both systems.
> Without the --enable-osx-builtin-atomics things are fine.
>
> I don't have data for Sierra (10.12).
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> <(510)%20495-2352>
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> <(510)%20486-6900>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] rdmacm and udcm for 2.0.1 and RoCE

2017-01-05 Thread Howard Pritchard
Hi Dave,

Sorry for the delayed response.

Anyway, you have to use rdmacm for connection management when using ROCE.
However, with 2.0.1 and later, you have to specify per peer QP info manually
on the mpirun command line.

Could you try rerunning with

mpirun --mca btl_openib_receive_queues P,128,64,32,32,32:S,2048,1024,128,32:S,
12288,1024,128,32:S,65536,1024,128,32 (all the reset of the command line
args)

and see if it then works?

Howard


2017-01-04 16:37 GMT-07:00 Dave Turner :

> --
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
>
>   Local host:   elf22
>   Local device: mlx4_2
>   Local port:   1
>   CPCs attempted:   rdmacm, udcm
> --
>
> I posted this to the user list but got no answer so I'm reposting to
> the devel list.
>
> We recently upgraded to OpenMPI 2.0.1.  Everything works fine
> on our QDR connections but we get the error above for our
> 40 GbE connections running RoCE.  I traced through the code and
> it looks like udcm cannot be used with RoCE.  I've also read that
> there are currently some problems with rdmacm under 2.0.1, which
> would mean 2.0.1 does not currently work on RoCE.  We've tested
> 10.4 using rdmacm and that works fine so I don't think we have anything
> configured wrong on the RoCE side.
>  Could someone please verify whether this information is correct that
> RoCE requires rdmacm only and not udcm, and that rdmacm is currently
> not working.  If so, is it being worked on?
>
>  Dave
>
>
> --
> Work: davetur...@ksu.edu (785) 532-7791
>  2219 Engineering Hall, Manhattan KS  66506
> Home:drdavetur...@gmail.com
>   cell: (785) 770-5929
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [2.0.2rc2] FreeBSD-11 run failure

2017-01-05 Thread Howard Pritchard
HI Paul,

I opened

https://github.com/open-mpi/ompi/issues/2665

to track this.

Thanks for reporting this.

Howard



2017-01-04 14:43 GMT-07:00 Paul Hargrove :

> With the 2.0.2rc2 tarball on FreeBSD-11 (i386 or amd64) I am configuring
> with:
>  --prefix=... CC=clang CXX=clang++ --disable-mpi-fortran
>
> I get a failure running ring_c:
>
> mpirun -mca btl sm,self -np 2 examples/ring_c'
> --
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   opal_shmem_base_select failed
>   --> Returned value -1 instead of OPAL_SUCCESS
> --
> + exit 1
>
> When I configure with either "--disable-dlopen" OR "--enable-static
> --disable-shared" the problem vanishes.
> So, I suspect a dlopen-related issue.
>
> I will
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> <(510)%20495-2352>
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> <(510)%20486-6900>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Open MPI 2.0.2rc2 is up

2016-12-23 Thread Howard Pritchard
Just in time for the holidays, an rc2 for the v2.0.x series is available.

https://www.open-mpi.org/software/ompi/v2.0

Changes since rc1 are:

- Fix a problem with building Open MPI against an external hwloc
installation.
  Thanks to Orion Poplawski for reporting this issue.
- Remove use of DATE in the message queue version string reported to
debuggers to
   insure bit-wise reproducibility of binaries.  Thanks to Alastair
McKinstry for help in fixing
   this problem.
- Some problems found with MPI RMA over MPI point to point.
- A Xeon processor specific performance regression was addressed.

Thanks,

Howard

-- 

Howard Pritchard
HPC-DES
Los Alamos National Laboratory
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI v2.0.2rc1 is up

2016-12-20 Thread Howard Pritchard
HI Orion,

Opened issue 2610 <https://github.com/open-mpi/ompi/issues/2610>.

Thanks,

Howard


2016-12-20 11:27 GMT-07:00 Howard Pritchard :

> Hi Orion,
>
> Thanks for trying out the rc.  Which compiler/version of compiler are you
> using?
>
> Howard
>
>
> 2016-12-20 10:50 GMT-07:00 Orion Poplawski :
>
>> On 12/14/2016 07:58 PM, Jeff Squyres (jsquyres) wrote:
>> > Please test!
>> >
>> > https://www.open-mpi.org/software/ompi/v2.0/
>>
>> I appear to e getting a new test failure on ppc64le on Fedora Rawhide:
>>
>> make[4]: Entering directory '/builddir/build/BUILD/openmpi
>> -2.0.2rc1/test/asm'
>> basename: extra operand '--test-name'
>> Try 'basename --help' for more information.
>> --> Testing
>> PASS: atomic_barrier
>> - 1 threads: Passed
>> PASS: atomic_barrier
>> - 2 threads: Passed
>> PASS: atomic_barrier
>> - 4 threads: Passed
>> PASS: atomic_barrier
>> - 5 threads: Passed
>> PASS: atomic_barrier
>> - 8 threads: Passed
>> basename: extra operand '--test-name'
>> Try 'basename --help' for more information.
>> --> Testing
>> PASS: atomic_barrier_noinline
>> - 1 threads: Passed
>> PASS: atomic_barrier_noinline
>> - 2 threads: Passed
>> PASS: atomic_barrier_noinline
>> - 4 threads: Passed
>> PASS: atomic_barrier_noinline
>> - 5 threads: Passed
>> PASS: atomic_barrier_noinline
>> - 8 threads: Passed
>> basename: extra operand '--test-name'
>> Try 'basename --help' for more information.
>> --> Testing
>> PASS: atomic_spinlock
>> - 1 threads: Passed
>> PASS: atomic_spinlock
>> - 2 threads: Passed
>> PASS: atomic_spinlock
>> - 4 threads: Passed
>> PASS: atomic_spinlock
>> - 5 threads: Passed
>> PASS: atomic_spinlock
>> - 8 threads: Passed
>> basename: extra operand '--test-name'
>> Try 'basename --help' for more information.
>> --> Testing
>> PASS: atomic_spinlock_noinline
>> - 1 threads: Passed
>> PASS: atomic_spinlock_noinline
>> - 2 threads: Passed
>> PASS: atomic_spinlock_noinline
>> - 4 threads: Passed
>> PASS: atomic_spinlock_noinline
>> - 5 threads: Passed
>> PASS: atomic_spinlock_noinline
>> - 8 threads: Passed
>> basename: extra operand '--test-name'
>> Try 'basename --help' for more information.
>> --> Testing
>> PASS: atomic_math
>> - 1 threads: Passed
>> PASS: atomic_math
>> - 2 threads: Passed
>> PASS: atomic_math
>> - 4 threads: Passed
>> PASS: atomic_math
>> - 5 threads: Passed
>> PASS: atomic_math
>> - 8 threads: Passed
>> basename: extra operand '--test-name'
>> Try 'basename --help' for more information.
>> --> Testing
>> PASS: atomic_math_noinline
>> - 1 threads: Passed
>> PASS: atomic_math_noinline
>> - 2 threads: Passed
>> PASS: atomic_math_noinline
>> - 4 threads: Passed
>> PASS: atomic_math_noinline
>> - 5 threads: Passed
>> PASS: atomic_math_noinline
>> - 8 threads: Passed
>> basename: extra operand '--test-name'
>> Try 'basename --help' for more information.
>> --> Testing
>> FAIL: atomic_cmpset
>> ../../config/test-driver: line 107: 23509 Aborted (core
>> dumped) "$@" > $log_file 2>&1
>> - 1 threads: Passed
>> ../../config/test-driver: line 107: 23513 Aborted (core
>> dumped) "$@" > $log_file 2>&1
>> FAIL: atomic_cmpset
>> - 2 threads: Passed
>> ../../config/test-driver: line 107: 23520 Aborted (core
>> dumped) "$@" > $log_file 2>&1
>> FAIL: atomic_cmpset
>> - 4 threads: Passed
>> ../../config/test-driver: line 107: 23528 Aborted (core
>> dumped) "$@" > $log_file 2>&1
>> FAIL: atomic_cmpset
>> - 5 threads: Passed
>> FAIL: atomic_cmpset
>> ../../config/test-driver: line 107: 23538 Aborted (core
>> dumped) "$@" > $log_file 2>&1
>> - 8 threads: Passed
>> basename: extra operand '--test-name'
>> Try 'basename --help' for more information.
>> --> Testing
>> FAIL: atomic_cmpset_noinline
>> - 1 threads: Passed
>> ../../config/test-driver: line 1

Re: [OMPI devel] Open MPI v2.0.2rc1 is up

2016-12-20 Thread Howard Pritchard
Hi Orion,

Thanks for trying out the rc.  Which compiler/version of compiler are you
using?

Howard


2016-12-20 10:50 GMT-07:00 Orion Poplawski :

> On 12/14/2016 07:58 PM, Jeff Squyres (jsquyres) wrote:
> > Please test!
> >
> > https://www.open-mpi.org/software/ompi/v2.0/
>
> I appear to e getting a new test failure on ppc64le on Fedora Rawhide:
>
> make[4]: Entering directory '/builddir/build/BUILD/
> openmpi-2.0.2rc1/test/asm'
> basename: extra operand '--test-name'
> Try 'basename --help' for more information.
> --> Testing
> PASS: atomic_barrier
> - 1 threads: Passed
> PASS: atomic_barrier
> - 2 threads: Passed
> PASS: atomic_barrier
> - 4 threads: Passed
> PASS: atomic_barrier
> - 5 threads: Passed
> PASS: atomic_barrier
> - 8 threads: Passed
> basename: extra operand '--test-name'
> Try 'basename --help' for more information.
> --> Testing
> PASS: atomic_barrier_noinline
> - 1 threads: Passed
> PASS: atomic_barrier_noinline
> - 2 threads: Passed
> PASS: atomic_barrier_noinline
> - 4 threads: Passed
> PASS: atomic_barrier_noinline
> - 5 threads: Passed
> PASS: atomic_barrier_noinline
> - 8 threads: Passed
> basename: extra operand '--test-name'
> Try 'basename --help' for more information.
> --> Testing
> PASS: atomic_spinlock
> - 1 threads: Passed
> PASS: atomic_spinlock
> - 2 threads: Passed
> PASS: atomic_spinlock
> - 4 threads: Passed
> PASS: atomic_spinlock
> - 5 threads: Passed
> PASS: atomic_spinlock
> - 8 threads: Passed
> basename: extra operand '--test-name'
> Try 'basename --help' for more information.
> --> Testing
> PASS: atomic_spinlock_noinline
> - 1 threads: Passed
> PASS: atomic_spinlock_noinline
> - 2 threads: Passed
> PASS: atomic_spinlock_noinline
> - 4 threads: Passed
> PASS: atomic_spinlock_noinline
> - 5 threads: Passed
> PASS: atomic_spinlock_noinline
> - 8 threads: Passed
> basename: extra operand '--test-name'
> Try 'basename --help' for more information.
> --> Testing
> PASS: atomic_math
> - 1 threads: Passed
> PASS: atomic_math
> - 2 threads: Passed
> PASS: atomic_math
> - 4 threads: Passed
> PASS: atomic_math
> - 5 threads: Passed
> PASS: atomic_math
> - 8 threads: Passed
> basename: extra operand '--test-name'
> Try 'basename --help' for more information.
> --> Testing
> PASS: atomic_math_noinline
> - 1 threads: Passed
> PASS: atomic_math_noinline
> - 2 threads: Passed
> PASS: atomic_math_noinline
> - 4 threads: Passed
> PASS: atomic_math_noinline
> - 5 threads: Passed
> PASS: atomic_math_noinline
> - 8 threads: Passed
> basename: extra operand '--test-name'
> Try 'basename --help' for more information.
> --> Testing
> FAIL: atomic_cmpset
> ../../config/test-driver: line 107: 23509 Aborted (core
> dumped) "$@" > $log_file 2>&1
> - 1 threads: Passed
> ../../config/test-driver: line 107: 23513 Aborted (core
> dumped) "$@" > $log_file 2>&1
> FAIL: atomic_cmpset
> - 2 threads: Passed
> ../../config/test-driver: line 107: 23520 Aborted (core
> dumped) "$@" > $log_file 2>&1
> FAIL: atomic_cmpset
> - 4 threads: Passed
> ../../config/test-driver: line 107: 23528 Aborted (core
> dumped) "$@" > $log_file 2>&1
> FAIL: atomic_cmpset
> - 5 threads: Passed
> FAIL: atomic_cmpset
> ../../config/test-driver: line 107: 23538 Aborted (core
> dumped) "$@" > $log_file 2>&1
> - 8 threads: Passed
> basename: extra operand '--test-name'
> Try 'basename --help' for more information.
> --> Testing
> FAIL: atomic_cmpset_noinline
> - 1 threads: Passed
> ../../config/test-driver: line 107: 23551 Aborted (core
> dumped) "$@" > $log_file 2>&1
> FAIL: atomic_cmpset_noinline
> ../../config/test-driver: line 107: 23557 Aborted (core
> dumped) "$@" > $log_file 2>&1
> - 2 threads: Passed
> ../../config/test-driver: line 107: 23563 Aborted (core
> dumped) "$@" > $log_file 2>&1
> FAIL: atomic_cmpset_noinline
> - 4 threads: Passed
> ../../config/test-driver: line 107: 23570 Aborted (core
> dumped) "$@" > $log_file 2>&1
> FAIL: atomic_cmpset_noinline
> - 5 threads: Passed
> ../../config/test-driver: line 107: 23578 Aborted (core
> dumped) "$@" > $log_file 2>&1
> FAIL: atomic_cmpset_noinline
> - 8 threads: Passed
> 
> 
> Testsuite summary for Open MPI 2.0.2rc1
> 
> 
> # TOTAL: 8
> # PASS:  6
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  2
> # XPASS: 0
> # ERROR: 0
> 
> 
>
> 2.0.1 is still building fine.
>
> --
> Orion Poplawski
> Technical Manager 303-415-9701 x222
> NWRA, Boulder/CoRA Office FAX: 303-415-9702
> 3380 Mitchell Lane

Re: [OMPI devel] Open MPI v2.0.2rc1 is up

2016-12-19 Thread Howard Pritchard
HI Paul,

Would you mind resending the  "runtime error w/ PGI usempif08 on OpenPOWER"
email without the config.log attached?

Thanks,

Howard


2016-12-16 12:17 GMT-07:00 Howard Pritchard :

> HI Paul,
>
> Thanks for the checking the rc out.  And for noting the grammar
> mistake.
>
> Howard
>
>
> 2016-12-16 1:00 GMT-07:00 Paul Hargrove :
>
>> My testing is complete.
>>
>> The only problems not already known are related to PGI's recent
>> "Community Edition" compilers and have been reported in three separate
>> emails:
>>
>> [2.0.2rc1] Fortran link failure with PGI fortran on MacOSX
>> <https://mail-archive.com/devel@lists.open-mpi.org/msg19823.html>
>> [2.0.2rc1] Build failure w/ PGI compilers on Mac OS X
>> <https://mail-archive.com/devel@lists.open-mpi.org/msg19824.html>
>> [2.0.2rc1] runtime error w/ PGI usempif08 on OpenPOWER
>>
>> For some reason the last one does not appear in the archive!
>> Perhaps the config.log.bz2 I attached was too large?
>> Let me know if I should resend it.
>>
>> BTW:  a typo in the ChangeLog of the announcement email:
>>
>> - Fix a problem with early exit of a MPI process without calling
>> MPI_FINALIZE
>>   of MPI_ABORT that could lead to job hangs.  Thanks to Christof Koehler
>> for
>>   reporting.
>>
>> The "of" that begins the second line was almost certainly intended to be
>> "or".
>>
>> -Paul
>>
>> On Wed, Dec 14, 2016 at 6:58 PM, Jeff Squyres (jsquyres) <
>> jsquy...@cisco.com> wrote:
>>
>>> Please test!
>>>
>>> https://www.open-mpi.org/software/ompi/v2.0/
>>>
>>> Changes since v2.0.1:
>>>
>>> - Remove use of DATE in the message queue version string reported to
>>> debuggers to
>>>   insure bit-wise reproducibility of binaries.  Thanks to Alastair
>>> McKinstry
>>>   for help in fixing this problem.
>>> - Fix a problem with early exit of a MPI process without calling
>>> MPI_FINALIZE
>>>   of MPI_ABORT that could lead to job hangs.  Thanks to Christof Koehler
>>> for
>>>   reporting.
>>> - Fix a problem with forward of SIGTERM signal from mpirun to MPI
>>> processes
>>>   in a job.  Thanks to Noel Rycroft for reporting this problem
>>> - Plug some memory leaks in MPI_WIN_FREE discovered using Valgrind.
>>> Thanks
>>>   to Joseph Schuchart for reporting.
>>> - Fix a problems  MPI_NEIGHOR_ALLTOALL when using a communicator with an
>>> empty topology
>>>   graph.  Thanks to Daniel Ibanez for reporting.
>>> - Fix a typo in a PMIx component help file.  Thanks to @njoly for
>>> reporting this.
>>> - Fix a problem with Valgrind false positives when using Open MPI's
>>> internal memchecker.
>>>   Thanks to Yvan Fournier for reporting.
>>> - Fix a problem with MPI_FILE_DELETE returning MPI_SUCCESS when
>>>   deleting a non-existent file. Thanks to Wei-keng Liao for reporting.
>>> - Fix a problem with MPI_IMPROBE that could lead to hangs in subsequent
>>> MPI
>>>   point to point or collective calls.  Thanks to Chris Pattison for
>>> reporting.
>>> - Fix a problem when configure Open MPI for powerpc with --enable-mpi-cxx
>>>   enabled.  Thanks to amckinstry for reporting.
>>> - Fix a problem using MPI_IALLTOALL with MPI_IN_PLACE argument.  Thanks
>>> to
>>>   Chris Ward for reporting.
>>> - Fix a problem using MPI_RACCUMULATE with the Portals4 transport.
>>> Thanks to
>>>   @PDeveze for reporting.
>>> - Fix an issue with static linking and duplicate symbols arising from
>>> PMIx
>>>   Slurm components.  Thanks to Limin Gu for reporting.
>>> - Fix a problem when using MPI dynamic memory windows.  Thanks to
>>>   Christoph Niethammer for reporting.
>>> - Fix a problem with Open MPI's pkgconfig files.  Thanks to Alastair
>>> McKinstry
>>>   for reporting.
>>> - Fix a problem with MPI_IREDUCE when the same buffer is supplied for the
>>>   send and recv buffer arguments.  Thanks to Valentin Petrov for
>>> reporting.
>>> - Fix a problem with atomic operations on PowerPC.  Thanks to Paul
>>>   Hargrove for reporting.
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: http://www.cisco.com/web/about
>>> /doing_business/legal/cri/
>>>
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>
>>
>>
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> <(510)%20495-2352>
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> <(510)%20486-6900>
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI v2.0.2rc1 is up

2016-12-16 Thread Howard Pritchard
HI Paul,

Thanks for the checking the rc out.  And for noting the grammar
mistake.

Howard


2016-12-16 1:00 GMT-07:00 Paul Hargrove :

> My testing is complete.
>
> The only problems not already known are related to PGI's recent "Community
> Edition" compilers and have been reported in three separate emails:
>
> [2.0.2rc1] Fortran link failure with PGI fortran on MacOSX
> 
> [2.0.2rc1] Build failure w/ PGI compilers on Mac OS X
> 
> [2.0.2rc1] runtime error w/ PGI usempif08 on OpenPOWER
>
> For some reason the last one does not appear in the archive!
> Perhaps the config.log.bz2 I attached was too large?
> Let me know if I should resend it.
>
> BTW:  a typo in the ChangeLog of the announcement email:
>
> - Fix a problem with early exit of a MPI process without calling
> MPI_FINALIZE
>   of MPI_ABORT that could lead to job hangs.  Thanks to Christof Koehler
> for
>   reporting.
>
> The "of" that begins the second line was almost certainly intended to be
> "or".
>
> -Paul
>
> On Wed, Dec 14, 2016 at 6:58 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> Please test!
>>
>> https://www.open-mpi.org/software/ompi/v2.0/
>>
>> Changes since v2.0.1:
>>
>> - Remove use of DATE in the message queue version string reported to
>> debuggers to
>>   insure bit-wise reproducibility of binaries.  Thanks to Alastair
>> McKinstry
>>   for help in fixing this problem.
>> - Fix a problem with early exit of a MPI process without calling
>> MPI_FINALIZE
>>   of MPI_ABORT that could lead to job hangs.  Thanks to Christof Koehler
>> for
>>   reporting.
>> - Fix a problem with forward of SIGTERM signal from mpirun to MPI
>> processes
>>   in a job.  Thanks to Noel Rycroft for reporting this problem
>> - Plug some memory leaks in MPI_WIN_FREE discovered using Valgrind.
>> Thanks
>>   to Joseph Schuchart for reporting.
>> - Fix a problems  MPI_NEIGHOR_ALLTOALL when using a communicator with an
>> empty topology
>>   graph.  Thanks to Daniel Ibanez for reporting.
>> - Fix a typo in a PMIx component help file.  Thanks to @njoly for
>> reporting this.
>> - Fix a problem with Valgrind false positives when using Open MPI's
>> internal memchecker.
>>   Thanks to Yvan Fournier for reporting.
>> - Fix a problem with MPI_FILE_DELETE returning MPI_SUCCESS when
>>   deleting a non-existent file. Thanks to Wei-keng Liao for reporting.
>> - Fix a problem with MPI_IMPROBE that could lead to hangs in subsequent
>> MPI
>>   point to point or collective calls.  Thanks to Chris Pattison for
>> reporting.
>> - Fix a problem when configure Open MPI for powerpc with --enable-mpi-cxx
>>   enabled.  Thanks to amckinstry for reporting.
>> - Fix a problem using MPI_IALLTOALL with MPI_IN_PLACE argument.  Thanks to
>>   Chris Ward for reporting.
>> - Fix a problem using MPI_RACCUMULATE with the Portals4 transport.
>> Thanks to
>>   @PDeveze for reporting.
>> - Fix an issue with static linking and duplicate symbols arising from PMIx
>>   Slurm components.  Thanks to Limin Gu for reporting.
>> - Fix a problem when using MPI dynamic memory windows.  Thanks to
>>   Christoph Niethammer for reporting.
>> - Fix a problem with Open MPI's pkgconfig files.  Thanks to Alastair
>> McKinstry
>>   for reporting.
>> - Fix a problem with MPI_IREDUCE when the same buffer is supplied for the
>>   send and recv buffer arguments.  Thanks to Valentin Petrov for
>> reporting.
>> - Fix a problem with atomic operations on PowerPC.  Thanks to Paul
>>   Hargrove for reporting.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: http://www.cisco.com/web/about
>> /doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> <(510)%20495-2352>
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> <(510)%20486-6900>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] heads up about OMPI/master

2016-12-01 Thread Howard Pritchard
Ralph,

I don't know how it happened but if you do

git log --oneline --topo-order

you don't see a Merge pull request #2488 in the history for master.

Howard


2016-12-01 16:59 GMT-07:00 r...@open-mpi.org :

> Ummm...guys, it was done via PR. I saw it go by, and it was all done to
> procedure:
>
> https://github.com/open-mpi/ompi/pull/2488
>
> So please don’t jump to conclusions
>
> On Dec 1, 2016, at 3:49 PM, Artem Polyakov  wrote:
>
> But I guess that we can verify that things are not broken using other PR's.
> Looks that all is good: https://github.com/open-mpi/ompi/pull/2493
>
> 2016-12-01 15:38 GMT-08:00 Artem Polyakov :
>
>> All systems are different and it is hard to compete in coverage with our
>> set of Jenkins' :).
>>
>> 2016-12-01 14:51 GMT-08:00 r...@open-mpi.org :
>>
>>> FWIW: I verified it myself, and it was fine on my systems
>>>
>>> On Dec 1, 2016, at 2:46 PM, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com> wrote:
>>>
>>> fwiw, the major change is in https://github.com/open-mpi
>>> /ompi/commit/c9aeccb84e4626c350af4daa974d37775db5b25e
>>> and i did spend quite some time testing it.
>>>
>>> if you are facing some unexpected issues with the latest master, you
>>> might try to revert this locally and see if it helps
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Friday, December 2, 2016, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com> wrote:
>>>
>>>> Howard,
>>>>
>>>> i pushed a bunch of commits yesterday, and that was not an accident.
>>>> you might be referring https://github.com/o
>>>> pen-mpi/ompi/commit/cb55c88a8b7817d5891ff06a447ea190b0e77479 but it
>>>> has already been reverted 9 days ago with https://github.com/open-m
>>>> pi/ompi/commit/1e2019ce2a903be24361b3424d8e98d27e941c6c
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Friday, December 2, 2016, r...@open-mpi.org  wrote:
>>>>
>>>>> I should add, FWIW: I’m working with the HEAD of master right now, and
>>>>> not seeing any problems.
>>>>>
>>>>> On Dec 1, 2016, at 2:10 PM, r...@open-mpi.org wrote:
>>>>>
>>>>> ?? I see a bunch of commits that were all collected in a single PR
>>>>> from Gilles yesterday - is that what you are referring to?
>>>>>
>>>>> On Dec 1, 2016, at 1:58 PM, Howard Pritchard 
>>>>> wrote:
>>>>>
>>>>> Hi Folks,
>>>>>
>>>>> Just an FYI it looks like a bunch of commits may have been
>>>>> accidentally pushed to
>>>>> master sometime in the past day.  You may not want to merge
>>>>> origin/master (if origin is
>>>>> how you reference https://github.com/open-mpi/omp) into your master
>>>>> or rebase off
>>>>> of it until we get some clarity on what has happened.
>>>>>
>>>>> Howard
>>>>>
>>>>> ___
>>>>> devel mailing list
>>>>> devel@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>>
>>>>>
>>>>>
>>>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>>
>>>
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>
>>
>>
>> --
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
>>
>
>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] heads up about OMPI/master

2016-12-01 Thread Howard Pritchard
Hi Gilles

I didn't see a merge commit for all these commits,
hence my concern that it was a mistake.

In general it's better to pull in commits via PR process.

Howard

Am Donnerstag, 1. Dezember 2016 schrieb Gilles Gouaillardet :

> fwiw, the major change is in https://github.com/open-mpi/ompi/commit/
> c9aeccb84e4626c350af4daa974d37775db5b25e
> and i did spend quite some time testing it.
>
> if you are facing some unexpected issues with the latest master, you might
> try to revert this locally and see if it helps
>
> Cheers,
>
> Gilles
>
> On Friday, December 2, 2016, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> > wrote:
>
>> Howard,
>>
>> i pushed a bunch of commits yesterday, and that was not an accident.
>> you might be referring https://github.com/open-mpi/ompi/commit/cb55c88a8b
>> 7817d5891ff06a447ea190b0e77479 but it has already been reverted 9 days
>> ago with https://github.com/open-mpi/ompi/commit/1e2019ce2a903be
>> 24361b3424d8e98d27e941c6c
>>
>> Cheers,
>>
>> Gilles
>>
>> On Friday, December 2, 2016, r...@open-mpi.org  wrote:
>>
>>> I should add, FWIW: I’m working with the HEAD of master right now, and
>>> not seeing any problems.
>>>
>>> On Dec 1, 2016, at 2:10 PM, r...@open-mpi.org wrote:
>>>
>>> ?? I see a bunch of commits that were all collected in a single PR from
>>> Gilles yesterday - is that what you are referring to?
>>>
>>> On Dec 1, 2016, at 1:58 PM, Howard Pritchard 
>>> wrote:
>>>
>>> Hi Folks,
>>>
>>> Just an FYI it looks like a bunch of commits may have been accidentally
>>> pushed to
>>> master sometime in the past day.  You may not want to merge
>>> origin/master (if origin is
>>> how you reference https://github.com/open-mpi/omp) into your master or
>>> rebase off
>>> of it until we get some clarity on what has happened.
>>>
>>> Howard
>>>
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>>
>>>
>>>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] heads up about OMPI/master

2016-12-01 Thread Howard Pritchard
Hi Folks,

Just an FYI it looks like a bunch of commits may have been accidentally
pushed to
master sometime in the past day.  You may not want to merge origin/master
(if origin is
how you reference https://github.com/open-mpi/omp) into your master or
rebase off
of it until we get some clarity on what has happened.

Howard
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0

2016-08-05 Thread Howard Pritchard
Hello Christoph

The rdmacm messages while annoying are not causing the problem.

If you specify tcp BTL  does the BW drop disappear?

Also could you post your configure options to the mail list?

Thanks

Howard

Am Freitag, 5. August 2016 schrieb Christoph Niethammer :

> Hello,
>
> We see the same problem here on various machines with Open MPI 2.0.0.
> To us it seems that enabling the openib btl triggers bad performance for
> the sm AND vader btls!
> --mca btl_base_verbose 10 reports in both cases the correct use of sm and
> vader between MPI ranks - only performance differs?!
>
> One irritating thing I see in the log output is the following:
>   openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped
>   [rank=1] openib: using port mlx4_0:1
>   select: init of component openib returned success
>
> Did not look into the "Skipped" code part yet, but maybe there is a
> problem not skipping "as intended" confusing interfaces later on?
>
> Results see below.
>
> Best regards
> Christoph Niethammer
>
> --
>
> Christoph Niethammer
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstrasse 19
> 70569 Stuttgart
>
> Tel: ++49(0)711-685-87203
> email: nietham...@hlrs.de 
> http://www.hlrs.de/people/niethammer
>
>
>
> mpirun -np 2 --mca btl self,vader  osu_bw
> # OSU MPI Bandwidth Test
> # SizeBandwidth (MB/s)
> 1 4.83
> 210.30
> 424.68
> 849.27
> 16   95.80
> 32  187.52
> 64  270.82
> 128 405.00
> 256 659.26
> 5121165.14
> 1024   2372.83
> 2048   3592.85
> 4096   4283.51
> 8192   5523.55
> 16384  7388.92
> 32768  7024.37
> 65536  7353.79
> 131072 7465.96
> 262144 8597.56
> 524288 9292.86
> 10485769168.01
> 20971529009.62
> 41943049013.02
>
> mpirun -np 2 --mca btl self,vader,openib  osu_bw
> # OSU MPI Bandwidth Test
> # SizeBandwidth (MB/s)
> 1 5.32
> 211.14
> 420.88
> 849.26
> 16   99.11
> 32  197.42
> 64  301.08
> 128 413.64
> 256 651.15
> 5121161.12
> 1024   2460.99
> 2048   3627.36
> 4096   2191.06
> 8192   3118.36
> 16384  3428.45
> 32768  3676.96
> 65536  3709.65
> 131072 3748.64
> 262144 3764.88
> 524288 3764.61
> 10485763772.45
> 20971523757.37
> 41943043746.45
>
> mpirun -np 2 --mca btl self,sm  osu_bw
> # OSU MPI Bandwidth Test
> # SizeBandwidth (MB/s)
> 1 2.98
> 2 5.97
> 411.99
> 823.47
> 16   50.64
> 32   99.91
> 64  197.87
> 128 343.32
> 256 667.48
> 5121200.86
> 1024   2050.05
> 2048   3578.52
> 4096   3966.92
> 8192   5687.96
> 16384  7395.88
> 32768  7101.41
> 65536  7619.49
> 131072 7978.09
> 262144 8648.87
> 524288 9129.18
> 1048576   10525.31
> 2097152   10511.63
> 4194304   10489.66
>
> mpirun -np 2 --mca btl self,sm,openib  osu_bw
> # OSU MPI Bandwidth Test
> # SizeBandwidth (MB/s)
> 1 2.02
> 2 3.00
> 4 9.99
> 819.96
> 16   40.10
> 32   70.63
> 64  144.08
> 128 282.21
> 256 543.55
> 5121032.61
> 1024   1871.09
> 2048   3294.07
> 4096   2336.48
> 8192   3142.22
> 16384  3419.93
> 32768  3647.30
> 65536  3725.40
> 131072 3749.43
> 262144 3765.31
> 524288 3771.06
> 10485763772.54
> 20971523760.93
> 41943043745.37
>
> - Original Message -
> From: tmish...@jcity.maeda.co.jp 
> To: "Open MPI Developers" >
> Sent: Wednesday, July 27, 2016 6:04:48 AM
> Subject: Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0
>
> HiNathan,
>
> I applied those commits and ran again without any BTL specifi

[OMPI devel] tcp btl rendezvous performance question

2016-07-18 Thread Howard Pritchard
Hi Folks,

I have a cluster with some 100 Gb ethernet cards
installed.  What we are noticing if we force Open MPI 1.10.3
to go through the TCP BTL (rather than yalla)  is that
the performance of osu_bw once the TCP BTL switches
from eager to rendezvous (> 32KB)
falls off a cliff, going from about 1.6 GB/sec to 233 MB/sec
and stays that way out to 4 MB message lengths.

There's nothing wrong with the IP stack (iperf -P4 gives
63 Gb/sec).

So, my questions are

1) is this performance expected for the TCP BTL when in
rendezvous mode?
2) is there some way to get more like the single socket
performance obtained with iperf for large messages (~16 Gb/sec).

We tried adjusting the tcp_btl_rendezvous threshold but that doesn't
appear to actually be adjustable from the mpirun command line.

Thanks for any suggestions,

Howard


Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-13 Thread Howard Pritchard
Hi Eric,

Thanks very much for finding this problem.   We decided in order to have a
reasonably timely
release, that we'd triage issues and turn around a new RC if something
drastic
appeared.  We want to fix this issue (and it will be fixed), but we've
decided to
defer the fix for this issue to a 2.0.1 bug fix release.

Howard



2016-07-12 13:51 GMT-06:00 Eric Chamberland <
eric.chamberl...@giref.ulaval.ca>:

> Hi Edgard,
>
> I just saw that your patch got into ompi/master... any chances it goes
> into ompi-release/v2.x before rc5?
>
> thanks,
>
> Eric
>
>
> On 08/07/16 03:14 PM, Edgar Gabriel wrote:
>
>> I think I found the problem, I filed a pr towards master, and if that
>> passes I will file a pr for the 2.x branch.
>>
>> Thanks!
>> Edgar
>>
>>
>> On 7/8/2016 1:14 PM, Eric Chamberland wrote:
>>
>>>
>>> On 08/07/16 01:44 PM, Edgar Gabriel wrote:
>>>
 ok, but just to be able to construct a test case, basically what you are
 doing is

 MPI_File_write_all_begin (fh, NULL, 0, some datatype);

 MPI_File_write_all_end (fh, NULL, &status),

 is this correct?

>>> Yes, but with 2 processes:
>>>
>>> rank 0 writes something, but not rank 1...
>>>
>>> other info: rank 0 didn't wait for rank1 after MPI_File_write_all_end so
>>> it continued to the next MPI_File_write_all_begin with a different
>>> datatype but on the same file...
>>>
>>> thanks!
>>>
>>> Eric
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2016/07/19173.php
>>>
>>
>> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/07/19192.php
>


Re: [OMPI devel] [2.0.0rc4] non-critical faulres report

2016-07-12 Thread Howard Pritchard
Paul,

Could you narrow down the versions of the PGCC where you get the ICE when
using the -m32 option?

Thanks,

Howard


2016-07-06 15:29 GMT-06:00 Paul Hargrove :

> The following are previously reported issues that I am *not* expecting to
> be resolved in 2.0.0.
> However, I am listing them here for completeness.
>
> Known, but with later target:
>
> OpenBSD fails to build ROMIO - PR1178 exists with v2.0.1 target
> NAG Fortran support - PR1215 exists with v2.0.1 target
>
> Known, but *not* suspected to be the fault of Open MPI or it embedded
> components:
>
> Pathcc gets ICE - versions 5.0.5 and 6.0.527 get compiler crashes building
> Open MPI
> Pgcc -m32 gets ICE - versions 12.x and 13.x (the only ones I can test w/
> -m32) crash compiling hwloc
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/07/19155.php
>


Re: [OMPI devel] Issue with 2.0.0rc3, singleton init

2016-06-16 Thread Howard Pritchard
Hi Lisandro,

Thanks for giving the rc3 a try.  Could you post the output of ompi_info
from your
install to the list?

Thanks,

Howard


2016-06-16 7:55 GMT-06:00 Lisandro Dalcin :

> ./configure --prefix=/home/devel/mpi/openmpi/2.0.0rc3 --enable-debug
> --enable-mem-debug
>
> https://bitbucket.org/mpi4py/mpi4py/src/master/demo/helloworld.c
>
> $ mpicc helloworld.c
>
> $ mpiexec -n 1 ./a.out
> Hello, World! I am process 0 of 1 on kw14821.
>
> $ ./a.out
> [kw14821:31370] *** Process received signal ***
> [kw14821:31370] Signal: Segmentation fault (11)
> [kw14821:31370] Signal code: Address not mapped (1)
> [kw14821:31370] Failing at address: 0xf8
> [kw14821:31370] [ 0] /lib64/libpthread.so.0(+0x10a00)[0x7fc816196a00]
> [kw14821:31370] [ 1]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libopen-pal.so.20(opal_libevent2022_event_priority_set+0xcb)[0x7fc81584c7db]
> [kw14821:31370] [ 2]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/openmpi/mca_rml_oob.so(orte_rml_oob_recv_buffer_nb+0x154)[0x7fc81277f95f]
> [kw14821:31370] [ 3]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/openmpi/mca_grpcomm_direct.so(+0x17c2)[0x7fc81469f7c2]
> [kw14821:31370] [ 4]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libopen-rte.so.20(orte_grpcomm_base_select+0x17b)[0x7fc815b522e9]
> [kw14821:31370] [ 5]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libopen-rte.so.20(orte_ess_base_app_setup+0x985)[0x7fc815b4cafe]
> [kw14821:31370] [ 6]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/openmpi/mca_ess_singleton.so(+0x37e2)[0x7fc81429c7e2]
> [kw14821:31370] [ 7]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libopen-rte.so.20(orte_init+0x2d2)[0x7fc815b05b27]
> [kw14821:31370] [ 8]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libmpi.so.20(ompi_mpi_init+0x31b)[0x7fc8163fbecf]
> [kw14821:31370] [ 9]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libmpi.so.20(PMPI_Init_thread+0x7f)[0x7fc81642feae]
> [kw14821:31370] [10] ./a.out[0x4008f3]
> [kw14821:31370] [11]
> /lib64/libc.so.6(__libc_start_main+0xf0)[0x7fc815de5580]
> [kw14821:31370] [12] ./a.out[0x4007e9]
> [kw14821:31370] *** End of error message ***
> Segmentation fault (core dumped)
>
>
> $ valgrind -q ./a.out
> ==31396== Conditional jump or move depends on uninitialised value(s)
> ==31396==at 0x5A9D4CA: opal_value_unload (dss_load_unload.c:291)
> ==31396==by 0x74B6378: rte_init (ess_singleton_module.c:260)
> ==31396==by 0x57A2B26: orte_init (orte_init.c:226)
> ==31396==by 0x4E8CECE: ompi_mpi_init (ompi_mpi_init.c:501)
> ==31396==by 0x4EC0EAD: PMPI_Init_thread (pinit_thread.c:69)
> ==31396==by 0x4008F2: main (in
> /home/dalcinl/Devel/mpi4py-dev/demo/a.out)
> ==31396==
> ==31396== Invalid read of size 4
> ==31396==at 0x5AEE7DB: opal_libevent2022_event_priority_set
> (event.c:1859)
> ==31396==by 0x8FD195E: orte_rml_oob_recv_buffer_nb (rml_oob_recv.c:74)
> ==31396==by 0x70AE7C1: init (grpcomm_direct.c:78)
> ==31396==by 0x57EF2E8: orte_grpcomm_base_select
> (grpcomm_base_select.c:87)
> ==31396==by 0x57E9AFD: orte_ess_base_app_setup (ess_base_std_app.c:223)
> ==31396==by 0x74B67E1: rte_init (ess_singleton_module.c:323)
> ==31396==by 0x57A2B26: orte_init (orte_init.c:226)
> ==31396==by 0x4E8CECE: ompi_mpi_init (ompi_mpi_init.c:501)
> ==31396==by 0x4EC0EAD: PMPI_Init_thread (pinit_thread.c:69)
> ==31396==by 0x4008F2: main (in
> /home/dalcinl/Devel/mpi4py-dev/demo/a.out)
> ==31396==  Address 0xf8 is not stack'd, malloc'd or (recently) free'd
> ==31396==
> [kw14821:31396] *** Process received signal ***
> [kw14821:31396] Signal: Segmentation fault (11)
> [kw14821:31396] Signal code: Address not mapped (1)
> [kw14821:31396] Failing at address: 0xf8
> [kw14821:31396] [ 0] /lib64/libpthread.so.0(+0x10a00)[0x51bea00]
> [kw14821:31396] [ 1]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libopen-pal.so.20(opal_libevent2022_event_priority_set+0xcb)[0x5aee7db]
> [kw14821:31396] [ 2]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/openmpi/mca_rml_oob.so(orte_rml_oob_recv_buffer_nb+0x154)[0x8fd195f]
> [kw14821:31396] [ 3]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/openmpi/mca_grpcomm_direct.so(+0x17c2)[0x70ae7c2]
> [kw14821:31396] [ 4]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libopen-rte.so.20(orte_grpcomm_base_select+0x17b)[0x57ef2e9]
> [kw14821:31396] [ 5]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libopen-rte.so.20(orte_ess_base_app_setup+0x985)[0x57e9afe]
> [kw14821:31396] [ 6]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/openmpi/mca_ess_singleton.so(+0x37e2)[0x74b67e2]
> [kw14821:31396] [ 7]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libopen-rte.so.20(orte_init+0x2d2)[0x57a2b27]
> [kw14821:31396] [ 8]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libmpi.so.20(ompi_mpi_init+0x31b)[0x4e8cecf]
> [kw14821:31396] [ 9]
>
> /home/devel/mpi/openmpi/2.0.0rc3/lib/libmpi.so.20(PMPI_Init_thread+0x7f)[0x4ec0eae]
> [kw14821:31396] [10] ./a.out[0x4008f3]
> [kw14821:31396] [11] /lib64/libc.so.6(__libc_start_main+0xf0)[0x53ec580]
> [kw14821:31396] [12] ./a.out[0x4007e9]
> [kw14821:31396] *** End of error message ***
> ==31396==

[OMPI devel] Open MPI v2.0.0rc3 now available

2016-06-15 Thread Howard Pritchard
We are now feature complete for v2.0.0 and would appreciate testing by
developers and end users before we finalize the v2.0.0 release.  In that
light, v2.0.0rc3 is now available:

https://www.open-mpi.org/software/ompi/v2.x/

Here are the changes since 2.0.0rc2:

- The MPI C++ bindings -- which were removed from the MPI standard in
   v3.0 -- are no longer built by default and will be removed in some
   future version of Open MPI.  Use the --enable-mpi-cxx-bindings
   configure option to build the deprecated/removed MPI C++ bindings.

--> NOTE: this is not new, actually -- but we just added it to the NEWS
file.

- In environments where mpirun cannot automatically determine the
  number of slots available (e.g., when using a hostfile that does not
  specify "slots", or when using --host without specifying a ":N"
  suffix to hostnames), mpirun now requires the use of "-np N" to
  specify how many MPI processes to launch.

- Many updates and fixes to the revamped memory hooks infrastructure

- Various configure-related compatibility updates for newer versions
   of libibverbs and OFED.

- Properly detect Intel TrueScale and OmniPath devices in the ACTIVE
  state.  Thanks to Durga Choudhury for reporting the issue.

- Fix MPI_IREDUCE_SCATTER_BLOCK for a one-process communicator. Thanks
   to Lisandro Dalcin for reporting.

- Fix detection and use of Solaris Studio 12.5 (beta) compilers.
   Thanks to Paul Hargrove for reporting and debugging.

- Allow NULL arrays when creating empty MPI datatypes.

- Miscellaneous minor bug fixes in the hcoll component.

- Miscellaneous minor bug fixes in the ugni component.

- Fix various small memory leaks.

- Notable new MCA parameters:

   -  opal_progress_lp_call_ration: Control how often low-priority
  callbacks are made during Open MPI's main progress loop.

- Disable backtrace support by default in the PSM/PSM2 libraries to
  prevent unintentional conflicting behavior.

Thanks,

Howard

-- 

Howard Pritchard
HPC-DES
Los Alamos National Laboratory


Re: [OMPI devel] Jenkins testing - what purpose are we striving to achieve?

2016-06-07 Thread Howard Pritchard
HI Ralph,

We briefly discussed this some today.  I would like to avoid the mini-MTT
approach for PR checking.
At the same time, one can also see why it might be useful from time to time
to make changes to
the script a given jenkins project runs on PRs.

An idea we discussed was to have jenkins folks support a "stable" version
of their jenkins script.  If they would
like to make changes,  they would create an experimental, temporary jenkins
project to run the new script.
If the new project's script runs clean against open PRs, the new script can
be swapped in to the
original jenkins project.  The experimental project could then be
deactivated.  If the new script showed failures in the
open PRs, or against master or other branch, issues can be opened to track
the problem(s) found by the
script.  The experimental, temporary jenkins project can continue to run,
but its  "failure" status can be ignored
until the underlying bug(s) is fixed.

I don't think it makes much sense to run a jenkins script against PRs if it
fails when run against master.
The purpose of jenkins PR testing is to trap new problems, not to keep
reminding us there are problems
with the underlying branch which the PR targets.

Howard


2016-06-07 13:33 GMT-06:00 Ralph Castain :

> Hi folks
>
> I’m trying to get a handle on our use of Jenkins testing for PRs prior to
> committing them. When we first discussed this, it was my impression that
> our objective was to screen PRs to catch any errors caused by differences
> in environment and to avoid regressions. However, it appears that the tests
> keep changing without warning, leading to the impression that we are using
> Jenkins as a “mini-MTT” testing device.
>
> So I think we need to come to consensus on the purpose of the Jenkins
> testing. If it is to screen for regressions, then the tests need to remain
> stable. A PR that does not introduce any new problems might not address old
> ones, but that is no reason to flag it as an “error”.
>
> On the other hand, if the objective is to use Jenkins as a “mini-MTT”,
> then we need to agree on how/when a PR is ready to be merged. Insisting
> that nothing be merged until even a mini-MTT is perfectly clean is probably
> excessively prohibitive - it would require that the entire community (and
> not just the one proposing the PR) take responsibility for cleaning up the
> code base against any and all imposed tests.
>
> So I would welcome opinions on this: are we using Jenkins as a screening
> tool on changes, or as a test for overall correctness of the code base?
>
> Ralph
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19087.php


[OMPI devel] NERSC down today so lanl-bot getting time off

2016-05-24 Thread Howard Pritchard
Hi Folks,

NERSC is doing some major maintenance and both
the edison and cori systems are offline.  As a consequence,
the lanl-bot can't run some of its checks on OMPI PRs
today.

So you can either ignore the lack of status update for PRs
for today and go ahead and merge, or wait till tomorrow
when lanl-bot should be back in business wrt cori and edison.

Howard


[OMPI devel] updating the users migration guide request

2016-05-16 Thread Howard Pritchard
Hi Folks,

The last known blocker for 2.0.0 will hopefully be resolved this week,
which means its time to be filling in the users' migration guide.

If you have  a feature that went in to the 2.0.x release stream that's
important,  please add a short description of the feature to the
migration guide.

The wiki page format of the guide is at

ttps://
github.com/open-mpi/ompi/wiki/User-Migration-Guide%3A-1.8.x-and-v1.10.x-to-v2.0.0

We'll discuss this at the devel telecon tomorrow (5/17).

Thanks,

Howard


Re: [OMPI devel] Open MPI v2.0.0rc2

2016-04-30 Thread Howard Pritchard
Hi Jeff,

Let's just update the MPI_THREAD_MULTIPLE comment to say that
enable-mpi-thread-multiple is still required as part of config.

Howard

2016-04-29 22:20 GMT-06:00 Orion Poplawski :

> On 04/28/2016 05:01 PM, Jeff Squyres (jsquyres) wrote:
>
>> At long last, here's the next v2.0.0 release candidate: 2.0.0rc2:
>>
>>  https://www.open-mpi.org/software/ompi/v2.x/
>>
>> We didn't keep a good list of all the things that have changed since rc1
>> -- but it's many things.  Here's a link to the NEWS file for v2.0.0:
>>
>>  https://github.com/open-mpi/ompi-release/blob/v2.x/NEWS
>>
>> Please test test test!
>>
>>
> I see that --enable-mpi-cxx appears to default to disabled now, but I
> don't see any mention of it in the NEWS.
>
>
> https://github.com/open-mpi/ompi-release/commit/84f1e14b17dcc467e315038596535d8c7717c809
>
> I suspect I'll keep this enabled in the Fedora openmpi builds just
> because.  But I could be persuaded otherwise.
>
> Also, I see mention of improved MPI_THREAD_MULTIPLE support but that it
> still defaults to disabled, so I assume it probably should still be
> disabled in the basic fedora package.
>
> Also filed https://github.com/open-mpi/ompi/issues/1609 for failing to
> find system pmix library.
>
> I think that's it for me so far.
>
> --
> Orion Poplawski
> Technical Manager 303-415-9701 x222
> NWRA/CoRA DivisionFAX: 303-415-9702
> 3380 Mitchell Lane  or...@cora.nwra.com
> Boulder, CO 80301  http://www.cora.nwra.com
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18859.php
>


Re: [OMPI devel] 2.0.0 is coming: what do we need to communicate to users?

2016-04-29 Thread Howard Pritchard
Hi Jeff,

checkpoint/restart is not supported in this release.

Does this release work with totalview?  I recall we had some problems,
and do not remember if they were resolved.

We may also want to clarify if any PML/MTLs are experimental in this
release.

MPI_THREAD_MULTIPLE support.


Howard


2016-04-29 10:34 GMT-06:00 Cabral, Matias A :

> How about for “developers that have not been following the transition from
> 1.x to 2.0?  Particularly myself J. I started contributing to some
> specific parts (psm2 mtl) and following changes. However, I don’t have
> details of what is changing in 2.0. I see there could be different level of
> details in the “developer’s transition guide book”, ranging from
> architectural change to what pieces were moved where.
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Joshua
> Ladd
> *Sent:* Friday, April 29, 2016 7:11 AM
> *To:* Open MPI Developers 
> *Subject:* Re: [OMPI devel] 2.0.0 is coming: what do we need to
> communicate to users?
>
>
>
> Certainly we need to communicate / advertise / evangelize the improvements
> in job launch - the largest and most substantial change between the two
> branches - and provide some best practice guidelines for usage (use direct
> modex for applications with sparse communication patterns and full modex
> for dense.) I would be happy to contribute some paragraphs.
>
>
>
> Obviously, we also need to communicate, reiterate the need to recompile
> codes built against the 1.10 series.
>
>
>
> Best,
>
>
>
> Josh
>
>
>
>
>
> On Thursday, April 28, 2016, Jeff Squyres (jsquyres) 
> wrote:
>
> We're getting darn close to v2.0.0.
>
> What "gotchas" do we need to communicate to users?  I.e., what will people
> upgrading from v1.8.x/v1.10.x be surprised by?
>
> The most obvious one I can think of is mpirun requiring -np when slots are
> not specified somehow.
>
> What else do we need to communicate?  It would be nice to avoid the
> confusion users experienced regarding affinity functionality/options when
> upgrading from v1.6 -> v1.8 (because we didn't communicate these changes
> well, IMHO).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18832.php
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18843.php
>


Re: [OMPI devel] PSM2 Intel folks question

2016-04-21 Thread Howard Pritchard
Hi Matias,

I updated the issue 1559 with the info requested.
It might be simpler to just switch over to using the issue
for tracking this conversation?

I don't want to be posting big attachments emails on this
list.

Thanks,

Howard


2016-04-20 19:21 GMT-06:00 Cabral, Matias A :

> Hi Howard,
>
>
>
> I’ve been playing with the same version of psm (
> hfi1-psm-0.7-221.ch6.x86_64) but cannot yet reproduce the issue.  Just in
> case, please share the version of the driver you have installed
> (hfi1-X.XX-XX.x86_64.rpm, modinfo hfi1).
>
>
>
> What I can tell so far, is that I still suspect this has  some relation to
> the job_id, that OMPI uses to generate the unique job key, that psm uses to
> generate the epid. By looking at the logfile.busted, I see some entries for
> ‘epid 1’. This can only happen if psm2_ep_open() is called with a
> unique job key of 1 and having the PSM2 hfi device disabled (only shm
> communication expected). In your workaround (hfi enabled) the epid
> generation goes through a different path that includes the HFI LID which
> ends with different number.  HOWEVER, I hardcoded the above (to get epid
> 1) case but I still see the hello_c running with stock OMPI 1.10.2.
>
>
> Would you please try forcing different jobid and share the results?
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Wednesday, April 20, 2016 8:49 AM
>
> *To:* Open MPI Developers 
> *Subject:* Re: [OMPI devel] PSM2 Intel folks question
>
>
>
> HI Matias,
>
>
>
> Actually I found the problem.  I kept wondering why the OFI MTL works
> fine, but the
>
> PSM2 MTL doesn't.  When I cranked up the debugging level I noticed that
> for OFI MTL,
>
> it doesn't mess with the PSM2_DEVICES env variable.  So the PSM2 tries all
> three
>
> "devices" as part of initialization.  However, the PSM2 MTL sets the
> PSM2_DEVICES
>
> to not include hfi.  If I comment out those lines of code in the PSM2 MTL,
> my one-node
>
> problem vanishes.
>
>
>
> I suspect there's some setup code when "initializing" the hfi device that
> is actually
>
> required even when using the shm device for on-node messages.
>
>
>
> Is there by an chance some psm2 device driver parameter setting that might
>
> result in this behavior.
>
>
>
> Anyway, I set PSM2_TRACEMASK to 0x and got a bunch of output that
>
> might be helpful.  I attached the log files to issue 1559.
>
>
>
> For now, I will open a PR with fixes to get the PSM2 MTL working on our
>
> omnipath clusters.
>
>
>
> I don't think this problem has anything to do with SLURM except for the
> jobid
>
> manipulation to generate the unique key.
>
>
>
> Howard
>
>
>
>
>
> 2016-04-19 17:18 GMT-06:00 Cabral, Matias A :
>
> Howard,
>
>
>
> PSM2_DEVICES, I went back to the roots and found that shm is the only
> device supporting communication between ranks in the same node. Therefore,
> the below error “Endpoint could not be reached” would be expected.
>
>
>
> Back to the psm2_ep_connect() hanging, I cloned the same psm2 as you have
> from github and have hello_c and ring_c running with 80 ranks on a local
> node using PSM2 mtl. I do not have any SLURM setup on my system.  I will
> proceed to setup SLURM to see if I can reproduce the issue with it. In the
> meantime please share any extra detail you find relevant.
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, April 19, 2016 12:21 PM
> *To:* Open MPI Developers 
> *Subject:* Re: [OMPI devel] PSM2 Intel folks question
>
>
>
> Hi Matias,
>
>
>
> My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c.
>
> If I disable the shared memory device using the PSM2_DEVICES option
>
> it looks like psm2 is unhappy:
>
>
>
>
>
> kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be
> reached):
>
> [kit001.localdomain:08222]  kit001
>
> [kit001.localdomain:08222] PSM2 EP connect error (unknown connect error):
>
> [kit001.localdomain:08222]  kit001
>
>  psm2_ep_connect returned 41
>
> [kit001.localdomain:08221] PSM2 EP connect error (unknown connect error):
>
> [kit001.localdomain:08221]  kit001
>
> [kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be
> reached):
>
> [kit001.localdomain:08221]  kit001
>
> leaving ompi_mtl_psm2_add_procs nprocs 2
>
>
>
> I we

Re: [OMPI devel] Common symbol warnings in tarballs (was: make install warns about 'common symbols')

2016-04-21 Thread Howard Pritchard
Am Mittwoch, 20. April 2016 schrieb Paul Hargrove :

> Not sure if Howard wants the check to be OFF by default in tarballs, or
> absent completely.
>
>
I meant the former.



> I test almost exclusively from RC tarballs, and have access to many
> uncommon platforms.
> So, if you think it is useful for my testing to help look for these
> warnings, then there should be some way to enable it from a tarball build.
> That could be a configure option, or even something as obscure as "mkdir
> .git".
>
> Yet another option is to default the check ON in all RC tarballs, but OFF
> in the release tarballs.
>
> Personally, the only thing I feel strongly about is not producing
> developer-oriented warnings for the end-user who uses the normal configure
> options.
>
> -Paul
>
> On Wed, Apr 20, 2016 at 2:44 PM, Howard Pritchard  > wrote:
>
>> I also think this symbol checker should not be in the tarball.
>>
>> Howard
>>
>>
>> 2016-04-20 13:08 GMT-06:00 Jeff Squyres (jsquyres) > >:
>>
>>> On Apr 20, 2016, at 2:08 PM, dpchoudh . >> > wrote:
>>> >
>>> > Just to clarify, I was doing a build (after adding code to support a
>>> new transport) from code pulled from git (a 'git clone') when I came across
>>> this warning, so I suppose this would be a 'developer build'.
>>>
>>> No worries.  I only brought it up because this is currently on master
>>> (and not v2.x), but it will eventually end up in a release branch -- even
>>> if it's v3.0.0.  So it's something we'd want figure out before it hits the
>>> release branch.
>>>
>>> > I know I am not a real MPI developer (I am doing OMPI internal
>>> development for the second time in my whole career), but if my vote counts,
>>> I'd vote for leaving the warning in.
>>>
>>> I don't know why you keep pretending that you're not an OMPI developer.
>>> :-)
>>>
>>> You're developing a BTL and asking all kinds of good questions about the
>>> code, and that's good enough for all of us.
>>>
>>> > It, in my opinion, encourages good coding practice, that should matter
>>> to everyone, not just 'core developers'. However, I agree that the phrasing
>>> of the warning is confusing, and adding a URL there to an appropriate page
>>> should be enough to prevent future questions like this in the support forum.
>>>
>>> FWIW: I think I agree with Ralph on this one.  Yes, we should make those
>>> common symbols zero.  But a user seeing this warning will likely be
>>> concerned, and there's nothing they can do about it.  So I think it should
>>> be a "developer only" kind of warning.
>>>
>>> My $0.02.
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com 
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2016/04/18797.php
>>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/04/18798.php
>>
>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>


Re: [OMPI devel] Common symbol warnings in tarballs (was: make install warns about 'common symbols')

2016-04-20 Thread Howard Pritchard
I also think this symbol checker should not be in the tarball.

Howard


2016-04-20 13:08 GMT-06:00 Jeff Squyres (jsquyres) :

> On Apr 20, 2016, at 2:08 PM, dpchoudh .  wrote:
> >
> > Just to clarify, I was doing a build (after adding code to support a new
> transport) from code pulled from git (a 'git clone') when I came across
> this warning, so I suppose this would be a 'developer build'.
>
> No worries.  I only brought it up because this is currently on master (and
> not v2.x), but it will eventually end up in a release branch -- even if
> it's v3.0.0.  So it's something we'd want figure out before it hits the
> release branch.
>
> > I know I am not a real MPI developer (I am doing OMPI internal
> development for the second time in my whole career), but if my vote counts,
> I'd vote for leaving the warning in.
>
> I don't know why you keep pretending that you're not an OMPI developer.
> :-)
>
> You're developing a BTL and asking all kinds of good questions about the
> code, and that's good enough for all of us.
>
> > It, in my opinion, encourages good coding practice, that should matter
> to everyone, not just 'core developers'. However, I agree that the phrasing
> of the warning is confusing, and adding a URL there to an appropriate page
> should be enough to prevent future questions like this in the support forum.
>
> FWIW: I think I agree with Ralph on this one.  Yes, we should make those
> common symbols zero.  But a user seeing this warning will likely be
> concerned, and there's nothing they can do about it.  So I think it should
> be a "developer only" kind of warning.
>
> My $0.02.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18797.php
>


Re: [OMPI devel] PSM2 Intel folks question

2016-04-20 Thread Howard Pritchard
HI Matias,

Actually I found the problem.  I kept wondering why the OFI MTL works fine,
but the
PSM2 MTL doesn't.  When I cranked up the debugging level I noticed that for
OFI MTL,
it doesn't mess with the PSM2_DEVICES env variable.  So the PSM2 tries all
three
"devices" as part of initialization.  However, the PSM2 MTL sets the
PSM2_DEVICES
to not include hfi.  If I comment out those lines of code in the PSM2 MTL,
my one-node
problem vanishes.

I suspect there's some setup code when "initializing" the hfi device that
is actually
required even when using the shm device for on-node messages.

Is there by an chance some psm2 device driver parameter setting that might
result in this behavior.

Anyway, I set PSM2_TRACEMASK to 0x and got a bunch of output that
might be helpful.  I attached the log files to issue 1559.

For now, I will open a PR with fixes to get the PSM2 MTL working on our
omnipath clusters.

I don't think this problem has anything to do with SLURM except for the
jobid
manipulation to generate the unique key.

Howard


2016-04-19 17:18 GMT-06:00 Cabral, Matias A :

> Howard,
>
>
>
> PSM2_DEVICES, I went back to the roots and found that shm is the only
> device supporting communication between ranks in the same node. Therefore,
> the below error “Endpoint could not be reached” would be expected.
>
>
>
> Back to the psm2_ep_connect() hanging, I cloned the same psm2 as you have
> from github and have hello_c and ring_c running with 80 ranks on a local
> node using PSM2 mtl. I do not have any SLURM setup on my system.  I will
> proceed to setup SLURM to see if I can reproduce the issue with it. In the
> meantime please share any extra detail you find relevant.
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, April 19, 2016 12:21 PM
> *To:* Open MPI Developers 
> *Subject:* Re: [OMPI devel] PSM2 Intel folks question
>
>
>
> Hi Matias,
>
>
>
> My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c.
>
> If I disable the shared memory device using the PSM2_DEVICES option
>
> it looks like psm2 is unhappy:
>
>
>
>
>
> kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be
> reached):
>
> [kit001.localdomain:08222]  kit001
>
> [kit001.localdomain:08222] PSM2 EP connect error (unknown connect error):
>
> [kit001.localdomain:08222]  kit001
>
>  psm2_ep_connect returned 41
>
> [kit001.localdomain:08221] PSM2 EP connect error (unknown connect error):
>
> [kit001.localdomain:08221]  kit001
>
> [kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be
> reached):
>
> [kit001.localdomain:08221]  kit001
>
> leaving ompi_mtl_psm2_add_procs nprocs 2
>
>
>
> I went back and tried again with the OFI MTL (without the PSM2_DEVICES set)
> and that works correctly on a single node.
>
> I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM
> specific problem.
>
>
>
> 2016-04-19 12:25 GMT-06:00 Cabral, Matias A :
>
> Hi Howard,
>
>
>
> Couple more questions to understand a little better the context:
>
> -  What type of job running?
>
> -  Is this also under srun?
>
>
>
> For PSM2 you may find more details in the programmer’s guide:
>
>
> http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf
>
>
>
> To disable shared memory:
>
> Section 2.7.1:
>
> PSM2_DEVICES="self,fi"
>
>
>
> Thanks,
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, April 19, 2016 11:04 AM
> *To:* Open MPI Developers List 
> *Subject:* [OMPI devel] PSM2 Intel folks question
>
>
>
> Hi Folks,
>
>
>
> I'm making progress with issue #1559 (patches on the mail list didn't
> help),
>
> and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
>
> noticing something more troublesome.
>
>
>
> If I run on just one node, and I use more than one process, process zero
>
> consistently hangs in psm2_ep_connect.
>
>
>
> I've tried using the psm2 code on github - at sha e951cf31, but I still see
>
> the same behavior.
>
>
>
> The PSM2 related rpms installed on our system are:
>
>
>
> infinipath-*psm*-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> hfi1-*psm*-0.7-221.ch6.x86_64
>
> hfi1-*psm*-devel-0.7-221.ch6.x86_64
>
> infinipath-*psm*-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> should we get newer rpms installed?
>
&

Re: [OMPI devel] PSM2 Intel folks question

2016-04-19 Thread Howard Pritchard
Hi Matias,

My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c.
If I disable the shared memory device using the PSM2_DEVICES option
it looks like psm2 is unhappy:


kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be
reached):

[kit001.localdomain:08222]  kit001

[kit001.localdomain:08222] PSM2 EP connect error (unknown connect error):

[kit001.localdomain:08222]  kit001

 psm2_ep_connect returned 41

[kit001.localdomain:08221] PSM2 EP connect error (unknown connect error):

[kit001.localdomain:08221]  kit001

[kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be
reached):

[kit001.localdomain:08221]  kit001

leaving ompi_mtl_psm2_add_procs nprocs 2


I went back and tried again with the OFI MTL (without the PSM2_DEVICES set)
and that works correctly on a single node.

I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM
specific problem.

2016-04-19 12:25 GMT-06:00 Cabral, Matias A :

> Hi Howard,
>
>
>
> Couple more questions to understand a little better the context:
>
> -  What type of job running?
>
> -  Is this also under srun?
>
>
>
> For PSM2 you may find more details in the programmer’s guide:
>
>
> http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf
>
>
>
> To disable shared memory:
>
> Section 2.7.1:
>
> PSM2_DEVICES="self,fi"
>
>
>
> Thanks,
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, April 19, 2016 11:04 AM
> *To:* Open MPI Developers List 
> *Subject:* [OMPI devel] PSM2 Intel folks question
>
>
>
> Hi Folks,
>
>
>
> I'm making progress with issue #1559 (patches on the mail list didn't
> help),
>
> and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
>
> noticing something more troublesome.
>
>
>
> If I run on just one node, and I use more than one process, process zero
>
> consistently hangs in psm2_ep_connect.
>
>
>
> I've tried using the psm2 code on github - at sha e951cf31, but I still see
>
> the same behavior.
>
>
>
> The PSM2 related rpms installed on our system are:
>
>
>
> infinipath-*psm*-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> hfi1-*psm*-0.7-221.ch6.x86_64
>
> hfi1-*psm*-devel-0.7-221.ch6.x86_64
>
> infinipath-*psm*-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> should we get newer rpms installed?
>
>
>
> Is there a way to disable the AMSHM path?  I'm wondering if that
>
> would help since multi-node jobs seems to run fine.
>
>
>
> Thanks for any help,
>
>
>
> Howard
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18783.php
>


[OMPI devel] PSM2 Intel folks question

2016-04-19 Thread Howard Pritchard
Hi Folks,

I'm making progress with issue #1559 (patches on the mail list didn't help),
and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
noticing something more troublesome.

If I run on just one node, and I use more than one process, process zero
consistently hangs in psm2_ep_connect.

I've tried using the psm2 code on github - at sha e951cf31, but I still see
the same behavior.

The PSM2 related rpms installed on our system are:

infinipath-*psm*-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64

hfi1-*psm*-0.7-221.ch6.x86_64

hfi1-*psm*-devel-0.7-221.ch6.x86_64

infinipath-*psm*-3.3-0.g6f42cdb1bb8.2.el7.x86_64
should we get newer rpms installed?

Is there a way to disable the AMSHM path?  I'm wondering if that
would help since multi-node jobs seems to run fine.

Thanks for any help,

Howard


Re: [OMPI devel] psm2 and psm2_ep_open problems

2016-04-18 Thread Howard Pritchard
please point me to the patch.

--

sent from my smart phonr so no good type.

Howard
On Apr 15, 2016 1:04 PM, "Ralph Castain"  wrote:

> I have a patch that I think will resolve this problem - would you please
> take a look?
>
> Ralph
>
>
>
> On Apr 15, 2016, at 7:32 AM, Ralph Castain  wrote:
>
> Actually, it did come across the developer list :-)
>
> Why don’t I resolve this by just ensuring that the key we create is
> properly filled? It’s a trivial fix in the PMI ess component
>
>
> On Apr 15, 2016, at 7:26 AM, Howard Pritchard  wrote:
>
> I didn't copy dev on this.
>
>
>
> -- Weitergeleitete Nachricht --
> Von: *Howard Pritchard* 
> Datum: Donnerstag, 14. April 2016
> Betreff: psm2 and psm2_ep_open problems
> An: Open MPI Developers 
>
>
> Hi Matias
>
> Actually I triaged this further.  Open mpi PMI subsystem is actually doing
> things correctly wrt env variable setting with or without mpi run.  The
> problem has to do with a psm2  and the fact that on my cluster right now
> SLURM has only scheduled about 25 jobs.  This results in the unique key
> PSM2 Mtl is feeding to PSM2 has lots of zeros inthe initial part of the
> key.  This ends up messing up the epid generated in PSM2.  OFI MTL doesn't
> have this problem because the PSM2 provider has some of these LSBs set in
> the value it passes to PSM2.
>
> I will open a PR to "fix" the PSM2MTL to handle this feature of PSM2.
>
> Howard
>
> Am Donnerstag, 14. April 2016 schrieb Cabral, Matias A :
>
>> Hi Howard,
>>
>>
>>
>> I suspect this is the known issue that when using SLURM with OMPI and PSM
>> that is discussed here:
>>
>> https://www.open-mpi.org/community/lists/users/2010/12/15220.php
>>
>>
>>
>> As per today, orte generates the psm_key, so when using SLURM this does
>> not happen and is necessary to set it in the environment.  Here Ralph
>> explains the workaround:
>>
>> https://www.open-mpi.org/community/lists/users/2010/12/15242.php
>>
>>
>>
>> As you found, epid of 0 is not a valid value. So, basing comments on:
>>
>> https://github.com/01org/opa-psm2/blob/master/psm_ep.c
>>
>>
>>
>> the assert of line 832. psmi_ep_open_device()  will do :
>>
>>
>>
>> /*
>>
>> * We use a LID of 0 for non-HFI
>> communication.
>>
>> * Since a jobkey is not available from
>> IPS, pull the
>>
>> * first 16 bits from the UUID.
>>
>> */
>>
>>
>>
>> *epid = PSMI_EPID_PACK(((uint16_t *)
>> unique_job_key)[0],
>>
>>(rank
>> >> 3), rank, 0,
>>
>>    
>> PSMI_HFI_TYPE_DEFAULT,
>> rank);
>>
>>  In the particular case you mention below, when there is no HFI (shared
>> memory), rank 0 and the passed key is 0, epid will be 0.
>>
>>
>>
>> SOLUTION: set
>>
>> Set in the environment OMPI_MCA_orte_precondition_transports with a value
>> different than 0.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> _MAC
>>
>>
>>
>> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
>> Pritchard
>> *Sent:* Thursday, April 14, 2016 1:10 PM
>> *To:* Open MPI Developers List 
>> *Subject:* [OMPI devel] psm2 and psm2_ep_open problems
>>
>>
>>
>> Hi Folks,
>>
>>
>>
>> So we have this brand-new omnipath cluster here at work,
>>
>> but people are having problem using it on a single node using
>>
>> srun as the job launcher.
>>
>>
>>
>> The customer wants to use srun to launch jobs not the open mpi
>>
>> mpirun.
>>
>>
>>
>> The customer installed 1.10.1, but I can reproduce the
>>
>> problem with v2.x and I'm sure with master, unless I build the
>>
>> ofi mtl.  ofi mtl works, psm2 mtl doesn't.
>>
>>
>>
>> I downloaded the psm2 code from github and started hacking.
>>
>>
>>
>> What appears to be the problem is that when running on a single
>>
>> node one can go through a path in psmi_ep_open_device where
>>
>> for a single process job, the value stored into epid is zero.
>>
>&g

[OMPI devel] Fwd: psm2 and psm2_ep_open problems

2016-04-15 Thread Howard Pritchard
I didn't copy dev on this.



-- Weitergeleitete Nachricht --
Von: *Howard Pritchard* 
Datum: Donnerstag, 14. April 2016
Betreff: psm2 and psm2_ep_open problems
An: Open MPI Developers 


Hi Matias

Actually I triaged this further.  Open mpi PMI subsystem is actually doing
things correctly wrt env variable setting with or without mpi run.  The
problem has to do with a psm2  and the fact that on my cluster right now
SLURM has only scheduled about 25 jobs.  This results in the unique key
PSM2 Mtl is feeding to PSM2 has lots of zeros inthe initial part of the
key.  This ends up messing up the epid generated in PSM2.  OFI MTL doesn't
have this problem because the PSM2 provider has some of these LSBs set in
the value it passes to PSM2.

I will open a PR to "fix" the PSM2MTL to handle this feature of PSM2.

Howard

Am Donnerstag, 14. April 2016 schrieb Cabral, Matias A :

> Hi Howard,
>
>
>
> I suspect this is the known issue that when using SLURM with OMPI and PSM
> that is discussed here:
>
> https://www.open-mpi.org/community/lists/users/2010/12/15220.php
>
>
>
> As per today, orte generates the psm_key, so when using SLURM this does
> not happen and is necessary to set it in the environment.  Here Ralph
> explains the workaround:
>
> https://www.open-mpi.org/community/lists/users/2010/12/15242.php
>
>
>
> As you found, epid of 0 is not a valid value. So, basing comments on:
>
> https://github.com/01org/opa-psm2/blob/master/psm_ep.c
>
>
>
> the assert of line 832. psmi_ep_open_device()  will do :
>
>
>
> /*
>
> * We use a LID of 0 for non-HFI
> communication.
>
> * Since a jobkey is not available from
> IPS, pull the
>
> * first 16 bits from the UUID.
>
> */
>
>
>
> *epid = PSMI_EPID_PACK(((uint16_t *)
> unique_job_key)[0],
>
>
> (rank >> 3), rank, 0,
>
>
> PSMI_HFI_TYPE_DEFAULT, rank);
>
>  In the particular case you mention below, when there is no HFI (shared
> memory), rank 0 and the passed key is 0, epid will be 0.
>
>
>
> SOLUTION: set
>
> Set in the environment OMPI_MCA_orte_precondition_transports with a value
> different than 0.
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Thursday, April 14, 2016 1:10 PM
> *To:* Open MPI Developers List 
> *Subject:* [OMPI devel] psm2 and psm2_ep_open problems
>
>
>
> Hi Folks,
>
>
>
> So we have this brand-new omnipath cluster here at work,
>
> but people are having problem using it on a single node using
>
> srun as the job launcher.
>
>
>
> The customer wants to use srun to launch jobs not the open mpi
>
> mpirun.
>
>
>
> The customer installed 1.10.1, but I can reproduce the
>
> problem with v2.x and I'm sure with master, unless I build the
>
> ofi mtl.  ofi mtl works, psm2 mtl doesn't.
>
>
>
> I downloaded the psm2 code from github and started hacking.
>
>
>
> What appears to be the problem is that when running on a single
>
> node one can go through a path in psmi_ep_open_device where
>
> for a single process job, the value stored into epid is zero.
>
>
>
> This results in an assert failing in the __psm2_ep_open_internal
>
> function.
>
>
>
> Is there a quick and dirty workaround that doesn't involve fixing
>
> psm2 MTL?  I could suggest to the sysadmins to install libfabric 1.3
>
> and build the openmpi to only have ofi mtl, but perhaps there's
>
> another way to get psm2 mtl to work for single node jobs?  I'd prefer
>
> to not ask users to disable psm2 mtl explicitly for their single node jobs.
>
>
>
> Thanks for suggestions.
>
>
>
> Howard
>
>
>
>
>
>
>


Re: [OMPI devel] psm2 and psm2_ep_open problems

2016-04-14 Thread Howard Pritchard
Hi Matias

Actually I triaged this further.  Open mpi PMI subsystem is actually doing
things correctly wrt env variable setting with or without mpi run.  The
problem has to do with a psm2  and the fact that on my cluster right now
SLURM has only scheduled about 25 jobs.  This results in the unique key
PSM2 Mtl is feeding to PSM2 has lots of zeros inthe initial part of the
key.  This ends up messing up the epid generated in PSM2.  OFI MTL doesn't
have this problem because the PSM2 provider has some of these LSBs set in
the value it passes to PSM2.

I will open a PR to "fix" the PSM2MTL to handle this feature of PSM2.

Howard

Am Donnerstag, 14. April 2016 schrieb Cabral, Matias A :

> Hi Howard,
>
>
>
> I suspect this is the known issue that when using SLURM with OMPI and PSM
> that is discussed here:
>
> https://www.open-mpi.org/community/lists/users/2010/12/15220.php
>
>
>
> As per today, orte generates the psm_key, so when using SLURM this does
> not happen and is necessary to set it in the environment.  Here Ralph
> explains the workaround:
>
> https://www.open-mpi.org/community/lists/users/2010/12/15242.php
>
>
>
> As you found, epid of 0 is not a valid value. So, basing comments on:
>
> https://github.com/01org/opa-psm2/blob/master/psm_ep.c
>
>
>
> the assert of line 832. psmi_ep_open_device()  will do :
>
>
>
> /*
>
> * We use a LID of 0 for non-HFI
> communication.
>
> * Since a jobkey is not available from
> IPS, pull the
>
> * first 16 bits from the UUID.
>
> */
>
>
>
> *epid = PSMI_EPID_PACK(((uint16_t *)
> unique_job_key)[0],
>
>
> (rank >> 3), rank, 0,
>
>
> PSMI_HFI_TYPE_DEFAULT, rank);
>
>  In the particular case you mention below, when there is no HFI (shared
> memory), rank 0 and the passed key is 0, epid will be 0.
>
>
>
> SOLUTION: set
>
> Set in the environment OMPI_MCA_orte_precondition_transports with a value
> different than 0.
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org
> ] *On Behalf
> Of *Howard Pritchard
> *Sent:* Thursday, April 14, 2016 1:10 PM
> *To:* Open MPI Developers List  >
> *Subject:* [OMPI devel] psm2 and psm2_ep_open problems
>
>
>
> Hi Folks,
>
>
>
> So we have this brand-new omnipath cluster here at work,
>
> but people are having problem using it on a single node using
>
> srun as the job launcher.
>
>
>
> The customer wants to use srun to launch jobs not the open mpi
>
> mpirun.
>
>
>
> The customer installed 1.10.1, but I can reproduce the
>
> problem with v2.x and I'm sure with master, unless I build the
>
> ofi mtl.  ofi mtl works, psm2 mtl doesn't.
>
>
>
> I downloaded the psm2 code from github and started hacking.
>
>
>
> What appears to be the problem is that when running on a single
>
> node one can go through a path in psmi_ep_open_device where
>
> for a single process job, the value stored into epid is zero.
>
>
>
> This results in an assert failing in the __psm2_ep_open_internal
>
> function.
>
>
>
> Is there a quick and dirty workaround that doesn't involve fixing
>
> psm2 MTL?  I could suggest to the sysadmins to install libfabric 1.3
>
> and build the openmpi to only have ofi mtl, but perhaps there's
>
> another way to get psm2 mtl to work for single node jobs?  I'd prefer
>
> to not ask users to disable psm2 mtl explicitly for their single node jobs.
>
>
>
> Thanks for suggestions.
>
>
>
> Howard
>
>
>
>
>
>
>


[OMPI devel] psm2 and psm2_ep_open problems

2016-04-14 Thread Howard Pritchard
Hi Folks,

So we have this brand-new omnipath cluster here at work,
but people are having problem using it on a single node using
srun as the job launcher.

The customer wants to use srun to launch jobs not the open mpi
mpirun.

The customer installed 1.10.1, but I can reproduce the
problem with v2.x and I'm sure with master, unless I build the
ofi mtl.  ofi mtl works, psm2 mtl doesn't.

I downloaded the psm2 code from github and started hacking.

What appears to be the problem is that when running on a single
node one can go through a path in psmi_ep_open_device where
for a single process job, the value stored into epid is zero.

This results in an assert failing in the __psm2_ep_open_internal
function.

Is there a quick and dirty workaround that doesn't involve fixing
psm2 MTL?  I could suggest to the sysadmins to install libfabric 1.3
and build the openmpi to only have ofi mtl, but perhaps there's
another way to get psm2 mtl to work for single node jobs?  I'd prefer
to not ask users to disable psm2 mtl explicitly for their single node jobs.

Thanks for suggestions.

Howard


Re: [OMPI devel] RFC: RML change to multi-select

2016-03-17 Thread Howard Pritchard
Okay, i'll bring this up at the workshop.  There's been talking but no
one's working on it.

2016-03-17 8:20 GMT-06:00 Ralph Castain :

> We are also targeting RDM for now, but I agree that the two may diverge at
> some point, and so flexibility makes sense. Only wish that libfabric had a
> decent shared memory provider...
>
>
> On Mar 17, 2016, at 7:10 AM, Howard  wrote:
>
> I think that's a better approach. Not clear you'd want to use same EP type
> as BTL.  I'm going for RDM type for now for BTL.
>
> Howard
>
> Von meinem iPhone gesendet
>
> Am 16.03.2016 um 09:35 schrieb Ralph Castain :
>
> Interesting! Yeah, we debated about BTL or go direct to OFI. Finally opted
> for the latter as it seemed simpler than the BTL interface.
>
> On Mar 16, 2016, at 7:29 AM, Howard  wrote:
>
> Hi Ralph
>
> I dont know if it's relevant, but I'm working on an ofi BTL so we can use
> the OSC rdma.
>
> Howard
>
> Von meinem iPhone gesendet
>
> Am 15.03.2016 um 17:21 schrieb Ralph Castain :
>
> Hi folks
>
> We are working on integrating the RML with libfabric so we have access to
> both management Ethernet and fabric transports. A first step in enabling
> this is to convert the RML framework to multi-select of active components.
> The stub functions then scan the components in priority order until one can
> perform the requested action (e.g., send a buffer). This will allow us to
> simultaneously support both OFI and other components.
>
> While making this change, we also:
>
> * removed the qos framework - this functionality has been moved to another
> library that builds on top of the RML
>
> * removed the ftrm component - this was stale, and it wasn’t clear to us
> how it would change under the new architecture
>
> We will be adding the new OFI component in a separate PR. This just
> contains the change to a multi-select framework.
>
> The PR is here:  https://github.com/open-mpi/ompi/pull/1457
>
> Please feel free to comment and/or make suggestions
> Ralph
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/03/18699.php
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/03/18702.php
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/03/18703.php
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/03/18709.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/03/18710.php
>


[OMPI devel] trying to view ciso-community master results and mtt has issue

2016-02-18 Thread Howard Pritchard
Hi Folks,

I noticed cisco-community MTT results are really red/pink today.
If I try to view some of the ibm test results though, something goes
south with mtt and this is what get's posted back to my browser:

Fatal error: Allowed memory size of 67108864 bytes exhausted (tried to
allocate 71 bytes) in /nfs/data/osl/www/
mtt.open-mpi.org/reporter/dashboard.inc on line 271

So, I guess the first priority is do we know what's happened
with cisco MTT?

Second, is this a known problem with the mtt reporter?
Is there a way to work around it?

Thanks,

Howard


Re: [OMPI devel] Trunk is broken

2016-02-17 Thread Howard Pritchard
Hi Folks,

Should we revert PR 1351 till there is a fix?

Howard


2016-02-17 11:34 GMT-07:00 Ralph Castain :

> FWIW: I wouldn’t have seen that because I don’t have IB on my system.
>
> On Feb 17, 2016, at 10:11 AM, Nysal Jan K A  wrote:
>
> So this seems to be still broken.
>
> mca_btl_openib.so: undefined symbol: opal_memory_linux_malloc_set_alignment
>
> I built with "--with-memory-manager=none"
>
> Regards
> --Nysal
>
> On Tue, Feb 16, 2016 at 10:19 AM, Ralph Castain  wrote:
>
>> It is very easy to reproduce - configure with:
>> enable_mem_debug=no
>> enable_mem_profile=no
>> enable_memchecker=no
>> with_memory_manager=no
>>
>> I’m not sure which of those is required. However, your assertion is
>> incorrect. The person who introduced the original violation went to great
>> lengths to ensure it didn’t create a problem if the referenced component
>> was not built. I’m not saying it was a good thing to do, but we spent a lot
>> of time discussing it and figuring out how to do it without causing the
>> problem.
>>
>> So whatever was done missed those precautions and introduced this symbol
>> regardless of the configuration.
>>
>>
>> On Feb 15, 2016, at 8:39 PM, Gilles Gouaillardet 
>> wrote:
>>
>> Ralph,
>>
>> this is being discussed at https://github.com/open-mpi/ompi/pull/1351
>>
>> btw, how do you get this warning ? i do not see it.
>> fwiw, the abstraction violation was kind of already here, so i am
>> surprised it pops up now only
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2/16/2016 1:17 PM, Ralph Castain wrote:
>>
>> Looks like someone broke the master build on Linux:
>>
>> ../../../ompi/.libs/libmpi.so: undefined reference to
>> `opal_memory_linux_malloc_init_hook'
>>
>>
>> I suspect it was a hard-coded reference to some component’s variable?
>> Ralph
>>
>>
>>
>> ___
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/02/18598.php
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/02/18599.php
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/02/18600.php
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18601.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18602.php
>


[OMPI devel] MTT error?

2016-02-11 Thread Howard Pritchard
Hi Folks

When I go to

https://mtt.open-mpi.org/

and then click the summary button I get some kind of DNS lookup error.

Howard


Re: [OMPI devel] Porting the underlying fabric interface

2016-02-04 Thread Howard Pritchard
Hi Durga

as an alternative you could implement a libfabric provider for your
network.  In theory,  if you can implement the reliable datagram endpoint
type on your network and a tag matching mechanism, you could then just use
the ofi mtl and not have to do much if anything in open mpi or mpich etc.

https://github.com/ofiwg/libfabric

You may also want to see if the open ucx tl model might work for your
network.  It may be less work than implementing a libfabric provider.

good luck

Howard

--

sent from my smart phonr so no good type.

Howard
On Feb 4, 2016 6:00 AM, "Jeff Squyres (jsquyres)" 
wrote:

> +1 on what Gilles said.  :-)
>
> Check out this part of the v1.10 README file:
>
> https://github.com/open-mpi/ompi-release/blob/v1.10/README#L585-L625
>
> Basically:
>
> - PML is the back-end to functions like MPI_Send and MPI_Recv.
> - The ob1 PML uses BTL plugins in a many-of-many relationship to
> potentially utilize multiple networks.
> - The cm PML uses matching-style network APIs in CM plugins to utilize a
> single underlying network.
> - The yalla PML was written by Mellanox as a replacement for cm and ob1,
> in that it directly utilizes the MXM network library without going through
> any of the abstractions in ob1 and cm.  It was written at a time when cm
> was not well optimized, and basically just added a latency penalty before
> dispatching to the underlying MTL module.  Since then, cm has been
> optimized such that its abstraction penalty before invoking the underlying
> MTL module is negligible.
>
> So the question really comes down to:
>
> - if you have a network stack API that does MPI-style matching, you should
> write an MTL.
> - if not, you should write a BTL
>
> Does that help?
>
>
> > On Feb 4, 2016, at 2:29 AM, Gilles Gouaillardet 
> wrote:
> >
> > Durga,
> >
> > did you confuse PML and MTL ?
> >
> > basically, a BTL (Byte Transport Layer ?) is used with "primitive"
> interconnects that can only send bytes.
> > (e.g. if you need to transmit a tagged message, it is up to you
> send/recv the tag and manually match the tag on the receiver side so you
> can put the message into the right place)
> > on the other hand, MTL (Message Transport Layer ?) can be used with more
> advanced interconnects, that can "natively" send/recv (tagged) messages.
> >
> > for example, with infiniband, you can use the openib BTL, or the mxm MTL
> > (note the openib BTL only requires the free ibverbs libraries
> > and mxm MTL requires proprietary extensions provided by mellanox)
> >
> > a good starting point is the video Jeff posted at
> https://www.open-mpi.org/video/?category=internals
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2/4/2016 2:20 PM, dpchoudh . wrote:
> >> Hi developers
> >>
> >> I am trying to add support for a new (proprietary) RDMA capable fabric
> >> to OpenMPI and have the following question:
> >>
> >> As I understand, some networks are implemented as a PML framework and
> >> some are implemented as a BTL framework. It seems there is even
> >> overlap as Myrinet seems to exist in both.
> >>
> >> My question is: what is the difference between these two frameworks?
> >> When adding support for a new fabric, what factors one should consider
> >> when choosing between one type of framework over the other?
> >>
> >> And, with apologies for asking a summary question: is there any kind
> >> of documentation and/or book that explains all the internal details of
> >> the implementation (which looks little like voodoo to a newcomer like
> >> me)?
> >>
> >> Thanks for your help.
> >>
> >> Durga Choudhury
> >>
> >> Life is complex. It has real and imaginary parts.
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18544.php
> >>
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18545.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18549.php
>


Re: [OMPI devel] tm-less tm module

2016-01-25 Thread Howard Pritchard
HI Folks,

I like Paul's suggestion for configury summary output a lot.  It would have
helped me when I was trying to deal with an oddball
one-off install of the moab/torque software on one of the non-standard
front ends at LANL.  The libfabric configury has
such a summary output at the end of configure and it makes it much simpler
(for a much smaller project) to check that
you're getting what you expected.

I still say updating the FAQ with something more precise is in order.  I'll
work on that.

Howard


2016-01-25 15:20 GMT-07:00 Paul Hargrove :

> Ralph,
>
> As a practical matter most users probably aren't going to know what to do
> with anything that scrolls off their screen.
> So I think dumping the ompi_info output as-is would be just "noise" to
> many folks.
> That is one reason I didn't just suggest doing exactly that
> (cross-compilation being another)
>
> However, a suitably summarized output might be just the right thing.
> Perhaps something compact along the lines of
> MCA foo: component1 component2 component2
>  MCA foobar: componentA componentB
>   ...
>Bindings: C C++ Java Fortan(mpif.h 'use mpi')
>
> If this could information be generated at the end of configure, rather
> than "make install", it could save folks some time spent compiling
> incorrectly configured builds.
>
>
> Another thing one might independently want to consider is having configure
> warn when the required libs are present for a component but the "can
> compile" test fails.
> This would, for instance, catch the situation when the "libfoo" packages
> is installed but the "libfoo-dev" package is not.
> This approach, however, may require non-trivial changes to how all the
> configure probes are performed since I don't believe this is something
> autoconf has existing support for (the AC_CHECK_LIB macro is effectively a
> check for the "libfoo-dev" package only).
>
>
> Just my $0.02USD, of course.
>
> -Paul
>
> On Mon, Jan 25, 2016 at 1:46 PM, Ralph Castain  wrote:
>
>> That makes sense, Paul - what if we output effectively the ompi_info
>> summary of what was built at the end of the make install procedure? Then
>> you would have immediate feedback on the result.
>>
>> On Mon, Jan 25, 2016 at 1:27 PM, Paul Hargrove 
>> wrote:
>>
>>> As one who builds other people's software frequently, I have my own
>>> opinions here.
>>>
>>> Above all else, is that there is no one "right" answer, but that
>>> consistency with in a product is best.
>>> So (within reason) the same things that work to configure module A and B
>>> should work with C and D as well.
>>> To use an analogy from (human) languages, I dislike "irregular verbs".
>>>
>>> The proposal to report (at run time) the existence of TM support on the
>>> system (but lacking in ORTE), doesn't "feel" consistent with existing
>>> practice.
>>> In GASNet we *do* report at runtime if a high-speed network is present
>>> and you are not using it.
>>> For instance we warn if the headers were missing at configure time but
>>> we can see the /dev entry at runtime.
>>> However, we do that uniformly across all the networks and have done this
>>> for years.
>>> So, it is a *consistent* practice in that project.
>>>
>>> Keep It Simple Stupid is also an important one.
>>> So, I agree with those who think the proposal to catch this at runtime
>>> is an unnecessary complication.
>>>
>>> I think improving the FAQ a good idea
>>>
>>> I do, however, I can think of one thing that might help the "I thought I
>>> had configured X" problem Jeff mentions.
>>> What about a summary output at the end of configure or make?
>>>
>>> Right now I sometimes use something like the following:
>>>   $ grep 'bindings\.\.\. yes' configure.out
>>>   $ grep -e 'component .* can compile\.\.\. yes' configure.log
>>> This lets me see what is going to be built.
>>> Outputing something like this a the end of configure might encourage
>>> admins to check for their feature X before typing "make"
>>> The existing configury goop can easily be modified to keep a list of
>>> configured components and language bindings.
>>>
>>> However, another alternative is probably easier to implement:
>>> The last step of "make install" could print a message like
>>>   NOTICE: Your installation is complete.
>>>   NOTICE: You can run ompi_info to verify that all expected components
>>> and language bindings have been built.
>>>
>>> -Paul
>>>
>>> On Mon, Jan 25, 2016 at 11:13 AM, Jeff Squyres (jsquyres) <
>>> jsquy...@cisco.com> wrote:
>>>
 Haters gotta hate.  ;-)

 Kidding aside, ok, you make valid points.  So -- no tm "addition".  We
 just have to rely on people using functionality like "--with-tm" in the
 configure line to force/ensure that tm (or whatever feature) will actually
 get built.


 > On Jan 25, 2016, at 1:31 PM, Ralph Castain  wrote:
 >
 > I think we would be opening a real can of worms with this idea. There
 are environments, for example, that use PBSPro for one part of the 

Re: [OMPI devel] tm-less tm module

2016-01-25 Thread Howard Pritchard
Hi Gilles

I would prefer improving the faq rather than adding yet more complexity in
this area.  The way things go you would add this feature then someone else
with a different use case would complain we had broken something for them.
Then we would add another mca param to disable the new tm less module etc.

I think the faq should be more explicit about configury options required
for orte/resource manager integration feature to work.

Howard
--

sent from my smart phonr so no good type.

Howard
On Jan 24, 2016 5:17 PM, "Gilles Gouaillardet"  wrote:

> Folks,
>
> there was a question about mtt on the mtt mailing list
> http://www.open-mpi.org/community/lists/mtt-users/2016/01/0840.php
>
> after a few emails (some offline) it seems that was a configuration issue.
> the user is running PBSPro and it seems OpenMPI was not configured with
> the tm module
> (e.g. tm is not included in the default location, and he did not configure
> with --with-tm=/.../pbspro)
>
> in this case, the tm module is not built, and when a job runs under PBSPro
> without any hostfile,
> the job runs on one node only.
> in order to make this easier to diagnose, what about always building the
> tm module :
> - if tm is found by configury, build the OpenMPI tm modules as usual
> - if tm is not found by configury, build a dumb module that will issue a
> warning or abort
>   if a job is ran under PBS/torque
>   (e.g. some PBS specific environment variable are defined)
>
> of course, the spec of this "dumb" module can be improved, for example
> - add a MCA parameter to disable the warning
> - issue the warning only if there is more that one node in the job *and*
> no machinefile nor host list was passed to the mpirun command line
>
> Any thoughts ?
>
> Cheers,
>
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/01/18497.php
>


[OMPI devel] UH jenkins node seems out for the holidays

2015-12-30 Thread Howard Pritchard
Hi Folks,

As those of you working on OMPI PRs this week have noticed,
it appears that Univ. Houston's CS department may have shut
a number of systems down for the holidays.

So, for now, ignore the status of the LANL-distcheck and LANL-dlopen
jenkins tests.  Hopefully the UH server(s) will be back on line
next week.

Howard


Re: [OMPI devel] Proposal on RFCs

2015-11-23 Thread Howard Pritchard
HI Ralph,

Let's definitely discuss on 12/1.  Unless it's something like code deletion
or  large extern package update (like the PMIx 1.1 PR
or hwloc refresh), just opening a PR that touches like 100+ files across a
range of the code base needs more than the current github
PR interface provides.

I'd add that it would not hurt for proposals involving major changes/new
algorithms, etc. to have a wiki page.

Howard




2015-11-21 11:16 GMT-07:00 Ralph Castain :

> Hi folks
>
> When we moved to Github, we decided that we would use their “pull
> requests” to replace our RFC process. Our thinking at the time was that
> everyone would receive these, and so would know that something had been
> proposed.
>
> What we hadn’t really anticipated was the volume of PRs that would be
> generated. Quite frankly, it has become hard to sift thru them all to
> identify those that involve significant change from those involving minor
> bug fixes.
>
> Josh and I were kicking this around last week at SC’15, and after some
> consideration, I thought it makes sense to at least propose a couple of
> modifications that might help people to track what’s going on:
>
> (a) revive the RFC for significant changes. If the PR touches core code,
> or involves a change that exceeds an isolated bug fix, it would help if
> people announced it on the devel mailing list with “RFC” in the subject
> line, an explanation appropriate in length to the corresponding change, and
> a pointer to the PR. We should also include a “timeout” to indicate when
> this PR is intended to be committed, minus any expressed concerns. This
> would allow people to become aware of a proposed change that could impact
> them.
>
> (b) send a note to the devel mailing list indicating you are about to
> start working on a significant change to the code base. We generally do
> this on our weekly telecon, but not everyone can attend those. So rather
> than surprising folks with a PR out of the blue, it would be good to let
> the community know of your intentions so people can chime in with
> suggestions and contact you off-list about possibly contributing to the
> change. Besides, it might help to avoid having others committing
> conflicting changes during the effort.
>
> I figured we could discuss this on the next telecon (Dec 1st), but wanted
> to throw it out there for comment and advanced consideration.
>
> Ralph
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/11/18379.php


[OMPI devel] master build fails

2015-10-27 Thread Howard Pritchard
Hi Folks,

Looks like master can't build any more, at least not on cray with
--enable-picky option:

-- make all -j 8 result_stderr ---
keyval_lex.c: In function 'yy_get_next_buffer':
keyval_lex.c:751:18: warning: comparison between signed and unsigned
integer expressions
[-Wsign-compare]
   for ( n = 0; n < max_size && \
  ^
keyval_lex.c:1284:3: note: in expansion of macro 'YY_INPUT'
   YY_INPUT( (&YY_CURRENT_BUFFER_LVALUE->yy_ch_buf[number_to_move]),
show_help_lex.c: In function 'yy_get_next_buffer':
show_help_lex.c:647:18: warning: comparison between signed and
unsigned integer expressions
[-Wsign-compare]
   for ( n = 0; n < max_size && \
  ^
show_help_lex.c:1081:3: note: in expansion of macro 'YY_INPUT'
   YY_INPUT( (&YY_CURRENT_BUFFER_LVALUE->yy_ch_buf[number_to_move]),
common_verbs_usnic_fake.c: In function 'fake_driver_init':
common_verbs_usnic_fake.c:92:9: error: implicit declaration of function 'sscanf'
[-Werror=implicit-function-declaration]
 if (sscanf(value, "%i", &vendor) != 1) {
common_verbs_usnic_fake.c:92:9: warning: incompatible implicit
declaration of built-in function
'sscanf'
cc1: some warnings being treated as errors
make[2]: *** [libmca_common_verbs_usnic_la-common_verbs_usnic_fake.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1


Howard


[OMPI devel] Fwd: mtt-submit, etc.

2015-10-22 Thread Howard Pritchard
Hi Folks,

I don't seem to have gotten subscribed yet to mtt-users mail list so
forwarding to the dev team.

Howard

-- Forwarded message --
From: Howard Pritchard 
List-Post: devel@lists.open-mpi.org
Date: 2015-10-22 10:18 GMT-06:00
Subject: mtt-submit, etc.
To: mtt-us...@open-mpi.org


HI Folks,

I have the following issue with a cluster I would like to use for
submitting MTT results
for Open MPI, namely, that the nodes on which I have to submit batch jobs
to run
the tests don't have external internet connectivity, so if my mtt ini file
has a IU database reporter
section, the run dies in the "ping the mtt server" test.

What I have right now is a two-stage process where I checkout and
compile/build
Open MPI and the tests on a front end which does have access to the mtt
server.
This part works and gets reported back to IU database.

I can run the tests using mtt, but have to disable all the mtt server
reporter stuff.

I thought I could use mtt-submit to submit some kind of mttdatabase debug
file
back to IU once the batch job has completed, but I can't figure out a way
to generate this file without enable the mtt server reporter section in the
ini file,
and so back to the ping failure issue.

Would anyone have suggestions on how to work around this problem?

Thanks,

Howard


[OMPI devel] HPX?

2015-10-19 Thread Howard Pritchard
Hi Folks,

I got some kind of strange email from a jenkins.crest.iu.edu
concerning an HPX project.

It looks like there's some code on some private repo on crest.

Does anyone know anything about this?

Howard


Re: [OMPI devel] Access to old users@ and devel@ Open MPI mails

2015-10-02 Thread Howard Pritchard
I'm okay with it as long as they use an MPI based mapreduce to do the
analytics.

Howard


2015-10-02 9:32 GMT-06:00 Jeff Squyres (jsquyres) :

> I've received a request from a researcher at Kansas State University to
> get a copy of all old us...@open-mpi.org and de...@open-mpi.org emails.
> In their words:
>
> Jeff:
>
> One of our new professors at K-State is interested in analyzing the
> dev and user archives for OpenMPI to get a statistical characterization
> of the types of bugs/issues that are encountered.  Is there an easy
> way to
> get a tarball of each for the past year or so to start?
>
> We do actually have the entire records of all users@ and devel@ mails in
> mbox format at IU.
>
> Does anyone have any opinions about us giving copies of these mbox files
> to this KState researcher?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18120.php
>


[OMPI devel] new compiler warning with --enable-picky using UH --disable-dlopen jenkins project

2015-09-25 Thread Howard Pritchard
HI Folks,

First, the --disable-dlopen/--distcheck projects do not run on anything
having anything to do with
Cray.  So if you see failures with the disable-dlopen  or distcheck
projects and chose
to ignore them please remember they:

1) run on a vanilla linux (Open Suse 13.1) x86_64 box
2) use gnu 4.8.1 and 5.2.0 compilers

So if these systems/config types are important to the project, and your PR
doesn't pass
both of these checks its probably a good idea not to merge in until
figuring out
better what's going on.

So I  am triaging the jenkins build failures.  A minor thing with the
--disable-dlopen project
I'm seeing these compiler warnings with --enable-picky:

 CC   request/req_test.lo
info/info.c: In function 'ompi_info_set_value_enum':
info/info.c:281:57: warning: passing argument 3 of
'var_enum->string_from_value' from incompatible pointer type
[-Wincompatible-pointer-types]
 ret = var_enum->string_from_value (var_enum, value, &string_value);
 ^
info/info.c:281:57: note: expected 'const char **' but argument is of
type 'char **'
  CC   request/req_wait.lo
  CC   runtime/ompi_mpi_abort.lo
  CC   runtime/ompi_mpi_init.lo
proc/proc.c: In function 'ompi_proc_world':
proc/proc.c:487:24: warning: comparison between signed and unsigned
integer expressions [-Wsign-compare]
 for (int i = 0 ; i < count ; ++i) {
^
proc/proc.c:505:18: warning: assignment from incompatible pointer type
[-Wincompatible-pointer-types]
 procs[i] = ompi_proc_for_name (name);
  ^
proc/proc.c:470:25: warning: unused variable 'my_name' [-Wunused-variable]
 ompi_process_name_t my_name;
 ^
proc/proc.c:469:28: warning: unused variable 'mask' [-Wunused-variable]
 ompi_rte_cmp_bitmask_t mask;
^
proc/proc.c:467:18: warning: unused variable 'proc' [-Wunused-variable]
 ompi_proc_t *proc;
  ^
  CC   runtime/ompi_mpi_finalize.lo
  CC   runtime/ompi_mpi_params.lo
  CC   runtime/ompi_mpi_preconnect.lo
  CC   runtime/ompi_cr.lo
  CC   runtime/ompi_info_support.lo
runtime/ompi_mpi_init.c:119:2: warning: #ident is a GCC extension
 #ident OMPI_IDENT_STRING



I think they are new.


The UH jenkins disable-dlopen is currently failing because the
gfortran 5.2.0 doesn't like change made in PR 595.  Prior to that PR,
gfortran 5.2.0 could build the usempi_f08 module.  Now apparently it
can't.

I'm like 100% sure this is a regression.


The UH jenkins disable-dlopen tests first building with gcc/gfortran
4.8.1, then proceeds on to a build with 5.2.0.


Unfortunately there are also now periodic hangs of the runs of hello
world at startup, but most of the time it seems to run.


Howard


[OMPI devel] busted build

2015-09-25 Thread Howard Pritchard
Hi Folks,

The UH distcheck is now failing with this compile error:

  CC   pml_ob1_rdma.lo
pml_ob1_irecv.c: In function 'mca_pml_ob1_recv':
pml_ob1_irecv.c:138:28: error: called object 'mca_pml_ob1_recvreq' is
not a function or function pointer
 mca_pml_ob1_recvreq(recvreq);
^
pml_ob1_irecv.c:39:29: note: declared here
 mca_pml_ob1_recv_request_t *mca_pml_ob1_recvreq = NULL;
 ^
make[2]: *** [pml_ob1_irecv.lo] Error 1
make[2]: *** Waiting for unfinished jobs
make[2]: Leaving directory
`/home/hppritcha/jenkins/workspace/ompi_master_pr_disable_dlopen/ompi/mca/pml/ob1'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory
`/home/hppritcha/jenkins/workspace/ompi_master_pr_disable_dlopen/ompi'
make: *** [install-recursive] Error 1
Build step 'Execute shell' marked build as failure
Setting status of 6b9e67cfdb109f87b3fce6047e52e8fe72cdaf4c to FAILURE
with url http://jenkins.open-mpi.org/job/ompi_master_pr_disable_dlopen/303/
and message: 'Build finished. No test results found.'
Using context: LANL-disable-dlopen-check
Test FAILed.


[OMPI devel] PR 595 busted build of mpi_f08

2015-09-25 Thread Howard Pritchard
Hi Folks,

Well, jenkins doesn't lie.

http://jenkins.open-mpi.org/job/ompi_master_cle5.2up02/595/console

Looks like the commits associated with PR 595 busted mpi_f08 build.

Its a bit frustrating to set all this jenkins stuff up then have it ignored.

Howard


[OMPI devel] open mpi builds busted

2015-09-25 Thread Howard Pritchard
Hi Folks,

I don't know what's going on, but someone checked in something that broke
build of mpi_f08.

Lots of the jenkins tests are now failing.


Howard


[OMPI devel] problems compiling ompi master

2015-09-22 Thread Howard Pritchard
Hi Folks,

Is anyone seeing a problem compiling ompi today?
This is what I'm getting

  CC   osc_pt2pt_passive_target.lo
In file included from ../../../../opal/include/opal_config.h:2802:0,
 from ../../../../ompi/include/ompi_config.h:29,
 from osc_pt2pt_active_target.c:24:
osc_pt2pt_active_target.c: In function 'ompi_osc_pt2pt_get_peers':
osc_pt2pt_active_target.c:84:35: error: 'ompi_osc_rdma_peer_t' undeclared
(first use in this function)
 peers = calloc (size, sizeof (ompi_osc_rdma_peer_t *));
   ^
../../../../opal/include/opal_config_bottom.h:323:61: note: in definition
of macro 'calloc'
 #define calloc(nmembers, size) opal_calloc((nmembers), (size),
__FILE__, __LINE__)
 ^
osc_pt2pt_active_target.c:84:35: note: each undeclared identifier is
reported only once for each function it appears in
 peers = calloc (size, sizeof (ompi_osc_rdma_peer_t *));
   ^
../../../../opal/include/opal_config_bottom.h:323:61: note: in definition
of macro 'calloc'
 #define calloc(nmembers, size) opal_calloc((nmembers), (size),
__FILE__, __LINE__)
 ^
osc_pt2pt_active_target.c:84:57: error: expected expression before ')' token
 peers = calloc (size, sizeof (ompi_osc_rdma_peer_t *));
 ^
../../../../opal/include/opal_config_bottom.h:323:61: note: in definition
of macro 'calloc'
 #define calloc(nmembers, size) opal_calloc((nmembers), (size),
__FILE__, __LINE__)
 ^
Howard


  1   2   3   >