Re: [OMPI devel] Announcing Open MPI v5.0.0rc2

2022-01-10 Thread Ralph Castain via devel
Hi marco

I added the libtool tweak to PMIx and changed the "interface" variable in PRRTE 
to "intf" - hopefully, the offending header didn't define that one too!

I'm not sure of the problem you're encountering, but I do note the PMIx error 
message:

> [116] 
> /pub/devel/openmpi/v5.0/openmpi-5.0.0-0.1.x86_64/src/openmpi-5.0.0rc2/3rd-party/openpmix/src/mca/ptl/base/ptl_base_listener.c:498
>  bind() failed for socket 13 storage size 16: Cannot assign requested address

IIRC, we may have had problems with sockets in Cygwin before, yes? You might 
need to look at the referenced code area to see if there needs to be some 
Cygwin-related tweak.

Ralph

> On Jan 9, 2022, at 11:09 PM, Marco Atzeri via devel 
>  wrote:
> 
> On 10.01.2022 06:50, Marco Atzeri wrote:
>> On 09.01.2022 15:54, Ralph Castain via devel wrote:
>>> Hi Marco
>>> 
>>> Try the patch here (for the prrte 3rd-party subdirectory): 
>>> https://github.com/openpmix/prrte/pull/1173
>>> 
>>> 
>>> Ralph
>>> 
>> Thanks Ralph,
>> I will do on the next build
>> as I need still to test the current build.
> 
> The test are not satisfactory
> 
> I have only one test fail
>  FAIL: dlopen_test.exe
> 
> that I supect is due to a wrong name on test
> 
> but a simple run fails
> 
> $ mpirun -n 4 ./hello_c.exe
> [116] 
> /pub/devel/openmpi/v5.0/openmpi-5.0.0-0.1.x86_64/src/openmpi-5.0.0rc2/3rd-party/openpmix/src/mca/ptl/base/ptl_base_listener.c:498
>  bind() failed for socket 13 storage size 16: Cannot assign requested address
> Hello, world, I am 0 of 1, (Open MPI v5.0.0rc2, package: Open MPI 
> Marco@LAPTOP-82F08ILC Distribution, ident: 5.0.0rc2, repo rev: v5.0.0rc2, Oct 
> 18, 2021, 125)
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  PML add procs failed
>  --> Returned "Not found" (-13) instead of "Success" (0)
> --
> [LAPTOP-82F08ILC:0] *** An error occurred in MPI_Init
> [LAPTOP-82F08ILC:0] *** reported by process [36547002369,1]
> [LAPTOP-82F08ILC:0] *** on a NULL communicator
> [LAPTOP-82F08ILC:0] *** Unknown error
> [LAPTOP-82F08ILC:0] *** MPI_ERRORS_ARE_FATAL (processes in this 
> communicator will now abort,
> [LAPTOP-82F08ILC:0] ***and MPI will try to terminate your MPI job as 
> well)
> --
> 
> Suggestion for what to look for ?
> 
> Regards
> Marco
> 
> 




Re: [OMPI devel] Announcing Open MPI v5.0.0rc2

2022-01-09 Thread Ralph Castain via devel
Hi Marco

Try the patch here (for the prrte 3rd-party subdirectory): 
https://github.com/openpmix/prrte/pull/1173


Ralph

> On Jan 9, 2022, at 12:29 AM, Marco Atzeri via devel 
>  wrote:
> 
> On 01.01.2022 20:07, Barrett, Brian wrote:
>> Marco -
>> There are some patches that haven't made it to the 5.0 branch to make this 
>> behavior better.  I didn't get a chance to back port them before the holiday 
>> break, but they will be in the next RC.  That said, the issue below is a 
>> warning, not an error, so you should still end up with a build that works 
>> (with an included PMIx).  The issue is that png-config can't be found, so we 
>> have trouble guessing what libraries are dependencies of PMIx, which is a 
>> potential problem in complicated builds with static libraries.
>> Brian
> 
> Thanks Brian,
> 
> the build error was in reality in setting for threads.
> 
> I was using up to v4.1
> 
>  --with-threads=posix
> 
> that currently is not accepted anymore but no error is reported,
> causing a different setting that does not work in CYGWIN.
> Removing the configuration seems to work
> 
> 
> I have however found a logic error in prrte that
> probably need a verification of all the HAVE_*_H
> between configuration and code
> 
> 
> /pub/devel/openmpi/v5.0/openmpi-5.0.0-0.1.x86_64/src/openmpi-5.0.0rc2/3rd-party/prrte/src/mca/odls/default/odls_default_module.c:114:14:
>  fatal error: sys/ptrace.h: No such file or directory
>  114 | #include 
>  |  ^~
> 
> caused by
> 
> $ grep -rH HAVE_SYS_PTRACE_H .
> ./3rd-party/prrte/config.log:| #define HAVE_SYS_PTRACE_H 0
> ./3rd-party/prrte/config.log:| #define HAVE_SYS_PTRACE_H 0
> ./3rd-party/prrte/config.log:#define HAVE_SYS_PTRACE_H 0
> ./3rd-party/prrte/config.status:D["HAVE_SYS_PTRACE_H"]=" 0"
> ./3rd-party/prrte/src/include/prte_config.h:#define HAVE_SYS_PTRACE_H 0
> 
> while the code in
>3rd-party/prrte/src/mca/odls/default/odls_default_module.c
> has
> 
> #ifdef HAVE_SYS_PTRACE_H
> #include 
> #endif
> 
> 
> currently I am stacked at
> 
> 0rc2/3rd-party/prrte/src/mca/oob/tcp/oob_tcp_connection.c:61:
> /pub/devel/openmpi/v5.0/openmpi-5.0.0-0.1.x86_64/src/openmpi-5.0.0rc2/3rd-party/prrte/s
> rc/mca/oob/tcp/oob_tcp_connection.c: In function 
> ‘prte_oob_tcp_peer_try_connect’:
> /pub/devel/openmpi/v5.0/openmpi-5.0.0-0.1.x86_64/src/openmpi-5.0.0rc2/3rd-party/prrte/src/mca/oob/tcp/oob_tcp_connection.c:163:16:
>  error: expected identifier or ‘(’ before ‘struct’
>  163 | prte_if_t *interface;
>  |^
> /pub/devel/openmpi/v5.0/openmpi-5.0.0-0.1.x86_64/src/openmpi-5.0.0rc2/3rd-party/prrte/src/mca/oob/tcp/oob_tcp_connection.c:180:19:
>  error: expected ‘{’ before ‘=’ token
>  180 | interface = PRTE_NEW(prte_if_t);
>  |   ^
> 
> 
> not sure if it is caused by new GCC 11 requirement or from wrong headers
> being pulled in.
> 
> Has anyone built with GCC 11 ?
> 
> Regards
> Marco
> 




Re: [OMPI devel] openmpi/pmix compatibility

2021-10-10 Thread Ralph Castain via devel
It was a bug (typo in the attribute name when backported from OMPI master) in 
OMPI 4.1.1 - it has been fixed.


> On Oct 9, 2021, at 9:18 PM, Orion Poplawski via devel 
>  wrote:
> 
> It looks like openmpi 4.1.1 is not compatible with pmix 4.1.0 - is that 
> expected?
> 
> In file included from ../../../../opal/mca/pmix/base/base.h:22,
> from common_ofi.c:29:
> common_ofi.c: In function 'get_package_rank':
> common_ofi.c:321:40: error: 'PMIX_PACKAGE_RANK' undeclared (first use in this 
> function)
>  321 | OPAL_MODEX_RECV_VALUE_OPTIONAL(rc, PMIX_PACKAGE_RANK,
>  |^
> ../../../../opal/mca/pmix/pmix.h:153:56: note: in definition of macro 
> 'OPAL_MODEX_RECV_VALUE_OPTIONAL'
>  153 | if (OPAL_SUCCESS == ((r) = opal_pmix.get((p), (s), &(_ilist), 
> &(_kv {\
>  |^
> common_ofi.c:321:40: note: each undeclared identifier is reported only once 
> for each function it appears in
>  321 | OPAL_MODEX_RECV_VALUE_OPTIONAL(rc, PMIX_PACKAGE_RANK,
>  |^
> ../../../../opal/mca/pmix/pmix.h:153:56: note: in definition of macro 
> 'OPAL_MODEX_RECV_VALUE_OPTIONAL'
>  153 | if (OPAL_SUCCESS == ((r) = opal_pmix.get((p), (s), &(_ilist), 
> &(_kv {\
>  |^
> make[2]: *** [Makefile:1937: common_ofi.lo] Error 1
> 
> 
> -- 
> Orion Poplawski
> he/him/his - surely the least important thing about me
> Manager of NWRA Technical Systems  720-772-5637
> NWRA, Boulder/CoRA Office FAX: 303-415-9702
> 3380 Mitchell Lane   or...@nwra.com
> Boulder, CO 80301 https://www.nwra.com/
> 




Re: [OMPI devel] Support timelines

2021-09-16 Thread Ralph Castain via devel
Answered on packager list - apologies that it didn't get answered there in a 
timely fashion.

> On Sep 16, 2021, at 6:56 AM, Orion Poplawski via devel 
>  wrote:
> 
> Is there any documentation that would indicate how long the 4.0 (or any 
> particular release series) will be supported?  This would be helpful for 
> establishing downstream timelines.
> 
> Thanks!
> 
> -- 
> Orion Poplawski
> he/him/his - surely the least important thing about me
> Manager of NWRA Technical Systems  720-772-5637
> NWRA, Boulder/CoRA Office FAX: 303-415-9702
> 3380 Mitchell Lane   or...@nwra.com
> Boulder, CO 80301 https://www.nwra.com/
> 
> 




[OMPI devel] New host/hostfile policy

2021-06-16 Thread Ralph Castain via devel
We've been struggling a bit lately with the problem of resolving multiple names 
for the same host. Part of the problem has been the need to minimize DNS 
resolves as systems were taking way too long to perform them, resulting in very 
long startup times. I've done my best to minimize this and still get hostnames 
to properly resolve, even when people/systems insist on creating name confusion.

Historically, we disabled all DNS resolves when running under a managed 
allocation. We simply assumed that the RM would be consistent in its naming, 
and required users to always use the RM-provided host names for any -host or 
hostfile entries.

However, we defaulted to performing DNS resolves in non-managed situations, 
giving the user an MCA parameter to disable them if the DNS system was too 
slow. This unfortunately has been causing problems as people new to the project 
forget about the param and start seeing very long startup times.

Accordingly, we now no longer default to performing DNS resolves for 
non-managed scenarios, though the user can request that we do so if they run 
into hostname confusion issues. We still disable it completely for managed 
allocations.

This doesn't penalize the majority of users who don't engage in or have systems 
that generate multiple names for the same piece of hardware, and shifts the 
penalties onto those who do. Seemed more appropriate that those who want to 
screw around with host names should pay the price instead of inconveniencing 
everyone else out-of-the-box.

So if you need to resolve hostnames, then set PRTE_MCA_prte_do_not_resolve=0 
(or the equivalent in the PRRTE default MCA param or on the cmd line). 
Otherwise, you should be fine.
Ralph




[OMPI devel] Auto-forward of envars

2021-04-14 Thread Ralph Castain via devel
PMIx and PRRTE both read and forward their respective default MCA parameters 
from default system and user-level param files:

/etc/pmix-mca-params.conf
/.pmix/mca-params.conf

/etc/prte-mca-params.conf
/.prte/mca-params.conf

PMIx will also do the same thing for OMPI default system and user-level params 
provided it is told the programming model. Getting the system-level defaults 
requires that the user have "OMPIHOME" set in their environment to point to the 
installation so we can find the default param file. In the case of "mpirun" and 
friends, we automatically do this. Our hope is that resource managers such as 
Slurm will update their PMIx support to include this feature - even though they 
auto-forward your environment, they don't know to collect the params in the 
default files, thus forcing every OMPI proc to open/read those files...and 
causing a logjam on the file system.

In addition to the default params, PMIx also (if the OMPI model is declared) 
picks up and forwards OMPI-specific envars. By default, we capture all envars 
that start with "OMPI_". However, you can control that by setting the 
PMIX_MCA_pmdl_ompi_include and PMIX_MCA_pmdl_ompi_exclude params.

There is a similar capability for OSHMEM. The default here is to forward all 
envars that start with "SHMEM_" or "SMA_", per the folks at Stonybrook.

The question that has arisen is: what should the default params include? 
Obviously, the system manager as well as the user can set them to be anything 
they want, but it would be nice if we could provide an adequate initial set. 
For example, one group has suggested that the OMPI model include all "CUDA_" 
envars.

I should note that PMIx also does this for fabrics! For example, if we see OPA 
is on the system, we automatically forward all "HFI_" and "PSM2_" envars. This 
is the only fabric with a PMIx plugin at the moment, so this feature is limited 
to OPA right now.

Any thoughts on this?
Ralph




Re: [OMPI devel] Slurm integration and rankfiles....

2021-03-21 Thread Ralph Castain via devel
You might want to take a look at PRRTE (https://github.com/openpmix/prrte) - it 
does exactly what you describe., only from the other way around. It provides a 
customizable launcher that supports the various cmd lines, and then uses a 
common RTE backend.

We don't use SLURM_TASKS_PER_NODE for placement - we only use it to determine 
how many processes we are allowed to run on that node. This is particularly 
important in multi-tenant environments. As you have discovered, we need it, so 
removing it isn't an option.


On Mar 21, 2021, at 12:22 PM, Martyn Foster via devel mailto:devel@lists.open-mpi.org> > wrote:

Sorry for the slow reply! 

I didn't want to get fixated on why the variable was unset, though I can 
understand the existence of a check if Slurm always sets this (I don't recall 
that being the case for all configurations historically, but perhaps it is 
now). The reason I'd unset it (!) is because I was trying to build an 
environment to support completely arbitrary task placement/distribution that 
works with various launchers (orterun/srun/hydra) and it seems tasks_per_node 
being set was upsetting one of the others. 

Slurm's internal geometry parameters can't possibly describe an arbitrary 
(rankfile) layout, so I was nervous about why they would be required if a 
rankfile was provided...

Martyn

On Mon, 15 Mar 2021 at 19:57, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
Martyn? Why are you saying SLURM_TASKS_PER_NODE might not be present?

It sounds to me like something is wrong in your Slurm environment - I really 
believe that this envar is always supposed to be there.


> On Mar 15, 2021, at 4:20 AM, Peter Kjellström  <mailto:c...@nsc.liu.se> > wrote:
> 
> On Fri, 12 Mar 2021 22:19:09 +
> Ralph Castain via devel  <mailto:devel@lists.open-mpi.org> > wrote:
> 
>> Why would it not be set? AFAICT, Slurm is supposed to always set that
>> envar, or so we've been told.
> 
> Maybe confusion on the exact name?
> 
> AFAIK slurm always sets SLURM_TASKS_PER_NODE but only sets
> SLURM_NTASKS_PER_NODE (almost same name) when --ntasks-per-node is
> given.
> 
> /Peter K





[OMPI devel] Accessing HWLOC from inside OMPI

2021-03-17 Thread Ralph Castain via devel
Hi folks

I've written a wiki page explaining how OMPI handles HWLOC from inside the OMPI 
code base starting with OMPI v5. The link is on the home page under the 
Developer Documents (Accessing the HWLOC topology tree from inside the MPI/OPAL 
layers):

https://github.com/open-mpi/ompi/wiki/Accessing-the-HWLOC-topology-tree

I've tried to capture the various scenarios under which we operate and explain 
(a) how we deal with it, (b) the various options that were considered, and (c) 
the thought process behind the eventual solution we used. Seemed like something 
worth capturing as I ride off into the sunset.

Please let me know if there are things I should better clarify.
Ralph



Re: [OMPI devel] Slurm integration and rankfiles....

2021-03-15 Thread Ralph Castain via devel
Martyn? Why are you saying SLURM_TASKS_PER_NODE might not be present?

It sounds to me like something is wrong in your Slurm environment - I really 
believe that this envar is always supposed to be there.


> On Mar 15, 2021, at 4:20 AM, Peter Kjellström  wrote:
> 
> On Fri, 12 Mar 2021 22:19:09 +0000
> Ralph Castain via devel  wrote:
> 
>> Why would it not be set? AFAICT, Slurm is supposed to always set that
>> envar, or so we've been told.
> 
> Maybe confusion on the exact name?
> 
> AFAIK slurm always sets SLURM_TASKS_PER_NODE but only sets
> SLURM_NTASKS_PER_NODE (almost same name) when --ntasks-per-node is
> given.
> 
> /Peter K




Re: [OMPI devel] Slurm integration and rankfiles....

2021-03-12 Thread Ralph Castain via devel
Why would it not be set? AFAICT, Slurm is supposed to always set that envar, or 
so we've been told.


On Mar 12, 2021, at 2:15 AM, Martyn Foster via devel mailto:devel@lists.open-mpi.org> > wrote:

Hi Ralph,

Slurm is 19.05. 

To be clear - its not unexpected that SLURM_TASKS_PER_NODE is unset in the 
configuration.

Martyn

On Thu, 11 Mar 2021 at 16:09, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
What version of Slurm is this?

> On Mar 11, 2021, at 8:03 AM, Martyn Foster via devel 
> mailto:devel@lists.open-mpi.org> > wrote:
> 
> Hi all,
> 
> Using a rather trivial example
> mpirun -np 1 -rf rankfile ./HelloWorld
> on a Slurm system;
> --
> While trying to determine what resources are available, the SLURM
> resource allocator expects to find the following environment variables:
> 
>     SLURM_NODELIST
>     SLURM_TASKS_PER_NODE
> 
> However, it was unable to find the following environment variable:
> 
>     SLURM_TASKS_PER_NODE
> 
> --
> 
> (Both for OpenMPI 4.0/4.1). It is correct the variable is not set,  but why 
> is  SLURM_TASKS_PER_NODE expected or required when using a rankfile where one 
> presumes it would not be a constant across the job anyway?
> 
> Martyn
> 





Re: [OMPI devel] Slurm integration and rankfiles....

2021-03-11 Thread Ralph Castain via devel
What version of Slurm is this?

> On Mar 11, 2021, at 8:03 AM, Martyn Foster via devel 
>  wrote:
> 
> Hi all,
> 
> Using a rather trivial example
> mpirun -np 1 -rf rankfile ./HelloWorld
> on a Slurm system;
> --
> While trying to determine what resources are available, the SLURM
> resource allocator expects to find the following environment variables:
> 
> SLURM_NODELIST
> SLURM_TASKS_PER_NODE
> 
> However, it was unable to find the following environment variable:
> 
> SLURM_TASKS_PER_NODE
> 
> --
> 
> (Both for OpenMPI 4.0/4.1). It is correct the variable is not set,  but why 
> is  SLURM_TASKS_PER_NODE expected or required when using a rankfile where one 
> presumes it would not be a constant across the job anyway?
> 
> Martyn
> 




Re: [OMPI devel] mpirun alternative

2021-03-09 Thread Ralph Castain via devel
If you can launch that cmd line on each node, then it should work. I'm not sure 
what the "-x 10.0.35.43" at the beginning of the line is supposed to do, so 
that might not be useful.


On Mar 9, 2021, at 7:25 AM, Gabriel Tanase mailto:gabrieltan...@gmail.com> > wrote:

George, 
I started to digg more in option 2 as you describe it. I believe I can make 
that work. 
For example I created this fake ssh :

$ cat ~/bin/ssh
#!/bin/bash
fname=env.$$
echo ">>>>>>>>>>>>> ssh" >> $fname
env >>$fname
echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>" >>$fname
echo $@ >>$fname

And this one prints all args that the remote process will receive:

>>>>>>>>>>>>>>>>>>>>>>>>>>
-x 10.0.35.43 orted -mca ess "env" -mca ess_base_jobid "2752512000" -mca 
ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex 
"ip-[2:10]-0-16-120,[2:10].0.35.43,[2:10].0.35.42@0(3)" -mca orte_hnp_uri 
"2752512000.0;tcp://10.0.16.120:44789 <http://10.0.16.120:44789/> " -mca plm 
"rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri 
"2752512000.0;tcp://10.0.16.120:44789 <http://10.0.16.120:44789/> " -mca 
rmaps_base_mapping_policy "node" -mca pmix "^s1,s2,cray,isolated"

Now I am thinking that probably I don;t even need to create all those openmpi 
env variables as I am hoping orted that will be started remotely will start the 
final executable with the right env set. does this sound right ?

Thx,
--Gabriel


On Fri, Mar 5, 2021 at 3:15 PM George Bosilca mailto:bosi...@icl.utk.edu> > wrote:
Gabriel,

You should be able to. Here are at least 2 different ways of doing this.


1. Purely MPI. Start singletons (or smaller groups), and connect via sockets 
using MPI_Comm_join. You can setup your own DNS-like service, with the goal of 
having the independent MPI jobs leave a trace there, such that they can find 
each other and create the initial socket.

2. You could replace ssh/rsh with a no-op script (that returns success such 
that the mpirun process thinks it successfully started the processes), and then 
handcraft the environment as you did for GASNet. 

3. We have support for DVM (Distributed Virtual Machine) that basically created 
an independent service where different mpirun could connect to retrieve 
information. The mpirun using this dvm singleton, and fallback to 
MPI_Comm_connect/accept to recreate an MPI world.

Good luck,
  George.


On Fri, Mar 5, 2021 at 2:08 PM Ralph Castain via devel 
mailto:devel@lists.open-mpi.org> > wrote:
I'm afraid that won't work - there is no way for the job to "self assemble". 
One could create a way to do it, but it would take some significant coding in 
the guts of OMPI to get there.


On Mar 5, 2021, at 9:40 AM, Gabriel Tanase via devel mailto:devel@lists.open-mpi.org> > wrote:

Hi all,
I decided to use mpi as the messaging layer for a multihost database. However 
within my org I faced very strong opposition to allow passwordless ssh or rsh. 
For security reasons we want to minimize the opportunities to execute arbitrary 
codes on the db clusters. I don;t want to run other things like slurm, etc.

My question would be: Is there a way to start an mpi application by running 
certain binaries on each host ? E.g., if my executable is "myapp" can I start a 
server (orted???) on host zero and than start myapp on each host with the right 
env variables set (for specifying the rank, num ranks, etc.) 

For example when using another messaging API (GASnet) I was able to start a 
server on host zero and then manually start the application binary on each host 
(with some environment variables properly set) and all was good.

I tried to reverse engineer a little the env variables used by mpirun (mpirun 
-np 2 env) and then I copied these env variables in a shell script prior to 
invoking my hello world mpirun but I got an error message implying a server is 
not present: 

PMIx_Init failed for the following reason:

  NOT-SUPPORTED

Open MPI requires access to a local PMIx server to execute. Please ensure
that either you are operating in a PMIx-enabled environment, or use "mpirun"
to execute the job.

Here is the shell script for host0:

$ cat env1.sh
#!/bin/bash

export OMPI_COMM_WORLD_RANK=0
export PMIX_NAMESPACE=mpirun-38f9d3525c2c-53291@1
export PRTE_MCA_prte_base_help_aggregate=0
export TERM_PROGRAM=Apple_Terminal
export OMPI_MCA_num_procs=2
export TERM=xterm-256color
export SHELL=/bin/bash
export PMIX_VERSION=4.1.0a1
export OPAL_USER_PARAMS_GIVEN=1
export TMPDIR=/var/folders/_k/c4_xr5vd14j97fw7j8vzmd45_9hjbq/T/
export 
Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.HCXmdRI1WL/Ren

Re: [OMPI devel] mpirun alternative

2021-03-05 Thread Ralph Castain via devel
I'm afraid that won't work - there is no way for the job to "self assemble". 
One could create a way to do it, but it would take some significant coding in 
the guts of OMPI to get there.


On Mar 5, 2021, at 9:40 AM, Gabriel Tanase via devel mailto:devel@lists.open-mpi.org> > wrote:

Hi all,
I decided to use mpi as the messaging layer for a multihost database. However 
within my org I faced very strong opposition to allow passwordless ssh or rsh. 
For security reasons we want to minimize the opportunities to execute arbitrary 
codes on the db clusters. I don;t want to run other things like slurm, etc.

My question would be: Is there a way to start an mpi application by running 
certain binaries on each host ? E.g., if my executable is "myapp" can I start a 
server (orted???) on host zero and than start myapp on each host with the right 
env variables set (for specifying the rank, num ranks, etc.) 

For example when using another messaging API (GASnet) I was able to start a 
server on host zero and then manually start the application binary on each host 
(with some environment variables properly set) and all was good.

I tried to reverse engineer a little the env variables used by mpirun (mpirun 
-np 2 env) and then I copied these env variables in a shell script prior to 
invoking my hello world mpirun but I got an error message implying a server is 
not present: 

PMIx_Init failed for the following reason:

  NOT-SUPPORTED

Open MPI requires access to a local PMIx server to execute. Please ensure
that either you are operating in a PMIx-enabled environment, or use "mpirun"
to execute the job.

Here is the shell script for host0:

$ cat env1.sh
#!/bin/bash

export OMPI_COMM_WORLD_RANK=0
export PMIX_NAMESPACE=mpirun-38f9d3525c2c-53291@1
export PRTE_MCA_prte_base_help_aggregate=0
export TERM_PROGRAM=Apple_Terminal
export OMPI_MCA_num_procs=2
export TERM=xterm-256color
export SHELL=/bin/bash
export PMIX_VERSION=4.1.0a1
export OPAL_USER_PARAMS_GIVEN=1
export TMPDIR=/var/folders/_k/c4_xr5vd14j97fw7j8vzmd45_9hjbq/T/
export 
Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.HCXmdRI1WL/Render
export 
PMIX_SERVER_URI41=mpirun-38f9d3525c2c-53291@0.0;tcp4://192.168.0.180:52093 
 
export TERM_PROGRAM_VERSION=421.2
export PMIX_RANK=0
export TERM_SESSION_ID=18212D82-DEB2-4AE8-A271-FB47AC71337B
export OMPI_COMM_WORLD_LOCAL_RANK=0
export OMPI_ARGV=
export OMPI_MCA_initial_wdir=/Users/igtanase/ompi
export USER=igtanase
export OMPI_UNIVERSE_SIZE=2
export SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.PhcplcX3pC/Listeners
export OMPI_COMMAND=./exe
export __CF_USER_TEXT_ENCODING=0x54984577:0x0:0x0
export 
OMPI_FILE_LOCATION=/var/folders/_k/c4_xr5vd14j97fw7j8vzmd45_9hjbq/T//prte.38f9d3525c2c.1419265399/dvm.53291/1/0
export 
PMIX_SERVER_URI21=mpirun-38f9d3525c2c-53291@0.0;tcp4://192.168.0.180:52093 
 
export 
PATH=/Users/igtanase/ompi/bin/:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
export OMPI_COMM_WORLD_LOCAL_SIZE=2
export PRTE_MCA_pmix_session_server=1
export PWD=/Users/igtanase/ompi
export OMPI_COMM_WORLD_SIZE=2
export OMPI_WORLD_SIZE=2
export LANG=en_US.UTF-8
export XPC_FLAGS=0x0
export PMIX_GDS_MODULE=hash
export XPC_SERVICE_NAME=0
export HOME=/Users/igtanase
export SHLVL=2
export PMIX_SECURITY_MODE=native
export PMIX_HOSTNAME=38f9d3525c2c
export LOGNAME=igtanase
export OMPI_WORLD_LOCAL_SIZE=2
export PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
export PRTE_LAUNCHED=1
export 
PMIX_SERVER_TMPDIR=/var/folders/_k/c4_xr5vd14j97fw7j8vzmd45_9hjbq/T//prte.38f9d3525c2c.1419265399/dvm.53291
export OMPI_COMM_WORLD_NODE_RANK=0
export OMPI_MCA_cpu_type=x86_64
export PMIX_SYSTEM_TMPDIR=/var/folders/_k/c4_xr5vd14j97fw7j8vzmd45_9hjbq/T/
export 
PMIX_SERVER_URI4=mpirun-38f9d3525c2c-53291@0.0;tcp4://192.168.0.180:52093 
 
export OMPI_NUM_APP_CTX=1
export SECURITYSESSIONID=186a9
export 
PMIX_SERVER_URI3=mpirun-38f9d3525c2c-53291@0.0;tcp4://192.168.0.180:52093 
 
export 
PMIX_SERVER_URI2=mpirun-38f9d3525c2c-53291@0.0;tcp4://192.168.0.180:52093 
 
export _=/usr/bin/env

./exe

Thx for your help,
--Gabriel



[OMPI devel] Remove stale OPAL dss and opal_tree code

2021-02-18 Thread Ralph Castain via devel
Hi folks

I'm planning on removing the OPAL dss (pack/unpack) code as part of my work to 
reduce the code base I historically supported. The pack/unpack functionality is 
now in PMIx (has been since v3.0 was released), and so we had duplicate 
capabilities spread across OPAL and PRRTE. I have already removed the PRRTE 
code in favor of unifying on PMIx as the lowest common denominator.

The following PR removes OPAL dss and an "opal_tree" class that called it but 
wasn't used anywhere in the code: https://github.com/open-mpi/ompi/pull/8492

I have updated the very few places in OMPI/OSHMEM that touched the dss to use 
the PMIx equivalents. Please take a look and note any concerns on the PR. Minus 
any objections, I'll plan on committing this after next Tuesday's telecon.

Thanks
Ralph



Re: [OMPI devel] [EXTERNAL] Support for AMD M100?

2021-02-11 Thread Ralph Castain via devel
FWIW: now that I am out of Intel, we are planning on upping the PMIx support 
for GPUs in general, so I expect we'll be including this one. Support will 
include providing info on capabilities (for both local and remote devices), 
distances from every proc to each of its local GPUs, affinity settings, etc.


> On Feb 11, 2021, at 10:57 AM, Atchley, Scott {Leadership Computing} via devel 
>  wrote:
> 
>> On Feb 11, 2021, at 1:56 PM, Atchley, Scott  wrote:
>> 
>>> On Feb 11, 2021, at 1:11 PM, Jeff Squyres (jsquyres) via devel 
>>>  wrote:
>>> 
>>> 
>>> 
>>> That being said, we just added the AVX MPI_Op component -- equivalent 
>>> components could be added for CUDA and/or AMD's GPU (what API does it use 
>>> -- OpenCL?).  
>> 
>> AMD’s API is HIP:
>> 
>> https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-GUIDE.html
>> 
>> It is an abstraction of CUDA that allows compiling to AMD or NVIDIA GPUs.
> 
> I should have added that there is an ECP project to port it to Intel GPUs as 
> well.




Re: [OMPI devel] configure problem on master

2021-02-04 Thread Ralph Castain via devel
Sounds like I need to resync the PMIx lustre configury with the OMPI one - I'll 
do that.


On Feb 4, 2021, at 11:56 AM, Gabriel, Edgar via devel mailto:devel@lists.open-mpi.org> > wrote:

I have a weird problem running configure on master on our cluster. Basically, 
configure fails when I request lustre support, but not from ompio but openpmix.
 What makes our cluster setup maybe a bit special is that the lustre libraries 
are not installed in the standard path, but in /opt, and thus we provide the 
--with-lustre=/opt/lustre/2.12.2 as an option.
If I remove the 3rd-party/openpmix/src/mca/pstrg/lustre component, the 
configure script finishes correctly.
 I looked at the ompi vs. openmpix check_lustre configure scripts, I could not 
detect on a quick glance any difference that would explain why the script is 
failing in one instance but not the other, but the openpmix version does seem 
to go through some additional hoops (checking separately for the include 
directory, the lib and lib64 directories etc), so it might be a difference  in 
the PMIX_ macros vs. the OPAL_ macros.
 --snip--
 --- MCA component fs:lustre (m4 configuration macro)
checking for MCA component fs:lustre compile mode... dso
checking --with-lustre value... sanity check ok (/opt/lustre/2.12.2)
checking looking for lustre libraries and header files in... 
(/opt/lustre/2.12.2)
checking lustre/lustreapi.h usability... yes
checking lustre/lustreapi.h presence... yes
checking for lustre/lustreapi.h... yes
looking for library in lib
checking for library containing llapi_file_create... no
looking for library in lib64
checking for library containing llapi_file_create... -llustreapi
checking if liblustreapi requires libnl v1 or v3...
checking for required lustre data structures... yes
checking if MCA component fs:lustre can compile... yes
 --snip --
 --- MCA component pstrg:lustre (m4 configuration macro)
checking for MCA component pstrg:lustre compile mode... dso
checking --with-lustre value... sanity check ok (/opt/lustre/2.12.2)
checking looking for lustre libraries and header files in... 
(/opt/lustre/2.12.2)
looking for header in /opt/lustre/2.12.2
checking lustre/lustreapi.h usability... no
checking lustre/lustreapi.h presence... no
checking for lustre/lustreapi.h... no
looking for header in /opt/lustre/2.12.2/include
checking lustre/lustreapi.h usability... yes
checking lustre/lustreapi.h presence... yes
checking for lustre/lustreapi.h... yes
looking for library without search path
checking for library containing llapi_file_create... no
looking for library in /opt/lustre/2.12.2/lib64
checking for library containing llapi_file_create... (cached) no
looking for library in /opt/lustre/2.12.2/lib
checking for library containing llapi_file_create... (cached) no
configure: error: Lustre support requested but not found.  Aborting

 --snip --
 Does anybody have an idea on what could trigger this issue or a suggestion how 
to investigate it?
 Thanks
Edgar



Re: [OMPI devel] HWLOC duplication relief

2021-02-03 Thread Ralph Castain via devel
I have updated the site to reflect this discussion to-date. I'm still trying to 
figure out what to do about low-level libs. For now, I've removed the envars 
and modified suggestions.

https://openpmix.github.io/support/faq/avoid-hwloc-dup

Further comment/input is welcome.


> On Feb 3, 2021, at 8:09 AM, Ralph Castain via devel 
>  wrote:
> 
> What if we do this:
> 
> - if you are using PMIx v4.1 or above, then there is no problem. Call 
> PMIx_Load_topology and we will always return a valid pointer to the topology, 
> subject to the caveat that all members of the process (as well as the server) 
> must use the same hwloc version.
> 
> - if you are using PMIx v4.0 or below, then first do a PMIx_Get for 
> PMIX_TOPOLOGY. If "not found", then try to get the shmem info and adopt it. 
> If the shmem info isn't found, then do a topology_load to discover the 
> topology. Either way, when done, do a PMIx_Store_internal of the 
> hwloc_topology_t using the PMIX_TOPOLOGY key.
> 
> This still leaves open the question of what to do with low-level libraries 
> that really don't want to link against PMIx. I'm not sure what to do there. I 
> agree it is "ugly" to pass an addr in the environment, but there really isn't 
> any cleaner option that I can see short of asking every library to provide us 
> with the ability to pass hwloc_topology_t down to them. Outside of that 
> obvious answer, I suppose we could put the hwloc_topology_t address into the 
> environment and have them connect that way?
> 
> 
>> On Feb 3, 2021, at 7:36 AM, Ralph Castain via devel 
>>  wrote:
>> 
>> I guess this begs the question: how does a library detect that the shmem 
>> region has already been mapped? If we attempt to map it and fail, does that 
>> mean it has already been mapped or that it doesn't exist?
>> 
>> It isn't reasonable to expect that all the libraries in a process will 
>> coordinate such that they "know" hwloc has been initialized by the main 
>> program, for example. So how do they determine that the topology is present, 
>> and how do they gain access to it?
>> 
>> 
>>> On Feb 3, 2021, at 6:07 AM, Brice Goglin via devel 
>>>  wrote:
>>> 
>>> Hello Ralph
>>> 
>>> One thing that isn't clear in this document : the hwloc shmem region may
>>> only be mapped *once* per process (because the mmap address is always
>>> the same). Hence, if a library calls adopt() in the process, others will
>>> fail. This applies to the 2nd and 3rd case in "Accessing the HWLOC
>>> topology tree from clients".
>>> 
>>> For the 3rd case where low-level libraries don't want to depend on PMIx,
>>> storing the pointer to the topology in an environment variable might be
>>> a (ugly) solution.
>>> 
>>> By the way, you may want to specify somewhere that all these libraries
>>> using the topology pointer in the process must use the same hwloc
>>> version (e.g. not 2.0 vs 2.4). shmem_adopt() verifies that the exported
>>> and importer are compatible. But passing the topology pointer doesn't
>>> provide any way to verify that the caller doesn't use its own
>>> incompatible embedded hwloc.
>>> 
>>> Brice
>>> 
>>> 
>>> Le 02/02/2021 à 18:32, Ralph Castain via devel a écrit :
>>>> Hi folks
>>>> 
>>>> Per today's telecon, here is a link to a description of the HWLOC
>>>> duplication issue for many-core environments and methods by which you
>>>> can mitigate the impact.
>>>> 
>>>> https://openpmix.github.io/support/faq/avoid-hwloc-dup
>>>> <https://openpmix.github.io/support/faq/avoid-hwloc-dup>
>>>> 
>>>> George: for lower-level libs like treematch or HAN, you might want to
>>>> look at the envar method (described about half-way down the page) to
>>>> avoid directly linking those libraries against PMIx. That wouldn't be
>>>> a problem while inside OMPI, but could be an issue if people want to
>>>> use them in a non-PMIx environment.
>>>> 
>>>> Ralph
>>>> 
>>> 
>> 
>> 
> 
> 




Re: [OMPI devel] HWLOC duplication relief

2021-02-03 Thread Ralph Castain via devel
What if we do this:

- if you are using PMIx v4.1 or above, then there is no problem. Call 
PMIx_Load_topology and we will always return a valid pointer to the topology, 
subject to the caveat that all members of the process (as well as the server) 
must use the same hwloc version.

- if you are using PMIx v4.0 or below, then first do a PMIx_Get for 
PMIX_TOPOLOGY. If "not found", then try to get the shmem info and adopt it. If 
the shmem info isn't found, then do a topology_load to discover the topology. 
Either way, when done, do a PMIx_Store_internal of the hwloc_topology_t using 
the PMIX_TOPOLOGY key.

This still leaves open the question of what to do with low-level libraries that 
really don't want to link against PMIx. I'm not sure what to do there. I agree 
it is "ugly" to pass an addr in the environment, but there really isn't any 
cleaner option that I can see short of asking every library to provide us with 
the ability to pass hwloc_topology_t down to them. Outside of that obvious 
answer, I suppose we could put the hwloc_topology_t address into the 
environment and have them connect that way?


> On Feb 3, 2021, at 7:36 AM, Ralph Castain via devel 
>  wrote:
> 
> I guess this begs the question: how does a library detect that the shmem 
> region has already been mapped? If we attempt to map it and fail, does that 
> mean it has already been mapped or that it doesn't exist?
> 
> It isn't reasonable to expect that all the libraries in a process will 
> coordinate such that they "know" hwloc has been initialized by the main 
> program, for example. So how do they determine that the topology is present, 
> and how do they gain access to it?
> 
> 
>> On Feb 3, 2021, at 6:07 AM, Brice Goglin via devel 
>>  wrote:
>> 
>> Hello Ralph
>> 
>> One thing that isn't clear in this document : the hwloc shmem region may
>> only be mapped *once* per process (because the mmap address is always
>> the same). Hence, if a library calls adopt() in the process, others will
>> fail. This applies to the 2nd and 3rd case in "Accessing the HWLOC
>> topology tree from clients".
>> 
>> For the 3rd case where low-level libraries don't want to depend on PMIx,
>> storing the pointer to the topology in an environment variable might be
>> a (ugly) solution.
>> 
>> By the way, you may want to specify somewhere that all these libraries
>> using the topology pointer in the process must use the same hwloc
>> version (e.g. not 2.0 vs 2.4). shmem_adopt() verifies that the exported
>> and importer are compatible. But passing the topology pointer doesn't
>> provide any way to verify that the caller doesn't use its own
>> incompatible embedded hwloc.
>> 
>> Brice
>> 
>> 
>> Le 02/02/2021 à 18:32, Ralph Castain via devel a écrit :
>>> Hi folks
>>> 
>>> Per today's telecon, here is a link to a description of the HWLOC
>>> duplication issue for many-core environments and methods by which you
>>> can mitigate the impact.
>>> 
>>> https://openpmix.github.io/support/faq/avoid-hwloc-dup
>>> <https://openpmix.github.io/support/faq/avoid-hwloc-dup>
>>> 
>>> George: for lower-level libs like treematch or HAN, you might want to
>>> look at the envar method (described about half-way down the page) to
>>> avoid directly linking those libraries against PMIx. That wouldn't be
>>> a problem while inside OMPI, but could be an issue if people want to
>>> use them in a non-PMIx environment.
>>> 
>>> Ralph
>>> 
>> 
> 
> 




Re: [OMPI devel] HWLOC duplication relief

2021-02-03 Thread Ralph Castain via devel
I guess this begs the question: how does a library detect that the shmem region 
has already been mapped? If we attempt to map it and fail, does that mean it 
has already been mapped or that it doesn't exist?

It isn't reasonable to expect that all the libraries in a process will 
coordinate such that they "know" hwloc has been initialized by the main 
program, for example. So how do they determine that the topology is present, 
and how do they gain access to it?


> On Feb 3, 2021, at 6:07 AM, Brice Goglin via devel  
> wrote:
> 
> Hello Ralph
> 
> One thing that isn't clear in this document : the hwloc shmem region may
> only be mapped *once* per process (because the mmap address is always
> the same). Hence, if a library calls adopt() in the process, others will
> fail. This applies to the 2nd and 3rd case in "Accessing the HWLOC
> topology tree from clients".
> 
> For the 3rd case where low-level libraries don't want to depend on PMIx,
> storing the pointer to the topology in an environment variable might be
> a (ugly) solution.
> 
> By the way, you may want to specify somewhere that all these libraries
> using the topology pointer in the process must use the same hwloc
> version (e.g. not 2.0 vs 2.4). shmem_adopt() verifies that the exported
> and importer are compatible. But passing the topology pointer doesn't
> provide any way to verify that the caller doesn't use its own
> incompatible embedded hwloc.
> 
> Brice
> 
> 
> Le 02/02/2021 à 18:32, Ralph Castain via devel a écrit :
>> Hi folks
>> 
>> Per today's telecon, here is a link to a description of the HWLOC
>> duplication issue for many-core environments and methods by which you
>> can mitigate the impact.
>> 
>> https://openpmix.github.io/support/faq/avoid-hwloc-dup
>> <https://openpmix.github.io/support/faq/avoid-hwloc-dup>
>> 
>> George: for lower-level libs like treematch or HAN, you might want to
>> look at the envar method (described about half-way down the page) to
>> avoid directly linking those libraries against PMIx. That wouldn't be
>> a problem while inside OMPI, but could be an issue if people want to
>> use them in a non-PMIx environment.
>> 
>> Ralph
>> 
> 




[OMPI devel] HWLOC duplication relief

2021-02-02 Thread Ralph Castain via devel
Hi folks

Per today's telecon, here is a link to a description of the HWLOC duplication 
issue for many-core environments and methods by which you can mitigate the 
impact.

https://openpmix.github.io/support/faq/avoid-hwloc-dup

George: for lower-level libs like treematch or HAN, you might want to look at 
the envar method (described about half-way down the page) to avoid directly 
linking those libraries against PMIx. That wouldn't be a problem while inside 
OMPI, but could be an issue if people want to use them in a non-PMIx 
environment.

Ralph



Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Ralph Castain via devel
It could be a Slurm issue, but I'm seeing one thing that makes me suspicious 
that this might be a problem reported elsewhere.

Andrej - what  version of Slurm are you using here?


> On Feb 1, 2021, at 5:34 PM, Gilles Gouaillardet via devel 
>  wrote:
> 
> Andrej,
> 
> that really looks like a SLURM issue that does not involve Open MPI
> 
> In order to confirm, you can
> 
> $ salloc -N 2 -n 2
> /* and then from the allocation */
> srun hostname
> 
> If this does not work, then this is a SLURM issue you have to fix.
> Once fixed, I am confident Open MPI will just work
> 
> Cheers,
> 
> Gilles
> 
> On Tue, Feb 2, 2021 at 10:22 AM Andrej Prsa via devel
>  wrote:
>> 
>> Hi Gilles,
>> 
>>> Here is what you can try
>>> 
>>> $ salloc -N 4 -n 384
>>> /* and then from the allocation */
>>> 
>>> $ srun -n 1 orted
>>> /* that should fail, but the error message can be helpful */
>>> 
>>> $ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true
>> 
>> andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 384
>> salloc: Granted job allocation 837
>> andrej@terra:~/system/tests/MPI$ srun -n 1 orted
>> srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1
>> srun: launch/slurm: launch_p_step_launch: StepId=837.0 aborted before
>> step completely launched.
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> srun: error: task 0 launch failed: Unspecified error
>> andrej@terra:~/system/tests/MPI$ /usr/local/bin/mpirun -mca plm slurm
>> -mca plm_base_verbose 10 true
>> [terra:179991] mca: base: components_register: registering framework plm
>> components
>> [terra:179991] mca: base: components_register: found loaded component slurm
>> [terra:179991] mca: base: components_register: component slurm register
>> function successful
>> [terra:179991] mca: base: components_open: opening plm components
>> [terra:179991] mca: base: components_open: found loaded component slurm
>> [terra:179991] mca: base: components_open: component slurm open function
>> successful
>> [terra:179991] mca:base:select: Auto-selecting plm components
>> [terra:179991] mca:base:select:(  plm) Querying component [slurm]
>> [terra:179991] [[INVALID],INVALID] plm:slurm: available for selection
>> [terra:179991] mca:base:select:(  plm) Query of component [slurm] set
>> priority to 75
>> [terra:179991] mca:base:select:(  plm) Selected component [slurm]
>> [terra:179991] plm:base:set_hnp_name: initial bias 179991 nodename hash
>> 2928217987
>> [terra:179991] plm:base:set_hnp_name: final jobfam 7711
>> [terra:179991] [[7711,0],0] plm:base:receive start comm
>> [terra:179991] [[7711,0],0] plm:base:setup_job
>> [terra:179991] [[7711,0],0] plm:slurm: LAUNCH DAEMONS CALLED
>> [terra:179991] [[7711,0],0] plm:base:setup_vm
>> [terra:179991] [[7711,0],0] plm:base:setup_vm creating map
>> [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],1]
>> [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon
>> [[7711,0],1] to node node9
>> [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],2]
>> [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon
>> [[7711,0],2] to node node10
>> [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],3]
>> [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon
>> [[7711,0],3] to node node11
>> [terra:179991] [[7711,0],0] plm:base:setup_vm add new daemon [[7711,0],4]
>> [terra:179991] [[7711,0],0] plm:base:setup_vm assigning new daemon
>> [[7711,0],4] to node node12
>> [terra:179991] [[7711,0],0] plm:slurm: launching on nodes
>> node9,node10,node11,node12
>> [terra:179991] [[7711,0],0] plm:slurm: Set prefix:/usr/local
>> [terra:179991] [[7711,0],0] plm:slurm: final top-level argv:
>> srun --ntasks-per-node=1 --kill-on-bad-exit --ntasks=4 orted -mca
>> ess "slurm" -mca ess_base_jobid "505348096" -mca ess_base_vpid "1" -mca
>> ess_base_num_procs "5" -mca orte_node_regex
>> "terra,node[1:9],node[2:10-12]@0(5)" -mca orte_hnp_uri
>> "505348096.0;tcp://10.9.2.10,192.168.1.1:38995" -mca plm_base_verbose "10"
>> [terra:179991] [[7711,0],0] plm:slurm: reset PATH:
>> /usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
>> [terra:179991] [[7711,0],0] plm:slurm: reset LD_LIBRARY_PATH: /usr/local/lib
>> srun: launch/slurm: launch_p_step_launch: StepId=837.1 aborted before
>> step completely launched.
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> srun: error: task 3 launch failed: Unspecified error
>> srun: error: task 1 launch failed: Unspecified error
>> srun: error: task 2 launch failed: Unspecified error
>> srun: error: task 0 launch failed: Unspecified error
>> [terra:179991] [[7711,0],0] plm:slurm: primary daemons complete!
>> [terra:179991] [[7711,0],0] plm:base:receive stop comm
>> [terra:179991] mca: base: close: component slurm closed
>> [terra:179991] mca: base: close: unloading component slurm
>> 
>> This is 

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Ralph Castain via devel
The Slurm launch component would only disqualify itself if it didn't see a 
Slurm allocation - i.e., there is no SLURM_JOBID in the environment. If you 
want to use mpirun in a Slurm cluster, you need to:

1. get an allocation from Slurm using "salloc"

2. then run "mpirun"

Did you remember to get the allocation first?


> On Feb 1, 2021, at 4:04 PM, Andrej Prsa via devel  
> wrote:
> 
> Hi Ralph, Gilles,
> 
>> I fail to understand why you continue to think that PMI has anything to do 
>> with this problem. I see no indication of a PMIx-related issue in anything 
>> you have provided to date.
> 
> Oh, I went off the traceback that yelled about pmix, and slurm not being able 
> to find it until I patched the latest version; I'm an astrophysicist 
> pretending to be a sys admin for our research cluster, so while I can hold my 
> ground with c, python and technical computing, I'm out of my depths when it 
> comes to mpi, pmix, slurm and all that good stuff. So I appreciate your 
> patience. I am trying though. :)
> 
>> In the output below, it is clear what the problem is - you locked it to the 
>> "slurm" launcher (with -mca plm slurm) and the "slurm" launcher was not 
>> found. Try adding "--mca plm_base_verbose 10" to your cmd line and let's see 
>> why that launcher wasn't accepted.
> 
> andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca plm 
> slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py
> [terra:168998] mca: base: components_register: registering framework plm 
> components
> [terra:168998] mca: base: components_register: found loaded component slurm
> [terra:168998] mca: base: components_register: component slurm register 
> function successful
> [terra:168998] mca: base: components_open: opening plm components
> [terra:168998] mca: base: components_open: found loaded component slurm
> [terra:168998] mca: base: components_open: component slurm open function 
> successful
> [terra:168998] mca:base:select: Auto-selecting plm components
> [terra:168998] mca:base:select:(  plm) Querying component [slurm]
> [terra:168998] mca:base:select:(  plm) No component selected!
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_plm_base_select failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --
> 
> Gilles, I did try all the suggestions from the previous email but that led me 
> to think that slurm is the culprit, and now I'm back to openmpi.
> 
> Cheers,
> Andrej
> 




Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Ralph Castain via devel
Andrej

I fail to understand why you continue to think that PMI has anything to do with 
this problem. I see no indication of a PMIx-related issue in anything you have 
provided to date.

In the output below, it is clear what the problem is - you locked it to the 
"slurm" launcher (with -mca plm slurm) and the "slurm" launcher was not found. 
Try adding "--mca plm_base_verbose 10" to your cmd line and let's see why that 
launcher wasn't accepted.


> On Feb 1, 2021, at 2:47 PM, Gilles Gouaillardet via devel 
>  wrote:
> 
> Andrej,
> 
> My previous email listed other things to try
> 
> Cheers,
> 
> Gilles
> 
> Sent from my iPod
> 
>> On Feb 2, 2021, at 6:23, Andrej Prsa via devel  
>> wrote:
>> 
>> The saga continues.
>> 
>> I managed to build slurm with pmix by first patching slurm using this patch 
>> and manually building the plugin:
>> 
>> https://bugs.schedmd.com/show_bug.cgi?id=10683
>> 
>> Now srun shows pmix as an option:
>> 
>> andrej@terra:~/system/tests/MPI$ srun --mpi=list
>> srun: MPI types are...
>> srun: cray_shasta
>> srun: none
>> srun: pmi2
>> srun: pmix
>> srun: pmix_v4
>> 
>> But when I try to run mpirun with slurm plugin, it still fails:
>> 
>> andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca 
>> pmix_base_verbose 10 -mca plm slurm -np 384 -H 
>> node15:96,node16:96,node17:96,node18:96 python testmpi.py
>> [terra:149214] mca: base: components_register: registering framework ess 
>> components
>> [terra:149214] mca: base: components_register: found loaded component slurm
>> [terra:149214] mca: base: components_register: component slurm has no 
>> register or open function
>> [terra:149214] mca: base: components_register: found loaded component env
>> [terra:149214] mca: base: components_register: component env has no register 
>> or open function
>> [terra:149214] mca: base: components_register: found loaded component pmi
>> [terra:149214] mca: base: components_register: component pmi has no register 
>> or open function
>> [terra:149214] mca: base: components_register: found loaded component tool
>> [terra:149214] mca: base: components_register: component tool register 
>> function successful
>> [terra:149214] mca: base: components_register: found loaded component hnp
>> [terra:149214] mca: base: components_register: component hnp has no register 
>> or open function
>> [terra:149214] mca: base: components_register: found loaded component 
>> singleton
>> [terra:149214] mca: base: components_register: component singleton register 
>> function successful
>> [terra:149214] mca: base: components_open: opening ess components
>> [terra:149214] mca: base: components_open: found loaded component slurm
>> [terra:149214] mca: base: components_open: component slurm open function 
>> successful
>> [terra:149214] mca: base: components_open: found loaded component env
>> [terra:149214] mca: base: components_open: component env open function 
>> successful
>> [terra:149214] mca: base: components_open: found loaded component pmi
>> [terra:149214] mca: base: components_open: component pmi open function 
>> successful
>> [terra:149214] mca: base: components_open: found loaded component tool
>> [terra:149214] mca: base: components_open: component tool open function 
>> successful
>> [terra:149214] mca: base: components_open: found loaded component hnp
>> [terra:149214] mca: base: components_open: component hnp open function 
>> successful
>> [terra:149214] mca: base: components_open: found loaded component singleton
>> [terra:149214] mca: base: components_open: component singleton open function 
>> successful
>> [terra:149214] mca:base:select: Auto-selecting ess components
>> [terra:149214] mca:base:select:(  ess) Querying component [slurm]
>> [terra:149214] mca:base:select:(  ess) Querying component [env]
>> [terra:149214] mca:base:select:(  ess) Querying component [pmi]
>> [terra:149214] mca:base:select:(  ess) Querying component [tool]
>> [terra:149214] mca:base:select:(  ess) Querying component [hnp]
>> [terra:149214] mca:base:select:(  ess) Query of component [hnp] set priority 
>> to 100
>> [terra:149214] mca:base:select:(  ess) Querying component [singleton]
>> [terra:149214] mca:base:select:(  ess) Selected component [hnp]
>> [terra:149214] mca: base: close: component slurm closed
>> [terra:149214] mca: base: close: unloading component slurm
>> [terra:149214] mca: base: close: component env closed
>> [terra:149214] mca: base: close: unloading component env
>> [terra:149214] mca: base: close: component pmi closed
>> [terra:149214] mca: base: close: unloading component pmi
>> [terra:149214] mca: base: close: component tool closed
>> [terra:149214] mca: base: close: unloading component tool
>> [terra:149214] mca: base: close: component singleton closed
>> [terra:149214] mca: base: close: unloading component singleton
>> --
>> It looks like orte_init failed for some reason; your parallel process is
>> 

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-01-31 Thread Ralph Castain via devel
Just trying to understand - why are you saying this is a pmix problem? 
Obviously, something to do with mpirun is failing, but I don't see any 
indication here that it has to do with pmix.

Can you add --enable-debug to your configure line and inspect the core file 
from the dump?

> On Jan 31, 2021, at 8:18 PM, Andrej Prsa via devel  
> wrote:
> 
> Hello list,
> 
> I just upgraded openmpi from 4.0.3 to 4.1.0 to see if it would solve a weird 
> openpmix problem we've been having; I configured it using:
> 
> ./configure --prefix=/usr/local --with-pmix=internal --with-slurm 
> --without-tm --without-moab --without-singularity --without-fca 
> --without-hcoll --without-ime --without-lustre --without-psm --without-psm2 
> --without-mxm --with-gnu-ld
> 
> (I also have an external pmix version installed and tried using that instead 
> of internal, but it doesn't change anything). Here's the output of configure:
> 
> Open MPI configuration:
> ---
> Version: 4.1.0
> Build MPI C bindings: yes
> Build MPI C++ bindings (deprecated): no
> Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
> MPI Build Java bindings (experimental): no
> Build Open SHMEM support: false (no spml)
> Debug build: no
> Platform file: (none)
> 
> Miscellaneous
> ---
> CUDA support: no
> HWLOC support: external
> Libevent support: external
> PMIx support: Internal
> 
> Transports
> ---
> Cisco usNIC: no
> Cray uGNI (Gemini/Aries): no
> Intel Omnipath (PSM2): no
> Intel TrueScale (PSM): no
> Mellanox MXM: no
> Open UCX: no
> OpenFabrics OFI Libfabric: no
> OpenFabrics Verbs: no
> Portals4: no
> Shared memory/copy in+copy out: yes
> Shared memory/Linux CMA: yes
> Shared memory/Linux KNEM: no
> Shared memory/XPMEM: no
> TCP: yes
> 
> Resource Managers
> ---
> Cray Alps: no
> Grid Engine: no
> LSF: no
> Moab: no
> Slurm: yes
> ssh/rsh: yes
> Torque: no
> 
> OMPIO File Systems
> ---
> DDN Infinite Memory Engine: no
> Generic Unix FS: yes
> IBM Spectrum Scale/GPFS: no
> Lustre: no
> PVFS2/OrangeFS: no
> 
> Once configured, make and sudo make install worked without a glitch; but when 
> I run mpirun, I get this:
> 
> andrej@terra:~/system/openmpi-4.1.0$ mpirun --version
> mpirun (Open MPI) 4.1.0
> 
> Report bugs to http://www.open-mpi.org/community/help/
> andrej@terra:~/system/openmpi-4.1.0$ mpirun
> malloc(): corrupted top size
> Aborted (core dumped)
> 
> No matter what I try to run, it always segfaults. Any suggestions on what I 
> can try to resolve this?
> 
> Oh, I should also mention that I tried to remove the global libevent; openmpi 
> configured its internal copy but then failed to build.
> 
> Thanks,
> Andrej
> 




Re: [OMPI devel] Submodule change

2021-01-16 Thread Ralph Castain via devel
Hi folks

Just a Heads-Up that I have switched the PMIx submodule on OMPI master back to 
the PMIx master branch so we could pickup a fix for PSM2 support. You will need 
to do the dance again:

git submodule sync
git submodule update --init --recursive --remote

Ralph

On Dec 17, 2020, at 3:21 PM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:

Just a point of clarification since there was a comment on the PR that made 
this change. This is _not_ a permanent situation, nor was it done because PMIx 
had achieved some magic milestone. We changed the submodule to point to the new 
v4.0 branch so we could test that branch prior to its release. Once it is 
released, we will switch the submodule back to tracking the PMIx master.

When OMPI branches for v5, we will adjust the submodule in that branch to point 
at the latest PMIx v4.x release. Or we may remove the submodule and replace it 
with a tarball - not sure which the RMs would prefer, but we can do either.


On Dec 17, 2020, at 9:03 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:

Hi folks

I just switched OMPI's PMIx submodule to point at the new v4.0 branch. When you 
want to update, you may need to do the following after a "git pull":

git submodule sync
git submodule update --init --recursive --remote

to get yourself onto the proper branch.

Ralph





Re: [OMPI devel] Submodule change

2020-12-17 Thread Ralph Castain via devel
Just a point of clarification since there was a comment on the PR that made 
this change. This is _not_ a permanent situation, nor was it done because PMIx 
had achieved some magic milestone. We changed the submodule to point to the new 
v4.0 branch so we could test that branch prior to its release. Once it is 
released, we will switch the submodule back to tracking the PMIx master.

When OMPI branches for v5, we will adjust the submodule in that branch to point 
at the latest PMIx v4.x release. Or we may remove the submodule and replace it 
with a tarball - not sure which the RMs would prefer, but we can do either.


On Dec 17, 2020, at 9:03 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:

Hi folks

I just switched OMPI's PMIx submodule to point at the new v4.0 branch. When you 
want to update, you may need to do the following after a "git pull":

git submodule sync
git submodule update --init --recursive --remote

to get yourself onto the proper branch.

Ralph




[OMPI devel] Submodule change

2020-12-17 Thread Ralph Castain via devel
Hi folks

I just switched OMPI's PMIx submodule to point at the new v4.0 branch. When you 
want to update, you may need to do the following after a "git pull":

git submodule sync
git submodule update --init --recursive --remote

to get yourself onto the proper branch.

Ralph



Re: [OMPI devel] OpenMPI with Slurm support

2020-08-05 Thread Ralph Castain via devel
When you launch a job with "srun", you have to include the option that 
specifies which MPI "glue" to use. You can find the available options with 
"--mpi=list" - you should see a "pmix_v3" option. Your cmd line should then be 
"srun --mpi=pmix_v3 ..."

You can also set that to be the default in your Slurm config file.

> On Aug 5, 2020, at 8:40 AM, Luis Cebamanos via devel 
>  wrote:
> 
> Hi,
> 
> Just an update on this. I've managed to build OpenMPI 4.0.1 using
> 
> --with-pmix=/lustre/z04/pmix --width-slurm
> -with-libevent=/lustre/z04/pmix/libevent/
> 
> Unfortunately I keep getting the same warning when using srun (i.e. srun
> my_ompi_app.x):
> 
> The application appears to have been direct launched using "srun",
> but OMPI was not built with SLURM's PMI support and therefore cannot
> execute. There are several options for building PMI support under
> SLURM, depending upon the SLURM version you are using:
> 
>   version 16.05 or later: you can use SLURM's PMIx support. This
>   requires that you configure and build SLURM --with-pmix.
> 
>   Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>   PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>   install PMI-2. You must then build Open MPI using --with-pmi pointing
>   to the SLURM PMI library location.
> 
> Please configure as appropriate and try again.
> 
> We use slurm 20.02.3 so configuring --with-pmix it is the correct way
> according to this message. My worry is if we need to use the same PMIx
> library that was used to build slurm, because that one has not been made
> available.
> 
> Regards,
> Luis
> 
> 
> 
> On 05/08/2020 16:10, Luis Cebamanos via devel wrote:
>> Hi Ralph,
>> 
>> Thanks for pointing me to this. I've done that and although configure
>> does not report any errors, make won't build with following errors
>> 
>> Making all in mca/pmix/ext3x
>> make[2]: Entering directory '/lustre/z04/openmpi-4.0.1/opal/mca/pmix/ext3x'
>>   CC   mca_pmix_ext3x_la-ext3x_local.lo
>>   CC   mca_pmix_ext3x_la-ext3x.lo
>>   CC   mca_pmix_ext3x_la-ext3x_client.lo
>>   CC   mca_pmix_ext3x_la-ext3x_component.lo
>> In file included from ext3x_local.c:21:
>> ext3x.h:35:10: fatal error: pmix_server.h: No such file or directory
>>  #include "pmix_server.h"
>>   ^~~
>> 
>> My configure options are as follow
>> 
>> ./configure --prefix=/lustre/z04/ompi_slurm 
>> --with-pmix=/lustre/z04/pmix --with-pmi=/lustre/z04/pmix --with-slurm
>> --with-cuda=/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.1/include/
>> --with-libevent=/lustre/z04/pmix/libevent/
>> 
>> and I can see the that header file in the pmix built:
>> $ls /lustre/z04/pmix/include/
>> pmi2.h  pmix_common.h  pmix.h pmix_server.h  pmix_version.h
>> pmi.h   pmix_extend.h  pmix_rename.h  pmix_tool.h
>> 
>> What am I missing?
>> 
>> Cheers,
>> Luis
>> On 05/08/2020 15:21, Ralph Castain via devel wrote:
>>> For OMPI, I would recommend installing PMIx: 
>>> https://github.com/openpmix/openpmix/releases/tag/v3.1.5
>>> 
>>> 
>>>> On Aug 5, 2020, at 12:40 AM, Luis Cebamanos via devel 
>>>>  wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> We are trying to install OpenMPI with Slurm support on a recently
>>>> upgraded system. Unfortunately libpmi, libpmi2 or limpix don't seem to
>>>> be available. Could I install these libraries myself to build OpenMPI
>>>> and run it with Slurm? If so, which one would be needed? I guess these
>>>> libraries should have been installed and made available whith the Slurm
>>>> installation, but unfortunately they had not...
>>>> 
>>>> checking for pmi.h... yes
>>>> checking for libpmi in /usr/include/slurm/lib... checking for libpmi in
>>>> /usr/include/slurm/lib64... not found
>>>> checking for pmi2.h in /usr/include/slurm... found
>>>> checking pmi2.h usability... yes
>>>> checking pmi2.h presence... yes
>>>> checking for pmi2.h... yes
>>>> checking for libpmi2 in /usr/include/slurm/lib... checking for libpmi2
>>>> in /usr/include/slurm/lib64... not found
>>>> checking for pmix.h in /usr/include/slurm... not found
>>>> checking for pmix.h in /usr/include/slurm/include... not found
>>>> checking can PMI support be built... no
>>>> configure: WARNING: PMI support requested (via --with-pmi) but neither
>>&g

Re: [OMPI devel] OpenMPI with Slurm support

2020-08-05 Thread Ralph Castain via devel
This looks like a bug in the OMPI glue - let OMPI use its embedded PMIx version 
(which is the same anyway) as you don't need it to use the same as Slurm is 
using. You just need the separate PMIx installation to build the Slurm support.


> On Aug 5, 2020, at 8:10 AM, Luis Cebamanos via devel 
>  wrote:
> 
> Hi Ralph,
> 
> Thanks for pointing me to this. I've done that and although configure
> does not report any errors, make won't build with following errors
> 
> Making all in mca/pmix/ext3x
> make[2]: Entering directory '/lustre/z04/openmpi-4.0.1/opal/mca/pmix/ext3x'
>   CC   mca_pmix_ext3x_la-ext3x_local.lo
>   CC   mca_pmix_ext3x_la-ext3x.lo
>   CC   mca_pmix_ext3x_la-ext3x_client.lo
>   CC   mca_pmix_ext3x_la-ext3x_component.lo
> In file included from ext3x_local.c:21:
> ext3x.h:35:10: fatal error: pmix_server.h: No such file or directory
>  #include "pmix_server.h"
>   ^~~
> 
> My configure options are as follow
> 
> ./configure --prefix=/lustre/z04/ompi_slurm 
> --with-pmix=/lustre/z04/pmix --with-pmi=/lustre/z04/pmix --with-slurm
> --with-cuda=/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.1/include/
> --with-libevent=/lustre/z04/pmix/libevent/
> 
> and I can see the that header file in the pmix built:
> $ls /lustre/z04/pmix/include/
> pmi2.h  pmix_common.h  pmix.h pmix_server.h  pmix_version.h
> pmi.h   pmix_extend.h  pmix_rename.h  pmix_tool.h
> 
> What am I missing?
> 
> Cheers,
> Luis
> On 05/08/2020 15:21, Ralph Castain via devel wrote:
>> For OMPI, I would recommend installing PMIx: 
>> https://github.com/openpmix/openpmix/releases/tag/v3.1.5
>> 
>> 
>>> On Aug 5, 2020, at 12:40 AM, Luis Cebamanos via devel 
>>>  wrote:
>>> 
>>> Hi all,
>>> 
>>> We are trying to install OpenMPI with Slurm support on a recently
>>> upgraded system. Unfortunately libpmi, libpmi2 or limpix don't seem to
>>> be available. Could I install these libraries myself to build OpenMPI
>>> and run it with Slurm? If so, which one would be needed? I guess these
>>> libraries should have been installed and made available whith the Slurm
>>> installation, but unfortunately they had not...
>>> 
>>> checking for pmi.h... yes
>>> checking for libpmi in /usr/include/slurm/lib... checking for libpmi in
>>> /usr/include/slurm/lib64... not found
>>> checking for pmi2.h in /usr/include/slurm... found
>>> checking pmi2.h usability... yes
>>> checking pmi2.h presence... yes
>>> checking for pmi2.h... yes
>>> checking for libpmi2 in /usr/include/slurm/lib... checking for libpmi2
>>> in /usr/include/slurm/lib64... not found
>>> checking for pmix.h in /usr/include/slurm... not found
>>> checking for pmix.h in /usr/include/slurm/include... not found
>>> checking can PMI support be built... no
>>> configure: WARNING: PMI support requested (via --with-pmi) but neither
>>> pmi.h,
>>> configure: WARNING: pmi2.h or pmix.h were found under locations:
>>> configure: WARNING: /usr/include/slurm
>>> configure: WARNING: /usr/include/slurm/slurm
>>> configure: WARNING: Specified path: /usr/include/slurm
>>> configure: WARNING: OR neither libpmi, libpmi2, or libpmix were found under:
>>> configure: WARNING: /lib
>>> configure: WARNING: /lib64
>>> 
>>> 
>>> Regards,
>>> Luis
>>> 
>>> The University of Edinburgh is a charitable body, registered in Scotland, 
>>> with registration number SC005336.
>>> 
>> 
> 
> -- 
> ~ | EPCC | ~
> 
> Luis Cebamanos, HPC Applications Consultant
> Email: l.cebama...@epcc.ed.ac.uk Phone: +44 (0) 131 651 3479  
>  
> http://www.epcc.ed.ac.uk/   
> The Bayes Centre, 47 Potterrow, Edinburgh UK
> EH8 9BT
> 
> 
> 
> 




Re: [OMPI devel] OpenMPI with Slurm support

2020-08-05 Thread Ralph Castain via devel
For OMPI, I would recommend installing PMIx: 
https://github.com/openpmix/openpmix/releases/tag/v3.1.5


> On Aug 5, 2020, at 12:40 AM, Luis Cebamanos via devel 
>  wrote:
> 
> Hi all,
> 
> We are trying to install OpenMPI with Slurm support on a recently
> upgraded system. Unfortunately libpmi, libpmi2 or limpix don't seem to
> be available. Could I install these libraries myself to build OpenMPI
> and run it with Slurm? If so, which one would be needed? I guess these
> libraries should have been installed and made available whith the Slurm
> installation, but unfortunately they had not...
> 
> checking for pmi.h... yes
> checking for libpmi in /usr/include/slurm/lib... checking for libpmi in
> /usr/include/slurm/lib64... not found
> checking for pmi2.h in /usr/include/slurm... found
> checking pmi2.h usability... yes
> checking pmi2.h presence... yes
> checking for pmi2.h... yes
> checking for libpmi2 in /usr/include/slurm/lib... checking for libpmi2
> in /usr/include/slurm/lib64... not found
> checking for pmix.h in /usr/include/slurm... not found
> checking for pmix.h in /usr/include/slurm/include... not found
> checking can PMI support be built... no
> configure: WARNING: PMI support requested (via --with-pmi) but neither
> pmi.h,
> configure: WARNING: pmi2.h or pmix.h were found under locations:
> configure: WARNING: /usr/include/slurm
> configure: WARNING: /usr/include/slurm/slurm
> configure: WARNING: Specified path: /usr/include/slurm
> configure: WARNING: OR neither libpmi, libpmi2, or libpmix were found under:
> configure: WARNING: /lib
> configure: WARNING: /lib64
> 
> 
> Regards,
> Luis
> 
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336.
> 




Re: [OMPI devel] Libevent changes

2020-07-10 Thread Ralph Castain via devel
We forgot to discuss this at the last telecon - GP, would you please ensure it 
is on next week's agenda?

FWIW: I agree that this should not have been committed. We need to stop doing 
local patches to public packages and instead focus on getting them into the 
upstream (which has still not been committed, I believe?).


> On Jul 6, 2020, at 8:13 AM, Barrett, Brian via devel 
>  wrote:
> 
> https://github.com/open-mpi/ompi/pull/6784 went in while I was on vacation.  
> I'm a little confused; I thought we no longer patched libevent locally?  This 
> is certainly going to be a problem as we move to external dependencies; we 
> won't have a way of pulling in this change (whether using the bundled 
> libevent or not).  I don't think we should have this patch locally in master, 
> because we're going to lose it in the next couple of weeks with the configure 
> changes I'm hopefully near completing.
> 
> Brian
> 




Re: [OMPI devel] Announcing Open MPI v4.0.4rc2

2020-06-06 Thread Ralph Castain via devel
I would have hoped that the added protections we put into PMIx would have 
resolved ds12 as well as ds21, but it is possible those changes didn't get into 
OMPI v4.0.x. Regardless, I think you should be just fine using the gds/hash 
component for cygwin. I would suggest simply "locking" that param into the PMIx 
default param file.


On Jun 5, 2020, at 8:22 PM, Marco Atzeri via devel mailto:devel@lists.open-mpi.org> > wrote:

On 05.06.2020 22:29, Marco Atzeri wrote:
On 01.06.2020 20:26, Geoffrey Paulsen via devel wrote:
Open MPI v4.0.4rc2 is now available for download and test at: sso_last: 
https://www.open-mpi.org/software/ompi/v4.0/

It builds on Cygwin64 bit, and this time it runs,
but the PMIX is throwing some errors:
$ mpirun  -n 4 ./hello_c.exe
[LAPTOP-82F08ILC:02866] PMIX ERROR: INIT in file 
/pub/devel/openmpi/v4.0/openmpi-4.0.4-0.2.x86_64/src/openmpi-4.0.4rc2/opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds12/gds_ds12_lock_pthread.c
 at line 138
[LAPTOP-82F08ILC:02866] PMIX ERROR: SUCCESS in file 
/pub/devel/openmpi/v4.0/openmpi-4.0.4-0.2.x86_64/src/openmpi-4.0.4rc2/opal/mca/pmix/pmix3x/pmix/src/mca/common/dstore/dstore_base.c
 at line 2450
[LAPTOP-82F08ILC:02867] PMIX ERROR: ERROR in file 
/pub/devel/openmpi/v4.0/openmpi-4.0.4-0.2.x86_64/src/openmpi-4.0.4rc2/opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds12/gds_ds12_lock_pthread.c
 at line 168
[LAPTOP-82F08ILC:02868] PMIX ERROR: ERROR in file 
/pub/devel/openmpi/v4.0/openmpi-4.0.4-0.2.x86_64/src/openmpi-4.0.4rc2/opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds12/gds_ds12_lock_pthread.c
 at line 168
[LAPTOP-82F08ILC:02869] PMIX ERROR: ERROR in file 
/pub/devel/openmpi/v4.0/openmpi-4.0.4-0.2.x86_64/src/openmpi-4.0.4rc2/opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds12/gds_ds12_lock_pthread.c
 at line 168
[LAPTOP-82F08ILC:02870] PMIX ERROR: ERROR in file 
/pub/devel/openmpi/v4.0/openmpi-4.0.4-0.2.x86_64/src/openmpi-4.0.4rc2/opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds12/gds_ds12_lock_pthread.c
 at line 168
Hello, world, I am 2 of 4, (Open MPI v4.0.4rc2, package: Open MPI 
Marco@LAPTOP-82F08ILC Distribution, ident: 4.0.4rc2, repo rev: v4.0.4rc2, May 
29, 2020, 125)
Hello, world, I am 3 of 4, (Open MPI v4.0.4rc2, package: Open MPI 
Marco@LAPTOP-82F08ILC Distribution, ident: 4.0.4rc2, repo rev: v4.0.4rc2, May 
29, 2020, 125)
Hello, world, I am 0 of 4, (Open MPI v4.0.4rc2, package: Open MPI 
Marco@LAPTOP-82F08ILC Distribution, ident: 4.0.4rc2, repo rev: v4.0.4rc2, May 
29, 2020, 125)
Hello, world, I am 1 of 4, (Open MPI v4.0.4rc2, package: Open MPI 
Marco@LAPTOP-82F08ILC Distribution, ident: 4.0.4rc2, repo rev: v4.0.4rc2, May 
29, 2020, 125)
[LAPTOP-82F08ILC:02866] [[19692,0],0] unable to open debugger attach fifo
the second is a bit funny "PMIX ERROR: SUCCESS "
In the past on 3.1.5
https://www.mail-archive.com/devel@lists.open-mpi.org//msg21004.html
the workaround "gds = ^ds21" was used but now it seems ds12 is having problems

taking the hint from
https://github.com/open-mpi/ompi/issues/7516

PMIX_MCA_gds=hash

$ mpirun  -n 4 ./hello_c.exe
Hello, world, I am 0 of 4, (Open MPI v4.0.4rc2, package: Open MPI 
Marco@LAPTOP-82F08ILC Distribution, ident: 4.0.4rc2, repo rev: v4.0.4rc2, May 
29, 2020, 125)
Hello, world, I am 1 of 4, (Open MPI v4.0.4rc2, package: Open MPI 
Marco@LAPTOP-82F08ILC Distribution, ident: 4.0.4rc2, repo rev: v4.0.4rc2, May 
29, 2020, 125)
Hello, world, I am 2 of 4, (Open MPI v4.0.4rc2, package: Open MPI 
Marco@LAPTOP-82F08ILC Distribution, ident: 4.0.4rc2, repo rev: v4.0.4rc2, May 
29, 2020, 125)
Hello, world, I am 3 of 4, (Open MPI v4.0.4rc2, package: Open MPI 
Marco@LAPTOP-82F08ILC Distribution, ident: 4.0.4rc2, repo rev: v4.0.4rc2, May 
29, 2020, 125)
[LAPTOP-82F08ILC:02913] [[19647,0],0] unable to open debugger attach fifo



Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread Ralph Castain via devel
My best guess is that port 1024 is being blocked in some fashion. Depending on 
how you start it, OMPI may well pick a different port (it all depends on what 
it gets assigned by the OS) that lets it make the connection. You could verify 
this by setting "OMPI_MCA_btl_tcp_port_min_v4="



On May 4, 2020, at 9:19 AM, John DelSignore mailto:jdelsign...@perforce.com> > wrote:

That seems to work (much to my surprise):

mic:/amd/home/jdelsign/PMIx>pterm
pterm failed to initialize, likely due to no DVM being available
mic:/amd/home/jdelsign/PMIx>cat myhostfile3+1
microway3 slots=16
microway1 slots=16
mic:/amd/home/jdelsign/PMIx>cat myhostfile1+3
microway1 slots=16
microway3 slots=16
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile3+1 ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway3.totalviewtech.com <http://microway3.totalviewtech.com> 
Hello from proc (1): microway3.totalviewtech.com 
<http://microway3.totalviewtech.com> 
Hello from proc (2): microway1
All Done!
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile3+1 ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway3.totalviewtech.com <http://microway3.totalviewtech.com> 
Hello from proc (1): microway3.totalviewtech.com 
<http://microway3.totalviewtech.com> 
Hello from proc (2): microway1
All Done!
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile1+3 ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway1
Hello from proc (1): microway1
Hello from proc (2): microway3.totalviewtech.com 
<http://microway3.totalviewtech.com> 
All Done!
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile1+3 ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway1
Hello from proc (1): microway1
Hello from proc (2): microway3.totalviewtech.com 
<http://microway3.totalviewtech.com> 
All Done!
mic:/amd/home/jdelsign/PMIx>

Does that mean there is something wrong microway2? If that were the case, then 
why would it ever work?

On 2020-05-04 12:08, Ralph Castain via devel wrote:
What happens if you run your "3 procs on two nodes" case using just microway1 
and 3 (i.e., omit microway2)?


On May 4, 2020, at 9:05 AM, John DelSignore via devel mailto:devel@lists.open-mpi.org> > wrote:

Hi George,

10.71.2.58 is microway2 (which has been used in all of the configurations I've 
tried, so maybe that's why it appears to be the common denominator):

lid:/amd/home/jdelsign>host -l totalviewtech.com <http://totalviewtech.com/> 
|grep microway
microway1.totalviewtech.com <http://microway1.totalviewtech.com/> has address 
10.71.2.52
microway2.totalviewtech.com <http://microway2.totalviewtech.com/> has address 
10.71.2.58
microway3.totalviewtech.com <http://microway3.totalviewtech.com/> has address 
10.71.2.55
lid:/amd/home/jdelsign>

All three systems are on the same Ethernet, and in fact they are probably all 
in the same rack. AFAIK, there is no firewall and there is no restriction on 
port 1024. There are configurations that work OK on those same three nodes:

mic:/amd/home/jdelsign/PMIx>pterm
pterm failed to initialize, likely due to no DVM being available
mic:/amd/home/jdelsign/PMIx>
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway1
Hello from proc (1): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com/> 
Hello from proc (2): microway3.totalviewtech.com 
<http://microway3.totalviewtech.com/> 
All Done!
mic:/amd/home/jdelsign/PMIx>cat myhostfile
microway1 slots=16
microway2 slots=16
microway3 slots=16
mic:/amd/home/jdelsign/PMIx>

I haven't been able to find a combo where prun works. On the other hand, mpirun 
works in the above case, but not in the case where I put three processes on two 
nodes.

Cheers, John D.

On 2020-05-04 11:42, George Bosilca wrote:
John,

The common denominator across all these errors is an error from connect while 
trying to connect to 10.71.2.58 on port 1024. Who is 10.71.2.58 ? If the 
firewall open ? Is the port 1024 allowed to connect to ?


  George.


On Mon, May 4, 2020 at 11:36 AM John DelSignore via devel 
mailto:devel@lists.open-mpi.org> > wrote:
Inline below...

On 2020-05-04 11:09, Ralph Castain via devel wrote:
Staring at this some more, I do have the following questions:

* in your first case, it looks like "prte" was started from microway3 - correct?

Yes, "prte" was started from microway3. 


* in the second case, that worked, it looks like "mpirun" was executed from 
microway1 - correct?
No, "mpirun" was executed from micr

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread Ralph Castain via devel
What happens if you run your "3 procs on two nodes" case using just microway1 
and 3 (i.e., omit microway2)?


On May 4, 2020, at 9:05 AM, John DelSignore via devel mailto:devel@lists.open-mpi.org> > wrote:

Hi George,

10.71.2.58 is microway2 (which has been used in all of the configurations I've 
tried, so maybe that's why it appears to be the common denominator):

lid:/amd/home/jdelsign>host -l totalviewtech.com <http://totalviewtech.com> 
|grep microway
microway1.totalviewtech.com <http://microway1.totalviewtech.com> has address 
10.71.2.52
microway2.totalviewtech.com <http://microway2.totalviewtech.com> has address 
10.71.2.58
microway3.totalviewtech.com <http://microway3.totalviewtech.com> has address 
10.71.2.55
lid:/amd/home/jdelsign>

All three systems are on the same Ethernet, and in fact they are probably all 
in the same rack. AFAIK, there is no firewall and there is no restriction on 
port 1024. There are configurations that work OK on those same three nodes:

mic:/amd/home/jdelsign/PMIx>pterm
pterm failed to initialize, likely due to no DVM being available
mic:/amd/home/jdelsign/PMIx>
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway1
Hello from proc (1): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com> 
Hello from proc (2): microway3.totalviewtech.com 
<http://microway3.totalviewtech.com> 
All Done!
mic:/amd/home/jdelsign/PMIx>cat myhostfile
microway1 slots=16
microway2 slots=16
microway3 slots=16
mic:/amd/home/jdelsign/PMIx>

I haven't been able to find a combo where prun works. On the other hand, mpirun 
works in the above case, but not in the case where I put three processes on two 
nodes.

Cheers, John D.

On 2020-05-04 11:42, George Bosilca wrote:
John,

The common denominator across all these errors is an error from connect while 
trying to connect to 10.71.2.58 on port 1024. Who is 10.71.2.58 ? If the 
firewall open ? Is the port 1024 allowed to connect to ?


  George.


On Mon, May 4, 2020 at 11:36 AM John DelSignore via devel 
mailto:devel@lists.open-mpi.org> > wrote:
Inline below...

On 2020-05-04 11:09, Ralph Castain via devel wrote:
Staring at this some more, I do have the following questions:

* in your first case, it looks like "prte" was started from microway3 - correct?

Yes, "prte" was started from microway3. 


* in the second case, that worked, it looks like "mpirun" was executed from 
microway1 - correct?
No, "mpirun" was executed from microway3.

* in the third case, you state that "mpirun" was again executed from microway3, 
and the process output confirms that
Yes, "mpirun" was started from microway3.

I'm wondering if the issue here might actually be that PRRTE expects the 
ordering of hosts in the hostfile to start with the host it is sitting on - 
i.e., if the node index number between the various daemons is getting confused. 
Can you perhaps see what happens with the failing cases if you put microway3 at 
the top of the hostfile and execute prte/mpirun from microway3 as before?

OK, the first failing case:

mic:/amd/home/jdelsign/PMIx>pterm
pterm failed to initialize, likely due to no DVM being available
mic:/amd/home/jdelsign/PMIx>cat myhostfile3
microway3 slots=16
microway1 slots=16
microway2 slots=16
mic:/amd/home/jdelsign/PMIx>prte --hostfile ./myhostfile3 --daemonize
mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name 
--personality ompi ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway3.totalviewtech.com <http://microway3.totalviewtech.com/> 
Hello from proc (1): microway1
Hello from proc (2): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com/> 
--
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: microway1
  PID:    292266
  Message:    connect() to 10.71.2.58:1024 <http://10.71.2.58:1024/> failed
  Error:  No route to host (113)
--
[microway1:292266] 
../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
mic:/amd/home/jdelsign/PMIx>hostname
microway3.totalviewtech.com <http://microway3.totalviewtech.com/> 
mic:/amd/home/jdelsign/PMIx>

And the second failing test case:

mic:/amd/home/jdelsign/PMIx>pterm
pterm failed to initialize, likely due to no DVM being available
mic:/amd/home/jdelsign/PMIx>cat myhostfile3+2
microway3 slots=16
microway2 slots=16
mic:/amd/home/jdelsign/PMIx>
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile3+2 ./tx_basic_mpi
t

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread Ralph Castain via devel
Good to confirm - thanks! This does indeed look like an issue in the btl/tcp 
component's reachability code.



On May 4, 2020, at 8:34 AM, John DelSignore mailto:jdelsign...@perforce.com> > wrote:

Inline below...

On 2020-05-04 11:09, Ralph Castain via devel wrote:
Staring at this some more, I do have the following questions:

* in your first case, it looks like "prte" was started from microway3 - correct?

Yes, "prte" was started from microway3. 


* in the second case, that worked, it looks like "mpirun" was executed from 
microway1 - correct?
No, "mpirun" was executed from microway3.

* in the third case, you state that "mpirun" was again executed from microway3, 
and the process output confirms that
Yes, "mpirun" was started from microway3.

I'm wondering if the issue here might actually be that PRRTE expects the 
ordering of hosts in the hostfile to start with the host it is sitting on - 
i.e., if the node index number between the various daemons is getting confused. 
Can you perhaps see what happens with the failing cases if you put microway3 at 
the top of the hostfile and execute prte/mpirun from microway3 as before?

OK, the first failing case:

mic:/amd/home/jdelsign/PMIx>pterm
pterm failed to initialize, likely due to no DVM being available
mic:/amd/home/jdelsign/PMIx>cat myhostfile3
microway3 slots=16
microway1 slots=16
microway2 slots=16
mic:/amd/home/jdelsign/PMIx>prte --hostfile ./myhostfile3 --daemonize
mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name 
--personality ompi ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway3.totalviewtech.com <http://microway3.totalviewtech.com/> 
Hello from proc (1): microway1
Hello from proc (2): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com/> 
--
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: microway1
  PID:    292266
  Message:    connect() to 10.71.2.58:1024 failed
  Error:  No route to host (113)
--
[microway1:292266] 
../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
mic:/amd/home/jdelsign/PMIx>hostname
microway3.totalviewtech.com <http://microway3.totalviewtech.com/> 
mic:/amd/home/jdelsign/PMIx>

And the second failing test case:

mic:/amd/home/jdelsign/PMIx>pterm
pterm failed to initialize, likely due to no DVM being available
mic:/amd/home/jdelsign/PMIx>cat myhostfile3+2
microway3 slots=16
microway2 slots=16
mic:/amd/home/jdelsign/PMIx>
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile3+2 ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway3.totalviewtech.com <http://microway3.totalviewtech.com/> 
Hello from proc (1): microway3.totalviewtech.com 
<http://microway3.totalviewtech.com/> 
--
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: microway3
  PID:    271144
  Message:    connect() to 10.71.2.58:1024 failed
  Error:  No route to host (113)
--
[microway3.totalviewtech.com <http://microway3.totalviewtech.com/> :271144] 
../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
Hello from proc (2): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com/> 
mic:/amd/home/jdelsign/PMIx>

So, AFAICT, host name order didn't matter.

Cheers, John D.



 


On May 4, 2020, at 7:34 AM, John DelSignore via devel mailto:devel@lists.open-mpi.org> > wrote:

Hi folks,

I cloned a fresh copy of OMPI master this morning at ~8:30am EDT and rebuilt. 
I'm running a very simple test code on three Centos 7.[56] nodes named 
microway[123] over TCP. I'm seeing a fatal error similar to the following:

[microway3.totalviewtech.com <http://microway3.totalviewtech.com/> :227713] 
../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL

The case of prun launching an OMPI code does not work correctly. The MPI 
processes seem to launch OK, but there is the follwoing OMPI error at the point 
where the processes communicate. In the following case, I have DVM running on 
three nodes "microway[123]":

mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name 
--personality ompi ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway3.totalviewtech.com <http://microway3.totalviewtech.com/> 
Hello from proc (1): microway1
Hello from proc (2): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com/> 
---

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread Ralph Castain via devel
Staring at this some more, I do have the following questions:

* in your first case, it looks like "prte" was started from microway3 - correct?

* in the second case, that worked, it looks like "mpirun" was executed from 
microway1 - correct?

* in the third case, you state that "mpirun" was again executed from microway3, 
and the process output confirms that

I'm wondering if the issue here might actually be that PRRTE expects the 
ordering of hosts in the hostfile to start with the host it is sitting on - 
i.e., if the node index number between the various daemons is getting confused. 
Can you perhaps see what happens with the failing cases if you put microway3 at 
the top of the hostfile and execute prte/mpirun from microway3 as before?
 


On May 4, 2020, at 7:34 AM, John DelSignore via devel mailto:devel@lists.open-mpi.org> > wrote:

Hi folks,

I cloned a fresh copy of OMPI master this morning at ~8:30am EDT and rebuilt. 
I'm running a very simple test code on three Centos 7.[56] nodes named 
microway[123] over TCP. I'm seeing a fatal error similar to the following:

[microway3.totalviewtech.com  :227713] 
../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL

The case of prun launching an OMPI code does not work correctly. The MPI 
processes seem to launch OK, but there is the follwoing OMPI error at the point 
where the processes communicate. In the following case, I have DVM running on 
three nodes "microway[123]":

mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name 
--personality ompi ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway3.totalviewtech.com  
Hello from proc (1): microway1
Hello from proc (2): microway2.totalviewtech.com 
 
--
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: microway1
  PID:    282716
  Message:    connect() to 10.71.2.58:1024 failed
  Error:  No route to host (113)
--
[microway1:282716] 
../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
--
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: microway3
  Local PID:  214271
  Peer host:  microway1
--
mic:/amd/home/jdelsign/PMIx>

If I use mpirun to launch the program it works whether or not a DVM is already 
running (first without a DVM, then with a DVM):

mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway1
Hello from proc (1): microway2.totalviewtech.com 
 
Hello from proc (2): microway3.totalviewtech.com 
 
All Done!
mic:/amd/home/jdelsign/PMIx>
mic:/amd/home/jdelsign/PMIx>prte --hostfile ./myhostfile --daemonize
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway1
Hello from proc (1): microway2.totalviewtech.com 
 
Hello from proc (2): microway3.totalviewtech.com 
 
All Done!
mic:/amd/home/jdelsign/PMIx>

But if I use mpirun to launch 3 processes from microway3 and use a hostfile 
that contains only microway[23], I get a similar failure as the prun case:

mic:/amd/home/jdelsign/PMIx>hostname
microway3.totalviewtech.com  
mic:/amd/home/jdelsign/PMIx>cat myhostfile2
microway2 slots=16
microway3 slots=16
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile2 ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway2.totalviewtech.com  
Hello from proc (1): microway2.totalviewtech.com 
 
--
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local 

[OMPI devel] ORTE->PRRTE: some consequences to communicate to users

2020-04-28 Thread Ralph Castain via devel
So here is an interesting consequence of moving from ORTE to PRRTE. In ORTE, 
you could express any mapping policy as an MCA param - e.g., the following:

OMPI_MCA_rmaps_base_mapping_policy=core
OMPI_MCA_rmaps_base_display_map=1

would be the equivalent of a cmd line that included "--map-by core 
--display-map"

When defining what we wanted on the OMPI v5 cmd line, we removed some options 
like --display-map and replaced them with modifiers, so the above would have 
been replaced with:

OMPI_MCA_rmaps_base_mapping_policy=core:display

The move to PRRTE, however, means more than just changing the "OMPI" to 
"PRRTE". PRRTE doesn't support setting the default mapping policy to include 
"report" as that would mean we would be reporting the map for every job that 
was ever launched. Definitely not something the persistent DVM users would 
appreciate!

So if you put:

PRRTE_MCA_rmaps_default_mapping_policy=core:display

(note the name change!!!) in your environment, you are going to get an error 
when you execute "mpirun":

=
A mapping policy modifier was provided that is not supported as a default value:

 Modifier:  display

You can provide this modifier on a per-job basis, but it cannot
be the default setting.
===

And you will error out. However, it is perfectly okay to put "--map-by 
core:display" on your cmd line - that is legit and understood as it only 
applies to that specific job.

It's these changes, plus the name changes (e.g., we replace "base" with 
"default" to emphasize these are ONLY the default settings), that will need to 
be communicated.

Ralph


Re: [OMPI devel] MPI_Info args to spawn - resolving deprecated values?

2020-04-24 Thread Ralph Castain via devel
I have completed the deprecation/warning code - would people please take a 
quick look to ensure that this is what we want?

https://github.com/open-mpi/ompi/pull/7662

Ralph


On Apr 8, 2020, at 9:12 AM, George Bosilca via devel mailto:devel@lists.open-mpi.org> > wrote:

Deprecate, warn and convert seems reasonable. But for how long ?

As the number of automatic conversions OMPI supports has shown a tendency to 
increase, and as these conversions happen all over the code base, we might want 
to setup a well defined path to deprecation, what and when has been deprecated, 
for how long we intend to keep them or their conversion around and finally when 
they should be completely removed (or moved into a state where we warn but not 
convert).

  George.




On Wed, Apr 8, 2020 at 10:33 AM Jeff Squyres (jsquyres) via devel 
mailto:devel@lists.open-mpi.org> > wrote:
On Apr 8, 2020, at 9:51 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
> 
> We have deprecated a number of cmd line options (e.g., bynode, npernode, 
> npersocket) - what do we want to do about their MPI_Info equivalents when 
> calling comm_spawn?
> 
> Do I silently convert them? Should we output a deprecation warning? Return an 
> error?


We should probably do something similar to what happens on the command line 
(i.e., warn and convert).

-- 
Jeff Squyres
jsquy...@cisco.com <mailto:jsquy...@cisco.com> 




[OMPI devel] Mapping/ranking/binding defaults for OMPI v5

2020-04-24 Thread Ralph Castain via devel
I just want to confirm the default behaviors we want for OMPI v5. This is what 
we have currently set:

* if the user specifies nothing:
if np <=2: map-by core, rank-by core, bind-to core
if np > 2: map-by socket, rank-by core, bind-to socket

* if the user only specifies map-by:
rank-by  and bind-to adopt the same policy - e.g., map-by numa would 
result in rank-by numa, bind-to numa

If anyone wants something else, NOW is the time to speak up!
Ralph




[OMPI devel] Cross-job shared memory support

2020-04-12 Thread Ralph Castain via devel
There was a recent discussion regarding whether or not two job could 
communicate via shared memory. I recalled adding support for this, but thought 
that Nathan needed to do something in "vader" to enable it. Turns out I 
remembered correctly about adding the support - but I believe "vader" actually 
just works out-of-the-box. In OMPI master's connect/accept code, we 
obtain/compute the relative locality of all newly connected peers so that 
"vader" will correctly identify which are available for shmem support:

> if (0 < opal_list_get_size()) {
> uint32_t *peer_ranks = NULL;
> int prn, nprn = 0;
> char *val, *mycpuset;
> uint16_t u16;
> opal_process_name_t wildcard_rank;
> /* convert the list of new procs to a proc_t array */
> new_proc_list = (ompi_proc_t**)calloc(opal_list_get_size(),
>   sizeof(ompi_proc_t *));
> /* get the list of local peers for the new procs */
> cd = (ompi_dpm_proct_caddy_t*)opal_list_get_first();
> proc = cd->p;
> wildcard_rank.jobid = proc->super.proc_name.jobid;
> wildcard_rank.vpid = OMPI_NAME_WILDCARD->vpid;
> /* retrieve the local peers */
> OPAL_MODEX_RECV_VALUE_OPTIONAL(rc, PMIX_LOCAL_PEERS,
>_rank, , PMIX_STRING);
> if (OPAL_SUCCESS == rc && NULL != val) {
> char **peers = opal_argv_split(val, ',');
> free(val);
> nprn = opal_argv_count(peers);
> peer_ranks = (uint32_t*)calloc(nprn, sizeof(uint32_t));
> for (prn = 0; NULL != peers[prn]; prn++) {
> peer_ranks[prn] = strtoul(peers[prn], NULL, 10);
> }
> opal_argv_free(peers);
> }
> 
> /* get my locality string */
> val = NULL;
> OPAL_MODEX_RECV_VALUE_OPTIONAL(rc, PMIX_LOCALITY_STRING,
>OMPI_PROC_MY_NAME, , PMIX_STRING);
> if (OPAL_SUCCESS == rc && NULL != val) {
> mycpuset = val;
> } else {
> mycpuset = NULL;
> }
> 
> i = 0;
> OPAL_LIST_FOREACH(cd, , ompi_dpm_proct_caddy_t) {
> proc = cd->p;
> new_proc_list[i] = proc ;
> /* ompi_proc_complete_init_single() initializes and optionally 
> retrieves
>  * OPAL_PMIX_LOCALITY and OPAL_PMIX_HOSTNAME. since we can live 
> without
>  * them, we are just fine */
> ompi_proc_complete_init_single(proc);
> /* if this proc is local, then get its locality */
> if (NULL != peer_ranks) {
> for (prn=0; prn < nprn; prn++) {
> if (peer_ranks[prn] == proc->super.proc_name.vpid) {
> /* get their locality string */
> val = NULL;
> OPAL_MODEX_RECV_VALUE_IMMEDIATE(rc, 
> PMIX_LOCALITY_STRING,
>
> >super.proc_name, , OPAL_STRING);
> if (OPAL_SUCCESS == rc && NULL != val) {
> u16 = 
> opal_hwloc_compute_relative_locality(mycpuset, val);
> free(val);
> } else {
> /* all we can say is that it shares our node */
> u16 = OPAL_PROC_ON_CLUSTER | OPAL_PROC_ON_CU | 
> OPAL_PROC_ON_NODE;
> }
> proc->super.proc_flags = u16;
> /* save the locality for later */
> OPAL_PMIX_CONVERT_NAME(, 
> >super.proc_name);
> pval.type = PMIX_UINT16;
> pval.data.uint16 = proc->super.proc_flags;
> PMIx_Store_internal(, PMIX_LOCALITY, );
> break;
> }
> }
> }
> ++i;
> }

So I believe this feature may actually be available on master today.
Ralph




[OMPI devel] MPI_Info args to spawn - resolving deprecated values?

2020-04-08 Thread Ralph Castain via devel
We have deprecated a number of cmd line options (e.g., bynode, npernode, 
npersocket) - what do we want to do about their MPI_Info equivalents when 
calling comm_spawn?

Do I silently convert them? Should we output a deprecation warning? Return an 
error?

Ralph




[OMPI devel] External libevent, hwloc, pmix intertwined

2020-04-02 Thread Ralph Castain via devel
Hey folks

I have been fighting the build system for the last two days and discovered 
something a little bothersome. It appears that there are only two ways to build 
OMPI:

* with all three of libevent, hwloc, and pmix internal

* with all three of libevent, hwloc, and pmix external

In other words, you cannot mix internal/external combinations of these three 
packages - their headers intertwine. If you try, you'll get a flood of errors 
about redefining various constants, such as;

In file included from 
/Users/rhc/openmpi/foobar/opal/mca/pmix/pmix-internal.h:20,
                 from ../src/pmix/pmix-internal.h:34,
                 from ../src/util/proc_info.h:45,
                 from pmix/pmix.c:33:
/Users/rhc/openmpi/foobar/opal/include/opal_config.h:1315: warning: 
"HWLOC_VERSION_RELEASE" redefined
 1315 | #define HWLOC_VERSION_RELEASE 1
      | 
In file included from /Users/rhc/hwloc/build/v2.0.4/include/hwloc.h:56,
                 from ../src/hwloc/hwloc-internal.h:29,
                 from ../src/util/proc_info.h:44,
                 from pmix/pmix.c:33:
/Users/rhc/hwloc/build/v2.0.4/include/hwloc/autogen/config.h:18: note: this is 
the location of the previous definition
   18 | #define HWLOC_VERSION_RELEASE 4


I think the time has come to ask ourselves: should we simplify things and just 
give one option for these packages? Either you build them ALL internal, or you 
build them ALL external?

Ralph



Re: [OMPI devel] Taking advantage of PMIx: Hierarchical collective support

2020-03-22 Thread Ralph Castain via devel
I should have reminded everyone of the basics:

* PMIX_NETWORK_ENDPT - gives you an array of network endpts for the specified 
proc, one per NIC, ordered in closest to farthest distance from where that proc 
is bound

Similarly, PMIX_NETWORK_COORDINATE provides the array of coordinates for the 
specified proc, one per NIC, ordered as above.

I'll be posting some example code illustrating the use of all these in the near 
future and will alert anyone interested when I do.

Ralph


> On Mar 22, 2020, at 11:36 AM, Ralph Castain via devel 
>  wrote:
> 
> I'll be writing a series of notes containing thoughts on how to exploit 
> PMIx-provided information, especially covering aspects that might not be 
> obvious (e.g., attributes that might not be widely known). This first note 
> covers the topic of collective optimization.
> 
> PMIx provides network-related information that can be used in construction of 
> collectives - in this case, hierarchical collectives that minimize 
> cross-switch communications. Several pieces of information that might help 
> with construction of such collectives are provided by PMIx at time of process 
> execution. These include:
> 
> * PMIX_LOCAL_PEERS - the list of local peers (i.e., procs from your nspace) 
> sharing your node. This can be used to aggregate the contribution from 
> participating procs on the node to (for example) the lowest rank participator 
> on that node (call this the "node leader").
> 
> * PMIX_SWITCH_PEERS - the list of peers that share the same switch as the 
> proc specified in the call to PMIx_Get. Multi-NIC environments will return an 
> array of results, each element containing the NIC and the list of peers 
> sharing the switch to which that NIC is connected. This can be used to 
> aggregate the contribution across switches - e.g., by having the lowest 
> ranked participating proc on each switch participate in an allgather, and 
> then distribute the results to the participating node leaders for final 
> distribution across their nodes.
> 
> In the case of non-flat fabrics, further information regarding the topology 
> of the fabric and the location of each proc within that topology is provided 
> to aid in the construction of a collective. These include:
> 
> * PMIX_NETWORK_COORDINATE - network coordinate of the specified process in 
> the given view type (e.g., logical vs physical), expressed as a pmix_coord_t 
> struct that contains both the coordinates and the number of dimensions
> * PMIX_NETWORK_VIEW - Requested view type (e.g., logical vs physical)
> * PMIX_NETWORK_DIMS - Number of dimensions in the specified network plane/view
> 
> In addition, there are some values that can aid in interpreting this info 
> and/or describing it (e.g., in diagnostic output):
> 
> * PMIX_NETWORK_PLANE - string ID of a network plane
> * PMIX_NETWORK_SWITCH - string ID of a network switch
> * PMIX_NETWORK_NIC - string ID of a NIC
> * PMIX_NETWORK_SHAPE - number of interfaces (uint32_t) on each dimension of 
> the specified network plane in the requested view
> * PMIX_NETWORK_SHAPE_STRING - network shape expressed as a string (e.g., 
> "10x12x2")
> 
> Obviously, the availability of this support depends directly on access to the 
> required information. In the case of managed fabrics, this is provided by 
> PMIx plugins that directly obtain it from the respective fabric manager. I am 
> writing the support for Cray's Slingshot fabric, but any managed fabric can 
> be supported should someone wish to do so.
> 
> Unmanaged fabrics pose a bit of a challenge (e.g., how does one determine who 
> shares your switch?), but I suspect those who understand those environments 
> can probably devise a solution should they choose to pursue it. Remember, 
> PMIx includes interfaces that allow the daemon-level PMIx servers to collect 
> any information the fabric plugins deem useful from either the fabric or 
> local node level and roll it up for later use - this allows us, for example, 
> to provide the fabric support plugins with information on the local locality 
> of NICs on each node which they then use in assigning network endpoints.
> 
> This support will be appearing in PMIx (and thus, in OMPI) starting this 
> summer. You can play with it now, if you like - there are a couple of test 
> examples in the PMIx code base (see src/mca/pnet) that provide simulated 
> values being used by our early adopters for development. You are welcome to 
> use those, or to write your own plugin.
> 
> As always, I'm happy to provide advice/help to those interested in utilizing 
> these capabilities.
> Ralph
> 
> 




[OMPI devel] Taking advantage of PMIx: Hierarchical collective support

2020-03-22 Thread Ralph Castain via devel
I'll be writing a series of notes containing thoughts on how to exploit 
PMIx-provided information, especially covering aspects that might not be 
obvious (e.g., attributes that might not be widely known). This first note 
covers the topic of collective optimization.

PMIx provides network-related information that can be used in construction of 
collectives - in this case, hierarchical collectives that minimize cross-switch 
communications. Several pieces of information that might help with construction 
of such collectives are provided by PMIx at time of process execution. These 
include:

* PMIX_LOCAL_PEERS - the list of local peers (i.e., procs from your nspace) 
sharing your node. This can be used to aggregate the contribution from 
participating procs on the node to (for example) the lowest rank participator 
on that node (call this the "node leader").

* PMIX_SWITCH_PEERS - the list of peers that share the same switch as the proc 
specified in the call to PMIx_Get. Multi-NIC environments will return an array 
of results, each element containing the NIC and the list of peers sharing the 
switch to which that NIC is connected. This can be used to aggregate the 
contribution across switches - e.g., by having the lowest ranked participating 
proc on each switch participate in an allgather, and then distribute the 
results to the participating node leaders for final distribution across their 
nodes.

In the case of non-flat fabrics, further information regarding the topology of 
the fabric and the location of each proc within that topology is provided to 
aid in the construction of a collective. These include:

* PMIX_NETWORK_COORDINATE - network coordinate of the specified process in the 
given view type (e.g., logical vs physical), expressed as a pmix_coord_t struct 
that contains both the coordinates and the number of dimensions
* PMIX_NETWORK_VIEW - Requested view type (e.g., logical vs physical)
* PMIX_NETWORK_DIMS - Number of dimensions in the specified network plane/view

In addition, there are some values that can aid in interpreting this info 
and/or describing it (e.g., in diagnostic output):

* PMIX_NETWORK_PLANE - string ID of a network plane
* PMIX_NETWORK_SWITCH - string ID of a network switch
* PMIX_NETWORK_NIC - string ID of a NIC
* PMIX_NETWORK_SHAPE - number of interfaces (uint32_t) on each dimension of the 
specified network plane in the requested view
* PMIX_NETWORK_SHAPE_STRING - network shape expressed as a string (e.g., 
"10x12x2")

Obviously, the availability of this support depends directly on access to the 
required information. In the case of managed fabrics, this is provided by PMIx 
plugins that directly obtain it from the respective fabric manager. I am 
writing the support for Cray's Slingshot fabric, but any managed fabric can be 
supported should someone wish to do so.

Unmanaged fabrics pose a bit of a challenge (e.g., how does one determine who 
shares your switch?), but I suspect those who understand those environments can 
probably devise a solution should they choose to pursue it. Remember, PMIx 
includes interfaces that allow the daemon-level PMIx servers to collect any 
information the fabric plugins deem useful from either the fabric or local node 
level and roll it up for later use - this allows us, for example, to provide 
the fabric support plugins with information on the local locality of NICs on 
each node which they then use in assigning network endpoints.

This support will be appearing in PMIx (and thus, in OMPI) starting this 
summer. You can play with it now, if you like - there are a couple of test 
examples in the PMIx code base (see src/mca/pnet) that provide simulated values 
being used by our early adopters for development. You are welcome to use those, 
or to write your own plugin.

As always, I'm happy to provide advice/help to those interested in utilizing 
these capabilities.
Ralph




Re: [OMPI devel] Add multi nic support for ofi MTL using hwloc

2020-03-20 Thread Ralph Castain via devel
If you call "hwloc_topology_load", then hwloc merrily does its discovery and 
slams many-core systems. If you call "opal_hwloc_get_topology", then that is 
fine - it checks if we already have it, tries to get it from PMIx (using shared 
mem for hwloc 2.x), and only does the discovery if no other method is available.

IIRC, we might have decided to let those who needed the topology call 
"opal_hwloc_get_topology" to ensure the topo was available so that we don't 
load it unless someone actually needs it. However, I get the sense we wound up 
always needing the topology, so it was kind of a moot point.

Given that all we do is setup a shmem link (since hwloc 2 is now widely 
available), it shouldn't matter. However, if you want to stick with the "only 
get it if needed" approach, then just add a call to "opal_hwloc_get_topology" 
prior to using the topology and close that PR as "unneeded"


On Mar 20, 2020, at 9:35 AM, Barrett, Brian mailto:bbarr...@amazon.com> > wrote:

But does raise the question; should we call get_topology() for belt and 
suspenders in OFI?  Or will that cause your concerns from the start of this 
thread?
 Brian
 From: Ralph Castain mailto:r...@open-mpi.org> >
Date: Friday, March 20, 2020 at 9:31 AM
To: OpenMPI Devel mailto:devel@lists.open-mpi.org> >
Cc: "Barrett, Brian" mailto:bbarr...@amazon.com> >
Subject: RE: [EXTERNAL] [OMPI devel] Add multi nic support for ofi MTL using 
hwloc
  https://github.com/open-mpi/ompi/pull/7547 fixes it and has an explanation as 
to why it wasn't catching us elsewhere in the MPI code
 

On Mar 20, 2020, at 9:22 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
 Odd - the topology object gets filled in during init, well before the fence 
(as it doesn't need the fence, being a purely local op). Let me take a look



On Mar 20, 2020, at 9:15 AM, Barrett, Brian mailto:bbarr...@amazon.com> > wrote:

PMIx folks -

When using mpirun for launching, it looks like opal_hwloc_topology isn't filled 
in at the point where we need the information (mtl_ofi_component_init()).  This 
would end up being before the modex fence, since the goal is to figure out 
which address the process should publish.  I'm not sure that makes a difference 
here, but wanted to figure out if this was expected and, if so, if we had 
options for getting the right data from PMIx early enough in the process.  
Sorry, this is part of the runtime changes I haven't been following closely 
enough.

Brian

-----Original Message-
From: devel mailto:devel-boun...@lists.open-mpi.org> > on behalf of Ralph Castain via 
devel mailto:devel@lists.open-mpi.org> >
Reply-To: Open MPI Developers mailto:devel@lists.open-mpi.org> >
Date: Wednesday, March 18, 2020 at 2:08 PM
To: "Zhang, William" mailto:wilzh...@amazon.com> >

Cc: Ralph Castain mailto:r...@open-mpi.org> >, OpenMPI 
Devel mailto:devel@lists.open-mpi.org> >
Subject: RE: [EXTERNAL] [OMPI devel] Add multi nic support for ofi MTL using 
hwloc




  Excellent - thanks! Now if only the OpenMP people would be so 
reasonable...sigh.



On Mar 18, 2020, at 10:26 AM, Zhang, William mailto:wilzh...@amazon.com> > wrote:

Hello,

We're getting the topology info using the opal_hwloc_topology object, we won't 
be doing our own discovery.

William

On 3/17/20, 11:54 PM, "devel on behalf of Ralph Castain via devel" 
mailto:devel-boun...@lists.open-mpi.org> on 
behalf of devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > wrote:




 Hey folks

 I saw the referenced "new feature" on the v5 feature spreadsheet and wanted to 
ask a quick question. Is the OFI MTL going to be doing its own hwloc topology 
discovery for this feature? Or is it going to access the topology info via PMIx 
and the OPAL hwloc abstraction?

 I ask because we know that having every proc do its own topology discovery is 
a major problem on large-core systems (e.g., KNL or Power9). If OFI is going to 
do an hwloc discovery operation, then we need to ensure this doesn't happen 
unless specifically requested by a user willing to pay that price (and it was 
significant).

 Can someone from Amazon (as the item is assigned to them) please clarify?
 Ralph




Re: [OMPI devel] Add multi nic support for ofi MTL using hwloc

2020-03-20 Thread Ralph Castain via devel
https://github.com/open-mpi/ompi/pull/7547 fixes it and has an explanation as 
to why it wasn't catching us elsewhere in the MPI code


On Mar 20, 2020, at 9:22 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:

Odd - the topology object gets filled in during init, well before the fence (as 
it doesn't need the fence, being a purely local op). Let me take a look


On Mar 20, 2020, at 9:15 AM, Barrett, Brian mailto:bbarr...@amazon.com> > wrote:

PMIx folks -

When using mpirun for launching, it looks like opal_hwloc_topology isn't filled 
in at the point where we need the information (mtl_ofi_component_init()).  This 
would end up being before the modex fence, since the goal is to figure out 
which address the process should publish.  I'm not sure that makes a difference 
here, but wanted to figure out if this was expected and, if so, if we had 
options for getting the right data from PMIx early enough in the process.  
Sorry, this is part of the runtime changes I haven't been following closely 
enough.

Brian

-Original Message-
From: devel mailto:devel-boun...@lists.open-mpi.org> > on behalf of Ralph Castain via 
devel mailto:devel@lists.open-mpi.org> >
Reply-To: Open MPI Developers mailto:devel@lists.open-mpi.org> >
Date: Wednesday, March 18, 2020 at 2:08 PM
To: "Zhang, William" mailto:wilzh...@amazon.com> >

Cc: Ralph Castain mailto:r...@open-mpi.org> >, OpenMPI 
Devel mailto:devel@lists.open-mpi.org> >
Subject: RE: [EXTERNAL] [OMPI devel] Add multi nic support for ofi MTL using 
hwloc




   Excellent - thanks! Now if only the OpenMP people would be so 
reasonable...sigh.


On Mar 18, 2020, at 10:26 AM, Zhang, William mailto:wilzh...@amazon.com> > wrote:

Hello,

We're getting the topology info using the opal_hwloc_topology object, we won't 
be doing our own discovery.

William

On 3/17/20, 11:54 PM, "devel on behalf of Ralph Castain via devel" 
mailto:devel-boun...@lists.open-mpi.org> on 
behalf of devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > wrote:




  Hey folks

  I saw the referenced "new feature" on the v5 feature spreadsheet and wanted 
to ask a quick question. Is the OFI MTL going to be doing its own hwloc 
topology discovery for this feature? Or is it going to access the topology info 
via PMIx and the OPAL hwloc abstraction?

  I ask because we know that having every proc do its own topology discovery is 
a major problem on large-core systems (e.g., KNL or Power9). If OFI is going to 
do an hwloc discovery operation, then we need to ensure this doesn't happen 
unless specifically requested by a user willing to pay that price (and it was 
significant).

  Can someone from Amazon (as the item is assigned to them) please clarify?
  Ralph













Re: [OMPI devel] Add multi nic support for ofi MTL using hwloc

2020-03-20 Thread Ralph Castain via devel
Odd - the topology object gets filled in during init, well before the fence (as 
it doesn't need the fence, being a purely local op). Let me take a look


> On Mar 20, 2020, at 9:15 AM, Barrett, Brian  wrote:
> 
> PMIx folks -
> 
> When using mpirun for launching, it looks like opal_hwloc_topology isn't 
> filled in at the point where we need the information 
> (mtl_ofi_component_init()).  This would end up being before the modex fence, 
> since the goal is to figure out which address the process should publish.  
> I'm not sure that makes a difference here, but wanted to figure out if this 
> was expected and, if so, if we had options for getting the right data from 
> PMIx early enough in the process.  Sorry, this is part of the runtime changes 
> I haven't been following closely enough.
> 
> Brian
> 
> -Original Message-----
> From: devel  on behalf of Ralph Castain via 
> devel 
> Reply-To: Open MPI Developers 
> Date: Wednesday, March 18, 2020 at 2:08 PM
> To: "Zhang, William" 
> Cc: Ralph Castain , OpenMPI Devel 
> 
> Subject: RE: [EXTERNAL] [OMPI devel] Add multi nic support for ofi MTL using 
> hwloc
> 
> 
> 
> 
>Excellent - thanks! Now if only the OpenMP people would be so 
> reasonable...sigh.
> 
> 
>> On Mar 18, 2020, at 10:26 AM, Zhang, William  wrote:
>> 
>> Hello,
>> 
>> We're getting the topology info using the opal_hwloc_topology object, we 
>> won't be doing our own discovery.
>> 
>> William
>> 
>> On 3/17/20, 11:54 PM, "devel on behalf of Ralph Castain via devel" 
>>  
>> wrote:
>> 
>> 
>> 
>> 
>>   Hey folks
>> 
>>   I saw the referenced "new feature" on the v5 feature spreadsheet and 
>> wanted to ask a quick question. Is the OFI MTL going to be doing its own 
>> hwloc topology discovery for this feature? Or is it going to access the 
>> topology info via PMIx and the OPAL hwloc abstraction?
>> 
>>   I ask because we know that having every proc do its own topology discovery 
>> is a major problem on large-core systems (e.g., KNL or Power9). If OFI is 
>> going to do an hwloc discovery operation, then we need to ensure this 
>> doesn't happen unless specifically requested by a user willing to pay that 
>> price (and it was significant).
>> 
>>   Can someone from Amazon (as the item is assigned to them) please clarify?
>>   Ralph
>> 
>> 
>> 
>> 
> 
> 
> 
> 




Re: [OMPI devel] Add multi nic support for ofi MTL using hwloc

2020-03-18 Thread Ralph Castain via devel
Excellent - thanks! Now if only the OpenMP people would be so reasonable...sigh.


> On Mar 18, 2020, at 10:26 AM, Zhang, William  wrote:
> 
> Hello,
> 
> We're getting the topology info using the opal_hwloc_topology object, we 
> won't be doing our own discovery.
> 
> William
> 
> On 3/17/20, 11:54 PM, "devel on behalf of Ralph Castain via devel" 
>  
> wrote:
> 
>CAUTION: This email originated from outside of the organization. Do not 
> click links or open attachments unless you can confirm the sender and know 
> the content is safe.
> 
> 
> 
>Hey folks
> 
>I saw the referenced "new feature" on the v5 feature spreadsheet and 
> wanted to ask a quick question. Is the OFI MTL going to be doing its own 
> hwloc topology discovery for this feature? Or is it going to access the 
> topology info via PMIx and the OPAL hwloc abstraction?
> 
>I ask because we know that having every proc do its own topology discovery 
> is a major problem on large-core systems (e.g., KNL or Power9). If OFI is 
> going to do an hwloc discovery operation, then we need to ensure this doesn't 
> happen unless specifically requested by a user willing to pay that price (and 
> it was significant).
> 
>Can someone from Amazon (as the item is assigned to them) please clarify?
>Ralph
> 
> 
> 
> 




[OMPI devel] Add multi nic support for ofi MTL using hwloc

2020-03-18 Thread Ralph Castain via devel
Hey folks

I saw the referenced "new feature" on the v5 feature spreadsheet and wanted to 
ask a quick question. Is the OFI MTL going to be doing its own hwloc topology 
discovery for this feature? Or is it going to access the topology info via PMIx 
and the OPAL hwloc abstraction?

I ask because we know that having every proc do its own topology discovery is a 
major problem on large-core systems (e.g., KNL or Power9). If OFI is going to 
do an hwloc discovery operation, then we need to ensure this doesn't happen 
unless specifically requested by a user willing to pay that price (and it was 
significant).

Can someone from Amazon (as the item is assigned to them) please clarify?
Ralph




Re: [OMPI devel] New coll component

2020-03-05 Thread Ralph Castain via devel
You have a missing symbol in your component:

undefined symbol: ompi_coll_libpnbc_osc_neighbor_alltoall_init (ignored)




On Mar 5, 2020, at 5:57 AM, Luis Cebamanos via devel mailto:devel@lists.open-mpi.org> > wrote:

Hi folks,


We are developing a (hopefully) new component for the coll framework.
The component compiles fine, but I get the following error when running
the ompi_info tool:

~/test_ompi/bin/ompi_info --param coll all
[indy2-login0:21620] mca_base_component_repository_open: unable to open
mca_coll_libpnbc_osc:
/lustre/home/z04/lc/test_ompi/lib/openmpi/mca_coll_libpnbc_osc.so:
undefined symbol: ompi_coll_libpnbc_osc_neighbor_alltoall_init (ignored)

    MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.0)
    MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.0)
    MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.0)
    MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component
  v4.1.0)
    MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.0)
    MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.0)
    MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.0)
    MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.0)

I don't see any obvious problems and I am wondering if this component
needs to be registered somewhere else so that ompi_info sees it. Any
hints on where I should look at?

Regards,
Luis
The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.



Re: [OMPI devel] Today's OMPI master is failing with "ompi_mpi_init: ompi_rte_init failed"

2020-03-04 Thread Ralph Castain via devel
I checked this with a fresh clone and everything is working fine, so I expect 
this is a stale submodule issue again. I've asked John to check.

> On Mar 4, 2020, at 8:05 AM, John DelSignore via devel 
>  wrote:
> 
> Hi,
> 
> I've been working with Ralph to try to get the PMIx debugging interfaces 
> working with OMPI v5 master. I've been periodically pulling new versions to 
> try to pickup the changes Ralph has been pushing into PRRTE/OpenPMIx. After 
> pulling this morning, I'm getting the following error. This all worked OK 
> yesterday with a pull from late last week, so it seems to me that something 
> got broken in the last few days. Is this a known problem, or am I doing 
> something wrong?
> 
> Thanks, John D.
> 
> mic:/amd/home/jdelsign/PMIx>prun -x MESSAGE=name -n 1 --map-by node 
> --personality ompi ./tx_basic_mpi
> --
> It looks like MPI runtime init failed for some reason; your parallel process 
> is
> likely to abort.  There are many reasons that a parallel process can
> fail during RTE init; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  local size
>  --> Returned "Not found" (-13) instead of "Success" (0)
> --
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  ompi_mpi_init: ompi_rte_init failed
>  --> Returned "Not found" (-13) instead of "Success" (0)
> --
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [microway3:110345] Local abort before MPI_INIT completed completed 
> successfully, but am not able to aggregate error messages, and not able to 
> guarantee that all other processes were killed!
> mic:/amd/home/jdelsign/PMIx>
> 
> 
> This e-mail may contain information that is privileged or confidential. If 
> you are not the intended recipient, please delete the e-mail and any 
> attachments and notify us immediately.
> 



Re: [OMPI devel] Adding a new RAS module

2020-02-29 Thread Ralph Castain via devel
You'll have to do it in the PRRTE project: https://github.com/openpmix/prrte

OMPI has removed the ORTE code base and replaced it with PRRTE, which is 
effectively the same code but taken from a repo that multiple projects support. 
You can use any of the components in there as a template - I don't believe we 
have a formal guide. Feel free to open an issue on the PRRTE repo to track any 
questions.

Ralph


On Feb 29, 2020, at 6:10 AM, Davide Giuseppe Siciliano via devel 
mailto:devel@lists.open-mpi.org> > wrote:

Hello everyone, 

i'm trying to add a new RAS module to ingrate a framework developed in 
university with open-mpi.

Can you please help me saying if there is a template (or a guide) to follow to 
develop it in the right way?

Thanks, 
Davide



[OMPI devel] Github is sad today

2020-02-27 Thread Ralph Castain via devel
Just an FYI: GitHub is degraded today, especially on the webhooks and actions 
that we depend upon for things like CI. Hopefully, they will get it fixed soon.

Ralph




[OMPI devel] Conflicting definitions

2020-02-20 Thread Ralph Castain via devel
Hey folks

Now that we have multiple projects sharing a build system, we need to be 
careful how we name our #if's. For example, using:

#ifdef MCA_FOO_BAR
#define MCA_FOO_BAR
...

might have been fine in the past, but you wind up clobbering another project 
that also has a "foo_bar" component. Please prefix your definitions with the 
project name, like:

#ifdef OPAL_MCA_FOO_BAR
#define OPAL_MCA_FOO_BAR

This will help to reduce a number of headaches
Ralph




[OMPI devel] Deprecated configure options in OMPI v5

2020-02-19 Thread Ralph Castain via devel
What do we want to do with the following options? These have either been 
renamed (changing from "orte..." to a "prrte" equivalent) or are no longer 
valid:

--enable-orterun-prefix-by-default
--enable-mpirun-prefix-by-default
These are now --enable-prte-prefix-by-default. Should I error out via the 
deprecation mechanism? Or should we silently translate to the new option?


 --enable-per-user-config-files
This is no longer valid if launching either via mpirun or on a system that has 
adequate PMIx support. Still, it does apply to direct launch on systems that 
lack the requisite support. My only concern here is that we ARE going to use 
user-level config files with mpirun and supported systems, and it is now a 
runtime decision (not a configure option). So do we remove this and explain 
another method for doing it on systems lacking support? Or leave it and just 
"do the right thing" under the covers?


--enable-mpi-cxx
--enable-mpi-cxx-seek
--enable-cxx-exceptions
I assume these should be added to the "deprecation" m4?


Ralph



[OMPI devel] OMPI Developer Meeting for Feb 2020 - Minutes

2020-02-19 Thread Ralph Castain via devel
Hi folks

I integrated the minutes from this week's meeting into the meeting's wiki page:

https://github.com/open-mpi/ompi/wiki/Meeting-2020-02

Feel free to update and/or let me know of errors or omissions
Ralph


[OMPI devel] Command line and envar processing

2020-02-19 Thread Ralph Castain via devel
Hey folks

Based on the discussion at the OMPI developer's meeting this week, I have 
created the following wiki page explaining how OMPI's command line and envars 
will be processed for OMPI v5:

https://github.com/open-mpi/ompi/wiki/Command-Line-Envar-Parsing

Feel free to comment and/or ask questions
Ralph



[OMPI devel] Fix your MTT scripts!

2020-02-09 Thread Ralph Castain via devel
We are seeing many failures on MTT because of errors on the cmd line. Note that 
by request of the OMPI community, PRRTE is strictly enforcing the Posix "dash" 
syntax:

  * a single-dash must be used only for single-character options. You can 
combine the single-character options like "-abc" as shorthand for "-a -b -c"
  * two-dashes must precede ALL multi-character options. For example, "--mca" 
as opposed to "-mca". The latter will be rejected with an error

Please adjust your MTT scripts
Ralph




[OMPI devel] ORTE has been removed!

2020-02-08 Thread Ralph Castain via devel
FYI: pursuant to the objectives outline last year, I have committed PR #7202 
and removed ORTE from the OMPI repository. It has been replaced with a PRRTE 
submodule pointed at the PRRTE master branch. At the same tie, we replaced the 
embedded PMIx code tree with a submodule pointed to the PMIx master branch.

The mpirun command hasn't changed. It simply starts PRRTE under the covers and 
then launches your job (using "prun") against it. So everything behaves the 
same in that regard.

Some notes on possible differences:

* singletons work, but singleton comm-spawn does not. We will address this in a 
little while

* we haven't extensively tested MCA params and cannot claim that every cmd line 
option works. There are some MCA params that don't have a PRRTE equivalent. 
Please let us know when you hit something that appears to be missing.

Note that you will need to pull the submodules once you update from the repo, 
and PRs may well need to be rebased.

Ralph




Re: [OMPI devel] Git submodules are coming

2020-02-07 Thread Ralph Castain via devel
FWIW: I have major problems when rebasing if that rebase runs across the point 
where a submodule is added. Every file that was removed and replaced by the 
submodule generates a conflict. Only solution I could find was to whack the 
subdirectory containing the files-to-be-replaced and work thru it (and it isn't 
an easy process). Rather painful, which is why rebasing the "remove ORTE" PR 
has been a nightmare.


> On Feb 7, 2020, at 7:31 AM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> On Feb 7, 2020, at 4:27 AM, Brice Goglin via devel  
> wrote:
>> 
>> PR#7367 was initially on top of PR #7366. When Jeff merged PR#7366, I 
>> rebased my #7367 with git prrs and got this error:
>> 
>> $ git prrs origin master
>> From 
>> https://github.com/open-mpi/ompi
>> 
>> * branch  master -> FETCH_HEAD
>> Fetching submodule opal/mca/hwloc/hwloc2/hwloc
>> fatal: cannot rebase with locally recorded submodule modifications
>> 
>> I didn't touch the hwloc submodule as far as I can see. The hwloc submodule 
>> also didn't change in origin/master between before and after the rebasing.
> 
> Huh.  I can't see from this what happened; I have no insight to offer here, 
> sorry...
> 
>> $ git submodule status
>> 38433c0f5fae0b761bd20e7b928c77f3ff2e76dc opal/mca/hwloc/hwloc2/hwloc 
>> (hwloc-2.1.0rc2-33-g38433c0f)
> 
> I see this in my ompi clone as well (i.e., it's where the master/HEAD hwloc 
> submodule is pointing).
> 
>> opal/mca/hwloc/hwloc2/hwloc $ git status
>> HEAD detached from f1a2e22a
>> nothing to commit, working tree clean
>> 
>> I am not sure what's this "HEAD detached ..." is doing here.
> 
> If you look at the graph log in the opal/mca/hwloc/hwloc2/hwloc tree, you'll 
> see:
> 
> * 03d42600 (origin/v2.1) doxy: add a ref to envvar from the XML section
> ...a bunch more commits...
> * 38433c0f (HEAD) .gitignore: add config/ltmain.sh.orig
> ...a bunch more commits...
> * f1a2e22a (tag: hwloc-2.1.0rc2, tag: hwloc-2.1.0) contrib/windows: update 
> README
> 
> Meaning:
> - 03d42600 is the head of the "v2.1" branch in the hwloc repo
> - 38433c0f is where the submodule is pointing (i.e., local HEAD)
> - f1a2e22a is the last tag before that
> 
> So I think the "HEAD detached" means that the HEAD is not pointing to a named 
> commit (i.e., there's no tags or branches pointing to 38433c0f).
> 
>> I seem to be able to reproduce the issue in my master branch by doing "git 
>> reset --hard HEAD^". git prrs will then fail the same.
>> 
>> I worked around the issue by manually reapplying all commits from my PR on 
>> top of master with git cherry-pick, but I'd like to understand what's going 
>> on. It looks like my submodule is clean but not clean enough for a rebase?
> 
> I haven't had problems with rebasing and submodules; I'm not sure what I'm 
> doing different than you.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 




Re: [OMPI devel] 3.1.6rc2: Cygwin fifo warning

2020-02-03 Thread Ralph Castain via devel
It is the latter one it is complaining about:

>  /tmp/ompi.LAPTOP-82F08ILC.197609/pid.93/0/debugger_attach_fifo

I have no idea why it is complaining.

> On Feb 3, 2020, at 2:03 PM, Marco Atzeri via devel  
> wrote:
> 
> Am 03.02.2020 um 18:15 schrieb Ralph Castain via devel:
>> Hi Marco
>> mpirun isn't trying to run a debugger. It is opening a fifo pipe in case a 
>> debugger later wishes to attach to the running job - it is used by an 
>> MPIR-based debugger to let mpirun know that it is attaching. My guess is 
>> that the code is attempting to create the fifo in an unacceptable place 
>> under Cygwin - I forget the directory it is trying to use.
> 
> for what I see it write in two places
> 
> under /dev/shm  where it leave some trace
> $ ls -l /dev/shm/
> total 33M
> -rw--- 1 Marco Kein 4.1M Feb  3 22:01 
> vader_segment.LAPTOP-82F08ILC.45860001.0
> -rw--- 1 Marco Kein 4.1M Feb  3 22:01 
> vader_segment.LAPTOP-82F08ILC.45860001.1
> -rw--- 1 Marco Kein 4.1M Feb  3 22:01 
> vader_segment.LAPTOP-82F08ILC.45860001.2
> -rw--- 1 Marco Kein 4.1M Feb  3 22:01 
> vader_segment.LAPTOP-82F08ILC.45860001.3
> -rw--- 1 Marco Kein 4.1M Feb  3 22:02 
> vader_segment.LAPTOP-82F08ILC.45a60001.0
> -rw--- 1 Marco Kein 4.1M Feb  3 22:02 
> vader_segment.LAPTOP-82F08ILC.45a60001.1
> -rw--- 1 Marco Kein 4.1M Feb  3 22:02 
> vader_segment.LAPTOP-82F08ILC.45a60001.2
> -rw--- 1 Marco Kein 4.1M Feb  3 22:02 
> vader_segment.LAPTOP-82F08ILC.45a60001.3
> 
> and under /tmp/ompi.LAPTOP-82F08ILC.197609
> as /tmp/ompi.LAPTOP-82F08ILC.197609/pid.93/0/debugger_attach_fifo
> 
> where at the end nothing remain under /tmp
> 
> The only thing strange I see is some tentative access to /proc/elog
> that does not exist in Cygwin
> 




Re: [OMPI devel] 3.1.6rc2: Cygwin fifo warning

2020-02-03 Thread Ralph Castain via devel
Hi Marco

mpirun isn't trying to run a debugger. It is opening a fifo pipe in case a 
debugger later wishes to attach to the running job - it is used by an 
MPIR-based debugger to let mpirun know that it is attaching. My guess is that 
the code is attempting to create the fifo in an unacceptable place under Cygwin 
- I forget the directory it is trying to use.


On Feb 2, 2020, at 6:07 AM, Marco Atzeri via devel mailto:devel@lists.open-mpi.org> > wrote:

Am 02.02.2020 um 14:16 schrieb Jeff Squyres (jsquyres):
On Feb 2, 2020, at 2:17 AM, Marco Atzeri via devel mailto:devel@lists.open-mpi.org> > wrote:

not a new issue as it was also in 3.1.5. what is causing the
last line of warning ?
And why a simple run should try to run a debugger ?

$ mpirun -n 4 ./hello_c
...
Hello, world, I am 3 of 4, (Open MPI v3.1.6rc2, package: Open MPI 
Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.6rc2, repo rev: v3.1.6rc2, Jan 
30, 2020, 125)
[LAPTOP-82F08ILC:00154] [[18244,0],0] unable to open debugger attach fifo

this is a Cygwin 64 bit.
Can you get a stack trace for that, perchance?  The function in question to 
trap is open_fifo() in orted_submit.c.  This function can be called from 3 
different places; it would be good to know in which of the 3 it is happening.
Does Cygwin support mkfifo()?


/usr/include/sys/stat.h:int mkfifo (const char *__path, mode_t __mode );


Assuming that the message is coming from the last open_fifo call

Thread 1 "orterun" hit Breakpoint 1, open_fifo ()
   at /usr/src/debug/openmpi-3.1.6rc2-1/orte/orted/orted_submit.c:2857
2857    {
(gdb) bt
#0  open_fifo () at 
/usr/src/debug/openmpi-3.1.6rc2-1/orte/orted/orted_submit.c:2857
#1  0x0003783f1cf1 in attach_debugger (fd=, event=, arg=0x800155430)
   at /usr/src/debug/openmpi-3.1.6rc2-1/orte/orted/orted_submit.c:2913
#2  0x0003784bbca0 in event_process_active_single_queue 
(activeq=0x80008b650, base=0x80008af90)
   at 
/usr/src/debug/openmpi-3.1.6rc2-1/opal/mca/event/libevent2022/libevent/event.c:1370
#3  event_process_active (base=)
   at 
/usr/src/debug/openmpi-3.1.6rc2-1/opal/mca/event/libevent2022/libevent/event.c:1440
#4  opal_libevent2022_event_base_loop (base=0x80008af90, flags=flags@entry=1)
   at 
/usr/src/debug/openmpi-3.1.6rc2-1/opal/mca/event/libevent2022/libevent/event.c:1644
#5  0x0001004013de in orterun (argc=, argv=)
   at /usr/src/debug/openmpi-3.1.6rc2-1/orte/tools/orterun/orterun.c:201
#6  0x00018004a826 in _cygwin_exit_return ()
   at /usr/src/debug/cygwin-3.1.2-1/winsup/cygwin/dcrt0.cc:1028 
 
#7  0x000180048353 in _cygtls::call2 (this=0xce00, func=0x180049800 
,
   arg=0x0, buf=buf@entry=0xcdf0) at 
/usr/src/debug/cygwin-3.1.2-1/winsup/cygwin/cygtls.cc:40  
#8  0x000180048404 in _cygtls::call (func=, arg=)
   at /usr/src/debug/cygwin-3.1.2-1/winsup/cygwin/cygtls.cc:27 
 
#9  0x in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

(gdb) c
Continuing.
[Thread 127164.0x54860 exited with code 0]
[LAPTOP-82F08ILC:02101] [[20459,0],0] unable to open debugger attach fifo



Re: [OMPI devel] Git submodules are coming

2020-01-08 Thread Ralph Castain via devel
Actually, I take that back - making a separate PR to change the opal/pmix 
embedded component to a submodule was way too painful. I simply added it to the 
existing #7202.


> On Jan 7, 2020, at 1:33 PM, Ralph Castain via devel 
>  wrote:
> 
> Just an FYI: there will soon be THREE PRs introducing submodules - I am 
> breaking #7202 into two pieces. The first will replace opal/pmix with direct 
> use of PMIx everywhere and replace the embedded pmix component with a 
> submodule pointing to PMIx master, and the second will replace ORTE with 
> PRRTE.
> 
> 
>> On Jan 7, 2020, at 9:02 AM, Jeff Squyres (jsquyres) via devel 
>>  wrote:
>> 
>> We now have two PRs pending that will introduce the use of Git submodules 
>> (and there are probably more such PRs on the way).  At last one of these 
>> first two PRs will likely be merged "Real Soon Now".
>> 
>> We've been talking about using Git submodules forever.  Now we're just about 
>> ready.
>> 
>> **
>> *** DEVELOPERS: THIS AFFECTS YOU!! ***
>> **
>> 
>> You cannot just "clone and build" any more:
>> 
>> -
>> git clone g...@github.com:open-mpi/ompi.git
>> cd ompi && ./autogen.pl && ./configure ...
>> -
>> 
>> You will *have* to initialize the Git submodule(s) -- either during or after 
>> the clone.  *THEN* you can build Open MPI.
>> 
>> Go read this wiki: https://github.com/open-mpi/ompi/wiki/GitSubmodules
>> 
>> May the force be with us!
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> 
> 
> 




Re: [OMPI devel] Git submodules are coming

2020-01-07 Thread Ralph Castain via devel
Just an FYI: there will soon be THREE PRs introducing submodules - I am 
breaking #7202 into two pieces. The first will replace opal/pmix with direct 
use of PMIx everywhere and replace the embedded pmix component with a submodule 
pointing to PMIx master, and the second will replace ORTE with PRRTE.


> On Jan 7, 2020, at 9:02 AM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> We now have two PRs pending that will introduce the use of Git submodules 
> (and there are probably more such PRs on the way).  At last one of these 
> first two PRs will likely be merged "Real Soon Now".
> 
> We've been talking about using Git submodules forever.  Now we're just about 
> ready.
> 
> **
> *** DEVELOPERS: THIS AFFECTS YOU!! ***
> **
> 
> You cannot just "clone and build" any more:
> 
> -
> git clone g...@github.com:open-mpi/ompi.git
> cd ompi && ./autogen.pl && ./configure ...
> -
> 
> You will *have* to initialize the Git submodule(s) -- either during or after 
> the clone.  *THEN* you can build Open MPI.
> 
> Go read this wiki: https://github.com/open-mpi/ompi/wiki/GitSubmodules
> 
> May the force be with us!
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 




Re: [OMPI devel] PMIX ERROR: INIT spurious message on 3.1.5

2020-01-07 Thread Ralph Castain via devel
I was able to create the fix - it is in OMPI master. I have provided a patch 
for OMPI v3.1.5 here:

https://github.com/open-mpi/ompi/pull/7276

Ralph


> On Jan 3, 2020, at 6:04 PM, Ralph Castain via devel 
>  wrote:
> 
> I'm afraid the fix uncovered an issue in the ds21 component that will require 
> Mellanox to address it - unsure of the timetable for that to happen.
> 
> 
>> On Jan 3, 2020, at 6:28 AM, Ralph Castain via devel 
>>  wrote:
>> 
>> I committed something upstream in PMIx master and v3.1 that probably 
>> resolves this - another user reported it over there and provided a patch. I 
>> can probably backport it to v2.x and give you a patch for OMPI v3.1.
>> 
>> 
>>> On Jan 3, 2020, at 3:25 AM, Jeff Squyres (jsquyres) via devel 
>>>  wrote:
>>> 
>>> Is there a configure test we can add to make this kind of behavior be the 
>>> default?
>>> 
>>> 
>>>> On Jan 1, 2020, at 11:50 PM, Marco Atzeri via devel 
>>>>  wrote:
>>>> 
>>>> thanks Ralph
>>>> 
>>>> gds = ^ds21
>>>> works as expected
>>>> 
>>>> Am 31.12.2019 um 19:27 schrieb Ralph Castain via devel:
>>>>> PMIx likely defaults to the ds12 component - which will work fine but a 
>>>>> tad slower than ds21. It is likely something to do with the way cygwin 
>>>>> handles memory locks. You can avoid the error message by simply adding 
>>>>> "gds = ^ds21" to your default MCA param file (the pmix one - should be 
>>>>> named pmix-mca-params.conf).
>>>>> Artem - any advice here?
>>>>>> On Dec 25, 2019, at 9:56 AM, Marco Atzeri via devel 
>>>>>>  wrote:
>>>>>> 
>>>>>> I have no multinode around for testing
>>>>>> 
>>>>>> I will need to setup one for testing after the holidays
>>>>>> 
>>>>>> Am 24.12.2019 um 23:27 schrieb Jeff Squyres (jsquyres):
>>>>>>> That actually looks like a legit error -- it's failing to initialize a 
>>>>>>> shared mutex.
>>>>>>> I'm not sure what the consequence is of this failure, though, since the 
>>>>>>> job seemed to run ok.
>>>>>>> Are you able to run multi-node jobs ok?
>>>>>>>> On Dec 22, 2019, at 1:20 AM, Marco Atzeri via devel 
>>>>>>>>  wrote:
>>>>>>>> 
>>>>>>>> Hi Developers,
>>>>>>>> 
>>>>>>>> Cygwin 64bit, openmpi-3.1.5-1
>>>>>>>> testing the cygwin package before releasing it
>>>>>>>> I see a never seen before spurious error messages that do not seem
>>>>>>>> about error at all:
>>>>>>>> 
>>>>>>>> $ mpirun -n 4 ./hello_c.exe
>>>>>>>> [LAPTOP-82F08ILC:02395] PMIX ERROR: INIT in file 
>>>>>>>> /cygdrive/d/cyg_pub/devel/openmpi/v3.1/openmpi-3.1.5-1.x86_64/src/openmpi-3.1.5/opal/mca/pmix/pmix2x/pmix/src/mca/gds/ds21/gds_ds21_lock_pthread.c
>>>>>>>>  at line 188
>>>>>>>> [LAPTOP-82F08ILC:02395] PMIX ERROR: SUCCESS in file 
>>>>>>>> /cygdrive/d/cyg_pub/devel/openmpi/v3.1/openmpi-3.1.5-1.x86_64/src/openmpi-3.1.5/opal/mca/pmix/pmix2x/pmix/src/mca/common/dstore/dstore_base.c
>>>>>>>>  at line 2432
>>>>>>>> Hello, world, I am 0 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, 
>>>>>>>> Nov 15, 2019, 116)
>>>>>>>> Hello, world, I am 1 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, 
>>>>>>>> Nov 15, 2019, 116)
>>>>>>>> Hello, world, I am 2 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, 
>>>>>>>> Nov 15, 2019, 116)
>>>>>>>> Hello, world, I am 3 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, 
>>>>>>>> Nov 15, 2019, 116)
>>>>>>>> [LAPTOP-82F08ILC:02395] [[20101,0],0] unable to open debugger attach 
>>>>>>>> fifo
>>>>>>>> 
>>>>>>>> There is a know workaround ?
>>>>>>>> I have not found anything on the issue list.
>>>>>>>> 
>>>>>>>> Regards
>>>>>>>> MArcp
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> 
>> 
>> 
> 
> 




Re: [OMPI devel] PMIX ERROR: INIT spurious message on 3.1.5

2020-01-03 Thread Ralph Castain via devel
I'm afraid the fix uncovered an issue in the ds21 component that will require 
Mellanox to address it - unsure of the timetable for that to happen.


> On Jan 3, 2020, at 6:28 AM, Ralph Castain via devel 
>  wrote:
> 
> I committed something upstream in PMIx master and v3.1 that probably resolves 
> this - another user reported it over there and provided a patch. I can 
> probably backport it to v2.x and give you a patch for OMPI v3.1.
> 
> 
>> On Jan 3, 2020, at 3:25 AM, Jeff Squyres (jsquyres) via devel 
>>  wrote:
>> 
>> Is there a configure test we can add to make this kind of behavior be the 
>> default?
>> 
>> 
>>> On Jan 1, 2020, at 11:50 PM, Marco Atzeri via devel 
>>>  wrote:
>>> 
>>> thanks Ralph
>>> 
>>> gds = ^ds21
>>> works as expected
>>> 
>>> Am 31.12.2019 um 19:27 schrieb Ralph Castain via devel:
>>>> PMIx likely defaults to the ds12 component - which will work fine but a 
>>>> tad slower than ds21. It is likely something to do with the way cygwin 
>>>> handles memory locks. You can avoid the error message by simply adding 
>>>> "gds = ^ds21" to your default MCA param file (the pmix one - should be 
>>>> named pmix-mca-params.conf).
>>>> Artem - any advice here?
>>>>> On Dec 25, 2019, at 9:56 AM, Marco Atzeri via devel 
>>>>>  wrote:
>>>>> 
>>>>> I have no multinode around for testing
>>>>> 
>>>>> I will need to setup one for testing after the holidays
>>>>> 
>>>>> Am 24.12.2019 um 23:27 schrieb Jeff Squyres (jsquyres):
>>>>>> That actually looks like a legit error -- it's failing to initialize a 
>>>>>> shared mutex.
>>>>>> I'm not sure what the consequence is of this failure, though, since the 
>>>>>> job seemed to run ok.
>>>>>> Are you able to run multi-node jobs ok?
>>>>>>> On Dec 22, 2019, at 1:20 AM, Marco Atzeri via devel 
>>>>>>>  wrote:
>>>>>>> 
>>>>>>> Hi Developers,
>>>>>>> 
>>>>>>> Cygwin 64bit, openmpi-3.1.5-1
>>>>>>> testing the cygwin package before releasing it
>>>>>>> I see a never seen before spurious error messages that do not seem
>>>>>>> about error at all:
>>>>>>> 
>>>>>>> $ mpirun -n 4 ./hello_c.exe
>>>>>>> [LAPTOP-82F08ILC:02395] PMIX ERROR: INIT in file 
>>>>>>> /cygdrive/d/cyg_pub/devel/openmpi/v3.1/openmpi-3.1.5-1.x86_64/src/openmpi-3.1.5/opal/mca/pmix/pmix2x/pmix/src/mca/gds/ds21/gds_ds21_lock_pthread.c
>>>>>>>  at line 188
>>>>>>> [LAPTOP-82F08ILC:02395] PMIX ERROR: SUCCESS in file 
>>>>>>> /cygdrive/d/cyg_pub/devel/openmpi/v3.1/openmpi-3.1.5-1.x86_64/src/openmpi-3.1.5/opal/mca/pmix/pmix2x/pmix/src/mca/common/dstore/dstore_base.c
>>>>>>>  at line 2432
>>>>>>> Hello, world, I am 0 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 
>>>>>>> 15, 2019, 116)
>>>>>>> Hello, world, I am 1 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 
>>>>>>> 15, 2019, 116)
>>>>>>> Hello, world, I am 2 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 
>>>>>>> 15, 2019, 116)
>>>>>>> Hello, world, I am 3 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 
>>>>>>> 15, 2019, 116)
>>>>>>> [LAPTOP-82F08ILC:02395] [[20101,0],0] unable to open debugger attach 
>>>>>>> fifo
>>>>>>> 
>>>>>>> There is a know workaround ?
>>>>>>> I have not found anything on the issue list.
>>>>>>> 
>>>>>>> Regards
>>>>>>> MArcp
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> 
> 
> 




Re: [OMPI devel] PMIX ERROR: INIT spurious message on 3.1.5

2020-01-03 Thread Ralph Castain via devel
I committed something upstream in PMIx master and v3.1 that probably resolves 
this - another user reported it over there and provided a patch. I can probably 
backport it to v2.x and give you a patch for OMPI v3.1.


> On Jan 3, 2020, at 3:25 AM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> Is there a configure test we can add to make this kind of behavior be the 
> default?
> 
> 
>> On Jan 1, 2020, at 11:50 PM, Marco Atzeri via devel 
>>  wrote:
>> 
>> thanks Ralph
>> 
>> gds = ^ds21
>> works as expected
>> 
>> Am 31.12.2019 um 19:27 schrieb Ralph Castain via devel:
>>> PMIx likely defaults to the ds12 component - which will work fine but a tad 
>>> slower than ds21. It is likely something to do with the way cygwin handles 
>>> memory locks. You can avoid the error message by simply adding "gds = 
>>> ^ds21" to your default MCA param file (the pmix one - should be named 
>>> pmix-mca-params.conf).
>>> Artem - any advice here?
>>>> On Dec 25, 2019, at 9:56 AM, Marco Atzeri via devel 
>>>>  wrote:
>>>> 
>>>> I have no multinode around for testing
>>>> 
>>>> I will need to setup one for testing after the holidays
>>>> 
>>>> Am 24.12.2019 um 23:27 schrieb Jeff Squyres (jsquyres):
>>>>> That actually looks like a legit error -- it's failing to initialize a 
>>>>> shared mutex.
>>>>> I'm not sure what the consequence is of this failure, though, since the 
>>>>> job seemed to run ok.
>>>>> Are you able to run multi-node jobs ok?
>>>>>> On Dec 22, 2019, at 1:20 AM, Marco Atzeri via devel 
>>>>>>  wrote:
>>>>>> 
>>>>>> Hi Developers,
>>>>>> 
>>>>>> Cygwin 64bit, openmpi-3.1.5-1
>>>>>> testing the cygwin package before releasing it
>>>>>> I see a never seen before spurious error messages that do not seem
>>>>>> about error at all:
>>>>>> 
>>>>>> $ mpirun -n 4 ./hello_c.exe
>>>>>> [LAPTOP-82F08ILC:02395] PMIX ERROR: INIT in file 
>>>>>> /cygdrive/d/cyg_pub/devel/openmpi/v3.1/openmpi-3.1.5-1.x86_64/src/openmpi-3.1.5/opal/mca/pmix/pmix2x/pmix/src/mca/gds/ds21/gds_ds21_lock_pthread.c
>>>>>>  at line 188
>>>>>> [LAPTOP-82F08ILC:02395] PMIX ERROR: SUCCESS in file 
>>>>>> /cygdrive/d/cyg_pub/devel/openmpi/v3.1/openmpi-3.1.5-1.x86_64/src/openmpi-3.1.5/opal/mca/pmix/pmix2x/pmix/src/mca/common/dstore/dstore_base.c
>>>>>>  at line 2432
>>>>>> Hello, world, I am 0 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 
>>>>>> 15, 2019, 116)
>>>>>> Hello, world, I am 1 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 
>>>>>> 15, 2019, 116)
>>>>>> Hello, world, I am 2 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 
>>>>>> 15, 2019, 116)
>>>>>> Hello, world, I am 3 of 4, (Open MPI v3.1.5, package: Open MPI 
>>>>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 
>>>>>> 15, 2019, 116)
>>>>>> [LAPTOP-82F08ILC:02395] [[20101,0],0] unable to open debugger attach fifo
>>>>>> 
>>>>>> There is a know workaround ?
>>>>>> I have not found anything on the issue list.
>>>>>> 
>>>>>> Regards
>>>>>> MArcp
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 




Re: [OMPI devel] Reachable framework integration

2020-01-02 Thread Ralph Castain via devel
Hmmm...pretty complex code in there. Looks like it has to be "replicated" for 
reuse as functions are passing in btl_tcp specific structs. Is it worth 
developing an abstracted version of such functions as 
mca_btl_tcp_proc_create_interface_graph and 
mca_btl_tcp_proc_store_matched_interfaces?



On Jan 2, 2020, at 9:35 AM, George Bosilca mailto:bosi...@icl.utk.edu> > wrote:

Ralph,

I think the first use is still pending reviews (more precisely my review) at 
https://github.com/open-mpi/ompi/pull/7134.

  George.


On Wed, Jan 1, 2020 at 9:53 PM Ralph Castain via devel 
mailto:devel@lists.open-mpi.org> > wrote:
Hey folks

I can't find where the opal/reachable framework is being used in OMPI. I would 
like to utilize it in the PRRTE oob/tcp component, but need some guidance on 
how to do so, or pointers to an example.

Ralph





[OMPI devel] Reachable framework integration

2020-01-01 Thread Ralph Castain via devel
Hey folks

I can't find where the opal/reachable framework is being used in OMPI. I would 
like to utilize it in the PRRTE oob/tcp component, but need some guidance on 
how to do so, or pointers to an example.

Ralph




Re: [OMPI devel] PMIX ERROR: INIT spurious message on 3.1.5

2019-12-31 Thread Ralph Castain via devel
PMIx likely defaults to the ds12 component - which will work fine but a tad 
slower than ds21. It is likely something to do with the way cygwin handles 
memory locks. You can avoid the error message by simply adding "gds = ^ds21" to 
your default MCA param file (the pmix one - should be named 
pmix-mca-params.conf).

Artem - any advice here?


> On Dec 25, 2019, at 9:56 AM, Marco Atzeri via devel 
>  wrote:
> 
> I have no multinode around for testing
> 
> I will need to setup one for testing after the holidays
> 
> Am 24.12.2019 um 23:27 schrieb Jeff Squyres (jsquyres):
>> That actually looks like a legit error -- it's failing to initialize a 
>> shared mutex.
>> I'm not sure what the consequence is of this failure, though, since the job 
>> seemed to run ok.
>> Are you able to run multi-node jobs ok?
>>> On Dec 22, 2019, at 1:20 AM, Marco Atzeri via devel 
>>>  wrote:
>>> 
>>> Hi Developers,
>>> 
>>> Cygwin 64bit, openmpi-3.1.5-1
>>> testing the cygwin package before releasing it
>>> I see a never seen before spurious error messages that do not seem
>>> about error at all:
>>> 
>>> $ mpirun -n 4 ./hello_c.exe
>>> [LAPTOP-82F08ILC:02395] PMIX ERROR: INIT in file 
>>> /cygdrive/d/cyg_pub/devel/openmpi/v3.1/openmpi-3.1.5-1.x86_64/src/openmpi-3.1.5/opal/mca/pmix/pmix2x/pmix/src/mca/gds/ds21/gds_ds21_lock_pthread.c
>>>  at line 188
>>> [LAPTOP-82F08ILC:02395] PMIX ERROR: SUCCESS in file 
>>> /cygdrive/d/cyg_pub/devel/openmpi/v3.1/openmpi-3.1.5-1.x86_64/src/openmpi-3.1.5/opal/mca/pmix/pmix2x/pmix/src/mca/common/dstore/dstore_base.c
>>>  at line 2432
>>> Hello, world, I am 0 of 4, (Open MPI v3.1.5, package: Open MPI 
>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 15, 
>>> 2019, 116)
>>> Hello, world, I am 1 of 4, (Open MPI v3.1.5, package: Open MPI 
>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 15, 
>>> 2019, 116)
>>> Hello, world, I am 2 of 4, (Open MPI v3.1.5, package: Open MPI 
>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 15, 
>>> 2019, 116)
>>> Hello, world, I am 3 of 4, (Open MPI v3.1.5, package: Open MPI 
>>> Marco@LAPTOP-82F08ILC Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 15, 
>>> 2019, 116)
>>> [LAPTOP-82F08ILC:02395] [[20101,0],0] unable to open debugger attach fifo
>>> 
>>> There is a know workaround ?
>>> I have not found anything on the issue list.
>>> 
>>> Regards
>>> MArcp




[OMPI devel] ORTE replacement

2019-12-25 Thread Ralph Castain via devel
Hi folks

The move to replace ORTE with PRRTE is now ready to go (the OSHMEM team needs 
to fix something in that project). This means that all further development 
activity and/or PRs involving ORTE should be transferred to the PRRTE project 
(https://github.com/openpmix/prrte). Existing PRs that reference ORTE should 
remove all such references, transferring any relevant ORTE changes to PRRTE. 
You may still get some conflicts in the non-ORTE areas as we "purged" all 
runtime references from the MPI layer in favor of PMIx, but they should be 
relatively simple to resolve.

I'll be marking a few PRs that have changes we either already have made in 
PRRTE or that I will carry across myself (very few of those). We will be 
discussing the commit schedule for the ORTE-replacement PR 
(https://github.com/open-mpi/ompi/pull/7202) at the next telecon the first week 
of January. Expectation is that we will make the commit rather soon as the 
debugger community needs it for their development efforts in support of OMPI 
5.0 coming summer 2020.

Ralph



Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-13 Thread Ralph Castain via devel
Just want to clarify my remarks to avoid any misunderstanding. I'm not in any 
way saying MPIR or the debugger are at fault here, nor was I trying to imply 
that PMIx-based tools are somehow "superior" to MPIR-based ones.

My point was solely focused on the question of reliability. The MPIR-based 
tools activate a code path in OMPI that is only used when MPIR-based tools are 
involved - it is an "exception" code path and therefore not exercised during 
any normal operations. Thus, all the nightly regression testing and normal 
daily uses of OMPI do not activate that code path, leaving it effectively 
untested.

In contrast, PMIx-based tools utilize code paths that are active during normal 
operations. Thus, those code paths are exercised and tested with every nightly 
regression test, and 10's of thousands of times a day when users run OMPI-based 
applications. There is a much higher probability of detecting a race condition 
problem in the PMIx code path, and a correspondingly higher confidence level 
that the code is working correctly.

We are not hearing of any "hangs" such as the one described in this thread from 
our user base. This means that it is unlikely any similar race condition 
resides in the "normal" code paths shared by PMIx-based tools. Thus, it is most 
likely something in the MPIR-based code paths that is the root cause of the 
trouble.

The uniqueness of the MPIR-based code paths and the corresponding lack of 
testing of those paths is why we are moving to PMIx-based tool support in OMPI 
v5.

HTH
Ralph


On Nov 13, 2019, at 10:40 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:

Agreed and understood. My point was only that I'm not convinced the problem was 
"fixed" as it is entirely consistent with your findings for the race condition 
to still exist, but be biased so strongly that it now "normally" passes. 
Without determining the precise code that causes things to hang vs complete, 
there is no way to say that the code path is truly "fixed".

The fact that this only appears to happen IF the debugger_attach flag is set 
would indicate it has something to do with debugger-related code. Could be 
something in PMIx, or it could be that the change in PMIx just modified the 
race condition. It could be something in the OMPI debugger code, it could be in 
the abstraction layer between PMIx and OMPI, etc.

I don't have an immediate plan for digging deeper into possible root cause - 
and as I said, I'm not all that motivated to do so as PMIx-based tools are not 
displaying the same behavior  :-)

Ralph


On Nov 13, 2019, at 8:41 AM, John DelSignore mailto:jdelsign...@perforce.com> > wrote:

Hi Ralph,

I assume you are referring to your previous email, where you wrote:

Personally, I have never been entirely comfortable with the claim that the PMIx 
modification was the solution to the problem being discussed here. We have 
never seen a report of an application hanging in that spot outside of a 
debugger. Not one report. Yet that code has been "in the wild" now for several 
years.

What I suspect is actually happening is that the debugger is interfering with 
the OMPI internals that are involved in a way that creates a potential loss of 
the release event. The modified timing of the PMIx update biases that race 
sufficiently to make it happen "virtually never", which only means that it 
doesn't trigger when you run it a few times in quick succession. I don't know 
how to further debug it, nor am I particularly motivated to do so as the 
PMIx-based tools work within (not alongside) the release mechanism and are 
unlikely to evince the same behavior.

For now, it appears 4.0.2 is "good enough".

I'm not an OMPI/PMIx expert here, so I can only tell you what I observe, which 
is even without a debugger in the picture, I can reliably make OMPI 4.0.1 hang 
in that code by setting env ORTE_TEST_DEBUGGER_ATTACH=1. However, OMPI 4.0.2 
has not hung once after running the same test over 1,000 times.

Here's what I did:

*   I added two fprintfs to the rte_orte_module.c file in both 4.0.1 and 
4.0.2:
*   One inside _release_fn().
*   One inside ompi_rte_wait_for_debugger() at the start of the 
block that calls "OMPI_WAIT_FOR_COMPLETION(debugger_event_active);".
*   Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpir -np 4 ./cpi401
*   Ran w/ 4.0.2: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpir -np 4 ./cpi402
*   Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 mpir -np 4 ./cpi401

With 4.0.1 and ORTE_TEST_DEBUGGER_ATTACH=1, all of the runs hang and looks like 
this:

mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpirun -np 4 ./cpi401
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-13 Thread Ralph Castain via devel
16009869231249, Error is 0.0818
wall clock time = 0.000133
mic:/amd/home/jdelsign>

With 4.0.1 and ORTE_TEST_DEBUGGER_ATTACH not set, all of the runs complete and 
look like this:

mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 mpirun -np 4 ./cpi401
Process 2 on microway1
Process 0 on microway1
Process 3 on microway1
Process 1 on microway1
pi is approximately 3.1416009869231249, Error is 0.0818
wall clock time = 0.000153
mic:/amd/home/jdelsign>

As you can see in this last test, if ORTE_TEST_DEBUGGER_ATTACH is not set, the 
code in ompi_rte_wait_for_debugger() is not executed.

Honesty, I don't know if this is a valid test or not, but it strongly suggests 
that there is a problem in that code in 4.0.1 and it cannot be the debugger's 
fault, because there is no debugger in the picture. Tthe GitHub issues Austen 
pointed at seem to accurately describe what I have seen and the conclusion 
there was that it was a bug in PMIx. I have no basis to believe otherwise.

Finally, I'd like to reply to your statement, "What I suspect is actually 
happening is that the debugger is interfering with the OMPI internals that are 
involved in a way that creates a potential loss of the release event." In my 
experience, a debugger can most commonly affect application execution in the 
following ways:

*   It can change execution timing of target processes and threads, so if 
there's already a race in the application code, the debugger can either provoke 
it or cover it up.
*   It can cause data caches to be flushed back to memory when the process 
is stopped.
*   It can cause code that is not EINTR safe. Linux is pretty good these 
days about avoiding EINTR problems in the code, but it can still happen if the 
debugger happens to stop the process while in an interruptible system call. If 
the code does not handle EINTR correctly, then it can fail.

I think that unless there's already a problem in the code, the debugger should 
not be able to interfere at all.

Cheers, John D.



On 11/12/19 6:51 PM, Ralph Castain via devel wrote:
Again, John, I'm not convinced your last statement is true. However, I think it 
is "good enough" for now as it seems to work for you and it isn't seen outside 
of a debugger scenario.


On Nov 12, 2019, at 3:13 PM, John DelSignore via devel 
mailto:devel@lists.open-mpi.org> > wrote:

Hi Austen,

Thanks very much, the issues you show below do indeed describe what I am seeing.

Using printfs and breakpoints I inserted into the _release_fn() function, I was 
able to see that with OMPI 4.0.1, at most one of the MPI processes called the 
function. Most of the time rank 0 would be the only one to execute the 
function, but sometimes none of the MPI processes would execute it. However, 
with OMPI 4.0.2, all of the MPI processes execute the function reliably.

I'm glad to know that the problem was actually fixed in OMPI 4.0.2, and not 
just accidentally working for my test cases.

Cheers, John D.

On 11/12/19 3:41 PM, Austen W Lauria wrote:
I think you are hitting this issue here in 4.0.1:

https://github.com/open-mpi/ompi/issues/6613

MPIR was broken in 4.0.1 due to a race condition in PMIx. It was patched, it 
looks to me, for 4.0.2. Here is the openpmix 
issue:https://github.com/openpmix/openpmix/issues/1189 

I think this lines up - 4.0.2 should be good with a fix.

John DelSignore ---11/12/2019 02:25:14 PM---Hi Austen, Thanks for 
the reply. What I am seeing is consistent with your thought, in that when I se

From: John DelSignore
To: Open MPI Developers
Cc: Austen W Lauria, devel 

Date: 11/12/2019 02:25 PM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView





Hi Austen,

Thanks for the reply. What I am seeing is consistent with your thought, in that 
when I see the hang, one or more processes did not have a flag updated. I don't 
understand how the Open MPI code works well enough to say if it is a memory 
barrier problem or not. It almost looks like a event delivery or dropped event 
problem to me. 

The place in the MPI_init() code where the MPI processes hang and the number of 
"hung" processes seems to vary from run to run. In some cases the processes are 
waiting for an event or waiting for a fence (whatever that is).

I did the following run today, which shows that it can hang waiting for an 
event that apparently was not generated or was dropped:

1. Started TV on mpirun: totalview -args mpirun -np 4 ./cpi
2. Ran the mpirun process until it hit the MPIR_Breakpoint() event.
3. TV attached to all four of the MPI processes and left all five processes 
stopped.
4. Continued all of the processes/threads and let them run freely for about 60 
seconds. They should have run to completion in that amount of time.
5. Halted all of the processes. I included an aggregated backtrace of all of 
the processes below.

In this particular run, all f

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel
eads.c> #109 : 4:4[0-3.3]
| |+opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
| | +epoll_dispatch@epoll.c <mailto:epoll_dispatch@epoll.c> #407
| |  +__epoll_wait_nocancel
| +progress_engine : 1:2[p1.2, p1.4]
|  +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
|   +epoll_dispatch@epoll.c <mailto:epoll_dispatch@epoll.c> #407 : 1:1[p1.2]
|   |+__epoll_wait_nocancel
|   +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165 : 1:1[p1.4]
|    +__poll_nocancel
+_start : 5:5[0-3.1, p1.1]
 +__libc_start_main
  +main@cpi.c#27 <mailto:main@cpi.c#27>  : 4:4[0-3.1]
  |+PMPI_Init@pinit.c <mailto:PMPI_Init@pinit.c> #67
  | +ompi_mpi_init@ompi_mpi_init.c <mailto:ompi_mpi_init@ompi_mpi_init.c> #890
  |  +ompi_rte_wait_for_debugger@rte_orte_module.c#196 
<mailto:ompi_rte_wait_for_debugger@rte_orte_module.c#196> 
  |   +opal_progress@opal_progress.c <mailto:opal_progress@opal_progress.c> 
#245 : 1:1[0.1]
  |   |+opal_progress_events@opal_progress.c 
<mailto:opal_progress_events@opal_progress.c> #191
  |   | +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
  |   |  +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165
  |   |   +__poll_nocancel
  |   +opal_progress@opal_progress.c <mailto:opal_progress@opal_progress.c> 
#247 : 3:3[1-3.1]
  |    +opal_progress_events@opal_progress.c 
<mailto:opal_progress_events@opal_progress.c> #191
  |     +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
  |      +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165
  |       +__poll_nocancel
  +orterun : 1:1[p1.1]
   +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
    +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165
     +__poll_nocancel

d1.<> 

On 11/12/19 9:47 AM, Austen W Lauria via devel wrote: 

Could it be that some processes are not seeing the flag get updated? I don't 
think just using a simple while loop with a volatile variable is sufficient in 
all cases in a multi-threaded environment. It's my understanding that the 
volatile keyword just tells the compiler to not optimize or do anything funky 
with it - because it can change at any time. However, this doesn't provide any 
memory barrier - so it's possible that the thread polling on this variable is 
never seeing the update.

Looking at the code - I see:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
do { \
opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \
while ((flg)) { \
opal_progress(); \
usleep(100); \
} \
}while(0);

I think replacing that with:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
do { \
opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \

pthread_mutex_lock(); \
while(flag) { \ 
pthread_cond_wait(, ); \ //Releases the lock while waiting for a 
signal from another thread to wake up
} \
pthread_mutex_unlock(); \

}while(0);

Is much more standard when dealing with threads updating a shared variable - 
and might lead to a more expected result in this case.

On the other end, this would require the thread updating this variable to:

pthread_mutex_lock();
flg = new_val;
pthread_cond_signal();
pthread_mutex_unlock();

This provides the memory barrier for the thread polling on the flag to see the 
update - something the volatile keyword doesn't do on its own. I think it's 
also much cleaner as it eliminates an arbitrary sleep from the code - which I 
see as a good thing as well.


"Ralph Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 
2019, at 4:53 PM, Gilles Gouaillardet via devel 
<mailto:devel@lists.open-mpi.org> wrote: >

From: "Ralph Castain via devel"  
<mailto:devel@lists.open-mpi.org> 
To: "OpenMPI Devel"  
<mailto:devel@lists.open-mpi.org> 
Cc: "Ralph Castain"  <mailto:r...@open-mpi.org> 
Date: 11/12/2019 09:24 AM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView
Sent by: "devel"  
<mailto:devel-boun...@lists.open-mpi.org> 






> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel 
>  <mailto:devel@lists.open-mpi.org>  wrote:
> 
> John,
> 
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> 
> 
> is a simple loop that periodically checks the (volatile) "active" condition, 
> that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel
progress@opal_progress.c> 
#245 : 1:1[0.1]
   |   |+opal_progress_events@opal_progress.c 
<mailto:opal_progress_events@opal_progress.c> #191
   |   | +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
   |   |  +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165
   |   |   +__poll_nocancel
   |   +opal_progress@opal_progress.c <mailto:opal_progress@opal_progress.c> 
#247 : 3:3[1-3.1]
   |    +opal_progress_events@opal_progress.c 
<mailto:opal_progress_events@opal_progress.c> #191
   |     +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
   |      +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165
   |       +__poll_nocancel
   +orterun : 1:1[p1.1]
    +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
     +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165
      +__poll_nocancel

d1.<> 

On 11/12/19 9:47 AM, Austen W Lauria via devel wrote:

Could it be that some processes are not seeing the flag get updated? I don't 
think just using a simple while loop with a volatile variable is sufficient in 
all cases in a multi-threaded environment. It's my understanding that the 
volatile keyword just tells the compiler to not optimize or do anything funky 
with it - because it can change at any time. However, this doesn't provide any 
memory barrier - so it's possible that the thread polling on this variable is 
never seeing the update.

Looking at the code - I see:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
do { \
opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \
while ((flg)) { \
opal_progress(); \
usleep(100); \
} \
}while(0);

I think replacing that with:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
do { \
opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \

pthread_mutex_lock(); \
while(flag) { \ 
pthread_cond_wait(, ); \ //Releases the lock while waiting for a 
signal from another thread to wake up
} \
pthread_mutex_unlock(); \

}while(0);

Is much more standard when dealing with threads updating a shared variable - 
and might lead to a more expected result in this case.

On the other end, this would require the thread updating this variable to:

pthread_mutex_lock();
flg = new_val;
pthread_cond_signal();
pthread_mutex_unlock();

This provides the memory barrier for the thread polling on the flag to see the 
update - something the volatile keyword doesn't do on its own. I think it's 
also much cleaner as it eliminates an arbitrary sleep from the code - which I 
see as a good thing as well.


"Ralph Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 
2019, at 4:53 PM, Gilles Gouaillardet via devel  
<mailto:devel@lists.open-mpi.org> wrote: >

From: "Ralph Castain via devel"  
<mailto:devel@lists.open-mpi.org> 
To: "OpenMPI Devel"  
<mailto:devel@lists.open-mpi.org> 
Cc: "Ralph Castain"  <mailto:r...@open-mpi.org> 
Date: 11/12/2019 09:24 AM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView
Sent by: "devel"  
<mailto:devel-boun...@lists.open-mpi.org> 







> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel 
>  <mailto:devel@lists.open-mpi.org>  wrote:
> 
> John,
> 
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> 
> 
> is a simple loop that periodically checks the (volatile) "active" condition, 
> that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all** threads are stopped when 
> this breakpoint is hit, you might experience
> what looks like a race condition.
> I guess a similar scenario can occur if the breakpoint is set in mpirun/orted 
> too early, and prevents the pmix (or oob/tcp) thread
> from sending the message to all MPI tasks)
> 
> 
> 
> Ralph,
> 
> does the v4.0.x branch still need the oob/tcp progress thread running inside 
> the MPI app?
> or are we missing some commits (since all interactions with mpirun/orted are 
> handled by PMIx, at least in the master branch) ?

IIRC, that progress thread only runs if explicitly asked to do so by MCA param. 
We don't need that code any more as PMIx takes care of it.

> 
> Cheers,
> 
> Gilles
> 
> On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
>> Hi John
>> 
>> Sorry to say, but there is no way to really answer your question as the OMPI 
>> community doesn't actively test MPIR sup

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel
Just to be clear as well: you cannot use the pthread method you propose because 
you must loop over opal_progress - the "usleep" is in there simply to avoid 
consuming 100% cpu while we wait.


On Nov 12, 2019, at 8:52 AM, George Bosilca via devel mailto:devel@lists.open-mpi.org> > wrote:

I don't think there is a need any protection around that variable. It will 
change value only once (in a callback triggered from opal_progress), and the 
volatile guarantees that loads will be issued for every access, so the waiting 
thread will eventually notice the change.

 George.


On Tue, Nov 12, 2019 at 9:48 AM Austen W Lauria via devel 
mailto:devel@lists.open-mpi.org> > wrote:
Could it be that some processes are not seeing the flag get updated? I don't 
think just using a simple while loop with a volatile variable is sufficient in 
all cases in a multi-threaded environment. It's my understanding that the 
volatile keyword just tells the compiler to not optimize or do anything funky 
with it - because it can change at any time. However, this doesn't provide any 
memory barrier - so it's possible that the thread polling on this variable is 
never seeing the update.

Looking at the code - I see:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
 do { \
 opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
 "%s lazy waiting on RTE event at %s:%d", \
 OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
 __FILE__, __LINE__); \
 while ((flg)) { \
 opal_progress(); \
 usleep(100); \
 } \
 }while(0);

I think replacing that with:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
 do { \
 opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
 "%s lazy waiting on RTE event at %s:%d", \
 OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
 __FILE__, __LINE__); \

pthread_mutex_lock(); \
while(flag) { \ 
 pthread_cond_wait(, ); \ //Releases the lock while waiting for a 
signal from another thread to wake up
} \
pthread_mutex_unlock(); \

 }while(0);

Is much more standard when dealing with threads updating a shared variable - 
and might lead to a more expected result in this case.

On the other end, this would require the thread updating this variable to:

pthread_mutex_lock();
flg = new_val;
pthread_cond_signal();
pthread_mutex_unlock();

This provides the memory barrier for the thread polling on the flag to see the 
update - something the volatile keyword doesn't do on its own. I think it's 
also much cleaner as it eliminates an arbitrary sleep from the code - which I 
see as a good thing as well.


"Ralph Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 
2019, at 4:53 PM, Gilles Gouaillardet via devel mailto:devel@lists.open-mpi.org> > wrote: >

From: "Ralph Castain via devel" mailto:devel@lists.open-mpi.org> >
To: "OpenMPI Devel" mailto:devel@lists.open-mpi.org> 
>
Cc: "Ralph Castain" mailto:r...@open-mpi.org> >
Date: 11/12/2019 09:24 AM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView
Sent by: "devel" mailto:devel-boun...@lists.open-mpi.org> >







> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel 
> mailto:devel@lists.open-mpi.org> > wrote:
> 
> John,
> 
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> 
> 
> is a simple loop that periodically checks the (volatile) "active" condition, 
> that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all** threads are stopped when 
> this breakpoint is hit, you might experience
> what looks like a race condition.
> I guess a similar scenario can occur if the breakpoint is set in mpirun/orted 
> too early, and prevents the pmix (or oob/tcp) thread
> from sending the message to all MPI tasks)
> 
> 
> 
> Ralph,
> 
> does the v4.0.x branch still need the oob/tcp progress thread running inside 
> the MPI app?
> or are we missing some commits (since all interactions with mpirun/orted are 
> handled by PMIx, at least in the master branch) ?

IIRC, that progress thread only runs if explicitly asked to do so by MCA param. 
We don't need that code any more as PMIx takes care of it.

> 
> Cheers,
> 
> Gilles
> 
> On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
>> Hi John
>> 
>> Sorry to say, but there is no way to really answer your question as the OMPI 
>> community doesn't actively test MPIR support. I haven't seen any reports of 
>> hangs during MPI_Init from any release series, including 4.x. My guess is 
>> that it may have something to do with the debugger interactions as opposed 
>> to being a true race condition.
>> 
>> Ralph
>> 
>> 
>>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel 
>

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel


> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel 
>  wrote:
> 
> John,
> 
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> 
> 
> is a simple loop that periodically checks the (volatile) "active" condition, 
> that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all** threads are stopped when 
> this breakpoint is hit, you might experience
> what looks like a race condition.
> I guess a similar scenario can occur if the breakpoint is set in mpirun/orted 
> too early, and prevents the pmix (or oob/tcp) thread
> from sending the message to all MPI tasks)
> 
> 
> 
> Ralph,
> 
> does the v4.0.x branch still need the oob/tcp progress thread running inside 
> the MPI app?
> or are we missing some commits (since all interactions with mpirun/orted are 
> handled by PMIx, at least in the master branch) ?

IIRC, that progress thread only runs if explicitly asked to do so by MCA param. 
We don't need that code any more as PMIx takes care of it.

> 
> Cheers,
> 
> Gilles
> 
> On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
>> Hi John
>> 
>> Sorry to say, but there is no way to really answer your question as the OMPI 
>> community doesn't actively test MPIR support. I haven't seen any reports of 
>> hangs during MPI_Init from any release series, including 4.x. My guess is 
>> that it may have something to do with the debugger interactions as opposed 
>> to being a true race condition.
>> 
>> Ralph
>> 
>> 
>>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel 
>>> mailto:devel@lists.open-mpi.org>> wrote:
>>> 
>>> Hi,
>>> 
>>> An LLNL TotalView user on a Mac reported that their MPI job was hanging 
>>> inside MPI_Init() when started under the control of TotalView. They were 
>>> using Open MPI 4.0.1, and TotalView was using the MPIR Interface (sorry, we 
>>> don't support the PMIx debugging hooks yet).
>>> 
>>> I was able to reproduce the hang on my own Linux system with my own build 
>>> of Open MPI 4.0.1, which I built with debug symbols. As far as I can tell, 
>>> there is some sort of race inside of Open MPI 4.0.1, because if I placed 
>>> breakpoints at certain points in the Open MPI code, and thus change the 
>>> timing slightly, that was enough to avoid the hang.
>>> 
>>> When the code hangs, it appeared as if one or more MPI processes are 
>>> waiting inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to 
>>> be released. In one of the runs, rank 0 was the only one the was hanging 
>>> there (though I have seen runs where two ranks were hung there).
>>> 
>>> Here's a backtrace of the first thread in the rank 0 process in the case 
>>> where one rank was hung:
>>> 
>>> d1.<> f 10.1 w
>>> >  0 __nanosleep_nocancel PC=0x774e2efd, FP=0x7fffd1e0 
>>> > [/lib64/libc.so.6]
>>>1 usleep PC=0x77513b2f, FP=0x7fffd200 [/lib64/libc.so.6]
>>>2 ompi_mpi_init PC=0x77a64009, FP=0x7fffd350 
>>> [/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]
>>>3 PMPI_Init PC=0x77ab0be4, FP=0x7fffd390 
>>> [/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]
>>>4 main PC=0x00400c5e, FP=0x7fffd550 
>>> [/home/jdelsign/cpi.c#27]
>>>5 __libc_start_main PC=0x77446b13, FP=0x7fffd610 
>>> [/lib64/libc.so.6]
>>>6 _start   PC=0x00400b04, FP=0x7fffd618 
>>> [/amd/home/jdelsign/cpi]
>>> 
>>> Here's the block of code where the thread is hung:
>>> 
>>> /* if we executed the above fence in the background, then
>>>  * we have to wait here for it to complete. However, there
>>>  * is no reason to do two barriers! */
>>> if (background_fence) {
>>> OMPI_LAZY_WAIT_FOR_COMPLETION(active);
>>> } else if (!ompi_async_mpi_init) {
>>> /* wait for everyone to reach this point - this is a hard
>>>  * barrier requirement at this time, though we hope to relax
>>>  * it at a later point */
>>> if (NULL != opal_pmix.fence_nb) {
>>> active = true;
>>> OPAL_POST_OBJECT();
>>> if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false,
>>> fence_release, (void* {
>>> error = "opal_pmix.fence_nb() failed";
>>> goto error;
>&g

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-11 Thread Ralph Castain via devel
Hi John

Sorry to say, but there is no way to really answer your question as the OMPI 
community doesn't actively test MPIR support. I haven't seen any reports of 
hangs during MPI_Init from any release series, including 4.x. My guess is that 
it may have something to do with the debugger interactions as opposed to being 
a true race condition.

Ralph


On Nov 8, 2019, at 11:27 AM, John DelSignore via devel 
mailto:devel@lists.open-mpi.org> > wrote:

Hi,

An LLNL TotalView user on a Mac reported that their MPI job was hanging inside 
MPI_Init() when started under the control of TotalView. They were using Open 
MPI 4.0.1, and TotalView was using the MPIR Interface (sorry, we don't support 
the PMIx debugging hooks yet).

I was able to reproduce the hang on my own Linux system with my own build of 
Open MPI 4.0.1, which I built with debug symbols. As far as I can tell, there 
is some sort of race inside of Open MPI 4.0.1, because if I placed breakpoints 
at certain points in the Open MPI code, and thus change the timing slightly, 
that was enough to avoid the hang.

When the code hangs, it appeared as if one or more MPI processes are waiting 
inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to be released. 
In one of the runs, rank 0 was the only one the was hanging there (though I 
have seen runs where two ranks were hung there).

Here's a backtrace of the first thread in the rank 0 process in the case where 
one rank was hung:

d1.<> f 10.1 w
>  0 __nanosleep_nocancel PC=0x774e2efd, FP=0x7fffd1e0 
>[/lib64/libc.so.6]
   1 usleep   PC=0x77513b2f, FP=0x7fffd200 [/lib64/libc.so.6]
   2 ompi_mpi_init    PC=0x77a64009, FP=0x7fffd350 
[/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]
   3 PMPI_Init    PC=0x77ab0be4, FP=0x7fffd390 
[/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]
   4 main PC=0x00400c5e, FP=0x7fffd550 [/home/jdelsign/cpi.c#27]
   5 __libc_start_main PC=0x77446b13, FP=0x7fffd610 [/lib64/libc.so.6]
   6 _start   PC=0x00400b04, FP=0x7fffd618 [/amd/home/jdelsign/cpi]

Here's the block of code where the thread is hung:

    /* if we executed the above fence in the background, then
 * we have to wait here for it to complete. However, there
 * is no reason to do two barriers! */
    if (background_fence) {
    OMPI_LAZY_WAIT_FOR_COMPLETION(active);
    } else if (!ompi_async_mpi_init) {
    /* wait for everyone to reach this point - this is a hard
 * barrier requirement at this time, though we hope to relax
 * it at a later point */
    if (NULL != opal_pmix.fence_nb) {
    active = true;
    OPAL_POST_OBJECT();
    if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false,
   fence_release, (void* {
    error = "opal_pmix.fence_nb() failed";
    goto error;
    }
    OMPI_LAZY_WAIT_FOR_COMPLETION(active);   - STUCK HERE 
WAITING FOR THE FENCE TO BE RELEASED
    } else {
    if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL, false))) {
    error = "opal_pmix.fence() failed";
    goto error;
    }
    }
    }

And here is an aggregated backtrace of all of the processes and threads in the 
job:

d1.<> f g w -g f+l
+/
 +__clone : 5:12[0-3.2-3, p1.2-5]
 |+start_thread
 | +listen_thread@oob_tcp_listener.c  
#705 : 1:1[p1.5]
 | |+__select_nocancel
 | +listen_thread@ptl_base_listener.c 
 #214 : 1:1[p1.3]
 | |+__select_nocancel
 | +progress_engine@opal_progress_threads.c 
 #105 : 5:5[0-3.2, p1.4]
 | |+opal_libevent2022_event_base_loop@event.c 
 #1632
 | | +poll_dispatch@poll.c  #167
 | |  +__poll_nocancel
 | +progress_engine@pmix_progress_threads.c 
 #108 : 5:5[0-3.3, p1.2]
 |  +opal_libevent2022_event_base_loop@event.c 
 #1632
 |   +epoll_dispatch@epoll.c  #409
 |    +__epoll_wait_nocancel
 +_start : 5:5[0-3.1, p1.1]
  +__libc_start_main
   +main@cpi.c  #27 : 4:4[0-3.1]
   |+PMPI_Init@pinit.c  #67
   | +ompi_mpi_init@ompi_mpi_init.c#890 : 3:3[1-3.1]   THE 3 OTHER MPI 
PROCS MADE IT PAST FENCE
   | |+ompi_rte_wait_for_debugger@rte_orte_module.c 
 #196
   | | +opal_progress@opal_progress.c  
#251
   | |  +opal_progress_events@opal_progress.c 
 #191
   | |   +opal_libevent2022_event_base_loop@event.c 
 

[OMPI devel] Network support in OMPI

2019-07-24 Thread Ralph Castain via devel
Hi folks

I mentioned this very briefly at the Tues telecon, but didn't explain it well 
as there just wasn't adequate time available. With the recent updates of the 
embedded PMIx code, OMPI's mpirun now has the ability to fully support 
pre-launch network resource assignment for processes. This includes endpoints 
as well as network coordinates.

In brief, what happens is:

* at startup, the PMIx network support plugins in mpirun obtain their network 
configuration info. In cases where a fabric manager is present, we directly 
communicate to that FM for the info we need. Where no fabric manager is 
available, an MCA param can point us to a file containing the info, or the 
plugin can get it in whatever way the vendor chooses

* when ORTE launches its daemons, the daemons query their PMIx network support 
plugins for any network inventory info they would like to communicate back to 
mpirun. Each plugin (TCP, whatever) is given an opportunity to contribute to 
that payload. The data is included in the daemon's "phone home" message

* when the inventory arrives at mpirun, ORTE delivers it to the PMIx network 
support plugins for processing. As far as ORTE is concerned, it is an opaque 
"blob" - only the fabric plugin provider knows what is in it and how to process 
it. In the case of TCP (which I wrote), we store information on both the 
available static ports on each node and the available NICs (e.g., subnet they 
are attached to).

* when mpirun is ready to launch, it passes the process map down to the PMIx 
network support plugins (again, every plugin gets to see it) so they can 
assign/allocate network resources to the procs. In the case of TCP, we assign a 
static socket (or multiple sockets if they request it) to each process on each 
node, a prioritized list of the NICs they can use (based on distance), and the 
network coordinates of the NICs. This all gets bundled up into a per-plugin 
"blob" and passed up to mpirun for inclusion in the launch command sent to the 
daemons.

* when a daemon receives the launch command, it passes the "blobs" down to the 
local PMIx network support plugins, which parse the blob as they desire. In the 
case of TCP, we simply store the assignment info in the PMIx datastore for 
retrieval by the procs when they want to communicate to a peer or compute a 
topologically aware collective pattern.


The definition of coordinate values for each NIC is up to the network support 
plugins. The pmix_coord_t struct includes an array of integer coordinates along 
with a value indicating the number of dimensions and a flag indicating whether 
it is a "logical" or "physical" view - this is in keeping with the MPI topology 
WG. Some fabrics are writing plugins that provide that info per the vendor's 
algorithms. In the case of TCP, what I've done is rather simple. I provide an 
x,y,z coordinate "logical" coordinate for each NIC where:

* x represents the relative NIC index on the host where the proc is located - 
just a simple counter (e.g., this is the third NIC on the host)

* y represents the switch to which that NIC is attached - i.e., if you have the 
same y-coord as another NIC, you are attached to the same switch

* z represents the subnet - i.e., if you have the same z-coord as another NIC, 
then that NIC is on the same subnet as you

It is totally up to the plugin - the idea is to provide each process with 
information that allows them to know relative location. I'm quite open to 
modifying the TCP one as it was just done as an example for testing the 
infrastructure. You can retrieve coordinate info for any proc using PMIx_Get. 
You can also retrieve the relative communication cost to any proc - the plugin 
will compute it for you based on the coordinates, assuming the plugin supports 
that ability (in the case of my TCP one, it uses the coordinate to compute the 
number of hops because I numbered things to support that algo).

PRRTE already knows how to do all this - there are a few simple changes 
required to sync OMPI. If folks are interested in exploring this further, 
please let me know.
Ralph


___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] [OMPI] issue with mpirun

2019-07-12 Thread Ralph Castain via devel
Again, I have no knowledge of what this program is supposed to do. I would have 
thought it would only print once as there is only one answer, but I don't know 
the code. I'd suggest looking to see where it prints.


On Jul 12, 2019, at 6:32 AM, Dangeti Tharun kumar mailto:cs15mtech11...@iith.ac.in> > wrote:

Program is a hybrid OpenMP and OpenMPI matrix multiplication.
Time is 0.004174
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
This is the program output. If you see, it is just printed only once by the 
second case.

On Fri, Jul 12, 2019 at 6:00 PM Ralph Castain via devel 
mailto:devel@lists.open-mpi.org> > wrote:
Afraid I don't know anything about that program, but it looks like it is 
printing the same number of times in both cases. It only appears to be more in 
the first case because the line wraps due to the number of PUs in the list


On Jul 12, 2019, at 3:00 AM, Dangeti Tharun kumar mailto:cs15mtech11...@iith.ac.in> > wrote:

Thanks, Ralph.

Why is the output of the program(mm-llvm.out) being run is printed only once, 
while the mpirun from intel prints as many times as mentioned in the command 
line? 


On Thu, Jul 11, 2019 at 11:08 PM Ralph Castain via devel 
mailto:devel@lists.open-mpi.org> > wrote:
Because OMPI binds to core by default when np=2. If you have an OpenMP process, 
you want to add “--bind-to numa" to your mpirun cmd line.


On Jul 11, 2019, at 10:28 AM, Dangeti Tharun kumar via devel 
mailto:devel@lists.open-mpi.org> > wrote:


Hi Devs,

I have build openmpi with LLVM-8 compiler, I tried a simple example on a 2 
socket machine.
The following is the output of mpirun from ICC(intel c compiler package) and 
mpirun from openmpi:

Why the CPU topology(highlighted below) identified by both of them are 
different? Not sure, if this behavior is correct.

$>intel/bin/mpirun -np 2 ./mm-llvm.out

OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #156: KMP_AFFINITY: 56 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 14 cores/pkg x 2 threads/core (28 
total cores)
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89736 thread 0 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89752 thread 1 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89753 thread 2 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89754 thread 3 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #156: KMP_AFFINITY: 56 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 14 cores/pkg x 2 threads/core (28 
total cores)
OMP: Info #249: KMP_AFFINITY: pid 89737 tid 89737 thread 0 bound to OS proc set 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #249: KMP_AFFINITY: pid 89737 tid 89755 thread 1 bound to OS proc set 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #249: KMP_AFFINITY: pid 89737 tid 89756 thread 2 bound to OS proc set 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #249: KMP_AFFINITY: pid 89737 t

Re: [OMPI devel] [OMPI] issue with mpirun

2019-07-12 Thread Ralph Castain via devel
Afraid I don't know anything about that program, but it looks like it is 
printing the same number of times in both cases. It only appears to be more in 
the first case because the line wraps due to the number of PUs in the list


On Jul 12, 2019, at 3:00 AM, Dangeti Tharun kumar mailto:cs15mtech11...@iith.ac.in> > wrote:

Thanks, Ralph.

Why is the output of the program(mm-llvm.out) being run is printed only once, 
while the mpirun from intel prints as many times as mentioned in the command 
line? 


On Thu, Jul 11, 2019 at 11:08 PM Ralph Castain via devel 
mailto:devel@lists.open-mpi.org> > wrote:
Because OMPI binds to core by default when np=2. If you have an OpenMP process, 
you want to add “--bind-to numa" to your mpirun cmd line.


On Jul 11, 2019, at 10:28 AM, Dangeti Tharun kumar via devel 
mailto:devel@lists.open-mpi.org> > wrote:


Hi Devs,

I have build openmpi with LLVM-8 compiler, I tried a simple example on a 2 
socket machine.
The following is the output of mpirun from ICC(intel c compiler package) and 
mpirun from openmpi:

Why the CPU topology(highlighted below) identified by both of them are 
different? Not sure, if this behavior is correct.

$>intel/bin/mpirun -np 2 ./mm-llvm.out

OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #156: KMP_AFFINITY: 56 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 14 cores/pkg x 2 threads/core (28 
total cores)
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89736 thread 0 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89752 thread 1 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89753 thread 2 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89754 thread 3 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #156: KMP_AFFINITY: 56 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 14 cores/pkg x 2 threads/core (28 
total cores)
OMP: Info #249: KMP_AFFINITY: pid 89737 tid 89737 thread 0 bound to OS proc set 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #249: KMP_AFFINITY: pid 89737 tid 89755 thread 1 bound to OS proc set 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #249: KMP_AFFINITY: pid 89737 tid 89756 thread 2 bound to OS proc set 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #249: KMP_AFFINITY: pid 89737 tid 89757 thread 3 bound to OS proc set 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
Time is 0.004174
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
Time is 0.004542
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00

$>openmpi/bin/mpirun -np 2 ./mm-llvm.out 

OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0,56
OMP: Info #156: KMP_AFFIN

Re: [OMPI devel] [OMPI] issue with mpirun

2019-07-11 Thread Ralph Castain via devel
Because OMPI binds to core by default when np=2. If you have an OpenMP process, 
you want to add “--bind-to numa" to your mpirun cmd line.


On Jul 11, 2019, at 10:28 AM, Dangeti Tharun kumar via devel 
mailto:devel@lists.open-mpi.org> > wrote:


Hi Devs,

I have build openmpi with LLVM-8 compiler, I tried a simple example on a 2 
socket machine.
The following is the output of mpirun from ICC(intel c compiler package) and 
mpirun from openmpi:

Why the CPU topology(highlighted below) identified by both of them are 
different? Not sure, if this behavior is correct.

$>intel/bin/mpirun -np 2 ./mm-llvm.out

OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #156: KMP_AFFINITY: 56 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 14 cores/pkg x 2 threads/core (28 
total cores)
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89736 thread 0 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89752 thread 1 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89753 thread 2 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #249: KMP_AFFINITY: pid 89736 tid 89754 thread 3 bound to OS proc set 
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #156: KMP_AFFINITY: 56 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 14 cores/pkg x 2 threads/core (28 
total cores)
OMP: Info #249: KMP_AFFINITY: pid 89737 tid 89737 thread 0 bound to OS proc set 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #249: KMP_AFFINITY: pid 89737 tid 89755 thread 1 bound to OS proc set 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #249: KMP_AFFINITY: pid 89737 tid 89756 thread 2 bound to OS proc set 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
OMP: Info #249: KMP_AFFINITY: pid 89737 tid 89757 thread 3 bound to OS proc set 
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111
Time is 0.004174
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
Time is 0.004542
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00 
4.00 4.00 4.00 4.00

$>openmpi/bin/mpirun -np 2 ./mm-llvm.out 

OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0,56
OMP: Info #156: KMP_AFFINITY: 2 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 4,60
OMP: Info #179: KMP_AFFINITY: 1 packages x 1 cores/pkg x 2 threads/core (1 
total cores)
OMP: Info #156: KMP_AFFINITY: 2 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 1 cores/pkg x 2 threads/core (1 
total cores)
OMP: Info #249: KMP_AFFINITY: pid 89768 tid 89768 thread 0 bound to OS proc set 
4,60
OMP: Info #249: KMP_AFFINITY: pid 89767 tid 89767