Re: [OMPI users] prterun: symbol lookup error: /usr/lib/libprrte.so.3: undefined symbol: PMIx_Session_control

2024-08-15 Thread Jeff Squyres (jsquyres) via users
This isn't enough information to provide a definitive answer.  Can you provide 
more information about your setup, how you built and installed Open MPI, ... 
etc.?

In general, the error message is the standard Linux error message when a symbol 
is unable to be found at run time.  In particular, mpirun launches a process 
called prterun, of which, one of its dependencies (libprte.so.3) is unable to 
find a symbol named `PMIx_Session_control`.  It's likely that it is finding the 
"wrong" libpmix.so at runtime somehow (i.e., one that does not have that 
symbol).


From: users  on behalf of Kook Jin Noh via 
users 
Sent: Tuesday, August 13, 2024 10:56 PM
To: Open MPI Users 
Cc: Kook Jin Noh 
Subject: [OMPI users] prterun: symbol lookup error: /usr/lib/libprrte.so.3: 
undefined symbol: PMIx_Session_control


[vorlket@server openmpi-ucx]$ mpirun -host server:2,midiserver:2 -np 4 
/home/vorlket/sharedfolder/mpi-prime

prterun: symbol lookup error: /usr/lib/libprrte.so.3: undefined symbol: 
PMIx_Session_control



What’s going on?


Re: [OMPI users] Fwd: Unable to run basic mpirun command (OpenMPI v5.0.3)

2024-05-05 Thread Jeff Squyres (jsquyres) via users
Note that, depending on your environment, you might need to set these env 
variables on every node where you're running the Open MPI job.  For example: 
https://docs.open-mpi.org/en/v5.0.x/launching-apps/quickstart.html#launching-in-a-non-scheduled-environments-via-ssh
 and 
https://docs.open-mpi.org/en/v5.0.x/launching-apps/ssh.html#finding-open-mpi-executables-and-libraries.

From: T Brouns 
Sent: Sunday, May 5, 2024 4:37 PM
To: users@lists.open-mpi.org 
Cc: Jeff Squyres (jsquyres) ; hear...@gmail.com 

Subject: Re: [OMPI users] Fwd: Unable to run basic mpirun command (OpenMPI 
v5.0.3)

Hi all,

I solved the problem by doing:

```
INSTALL_DIR=/usr/local/openmpi-5.0.3
export PATH=$INSTALL_DIR/bin:$PATH
export LD_LIBRARY_PATH=$INSTALL_DIR/lib:$LD_LIBRARY_PATH
export OPAL_PREFIX=$INSTALL_DIR
```

That OPAL_PREFIX line was the tricky one.

After doing that, these mpirun commands are now working correctly:

```
mpirun --version
mpirun uptime
```

Thanks for pointing me in the right direction!


@John Hearns,
I'm not setting up a Modules environment, but this sounds like a great solution 
to the problem. I might need to look into that! Thanks.


Best,
Terence

On Sat, 4 May 2024 at 17:22, Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
You might want to see if your OS has Open MPI installed into default binary / 
library search paths; you might be able to uninstall it easily.

Otherwise, even if you explicitly run the mpirun​ you just built+installed, it 
might find the libmpi.so​ from some other copy of Open MPI.

Alternatively, your could prefix your LD_LIBRARY_PATH​ environment variable 
with the libdir from the Open MPI installation you just created.

From: T Brouns mailto:t.s.n.bro...@gmail.com>>
Sent: Saturday, May 4, 2024 10:56 AM
To: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com>>; 
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Fwd: Unable to run basic mpirun command (OpenMPI 
v5.0.3)

Hi Jeff,

I think you're onto something with the multiple copies.

For this reason, I also tried to run:

```
/usr/local/openmpi-5.0.3/bin/mpirun --version
```

To make sure I'm running the correct copy, but this one crashes with the same 
error.

As a next step, I can try to install OpenMPI on a different system to narrow 
down the problem. Or run it in a Docker container.

And thanks for the pointer on the
`mpirun hello_c.c`. This command made no sense.

Best,
Terence


On Sat, 4 May 2024, 14:30 Jeff Squyres (jsquyres), 
mailto:jsquy...@cisco.com>> wrote:
My apologies – I must have somehow been looking at the wrong config.log file.

I see there's an extra -​ in the script on the help page; I'll get that fixed.


Thanks for the tarball; that's easier to get everything.  Looking in there, it 
looks like you built with a prefix of /usr/local/openmpi-5.0.3, but your 
original email referred to looking for a help file in 
/usr/share/openmpi/help-mpirun.txt -- this seems to be a disparity.

You might want to check that you don't have multiple copies of Open MPI 
installed, and you're not running an unexpected copy somewhere – not the one 
you just built.

Also, your first mail mentioned "mpirun hello_c.c" – you don't want to do that. 
 mpirun is used for launching applications.  hello_c.c is the source code – you 
need to compile it first.  In the examples directory, you can make​, or you can 
manually build it via mpicc hello_c.c -o hello_c​.

____
From: T Brouns mailto:t.s.n.bro...@gmail.com>>
Sent: Saturday, May 4, 2024 2:00 AM
To: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com>>
Subject: Re: [OMPI users] Fwd: Unable to run basic mpirun command (OpenMPI 
v5.0.3)

Hi Jeff,

Thanks for the response.


"Your config.log file shows that you are trying to build Open MPI 2.1.6 and 
that configure failed."

Where are you seeing version 2.1.6 exactly? Version 5.0.3 is mentioned many 
times in the config.log file. Whereas if I do a recursive search for "2.1.6", 
it doesn't come up in any of the log files.

Also, the configure didn't give any error message. It successfully completed 
with: configure: exit 0

And I never installed version 2.1.6.

Are you sure you are looking at the right file?


"Can you provide all the information from 
https://docs.open-mpi.org/en/v5.0.x/getting-help.html?  (e.g., tar all the 
files up in a single file – makes it easier to download and examine everything)"

Here's the TAR file:

https://drive.google.com/file/d/19cr7Y4gyCEP0Aa2isTnASItOe9wmfTSK/view?usp=sharing

When I used the first script provided on that webpage, I got the following 
error:

```
+ tar -x -C /home/jupyter/openmpi-5.0.3/ompi-output -
++ find . -name config.log
+ tar -cf ./3rd-party/libevent-2.

Re: [OMPI users] Fwd: Unable to run basic mpirun command (OpenMPI v5.0.3)

2024-05-04 Thread Jeff Squyres (jsquyres) via users
You might want to see if your OS has Open MPI installed into default binary / 
library search paths; you might be able to uninstall it easily.

Otherwise, even if you explicitly run the mpirun​ you just built+installed, it 
might find the libmpi.so​ from some other copy of Open MPI.

Alternatively, your could prefix your LD_LIBRARY_PATH​ environment variable 
with the libdir from the Open MPI installation you just created.

From: T Brouns 
Sent: Saturday, May 4, 2024 10:56 AM
To: Jeff Squyres (jsquyres) ; users@lists.open-mpi.org 

Subject: Re: [OMPI users] Fwd: Unable to run basic mpirun command (OpenMPI 
v5.0.3)

Hi Jeff,

I think you're onto something with the multiple copies.

For this reason, I also tried to run:

```
/usr/local/openmpi-5.0.3/bin/mpirun --version
```

To make sure I'm running the correct copy, but this one crashes with the same 
error.

As a next step, I can try to install OpenMPI on a different system to narrow 
down the problem. Or run it in a Docker container.

And thanks for the pointer on the
`mpirun hello_c.c`. This command made no sense.

Best,
Terence


On Sat, 4 May 2024, 14:30 Jeff Squyres (jsquyres), 
mailto:jsquy...@cisco.com>> wrote:
My apologies – I must have somehow been looking at the wrong config.log file.

I see there's an extra -​ in the script on the help page; I'll get that fixed.


Thanks for the tarball; that's easier to get everything.  Looking in there, it 
looks like you built with a prefix of /usr/local/openmpi-5.0.3, but your 
original email referred to looking for a help file in 
/usr/share/openmpi/help-mpirun.txt -- this seems to be a disparity.

You might want to check that you don't have multiple copies of Open MPI 
installed, and you're not running an unexpected copy somewhere – not the one 
you just built.

Also, your first mail mentioned "mpirun hello_c.c" – you don't want to do that. 
 mpirun is used for launching applications.  hello_c.c is the source code – you 
need to compile it first.  In the examples directory, you can make​, or you can 
manually build it via mpicc hello_c.c -o hello_c​.


From: T Brouns mailto:t.s.n.bro...@gmail.com>>
Sent: Saturday, May 4, 2024 2:00 AM
To: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com>>
Subject: Re: [OMPI users] Fwd: Unable to run basic mpirun command (OpenMPI 
v5.0.3)

Hi Jeff,

Thanks for the response.


"Your config.log file shows that you are trying to build Open MPI 2.1.6 and 
that configure failed."

Where are you seeing version 2.1.6 exactly? Version 5.0.3 is mentioned many 
times in the config.log file. Whereas if I do a recursive search for "2.1.6", 
it doesn't come up in any of the log files.

Also, the configure didn't give any error message. It successfully completed 
with: configure: exit 0

And I never installed version 2.1.6.

Are you sure you are looking at the right file?


"Can you provide all the information from 
https://docs.open-mpi.org/en/v5.0.x/getting-help.html?  (e.g., tar all the 
files up in a single file – makes it easier to download and examine everything)"

Here's the TAR file:

https://drive.google.com/file/d/19cr7Y4gyCEP0Aa2isTnASItOe9wmfTSK/view?usp=sharing

When I used the first script provided on that webpage, I got the following 
error:

```
+ tar -x -C /home/jupyter/openmpi-5.0.3/ompi-output -
++ find . -name config.log
+ tar -cf ./3rd-party/libevent-2.1.12-stable/config.log 
./3rd-party/openpmix/config.log ./3rd-party/romio341/mpl/config.log 
./3rd-party/romio341/config.log ./3rd-party/prrte/config.log ./config.log
tar: This does not look like a tar archive
tar: -: Not found in archive
tar: Exiting with failure status due to previous errors
```

This is why I didn't generate the TAR file in the first place. I fixed the 
script now.


Best,
Terence



On Fri, 3 May 2024 at 23:43, Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
Your config.log file shows that you are trying to build Open MPI 2.1.6 and that 
configure failed.

I'm not sure how to square this with the information that you provided in your 
message... did you upload the wrong config.log?

Can you provide all the information from 
https://docs.open-mpi.org/en/v5.0.x/getting-help.html?  (e.g., tar all the 
files up in a single file – makes it easier to download and examine everything)

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of T Brouns via users 
mailto:users@lists.open-mpi.org>>
Sent: Friday, May 3, 2024 4:04 PM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
mailto:users@lists.open-mpi.org>>
Cc: T Brouns mailto:t.s.n.bro...@gmail.com>>
Subject: [OMPI users] Fwd: Unable to run basic mpirun command (OpenMPI v5.0.3)


Hello,

I'm experiencing issues running simple `mpirun` commands, after insta

Re: [OMPI users] Fwd: Unable to run basic mpirun command (OpenMPI v5.0.3)

2024-05-03 Thread Jeff Squyres (jsquyres) via users
Your config.log file shows that you are trying to build Open MPI 2.1.6 and that 
configure failed.

I'm not sure how to square this with the information that you provided in your 
message... did you upload the wrong config.log?

Can you provide all the information from 
https://docs.open-mpi.org/en/v5.0.x/getting-help.html?  (e.g., tar all the 
files up in a single file – makes it easier to download and examine everything)

From: users  on behalf of T Brouns via users 

Sent: Friday, May 3, 2024 4:04 PM
To: users@lists.open-mpi.org 
Cc: T Brouns 
Subject: [OMPI users] Fwd: Unable to run basic mpirun command (OpenMPI v5.0.3)


Hello,

I'm experiencing issues running simple `mpirun` commands, after installing 
OpenMPI v5.0.3.

When I run any command with `mpirun`, for example:

```
mpirun --help
mpirun --version
mpirun uptime
mpirun hello_c.c
```

I end up with the following error (in every case):

```
--
Sorry!  You were supposed to get help about:
prterun-exec-failed
from the file:
/usr/share/openmpi/help-mpirun.txt: No such file or directory
But I couldn't find that topic in the file.  Sorry!
--
```

I've installed OpenMPI using these steps:
https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/quickstart.html

When I install an older version of OpenMPI (such as v4.0.5), I end up with the 
following error instead, when running `mpirun`:

```
--
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--
```

You can find all the log files over here:
https://drive.google.com/drive/folders/163N5Xx5UJZ7fKU172VZSGF2nPY6z0tJF?usp=sharing


Love to get some help on this. Thanks.

Best,
Terence



Re: [OMPI users] [EXTERNAL] Help deciphering error message

2024-03-08 Thread Jeff Squyres (jsquyres) via users
(sorry this is so long – it's a bunch of explanations followed by 2 suggestions 
at the bottom)

One additional thing worth mentioning is that your mpirun command line does not 
seem to explicitly be asking for the "ucx" PML component, but the error message 
you're getting indicates that you specifically asked for the "ucx" PML.  Here's 
your command line, line-broken and re-ordered for ease of reading:


/cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun \

-np 1 \

-map-by ppr:1:node \

--allow-run-as-root \

--mca btl '^openib' \

--mca btl_openib_warn_default_gid_prefix 0 \

--mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 \

--mca plm_base_verbose 0 \

--mca plm rsh \

/home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d /raid/bcm/mdtest

A few things of note on your parameters:


  *
With the "btl" parameter, you're specifically telling Open MPI to skip using 
the openib​ component.  But then you pass in 2 btl_openib_*​ parameters, anyway 
(which will just be ignored, because you told Open MPI to not use openib​).  
This is harmless, but worth mentioning.
  *
You explicitly set plm_base_verbose​ to 0, but 0 is the default value.  Again, 
this is harmless (i.e., it's unnecessary because you're setting it to the same 
as the default value), but I thought I'd point it out.
  *
You're explicitly setting the plm​ value (Program Launch Module – i.e., how 
Open MPI launches remote executables), but you're not specifying any remote 
hosts.  In this local-only case, Open MPI will effectively just fork/exec the 
process locally.  So specifying the plm​ isn't needed.  Again, harmless, but I 
thought I'd point it out.
  *
We always advise against --allow-run-as-root​.  If you have a strong need for 
it, ok – that's what it's there for, after all – but it definitely isn't 
recommended.

I suspect you have some environment variables and/or a config file that is 
telling Open MPI to set the pml​ to ucx​ (perhaps from your environment 
modules?).  Look in your environment for OMPI_mca_pml=ucx​, or something 
similar.

That being said, the command line always trumps environment variables and 
config files in Open MPI.  So what Howard said – mpirun --mca pml '^ucx' ...​ – 
will effectively override any env variable or config file specifications 
telling Open MPI to use the UCX PML.

And all that​ being said, the full error message says that the UCX PML may not 
have been able to be loaded.  That might mean that the UCX PML isn't present 
(i.e., that plugin literally isn't present in the filesystem), but it may also 
mean that the plugin was present and Open MPI tried to load it, and failed.  
This typically means that shared library dependencies of that plugin weren't 
able to be loaded by the linker, so the linker gave up and simply told Open MPI 
"sorry, I can't dynamically open that plugin."  Open MPI basically just passed 
on the error to you.

To figure out which is the case, you might want to run with mpirun --mca 
mca_component_show_load_errors 1 ...​  This will tell Open MPI to display 
errors when it tries to load a plugin, but fails (e.g, due to the linker not 
being able to find dependent libraries).  This is probably what I would do 
first – you might find that the dgx-14 node either is missing some libraries, 
or your LD_LIBRARY_PATH is not set correctly to find dependent libraries, or 
somesuch.

Hope that helps!



From: users  on behalf of Pritchard Jr., 
Howard via users 
Sent: Thursday, March 7, 2024 3:01 PM
To: Open MPI Users 
Cc: Pritchard Jr., Howard 
Subject: Re: [OMPI users] [EXTERNAL] Help deciphering error message


Hello Jeffrey,



A couple of things to try first.



Try running without UCX.  Add –-mca pml ^ucx to the mpirun command line.  If 
the app functions without ucx, then the next thing is to see what may be going 
wrong with UCX and the Open MPI components that use it.



You may want to set the UCX_LOG_LEVEL environment variable to see if Open MPI’s 
UCX PML component is actually able to initialize UCX and start trying to use it.



See https://openucx.readthedocs.io/en/master/faq.html  for an example to do 
this using mpirun and the type of output you should be getting.



Another simple thing to try is



mpirun -np 1 ucx_info -v



and see it you get something like this back on stdout:

 Library version: 1.14.0

# Library path: /usr/lib64/libucs.so.0

# API headers version: 1.14.0

# Git branch '', revision f8877c5

# Configured with: --build=aarch64-redhat-linux-gnu 
--host=aarch64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking 
--prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin 
--sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include 
--libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var 
--sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info 
--disable-optimizations --disable-logging --disable-debug --disable-assertions 
--enable-mt -

Re: [OMPI users] Seg error when using v5.0.1

2024-01-31 Thread Jeff Squyres (jsquyres) via users
No worries – glad you figured it out!

From: users  on behalf of afernandez via 
users 
Sent: Wednesday, January 31, 2024 10:56 AM
To: Open MPI Users 
Cc: afernandez 
Subject: Re: [OMPI users] Seg error when using v5.0.1

Hello,
I'm sorry as I totally messed up here. It turns out that the problem was caused 
because there's a previous installation of OpenMPI (v4.1.6) and it was trying 
to run the codes compiled against v5 with the mpirun from v4. I always set up 
the systems so that the OS picks up the latest MPI version, but it apparently 
didn't become effective this time prompting me to the wrong conclusion. I 
should have realized of this fact earlier and not waste everyone's time. My 
apologies.
Arturo

Gilles Gouaillardet via users wrote:


Hi,

please open an issue on GitHub at https://github.com/open-mpi/ompi/issues
and provide the requested information.

If the compilation failed when configured with --enable-debug, please share the 
logs.

the name of the WRF subroutine suggests the crash might occur in 
MPI_Comm_split(),
if so, are you able to craft a reproducer that causes the crash?

How many nodes and MPI tasks are needed in order to evidence the crash?


Cheers,

Gilles

On Wed, Jan 31, 2024 at 10:09 PM afernandez via users 
mailto:users@lists.open-mpi.org>> wrote:
Hello Joseph,
Sorry for the delay but I didn't know if I was missing something yesterday 
evening and wanted to double check everything this morning. This is for WRF but 
other apps exhibit the same behavior.
* I had no problem with the serial version (and gdb obviously didn't report any 
issue).
* I tried compiling with the --enable-debug flag but it was generating errors 
during the compilation and never completed.
* I went back to my standard flags for debugging: -g -fbacktrace -ggdb 
-fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is still 
crashing with little extra info vs yesterday:
Backtrace for this error:
#0  0x7f5a4e54451f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1  0x7f5a4e5a73fe in __GI___libc_free
at ./malloc/malloc.c:3368
#2  0x7f5a4c7aa5c3 in ???
#3  0x7f5a4e83b048 in ???
#4  0x7f5a4e7d3ef1 in ???
#5  0x7f5a4e8dab7b in ???
#6  0x8f6bbf in __module_dm_MOD_split_communicator
at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734
#7  0x1879ebd in init_modules_
at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63
#8  0x406fe4 in __module_wrf_top_MOD_wrf_init
at ../main/module_wrf_top.f90:130
#9  0x405ff3 in wrf
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22
#10  0x40605c in main
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163 exited 
on signal 11 (Segmentation fault).
--
Any pointers on what might be going on here as this never happened with OMPIv4. 
Thanks.



Joseph Schuchart via users wrote:


Hello,

This looks like memory corruption. Do you have more details on what your app is 
doing? I don't see any MPI calls inside the call stack. Could you rebuild Open 
MPI with debug information enabled (by adding `--enable-debug` to configure)? 
If this error occurs on singleton runs (1 process) then you can easily attach 
gdb to it to get a better stack trace. Also, valgrind may help pin down the 
problem by telling you which memory block is being free'd here.

Thanks
Joseph

On 1/30/24 07:41, afernandez via users wrote:

quote class="gmail_quote" type="cite" style="margin:0 0 0 .8ex;border-left:1px 
#ccc solid;padding-left:1ex">
Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything > exactly 
as dozens of previous times with v4. I wasn't expecting any > issue (and the 
compilations didn't report anything out of ordinary) > but running several apps 
has resulted in error messages such as:
/Backtrace for this error:/
/#0  0x7f7c9571f51f in ???/
/at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1  0x7f7c957823fe in __GI___libc_free/
/at ./malloc/malloc.c:3368/
/#2  0x7f7c93a635c3 in ???/
/#3  0x7f7c95f84048 in ???/
/#4  0x7f7c95f1cef1 in ???/
/#5  0x7f7c95e34b7b in ???/
/#6  0x6e05be in ???/
/#7  0x6e58d7 in ???/
/#8  0x405d2c in ???/
/#9  0x7f7c95706d8f in __libc_start_call_main/
/at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10  0x7f7c95706e3f in __libc_start_main_impl/
/at ../csu/libc-start.c:392/
/#11  0x405d64 in ???/
/#12  0x in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building 
OpenMPI, I had prev

Re: [OMPI users] MPI Wireshark Packet Dissector

2023-12-11 Thread Jeff Squyres (jsquyres) via users
Cool!

I dimly remember this project; it was written independently of the main Open 
MPI project.

It looks like it supports the TCP OOB and TCP BTL.

The TCP OOB has since moved from Open MPI's "ORTE" sub-project to the 
independent PRRTE project.  Regardless, TCP OOB traffic is effectively about 
the control plane -- it's management messages, setup, teardown, stdout/err 
redirection, ... etc.  Depending on the goals of the Open MPI wireshark plugin, 
it may or may not be worth it to dissect that traffic.

The TCP BTL is the actual MPI messages that are sent across TCP (assuming you 
don't have some kind of HPC-class networking stack, that likely uses OS-bypass 
and probably doesn't use TCP).

Keep in mind that neither of these plugins have ever formally published a wire 
protocol, and are therefore subject to change at any time.  That's a 
not-insignificant risk for having an mpi-dissector plugin.

From: users  on behalf of Belanger, Martin 
via users 
Sent: Monday, December 11, 2023 10:55 AM
To: users@lists.open-mpi.org 
Cc: Belanger, Martin ; jul...@rilli.eu 

Subject: [OMPI users] MPI Wireshark Packet Dissector


I’m new to MPI and I needed to analyze MPI packets with Wireshark. I found 
Julian Rilli’s “mpi-dissector” project on GitHub 
(https://github.com/juhulian/mpi-dissector). The project is about 9 years old 
and does not compile with the latest Wireshark code. Fortunately, I was able to 
port it, build it, and make it work.



It is not clear which version of the MPI protocol this project supports. 
Suffice it to say that since the code is 9 years old, it probably does not 
support all of the MPI protocol changes/additions made in the last 9 years.



I wanted to share this with the Open-MPI community in case someone is 
interested and would like to update the code to support the latest version of 
MPI (I don’t know enough about MPI to do this work myself). Eventually, I will 
submit this to the Wireshark project so that it can be part of Wireshark going 
forward.



For anyone interested, the ported code can be found in my fork of the Wireshark 
repo: 
https://gitlab.com/martin-belanger/wireshark/-/tree/mpi-support-v1?ref_type=heads.
 It can be cloned as follows:



git clone -b mpi-support-v1 https://gitlab.com/martin-belanger/wireshark.git



Regards,

Martin Belanger








Re: [OMPI users] OpenMPI 5.0.0 & Intel OneAPI 2023.2.0 on MacOS 14.0:

2023-11-06 Thread Jeff Squyres (jsquyres) via users
We develop and build with clang on macOS frequently; it would be surprising if 
it didn't work.

That being said, I was able to replicate both errors report here.  One macOS 
Sonoma with XCode 15.x and the OneAPI compilers:

  *   configure fails in the PMIx libevent section, complaining about how it 
can't find a suitable libevent
 *
Filed github issue https://github.com/open-mpi/ompi/issues/12051 to track
  *
build fails complaining that it can't find 
 *
Filed github issue https://github.com/open-mpi/ompi/issues/12052 to track

Thanks for reporting these issues!

From: users  on behalf of Matt Thompson via 
users 
Sent: Monday, November 6, 2023 1:38 PM
To: Open MPI Users 
Cc: Matt Thompson ; Christophe Peyret 

Subject: Re: [OMPI users] OpenMPI 5.0.0 & Intel OneAPI 2023.2.0 on MacOS 14.0:

I have built Open MPI 5 (well, 5.0.0rc12) with Intel oneAPI under Rosetta2 with:

 $ lt_cv_ld_force_load=no ../configure --disable-wrapper-rpath 
--disable-wrapper-runpath \
CC=clang CXX=clang++ FC=ifort \
--with-hwloc=internal --with-libevent=internal --with-pmix=internal

I'm fairly sure the two wrapper flags are not needed, I just have them for 
historical reasons (long ago I needed them and until they cause an issue, I 
just keep all my flags around).

Maybe it works for me because I'm using clang instead of icc? I can "get away" 
with that because the code I work on is nearly all Fortran so the C compiler is 
not as important to us. And all the libraries we care about seem happy with 
mixed ifort-clang as well.

If you don't have a driving need for icc, maybe this will let things work?

On Mon, Nov 6, 2023 at 8:55 AM Volker Blum via users 
mailto:users@lists.open-mpi.org>> wrote:
I don’t have a solution to this but am interested in finding one.

There is an issue with some include statements between OneAPI and XCode on 
MacOS 14.x , at least for C++ (the example below seems to be C?). It appears 
that many standard headers are not being found.

I did not encounter this problem with OpenMPI, though, since I got stuck at an 
earlier point. My workaround, OpenMPI 4.1.6, compiled fine.

While compiling a different C++ code, these missing headers struck me, too.

Many of the include related error messages went away after installing XCode 
15.1 beta 2 - however, not all of them. That’s as far as I got … sorry about 
the experience.

Best wishes
Volker


Volker Blum
Vinik Associate Professor, Duke MEMS & Chemistry
https://aims.pratt.duke.edu
https://bsky.app/profile/aimsduke.bsky.social

> On Nov 6, 2023, at 4:25 AM, Christophe Peyret via users 
> mailto:users@lists.open-mpi.org>> wrote:
>
> Hello,
>
> I am tring to compile openmpi 5.0.0 on MacOS 14.1 with Intel oneapi Version 
> 2021.9.0 Build 20230302_00.
>
> I enter commande :
>
> lt_cv_ld_force_load=no  ../openmpi-5.0.0/configure 
> --prefix=$APP_DIR/openmpi-5.0.0 F77=ifort FC=ifort CC=icc CXX=icpc  
> --with-pmix=internal  --with-libevent=internal --with-hwloc=internal
>
> Then
>
> make
>
> And compilation stops with error message :
>
> /Users/christophe/Developer/openmpi-5.0.0/3rd-party/openpmix/src/util/pmix_path.c(55):
>  catastrophic error: cannot open source file 
> "/Users/christophe/Developer/openmpi-5.0.0/3rd-party/openpmix/src/util/pmix_path.c"
>  #include 
> ^
>
> compilation aborted for 
> /Users/christophe/Developer/openmpi-5.0.0/3rd-party/openpmix/src/util/pmix_path.c
>  (code 4)
> make[4]: *** [pmix_path.lo] Error 1
> make[3]: *** [all-recursive] Error 1
> make[2]: *** [all-recursive] Error 1
> make[1]: *** [all-recursive] Error 1
> make: *** [all-recursive] Error 1
>



--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton


[OMPI users] Open MPI BOF at SC'23

2023-11-06 Thread Jeff Squyres (jsquyres) via users
We're excited to see everyone next week in Denver, Colorado, USA at SC23!

Open MPI will be hosting our usual State of the Union Birds of a Feather (BOF) 
session
 on Wednesday, 15, November, 2023, from 12:15-1:15pm US Mountain time.  The 
event is in-person only; SC does not allow us to livestream.

During the BOF, we'll present the state of Open MPI, where we are, and where 
we're going.  We also use the BOF as an opportunity to directly respond to your 
questions.  We only have an hour; it's really helpful if you submit your 
questions ahead of 
time
 so that we can include them directly in our presentation.  We'll obviously 
take questions in-person, too, and will be available after the presentation as 
well, but chances are: if you have a question, others have the same question.  
So submit your question to 
us
 so that we can include them in the presentation!  🙂

Hope to see you in Denver!

--
Jeff Squyres


Re: [OMPI users] OpenMPI 5.0.0 & Intel OneAPI 2023.2.0 on MacOS 14.0:

2023-10-30 Thread Jeff Squyres (jsquyres) via users
Volker --

If that doesn't work, send all the information requested here: 
https://docs.open-mpi.org/en/v5.0.x/getting-help.html

From: users  on behalf of Volker Blum via 
users 
Sent: Saturday, October 28, 2023 8:47 PM
To: Matt Thompson 
Cc: Volker Blum ; Open MPI Users 

Subject: Re: [OMPI users] OpenMPI 5.0.0 & Intel OneAPI 2023.2.0 on MacOS 14.0:

Thank you very much, Matt! That sounds like it. Will try when I next get to 
work on this. (and I would really like to make 5.0.0 work …)

(I lost a few extra hours afterwards, with the OneAPI based OpenMPI 4.1.6 
mpic++ not being able to find various C++ headers - and I think this is due to 
Apple XCode. mpif90 and mpicc work and the situation for mpic++ did improve 
with XCode 5.1 beta 2, which is why I am suspecting this is an XCode related 
problem. However, I now need to find more time …)

Thanks again & best wishes
Volker

Volker Blum
Associate Professor, Duke MEMS & Chemistry
https://aims.pratt.duke.edu
https://bsky.app/profile/aimsduke.bsky.social

> On Oct 28, 2023, at 1:32 PM, Matt Thompson  wrote:
>
> On my Mac I build Open MPI 5 with (among other flags):
>
> --with-hwloc=internal --with-libevent=internal --with-pmix=internal
>
> In my case, I should have had libevent through brew, but it didn't seem to 
> see it. But then I figured I might as well let Open MPI build its own for 
> convenience.
>
> Matt
>
> On Fri, Oct 27, 2023 at 7:51 PM Volker Blum via users 
>  wrote:
> OpenMPI 5.0.0 & Intel OneAPI 2023.2.0 on MacOS 14.0:
>
> In an ostensibly clean system, the following configure on MacOS ends without 
> a viable pmix build:
>
> configure: WARNING: Either libevent or libev support is required, but neither
> configure: WARNING: was found. Please use the configure options to point us
> configure: WARNING: to where we can find one or the other library
> configure: error: Cannot continue
> configure: = done with 3rd-party/openpmix configure =
> checking for pmix pkg-config name... pmix
> checking if pmix pkg-config module exists... yes
> checking for pmix pkg-config cflags... 
> -I/usr/local/Cellar/open-mpi/4.1.5/include
> checking for pmix pkg-config ldflags... -L/usr/local/Cellar/open-mpi/4.1.5/lib
> checking for pmix pkg-config static ldflags... 
> -L/usr/local/Cellar/open-mpi/4.1.5/lib
> checking for pmix pkg-config libs... -lpmix -lz
> checking for pmix pkg-config static libs... -lpmix -lz
> checking for pmix.h... no
> configure: error: Could not find viable pmix build.
>
> configure command used was:
>
> lt_cv_ld_force_load=no ./configure --prefix=/usr/local/openmpi/5.0.0 FC=ifort 
> F77=ifort CC=icc CXX=icpc
>
> ***
>
> The same command works (up to the end of the configure stage) with OpenMPI 
> 4.1.6.
>
> My guess is that this is related to some earlier pmix related issues that can 
> be found by google but wanted to report.
>
> Thank you!
> Best wishes
> Volker
>
>
> Volker Blum
> Associate Professor, Duke MEMS & Chemistry
> https://aims.pratt.duke.edu
> https://bsky.app/profile/aimsduke.bsky.social
>
>
>
>
>
> --
> Matt Thompson
>“The fact is, this is about us identifying what we do best and
>finding more ways of doing less of it better” -- Director of Better Anna 
> Rampton



Re: [OMPI users] MPI4Py Only Using Rank 0

2023-10-25 Thread Jeff Squyres (jsquyres) via users
(let's keep users@lists.open-mpi.org in the CC list so that others can reply, 
too)

I don't know exactly how conda installs / re-installs mpi4py -- e.g., I don't 
know which MPI implementation it compiles and links against.

You can check to see which MPI implementation mpiexec uses -- for Open MPI, you 
should be able to run "mpiexec --version" and it should have "Open MPI" 
somewhere in the output.  I suspect MPICH's mpiexec will show something else 
(at a bare minimum, it won't have "Open MPI" in the output).

After that, you might look to see if the conda package documentation describes 
which MPI implementation it uses, and/or if it has instructions about choosing 
which one it uses.  I'm afraid I don't have much more detail here; we don't 
have much control on how the downstream packages bundle up Open MPI and/or use 
it.

From: caitlin lamirez 
Sent: Wednesday, October 25, 2023 1:17 PM
To: Jeff Squyres (jsquyres) 
Subject: Re: [OMPI users] MPI4Py Only Using Rank 0

Hi Jeff,

After getting that error, I did reinstall MPI4py using conda remove mpi4py and 
conda install mpi4py. However, I am still getting the same error. If I did 
happen to accidentally switch to a different MPI implementation, what should I 
do to fix this?

Thank you,
Caitlin

On Wednesday, October 25, 2023 at 12:05:15 PM CDT, Jeff Squyres (jsquyres) 
 wrote:


This usually​ means that you have accidentally switched to using a different 
MPI implementation under the covers somehow.  E.g., did you somehow 
accidentally start using mpiexec from MPICH instead of Open MPI?  Or did MPI4Py 
somehow get upgraded or otherwise re-build itself for MPICH, but you're still 
using the mpiexec from Open MPI?  Stuff like that.

From: users  on behalf of caitlin lamirez via 
users 
Sent: Wednesday, October 25, 2023 11:57 AM
To: users@lists.open-mpi.org 
Cc: caitlin lamirez 
Subject: [OMPI users] MPI4Py Only Using Rank 0

Hello,

I am having a problem with MPI4Py (version: mpiexec (OpenRTE) 4.1.5). I have 
been using it for months without a problem, however, out of no where, I am 
getting a bug where my program is only using Rank 0.

For example, when I run the command:

mpiexec -n 5 python -m mpi4py.bench helloworld

I get the following output:

Hello, World! I am process 0 of 1 on Caitlins-MacBook-Air.local.
Hello, World! I am process 0 of 1 on Caitlins-MacBook-Air.local.
Hello, World! I am process 0 of 1 on Caitlins-MacBook-Air.local.
Hello, World! I am process 0 of 1 on Caitlins-MacBook-Air.local.
Hello, World! I am process 0 of 1 on Caitlins-MacBook-Air.local.

I have not changed anything with my system. I was wondering if I can get some 
help on this issue?

Thank you,
Caitlin Lamirez


Re: [OMPI users] MPI4Py Only Using Rank 0

2023-10-25 Thread Jeff Squyres (jsquyres) via users
This usually​ means that you have accidentally switched to using a different 
MPI implementation under the covers somehow.  E.g., did you somehow 
accidentally start using mpiexec from MPICH instead of Open MPI?  Or did MPI4Py 
somehow get upgraded or otherwise re-build itself for MPICH, but you're still 
using the mpiexec from Open MPI?  Stuff like that.

From: users  on behalf of caitlin lamirez via 
users 
Sent: Wednesday, October 25, 2023 11:57 AM
To: users@lists.open-mpi.org 
Cc: caitlin lamirez 
Subject: [OMPI users] MPI4Py Only Using Rank 0

Hello,

I am having a problem with MPI4Py (version: mpiexec (OpenRTE) 4.1.5). I have 
been using it for months without a problem, however, out of no where, I am 
getting a bug where my program is only using Rank 0.

For example, when I run the command:

mpiexec -n 5 python -m mpi4py.bench helloworld

I get the following output:

Hello, World! I am process 0 of 1 on Caitlins-MacBook-Air.local.
Hello, World! I am process 0 of 1 on Caitlins-MacBook-Air.local.
Hello, World! I am process 0 of 1 on Caitlins-MacBook-Air.local.
Hello, World! I am process 0 of 1 on Caitlins-MacBook-Air.local.
Hello, World! I am process 0 of 1 on Caitlins-MacBook-Air.local.

I have not changed anything with my system. I was wondering if I can get some 
help on this issue?

Thank you,
Caitlin Lamirez


Re: [OMPI users] Binding to thread 0

2023-09-08 Thread Jeff Squyres (jsquyres) via users
In addition to what Gilles mentioned, I'm curious: is there a reason you have 
hardware threads enabled?  You could disable them in the BIOS, and then each of 
your MPI processes can use the full core, not just a single hardware thread.

From: users  on behalf of Luis Cebamanos via 
users 
Sent: Friday, September 8, 2023 7:10 AM
To: Ralph Castain via users 
Cc: Luis Cebamanos 
Subject: [OMPI users] Binding to thread 0

Hello,

Up to now, I have been using numerous ways of binding with wrappers (numactl, 
taskset) whenever I wanted to play with core placing. Another way I have been 
using is via -rankfile, however I notice that some ranks jump from thread 0 to 
thread 1 on SMT chips. I can control this with numactl for instance, but it 
would be great to see similar behaviour when using -rankfile. Is there a way to 
pack all ranks to one of the threads of each core (preferibly to thread 0) so I 
can nicely see all ranks with htop on either left or right of the screen?

The command I am using is pretty simple:

mpirun -np $MPIRANKS --rankfile ./myrankfile

and ./myrankfile looks like

rank 33=argon slot=33
rank 34=argon slot=34
rank 35=argon slot=35
rank 36=argon slot=36

Thanks!


Re: [OMPI users] Segmentation fault

2023-08-09 Thread Jeff Squyres (jsquyres) via users
Without knowing anything about SU2, we can't really help debug the issue.  The 
seg fault stack trace that you provided was quite deep; we don't really have 
the resources to go learn about how a complex application like SU2 is 
implemented -- sorry!

Can you or they provide a small, simple MPI application that replicates the 
issue?  That would be something we could dig into and investigate.

From: Aziz Ogutlu 
Sent: Wednesday, August 9, 2023 10:31 AM
To: Jeff Squyres (jsquyres) ; Open MPI Users 

Subject: Re: [OMPI users] Segmentation fault


Hi Jeff,


I'm using also SU2 lastest version, also open issue on github page. They says 
it could be about OpenMPI :)


On 8/9/23 17:28, Jeff Squyres (jsquyres) wrote:
Ok, thanks for upgrading.  Are you also using the latest version of SU2?

Without knowing what that application is doing, it's a little hard to debug the 
issue from our side.  At first glance, it looks like it is crashing when it has 
completed writing a file and is attempting to close it.  But the pointer that 
Open MPI got to close the file looks like it is bogus (i.e., 0x30 instead of a 
real pointer value).

You might need to raise the issue with the SU2 community and ask if there are 
any known issues with the application, or your particular use case of that 
application.

From: Aziz Ogutlu 
<mailto:aziz.ogu...@eduline.com.tr>
Sent: Wednesday, August 9, 2023 10:08 AM
To: Jeff Squyres (jsquyres) <mailto:jsquy...@cisco.com>; 
Open MPI Users <mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] Segmentation fault


Hi Jeff,

I also tried with OpenMPI 4.1.5, I got same error.


On 8/9/23 17:05, Jeff Squyres (jsquyres) wrote:
I'm afraid I don't know anything about the SU2 application.

You are using Open MPI v4.0.3, which is fairly old.  Many bug fixes have been 
released since that version.  Can you upgrade to the latest version of Open MPI 
(v4.1.5)?

From: users 
<mailto:users-boun...@lists.open-mpi.org> on 
behalf of Aziz Ogutlu via users 
<mailto:users@lists.open-mpi.org>
Sent: Wednesday, August 9, 2023 3:26 AM
To: Open MPI Users <mailto:users@lists.open-mpi.org>
Cc: Aziz Ogutlu <mailto:aziz.ogu...@eduline.com.tr>
Subject: [OMPI users] Segmentation fault


Hi there all,

We're using SU2 with OpenMPI 4.0.3, gcc 8.5.0 on Redhat 7.9. We compiled all 
component for using on HPC system.

When I use SU2 with QuickStart config file with OpenMPI, it gives error like in 
attached file.
Command is:
mpirun -np 8 --allow-run-as-root SU2_CFD inv_NACA0012.cfg

--
İyi çalışmalar,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
www.eduline.com.tr<http://www.eduline.com.tr>
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72

--
İyi çalışmalar,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
www.eduline.com.tr<http://www.eduline.com.tr>
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72

--
İyi çalışmalar,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
www.eduline.com.tr<http://www.eduline.com.tr>
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72


Re: [OMPI users] Segmentation fault

2023-08-09 Thread Jeff Squyres (jsquyres) via users
Ok, thanks for upgrading.  Are you also using the latest version of SU2?

Without knowing what that application is doing, it's a little hard to debug the 
issue from our side.  At first glance, it looks like it is crashing when it has 
completed writing a file and is attempting to close it.  But the pointer that 
Open MPI got to close the file looks like it is bogus (i.e., 0x30 instead of a 
real pointer value).

You might need to raise the issue with the SU2 community and ask if there are 
any known issues with the application, or your particular use case of that 
application.

From: Aziz Ogutlu 
Sent: Wednesday, August 9, 2023 10:08 AM
To: Jeff Squyres (jsquyres) ; Open MPI Users 

Subject: Re: [OMPI users] Segmentation fault


Hi Jeff,

I also tried with OpenMPI 4.1.5, I got same error.


On 8/9/23 17:05, Jeff Squyres (jsquyres) wrote:
I'm afraid I don't know anything about the SU2 application.

You are using Open MPI v4.0.3, which is fairly old.  Many bug fixes have been 
released since that version.  Can you upgrade to the latest version of Open MPI 
(v4.1.5)?

From: users 
<mailto:users-boun...@lists.open-mpi.org> on 
behalf of Aziz Ogutlu via users 
<mailto:users@lists.open-mpi.org>
Sent: Wednesday, August 9, 2023 3:26 AM
To: Open MPI Users <mailto:users@lists.open-mpi.org>
Cc: Aziz Ogutlu <mailto:aziz.ogu...@eduline.com.tr>
Subject: [OMPI users] Segmentation fault


Hi there all,

We're using SU2 with OpenMPI 4.0.3, gcc 8.5.0 on Redhat 7.9. We compiled all 
component for using on HPC system.

When I use SU2 with QuickStart config file with OpenMPI, it gives error like in 
attached file.
Command is:
mpirun -np 8 --allow-run-as-root SU2_CFD inv_NACA0012.cfg

--
İyi çalışmalar,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
www.eduline.com.tr<http://www.eduline.com.tr>
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72

--
İyi çalışmalar,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
www.eduline.com.tr<http://www.eduline.com.tr>
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72


Re: [OMPI users] Segmentation fault

2023-08-09 Thread Jeff Squyres (jsquyres) via users
I'm afraid I don't know anything about the SU2 application.

You are using Open MPI v4.0.3, which is fairly old.  Many bug fixes have been 
released since that version.  Can you upgrade to the latest version of Open MPI 
(v4.1.5)?

From: users  on behalf of Aziz Ogutlu via 
users 
Sent: Wednesday, August 9, 2023 3:26 AM
To: Open MPI Users 
Cc: Aziz Ogutlu 
Subject: [OMPI users] Segmentation fault


Hi there all,

We're using SU2 with OpenMPI 4.0.3, gcc 8.5.0 on Redhat 7.9. We compiled all 
component for using on HPC system.

When I use SU2 with QuickStart config file with OpenMPI, it gives error like in 
attached file.
Command is:
mpirun -np 8 --allow-run-as-root SU2_CFD inv_NACA0012.cfg

--
İyi çalışmalar,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
www.eduline.com.tr
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72


Re: [OMPI users] [EXT] Re: Error handling

2023-07-19 Thread Jeff Squyres (jsquyres) via users
MPI_Allreduce should work just fine, even with negative numbers.  If you are 
seeing something different, can you provide a small reproducer program that 
shows the problem?  We can dig deeper into if if we can reproduce the problem.

mpirun's exit status can't distinguish between MPI processes who call 
MPI_Finalize and then return a non-zero exit status and those who invoked 
MPI_Abort.  But if you have 1 process that invokes MPI_Abort with an exit 
status <255, it should be reflected in mpirun's exit status.  For example:


$ cat abort.c

#include 

#include 


int main(int argc, char *argv[])

{

int i, rank, size;


MPI_Init(NULL, NULL);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);


if (rank == size - 1) {

int err_code = 79;

fprintf(stderr, "I am rank %d and am aborting with error code %d\n",

rank, err_code);

MPI_Abort(MPI_COMM_WORLD, err_code);

}


fprintf(stderr, "I am rank %d and am exiting with 0\n", rank);

MPI_Finalize();

return 0;

}


$ mpicc abort.c -o abort


$ mpirun --host mpi004:2,mpi005:2 -np 4 ./abort

I am rank 0 and am exiting with 0

I am rank 1 and am exiting with 0

I am rank 2 and am exiting with 0

I am rank 3 and am aborting with error code 79

--

MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD

with errorcode 79.


NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.

--


$ echo $?

79


From: users  on behalf of Alexander Stadik 
via users 
Sent: Wednesday, July 19, 2023 12:45 AM
To: George Bosilca ; Open MPI Users 

Cc: Alexander Stadik 
Subject: Re: [OMPI users] [EXT] Re: Error handling

Hey George,

I said random only because I do not see the method behind it, but exactly like 
this when I do allreduce by MIN and return a negative number I get either 248, 
253, 11 or 6 usually. Meaning that's purely a number from MPI side.

The Problem with MPI_Abort is it shows the correct number in its output in 
Logfile, but it does not communicate its value to other processes, or forward 
its value to exit. So one also always sees these "random" values.

When using positive numbers in range it seems to work, so my question was on 
how it works, and how one can do it? Is there a way to let MPI_Abort 
communicate  the value as exit code?
Why do negative numbers not work, or does one simply have to always use 
positive numbers? Why I would prefer Abort is because it seems safer.

BR Alex



Von: George Bosilca 
Gesendet: Dienstag, 18. Juli 2023 18:47
An: Open MPI Users 
Cc: Alexander Stadik 
Betreff: [EXT] Re: [OMPI users] Error handling

External: Check sender address and use caution opening links or attachments

Alex,

How are your values "random" if you provide correct values ? Even for negative 
values you could use MIN to pick one value and return it. What is the problem 
with `MPI_Abort` ? it does seem to do what you want.

  George.


On Tue, Jul 18, 2023 at 4:38 AM Alexander Stadik via users 
mailto:users@lists.open-mpi.org>> wrote:
Hey everyone,

I am working for longer time now with cuda-aware OpenMPI, and developed longer 
time back a small exceptions handling framework including MPI and CUDA 
exceptions.
Currently I am using MPI_Abort with costum error numbers, to terminate 
everything elegantly, which works well, by just reading the logfile in case of 
a crash.

Now I was wondering how one can handle return / exit codes properly between 
processes, since we would like to filter non-zero exits by return code.

One way is a simple Allreduce (in my case) + exit instead of Abort. But the 
problem seems to be the values are always "random" (since I was using negative 
codes), only by using MPI error codes it seems to work correctly.
But usage of that is limited.

Any suggestions on how to do this / how it can work properly?

BR Alex



[https://www.essteyr.com/wp-content/uploads/2020/02/pic-1_1568d80e-78e3-426f-85e8-4bf0051208351.png]

[https://www.essteyr.com/wp-content/uploads/2021/01/ESSSignatur3.png]

[https://www.essteyr.com/wp-content/uploads/2020/02/linkedin_38a91193-02cf-4df9-8e91-230f7459e9c3.png]
 
[https://www.essteyr.com/wp-content/uploads/2020/02/twitter_5fc7318f-c0e4-495c-b96c-ebd9cf186067.png]
   
[https://www.essteyr.com/wp-content/uploads/2020/02/facebook_ee01289e-1a90-48d0-8e82-049bb3c3a46b.png]
   
[https://www.essteyr.com/wp-content/uploads/2020/09/SocialLink_Instagram_32x32_ea55186d-8d0b-4f5e-a023-02e04995f5bf.png]
 

Re: [OMPI users] libnuma.so error

2023-07-19 Thread Jeff Squyres (jsquyres) via users
It's not clear if that message is being emitted by Open MPI.

It does say it's falling back to a different behavior if libnuma.so is not 
found, so it appears if it's treating it as a warning, not an error.

From: users  on behalf of Luis Cebamanos via 
users 
Sent: Wednesday, July 19, 2023 10:09 AM
To: users@lists.open-mpi.org 
Cc: Luis Cebamanos 
Subject: [OMPI users] libnuma.so error

Hello,

I was wondering if anyone has ever seen the following runtime error:

mpirun -np 32 ./hello
.
[LOG_CAT_SBGP] libnuma.so: cannot open shared object file: No such file
or directory
[LOG_CAT_SBGP] Failed to dlopen libnuma.so. Fallback to GROUP_BY_SOCKET
manual.
.

The funny thing is that the binary is executed despite the errors.
What could be causing it?

Regards,
Lusi


Re: [OMPI users] Error build Open MPI 4.1.5 with GCC 11.3

2023-07-18 Thread Jeff Squyres (jsquyres) via users
The GNU-generated Makefile dependencies may not be removed during "make clean" 
-- they may only be removed during "make distclean" (which is kinda equivalent 
to rm -rf'ing the tree and extracting a fresh tarball).

From: Jeffrey Layton 
Sent: Tuesday, July 18, 2023 12:51 PM
To: Jeff Squyres (jsquyres) 
Cc: Open MPI Users 
Subject: Re: [OMPI users] Error build Open MPI 4.1.5 with GCC 11.3

As soon as you pointed out /usr/lib/gcc/x86_64-linux-gnu/9/include/float.h  
that made me think of the previous build.

I did "make clean" a _bunch_ of times before running configure and it didn't 
cure it. Strange.

But, nuking the source tree from orbit, just to be sure, and then 
configure/rebuild worked just create!

Thanks!

Jeff


On Tue, Jul 18, 2023 at 12:29 PM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
There were probably quite a few differences from the output of "configure" 
between GCC 9.4 and GCC 11.3.

For example, your original post cited 
"/usr/lib/gcc/x86_64-linux-gnu/9/include/float.h", which, I assume, does not 
exist on your new GCC 11.3-based system.

Meaning: if you had run make clean and then re-ran configure, it probably would 
have built ok.  But deleting the whole source tree and re-configuring + 
re-building also works.  🙂

From: Jeffrey Layton mailto:layto...@gmail.com>>
Sent: Tuesday, July 18, 2023 11:38 AM
To: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com>>
Cc: Open MPI Users mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Error build Open MPI 4.1.5 with GCC 11.3

Jeff,

Thanks for the tip - it started me thinking a bit.

I was using a directory in my /home account with 4.1.5 that I had previously 
built using GCC 9.4 (Ubuntu 20.04). I rebuilt the system with Ubuntu-22.04 but 
I did a backup of /home. Then I copied the 4.1.5 directory to /home again.

I checked and I did a "make clean" before attempting to build 4.1.5 but with 
GCC 11.3 that came with Ubuntu 22.04. In fact, I did it several times before I 
ran configure.

Even after running "make clean" I got the error I mentioned in my initial post. 
This happened several times.

This morning, I blew away my 4.1.5 directory and downloaded a fresh 4.1.5. 
Configure went fine as did compiling it.

My theory is that some cruft from building 4.1.5 with GCC 9.4 compilers hung 
around, even after "make clean". Using a "fresh" download of 4.1.5 did not 
include this "cruft" so configure and make all proceeds just fine.

I don't know if this is correct and I can't point to any smoking gun though.

Thanks!

Jeff


On Mon, Jul 17, 2023 at 2:53 PM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
That's a little odd.  Usually, the specific .h files that are listed as 
dependencies came from somewhere -- usually either part of the GNU Autotools 
dependency analysis.

I'm guessing that /usr/lib/gcc/x86_64-linux-gnu/9/include/float.h doesn't 
actually exist on your system -- but then how did it get into Open MPI's 
makefiles?

Did you run configure on one machine and make on a different machine, perchance?

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Jeffrey Layton via users 
mailto:users@lists.open-mpi.org>>
Sent: Monday, July 17, 2023 2:05 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Jeffrey Layton mailto:layto...@gmail.com>>
Subject: [OMPI users] Error build Open MPI 4.1.5 with GCC 11.3

Good afternoon,

I'm trying to build Open MPI 4.1.5 using GCC 11.3. However, I get an error that 
I'm not sure how to correct. The error is,

...
  CC   pscatter.lo
  CC   piscatter.lo
  CC   pscatterv.lo
  CC   piscatterv.lo
  CC   psend.lo
  CC   psend_init.lo
  CC   psendrecv.lo
  CC   psendrecv_replace.lo
  CC   pssend_init.lo
  CC   pssend.lo
  CC   pstart.lo
  CC   pstartall.lo
  CC   pstatus_c2f.lo
  CC   pstatus_f2c.lo
  CC   pstatus_set_cancelled.lo
  CC   pstatus_set_elements.lo
  CC   pstatus_set_elements_x.lo
  CC   ptestall.lo
  CC   ptestany.lo
  CC   ptest.lo
  CC   ptest_cancelled.lo
  CC   ptestsome.lo
  CC   ptopo_test.lo
  CC   ptype_c2f.lo
  CC   ptype_commit.lo
  CC   ptype_contiguous.lo
  CC   ptype_create_darray.lo
make[3]: *** No rule to make target 
'/usr/lib/gcc/x86_64-linux-gnu/9/include/float.h', needed by 
'ptype_create_f90_complex.lo'.  Stop.
make[3]: Leaving directory '/home/laytonjb/src/openmpi-4.1.5/ompi/mpi/c/profile'
make[2]: *** [Makefile:2559: all-recursive] Error 1
make[2]: Leaving directory '/home/laytonjb/src/openmpi-4.1.5/ompi/mpi/c'
make[1]: *** [Makefile:3566: all-recursive] Error 1
make[1]: Leaving di

Re: [OMPI users] Error build Open MPI 4.1.5 with GCC 11.3

2023-07-18 Thread Jeff Squyres (jsquyres) via users
There were probably quite a few differences from the output of "configure" 
between GCC 9.4 and GCC 11.3.

For example, your original post cited 
"/usr/lib/gcc/x86_64-linux-gnu/9/include/float.h", which, I assume, does not 
exist on your new GCC 11.3-based system.

Meaning: if you had run make clean and then re-ran configure, it probably would 
have built ok.  But deleting the whole source tree and re-configuring + 
re-building also works.  🙂

From: Jeffrey Layton 
Sent: Tuesday, July 18, 2023 11:38 AM
To: Jeff Squyres (jsquyres) 
Cc: Open MPI Users 
Subject: Re: [OMPI users] Error build Open MPI 4.1.5 with GCC 11.3

Jeff,

Thanks for the tip - it started me thinking a bit.

I was using a directory in my /home account with 4.1.5 that I had previously 
built using GCC 9.4 (Ubuntu 20.04). I rebuilt the system with Ubuntu-22.04 but 
I did a backup of /home. Then I copied the 4.1.5 directory to /home again.

I checked and I did a "make clean" before attempting to build 4.1.5 but with 
GCC 11.3 that came with Ubuntu 22.04. In fact, I did it several times before I 
ran configure.

Even after running "make clean" I got the error I mentioned in my initial post. 
This happened several times.

This morning, I blew away my 4.1.5 directory and downloaded a fresh 4.1.5. 
Configure went fine as did compiling it.

My theory is that some cruft from building 4.1.5 with GCC 9.4 compilers hung 
around, even after "make clean". Using a "fresh" download of 4.1.5 did not 
include this "cruft" so configure and make all proceeds just fine.

I don't know if this is correct and I can't point to any smoking gun though.

Thanks!

Jeff


On Mon, Jul 17, 2023 at 2:53 PM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
That's a little odd.  Usually, the specific .h files that are listed as 
dependencies came from somewhere -- usually either part of the GNU Autotools 
dependency analysis.

I'm guessing that /usr/lib/gcc/x86_64-linux-gnu/9/include/float.h doesn't 
actually exist on your system -- but then how did it get into Open MPI's 
makefiles?

Did you run configure on one machine and make on a different machine, perchance?

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Jeffrey Layton via users 
mailto:users@lists.open-mpi.org>>
Sent: Monday, July 17, 2023 2:05 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Jeffrey Layton mailto:layto...@gmail.com>>
Subject: [OMPI users] Error build Open MPI 4.1.5 with GCC 11.3

Good afternoon,

I'm trying to build Open MPI 4.1.5 using GCC 11.3. However, I get an error that 
I'm not sure how to correct. The error is,

...
  CC   pscatter.lo
  CC   piscatter.lo
  CC   pscatterv.lo
  CC   piscatterv.lo
  CC   psend.lo
  CC   psend_init.lo
  CC   psendrecv.lo
  CC   psendrecv_replace.lo
  CC   pssend_init.lo
  CC   pssend.lo
  CC   pstart.lo
  CC   pstartall.lo
  CC   pstatus_c2f.lo
  CC   pstatus_f2c.lo
  CC   pstatus_set_cancelled.lo
  CC   pstatus_set_elements.lo
  CC   pstatus_set_elements_x.lo
  CC   ptestall.lo
  CC   ptestany.lo
  CC   ptest.lo
  CC   ptest_cancelled.lo
  CC   ptestsome.lo
  CC   ptopo_test.lo
  CC   ptype_c2f.lo
  CC   ptype_commit.lo
  CC   ptype_contiguous.lo
  CC   ptype_create_darray.lo
make[3]: *** No rule to make target 
'/usr/lib/gcc/x86_64-linux-gnu/9/include/float.h', needed by 
'ptype_create_f90_complex.lo'.  Stop.
make[3]: Leaving directory '/home/laytonjb/src/openmpi-4.1.5/ompi/mpi/c/profile'
make[2]: *** [Makefile:2559: all-recursive] Error 1
make[2]: Leaving directory '/home/laytonjb/src/openmpi-4.1.5/ompi/mpi/c'
make[1]: *** [Makefile:3566: all-recursive] Error 1
make[1]: Leaving directory '/home/laytonjb/src/openmpi-4.1.5/ompi'
make: *** [Makefile:1912: all-recursive] Error 1



Here is the configuration output from configure:

Open MPI configuration:
---
Version: 4.1.5
Build MPI C bindings: yes
Build MPI C++ bindings (deprecated): no
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
MPI Build Java bindings (experimental): no
Build Open SHMEM support: false (no spml)
Debug build: no
Platform file: (none)

Miscellaneous
---
CUDA support: no
HWLOC support: external
Libevent support: internal
Open UCC: no
PMIx support: Internal

Transports
---
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Intel TrueScale (PSM): no
Mellanox MXM: no
Open UCX: no
OpenFabrics OFI Libfabric: no
OpenFabrics Verbs: no
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Resource Managers
---

Re: [OMPI users] Error build Open MPI 4.1.5 with GCC 11.3

2023-07-17 Thread Jeff Squyres (jsquyres) via users
That's a little odd.  Usually, the specific .h files that are listed as 
dependencies came from somewhere​ -- usually either part of the GNU Autotools 
dependency analysis.

I'm guessing that /usr/lib/gcc/x86_64-linux-gnu/9/include/float.h doesn't 
actually exist on your system -- but then how did it get into Open MPI's 
makefiles?

Did you run configure on one machine and make on a different machine, perchance?

From: users  on behalf of Jeffrey Layton via 
users 
Sent: Monday, July 17, 2023 2:05 PM
To: Open MPI Users 
Cc: Jeffrey Layton 
Subject: [OMPI users] Error build Open MPI 4.1.5 with GCC 11.3

Good afternoon,

I'm trying to build Open MPI 4.1.5 using GCC 11.3. However, I get an error that 
I'm not sure how to correct. The error is,

...
  CC   pscatter.lo
  CC   piscatter.lo
  CC   pscatterv.lo
  CC   piscatterv.lo
  CC   psend.lo
  CC   psend_init.lo
  CC   psendrecv.lo
  CC   psendrecv_replace.lo
  CC   pssend_init.lo
  CC   pssend.lo
  CC   pstart.lo
  CC   pstartall.lo
  CC   pstatus_c2f.lo
  CC   pstatus_f2c.lo
  CC   pstatus_set_cancelled.lo
  CC   pstatus_set_elements.lo
  CC   pstatus_set_elements_x.lo
  CC   ptestall.lo
  CC   ptestany.lo
  CC   ptest.lo
  CC   ptest_cancelled.lo
  CC   ptestsome.lo
  CC   ptopo_test.lo
  CC   ptype_c2f.lo
  CC   ptype_commit.lo
  CC   ptype_contiguous.lo
  CC   ptype_create_darray.lo
make[3]: *** No rule to make target 
'/usr/lib/gcc/x86_64-linux-gnu/9/include/float.h', needed by 
'ptype_create_f90_complex.lo'.  Stop.
make[3]: Leaving directory '/home/laytonjb/src/openmpi-4.1.5/ompi/mpi/c/profile'
make[2]: *** [Makefile:2559: all-recursive] Error 1
make[2]: Leaving directory '/home/laytonjb/src/openmpi-4.1.5/ompi/mpi/c'
make[1]: *** [Makefile:3566: all-recursive] Error 1
make[1]: Leaving directory '/home/laytonjb/src/openmpi-4.1.5/ompi'
make: *** [Makefile:1912: all-recursive] Error 1



Here is the configuration output from configure:

Open MPI configuration:
---
Version: 4.1.5
Build MPI C bindings: yes
Build MPI C++ bindings (deprecated): no
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
MPI Build Java bindings (experimental): no
Build Open SHMEM support: false (no spml)
Debug build: no
Platform file: (none)

Miscellaneous
---
CUDA support: no
HWLOC support: external
Libevent support: internal
Open UCC: no
PMIx support: Internal

Transports
---
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Intel TrueScale (PSM): no
Mellanox MXM: no
Open UCX: no
OpenFabrics OFI Libfabric: no
OpenFabrics Verbs: no
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Resource Managers
---
Cray Alps: no
Grid Engine: no
LSF: no
Moab: no
Slurm: yes
ssh/rsh: yes
Torque: no

OMPIO File Systems
---
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no
Lustre: no
PVFS2/OrangeFS: no



Any suggestions! Thanks!

Jeff





Re: [OMPI users] OMPI compilation error in Making all datatypes

2023-07-12 Thread Jeff Squyres (jsquyres) via users
If the file opal/datatype/.lib/libdatatype_reliable.a does not exist after 
running "ar cru .libs/libdatatype_reliable.a .libs/libdataty...etc.", then 
there is something wrong with your system.  Specifically, "ar" is a Linux 
command that makes an archive file; this command is not part of Open MPI.  If 
"ar" isn't working, then ... 🤷‍♂️

What happens if you run the full "ar cru " command manually from within the 
opal/datatype directory?

(you can see the full command if you invoke "make V=1")

From: George Bosilca 
Sent: Wednesday, July 12, 2023 2:26 PM
To: Open MPI Users 
Cc: Jeff Squyres (jsquyres) ; Elad Cohen 

Subject: Re: [OMPI users] OMPI compilation error in Making all datatypes

I can't replicate this on my setting, but I am not using the tar archive from 
the OMPI website (I use the git tag). Can you do `ls -l opal/datatype/.lib` in 
your build directory.

  George.

On Wed, Jul 12, 2023 at 7:14 AM Elad Cohen via users 
mailto:users@lists.open-mpi.org>> wrote:

Hi Jeff, thanks for replying


opal/datatype/.libs/libdatatype_reliable.a doesn't exist.


I tried building on a networked filesystem , and a local one .


when building in /root - i'm getting ore output, but eventually the same error:


make[2]: Entering directory '/root/openmpi-v4.1.5/openmpi-4.1.5/opal/datatype'
  CC   libdatatype_reliable_la-opal_datatype_pack.lo
  CC   libdatatype_reliable_la-opal_datatype_unpack.lo
  CC   opal_convertor_raw.lo
  CC   opal_convertor.lo
  CC   opal_copy_functions.lo
  CC   opal_copy_functions_heterogeneous.lo
  CC   opal_datatype_add.lo
  CC   opal_datatype_clone.lo
  CC   opal_datatype_copy.lo
  CC   opal_datatype_create.lo
  CC   opal_datatype_create_contiguous.lo
  CC   opal_datatype_destroy.lo
  CC   opal_datatype_dump.lo
  CC   opal_datatype_fake_stack.lo
  CC   opal_datatype_get_count.lo
  CC   opal_datatype_module.lo
  CC   opal_datatype_monotonic.lo
  CC   opal_datatype_optimize.lo
  CC   opal_datatype_pack.lo
  CC   opal_datatype_position.lo
  CC   opal_datatype_resize.lo
  CC   opal_datatype_unpack.lo
  CCLD libdatatype_reliable.la<http://libdatatype_reliable.la>
ranlib: '.libs/libdatatype_reliable.a': No such file
make[2]: *** [Makefile:1870: 
libdatatype_reliable.la<http://libdatatype_reliable.la>] Error





From: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com>>
Sent: Wednesday, July 12, 2023 1:09:35 PM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: Elad Cohen
Subject: Re: OMPI compilation error in Making all datatypes

The output you sent (in the attached tarball) in doesn't really make much sense:


libtool: link: ar cru .libs/libdatatype_reliable.a 
.libs/libdatatype_reliable_la-opal_datatype_pack.o 
.libs/libdatatype_reliable_la-opal_datatype_unpack.o

libtool: link: ranlib .libs/libdatatype_reliable.a

ranlib: '.libs/libdatatype_reliable.a': No such file

Specifically:

  1.  "ar cru .libs/libdatatype_reliable.a" should have created the file 
.libs/libdatatype_reliable.a.
  2.  "ranlib .libs/libdatatype_reliable.a" then should modify the 
.libs/libdatatype_reliable.a that was just created.

I'm not sure how #2 fails to find the file that was just created in step #1.  
No errors were reported by step #1, so that file should be there.

Can you confirm if the file opal/datatype/.libs/libdatatype_reliable.a exists?
Are you building on a networked filesystem, perchance?  If so, is the time 
synchronized between the machine on which you are building and the file server?


From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Elad Cohen via users 
mailto:users@lists.open-mpi.org>>
Sent: Wednesday, July 12, 2023 4:27 AM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
mailto:users@lists.open-mpi.org>>
Cc: Elad Cohen mailto:el...@volcani.agri.gov.il>>
Subject: [OMPI users] OMPI compilation error in Making all datatypes


Hello,

I'm getting this error in both v4.1.4 and v4.1.5:

Making all in datatype
make[2]: Entering directory 
'/data/bin/openmpi-v4.1.5/openmpi-4.1.5/opal/datatype'
  CCLD libdatatype_reliable.la<http://libdatatype_reliable.la>
ranlib: '.libs/libdatatype_reliable.a': No such file
make[2]: *** [Makefile:1870: 
libdatatype_reliable.la<http://libdatatype_reliable.la>] Error 1
make[2]: Leaving directory 
'/data/bin/openmpi-v4.1.5/openmpi-4.1.5/opal/datatype'
make[1]: *** [Makefile:2394: all-recursive] Error 1
make[1]: Leaving directory '/data/bin/openmpi-v4.1.5/openmpi-4.1.5/opal'
make: *** [Makefile:1912: all-recursive] Error 1

Thank you


Email secured by Check Point - Threat Emulation policy



Re: [OMPI users] OMPI compilation error in Making all datatypes

2023-07-12 Thread Jeff Squyres (jsquyres) via users
The output you sent (in the attached tarball) in doesn't really make much sense:


libtool: link: ar cru .libs/libdatatype_reliable.a 
.libs/libdatatype_reliable_la-opal_datatype_pack.o 
.libs/libdatatype_reliable_la-opal_datatype_unpack.o

libtool: link: ranlib .libs/libdatatype_reliable.a

ranlib: '.libs/libdatatype_reliable.a': No such file

Specifically:

  1.  "ar cru .libs/libdatatype_reliable.a" should have created the file 
.libs/libdatatype_reliable.a.
  2.  "ranlib .libs/libdatatype_reliable.a" then should modify the 
.libs/libdatatype_reliable.a that was just created.

I'm not sure how #2 fails to find the file that was just created in step #1.  
No errors were reported by step #1, so that file should be there.

Can you confirm if the file opal/datatype/.libs/libdatatype_reliable.a exists?
Are you building on a networked filesystem, perchance?  If so, is the time 
synchronized between the machine on which you are building and the file server?


From: users  on behalf of Elad Cohen via 
users 
Sent: Wednesday, July 12, 2023 4:27 AM
To: users@lists.open-mpi.org 
Cc: Elad Cohen 
Subject: [OMPI users] OMPI compilation error in Making all datatypes


Hello,

I'm getting this error in both v4.1.4 and v4.1.5:

Making all in datatype
make[2]: Entering directory 
'/data/bin/openmpi-v4.1.5/openmpi-4.1.5/opal/datatype'
  CCLD libdatatype_reliable.la
ranlib: '.libs/libdatatype_reliable.a': No such file
make[2]: *** [Makefile:1870: libdatatype_reliable.la] Error 1
make[2]: Leaving directory 
'/data/bin/openmpi-v4.1.5/openmpi-4.1.5/opal/datatype'
make[1]: *** [Makefile:2394: all-recursive] Error 1
make[1]: Leaving directory '/data/bin/openmpi-v4.1.5/openmpi-4.1.5/opal'
make: *** [Makefile:1912: all-recursive] Error 1

Thank you


Re: [OMPI users] Issue with Running MPI Job on CentOS 7

2023-06-14 Thread Jeff Squyres (jsquyres) via users
I have seen the "pipe" error message when MPI applications do not call 
MPI_Finalize() before exiting.  I don't know what your application is doing, 
but it might be worth checking that if you call MPI_Init(), you must call 
MPI_Finalize().

You can also try the sample MPI applications in the "examples" directory.

From: 深空探测 
Sent: Tuesday, June 13, 2023 8:59 PM
To: Open MPI Users 
Cc: John Hearns ; Jeff Squyres (jsquyres) 
; gilles.gouaillar...@gmail.com 
; t...@pasteur.fr 
Subject: Re: [OMPI users] Issue with Running MPI Job on CentOS 7

Hello,

As you mentioned before, I was initially puzzled by the inability to generate 
the libmpicxx.so.12 and libmpi.so.12 files after installing OpenMPI. When I 
attempted to run the command "mpirun -H wude,wude mpispeed 1000 10s 1," I 
received the following error message:

mpispeed: error while loading shared libraries: libmpicxx.so.12: cannot open 
shared object file: No such file or directory

Initially, I had installed mpich-4.0.3, but I have since uninstalled it and 
reinstalled openmpi-1.6.5. However, even with the new installation, I 
encountered the same issue where it still required the libmpicxx.so.12 file to 
be loaded. I suspect that there might have been some remnants from the previous 
installation that were not completely removed.

I conducted a fresh installation of openmpi-1.6.5 on another system, using the 
--enable-mpi-cxx flag added to the ./configure command to enable the C++ 
bindings in Open MPI. After successfully installing openmpi-1.6.5, I ran the 
program again with the command "mpirun -H localhost,localhost mpispeed 1000 10s 
1 | head" and it executed successfully on both nodes. My username is "wude," 
and the displayed results were as follows:

Processor = wude
Rank = 0/2
[0] Starting
Processor = wude
Rank = 1/2
[1] Starting
[0] Sent 0 -> 0
[0] Sent 1 -> 0
[0] Sent 2 -> 0
[0] Sent 3 -> 0
[wude:109888] mpirun: SIGPIPE detected on fd 13 - aborting
mpirun: killing job...

However, I am unsure if the message "mpirun: killing job..." is considered a 
normal occurrence.

In conclusion, the root cause of the issue was the interference between the 
previously installed MPICH and the subsequent installation of OpenMPI, as you 
suggested. It is possible that mpirun and libmpi.so originated from different 
vendors and/or significantly different versions.

I would like to express my sincere appreciation for your patient assistance 
throughout this troubleshooting process. Your guidance has been invaluable in 
helping me understand and resolve the challenges I encountered.

Thank you once again for your support.

Best regards,
De Wu

John Hearns via users 
mailto:users@lists.open-mpi.org>> 于2023年6月13日周二 
14:13写道:
You talk about adjusting your PATH and LD_LIBRARY_PATH in your .bashrc   Jeff 
Squyres has given you some guidance on this.
Please investigate the following.
It is common to use Modules in an HPC environment  
https://www.admin-magazine.com/HPC/Articles/Lmod-Alternative-Environment-Modules

For compiling software packages and creating Modules files investigate these 
frameworks:
https://spack.io/
https://easybuild.io/



On Mon, 12 Jun 2023 at 22:44, Jeff Squyres (jsquyres) via users 
mailto:users@lists.open-mpi.org>> wrote:
Your steps are generally correct, but I cannot speak for whether your 
/home/wude/.bashrc file is executed for both non-interactive and interactive 
logins.  If /home/wude is your $HOME, it probably is, but I don't know about 
your specific system.

Also, you should be aware that MPI applications built with Open MPI v1.6.x will 
not be ABI compatible with Open MPI v4.1.x.  Specifically: you will need to 
re-compile / re-build your "mpispeed" application with Open MPI v4.1.x.

If you are using the MPI C++ bindings in your application:

  1.  I suggest you migrate away from them, because the MPI Forum (i.e., the 
standards body that governs the MPI API) removed the C++ bindings in version 
3.0 of the MPI specification in 2012 -- over a decade ago.
  2.  That being said, the C++ bindings are still available in Open MPI v4.1.x 
-- they're just not built and installed by default (frankly, to discourage 
their use).  You can enable the C++ bindings in Open MPI 4.1.x with by adding 
--enable-mpi-cxx to the ./configure command that you use to build Open MPI.  
You will need to have a C++ compiler present to build and install the C++ 
bindings.

Also note that in Open MPI, the C++ bindings library is named "libmpi_cxx.so", 
not "libmpicxx.so" (I checked both Open MPI v1.6.5 and 4.1.5).  If your MPI 
executable is dependant upon a file named "libmpicxx.so", then, as Gilles 
mentioned earlier in this thread, you might accidentally be mixing the 
libraries between two different implmenetations of MPI (e.g., Open MPI and 
MPICH are two entirely different impleme

Re: [OMPI users] Issue with Running MPI Job on CentOS 7

2023-06-12 Thread Jeff Squyres (jsquyres) via users
Your steps are generally correct, but I cannot speak for whether your 
/home/wude/.bashrc file is executed for both non-interactive and interactive 
logins.  If /home/wude is your $HOME, it probably is, but I don't know about 
your specific system.

Also, you should be aware that MPI applications built with Open MPI v1.6.x will 
not be ABI compatible with Open MPI v4.1.x.  Specifically: you will need to 
re-compile / re-build your "mpispeed" application with Open MPI v4.1.x.

If you are using the MPI C++ bindings in your application:

  1.  I suggest you migrate away from them, because the MPI Forum (i.e., the 
standards body that governs the MPI API) removed the C++ bindings in version 
3.0 of the MPI specification in 2012 -- over a decade ago.
  2.  That being said, the C++ bindings are still available in Open MPI v4.1.x 
-- they're just not built and installed by default (frankly, to discourage 
their use).  You can enable the C++ bindings in Open MPI 4.1.x with by adding 
--enable-mpi-cxx to the ./configure command that you use to build Open MPI.  
You will need to have a C++ compiler present to build and install the C++ 
bindings.

Also note that in Open MPI, the C++ bindings library is named "libmpi_cxx.so", 
not "libmpicxx.so" (I checked both Open MPI v1.6.5 and 4.1.5).  If your MPI 
executable is dependant upon a file named "libmpicxx.so", then, as Gilles 
mentioned earlier in this thread, you might accidentally be mixing the 
libraries between two different implmenetations of MPI (e.g., Open MPI and 
MPICH are two entirely different implementations of the same MPI API.  They are 
written and maintained by different sets of people, and are not binary 
compatible with each other).

If you'e just starting out in MPI, I'd strongly suggest ensuring that your 
system has exactly 1 implementation of MPI installed (e.g., Open MPI v4.1.5).  
Ensure that no other versions of Open MPI or MPICH -- or any other MPI 
implementation -- are installed.  That way, you can avoid confusing issues with 
libraries that are similar-but-different, ... etc.



From: users  on behalf of 深空探测 via users 

Sent: Sunday, June 11, 2023 11:28 AM
To: Open MPI Users 
Cc: 深空探测 
Subject: Re: [OMPI users] Issue with Running MPI Job on CentOS 7

Subject: Open MPI Installation Issues

Hello,

Despite following your previous suggestions, I am still encountering some 
problems. Below, I have outlined the specific challenges I am facing:

1. Installation with Updated Open MPI Version:
I attempted to install the latest version of Open MPI (v4.1.5) using the 
following steps:
- Downloaded the package from 
https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gz.
- Executed the installation commands:
  ./configure --prefix=/usr/local/openmpi
  make
  sudo make install
- Added the environment variables to the "/home/wude/.bashrc" file:
  export PATH=/usr/local/openmpi/bin:$PATH
  export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH
- Ran the command: source /home/wude/.bashrc

Could you please verify if these installation steps are correct?

2. Missing "libmpicxx.so.12" file after Open MPI 1.6.5 installation:
After reinstalling Open MPI 1.6.5, I noticed the existence of the "mpirun" 
executable file in the "/usr/local/openmpi/bin" directory, as well as the 
"libmpi.so" file in the "/usr/lib" directory. However, when I executed the 
command "mpirun -n 2 -H wude,wude mpispeed 1000 10s 1", an error occurred: 
"mpispeed: error while loading shared libraries: libmpicxx.so.12: cannot open 
shared object file: No such file or directory". It seems that the 
"libmpicxx.so.12" file was not generated during the installation process. Could 
you please help me identify the cause of this issue?

3. Missing "libmpixcc.so.12" file in CentOS 7 default Open MPI installation:
In case I install the Open MPI version provided by CentOS 7 using the command 
"sudo yum install openmpi-devel.x86_64", I encountered a similar problem. Even 
after installation, I cannot find the "libmpixcc.so.12" file. It appears that 
the "/usr/lib64/openmpi/lib" directory does not contain any files related to 
"libmpixcc". Could you kindly advise on this matter?

I greatly appreciate your time and assistance in resolving these issues. Thank 
you in advance for your support.

Best regards,
De Wu

Zhéxué M. Krawutschke via users 
mailto:users@lists.open-mpi.org>> 于2023年6月1日周四 
20:32写道:

Hello together,

regardless that CentOS 7.X already has EOL status,
I would recommend, for example, that you always build/compile OpenMPI and the 
other tools yourself
compile OpenMPI and the other tools to one's needs.

It is true that it is more effort, but in the end it pays off, because the
distributions are sometimes very behind.

I have already thought about providing something like this, how to do it or how 
to make the whole process available in an automated way, so to speak.


If someone would like to help me develop this, I would be very

Re: [OMPI users] What is the best choice of pml and btl for intranode communication

2023-03-06 Thread Jeff Squyres (jsquyres) via users
Per George's comments, I stand corrected: UCX does​ work fine in single-node 
cases -- he confirmed to me that he tested it on his laptop, and it worked for 
him.

I think some of the mails in this thread got delivered out of order.  Edgar's 
and George's comments about how/when the UCX PML is selected make my above 
comment moot.  Sorry for any confusion!

From: users  on behalf of Jeff Squyres 
(jsquyres) via users 
Sent: Monday, March 6, 2023 10:40 AM
To: Chandran, Arun ; Open MPI Users 

Cc: Jeff Squyres (jsquyres) 
Subject: Re: [OMPI users] What is the best choice of pml and btl for intranode 
communication

Per George's comments, I stand corrected: UCX does​ work fine in single-node 
cases -- he confirmed to me that he tested it on his laptop, and it worked for 
him.

That being said, you're passing "--mca pml ucx" in the correct place now, and 
you're therefore telling Open MPI "_only_ use the UCX PML".  Hence, if the UCX 
PML can't be used, it's an aborting type of error.  The question is: why​ is 
the UCX PML not usable on your node?  Your output clearly shows that UCX 
chooses to disable itself -- is that because there are no IB / RoCE interfaces 
at all?  (this is an open question to George / the UCX team)

From: Chandran, Arun 
Sent: Monday, March 6, 2023 10:31 AM
To: Jeff Squyres (jsquyres) ; Open MPI Users 

Subject: RE: [OMPI users] What is the best choice of pml and btl for intranode 
communication


[Public]



Hi,



Yes, it is run on a single node, there is no IB anr RoCE attached to it.



Pasting the complete o/p (I might have mistakenly copy pasted the command in 
the previous mail)



#

perf_benchmark $ mpirun -np 2 --map-by core --bind-to core --mca pml ucx  --mca 
pml_base_verbose 10 --mca mtl_base_verbose 10 -x OMPI_MCA_pml_ucx_verbose=10 -x 
UCX_LOG_LEVEL=info -x UCX_PROTO_ENABLE=y -x UCX_PROTO_INFO=y   ./perf

[1678115882.908665] [lib-ssp-04:759377:0] ucp_context.c:1849 UCX  INFO  
Version 1.13.1 (loaded from 
/home/arun/openmpi_work/ucx-1.13.1/install/lib/libucp.so.0)

[lib-ssp-04:759377] mca: base: components_register: registering framework pml 
components

[lib-ssp-04:759377] mca: base: components_register: found loaded component ucx

[lib-ssp-04:759377] mca: base: components_register: component ucx register 
function successful

[lib-ssp-04:759377] mca: base: components_open: opening pml components

[lib-ssp-04:759377] mca: base: components_open: found loaded component ucx

[lib-ssp-04:759377] common_ucx.c:174 using OPAL memory hooks as external events

[lib-ssp-04:759377] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.13.1

[lib-ssp-04:759377] mca: base: components_open: component ucx open function 
successful

[lib-ssp-04:759377] select: initializing pml component ucx

[lib-ssp-04:759377] common_ucx.c:333 self/memory0: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 tcp/lo: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 tcp/enp33s0: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 sysv/memory: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 posix/memory: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 cma/memory: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 xpmem/memory: did not match transport list

[lib-ssp-04:759377] common_ucx.c:337 support level is none

[lib-ssp-04:759377] select: init returned failure for component ucx

--

No components were able to be opened in the pml framework.



This typically means that either no components of this type were

installed, or none of the installed components can be loaded.

Sometimes this means that shared libraries required by these

components are unable to be found/loaded.



  Host:  lib-ssp-04

  Framework: pml

--

[lib-ssp-04:759377] PML ucx cannot be selected

[lib-ssp-04:759376] mca: base: components_register: registering framework pml 
components

[lib-ssp-04:759376] mca: base: components_register: found loaded component ucx

[lib-ssp-04:759376] mca: base: components_register: component ucx register 
function successful

[lib-ssp-04:759376] mca: base: components_open: opening pml components

[lib-ssp-04:759376] mca: base: components_open: found loaded component ucx

[lib-ssp-04:759376] common_ucx.c:174 using OPAL memory hooks as external events

[lib-ssp-04:759376] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.13.1

[1678115882.913551] [lib-ssp-04:759376:0] ucp_context.c:1849 UCX  INFO  
Version 1.13.1 (loaded from 
/home/arun/openmpi_work/ucx-1.13.1/install/lib/libucp.so.0)

##



So, running with pml/ucx is disabled by default if there is no compatible 
Networking-Equipment is found.



--Arun



From: Jef

Re: [OMPI users] What is the best choice of pml and btl for intranode communication

2023-03-06 Thread Jeff Squyres (jsquyres) via users
Per George's comments, I stand corrected: UCX does​ work fine in single-node 
cases -- he confirmed to me that he tested it on his laptop, and it worked for 
him.

That being said, you're passing "--mca pml ucx" in the correct place now, and 
you're therefore telling Open MPI "_only_ use the UCX PML".  Hence, if the UCX 
PML can't be used, it's an aborting type of error.  The question is: why​ is 
the UCX PML not usable on your node?  Your output clearly shows that UCX 
chooses to disable itself -- is that because there are no IB / RoCE interfaces 
at all?  (this is an open question to George / the UCX team)

From: Chandran, Arun 
Sent: Monday, March 6, 2023 10:31 AM
To: Jeff Squyres (jsquyres) ; Open MPI Users 

Subject: RE: [OMPI users] What is the best choice of pml and btl for intranode 
communication


[Public]



Hi,



Yes, it is run on a single node, there is no IB anr RoCE attached to it.



Pasting the complete o/p (I might have mistakenly copy pasted the command in 
the previous mail)



#

perf_benchmark $ mpirun -np 2 --map-by core --bind-to core --mca pml ucx  --mca 
pml_base_verbose 10 --mca mtl_base_verbose 10 -x OMPI_MCA_pml_ucx_verbose=10 -x 
UCX_LOG_LEVEL=info -x UCX_PROTO_ENABLE=y -x UCX_PROTO_INFO=y   ./perf

[1678115882.908665] [lib-ssp-04:759377:0] ucp_context.c:1849 UCX  INFO  
Version 1.13.1 (loaded from 
/home/arun/openmpi_work/ucx-1.13.1/install/lib/libucp.so.0)

[lib-ssp-04:759377] mca: base: components_register: registering framework pml 
components

[lib-ssp-04:759377] mca: base: components_register: found loaded component ucx

[lib-ssp-04:759377] mca: base: components_register: component ucx register 
function successful

[lib-ssp-04:759377] mca: base: components_open: opening pml components

[lib-ssp-04:759377] mca: base: components_open: found loaded component ucx

[lib-ssp-04:759377] common_ucx.c:174 using OPAL memory hooks as external events

[lib-ssp-04:759377] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.13.1

[lib-ssp-04:759377] mca: base: components_open: component ucx open function 
successful

[lib-ssp-04:759377] select: initializing pml component ucx

[lib-ssp-04:759377] common_ucx.c:333 self/memory0: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 tcp/lo: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 tcp/enp33s0: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 sysv/memory: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 posix/memory: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 cma/memory: did not match transport list

[lib-ssp-04:759377] common_ucx.c:333 xpmem/memory: did not match transport list

[lib-ssp-04:759377] common_ucx.c:337 support level is none

[lib-ssp-04:759377] select: init returned failure for component ucx

--

No components were able to be opened in the pml framework.



This typically means that either no components of this type were

installed, or none of the installed components can be loaded.

Sometimes this means that shared libraries required by these

components are unable to be found/loaded.



  Host:  lib-ssp-04

  Framework: pml

--

[lib-ssp-04:759377] PML ucx cannot be selected

[lib-ssp-04:759376] mca: base: components_register: registering framework pml 
components

[lib-ssp-04:759376] mca: base: components_register: found loaded component ucx

[lib-ssp-04:759376] mca: base: components_register: component ucx register 
function successful

[lib-ssp-04:759376] mca: base: components_open: opening pml components

[lib-ssp-04:759376] mca: base: components_open: found loaded component ucx

[lib-ssp-04:759376] common_ucx.c:174 using OPAL memory hooks as external events

[lib-ssp-04:759376] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.13.1

[1678115882.913551] [lib-ssp-04:759376:0] ucp_context.c:1849 UCX  INFO  
Version 1.13.1 (loaded from 
/home/arun/openmpi_work/ucx-1.13.1/install/lib/libucp.so.0)

##



So, running with pml/ucx is disabled by default if there is no compatible 
Networking-Equipment is found.



--Arun



From: Jeff Squyres (jsquyres) 
Sent: Monday, March 6, 2023 8:13 PM
To: Open MPI Users 
Cc: Chandran, Arun 
Subject: Re: [OMPI users] What is the best choice of pml and btl for intranode 
communication



Caution: This message originated from an External Source. Use proper caution 
when opening attachments, clicking links, or responding.



If this run was on a single node, then UCX probably disabled itself since it 
wouldn't be using InfiniBand or RoCE to communicate between peers.



Also, I'm not sure your command line was correct:



perf_benchmark $ mpirun -np 32 --map-by core --bind-to core ./perf  --mca pml 
ucx



You probably need to list all of mpirun

Re: [OMPI users] What is the best choice of pml and btl for intranode communication

2023-03-06 Thread Jeff Squyres (jsquyres) via users
If this run was on a single node, then UCX probably disabled itself since it 
wouldn't be using InfiniBand or RoCE to communicate between peers.

Also, I'm not sure your command line was correct:


perf_benchmark $ mpirun -np 32 --map-by core --bind-to core ./perf  --mca pml 
ucx

You probably need to list all of mpirun's CLI options before​ you list the 
./perf executable.  In its right-to-left traversal, once mpirun hits a CLI 
option it does not recognize (e.g., "./perf"), it assumes that it is the user's 
executable name, and does not process the CLI options to the right of that.

Hence, the output you show must have forced the UCX PML another way -- perhaps 
you set an environment variable or something?


From: users  on behalf of Chandran, Arun via 
users 
Sent: Monday, March 6, 2023 3:33 AM
To: Open MPI Users 
Cc: Chandran, Arun 
Subject: Re: [OMPI users] What is the best choice of pml and btl for intranode 
communication


[Public]



Hi Gilles,



Thanks very much for the information.



I was looking for the best pml + btl combination for a standalone intra node 
with high task count (>= 192) with no HPC-class networking installed.



Just now realized that I can’t use pml ucx for such cases as it is unable find 
IB and fails.



perf_benchmark $ mpirun -np 32 --map-by core --bind-to core ./perf  --mca pml 
ucx

--

No components were able to be opened in the pml framework.



This typically means that either no components of this type were

installed, or none of the installed components can be loaded.

Sometimes this means that shared libraries required by these

components are unable to be found/loaded.



  Host:  lib-ssp-04

  Framework: pml

--

[lib-ssp-04:753542] PML ucx cannot be selected

[lib-ssp-04:753531] PML ucx cannot be selected

[lib-ssp-04:753541] PML ucx cannot be selected

[lib-ssp-04:753539] PML ucx cannot be selected

[lib-ssp-04:753545] PML ucx cannot be selected

[lib-ssp-04:753547] PML ucx cannot be selected

[lib-ssp-04:753572] PML ucx cannot be selected

[lib-ssp-04:753538] PML ucx cannot be selected

[lib-ssp-04:753530] PML ucx cannot be selected

[lib-ssp-04:753537] PML ucx cannot be selected

[lib-ssp-04:753546] PML ucx cannot be selected

[lib-ssp-04:753544] PML ucx cannot be selected

[lib-ssp-04:753570] PML ucx cannot be selected

[lib-ssp-04:753567] PML ucx cannot be selected

[lib-ssp-04:753534] PML ucx cannot be selected

[lib-ssp-04:753592] PML ucx cannot be selected

[lib-ssp-04:753529] PML ucx cannot be selected





That means my only choice is pml/ob1 + btl/vader.



--Arun



From: users  On Behalf Of Gilles Gouaillardet 
via users
Sent: Monday, March 6, 2023 12:56 PM
To: Open MPI Users 
Cc: Gilles Gouaillardet 
Subject: Re: [OMPI users] What is the best choice of pml and btl for intranode 
communication



Caution: This message originated from an External Source. Use proper caution 
when opening attachments, clicking links, or responding.



Arun,



First Open MPI selects a pml for **all** the MPI tasks (for example, pml/ucx or 
pml/ob1)



Then, if pml/ob1 ends up being selected, a btl component (e.g. btl/uct, 
btl/vader) is used for each pair of MPI tasks

(tasks on the same node will use btl/vader, tasks on different nodes will use 
btl/uct)



Note that if UCX is available, pml/ucx takes the highest priority, so no btl is 
involved

(in your case, if means intra-node communications will be handled by UCX and 
not btl/vader).

You can force ob1 and try different combinations of btl with

mpirun --mca pml ob1 --mca btl self,, ...



I expect pml/ucx is faster than pml/ob1 with btl/uct for inter node 
communications.



I have not benchmarked Open MPI for a while and it is possible btl/vader 
outperforms pml/ucx for intra nodes communications,

so if you run on a small number of Infiniband interconnected nodes with a large 
number of tasks per node, you might be able

to get the best performances by forcing pml/ob1.



Bottom line, I think it is best for you to benchmark your application and pick 
the combination that leads to the best performances,

and you are more than welcome to share your conclusions.



Cheers,



Gilles





On Mon, Mar 6, 2023 at 3:12 PM Chandran, Arun via users 
mailto:users@lists.open-mpi.org>> wrote:

[Public]

Hi Folks,

I can run benchmarks and find the pml+btl (ob1, ucx, uct, vader, etc)  
combination that gives the best performance,
but I wanted to hear from the community about what is generally used in 
"__high_core_count_intra_node_" cases before jumping into conclusions.

As I am a newcomer to openMPI I don't want to end up using a combination only 
because it fared better in a benchmark (overfitting?)

Or the choice of pml+btl for the 'intranode' case is not so important as 
openmpi is mainly used in 'internode' and the 'networking-equipment

Re: [OMPI users] Compile options to disable Infiniband

2022-12-12 Thread Jeff Squyres (jsquyres) via users
You can use:

./configure --enable-mca-no-build=btl-openib,pml-ucx,mtl-psm

That should probably do it in the 3.x and 4.x series.

You can double check after it installs: look in $prefix/lib/openmpi for any 
files with "ucx", "openib", or "psm" in them.  If they're there, remove them 
(those are the IB plugins).  You can further run "ompi_info" and look for 
"ucx", "openib", and/or "psm" in the output (ompi_info shows all 
currently-available plugins -- take from both plugins that were statically 
compiled into Open MPI's libraries and from what's available in 
$prefix/lib/openmpi [the latter is the default]).  If ompi_info doesn't show 
any output with "ucx", "openib", and/or "psm", then your Open MPI does not 
contain any IB support.

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Gestió Servidors 
via users 
Sent: Monday, December 12, 2022 10:38 AM
To: users@lists.open-mpi.org 
Cc: Gestió Servidors 
Subject: [OMPI users] Compile options to disable Infiniband


Hi,



I’m getting several errors and problems with an old OpenMPI installation. 
Versions 3.1.5, 4.0.2 and 4.1.4 compiled with Infiniband support 
(--with-verbs). I don’t know why, now that versions are failing with my OpenMPI 
programs and, after running some tests and OMPI recompilations, I would like to 
run a “clean” compilation with NO Infiniband support (in other words, I need to 
be 100% sure that OpenMPI will NOT detect any Qlogic card/device as “ib0”) 
because I only want to use my ethernet card “ethX”.



How are that compilation parameters in “configure” to disable Infiniband 
support?



Thanks a lot!




Re: [OMPI users] mpi program gets stuck

2022-12-07 Thread Jeff Squyres (jsquyres) via users
To tie up this issue for the web mail archives...

There were a bunch more off-list emails exchanged on this thread.  It was 
determined that something is going wrong down in the IB networking stack.  It 
looks like it may be a problem in the environment itself, not Open MPI.  The 
user is continuing to investigate.  If it turns into a problem with Open MPI, 
we'll report back here.

--
Jeff Squyres
jsquy...@cisco.com

From: Jeff Squyres (jsquyres) 
Sent: Wednesday, November 30, 2022 7:42 AM
To: timesir ; Open MPI Users 
Subject: Re: mpi program gets stuck

Ok, this looks like the same type of output running ring_c as your Python MPI 
app -- good.  Using a C MPI program for testing just eliminates some possible 
variables / issues.

Ok, let's try running again, but add some more command line parameters:

mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca 
rmaps_base_verbose 100 --mca ras_base_verbose 100 --prtemca 
grpcomm_base_verbose 5 --prtemca state_base_verbose 5 ./ring_c

And please send the output back here to the list.

--
Jeff Squyres
jsquy...@cisco.com

From: timesir 
Sent: Tuesday, November 29, 2022 9:44 PM
To: Jeff Squyres (jsquyres) 
Subject: Re: mpi program gets stuck


Do you think the information below is enough? If not, I will add more


(py3.9) ➜  /share  cat hosts
192.168.180.48 slots=1
192.168.60.203 slots=1



(py3.9) ➜  examples  mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 
--mca rmaps_base_verbose 100 --mca ras_base_verbose 100  ./ring_c

[computer01:74388] mca: base: component_find: searching NULL for plm components
[computer01:74388] mca: base: find_dyn_components: checking NULL for plm 
components
[computer01:74388] pmix:mca: base: components_register: registering framework 
plm components
[computer01:74388] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:74388] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
ssh
[computer01:74388] pmix:mca: base: components_register: component ssh register 
function successful
[computer01:74388] mca: base: components_open: opening plm components
[computer01:74388] mca: base: components_open: found loaded component slurm
[computer01:74388] mca: base: components_open: component slurm open function 
successful
[computer01:74388] mca: base: components_open: found loaded component ssh
[computer01:74388] mca: base: components_open: component ssh open function 
successful
[computer01:74388] mca:base:select: Auto-selecting plm components
[computer01:74388] mca:base:select:(  plm) Querying component [slurm]
[computer01:74388] mca:base:select:(  plm) Querying component [ssh]
[computer01:74388] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[computer01:74388] mca:base:select:(  plm) Query of component [ssh] set 
priority to 10
[computer01:74388] mca:base:select:(  plm) Selected component [ssh]
[computer01:74388] mca: base: close: component slurm closed
[computer01:74388] mca: base: close: unloading component slurm
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh_setup on agent ssh : 
rsh path NULL
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive start comm
[computer01:74388] mca: base: component_find: searching NULL for ras components
[computer01:74388] mca: base: find_dyn_components: checking NULL for ras 
components
[computer01:74388] pmix:mca: base: components_register: registering framework 
ras components
[computer01:74388] pmix:mca: base: components_register: found loaded component 
simulator
[computer01:74388] pmix:mca: base: components_register: component simulator 
register function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
pbs
[computer01:74388] pmix:mca: base: components_register: component pbs register 
function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:74388] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:74388] mca: base: components_open: opening ras components
[computer01:74388] mca: base: components_open: found loaded component simulator
[computer01:74388] mca: base: components_open: found loaded component pbs
[computer01:74388] mca: base: components_open: component pbs open function 
successful
[computer01:74388] mca: base: components_open: found loaded component slurm
[computer01:74388] mca: base: components_open: component slurm open function 
successful
[computer01:74388] mca:base:select: Auto-selecting ras components
[computer01:74388] mca:base:select:(  ras) Querying component [simulator]
[computer01:74388] mca:base:select:(  ras) Querying component [pbs]
[computer01:74388] mca:base:select:(  ras) Querying component [slurm]
[computer01:74388] mca:base:select:(  ras) No component selected!
[

Re: [OMPI users] Can't run an MPI program through mpirun command

2022-12-04 Thread Jeff Squyres (jsquyres) via users
Can you try steps 1-3 in 
https://docs.open-mpi.org/en/v5.0.x/validate.html#testing-your-open-mpi-installation
 ?

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Blaze Kort via 
users 
Sent: Saturday, December 3, 2022 5:52 AM
To: users@lists.open-mpi.org 
Cc: Blaze Kort 
Subject: [OMPI users] Can't run an MPI program through mpirun command

I have an MPI program (a code in c for a school project) that I want to
run on more nodes (this time 2 nodes) but it doesn't work and it is
infinitely waiting.
First I tried to run it on both machines with command `mpirun -np 2 --
host 192.168.0.147,192.168.0.116 ./mandelbrot_mpi_omp`,
and now I tried to be more specific with subnet masks and to enable
logging with: `mpirun --mca oob_base_verbose 100 --mca
oob_tcp_if_include 192.168.0.0/24 --mca btl_tcp_if_include
192.168.0.0/24 -np 2 --host 192.168.0.147,192.168.0.116
./mandelbrot_mpi_omp`
Local ip addresses are ip addresses of that computers and they are
entered in same order on both pcs, so 192.168.0.147 would be 0 -
master.
The first command is waiting without any text/error.
The second one will pull a log and stop at step: get transports for
component tcp - on both machines.
log from both machines: https://pastebin.com/bt32ZddX
lsof from both machines: https://pastebin.com/s3HHFWZB
the lsof output is not shrinked, it's everything that had opened any
ports at the moment, nothing else was running. The weird thing is about
that CLOSE_WAIT flag that the ssh connection has on both sides.

here is my code:
```
int main(int argc, char* argv[]){
int width = SCALE_X;
int height = SCALE_Y;

// MPI init & setup
MPI_Init(&argc, &argv);

int world_size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

// calculate size of buffer according to server count
int part_height = SCALE_Y/world_size;
int buffer_size = (width+1)*(part_height+1)*3;

// dynamically allocate arrays for image data according to server
count
send_buffer = calloc( buffer_size, sizeof(PIXEL));
recv_buffer = calloc( buffer_size*world_size, sizeof(PIXEL));

if(rank == 0) printf("MPI node count: %i\n", world_size);
MPI_Barrier(MPI_COMM_WORLD);

// OpenMP setup
int cpu_count = omp_get_num_procs();
omp_set_num_threads(cpu_count);
printf("OpenMP cpu count on node %i: %i\n", rank, cpu_count);
printf("OpenMP (max) thread count on node %i: %i\n", rank,
omp_get_num_threads());
MPI_Barrier(MPI_COMM_WORLD);

// generate a part of mandelbrot set according to world size and
rank of this server
mandelbrot(rank, world_size, width, part_height);

// gather parts of mandelbrot from all nodes
MPI_Gather(send_buffer, (width)*(part_height)*3, MPI_CHAR,
recv_buffer, (width)*(part_height)*3, MPI_CHAR, 0, MPI_COMM_WORLD);


// save raster array of mandelbrot data to png file
if(rank == 0) save_to_png(width, height);


printf("Process %i finished.\n", rank);

MPI_Finalize();

return 0;
}
```

My OS is Debian 11 and Open MPI (v4.1.0) is installed through official
debian repositories (on both machines). iptables or nftables are not
installed on both systems, so any ip blocking should not be problem
right now. Machines are connected to one router and they can connect to
each other - i can ping them or connect to them with ssh on each other.
I tried to connect them directly with ethernet cable and set ip
addresses manually, but this didn't work too. Also, they have same
username and password in system.

What am I missing?
I am new to mpi and not very savvy about networking as it is.

Thanks in advance.



Re: [OMPI users] mpi program gets stuck

2022-12-01 Thread Jeff Squyres (jsquyres) via users
Ok, this looks like the same type of output running ring_c as your Python MPI 
app -- good.  Using a C MPI program for testing just eliminates some possible 
variables / issues.

Ok, let's try running again, but add some more command line parameters:

mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca 
rmaps_base_verbose 100 --mca ras_base_verbose 100 --prtemca 
grpcomm_base_verbose 5 --prtemca state_base_verbose 5 ./ring_c

And please send the output back here to the list.

--
Jeff Squyres
jsquy...@cisco.com

From: timesir 
Sent: Tuesday, November 29, 2022 9:44 PM
To: Jeff Squyres (jsquyres) 
Subject: Re: mpi program gets stuck


Do you think the information below is enough? If not, I will add more


(py3.9) ➜  /share  cat hosts
192.168.180.48 slots=1
192.168.60.203 slots=1



(py3.9) ➜  examples  mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 
--mca rmaps_base_verbose 100 --mca ras_base_verbose 100  ./ring_c

[computer01:74388] mca: base: component_find: searching NULL for plm components
[computer01:74388] mca: base: find_dyn_components: checking NULL for plm 
components
[computer01:74388] pmix:mca: base: components_register: registering framework 
plm components
[computer01:74388] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:74388] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
ssh
[computer01:74388] pmix:mca: base: components_register: component ssh register 
function successful
[computer01:74388] mca: base: components_open: opening plm components
[computer01:74388] mca: base: components_open: found loaded component slurm
[computer01:74388] mca: base: components_open: component slurm open function 
successful
[computer01:74388] mca: base: components_open: found loaded component ssh
[computer01:74388] mca: base: components_open: component ssh open function 
successful
[computer01:74388] mca:base:select: Auto-selecting plm components
[computer01:74388] mca:base:select:(  plm) Querying component [slurm]
[computer01:74388] mca:base:select:(  plm) Querying component [ssh]
[computer01:74388] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[computer01:74388] mca:base:select:(  plm) Query of component [ssh] set 
priority to 10
[computer01:74388] mca:base:select:(  plm) Selected component [ssh]
[computer01:74388] mca: base: close: component slurm closed
[computer01:74388] mca: base: close: unloading component slurm
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh_setup on agent ssh : 
rsh path NULL
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive start comm
[computer01:74388] mca: base: component_find: searching NULL for ras components
[computer01:74388] mca: base: find_dyn_components: checking NULL for ras 
components
[computer01:74388] pmix:mca: base: components_register: registering framework 
ras components
[computer01:74388] pmix:mca: base: components_register: found loaded component 
simulator
[computer01:74388] pmix:mca: base: components_register: component simulator 
register function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
pbs
[computer01:74388] pmix:mca: base: components_register: component pbs register 
function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:74388] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:74388] mca: base: components_open: opening ras components
[computer01:74388] mca: base: components_open: found loaded component simulator
[computer01:74388] mca: base: components_open: found loaded component pbs
[computer01:74388] mca: base: components_open: component pbs open function 
successful
[computer01:74388] mca: base: components_open: found loaded component slurm
[computer01:74388] mca: base: components_open: component slurm open function 
successful
[computer01:74388] mca:base:select: Auto-selecting ras components
[computer01:74388] mca:base:select:(  ras) Querying component [simulator]
[computer01:74388] mca:base:select:(  ras) Querying component [pbs]
[computer01:74388] mca:base:select:(  ras) Querying component [slurm]
[computer01:74388] mca:base:select:(  ras) No component selected!
[computer01:74388] mca: base: component_find: searching NULL for rmaps 
components
[computer01:74388] mca: base: find_dyn_components: checking NULL for rmaps 
components
[computer01:74388] pmix:mca: base: components_register: registering framework 
rmaps components
[computer01:74388] pmix:mca: base: components_register: found loaded component 
ppr
[computer01:74388] pmix:mca: base: components_register: component ppr register 
function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
rank_file
[computer01:74388] pmix:mca: base: components_register: comp

Re: [OMPI users] mpi program gets stuck

2022-11-29 Thread Jeff Squyres (jsquyres) via users
(we've conversed a bit off-list; bringing this back to the list with a good 
subject to differentiate it from other digest threads)

I'm glad the tarball I provided (that included the PMIx fix) resolved running 
"uptime" for you.

Can you try running a plain C MPI program instead of a Python MPI program?  
That would just eliminate a few more variables from the troubleshooting process.

In the "examples" directory in the tarball I provided are trivial "hello world" 
and "ring" MPI programs.  A "make" should build them all.  Try running hello_c 
and ring_c.

--
Jeff Squyres
jsquy...@cisco.com

From: timesir 
Sent: Tuesday, November 29, 2022 10:42 AM
To: Jeff Squyres (jsquyres) ; Open MPI Users 

Subject: mpi program gets stuck


see also: https://pastebin.com/s5tjaUkF

(py3.9) ➜  /share  cat hosts
192.168.180.48 slots=1
192.168.60.203 slots=1

1.  This command now runs correctly using your openmpi-gitclone-pr11096.tar.bz2
(py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 
--mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime


2. But this command gets stuck. It seems to be the mpi program that gets stuck.
test.py:
import mpi4py
from mpi4py import MPI

(py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 
--mca rmaps_base_verbose 100 --mca ras_base_verbose 100 python test.py
[computer01:47982] mca: base: component_find: searching NULL for plm components
[computer01:47982] mca: base: find_dyn_components: checking NULL for plm 
components
[computer01:47982] pmix:mca: base: components_register: registering framework 
plm components
[computer01:47982] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:47982] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded component 
ssh
[computer01:47982] pmix:mca: base: components_register: component ssh register 
function successful
[computer01:47982] mca: base: components_open: opening plm components
[computer01:47982] mca: base: components_open: found loaded component slurm
[computer01:47982] mca: base: components_open: component slurm open function 
successful
[computer01:47982] mca: base: components_open: found loaded component ssh
[computer01:47982] mca: base: components_open: component ssh open function 
successful
[computer01:47982] mca:base:select: Auto-selecting plm components
[computer01:47982] mca:base:select:(  plm) Querying component [slurm]
[computer01:47982] mca:base:select:(  plm) Querying component [ssh]
[computer01:47982] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[computer01:47982] mca:base:select:(  plm) Query of component [ssh] set 
priority to 10
[computer01:47982] mca:base:select:(  plm) Selected component [ssh]
[computer01:47982] mca: base: close: component slurm closed
[computer01:47982] mca: base: close: unloading component slurm
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh_setup on agent ssh : 
rsh path NULL
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive start comm
[computer01:47982] mca: base: component_find: searching NULL for ras components
[computer01:47982] mca: base: find_dyn_components: checking NULL for ras 
components
[computer01:47982] pmix:mca: base: components_register: registering framework 
ras components
[computer01:47982] pmix:mca: base: components_register: found loaded component 
simulator
[computer01:47982] pmix:mca: base: components_register: component simulator 
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded component 
pbs
[computer01:47982] pmix:mca: base: components_register: component pbs register 
function successful
[computer01:47982] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:47982] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:47982] mca: base: components_open: opening ras components
[computer01:47982] mca: base: components_open: found loaded component simulator
[computer01:47982] mca: base: components_open: found loaded component pbs
[computer01:47982] mca: base: components_open: component pbs open function 
successful
[computer01:47982] mca: base: components_open: found loaded component slurm
[computer01:47982] mca: base: components_open: component slurm open function 
successful
[computer01:47982] mca:base:select: Auto-selecting ras components
[computer01:47982] mca:base:select:(  ras) Querying component [simulator]
[computer01:47982] mca:base:select:(  ras) Querying component [pbs]
[computer01:47982] mca:base:select:(  ras) Querying component [slurm]
[computer01:47982] mca:base:select:(  ras) No component selected!
[computer01:47982] mca: base: component_find: searching NULL for rmaps 
components
[computer01:47982] mca: base: find_dyn_components: checking 

Re: [OMPI users] CephFS and striping_factor

2022-11-29 Thread Jeff Squyres (jsquyres) via users
More specifically, Gilles created a skeleton "ceph" component in this draft 
pull request: https://github.com/open-mpi/ompi/pull/11122

If anyone has any cycles to work on it and develop it beyond the skeleton that 
is currently there, that would be great!

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Gilles Gouaillardet 
via users 
Sent: Monday, November 28, 2022 9:48 PM
To: users@lists.open-mpi.org 
Cc: Gilles Gouaillardet 
Subject: Re: [OMPI users] CephFS and striping_factor

Hi Eric,


Currently, Open MPI does not provide specific support for CephFS.

MPI-IO is either implemented by ROMIO (imported from MPICH, it does not
support CephFS today)

or the "native" ompio component (that also does not support CephFS today).


A proof of concept for CephFS in ompio might not be a huge work for
someone motivated:

That could be as simple as (so to speak, since things are generally not
easy) creating a new fs/ceph component

(e.g. in ompi/mca/fs/ceph) and implement the "file_open" callback that
uses the ceph API.

I think the fs/lustre component can be used as an inspiration.


I cannot commit to do this, but if you are willing to take a crack at
it, I can create such a component

so you can go directly to implementing the callback without spending too
much time on some Open MPI internals

(e.g. component creation).



Cheers,


Gilles


On 11/29/2022 6:55 AM, Eric Chamberland via users wrote:
> Hi,
>
> I would like to know if OpenMPI is supporting file creation with
> "striping_factor" for CephFS?
>
> According to CephFS library, I *think* it would be possible to do it
> at file creation with "ceph_open_layout".
>
> https://github.com/ceph/ceph/blob/main/src/include/cephfs/libcephfs.h
>
> Is it a possible futur enhancement?
>
> Thanks,
>
> Eric
>



Re: [OMPI users] Question about "mca" parameters

2022-11-29 Thread Jeff Squyres (jsquyres) via users
Also, you probably want to add "vader" into your BTL specification.  Although 
the name is counter-intuitive, "vader" in Open MPI v3.x and v4.x is the shared 
memory transport.  Hence, if you run with "btl=tcp,self", you are only allowing 
MPI processes to talk via the TCP stack or process loopback (which, by 
definition, is only for a process to talk to itself) -- even if they are on the 
same node.

Instead, if you run with "btl=tcp,vader,self", then MPI processes can talk via 
TCP, process loopback, or shared memory.  Hence, if two MPI processes are on 
the same node, they can use shared memory to communicate, which is 
significantly​ faster than TCP.

NOTE:​ In the upcoming Open MPI v5.0.x, the name "vader" has (finally) been 
deprecated and replaced with the more intuitive name "sm".  While 
"btl=tcp,vader,self" will work fine in v5.0.x for backwards compatibility with 
v4.x and v3.x, "btl=tcp,sm,self" is preferred for v5.0.x and forward (and "sm" 
is just a more intuitive name than "vader").

The problem you were seeing was because the openib BTL component was 
complaining that, as the help message described, the environment was not set 
correctly to allow using the qib0 device correctly.  Hence, it seems like you 
have a secondary / HPC-quality network available (which could be faster / more 
efficient than TCP), but it isn't configured properly in your environment.  You 
might want to investigate the suggestion from the help message to set the 
memlock limits correctly, and see if using the qib0 interfaces would yield 
better performance.

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Gilles Gouaillardet 
via users 
Sent: Tuesday, November 29, 2022 3:36 AM
To: Gestió Servidors via users 
Cc: Gilles Gouaillardet 
Subject: Re: [OMPI users] Question about "mca" parameters

Hi,


Simply add


btl = tcp,self


If the openib error message persists, try also adding

osc_rdma_btls = ugni,uct,ucp

or simply

osc = ^rdma



Cheers,


Gilles

On 11/29/2022 5:16 PM, Gestió Servidors via users wrote:
>
> Hi,
>
> If I run “mpirun --mca btl tcp,self --mca allow_ib 0 -n 12
> ./my_program”, I get to disable some “extra” info in the output file like:
>
> The OpenFabrics (openib) BTL failed to initialize while trying to
>
> allocate some locked memory.  This typically can indicate that the
>
> memlock limits are set too low.  For most HPC installations, the
>
> memlock limits should be set to "unlimited".  The failure occured
>
> here:
>
> Local host:clus11
>
> OMPI source:   btl_openib.c:757
>
> Function:  opal_free_list_init()
>
> Device:qib0
>
> Memlock limit: 65536
>
> You may need to consult with your system administrator to get this
>
> problem fixed.  This FAQ entry on the Open MPI web site may also be
>
> helpful:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
> --
>
> [clus11][[33029,1],0][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> [clus11][[33029,1],1][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> [clus11][[33029,1],9][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> [clus11][[33029,1],8][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> [clus11][[33029,1],2][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> [clus11][[33029,1],6][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> [clus11][[33029,1],10][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> [clus11][[33029,1],11][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> [clus11][[33029,1],5][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> [clus11][[33029,1],3][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> [clus11][[33029,1],4][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> [clus11][[33029,1],7][btl_openib.c:1062:mca_btl_openib_add_procs]
> could not prepare openib device for use
>
> or like
>
> By default, for Open MPI 4.0 and later, infiniband ports on a device
>
> are not used by default.  The intent is to use UCX for these devices.
>
> You can override this policy by setting the btl_openib_allow_ib MCA
> parameter
>
> to true.
>
> Local host:  clus11
>
> Local adapter:   qib0
>
> Local port:  1
>
> --
>
> --
>
> WARNING: There was an error initializing an OpenFabrics device.
>
> Local host:   clus11
>
> Local device: qib0
>
> 

Re: [OMPI users] users Digest, Vol 4818, Issue 1

2022-11-25 Thread Jeff Squyres (jsquyres) via users
Ok, this is a good / consistent output.  That being said, I don't grok what is 
happening here: it says it finds 2 slots, but then it tells you it doesn't have 
enough slots.

Let me dig deeper and get back to you...

--
Jeff Squyres
jsquy...@cisco.com

From: timesir 
Sent: Friday, November 18, 2022 10:20 AM
To: Jeff Squyres (jsquyres) ; users@lists.open-mpi.org 
; gilles.gouaillar...@gmail.com 

Subject: Re: users Digest, Vol 4818, Issue 1

(py3.9) ➜  /share   ompi_info --version

Open MPI v5.0.0rc9

https://www.open-mpi.org/community/help/


(py3.9) ➜  /share  cat hosts
192.168.180.48 slots=1
192.168.60.203 slots=1


(py3.9) ➜  /share  mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 
--mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime
[computer01:53933] mca: base: component_find: searching NULL for plm components
[computer01:53933] mca: base: find_dyn_components: checking NULL for plm 
components
[computer01:53933] pmix:mca: base: components_register: registering framework 
plm components
[computer01:53933] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:53933] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:53933] pmix:mca: base: components_register: found loaded component 
ssh
[computer01:53933] pmix:mca: base: components_register: component ssh register 
function successful
[computer01:53933] mca: base: components_open: opening plm components
[computer01:53933] mca: base: components_open: found loaded component slurm
[computer01:53933] mca: base: components_open: component slurm open function 
successful
[computer01:53933] mca: base: components_open: found loaded component ssh
[computer01:53933] mca: base: components_open: component ssh open function 
successful
[computer01:53933] mca:base:select: Auto-selecting plm components
[computer01:53933] mca:base:select:(  plm) Querying component [slurm]
[computer01:53933] mca:base:select:(  plm) Querying component [ssh]
[computer01:53933] mca:base:select:(  plm) Query of component [ssh] set 
priority to 10
[computer01:53933] mca:base:select:(  plm) Selected component [ssh]
[computer01:53933] mca: base: close: component slurm closed
[computer01:53933] mca: base: close: unloading component slurm
[computer01:53933] mca: base: component_find: searching NULL for ras components
[computer01:53933] mca: base: find_dyn_components: checking NULL for ras 
components
[computer01:53933] pmix:mca: base: components_register: registering framework 
ras components
[computer01:53933] pmix:mca: base: components_register: found loaded component 
simulator
[computer01:53933] pmix:mca: base: components_register: component simulator 
register function successful
[computer01:53933] pmix:mca: base: components_register: found loaded component 
pbs
[computer01:53933] pmix:mca: base: components_register: component pbs register 
function successful
[computer01:53933] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:53933] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:53933] mca: base: components_open: opening ras components
[computer01:53933] mca: base: components_open: found loaded component simulator
[computer01:53933] mca: base: components_open: found loaded component pbs
[computer01:53933] mca: base: components_open: component pbs open function 
successful
[computer01:53933] mca: base: components_open: found loaded component slurm
[computer01:53933] mca: base: components_open: component slurm open function 
successful
[computer01:53933] mca:base:select: Auto-selecting ras components
[computer01:53933] mca:base:select:(  ras) Querying component [simulator]

[computer01:53933] mca:base:select:(  ras) Querying component [pbs] 
   [71/1815]
[computer01:53933] mca:base:select:(  ras) Querying component [slurm]
[computer01:53933] mca:base:select:(  ras) No component selected!
[computer01:53933] mca: base: component_find: searching NULL for rmaps 
components
[computer01:53933] mca: base: find_dyn_components: checking NULL for rmaps 
components
[computer01:53933] pmix:mca: base: components_register: registering framework 
rmaps components
[computer01:53933] pmix:mca: base: components_register: found loaded component 
ppr
[computer01:53933] pmix:mca: base: components_register: component ppr register 
function successful
[computer01:53933] pmix:mca: base: components_register: found loaded component 
rank_file
[computer01:53933] pmix:mca: base: components_register: component rank_file has 
no register or open function
[computer01:53933] pmix:mca: base: components_register: found loaded component 
round_robin
[computer01:53933] pmix:mca: base: components_register: component round_robin 
register function successful
[computer01:53933] pmix:mca: base: components_register: found loaded component 
seq
[

Re: [OMPI users] users Digest, Vol 4818, Issue 1

2022-11-25 Thread Jeff Squyres (jsquyres) via users
Thanks for the output.

I'm seeing inconsistent output between your different outputs, however.  For 
example, one of your outputs seems to ignore the hostfile and only show slots 
on the local host, but another output shows 2 hosts with 1 slot each.  But I 
don't know what was in the hosts file for that run.

Also, I see a weird "state=UNKNOWN" in the output in the 2nd node.  Not sure 
what that means; we might need to track that down.

Can you send the output from these commands, in a single session (I added 
another MCA verbose parameter in here, too):

ompi_info --version
cat hosts
mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca 
rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime

Make sure to use "dash dash" before the CLI options; ensure that copy-and-paste 
from email doesn't replace the dashes with non-ASCII dashes, such as an "em 
dash", or somesuch.

--
Jeff Squyres
jsquy...@cisco.com

From: timesir 
Sent: Friday, November 18, 2022 8:59 AM
To: Jeff Squyres (jsquyres) ; users@lists.open-mpi.org 
; gilles.gouaillar...@gmail.com 

Subject: Re: users Digest, Vol 4818, Issue 1


The ompi_info -all output for both machines is attached.



在 2022/11/18 21:54, Jeff Squyres (jsquyres) 写道:
I see 2 config.log files -- can you also send the other information requested 
on that page?  I.e, the version you're using (I think​ you said in a prior 
email that it was 5.0rc9, but I'm not 100% sure), and the output from ompi_info 
--all.

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>

From: timesir <mailto:mrlong...@gmail.com>
Sent: Friday, November 18, 2022 8:49 AM
To: Jeff Squyres (jsquyres) <mailto:jsquy...@cisco.com>; 
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
<mailto:users@lists.open-mpi.org>; 
gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com> 
<mailto:gilles.gouaillar...@gmail.com>
Subject: Re: users Digest, Vol 4818, Issue 1


The information you need is attached.


在 2022/11/18 21:08, Jeff Squyres (jsquyres) 写道:
Yes, Gilles responded within a few hours: 
https://www.mail-archive.com/users@lists.open-mpi.org/msg35057.html

Looking closer, we should still be seeing more output compared to what you 
posted.  It's almost like you have a busted Open MPI installation -- perhaps 
it's missing the "hostfile" component altogether.

How did you install Open MPI?  Can you send the information from "Run time 
problems" on 
https://docs.open-mpi.org/en/v5.0.x/getting-help.html#for-run-time-problems ?

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>

From: timesir <mailto:mrlong...@gmail.com>
Sent: Monday, November 14, 2022 11:32 PM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
<mailto:users@lists.open-mpi.org>; Jeff Squyres 
(jsquyres) <mailto:jsquy...@cisco.com>; 
gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com> 
<mailto:gilles.gouaillar...@gmail.com>
Subject: Re: users Digest, Vol 4818, Issue 1


(py3.9) ➜  /share   mpirun -n 2 --machinefile hosts --mca rmaps_base_verbose 
100 --mca ras_base_verbose 100  which mpirun
[computer01:39342] mca: base: component_find: searching NULL for ras components
[computer01:39342] mca: base: find_dyn_components: checking NULL for ras 
components
[computer01:39342] pmix:mca: base: components_register: registering framework 
ras components
[computer01:39342] pmix:mca: base: components_register: found loaded component 
simulator
[computer01:39342] pmix:mca: base: components_register: component simulator 
register function successful
[computer01:39342] pmix:mca: base: components_register: found loaded component 
pbs
[computer01:39342] pmix:mca: base: components_register: component pbs register 
function successful
[computer01:39342] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:39342] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:39342] mca: base: components_open: opening ras components
[computer01:39342] mca: base: components_open: found loaded component simulator
[computer01:39342] mca: base: components_open: found loaded component pbs
[computer01:39342] mca: base: components_open: component pbs open function 
successful
[computer01:39342] mca: base: components_open: found loaded component slurm
[computer01:39342] mca: base: components_open: component slurm open function 
successful
[computer01:39342] mca:base:select: Auto-selecting ras components
[computer01:39342] mca:base:select:(  ras) Querying component [simulator]
[computer01:39342] mca:base:select:(  ras) Querying component [pbs]
[computer01:39342] mca:base:select:(  ras) Querying component [slurm]
[computer01:39342] mca:base:select:(  ras) No component selected!

Re: [OMPI users] users Digest, Vol 4818, Issue 1

2022-11-25 Thread Jeff Squyres (jsquyres) via users
Yes, Gilles responded within a few hours: 
https://www.mail-archive.com/users@lists.open-mpi.org/msg35057.html

Looking closer, we should still be seeing more output compared to what you 
posted.  It's almost like you have a busted Open MPI installation -- perhaps 
it's missing the "hostfile" component altogether.

How did you install Open MPI?  Can you send the information from "Run time 
problems" on 
https://docs.open-mpi.org/en/v5.0.x/getting-help.html#for-run-time-problems ?

--
Jeff Squyres
jsquy...@cisco.com

From: timesir 
Sent: Monday, November 14, 2022 11:32 PM
To: users@lists.open-mpi.org ; Jeff Squyres 
(jsquyres) ; gilles.gouaillar...@gmail.com 

Subject: Re: users Digest, Vol 4818, Issue 1


(py3.9) ➜  /share   mpirun -n 2 --machinefile hosts --mca rmaps_base_verbose 
100 --mca ras_base_verbose 100  which mpirun
[computer01:39342] mca: base: component_find: searching NULL for ras components
[computer01:39342] mca: base: find_dyn_components: checking NULL for ras 
components
[computer01:39342] pmix:mca: base: components_register: registering framework 
ras components
[computer01:39342] pmix:mca: base: components_register: found loaded component 
simulator
[computer01:39342] pmix:mca: base: components_register: component simulator 
register function successful
[computer01:39342] pmix:mca: base: components_register: found loaded component 
pbs
[computer01:39342] pmix:mca: base: components_register: component pbs register 
function successful
[computer01:39342] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:39342] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:39342] mca: base: components_open: opening ras components
[computer01:39342] mca: base: components_open: found loaded component simulator
[computer01:39342] mca: base: components_open: found loaded component pbs
[computer01:39342] mca: base: components_open: component pbs open function 
successful
[computer01:39342] mca: base: components_open: found loaded component slurm
[computer01:39342] mca: base: components_open: component slurm open function 
successful
[computer01:39342] mca:base:select: Auto-selecting ras components
[computer01:39342] mca:base:select:(  ras) Querying component [simulator]
[computer01:39342] mca:base:select:(  ras) Querying component [pbs]
[computer01:39342] mca:base:select:(  ras) Querying component [slurm]
[computer01:39342] mca:base:select:(  ras) No component selected!
[computer01:39342] mca: base: component_find: searching NULL for rmaps 
components
[computer01:39342] mca: base: find_dyn_components: checking NULL for rmaps 
components
[computer01:39342] pmix:mca: base: components_register: registering framework 
rmaps components
[computer01:39342] pmix:mca: base: components_register: found loaded component 
ppr
[computer01:39342] pmix:mca: base: components_register: component ppr register 
function successful
[computer01:39342] pmix:mca: base: components_register: found loaded component 
rank_file
[computer01:39342] pmix:mca: base: components_register: component rank_file has 
no register or open function
[computer01:39342] pmix:mca: base: components_register: found loaded component 
round_robin
[computer01:39342] pmix:mca: base: components_register: component round_robin 
register function successful
[computer01:39342] pmix:mca: base: components_register: found loaded component 
seq
[computer01:39342] pmix:mca: base: components_register: component seq register 
function successful
[computer01:39342] mca: base: components_open: opening rmaps components
[computer01:39342] mca: base: components_open: found loaded component ppr
[computer01:39342] mca: base: components_open: component ppr open function 
successful
[computer01:39342] mca: base: components_open: found loaded component rank_file
[computer01:39342] mca: base: components_open: found loaded component 
round_robin
[computer01:39342] mca: base: components_open: component round_robin open 
function successful
[computer01:39342] mca: base: components_open: found loaded component seq   
[35/405]
[computer01:39342] mca: base: components_open: component seq open function 
successful
[computer01:39342] mca:rmaps:select: checking available component ppr
[computer01:39342] mca:rmaps:select: Querying component [ppr]
[computer01:39342] mca:rmaps:select: checking available component rank_file
[computer01:39342] mca:rmaps:select: Querying component [rank_file]
[computer01:39342] mca:rmaps:select: checking available component round_robin
[computer01:39342] mca:rmaps:select: Querying component [round_robin]
[computer01:39342] mca:rmaps:select: checking available component seq
[computer01:39342] mca:rmaps:select: Querying component [seq]
[computer01:39342] [prterun-computer01-39342@0,0]: Final mapper priorities
[computer01:39342]  Mapper: ppr Priority: 90

Re: [OMPI users] users Digest, Vol 4818, Issue 1

2022-11-25 Thread Jeff Squyres (jsquyres) via users
I see 2 config.log files -- can you also send the other information requested 
on that page?  I.e, the version you're using (I think​ you said in a prior 
email that it was 5.0rc9, but I'm not 100% sure), and the output from ompi_info 
--all.

--
Jeff Squyres
jsquy...@cisco.com

From: timesir 
Sent: Friday, November 18, 2022 8:49 AM
To: Jeff Squyres (jsquyres) ; users@lists.open-mpi.org 
; gilles.gouaillar...@gmail.com 

Subject: Re: users Digest, Vol 4818, Issue 1


The information you need is attached.


在 2022/11/18 21:08, Jeff Squyres (jsquyres) 写道:
Yes, Gilles responded within a few hours: 
https://www.mail-archive.com/users@lists.open-mpi.org/msg35057.html

Looking closer, we should still be seeing more output compared to what you 
posted.  It's almost like you have a busted Open MPI installation -- perhaps 
it's missing the "hostfile" component altogether.

How did you install Open MPI?  Can you send the information from "Run time 
problems" on 
https://docs.open-mpi.org/en/v5.0.x/getting-help.html#for-run-time-problems ?

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>

From: timesir <mailto:mrlong...@gmail.com>
Sent: Monday, November 14, 2022 11:32 PM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
<mailto:users@lists.open-mpi.org>; Jeff Squyres 
(jsquyres) <mailto:jsquy...@cisco.com>; 
gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com> 
<mailto:gilles.gouaillar...@gmail.com>
Subject: Re: users Digest, Vol 4818, Issue 1


(py3.9) ➜  /share   mpirun -n 2 --machinefile hosts --mca rmaps_base_verbose 
100 --mca ras_base_verbose 100  which mpirun
[computer01:39342] mca: base: component_find: searching NULL for ras components
[computer01:39342] mca: base: find_dyn_components: checking NULL for ras 
components
[computer01:39342] pmix:mca: base: components_register: registering framework 
ras components
[computer01:39342] pmix:mca: base: components_register: found loaded component 
simulator
[computer01:39342] pmix:mca: base: components_register: component simulator 
register function successful
[computer01:39342] pmix:mca: base: components_register: found loaded component 
pbs
[computer01:39342] pmix:mca: base: components_register: component pbs register 
function successful
[computer01:39342] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:39342] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:39342] mca: base: components_open: opening ras components
[computer01:39342] mca: base: components_open: found loaded component simulator
[computer01:39342] mca: base: components_open: found loaded component pbs
[computer01:39342] mca: base: components_open: component pbs open function 
successful
[computer01:39342] mca: base: components_open: found loaded component slurm
[computer01:39342] mca: base: components_open: component slurm open function 
successful
[computer01:39342] mca:base:select: Auto-selecting ras components
[computer01:39342] mca:base:select:(  ras) Querying component [simulator]
[computer01:39342] mca:base:select:(  ras) Querying component [pbs]
[computer01:39342] mca:base:select:(  ras) Querying component [slurm]
[computer01:39342] mca:base:select:(  ras) No component selected!
[computer01:39342] mca: base: component_find: searching NULL for rmaps 
components
[computer01:39342] mca: base: find_dyn_components: checking NULL for rmaps 
components
[computer01:39342] pmix:mca: base: components_register: registering framework 
rmaps components
[computer01:39342] pmix:mca: base: components_register: found loaded component 
ppr
[computer01:39342] pmix:mca: base: components_register: component ppr register 
function successful
[computer01:39342] pmix:mca: base: components_register: found loaded component 
rank_file
[computer01:39342] pmix:mca: base: components_register: component rank_file has 
no register or open function
[computer01:39342] pmix:mca: base: components_register: found loaded component 
round_robin
[computer01:39342] pmix:mca: base: components_register: component round_robin 
register function successful
[computer01:39342] pmix:mca: base: components_register: found loaded component 
seq
[computer01:39342] pmix:mca: base: components_register: component seq register 
function successful
[computer01:39342] mca: base: components_open: opening rmaps components
[computer01:39342] mca: base: components_open: found loaded component ppr
[computer01:39342] mca: base: components_open: component ppr open function 
successful
[computer01:39342] mca: base: components_open: found loaded component rank_file
[computer01:39342] mca: base: components_open: found loaded component 
round_robin
[computer01:39342] mca: base: components_open: component round_robin open 
function su

Re: [OMPI users] Tracing of openmpi internal functions

2022-11-14 Thread Jeff Squyres (jsquyres) via users
Open MPI uses plug-in modules for its implementations of the MPI collective 
algorithms.  From that perspective, once you understand that infrastructure, 
it's exactly the same regardless of whether the MPI job is using intra-node or 
inter-node collectives.

We don't have much in the way of detailed internal function call tracing inside 
Open MPI itself, due to performance considerations.  You might want to look 
into flamegraphs, or something similar...?

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of arun c via users 

Sent: Saturday, November 12, 2022 9:46 AM
To: users@lists.open-mpi.org 
Cc: arun c 
Subject: [OMPI users] Tracing of openmpi internal functions

Hi All,

I am new to openmpi and trying to learn the internals (source code
level) of data transfer during collective operations. At first, I will
limit it to intra-node (between cpu cores, and sockets) to minimize
the scope of learning.

What are the best options (Looking for only free and open methods) for
tracing the openmpi code? (say I want to execute alltoall collective
and trace all the function calls and event callbacks that happened
inside the libmpi.so on all the cores)

Linux kernel has something called ftrace, it gives a neat call graph
of all the internal functions inside the kernel with time, is
something similar available?

--Arun


Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application

2022-11-14 Thread Jeff Squyres (jsquyres) via users
BE option to ignore the
number of available slots when deciding the number of processes to
launch.
----------



在 2022/11/13 23:42, Jeff Squyres (jsquyres) 写道:
Interesting.  It says:

[computer01:106117] AVAILABLE NODES FOR MAPPING:
[computer01:106117] node: computer01 daemon: 0 slots_available: 1

This is why it tells you you're out of slots: you're asking for 2, but it only 
found 1.  This means it's not seeing your hostfile somehow.

I should have asked you to run with 2​ variables last time -- can you re-run 
with "mpirun --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 ..."?

Turning on the RAS verbosity should show us what the hostfile component is 
doing.

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>
____
From: 龙龙 <mailto:mrlong...@gmail.com>
Sent: Sunday, November 13, 2022 3:13 AM
To: Jeff Squyres (jsquyres) <mailto:jsquy...@cisco.com>; 
Open MPI Users <mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI devel] There are not enough slots available in the system to 
satisfy the 2, slots that were requested by the application


(py3.9) ➜ /share mpirun –version

mpirun (Open MPI) 5.0.0rc9

Report bugs to https://www.open-mpi.org/community/help/

(py3.9) ➜ /share cat hosts

192.168.180.48 slots=1
192.168.60.203 slots=1

(py3.9) ➜ /share mpirun -n 2 -machinefile hosts –mca rmaps_base_verbose 100 
which mpirun

[computer01:106117] mca: base: component_find: searching NULL for rmaps 
components
[computer01:106117] mca: base: find_dyn_components: checking NULL for rmaps 
components
[computer01:106117] pmix:mca: base: components_register: registering framework 
rmaps components
[computer01:106117] pmix:mca: base: components_register: found loaded component 
ppr
[computer01:106117] pmix:mca: base: components_register: component ppr register 
function successful
[computer01:106117] pmix:mca: base: components_register: found loaded component 
rank_file
[computer01:106117] pmix:mca: base: components_register: component rank_file 
has no register or open function
[computer01:106117] pmix:mca: base: components_register: found loaded component 
round_robin
[computer01:106117] pmix:mca: base: components_register: component round_robin 
register function successful
[computer01:106117] pmix:mca: base: components_register: found loaded component 
seq
[computer01:106117] pmix:mca: base: components_register: component seq register 
function successful
[computer01:106117] mca: base: components_open: opening rmaps components
[computer01:106117] mca: base: components_open: found loaded component ppr
[computer01:106117] mca: base: components_open: component ppr open function 
successful
[computer01:106117] mca: base: components_open: found loaded component rank_file
[computer01:106117] mca: base: components_open: found loaded component 
round_robin
[computer01:106117] mca: base: components_open: component round_robin open 
function successful
[computer01:106117] mca: base: components_open: found loaded component seq
[computer01:106117] mca: base: components_open: component seq open function 
successful
[computer01:106117] mca:rmaps:select: checking available component ppr
[computer01:106117] mca:rmaps:select: Querying component [ppr]
[computer01:106117] mca:rmaps:select: checking available component rank_file
[computer01:106117] mca:rmaps:select: Querying component [rank_file]
[computer01:106117] mca:rmaps:select: checking available component round_robin
[computer01:106117] mca:rmaps:select: Querying component [round_robin]
[computer01:106117] mca:rmaps:select: checking available component seq
[computer01:106117] mca:rmaps:select: Querying component [seq]
[computer01:106117] [prterun-computer01-106117@0,0]: Final mapper priorities
[computer01:106117] Mapper: ppr Priority: 90
[computer01:106117] Mapper: seq Priority: 60
[computer01:106117] Mapper: round_robin Priority: 10
[computer01:106117] Mapper: rank_file Priority: 0
[computer01:106117] mca:rmaps: mapping job prterun-computer01-106117@1

[computer01:106117] mca:rmaps: setting mapping policies for job 
prterun-computer01-106117@1 inherit TRUE hwtcpus FALSE [9/1957]
[computer01:106117] mca:rmaps[358] mapping not given - using bycore
[computer01:106117] setdefaultbinding[365] binding not given - using bycore
[computer01:106117] mca:rmaps:ppr: job prterun-computer01-106117@1 not using 
ppr mapper PPR NULL policy PPR NOTSET
[computer01:106117] mca:rmaps:seq: job prterun-computer01-106117@1 not using 
seq mapper
[computer01:106117] mca:rmaps:rr: mapping job prterun-computer01-106117@1
[computer01:106117] AVAILABLE NODES FOR MAPPING:
[computer01:106117] node: computer01 daemon: 0 slots_available: 1
[computer01:106117] mca:rmaps:rr: mapping by Core for job 
prterun-computer01-106117@1 slots 1 num_procs 2



There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

which

Eithe

Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application

2022-11-13 Thread Jeff Squyres (jsquyres) via users
Interesting.  It says:

[computer01:106117] AVAILABLE NODES FOR MAPPING:
[computer01:106117] node: computer01 daemon: 0 slots_available: 1

This is why it tells you you're out of slots: you're asking for 2, but it only 
found 1.  This means it's not seeing your hostfile somehow.

I should have asked you to run with 2​ variables last time -- can you re-run 
with "mpirun --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 ..."?

Turning on the RAS verbosity should show us what the hostfile component is 
doing.

--
Jeff Squyres
jsquy...@cisco.com

From: 龙龙 
Sent: Sunday, November 13, 2022 3:13 AM
To: Jeff Squyres (jsquyres) ; Open MPI Users 

Subject: Re: [OMPI devel] There are not enough slots available in the system to 
satisfy the 2, slots that were requested by the application


(py3.9) ➜ /share mpirun –version

mpirun (Open MPI) 5.0.0rc9

Report bugs to https://www.open-mpi.org/community/help/

(py3.9) ➜ /share cat hosts

192.168.180.48 slots=1
192.168.60.203 slots=1

(py3.9) ➜ /share mpirun -n 2 -machinefile hosts –mca rmaps_base_verbose 100 
which mpirun

[computer01:106117] mca: base: component_find: searching NULL for rmaps 
components
[computer01:106117] mca: base: find_dyn_components: checking NULL for rmaps 
components
[computer01:106117] pmix:mca: base: components_register: registering framework 
rmaps components
[computer01:106117] pmix:mca: base: components_register: found loaded component 
ppr
[computer01:106117] pmix:mca: base: components_register: component ppr register 
function successful
[computer01:106117] pmix:mca: base: components_register: found loaded component 
rank_file
[computer01:106117] pmix:mca: base: components_register: component rank_file 
has no register or open function
[computer01:106117] pmix:mca: base: components_register: found loaded component 
round_robin
[computer01:106117] pmix:mca: base: components_register: component round_robin 
register function successful
[computer01:106117] pmix:mca: base: components_register: found loaded component 
seq
[computer01:106117] pmix:mca: base: components_register: component seq register 
function successful
[computer01:106117] mca: base: components_open: opening rmaps components
[computer01:106117] mca: base: components_open: found loaded component ppr
[computer01:106117] mca: base: components_open: component ppr open function 
successful
[computer01:106117] mca: base: components_open: found loaded component rank_file
[computer01:106117] mca: base: components_open: found loaded component 
round_robin
[computer01:106117] mca: base: components_open: component round_robin open 
function successful
[computer01:106117] mca: base: components_open: found loaded component seq
[computer01:106117] mca: base: components_open: component seq open function 
successful
[computer01:106117] mca:rmaps:select: checking available component ppr
[computer01:106117] mca:rmaps:select: Querying component [ppr]
[computer01:106117] mca:rmaps:select: checking available component rank_file
[computer01:106117] mca:rmaps:select: Querying component [rank_file]
[computer01:106117] mca:rmaps:select: checking available component round_robin
[computer01:106117] mca:rmaps:select: Querying component [round_robin]
[computer01:106117] mca:rmaps:select: checking available component seq
[computer01:106117] mca:rmaps:select: Querying component [seq]
[computer01:106117] [prterun-computer01-106117@0,0]: Final mapper priorities
[computer01:106117] Mapper: ppr Priority: 90
[computer01:106117] Mapper: seq Priority: 60
[computer01:106117] Mapper: round_robin Priority: 10
[computer01:106117] Mapper: rank_file Priority: 0
[computer01:106117] mca:rmaps: mapping job prterun-computer01-106117@1

[computer01:106117] mca:rmaps: setting mapping policies for job 
prterun-computer01-106117@1 inherit TRUE hwtcpus FALSE [9/1957]
[computer01:106117] mca:rmaps[358] mapping not given - using bycore
[computer01:106117] setdefaultbinding[365] binding not given - using bycore
[computer01:106117] mca:rmaps:ppr: job prterun-computer01-106117@1 not using 
ppr mapper PPR NULL policy PPR NOTSET
[computer01:106117] mca:rmaps:seq: job prterun-computer01-106117@1 not using 
seq mapper
[computer01:106117] mca:rmaps:rr: mapping job prterun-computer01-106117@1
[computer01:106117] AVAILABLE NODES FOR MAPPING:
[computer01:106117] node: computer01 daemon: 0 slots_available: 1
[computer01:106117] mca:rmaps:rr: mapping by Core for job 
prterun-computer01-106117@1 slots 1 num_procs 2



There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

which

Either request fewer procs for your application, or make more slots
available for use.

A “slot” is the PRRTE term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which PRRTE processes are run:

  1.  Hostfile, via “slots=N” clauses (N defaults to 

Re: [OMPI users] --mca btl_base_verbose 30 not working in version 5.0

2022-11-07 Thread Jeff Squyres (jsquyres) via users
Sorry for the delay in replying.

To tie up this thread for the web mail archives: this same question was 
cross-posted over in the devel list; I replied there.

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of mrlong via users 

Sent: Sunday, October 30, 2022 10:02 AM
To: users@lists.open-mpi.org 
Cc: mrlong 
Subject: [OMPI users] --mca btl_base_verbose 30 not working in version 5.0


mpirun --mca btl self,sm,tcp --mca btl_base_verbose 30 -np 2 --machinefile 
hostfile  hostname

Why this sentence does not print IP addresses are routable in openmpi 5.0.0.rc9?



Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application

2022-11-07 Thread Jeff Squyres (jsquyres) via users
In the future, can you please just mail one of the lists?  This particular 
question is probably more of a users type of question (since we're not talking 
about the internals of Open MPI itself), so I'll reply just on the users list.

For what it's worth, I'm unable to replicate your error:


$ mpirun --version

mpirun (Open MPI) 5.0.0rc9


Report bugs to https://www.open-mpi.org/community/help/

$ cat hostfile

mpi002 slots=1

mpi005 slots=1

$ mpirun -n 2 --machinefile hostfile hostname

mpi002

mpi005

Can you try running with "--mca rmaps_base_verbose 100" so that we can get some 
debugging output and see why the slots aren't working for you?  Show the full 
output, like I did above (e.g., cat the hostfile, and then mpirun with the MCA 
param and all the output).  Thanks!

--
Jeff Squyres
jsquy...@cisco.com

From: devel  on behalf of mrlong via devel 

Sent: Monday, November 7, 2022 3:37 AM
To: de...@lists.open-mpi.org ; Open MPI Users 

Cc: mrlong 
Subject: [OMPI devel] There are not enough slots available in the system to 
satisfy the 2, slots that were requested by the application


Two machines, each with 64 cores. The contents of the hosts file are:

192.168.180.48 slots=1
192.168.60.203 slots=1

Why do you get the following error when running with openmpi 5.0.0rc9?

(py3.9) [user@machine01 share]$  mpirun -n 2 --machinefile hosts hostname
--
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  hostname

Either request fewer procs for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
 processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
 hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
 RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.



Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI Installation - pmi.h Error

2022-10-06 Thread Jeff Squyres (jsquyres) via users
Hmm; that's a little unexpected, but it actually helps simplify the debugging 
process.

It looks like you are using an external hwloc build from 
/cm/shared/apps/hwloc/1.11.11.  Is there a libhwloc.la file in there somewhere? 
 If so, can you see if "-lnuma" and "-ludev" is in this file?  If that's the 
case, then that's where Open MPI is getting these CLI arguments.

--
Jeff Squyres
jsquy...@cisco.com

From: Jeffrey D. (JD) Tamucci 
Sent: Wednesday, October 5, 2022 5:16 PM
To: Jeff Squyres (jsquyres) 
Cc: Open MPI Users ; Pritchard Jr., Howard 

Subject: Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI 
Installation - pmi.h Error

Thank you for your assistance! I tried to compile it with either the 
--enabled-shared or --enable-static flag and they both seemed to have the same 
error as before (lnuma and ludev) unfortunately. A dropbox link to full output 
for each install is below:

https://www.dropbox.com/s/axnbze56iokyyqe/ompi-output_static_v_shared.tar.bz2?dl=0

I appreciate your help. I will run this by the admins of our HPC as well.

Best,
JD






Jeffrey D. (JD) Tamucci

University of Connecticut
Molecular & Cell Biology
RA in Lab of Eric R. May
PhD / MPH Candidate
he/him


On Wed, Oct 5, 2022 at 1:53 PM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:

*Message sent from a system outside of UConn.*


Actually, I think the problem might be a little more subtle.

I see that you configured with both --enable-static and --enable-shared.

My gut reaction is that there might be some kind of issue with enabling both of 
those options (by default, shared is enabled and static is disabled).  If you 
configure+build with just one of those two options, does it work?

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Pritchard Jr., Howard via users 
mailto:users@lists.open-mpi.org>>
Sent: Wednesday, October 5, 2022 11:47 AM
To: Jeffrey D. (JD) Tamucci 
mailto:jeffrey.tamu...@uconn.edu>>
Cc: Pritchard Jr., Howard mailto:howa...@lanl.gov>>; Open MPI 
Users mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI 
Installation - pmi.h Error


Hi Jeff,



I think you are now in the “send the system admin an email to install RPMs, in 
particular ask that the numa and udev devel rpms be installed”.  They will need 
to install these rpms on the compute node image(s) as well.



Howard





From: "Jeffrey D. (JD) Tamucci" 
mailto:jeffrey.tamu...@uconn.edu>>
Date: Wednesday, October 5, 2022 at 9:20 AM
To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>
Cc: "bbarr...@amazon.com<mailto:bbarr...@amazon.com>" 
mailto:bbarr...@amazon.com>>, Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI 
Installation - pmi.h Error



Gladly, I tried it that way and it worked in that it was able to find pmi.h. 
Unfortunately there's a new error about finding lnuma and ludev.



make[2]: Entering directory '/shared/maylab/src/openmpi-4.1.4/opal'
  CCLD 
libopen-pal.la<https://urldefense.com/v3/__http:/libopen-pal.la__;!!Bt8fGhp8LhKGRg!BWR7snajnpicZF4YgkUocF-Zm3n1tT0PSpwsOGfvHrB1qcFmYIq9xU56yhcjTEBv6oq1Z5meQDixEwQJWs4fc6wp3HEFfA$>
/usr/bin/ld: cannot find -lnuma
/usr/bin/ld: cannot find -ludev
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:2249: 
libopen-pal.la<https://urldefense.com/v3/__http:/libopen-pal.la__;!!Bt8fGhp8LhKGRg!BWR7snajnpicZF4YgkUocF-Zm3n1tT0PSpwsOGfvHrB1qcFmYIq9xU56yhcjTEBv6oq1Z5meQDixEwQJWs4fc6wp3HEFfA$>]
 Error 1
make[2]: Leaving directory '/shared/maylab/src/openmpi-4.1.4/opal'
make[1]: *** [Makefile:2394: install-recursive] Error 1
make[1]: Leaving directory '/shared/maylab/src/openmpi-4.1.4/opal'
make: *** [Makefile:1912: install-recursive] Error 1



Here is a dropbox link to the full output: 
https://www.dropbox.com/s/4rv8n2yp320ix08/ompi-output_Oct4_2022.tar.bz2?dl=0<https://urldefense.com/v3/__https:/www.dropbox.com/s/4rv8n2yp320ix08/ompi-output_Oct4_2022.tar.bz2?dl=0__;!!Bt8fGhp8LhKGRg!BWR7snajnpicZF4YgkUocF-Zm3n1tT0PSpwsOGfvHrB1qcFmYIq9xU56yhcjTEBv6oq1Z5meQDixEwQJWs4fc6y8gBZt9g$>



Thank you for your help!



JD





Jeffrey D. (JD) Tamucci

University of Connecticut

Molecular & Cell Biology

RA in Lab of Eric R. May

PhD / MPH Candidate

he/him





On Tue, Oct 4, 2022 at 1:51 PM Pritchard Jr., Howard 
mailto:howa...@lanl.gov>> wrote:

*Message sent from a system outside of UConn.*



Could you change the –with-pmi to be

--with-pmi=/cm/shared/apps/slurm21.08.8



?





From: "Jeffrey D. (JD) Tamucci" 
mailto:jeffrey.tamu...@uconn.edu>>
Date: Tuesday, October 4, 2022 at 10:40 AM
To: 

Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI Installation - pmi.h Error

2022-10-05 Thread Jeff Squyres (jsquyres) via users
Actually, I think the problem might be a little more subtle.

I see that you configured with both --enable-static and --enable-shared.

My gut reaction is that there might be some kind of issue with enabling both of 
those options (by default, shared is enabled and static is disabled).  If you 
configure+build with just one of those two options, does it work?

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Pritchard Jr., 
Howard via users 
Sent: Wednesday, October 5, 2022 11:47 AM
To: Jeffrey D. (JD) Tamucci 
Cc: Pritchard Jr., Howard ; Open MPI Users 

Subject: Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI 
Installation - pmi.h Error


Hi Jeff,



I think you are now in the “send the system admin an email to install RPMs, in 
particular ask that the numa and udev devel rpms be installed”.  They will need 
to install these rpms on the compute node image(s) as well.



Howard





From: "Jeffrey D. (JD) Tamucci" 
Date: Wednesday, October 5, 2022 at 9:20 AM
To: "Pritchard Jr., Howard" 
Cc: "bbarr...@amazon.com" , Open MPI Users 

Subject: Re: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI 
Installation - pmi.h Error



Gladly, I tried it that way and it worked in that it was able to find pmi.h. 
Unfortunately there's a new error about finding lnuma and ludev.



make[2]: Entering directory '/shared/maylab/src/openmpi-4.1.4/opal'
  CCLD 
libopen-pal.la
/usr/bin/ld: cannot find -lnuma
/usr/bin/ld: cannot find -ludev
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:2249: 
libopen-pal.la]
 Error 1
make[2]: Leaving directory '/shared/maylab/src/openmpi-4.1.4/opal'
make[1]: *** [Makefile:2394: install-recursive] Error 1
make[1]: Leaving directory '/shared/maylab/src/openmpi-4.1.4/opal'
make: *** [Makefile:1912: install-recursive] Error 1



Here is a dropbox link to the full output: 
https://www.dropbox.com/s/4rv8n2yp320ix08/ompi-output_Oct4_2022.tar.bz2?dl=0



Thank you for your help!



JD





Jeffrey D. (JD) Tamucci

University of Connecticut

Molecular & Cell Biology

RA in Lab of Eric R. May

PhD / MPH Candidate

he/him





On Tue, Oct 4, 2022 at 1:51 PM Pritchard Jr., Howard 
mailto:howa...@lanl.gov>> wrote:

*Message sent from a system outside of UConn.*



Could you change the –with-pmi to be

--with-pmi=/cm/shared/apps/slurm21.08.8



?





From: "Jeffrey D. (JD) Tamucci" 
mailto:jeffrey.tamu...@uconn.edu>>
Date: Tuesday, October 4, 2022 at 10:40 AM
To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, 
"bbarr...@amazon.com" 
mailto:bbarr...@amazon.com>>
Cc: Open MPI Users mailto:users@lists.open-mpi.org>>
Subject: Re: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI 
Installation - pmi.h Error



Hi Howard and Brian,



Of course. Here's a dropbox link to the full folder: 
https://www.dropbox.com/s/raqlcnpgk9wz78b/ompi-output_Sep30_2022.tar.bz2?dl=0



This was the configure and make commands:

./configure \
--prefix=/shared/maylab/mayapps/mpi/openmpi/4.1.4 \
--with-slurm \
--with-lsf=no \
--with-pmi=/cm/shared/apps/slurm/21.08.8/include/slurm \
--with-pmi-libdir=/cm/shared/apps/slurm/21.08.8/lib64 \
--with-hwloc=/cm/shared/apps/hwloc/1.11.11 \
--with-cuda=/gpfs/sharedfs1/admin/hpc2.0/apps/cuda/11.6 \
--enable-shared \
--enable-static &&
make -j 32 &&
make -j 32 check
make install

The output of the make command is in the install_open-mpi_4.1.4_hpc2.log file.





Jeffrey D. (JD) Tamucci

University of Connecticut

Molecular & Cell Biology

RA in Lab of Eric R. May

PhD / MPH Candidate

he/him





On Tue, Oct 4, 2022 at 12:33 PM Pritchard Jr., Howard 
mailto:howa...@lanl.gov>> wrote:

*Message sent from a system outside of UConn.*



HI JD,



Could you post the configure options your script uses to build Open MPI?



Howard



From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of "Jeffrey D. (JD) Tamucci via users" 
mailto:users@lists.open-mpi.org>>
Reply-To: Open MPI Users 
mailto:users@lists.open-mpi.org>>
Date: Tuesday, October 4, 2022 at 10:07 AM
To: "users@lists.open-mpi.org" 
mailto:users@lists.open

Re: [OMPI users] openmpi compile failure

2022-09-28 Thread Jeff Squyres (jsquyres) via users
Looking at the detailed compile line in the "make" output that you sent, I 
don't see anything too unusual (e.g., in -I or other preprocessor directives).

You might want to look around your machine and see if there's an alternate 
signal.h that is somehow getting found and included.

If that doesn't yield anything interesting, then perhaps copy the "/usr/bin/gcc 
..." command for compiling signal.c from your make.out file, and add in a -E so 
that you can see the preprocessor output.  Then you can probably track down 
exactly which signal.h is being used.  For example, this is the command I see 
in your make.out (with line breaks added for readability):


/usr/bin/gcc -DHAVE_CONFIG_H -I. \

-I../../../../../../opal/mca/event/libevent2022/libevent \

-I../../../../../../opal/mca/event/libevent2022/libevent/compat \

-I../../../../../../opal/mca/event/libevent2022/libevent/include \

-I./include -I/home/zmumba/LIBS/src/openmpi-4.1.4 \

-I/home/zmumba/LIBS/src/openmpi-4.1.4/build \

-I/home/zmumba/LIBS/src/openmpi-4.1.4/opal/include \

-I/home/zmumba/LIBS/src/openmpi-4.1.4/build/opal/mca/hwloc/hwloc201/hwloc/include
 \

-I/home/zmumba/LIBS/src/openmpi-4.1.4/opal/mca/hwloc/hwloc201/hwloc/include \

-DNDEBUG -Drandom=opal_random -O -fPIC -D_XOPEN_SOURCE=500 -Wall \

-fno-strict-aliasing -pthread -MT signal.lo -MD -MP -MF \

.deps/signal.Tpo -c \

../../../../../../opal/mca/event/libevent2022/libevent/signal.c -fPIC \

-DPIC -o .libs/signal.o


If you remove the -o .libs/signal.o and instead put in a -E, you can redirect 
that and see the source code that came out of the preprocessor, and do a little 
backwards digging to figure out which signal.h was used:


/usr/bin/gcc -DHAVE_CONFIG_H -I. \

-I../../../../../../opal/mca/event/libevent2022/libevent \

-I../../../../../../opal/mca/event/libevent2022/libevent/compat \

-I../../../../../../opal/mca/event/libevent2022/libevent/include \

-I./include -I/home/zmumba/LIBS/src/openmpi-4.1.4 \

-I/home/zmumba/LIBS/src/openmpi-4.1.4/build \

-I/home/zmumba/LIBS/src/openmpi-4.1.4/opal/include \

-I/home/zmumba/LIBS/src/openmpi-4.1.4/build/opal/mca/hwloc/hwloc201/hwloc/include
 \

-I/home/zmumba/LIBS/src/openmpi-4.1.4/opal/mca/hwloc/hwloc201/hwloc/include \

-DNDEBUG -Drandom=opal_random -O -fPIC -D_XOPEN_SOURCE=500 -Wall \

-fno-strict-aliasing -pthread -MT signal.lo -MD -MP -MF \

.deps/signal.Tpo -c \

../../../../../../opal/mca/event/libevent2022/libevent/signal.c -fPIC \

-DPIC -E > signal-preprocessed.c

--
Jeff Squyres
jsquy...@cisco.com

From: Zilore Mumba 
Sent: Wednesday, September 28, 2022 1:50 AM
To: Jeff Squyres (jsquyres) 
Cc: users@lists.open-mpi.org 
Subject: Re: [OMPI users] openmpi compile failure

Thanks once again for that insight Jeff. Indeed it is my configuration. When I 
run the code snippet you sent I get exactly the result you have " NSIG is 65".
So I have to ensure my configure is pointing to the right libraries.

On Wed, Sep 28, 2022 at 2:02 AM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
I'm not sure why that would happen; it does sound like some kind of 
misconfiguration on your system.

If I compile this trivial application on Ubuntu 18.04:


#include 

#include 


int main() {

printf("NSIG is %d\n", NSIG);

return 0;

}

Like this:


$ gcc foo.c -o foo && ./foo

NSIG is 65

You can see that NSIG is definitely defined for me.

It's likely that until the above trivial program can compile properly, Open MPI 
won't compile properly, either.

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>
____
From: Zilore Mumba mailto:zmu...@gmail.com>>
Sent: Tuesday, September 27, 2022 2:51 PM
To: Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com>>
Cc: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] openmpi compile failure

Thanks Jeff,
I have tried with openmpi-4.1.4, but I still get the same error. The main error 
being
../../../../../../opal/mca/event/libevent2022/libevent/signal.c:135:14: error: 
‘NSIG’ undeclared (first use in this function); did you mean ‘_NSIG’?
  int ncaught[NSIG];
  ^~~~
  _NSIG
But I notice that inthe file "/usr/include/x86_64-linux-gnu/asm/signal.h" there 
is some definition of NSIG
#define NSIG32
typedef unsigned long sigset_t;


/* These should not be considered constants from userland.  */
#define SIGRTMIN32
#define SIGRTMAX_NSIG

So I am wondering if it is my system which is not picking up the correct 
version of signal.h
I have a attached a new zipped file ompi-output-tar.bz2, which is also on 
dropbox, link https://www.dropbox.com/s/ps49xqximjnn8oy/ompi-output.tar.bz2?dl=0


On Tue, Sep 27, 2022 at 2:19 PM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
Can you re-try

Re: [OMPI users] openmpi compile failure

2022-09-27 Thread Jeff Squyres (jsquyres) via users
I'm not sure why that would happen; it does sound like some kind of 
misconfiguration on your system.

If I compile this trivial application on Ubuntu 18.04:


#include 

#include 


int main() {

printf("NSIG is %d\n", NSIG);

return 0;

}

Like this:


$ gcc foo.c -o foo && ./foo

NSIG is 65

You can see that NSIG is definitely defined for me.

It's likely that until the above trivial program can compile properly, Open MPI 
won't compile properly, either.

--
Jeff Squyres
jsquy...@cisco.com

From: Zilore Mumba 
Sent: Tuesday, September 27, 2022 2:51 PM
To: Jeff Squyres (jsquyres) 
Cc: users@lists.open-mpi.org 
Subject: Re: [OMPI users] openmpi compile failure

Thanks Jeff,
I have tried with openmpi-4.1.4, but I still get the same error. The main error 
being
../../../../../../opal/mca/event/libevent2022/libevent/signal.c:135:14: error: 
‘NSIG’ undeclared (first use in this function); did you mean ‘_NSIG’?
  int ncaught[NSIG];
  ^~~~
  _NSIG
But I notice that inthe file "/usr/include/x86_64-linux-gnu/asm/signal.h" there 
is some definition of NSIG
#define NSIG32
typedef unsigned long sigset_t;


/* These should not be considered constants from userland.  */
#define SIGRTMIN32
#define SIGRTMAX_NSIG

So I am wondering if it is my system which is not picking up the correct 
version of signal.h
I have a attached a new zipped file ompi-output-tar.bz2, which is also on 
dropbox, link https://www.dropbox.com/s/ps49xqximjnn8oy/ompi-output.tar.bz2?dl=0


On Tue, Sep 27, 2022 at 2:19 PM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
Can you re-try with the latest Open MPI v4.1.x release (v4.1.4)?  There have 
been many bug fixes since v4.1.0.

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Zilore Mumba via users 
mailto:users@lists.open-mpi.org>>
Sent: Tuesday, September 27, 2022 5:10 AM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
mailto:users@lists.open-mpi.org>>
Cc: Zilore Mumba mailto:zmu...@gmail.com>>
Subject: Re: [OMPI users] openmpi compile failure

I am seeking help compiling openmpi. My compilation and installation output is 
in dropbox at the link below
https://www.dropbox.com/s/1a9tv5lnwicnhds/ompi-output.tar.bz2?dl=0
Help will be appreciated.



Re: [OMPI users] openmpi compile failure

2022-09-27 Thread Jeff Squyres (jsquyres) via users
Can you re-try with the latest Open MPI v4.1.x release (v4.1.4)?  There have 
been many bug fixes since v4.1.0.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Zilore Mumba via 
users 
Sent: Tuesday, September 27, 2022 5:10 AM
To: users@lists.open-mpi.org 
Cc: Zilore Mumba 
Subject: Re: [OMPI users] openmpi compile failure

I am seeking help compiling openmpi. My compilation and installation output is 
in dropbox at the link below
https://www.dropbox.com/s/1a9tv5lnwicnhds/ompi-output.tar.bz2?dl=0
Help will be appreciated.



Re: [OMPI users] --mca parameter explainer; mpirun WARNING: There was an error initializing an OpenFabrics device

2022-09-26 Thread Jeff Squyres (jsquyres) via users
Just to follow up for the email web archives: this issue was followed up in 
https://github.com/open-mpi/ompi/issues/10841.

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Rob Kudyba via 
users 
Sent: Thursday, September 22, 2022 2:15 PM
To: users@lists.open-mpi.org 
Cc: Rob Kudyba 
Subject: [OMPI users] --mca parameter explainer; mpirun WARNING: There was an 
error initializing an OpenFabrics device

We're using OpenMPI 4.1.1, CUDA aware on RHEL 8 cluster that we load as a 
module with Infiniband controller Mellanox Technologies MT28908 Family 
ConnectX-6, we see this warning runnig mpirun without any MCA 
options/parameters:
WARNING: There was an error initializing an OpenFabrics device.
  Local host:   
  Local device: mlx5_0
-

I did add 0x02c9 to our mca-btl-openib-device-params.ini file for the Mellanox 
ConnectX6 stanza as we were getting the following warning that no longer 
appears:

WARNING: No preset parameters were found for the device that Open MPI detected:

  Local host:
  Device name:   mlx5_0
  Device vendor ID:  0x02c9
  Device vendor part ID: 4123


Which I found is referenced in these 
comments:

# Note: Several vendors resell Mellanox hardware and put their own firmware
# on the cards, therefore overriding the default Mellanox vendor ID.
#
# Mellanox  0x02c9

Running  ompi_info --param btl all we have:
MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.1)

So I am trying to wrap my head around the various warnings, and what these 
various options/parameters available to use can improve performance and/or when 
to use them.

I've gone through the OpenMPI run-time tuning 
documentation, and I've used  
this STREAMS 
benchmark, 
https://anilmaurya.wordpress.com/2016/10/12/stream-benchmarks/ as well as these 
OSU Micro-Benchmarks at 
https://ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/OSU_MicroBenchmarks/

With version 4.1.1, if I use --mca btl 'openib' I get seg faults which I 
believe is expected as it's 
deprecated. 
I've tried --mca  btl '^openib', --mca  btl 'tcp' (or  --mca  btl 'tcp,self' 
using the OSU BMs) and the benchmark results are very similar even when I use 
multiple CPUs, threads and/or nodes. They also run without the warning 
messages. If I don't use a --mca option, I get the WARNING: message.

Does anyone know of a tried and true way to run these benchmarks so know if 
these MCA parameters make a difference or am I just not understanding how to 
use these? Perhaps running these benchmarks on a very active cluster with 
shared CPUs/nodes will affect the results?

I can share any desired results if that helps the discussion.

Thanks!


Re: [OMPI users] Hardware topology influence

2022-09-14 Thread Jeff Squyres (jsquyres) via users
It was pointed out to me off-list that I should update my worldview on HPC in 
VMs.  :-)

So let me clarify my remarks about VMs: yes, many organizations run bare-metal 
HPC environments.  However, it is no longer unusual to run HPC in VMs.  Using 
modern VM technology, especially when tuned for HPC workloads (e.g., bind each 
vCPU to a physical CPU), VMs can effect quite low overheard these days.  There 
are many benefits to running virtualized environments, and those are no longer 
off-limits to HPC workloads.  Indeed, VM overheads may be outweighed by other 
benefits of running in VM-based environments.

That being said, I'm not encouraging you to run 96 VMs on a single host, for 
example.  I have not done any VM testing myself, but I imagine that the same 
adage that applies to HPC bare metal environments also applies to HPC VM 
environments: let Open MPI use shared memory to communicate (vs. a network) 
whenever possible.  In your environment, this likely translates to having a 
single VM per host (encompassing all the physical CPUs that you want to use on 
that host) and launching N_x MPI processes in each VM (where N_x is the number 
of vCPU/physical CPUs available in VM x).  This will allow the MPI processes to 
use shared memory for on-node communication.

--
Jeff Squyres
jsquy...@cisco.com

From: Jeff Squyres (jsquyres) 
Sent: Tuesday, September 13, 2022 10:08 AM
To: Open MPI Users 
Cc: Gilles Gouaillardet 
Subject: Re: [OMPI users] Hardware topology influence

Let me add a little more color on what Gilles stated.

First, you should probably upgrade to the latest v4.1.x release: v4.1.4.  It 
has a bunch of bug fixes compared to v4.1.0.

Second, you should know that it is relatively uncommon to run HPC/MPI apps 
inside VMs because the virtualization infrastructure will -- by definition -- 
decrease your overall performance.  This is usually counter to the goal of 
writing/running HPC applications.  If you do run HPC/MPI applications in VMs, 
it is strongly recommended that you bind the cores in the VM to physical cores 
to attempt to minimize the performance loss.

By default, Open MPI maps MPI processes by core when deciding how many 
processes to place on each machine (and also deciding how to bind them).  For 
example, Open MPI looks at a machine and sees that it has N cores, and (by 
default) maps N MPI processes to that machine.  You can change Open MPI's 
defaults to map by hardware thread ("Hyperthread" in Interl parlance) instead 
of by core, but conventional knowledge is that math-heavy processes don't 
perform well with the limited resources of a single hardware thread, and 
benefit from the full resources of the core (this depends on your specific app, 
of course -- YMMV).  Intel's and AMD's hardware threads have gotten better over 
the years, but I think they still represent a division of resources in the 
core, and will likely still be performance-detrimental to at least some classes 
of HPC applications.  It's a surprisingly complicated topic.

In the v4.x series, note that you can use "mpirun --report-bindings ..." to see 
exactly where Open MPI thinks it has bound each process.  Note that this 
binding occurs before each MPI process starts; it's nothing that the 
application itself needs to do.

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Gilles Gouaillardet 
via users 
Sent: Tuesday, September 13, 2022 9:07 AM
To: Open MPI Users 
Cc: Gilles Gouaillardet 
Subject: Re: [OMPI users] Hardware topology influence

Lucas,

the number of MPI tasks started by mpirun is either
 - explicitly passed via the command line (e.g. mpirun -np 2306 ...)
 - equals to the number of available slots, and this value is either
 a) retrieved from the resource manager (such as a SLURM allocation)
 b) explicitly set in a machine file (e.g. mpirun -machinefile 
 ...) or the command line
 (e.g. mpirun --hosts host0:96,host1:96 ...)
 c) if none of the above is set, the number of detected cores on the system

Cheers,

Gilles

On Tue, Sep 13, 2022 at 9:23 PM Lucas Chaloyard via users 
mailto:users@lists.open-mpi.org>> wrote:
Hello,

I'm working as a research intern in a lab where we're studying virtualization.
And I've been working with several benchmarks using OpenMPI 4.1.0 (ASKAP, GPAW 
and Incompact3d from Phrononix Test suite).

To briefly explain my experiments, I'm running those benchmarks on several 
virtual machines using different topologies.
During one experiment I've been comparing those two topologies :
- Topology1 : 96 vCPUS divided in 96 sockets containing 1 threads
- Topology2 : 96 vCPUS divided in 48 sockets containing 2 threads (usage of 
hyperthreading)

For the ASKAP Benchmark :
- While using Topology2, 2306 processes will be created by the application to 
do its work.
- While using Topology1, 4612 pro

Re: [OMPI users] Hardware topology influence

2022-09-13 Thread Jeff Squyres (jsquyres) via users
Let me add a little more color on what Gilles stated.

First, you should probably upgrade to the latest v4.1.x release: v4.1.4.  It 
has a bunch of bug fixes compared to v4.1.0.

Second, you should know that it is relatively uncommon to run HPC/MPI apps 
inside VMs because the virtualization infrastructure will -- by definition -- 
decrease your overall performance.  This is usually counter to the goal of 
writing/running HPC applications.  If you do run HPC/MPI applications in VMs, 
it is strongly recommended that you bind the cores in the VM to physical cores 
to attempt to minimize the performance loss.

By default, Open MPI maps MPI processes by core when deciding how many 
processes to place on each machine (and also deciding how to bind them).  For 
example, Open MPI looks at a machine and sees that it has N cores, and (by 
default) maps N MPI processes to that machine.  You can change Open MPI's 
defaults to map by hardware thread ("Hyperthread" in Interl parlance) instead 
of by core, but conventional knowledge is that math-heavy processes don't 
perform well with the limited resources of a single hardware thread, and 
benefit from the full resources of the core (this depends on your specific app, 
of course -- YMMV).  Intel's and AMD's hardware threads have gotten better over 
the years, but I think they still represent a division of resources in the 
core, and will likely still be performance-detrimental to at least some classes 
of HPC applications.  It's a surprisingly complicated topic.

In the v4.x series, note that you can use "mpirun --report-bindings ..." to see 
exactly where Open MPI thinks it has bound each process.  Note that this 
binding occurs before each MPI process starts; it's nothing that the 
application itself needs to do.

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Gilles Gouaillardet 
via users 
Sent: Tuesday, September 13, 2022 9:07 AM
To: Open MPI Users 
Cc: Gilles Gouaillardet 
Subject: Re: [OMPI users] Hardware topology influence

Lucas,

the number of MPI tasks started by mpirun is either
 - explicitly passed via the command line (e.g. mpirun -np 2306 ...)
 - equals to the number of available slots, and this value is either
 a) retrieved from the resource manager (such as a SLURM allocation)
 b) explicitly set in a machine file (e.g. mpirun -machinefile 
 ...) or the command line
 (e.g. mpirun --hosts host0:96,host1:96 ...)
 c) if none of the above is set, the number of detected cores on the system

Cheers,

Gilles

On Tue, Sep 13, 2022 at 9:23 PM Lucas Chaloyard via users 
mailto:users@lists.open-mpi.org>> wrote:
Hello,

I'm working as a research intern in a lab where we're studying virtualization.
And I've been working with several benchmarks using OpenMPI 4.1.0 (ASKAP, GPAW 
and Incompact3d from Phrononix Test suite).

To briefly explain my experiments, I'm running those benchmarks on several 
virtual machines using different topologies.
During one experiment I've been comparing those two topologies :
- Topology1 : 96 vCPUS divided in 96 sockets containing 1 threads
- Topology2 : 96 vCPUS divided in 48 sockets containing 2 threads (usage of 
hyperthreading)

For the ASKAP Benchmark :
- While using Topology2, 2306 processes will be created by the application to 
do its work.
- While using Topology1, 4612 processes will be created by the application to 
do its work.
This is also happening when running GPAW and Incompact3d benchmarks.

What I've been wondering (and looking for) is, does OpenMPI take into account 
the topology, and reduce the number of processes create to execute its work in 
order to avoid the usage of hyperthreading ?
Or is it something done by the application itself ?

I was looking at the source code, and I've been trying to find how and when are 
filled the information about the MPI_COMM_WORLD communicator, to see if the 
'num_procs' field depends on the topology, but I didn't have any chance for now.

Respectfully, Chaloyard Lucas.


Re: [OMPI users] Disabling barrier in MPI_Finalize

2022-09-09 Thread Jeff Squyres (jsquyres) via users
No, it does not, sorry.

What are you trying to do?

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Mccall, Kurt E. 
(MSFC-EV41) via users 
Sent: Friday, September 9, 2022 2:30 PM
To: OpenMpi User List (users@lists.open-mpi.org) 
Cc: Mccall, Kurt E. (MSFC-EV41) 
Subject: [OMPI users] Disabling barrier in MPI_Finalize


Hi,



If a single process needs to exit, MPI_Finalize will pause at a barrier, 
possibly waiting for pending communications to complete.  Does OpenMPI have any 
means to disable this behavior, so the a single process can exit normally if 
the application calls for it?



Thanks,

Kurt


Re: [OMPI users] MPI with RoCE

2022-09-06 Thread Jeff Squyres (jsquyres) via users
You can think of RoCE as "IB over IP" -- RoCE is essentially the IB protocol 
over IP packets (which is different than IPoIB, which is emulating IP and TCP 
over the InfiniBand protocol).

You'll need to consult the docs for your Mellanox cards, but if you have 
Ethernet cards, you'll want to set them up the "normal" way (i.e., as Linux 
Ethernet interfaces), but then you'll also setup the RoCE drivers and 
interfaces.  If you compile Open MPI with UCX support, the UCX PML plugin in 
Open MPI should see those RoCE interfaces and automatically use the RoCE 
protocols for MPI message passing (and ignore the "normal" Ethernet interfaces).

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Harutyun Umrshatyan 
via users 
Sent: Tuesday, September 6, 2022 2:58 AM
To: Open MPI Users 
Cc: Harutyun Umrshatyan 
Subject: Re: [OMPI users] MPI with RoCE

Guys,

I actually could make it work!
I had to change Mellanox configuration from Ethernet to Infiniband and set up 
IPoIB.
That was in fact a good experience, but the issue is that not all my Mellanoxes 
can be configured to Infiniband.
My final destination is to make it work without Mellanox OFED on RoCE 
(Ethernet).

Thank you again guys!
Harutyun

On Mon, Sep 5, 2022 at 11:36 AM John Hearns via users 
mailto:users@lists.open-mpi.org>> wrote:
Stupid reply from me. You do know that Infiniband adapters operate without an 
IP address?
Yes, configuring IPOIB is a good idea - however Infiniband adapters are more 
than 'super ethernet adapters'
I would run the following utilities to investigate your Infiniband fabric

sminfo
ibhosts
ibdiagnet

Then on one of the compute nodes

ofed_info

ompi_info












On Sat, 3 Sept 2022 at 19:32, Harutyun Umrshatyan via users 
mailto:users@lists.open-mpi.org>> wrote:
Hi everyone

Could someone please share any experience using MPI with RoCE ?
I am trying to set up infiniband adapters (Mellanox cards for example) and run 
MPI applications with RoCE (Instead of TCP).
As I understand, there might be some environment requirements or restrictions 
like kernel version, installed drivers, etc.
I have tried a lot of versions of mpi libs and could not succeed. Would highly 
appreciate any hint or experience shared.

Best regards,
Harutyun Umrshatyan



Re: [OMPI users] ucx problems

2022-08-31 Thread Jeff Squyres (jsquyres) via users
Yes, that is the intended behavior: Open MPI basically only uses UCX for IB 
transports (and shared memory -- but only when also used with IB transports).

If IB can't be used, the UCX PML disqualifies itself.  This is by design, even 
though UCX can handle other transports (including TCP and shared memory).  The 
rationale for that is that the Open MPI community wanted direct control non-IB 
transports (e.g., shared memory).  Otherwise, a very large portion of the Open 
MPI code base and functionality would be subsumed by the UCX code base, and we 
would be reliant on the UCX community for core Open MPI functionality across 
several different

Hence, by default, UCX is basically used for IB and nothing else.

You can override this behavior by setting the opal_common_ucx_tls env variable 
to a comma-delimited list of UCX transports that the UCX PML will be allowed to 
use.  This MCA param defaults to:


rc_verbs,ud_verbs,rc_mlx5,dc_mlx5,ud_mlx5,cuda_ipc,rocm_ipc

(you'll need to ask the UCX community what each of those do/are)

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Bernstein, Noam CIV 
USN NRL (6393) Washington DC (USA) via users 
Sent: Thursday, August 25, 2022 12:27 PM
To: Tim Carlson 
Cc: Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) 
; Open MPI Users 
Subject: Re: [OMPI users] ucx problems

Yeah, that appears to have been the issue - IB is entirely dead (it's a new 
machine, so maybe no subnet manager, or maybe a bad cable). I'll track that 
down, and follow up here if there's still an issue once the low level IB 
problem is fixed.

However, given that ucx says it supports shared memory transports, I'm a bit 
surprised that it cannot operate (at least in OpenMPI) without IB active (it's 
a single node job).  I added some print statements to common_ucx.c, and 
discovered that ucx knows about a few transports like posix and tcp, but 
OpenMPI never tries to use those, so it never finds a match.  Is that expected 
from how OpenMPI tries to use ucx?

thanks,
Noam

On Aug 25, 2022, at 12:10 PM, Tim Carlson 
mailto:timothy.carl...@pnnl.gov>> wrote:

And the output of

ibstat
ibhosts

Is what? Maybe no subnet manager running?



Re: [OMPI users] Oldest version of SLURM in use?

2022-08-17 Thread Jeff Squyres (jsquyres) via users
Fair point.

If there's anyone out there who's unwilling to reply publicly, please feel free 
to reply directly to me.

Specifically: we want to know if Open MPI v5.0.0 stops supporting < SLURM 
2017.11 is going to be a problem.

--
Jeff Squyres
jsquy...@cisco.com

From: Tim Carlson 
Sent: Wednesday, August 17, 2022 11:34 AM
To: Open MPI Users 
Cc: Jeff Squyres (jsquyres) 
Subject: Re: [OMPI users] Oldest version of SLURM in use?


To be honest, I only upgrade SLURM when there is a feature I absolutely have to 
have, or a big bug that needs to be fixed.  Like when GRES was introduced, and 
we started using GPUs.



Updating SLURM and then going to find all the binaries that have been running 
for years and need to be relinked (or binary edited) to a new PMI library is 
painful.



I’m guessing there are plenty of folks on the OpenMPI list who are less than 
willing to reply as they are certainly running versions of SLURM previous to 
20.11 that have been summarily yanked from schedmd’s download site due to the 
security flaw that was discovered.



All that being said, it is true that we still have ancient project clusters 
tucked away that have limited users and are fairly static in terms of the 
software stack.  I’d be fibbing if I said all of my SLURM installations are 
from this decade.



Tim

--

Tim Carlson

Team Lead – HPC/ML/Q, Research Computing

Computing & Information Technology Directorate

Pacific Northwest National Laboratory | www.pnnl.gov<http://www.pnnl.gov/>



509.371.6435 | t...@pnnl.gov<mailto:t...@pnnl.gov>







From: users  on behalf of "Jeff Squyres 
(jsquyres) via users" 
Reply-To: Open MPI Users 
Date: Wednesday, August 17, 2022 at 8:18 AM
To: Open MPI Users 
Cc: "Jeff Squyres (jsquyres)" 
Subject: Re: [OMPI users] Oldest version of SLURM in use?



Check twice before you click! This email originated from outside PNNL.



These are great data points!



I'd love to hear from others, too.



--
Jeff Squyres
jsquy...@cisco.com



From: users  on behalf of Andrew Reid via 
users 
Sent: Tuesday, August 16, 2022 10:21 AM
To: Open MPI Users 
Cc: Andrew Reid 
Subject: Re: [OMPI users] Oldest version of SLURM in use?



Sorry, 2022-2016=6 years old.  This is why I let the computers do the 
arithmetic



On Tue, Aug 16, 2022 at 10:19 AM Andrew Reid 
mailto:andrew.ce.r...@gmail.com>> wrote:

Wondering if I should reply from an alt for this, but in my case, it's not 
so much "less well-funded" as "less well-organized".



I have some small clusters that, for convenience, run the Debian-packaged 
version of SLURM. Debian 9 reached the end of LTS  June, 30, 2022, and packaged 
version 16.05 of SLURM, which we were running on some systems right up until 
that point, when it was eight years old.



More generally, the Debian-packaged version tends to be a year or two behind at 
distro-release time, and Debian LTS lifetimes can be five years, so you can get 
into a window late in the distro lifecycle where things are pretty old.



But, to be clear, my expectation for support, which was the actual question, is 
pretty much zero. I'm juggling my time and tasks with my eyes open, and if I 
find myself in a corner where some software doesn't run because the version 
mismatch between OpenMPI and SLURM is too big, my first line of attack will be 
to do the required upgrades -- I'm pretty unlikely to look for support. Also, 
there's a selection effect, usually the *reason* the cluster has not been 
upgraded is that users want to keep running their legacy software on it, so as 
a practical matter, I do not often find myself in the version-mismatch corner.



Pardon my rambling, the upshot is, some lazy/disorganized people rely on 
third-party packagers, and do get pretty far behind.



On Tue, Aug 16, 2022 at 9:54 AM Jeff Squyres (jsquyres) via users 
mailto:users@lists.open-mpi.org>> wrote:

I have a curiosity question for the Open MPI user community: what version of 
SLURM are you using?



I ask because we're honestly curious about what the expectations are regarding 
new versions of Open MPI supporting older versions of SLURM.



I believe that SchedMD's policy is that they support up to 5-year old versions 
of SLURM, which is perfectly reasonable.  But then again, there's lots of 
people who don't have support contracts with SchedMD, and therefore don't want 
or need support from SchedMD.  Indeed, in well-funded institutions, HPC 
clusters tend to have a lifetime of 2-4 years before they are refreshed, which 
fits nicely within that 5-year window.  But in less well-funded institutions, 
HPC clusters could have lifetimes longer than 5 years.



Do any of you run versions of SLURM that are more than 5 years old?



--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>




Re: [OMPI users] Oldest version of SLURM in use?

2022-08-17 Thread Jeff Squyres (jsquyres) via users
These are great data points!

I'd love to hear from others, too.

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Andrew Reid via 
users 
Sent: Tuesday, August 16, 2022 10:21 AM
To: Open MPI Users 
Cc: Andrew Reid 
Subject: Re: [OMPI users] Oldest version of SLURM in use?

Sorry, 2022-2016=6 years old.  This is why I let the computers do the 
arithmetic

On Tue, Aug 16, 2022 at 10:19 AM Andrew Reid 
mailto:andrew.ce.r...@gmail.com>> wrote:
Wondering if I should reply from an alt for this, but in my case, it's not 
so much "less well-funded" as "less well-organized".

I have some small clusters that, for convenience, run the Debian-packaged 
version of SLURM. Debian 9 reached the end of LTS  June, 30, 2022, and packaged 
version 16.05 of SLURM, which we were running on some systems right up until 
that point, when it was eight years old.

More generally, the Debian-packaged version tends to be a year or two behind at 
distro-release time, and Debian LTS lifetimes can be five years, so you can get 
into a window late in the distro lifecycle where things are pretty old.

But, to be clear, my expectation for support, which was the actual question, is 
pretty much zero. I'm juggling my time and tasks with my eyes open, and if I 
find myself in a corner where some software doesn't run because the version 
mismatch between OpenMPI and SLURM is too big, my first line of attack will be 
to do the required upgrades -- I'm pretty unlikely to look for support. Also, 
there's a selection effect, usually the *reason* the cluster has not been 
upgraded is that users want to keep running their legacy software on it, so as 
a practical matter, I do not often find myself in the version-mismatch corner.

Pardon my rambling, the upshot is, some lazy/disorganized people rely on 
third-party packagers, and do get pretty far behind.

On Tue, Aug 16, 2022 at 9:54 AM Jeff Squyres (jsquyres) via users 
mailto:users@lists.open-mpi.org>> wrote:
I have a curiosity question for the Open MPI user community: what version of 
SLURM are you using?

I ask because we're honestly curious about what the expectations are regarding 
new versions of Open MPI supporting older versions of SLURM.

I believe that SchedMD's policy is that they support up to 5-year old versions 
of SLURM, which is perfectly reasonable.  But then again, there's lots of 
people who don't have support contracts with SchedMD, and therefore don't want 
or need support from SchedMD.  Indeed, in well-funded institutions, HPC 
clusters tend to have a lifetime of 2-4 years before they are refreshed, which 
fits nicely within that 5-year window.  But in less well-funded institutions, 
HPC clusters could have lifetimes longer than 5 years.

Do any of you run versions of SLURM that are more than 5 years old?

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


--
Andrew Reid / andrew.ce.r...@gmail.com<mailto:andrew.ce.r...@gmail.com>


--
Andrew Reid / andrew.ce.r...@gmail.com<mailto:andrew.ce.r...@gmail.com>


[OMPI users] Oldest version of SLURM in use?

2022-08-16 Thread Jeff Squyres (jsquyres) via users
I have a curiosity question for the Open MPI user community: what version of 
SLURM are you using?

I ask because we're honestly curious about what the expectations are regarding 
new versions of Open MPI supporting older versions of SLURM.

I believe that SchedMD's policy is that they support up to 5-year old versions 
of SLURM, which is perfectly reasonable.  But then again, there's lots of 
people who don't have support contracts with SchedMD, and therefore don't want 
or need support from SchedMD.  Indeed, in well-funded institutions, HPC 
clusters tend to have a lifetime of 2-4 years before they are refreshed, which 
fits nicely within that 5-year window.  But in less well-funded institutions, 
HPC clusters could have lifetimes longer than 5 years.

Do any of you run versions of SLURM that are more than 5 years old?

--
Jeff Squyres
jsquy...@cisco.com


Re: [OMPI users] RUNPATH vs. RPATH

2022-08-11 Thread Jeff Squyres (jsquyres) via users
Thanks for the feedback!  I made a follow-up PR 
https://github.com/open-mpi/ompi/pull/10652 incorporating your feedback and 
feedback from Harmen Stoppels.

I would have @mentioned you in the PR, but it doesn't appear that you have a 
Github ID (or, I couldn't find it, at least).

--
Jeff Squyres
jsquy...@cisco.com


From: Reuti
Sent: Tuesday, August 9, 2022 12:03 PM
To: Open MPI Users
Cc: Jeff Squyres (jsquyres); zuelc...@staff.uni-marburg.de
Subject: Re: [OMPI users] RUNPATH vs. RPATH

Hi Jeff,

> Am 09.08.2022 um 16:17 schrieb Jeff Squyres (jsquyres) via users 
> :
>
> Just to follow up on this thread...
>
> Reuti: I merged the PR on to the main docs branch.  They're now live -- we 
> changed the text:
>• here: 
> https://docs.open-mpi.org/en/main/installing-open-mpi/configure-cli-options/installation.html

On this page I read:

Using --disable-wrapper-rpath will disable both “runpath” and “rpath” behavior 
in the wrapper compilers.

I would phrase it:

Using --disable-wrapper-rpath in addition will disable both “runpath” and 
“rpath” behavior in the wrapper compilers.

(otherwise I get a "configure: error: --enable-wrapper-runpath cannot be 
selected with --disable-wrapper-rpath")


>• and here: 
> https://docs.open-mpi.org/en/main/installing-open-mpi/configure-cli-options/rpath-and-runpath.html

The last command reads `shell$ ./configure LDFLAGS=--enable-new-dtags ...`. But 
the LDFLAGS will be given to the compiler wrapper, hence it seems to need 
-Wl,--enable-new-dtags what I used initially to avoid:

configure:6591: checking whether the C compiler works
configure:6613: gcc   --enable-new-dtags conftest.c  >&5
cc1: error: unknown pass new-dtags specified in -fenable


> Here's the corresponding PR to update the v5.0.x docs: 
> https://github.com/open-mpi/ompi/pull/10640
>
> Specifically, the answer to your original question is twofold:
>• It's complicated. 🙂
>• It looks like you did the Right Thing for your environment, but you 
> might want to check the output of "readelf -d ..." to be sure.
> Does that additional text help explain things?

Yes, thx a lot for the clarification and update of the documentation.

-- Reuti


> --
> Jeff Squyres
> jsquy...@cisco.com
> From: Jeff Squyres (jsquyres) 
> Sent: Saturday, August 6, 2022 9:36 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] RUNPATH vs. RPATH
>
> Reuti --
>
> See my disclaimers on other posts about apologies for taking so long to reply!
>
> This code was written forever ago; I had to dig through it a bit, read the 
> comments and commit messages, and try to remember why it was done this way.  
> What I thought would be a 5-minute search turned into a few hours of digging 
> through code, multiple conversations with Brian, and one pull request (so 
> far).  We don't have a definitive answer yet, but we think we're getting 
> closer.
>
> The short version is that what you did appears to be correct:
>
> ./configure LDFLAGS=-Wl,--enable-new-dtags ...
>
> The longer answer is that whenever you think you understand the shared 
> library and run-time linkers, you inevitably find out that you don't.  The 
> complicated cases come from the fact that the handling of rpath and runpath 
> can be different on different platforms, and there are subtle differences in 
> their behavior (beyond the initial "search before or after LD_LIBRARY_PATH, 
> such as the handling of primary and secondary/transitive dependencies).
>
> The pull request I have so far is https://github.com/open-mpi/ompi/pull/10624 
> (rendered here: 
> https://ompi--10624.org.readthedocs.build/en/10624/installing-open-mpi/configure-cli-options/installation.html).
>   We're not 100% confident in that text yet, but I think we're close to at 
> least documenting what the current behavior is.  Once we nail that down, we 
> can talk about whether we want to change that behavior.
>
>
> 
> From: users  on behalf of Reuti via users 
> 
> Sent: Friday, July 22, 2022 9:48 AM
> To: Open MPI Users
> Cc: Reuti; zuelc...@staff.uni-marburg.de
> Subject: [OMPI users] RUNPATH vs. RPATH
>
> Hi,
>
> using Open MPI 4.1.4
>
> $ mpicc --show …
>
> tells me, that the command line contains "… -Wl,--enable-new-dtags …" so that 
> even older linkers will include RUNPATH instead of RPATH in the created 
> dynamic binary. On the other hand, Open MPI itself doesn't use this option 
> for its own libraries:
>
> ./liboshmem.so.40.30.2
> ./libmpi_mpifh.so.40.30.0
> ./libmpi.so.40.30.4
> ./libmpi_usempi_ignore_tkr.so.40.30.0
> ./libopen-rte.so.40.30.2
>
> Is this inten

Re: [OMPI users] RUNPATH vs. RPATH

2022-08-10 Thread Jeff Squyres (jsquyres) via users
Reuti -- thanks for the comments+fix about missing "-Wl," (oops!).  In addition 
to yours, some more came in on https://github.com/open-mpi/ompi/pull/10624 
after it was merged.  I'll make a follow-on PR with these suggestions.

--
Jeff Squyres
jsquy...@cisco.com


From: Reuti
Sent: Tuesday, August 9, 2022 12:03 PM
To: Open MPI Users
Cc: Jeff Squyres (jsquyres); zuelc...@staff.uni-marburg.de
Subject: Re: [OMPI users] RUNPATH vs. RPATH

Hi Jeff,

> Am 09.08.2022 um 16:17 schrieb Jeff Squyres (jsquyres) via users 
> :
>
> Just to follow up on this thread...
>
> Reuti: I merged the PR on to the main docs branch.  They're now live -- we 
> changed the text:
>• here: 
> https://docs.open-mpi.org/en/main/installing-open-mpi/configure-cli-options/installation.html

On this page I read:

Using --disable-wrapper-rpath will disable both “runpath” and “rpath” behavior 
in the wrapper compilers.

I would phrase it:

Using --disable-wrapper-rpath in addition will disable both “runpath” and 
“rpath” behavior in the wrapper compilers.

(otherwise I get a "configure: error: --enable-wrapper-runpath cannot be 
selected with --disable-wrapper-rpath")


>• and here: 
> https://docs.open-mpi.org/en/main/installing-open-mpi/configure-cli-options/rpath-and-runpath.html

The last command reads `shell$ ./configure LDFLAGS=--enable-new-dtags ...`. But 
the LDFLAGS will be given to the compiler wrapper, hence it seems to need 
-Wl,--enable-new-dtags what I used initially to avoid:

configure:6591: checking whether the C compiler works
configure:6613: gcc   --enable-new-dtags conftest.c  >&5
cc1: error: unknown pass new-dtags specified in -fenable


> Here's the corresponding PR to update the v5.0.x docs: 
> https://github.com/open-mpi/ompi/pull/10640
>
> Specifically, the answer to your original question is twofold:
>• It's complicated. 🙂
>• It looks like you did the Right Thing for your environment, but you 
> might want to check the output of "readelf -d ..." to be sure.
> Does that additional text help explain things?

Yes, thx a lot for the clarification and update of the documentation.

-- Reuti


> --
> Jeff Squyres
> jsquy...@cisco.com
> From: Jeff Squyres (jsquyres) 
> Sent: Saturday, August 6, 2022 9:36 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] RUNPATH vs. RPATH
>
> Reuti --
>
> See my disclaimers on other posts about apologies for taking so long to reply!
>
> This code was written forever ago; I had to dig through it a bit, read the 
> comments and commit messages, and try to remember why it was done this way.  
> What I thought would be a 5-minute search turned into a few hours of digging 
> through code, multiple conversations with Brian, and one pull request (so 
> far).  We don't have a definitive answer yet, but we think we're getting 
> closer.
>
> The short version is that what you did appears to be correct:
>
> ./configure LDFLAGS=-Wl,--enable-new-dtags ...
>
> The longer answer is that whenever you think you understand the shared 
> library and run-time linkers, you inevitably find out that you don't.  The 
> complicated cases come from the fact that the handling of rpath and runpath 
> can be different on different platforms, and there are subtle differences in 
> their behavior (beyond the initial "search before or after LD_LIBRARY_PATH, 
> such as the handling of primary and secondary/transitive dependencies).
>
> The pull request I have so far is https://github.com/open-mpi/ompi/pull/10624 
> (rendered here: 
> https://ompi--10624.org.readthedocs.build/en/10624/installing-open-mpi/configure-cli-options/installation.html).
>   We're not 100% confident in that text yet, but I think we're close to at 
> least documenting what the current behavior is.  Once we nail that down, we 
> can talk about whether we want to change that behavior.
>
>
> 
> From: users  on behalf of Reuti via users 
> 
> Sent: Friday, July 22, 2022 9:48 AM
> To: Open MPI Users
> Cc: Reuti; zuelc...@staff.uni-marburg.de
> Subject: [OMPI users] RUNPATH vs. RPATH
>
> Hi,
>
> using Open MPI 4.1.4
>
> $ mpicc --show …
>
> tells me, that the command line contains "… -Wl,--enable-new-dtags …" so that 
> even older linkers will include RUNPATH instead of RPATH in the created 
> dynamic binary. On the other hand, Open MPI itself doesn't use this option 
> for its own libraries:
>
> ./liboshmem.so.40.30.2
> ./libmpi_mpifh.so.40.30.0
> ./libmpi.so.40.30.4
> ./libmpi_usempi_ignore_tkr.so.40.30.0
> ./libopen-rte.so.40.30.2
>
> Is this intended?
>
> Setting LD_LIBRARY_PATH will instruct t

[OMPI users] Open MPI Java MPI bindings

2022-08-09 Thread Jeff Squyres (jsquyres) via users
During a planning meeting for Open MPI v5.0.0 today, the question came up: is 
anyone using the Open MPI Java bindings?

These bindings are not​ official MPI Forum bindings -- they are an Open 
MPI-specific extension.  They were added a few years ago as a result of a 
research project.

We ask this question because we're wondering if it's worthwhile to bring these 
bindings forward to the v5.0.x series, or whether we should remove them from 
v5.0.x, and just leave them available back in the v4.0.x and v4.1.x series.

Please reply here to this list if you are using the Open MPI Java bindings, or 
know of anyone who is using them.

Thank you!

--
Jeff Squyres
jsquy...@cisco.com


Re: [OMPI users] RUNPATH vs. RPATH

2022-08-09 Thread Jeff Squyres (jsquyres) via users
Just to follow up on this thread...

Reuti: I merged the PR on to the main docs branch.  They're now live -- we 
changed the text:

  *   here: 
https://docs.open-mpi.org/en/main/installing-open-mpi/configure-cli-options/installation.html
  *
  *   and here: 
https://docs.open-mpi.org/en/main/installing-open-mpi/configure-cli-options/rpath-and-runpath.html

Here's the corresponding PR to update the v5.0.x docs: 
https://github.com/open-mpi/ompi/pull/10640

Specifically, the answer to your original question is twofold:

  1.  It's complicated. 🙂
  2.  It looks like you did the Right Thing for your environment, but you might 
want to check the output of "readelf -d ..." to be sure.

Does that additional text help explain things?

--
Jeff Squyres
jsquy...@cisco.com
________
From: Jeff Squyres (jsquyres) 
Sent: Saturday, August 6, 2022 9:36 AM
To: Open MPI Users 
Subject: Re: [OMPI users] RUNPATH vs. RPATH

Reuti --

See my disclaimers on other posts about apologies for taking so long to reply!

This code was written forever ago; I had to dig through it a bit, read the 
comments and commit messages, and try to remember why it was done this way.  
What I thought would be a 5-minute search turned into a few hours of digging 
through code, multiple conversations with Brian, and one pull request (so far). 
 We don't have a definitive answer yet, but we think we're getting closer.

The short version is that what you did appears to be correct:

./configure LDFLAGS=-Wl,--enable-new-dtags ...

The longer answer is that whenever you think you understand the shared library 
and run-time linkers, you inevitably find out that you don't.  The complicated 
cases come from the fact that the handling of rpath and runpath can be 
different on different platforms, and there are subtle differences in their 
behavior (beyond the initial "search before or after LD_LIBRARY_PATH, such as 
the handling of primary and secondary/transitive dependencies).

The pull request I have so far is https://github.com/open-mpi/ompi/pull/10624 
(rendered here: 
https://ompi--10624.org.readthedocs.build/en/10624/installing-open-mpi/configure-cli-options/installation.html).
  We're not 100% confident in that text yet, but I think we're close to at 
least documenting what the current behavior is.  Once we nail that down, we can 
talk about whether we want to change that behavior.



From: users  on behalf of Reuti via users 

Sent: Friday, July 22, 2022 9:48 AM
To: Open MPI Users
Cc: Reuti; zuelc...@staff.uni-marburg.de
Subject: [OMPI users] RUNPATH vs. RPATH

Hi,

using Open MPI 4.1.4

$ mpicc --show …

tells me, that the command line contains "… -Wl,--enable-new-dtags …" so that 
even older linkers will include RUNPATH instead of RPATH in the created dynamic 
binary. On the other hand, Open MPI itself doesn't use this option for its own 
libraries:

./liboshmem.so.40.30.2
./libmpi_mpifh.so.40.30.0
./libmpi.so.40.30.4
./libmpi_usempi_ignore_tkr.so.40.30.0
./libopen-rte.so.40.30.2

Is this intended?

Setting LD_LIBRARY_PATH will instruct the created binary to look for libraries 
first in that location and resolve it, but the loaded library in turn will then 
use RPATH inside itself first to load additional libraries.

(I compile Open MPI in my home directory and move it then to the final 
destination for the group; setting OPAL_PREFIX of course. I see a mix of 
library locations when I run the created binary on my own with `ldd`.)

Looks like I can get the intended behavior while configuring Open MPI on this 
(older) system:

$ ./configure …  LDFLAGS=-Wl,--enable-new-dtags

-- Reuti



Re: [OMPI users] Problem with OpenMPI as Third pary library

2022-08-09 Thread Jeff Squyres (jsquyres) via users
I can't see the image that you sent; it seems to be broken.

But I think you're asking about this: 
https://www.open-mpi.org/faq/?category=building#installdirs

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Sebastian Gutierrez 
via users 
Sent: Tuesday, August 9, 2022 9:52 AM
To: users@lists.open-mpi.org 
Cc: Sebastian Gutierrez 
Subject: [OMPI users] Problem with OpenMPI as Third pary library


Good morning Open-MPI organization,



I have been trying to distribute your program as third party library in my 
CMake Project. Because I do not want to my Linux users to have to install 
OpenMPI by their own. I just want them to use my final product that uses 
OPenMPI dependencies. When I execute make install  of my Project in my own 
machine, it works perfectly but the problem appears when I move the executable 
to another machine that does not have installed OpenMPI (see image attached); 
as you can see when I run my executable appear this error message trying to 
find a file of yours but in an absolute path in which OpenMPI was installed, in 
fact this file does exist but not in that path, I would like to make everything 
relative to my Project origin folder. I used the the CMake macro 
ExternalProject, so that OpenMPI will be installed inside my project as an 
external dependency. Here is a piece of code of the config that I used to 
install OpenMPI

ExternalProject_Add(

openmpi_external



URL 
https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.2.tar.gz

URL_MD5 2f86dc37b7a00b96ca964637ee68826e



BUILD_IN_SOURCE 1

SOURCE_DIR ${CMAKE_BINARY_DIR}/toyapp/external/openmpi



CONFIGURE_COMMAND

${CMAKE_BINARY_DIR}/toyapp/external/openmpi/configure

--prefix=${CMAKE_BINARY_DIR}/toyapp/external/openmpi



BUILD_COMMAND

make all



INSTALL_COMMAND

make install

)

I honestly have no idea to solve this. I wrote this email in hopes that you 
kindly help me out with this.

I am looking forward to your answer.

Best regards,

Sebastian

[cid:image003.png@01D8A8C1.03EA7080]


Re: [OMPI users] RUNPATH vs. RPATH

2022-08-06 Thread Jeff Squyres (jsquyres) via users
Reuti --

See my disclaimers on other posts about apologies for taking so long to reply!

This code was written forever ago; I had to dig through it a bit, read the 
comments and commit messages, and try to remember why it was done this way.  
What I thought would be a 5-minute search turned into a few hours of digging 
through code, multiple conversations with Brian, and one pull request (so far). 
 We don't have a definitive answer yet, but we think we're getting closer.

The short version is that what you did appears to be correct:

./configure LDFLAGS=-Wl,--enable-new-dtags ...

The longer answer is that whenever you think you understand the shared library 
and run-time linkers, you inevitably find out that you don't.  The complicated 
cases come from the fact that the handling of rpath and runpath can be 
different on different platforms, and there are subtle differences in their 
behavior (beyond the initial "search before or after LD_LIBRARY_PATH, such as 
the handling of primary and secondary/transitive dependencies).

The pull request I have so far is https://github.com/open-mpi/ompi/pull/10624 
(rendered here: 
https://ompi--10624.org.readthedocs.build/en/10624/installing-open-mpi/configure-cli-options/installation.html).
  We're not 100% confident in that text yet, but I think we're close to at 
least documenting what the current behavior is.  Once we nail that down, we can 
talk about whether we want to change that behavior.



From: users  on behalf of Reuti via users 

Sent: Friday, July 22, 2022 9:48 AM
To: Open MPI Users
Cc: Reuti; zuelc...@staff.uni-marburg.de
Subject: [OMPI users] RUNPATH vs. RPATH

Hi,

using Open MPI 4.1.4

$ mpicc --show …

tells me, that the command line contains "… -Wl,--enable-new-dtags …" so that 
even older linkers will include RUNPATH instead of RPATH in the created dynamic 
binary. On the other hand, Open MPI itself doesn't use this option for its own 
libraries:

./liboshmem.so.40.30.2
./libmpi_mpifh.so.40.30.0
./libmpi.so.40.30.4
./libmpi_usempi_ignore_tkr.so.40.30.0
./libopen-rte.so.40.30.2

Is this intended?

Setting LD_LIBRARY_PATH will instruct the created binary to look for libraries 
first in that location and resolve it, but the loaded library in turn will then 
use RPATH inside itself first to load additional libraries.

(I compile Open MPI in my home directory and move it then to the final 
destination for the group; setting OPAL_PREFIX of course. I see a mix of 
library locations when I run the created binary on my own with `ldd`.)

Looks like I can get the intended behavior while configuring Open MPI on this 
(older) system:

$ ./configure …  LDFLAGS=-Wl,--enable-new-dtags

-- Reuti



Re: [OMPI users] Multiple IPs on network interface

2022-07-07 Thread Jeff Squyres (jsquyres) via users
Can you send the full output of "ifconfig" (or "ip addr") from one of your 
compute nodes?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of George Johnson via 
users 
Sent: Monday, July 4, 2022 11:06 AM
To: users@lists.open-mpi.org
Cc: George Johnson
Subject: [OMPI users] Multiple IPs on network interface

Hi,

I am aware that section 13 in the FAQ says that MPI "in general" wont work with 
a network interface that has two IPs. However, I've had slurm running python 
and C programs on a cluster of 21 nodes for a while and haven't had any issues 
until I tried running some OSU micro benchmarks. This resulted in this 
error , i'm not entirely sure why each node has 
two IPs, i believe it is related to netbooting as they are all netbooted.

This is the slurm script I'm using to start the 
job. The -mca section I added to fix the problem however doesn't do anything as 
both the ips are on the eth0 interface.

Is there anything I can do to run these benchmarks?

Let me know what other details I need to provide as I'm not sure where to start.

Thanks,

George Johnson


Re: [OMPI users] Intercommunicator issue (any standard about communicator?)

2022-06-24 Thread Jeff Squyres (jsquyres) via users
Open MPI and MPICH are completely unrelated -- we're entirely different code 
bases (note that Intel MPI is derived from MPICH).

Case in point is what Gilles cited: Open MPI chose to implement MPI_Comm 
handles as pointers, but MPICH chose to implement MPI_Comm handles as integers. 
 Hence, you can't really compare the MPI_Comm values from Open MPI vs. MPI_Comm 
values from MPICH/Intel MPI -- they're fundamentally representing different 
things.

The MPI standard doesn't say anything about the values of MPI handles (e.g., 
MPI_Comm handles).  They're just a value that a user program can pass around.  
When that handle is given to the MPI implementation (e.g., by passing it to 
MPI_Send() or other MPI API), the only rule is that the MPI implementation has 
to be able to map that handle into whatever back end data structures are 
relevant to implement the concept of an MPI communicator.  Hence: the handle is 
meaningless to the application -- it's just an opaque value that the user 
program can pass around.

User applications *can* compare it to the value for MPI_COMM_NULL, but that's 
about it.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Guillaume De Nayer 
via users 
Sent: Friday, June 24, 2022 8:29 AM
To: users@lists.open-mpi.org
Cc: Guillaume De Nayer
Subject: Re: [OMPI users] Intercommunicator issue (any standard about 
communicator?)

Hi Gilles,

I'm using both openmpi and intel mpi. I have with both problem with the
communicators. Therefore, I tried to get some infos about them.

Thx a lot for your help.
Have a nice day

On 06/24/2022 02:14 PM, Gilles Gouaillardet via users wrote:
> Guillaume,
>
> MPI_Comm is an opaque handler that should not be interpreted by an end user.
>
> Open MPI chose to implement is as an opaque pointer, and MPICH chose to
> implement it as a 32 bits unsigned integer.
> The 4400 value strongly suggests you are using MPICH and you are
> hence posting to the wrong mailing list
>
>
> Cheers,
>
> Gilles
>
> On Fri, Jun 24, 2022 at 9:06 PM Guillaume De Nayer via users
> mailto:users@lists.open-mpi.org>> wrote:
>
> Hi Gilles,
>
> MPI_COMM_WORLD is positive (4400).
>
> In a short code I wrote I have something like that:
>
> MPI_Comm_dup(MPI_COMM_WORLD, &world);
> cout << "intra-communicator: " << "world" << "---" << hex << world
> << endl;
>
> It returns "8406" (in hex).
>
> later I have:
>
> MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, world, &interClient);
> cout << "intercommunicator interClient=" << interClient << endl;
>
> After connection from a third party client it returns "c403" (in
> hex).
>
> Both 8406 and c403 are negative integer in dec.
>
> I don't know if it is "normal". Therefore I'm looking about rules on the
> communicators, intercommunicators.
>
> Regards,
> Guillaume
>
>
> On 06/24/2022 11:56 AM, Gilles Gouaillardet via users wrote:
> > Guillaume,
> >
> > what do you mean by (the intercommunicators are all negative"?
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On Fri, Jun 24, 2022 at 4:23 PM Guillaume De Nayer via users
> > mailto:users@lists.open-mpi.org>
> >>
> wrote:
> >
> > Hi,
> >
> > I am new on this list. Let me introduce myself shortly: I am a
> > researcher in fluid mechanics. In this context I am using
> softwares
> > related on MPI.
> >
> > I am facing a problem:
> > - 3 programs forms a computational framework. Soft1 is a coupling
> > program, i.e., it opens an MPI port at the beginning. Soft2
> and Soft3
> > are clients, which connect to the coupling program using
> > MPI_Comm_connect.
> > - After the start and the connections of Soft2 and Soft3 with
> Soft1, it
> > hangs.
> >
> > I started to debug this issue and as usual I found another
> issue (or
> > perhaps it is not an issue):
> > - The intercommunicators I get between Soft1-Soft2 and
> Soft1-Soft3 are
> > all negative (running on CentOS 7 with infiniband Mellanox
> OFED driver).
> > - Is there some standard about communicator? I don't find anything
> > about
> > this topic.
> > - What is a valid communicator, intercommunicator?
> >
> > thx a lot
> > Regards
> > Guillaume
> >
>
>




Re: [OMPI users] Intercommunicator issue (any standard about communicator?)

2022-06-24 Thread Jeff Squyres (jsquyres) via users
Guillaume --

There is an MPI Standard document that you can obtain from mpi-forum.org.  Open 
MPI v4.x adheres to MPI version 3.1 (the latest version of the MPI standard is 
v4.0, but that is unrelated to Open MPI's version number).

Frankly, Open MPI's support of the dynamic API functionality 
(connect/accept/etc.) has always been a bit shaky; they have been tested to 
work in very, very specific conditions, and not made super robust to work in 
many different / generalized cases.  Is there a chance you can orient your app 
to not use the MPI dynamic APIs?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Gilles Gouaillardet 
via users 
Sent: Friday, June 24, 2022 5:56 AM
To: Open MPI Users
Cc: Gilles Gouaillardet
Subject: Re: [OMPI users] Intercommunicator issue (any standard about 
communicator?)

Guillaume,

what do you mean by (the intercommunicators are all negative"?


Cheers,

Gilles

On Fri, Jun 24, 2022 at 4:23 PM Guillaume De Nayer via users 
mailto:users@lists.open-mpi.org>> wrote:
Hi,

I am new on this list. Let me introduce myself shortly: I am a
researcher in fluid mechanics. In this context I am using softwares
related on MPI.

I am facing a problem:
- 3 programs forms a computational framework. Soft1 is a coupling
program, i.e., it opens an MPI port at the beginning. Soft2 and Soft3
are clients, which connect to the coupling program using MPI_Comm_connect.
- After the start and the connections of Soft2 and Soft3 with Soft1, it
hangs.

I started to debug this issue and as usual I found another issue (or
perhaps it is not an issue):
- The intercommunicators I get between Soft1-Soft2 and Soft1-Soft3 are
all negative (running on CentOS 7 with infiniband Mellanox OFED driver).
- Is there some standard about communicator? I don't find anything about
this topic.
- What is a valid communicator, intercommunicator?

thx a lot
Regards
Guillaume



Re: [OMPI users] OpenMPI and names of the nodes in a cluster

2022-06-24 Thread Jeff Squyres (jsquyres) via users
I think the files suggested by Gilles are more about the underlying call to get 
the hostname; those won't be problematic.

The regex Open MPI modules are where Open MPI is running into a problem with 
your hostnames (i.e., your hostnames don't fit into Open MPI's expectations of 
the format of the hostname).  I'm surprised that using the naive module 
(instead of the fwd module) doesn't solve your problem.  ...oh shoot, I see 
why.  It's because I had a typo in what I suggested to you.

Please try:  mpirun --mca regx naive ...

(i.e., "regx", not "regex")

--
Jeff Squyres
jsquy...@cisco.com


From: Patrick Begou 
Sent: Tuesday, June 21, 2022 12:10 PM
To: Jeff Squyres (jsquyres); Open MPI Users
Subject: Re: [OMPI users] OpenMPI and names of the nodes in a cluster

Hi Jeff,

Unfortunately the workaround with "--mca regex naive" does not change the 
behaviour. I'm going to investigate OpenMPI sources files as suggested by 
Gilles.

Patrick

Le 16/06/2022 à 17:43, Jeff Squyres (jsquyres) a écrit :

Ah; this is a slightly different error than what Gilles was guessing from your 
prior description.  This is what you're running in to: 
https://github.com/open-mpi/ompi/blob/v4.0.x/orte/mca/regx/fwd/regx_fwd.c#L130-L134

Try running with:

mpirun --mca regex naive ...

Specifically: the "fwd" regex component is selected by default, but it has 
certain expectations about the format of hostnames.  Try using the "naive" 
regex component, instead.

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


From: Patrick Begou 
<mailto:patrick.be...@univ-grenoble-alpes.fr>
Sent: Thursday, June 16, 2022 9:48 AM
To: Jeff Squyres (jsquyres); Open MPI Users
Subject: Re: [OMPI users] OpenMPI and names of the nodes in a cluster

Hi  Gilles and Jeff,

@Gilles I will have a look at these files, thanks.

@Jeff this is the error message (screen dump attached) and of course the nodes 
names do not agree with the standard.

Patrick

[cid:part1.KfzAgK4Q.PG6VadQJ@univ-grenoble-alpes.fr]

Le 16/06/2022 à 14:30, Jeff Squyres (jsquyres) a écrit :

What exactly is the error that is occurring?

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com><mailto:jsquy...@cisco.com><mailto:jsquy...@cisco.com>


From: users 
<mailto:users-boun...@lists.open-mpi.org><mailto:users-boun...@lists.open-mpi.org><mailto:users-boun...@lists.open-mpi.org>
 on behalf of Patrick Begou via users 
<mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org>
Sent: Thursday, June 16, 2022 3:21 AM
To: Open MPI Users
Cc: Patrick Begou
Subject: [OMPI users] OpenMPI and names of the nodes in a cluster

Hi all,

we are facing a serious problem with OpenMPI (4.0.2) that we have
deployed on a cluster. We do not manage this large cluster and the names
of the nodes do not agree with Internet standards for protocols: they
contain a "_" (underscore) character.

So OpenMPI complains about this and do not run.

I've tried to use IP instead of host names in the host file without any
success.

Is there a known workaround for this as requesting the administrators to
change the nodes names on this large cluster may be difficult.

Thanks

Patrick








Re: [OMPI users] OpenMPI and names of the nodes in a cluster

2022-06-16 Thread Jeff Squyres (jsquyres) via users
Ah; this is a slightly different error than what Gilles was guessing from your 
prior description.  This is what you're running in to: 
https://github.com/open-mpi/ompi/blob/v4.0.x/orte/mca/regx/fwd/regx_fwd.c#L130-L134

Try running with:

mpirun --mca regex naive ...

Specifically: the "fwd" regex component is selected by default, but it has 
certain expectations about the format of hostnames.  Try using the "naive" 
regex component, instead.

-- 
Jeff Squyres
jsquy...@cisco.com


From: Patrick Begou 
Sent: Thursday, June 16, 2022 9:48 AM
To: Jeff Squyres (jsquyres); Open MPI Users
Subject: Re: [OMPI users] OpenMPI and names of the nodes in a cluster

Hi  Gilles and Jeff,

@Gilles I will have a look at these files, thanks.

@Jeff this is the error message (screen dump attached) and of course the nodes 
names do not agree with the standard.

Patrick

[cid:part1.KfzAgK4Q.PG6VadQJ@univ-grenoble-alpes.fr]

Le 16/06/2022 à 14:30, Jeff Squyres (jsquyres) a écrit :

What exactly is the error that is occurring?

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


From: users 
<mailto:users-boun...@lists.open-mpi.org> on 
behalf of Patrick Begou via users 
<mailto:users@lists.open-mpi.org>
Sent: Thursday, June 16, 2022 3:21 AM
To: Open MPI Users
Cc: Patrick Begou
Subject: [OMPI users] OpenMPI and names of the nodes in a cluster

Hi all,

we are facing a serious problem with OpenMPI (4.0.2) that we have
deployed on a cluster. We do not manage this large cluster and the names
of the nodes do not agree with Internet standards for protocols: they
contain a "_" (underscore) character.

So OpenMPI complains about this and do not run.

I've tried to use IP instead of host names in the host file without any
success.

Is there a known workaround for this as requesting the administrators to
change the nodes names on this large cluster may be difficult.

Thanks

Patrick






Re: [OMPI users] OpenMPI and names of the nodes in a cluster

2022-06-16 Thread Jeff Squyres (jsquyres) via users
What exactly is the error that is occurring?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Patrick Begou via 
users 
Sent: Thursday, June 16, 2022 3:21 AM
To: Open MPI Users
Cc: Patrick Begou
Subject: [OMPI users] OpenMPI and names of the nodes in a cluster

Hi all,

we are facing a serious problem with OpenMPI (4.0.2) that we have
deployed on a cluster. We do not manage this large cluster and the names
of the nodes do not agree with Internet standards for protocols: they
contain a "_" (underscore) character.

So OpenMPI complains about this and do not run.

I've tried to use IP instead of host names in the host file without any
success.

Is there a known workaround for this as requesting the administrators to
change the nodes names on this large cluster may be difficult.

Thanks

Patrick




[OMPI users] Passing of an MPI luminary: Rusty Lusk

2022-05-23 Thread Jeff Squyres (jsquyres) via users
In case you had not heard, Dr. Ewing "Rusty" Lusk passed away at age 78 last 
week.  Rusty was one of the founders and prime movers of the entire MPI 
ecosystem: the MPI Forum, the MPI standard, and MPICH.  Without Rusty, our 
community would not exist.  In addition to all of that, he was an all-around 
great guy: he was a thoughtful scientist and engineer, a kind mentor, and a 
genuinely nice guy.  Rusty was on my Ph.D. committee, and I was fortunate 
enough to work with him on a few projects over the years.

Thank you for everything, Rusty.

https://obituaries.neptunesociety.com/obituaries/downers-grove-il/ewing-lusk-10754811/amp

-- 
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI users] Network traffic packets documentation

2022-05-17 Thread Jeff Squyres (jsquyres) via users
Just to clarify: Open MPI's "out of band" messaging *used* to be called "OOB".  
Then PMIx split off into its own project, and Open MPI effectively offloaded 
our out-of-band messaging to PMIx.

If you want to inspect PMIx messages, you'll need to look at the headers in its 
source code repo: https://github.com/openpmix/openpmix/.  It's a different 
project than Open MPI, but you can certainly ask questions on their mailing 
lists, too.

--
Jeff Squyres
jsquy...@cisco.com


From: victor sv 
Sent: Tuesday, May 17, 2022 4:00 AM
To: Jeff Squyres (jsquyres)
Cc: users@lists.open-mpi.org
Subject: Network traffic packets documentation

Hi Jeff,

Thanks for your help. It seems a good point to start :)

And what about PMIX messages (I think it's called OOB) during the process 
spawn? Where can I look the data structure?

I will probably ask similar questions in the near future. I'm going to start a 
proof of concept project (only for fun) to monitor OMPI apps from different 
point of views using eBPF (not only network).

Hopefully you and other active members in the list can help me during the 
process.

Thanks again.
BR,
Víctor.





El lunes, 16 de mayo de 2022, Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> escribió:
Open MPI is generally structured in layers, but adjacent layers don't 
necessarily have any knowledge of each other.

For example, the PML (point-to-point messaging layer) is the first layer behind 
MPI point-to-point functions such as MPI_SEND and MPI_RECV.  Different PMLs do 
not have the same packet layouts, and may, themselves, be layered.  For 
example, the OB1 PML is mostly a high-level protocol engine that uses BTLs for 
low-level sending to and receiving from peers.  The BTLs, therefore, are 
responsible for their outermost packet formats, and encapsulate the OB1 payload 
(which, itself, has a header and a payload).

If you want to look at the TCP BTL (over the OB1 PML), you can look at the 
various structs in opal/mca/btl/tcp.  Additionally, you'll need to look at the 
OB1 PML structs in ompi/mca/pml/ob1.  It's conceivable that you could make a 
Wireshark plugin for this use case, for example.  The TCP BTL and OB1 PML 
structs are subject to change at any time -- it's not like they're published 
standards -- but they have been pretty stable for years.

For other use cases -- e.g., OS-bypass networks -- you'll need to sniff the 
packets from the network itself (because, by definition, the OS won't have 
visibility of the packets).  Regardless, all of those structs are defined in 
their BTL / MTL / PML / etc. components.  We don't have formal documentation of 
any of them, sorry!

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>

________
From: victor sv mailto:victo...@gmail.com>>
Sent: Monday, May 16, 2022 1:17 PM
To: Jeff Squyres (jsquyres)
Cc: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] Network traffic packets documentation

Hi Jeff,

Ok, maybe "packet headers" are not the right words. What I would like to know 
is how MPI application data is structured inside each packet in order to 
dissect and caracterize the messages.

As a first step I would like to start with TCP over Ethernet (MCA BTL TCP, I 
think). How can I figure out how the application data structure looks like 
inside network packets?

In the future I would like to extend it to other network and transport 
combinations.

What do you think? Has it sense?

Thanks in advance.
Víctor

El lun, 16 may 2022 a las 15:45, Jeff Squyres (jsquyres) 
(mailto:jsquy...@cisco.com><mailto:jsquy...@cisco.com<mailto:jsquy...@cisco.com>>>)
 escribió:
Open MPI doesn't proscribe a specific network protocol for anything.  Indeed, 
each network transport uses their own protocols, headers, etc.  It's basically 
a "each Open MPI plugin needs to be able to talk to itself", and therefore no 
commonality is needed (or desired).

Which network and Open MPI transport are you looking to sniff?

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com><mailto:jsquy...@cisco.com<mailto:jsquy...@cisco.com>>


From: users 
mailto:users-boun...@lists.open-mpi.org><mailto:users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>>>
 on behalf of victor sv via users 
mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>>
Sent: Sunday, May 15, 2022 3:55 PM
To: 
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: victor sv
Subject: [OMPI users] Network traffic packets documentation

Hello,

I would like to sniff the OMPI network

Re: [OMPI users] Network traffic packets documentation

2022-05-16 Thread Jeff Squyres (jsquyres) via users
Open MPI is generally structured in layers, but adjacent layers don't 
necessarily have any knowledge of each other.

For example, the PML (point-to-point messaging layer) is the first layer behind 
MPI point-to-point functions such as MPI_SEND and MPI_RECV.  Different PMLs do 
not have the same packet layouts, and may, themselves, be layered.  For 
example, the OB1 PML is mostly a high-level protocol engine that uses BTLs for 
low-level sending to and receiving from peers.  The BTLs, therefore, are 
responsible for their outermost packet formats, and encapsulate the OB1 payload 
(which, itself, has a header and a payload).

If you want to look at the TCP BTL (over the OB1 PML), you can look at the 
various structs in opal/mca/btl/tcp.  Additionally, you'll need to look at the 
OB1 PML structs in ompi/mca/pml/ob1.  It's conceivable that you could make a 
Wireshark plugin for this use case, for example.  The TCP BTL and OB1 PML 
structs are subject to change at any time -- it's not like they're published 
standards -- but they have been pretty stable for years.

For other use cases -- e.g., OS-bypass networks -- you'll need to sniff the 
packets from the network itself (because, by definition, the OS won't have 
visibility of the packets).  Regardless, all of those structs are defined in 
their BTL / MTL / PML / etc. components.  We don't have formal documentation of 
any of them, sorry!

--
Jeff Squyres
jsquy...@cisco.com


From: victor sv 
Sent: Monday, May 16, 2022 1:17 PM
To: Jeff Squyres (jsquyres)
Cc: users@lists.open-mpi.org
Subject: Re: [OMPI users] Network traffic packets documentation

Hi Jeff,

Ok, maybe "packet headers" are not the right words. What I would like to know 
is how MPI application data is structured inside each packet in order to 
dissect and caracterize the messages.

As a first step I would like to start with TCP over Ethernet (MCA BTL TCP, I 
think). How can I figure out how the application data structure looks like 
inside network packets?

In the future I would like to extend it to other network and transport 
combinations.

What do you think? Has it sense?

Thanks in advance.
Víctor

El lun, 16 may 2022 a las 15:45, Jeff Squyres (jsquyres) 
(mailto:jsquy...@cisco.com>>) escribió:
Open MPI doesn't proscribe a specific network protocol for anything.  Indeed, 
each network transport uses their own protocols, headers, etc.  It's basically 
a "each Open MPI plugin needs to be able to talk to itself", and therefore no 
commonality is needed (or desired).

Which network and Open MPI transport are you looking to sniff?

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of victor sv via users 
mailto:users@lists.open-mpi.org>>
Sent: Sunday, May 15, 2022 3:55 PM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: victor sv
Subject: [OMPI users] Network traffic packets documentation

Hello,

I would like to sniff the OMPI network traffic from outside the MPI application.

I was traversing the OpenMPI code and documentation, but I have not found any 
central point explaining MPI communications from the network point of view.

Please, is there any official documentation, or paper, or presentation or 
picture about MPI packet headers?

Sorry if this a basic question or if it was already answered.

Thanks in advance for your help!
BR,
Víctor.


Re: [OMPI users] Network traffic packets documentation

2022-05-16 Thread Jeff Squyres (jsquyres) via users
Open MPI doesn't proscribe a specific network protocol for anything.  Indeed, 
each network transport uses their own protocols, headers, etc.  It's basically 
a "each Open MPI plugin needs to be able to talk to itself", and therefore no 
commonality is needed (or desired).

Which network and Open MPI transport are you looking to sniff?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of victor sv via users 

Sent: Sunday, May 15, 2022 3:55 PM
To: users@lists.open-mpi.org
Cc: victor sv
Subject: [OMPI users] Network traffic packets documentation

Hello,

I would like to sniff the OMPI network traffic from outside the MPI application.

I was traversing the OpenMPI code and documentation, but I have not found any 
central point explaining MPI communications from the network point of view.

Please, is there any official documentation, or paper, or presentation or 
picture about MPI packet headers?

Sorry if this a basic question or if it was already answered.

Thanks in advance for your help!
BR,
Víctor.


Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

2022-05-05 Thread Jeff Squyres (jsquyres) via users
For the mailing list: the issue is that some M1's apparently default to an 
unlimited max number of open files per process.  If you use "ulimit -n" to set 
some reasonable number, then Open MPI should behave better:

ulimit -n 1024
mpirun ...

We don't (yet?) know why this seems to be necessary on some M1s (e.g., Scott's) 
but not others (e.g., George's).

We'll put a guard in against the "unlimited" case in future releases.

See https://github.com/open-mpi/ompi/issues/10358 for more details, but I 
figured I'd put the workaround out here on the mailing list.

--
Jeff Squyres
jsquy...@cisco.com

________
From: users  on behalf of Jeff Squyres 
(jsquyres) via users 
Sent: Thursday, May 5, 2022 3:31 PM
To: George Bosilca; Open MPI Users
Cc: Jeff Squyres (jsquyres)
Subject: Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

Scott and I conversed a bit off list, and I got more data.  I posted everything 
in https://github.com/open-mpi/ompi/issues/10358 -- let's follow up on this 
issue there.

--
Jeff Squyres
jsquy...@cisco.com


From: George Bosilca 
Sent: Thursday, May 5, 2022 3:19 PM
To: Open MPI Users
Cc: Jeff Squyres (jsquyres); Scott Sayres
Subject: Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

That is weird, but maybe it is not a deadlock, but a very slow progress. In the 
child can you print the fdmax and i in the frame do_child.

George.

On Thu, May 5, 2022 at 11:50 AM Scott Sayres via users 
mailto:users@lists.open-mpi.org>> wrote:
Jeff, thanks.
from 1:

(lldb) process attach --pid 95083

Process 95083 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

frame #0: 0x0001bde25628 libsystem_kernel.dylib`close + 8

libsystem_kernel.dylib`close:

->  0x1bde25628 <+8>:  b.lo   0x1bde25648   ; <+40>

0x1bde2562c <+12>: pacibsp

0x1bde25630 <+16>: stpx29, x30, [sp, #-0x10]!

0x1bde25634 <+20>: movx29, sp

Target 0: (orterun) stopped.

Executable module set to "/usr/local/bin/orterun".

Architecture set to: arm64e-apple-macosx-.

(lldb) thread backtrace

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

  * frame #0: 0x0001bde25628 libsystem_kernel.dylib`close + 8

frame #1: 0x000101563074 
mca_odls_default.so`do_child(cd=0x61e28000, write_fd=40) at 
odls_default_module.c:410:17

frame #2: 0x000101562d7c 
mca_odls_default.so`odls_default_fork_local_proc(cdptr=0x61e28000) at 
odls_default_module.c:646:9

frame #3: 0x000100e2c6f8 
libopen-rte.40.dylib`orte_odls_base_spawn_proc(fd=-1, sd=4, 
cbdata=0x61e28000) at odls_base_default_fns.c:1046:31

frame #4: 0x0001011827a0 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active_single_queue(base=0x00010df069d0) at event.c:1370:4 
[opt]

frame #5: 0x000101182628 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active(base=0x00010df069d0) at event.c:1440:8 [opt]

frame #6: 0x0001011825ec 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop(base=0x00010df069d0, 
flags=) at event.c:1644:12 [opt]

frame #7: 0x000100bbfb04 orterun`orterun(argc=4, 
argv=0x00016f2432f8) at orterun.c:179:9

frame #8: 0x000100bbf904 orterun`main(argc=4, argv=0x00016f2432f8) 
at main.c:13:12

frame #9: 0x000100f19088 dyld`start + 516

from 2:

scottsayres@scotts-mbp ~ % lldb -p 95082

(lldb) process attach --pid 95082

Process 95082 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8

libsystem_kernel.dylib`read:

->  0x1bde25654 <+8>:  b.lo   0x1bde25674   ; <+40>

0x1bde25658 <+12>: pacibsp

0x1bde2565c <+16>: stpx29, x30, [sp, #-0x10]!

0x1bde25660 <+20>: movx29, sp

Target 0: (orterun) stopped.

Executable module set to "/usr/local/bin/orterun".

Architecture set to: arm64e-apple-macosx-.

(lldb) thread backtrace

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

  * frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8

frame #1: 0x00010116969c libopen-pal.40.dylib`opal_fd_read(fd=22, 
len=20, buffer=0x00016f24299c) at fd.c:51:14

frame #2: 0x000101563388 
mca_odls_default.so`do_parent(cd=0x61e28200, read_fd=22) at 
odls_default_module.c:495:14

frame #3: 0x000101562d90 
mca_odls_default.so`odls_default_fork_local_proc(cdptr=0x61e28200) at 
odls_default_module.c:651:12

frame #4: 0x000100e2c6f8 
libopen-rte.40.dylib`orte_odls_base_spawn_proc(fd=-1, sd=4, 
cbdata=0x61e28200) at odls_base_default_fns.c:1046

Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

2022-05-05 Thread Jeff Squyres (jsquyres) via users
Scott and I conversed a bit off list, and I got more data.  I posted everything 
in https://github.com/open-mpi/ompi/issues/10358 -- let's follow up on this 
issue there.

--
Jeff Squyres
jsquy...@cisco.com


From: George Bosilca 
Sent: Thursday, May 5, 2022 3:19 PM
To: Open MPI Users
Cc: Jeff Squyres (jsquyres); Scott Sayres
Subject: Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

That is weird, but maybe it is not a deadlock, but a very slow progress. In the 
child can you print the fdmax and i in the frame do_child.

George.

On Thu, May 5, 2022 at 11:50 AM Scott Sayres via users 
mailto:users@lists.open-mpi.org>> wrote:
Jeff, thanks.
from 1:

(lldb) process attach --pid 95083

Process 95083 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

frame #0: 0x0001bde25628 libsystem_kernel.dylib`close + 8

libsystem_kernel.dylib`close:

->  0x1bde25628 <+8>:  b.lo   0x1bde25648   ; <+40>

0x1bde2562c <+12>: pacibsp

0x1bde25630 <+16>: stpx29, x30, [sp, #-0x10]!

0x1bde25634 <+20>: movx29, sp

Target 0: (orterun) stopped.

Executable module set to "/usr/local/bin/orterun".

Architecture set to: arm64e-apple-macosx-.

(lldb) thread backtrace

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

  * frame #0: 0x0001bde25628 libsystem_kernel.dylib`close + 8

frame #1: 0x000101563074 
mca_odls_default.so`do_child(cd=0x61e28000, write_fd=40) at 
odls_default_module.c:410:17

frame #2: 0x000101562d7c 
mca_odls_default.so`odls_default_fork_local_proc(cdptr=0x61e28000) at 
odls_default_module.c:646:9

frame #3: 0x000100e2c6f8 
libopen-rte.40.dylib`orte_odls_base_spawn_proc(fd=-1, sd=4, 
cbdata=0x61e28000) at odls_base_default_fns.c:1046:31

frame #4: 0x0001011827a0 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active_single_queue(base=0x00010df069d0) at event.c:1370:4 
[opt]

frame #5: 0x000101182628 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active(base=0x00010df069d0) at event.c:1440:8 [opt]

frame #6: 0x0001011825ec 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop(base=0x00010df069d0, 
flags=) at event.c:1644:12 [opt]

frame #7: 0x000100bbfb04 orterun`orterun(argc=4, 
argv=0x00016f2432f8) at orterun.c:179:9

frame #8: 0x000100bbf904 orterun`main(argc=4, argv=0x00016f2432f8) 
at main.c:13:12

frame #9: 0x000100f19088 dyld`start + 516

from 2:

scottsayres@scotts-mbp ~ % lldb -p 95082

(lldb) process attach --pid 95082

Process 95082 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8

libsystem_kernel.dylib`read:

->  0x1bde25654 <+8>:  b.lo   0x1bde25674   ; <+40>

0x1bde25658 <+12>: pacibsp

0x1bde2565c <+16>: stpx29, x30, [sp, #-0x10]!

0x1bde25660 <+20>: movx29, sp

Target 0: (orterun) stopped.

Executable module set to "/usr/local/bin/orterun".

Architecture set to: arm64e-apple-macosx-.

(lldb) thread backtrace

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

  * frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8

frame #1: 0x00010116969c libopen-pal.40.dylib`opal_fd_read(fd=22, 
len=20, buffer=0x00016f24299c) at fd.c:51:14

frame #2: 0x000101563388 
mca_odls_default.so`do_parent(cd=0x61e28200, read_fd=22) at 
odls_default_module.c:495:14

frame #3: 0x000101562d90 
mca_odls_default.so`odls_default_fork_local_proc(cdptr=0x61e28200) at 
odls_default_module.c:651:12

frame #4: 0x000100e2c6f8 
libopen-rte.40.dylib`orte_odls_base_spawn_proc(fd=-1, sd=4, 
cbdata=0x61e28200) at odls_base_default_fns.c:1046:31

frame #5: 0x0001011827a0 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active_single_queue(base=0x00010df069d0) at event.c:1370:4 
[opt]

frame #6: 0x000101182628 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active(base=0x00010df069d0) at event.c:1440:8 [opt]

frame #7: 0x0001011825ec 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop(base=0x00010df069d0, 
flags=) at event.c:1644:12 [opt]

frame #8: 0x000100bbfb04 orterun`orterun(argc=4, 
argv=0x00016f2432f8) at orterun.c:179:9

frame #9: 0x000100bbf904 orterun`main(argc=4, argv=0x00016f2432f8) 
at main.c:13:12

frame #10: 0x000100f19088 dyld`start + 516



Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

2022-05-05 Thread Jeff Squyres (jsquyres) via users
You can use "lldb -p PID" to attach to a running process.

--
Jeff Squyres
jsquy...@cisco.com


From: Scott Sayres 
Sent: Thursday, May 5, 2022 11:22 AM
To: Jeff Squyres (jsquyres)
Cc: Open MPI Users
Subject: Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

Jeff,
It does launch two mpirun processes (when hung from another terminal window)

scottsayres  95083  99.0  0.0 408918416   1472 s002  R 8:20AM   0:04.48 
mpirun -np 4 foo.sh

scottsayres  95085   0.0  0.0 408628368   1632 s006  S+8:20AM   0:00.00 
egrep mpirun|foo.sh

scottsayres  95082   0.0  0.1 408918416  10384 s002  S+8:20AM   0:00.03 
mpirun -np 4 foo.sh


I'm looking up how to get the backtrace from them both but if you know the 
answer I could use advice.


best

Scott


Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

2022-05-05 Thread Jeff Squyres (jsquyres) via users
Scott --

Sorry; something I should have clarified in my original email: I meant you to 
run the "ps" command **while mpirun was still hung**.  I.e., do it in another 
terminal, before you hit ctrl-C to exit mpirun.

I want to see if mpirun has launched the foo.sh or not.  Gilles' test is a 
different mechanism to give a similar result (i.e., it makes a side effect that 
allows you to tell if the child process was actually launched or not).  In the 
ps test, if there's *2* copies of mpirun, it would be useful to lldb attach to 
each of them and get the backtrace from both (you have the parent backtrace 
already; I'm really interested to see what the child mpirun's backtrace is -- 
that would tell us the exact line number where the child is hung.

Gilles' observation about the firewall and IP/hostname stuff is interesting, 
too.  The weirdness here is that the backtrace you posted earlier implies that 
the parent mpirun hadn't even finished its fork/exec sequence (i.e., mpirun 
itself is still in the "do_parent()" function, which implies that it didn't 
complete the pipe handshake that happens immediately after forking the child 
process... which is weird).

--
Jeff Squyres
jsquy...@cisco.com


From: Scott Sayres 
Sent: Wednesday, May 4, 2022 4:02 PM
To: Jeff Squyres (jsquyres)
Cc: Open MPI Users
Subject: Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

foo.sh is executable, again hangs without output.
I command c x2 to return to shell, then

ps auxwww | egrep 'mpirun|foo.sh'

output shown below


scottsayres@scotts-mbp trouble-shoot % ./foo.sh

Wed May  4 12:59:15 MST 2022

Wed May  4 12:59:16 MST 2022

Wed May  4 12:59:17 MST 2022

Wed May  4 12:59:18 MST 2022

Wed May  4 12:59:19 MST 2022

Wed May  4 12:59:20 MST 2022

Wed May  4 12:59:21 MST 2022

Wed May  4 12:59:22 MST 2022

Wed May  4 12:59:23 MST 2022

Wed May  4 12:59:24 MST 2022

scottsayres@scotts-mbp trouble-shoot % mpirun -np 1 foo.sh

^C^C%   

  scottsayres@scotts-mbp trouble-shoot % ps auxwww | egrep 'mpirun|foo.sh'

scottsayres  91795 100.0  0.0 409067920   1456 s002  R12:59PM   0:14.07 
mpirun -np 1 foo.sh

scottsayres  91798   0.0  0.0 408628368   1632 s002  S+1:00PM   0:00.00 
egrep mpirun|foo.sh

scottsayres@scotts-mbp trouble-shoot %


On Wed, May 4, 2022 at 12:42 PM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
That backtrace seems to imply that the launch may not have completed.

Can you make an executable script foo.sh with:


#!/bin/bash


i=0

while test $i -lt 10; do

date

sleep 1

let i=$i+1

done

Make sure that foo.sh is executable and then run it via:

mpirun -np 1 foo.sh

If you start seeing output, good!If it completes, better!

If it hangs, and/or if you don't see any output at all, do this:


ps auxwww | egrep 'mpirun|foo.sh'

It should show mpirun and 2 copies of foo.sh (and probably a grep).  Does it?

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


From: Scott Sayres mailto:ssay...@asu.edu>>
Sent: Wednesday, May 4, 2022 2:47 PM
To: Open MPI Users
Cc: Jeff Squyres (jsquyres)
Subject: Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

Following Jeff's advice, I have rebuilt open-mpi by hand using the -g option.   
This shows more information as below.   I am attempting George's advice of how 
to track the child but notice that gdb does not support arm64.  attempting to 
update lldb.


scottsayres@scotts-mbp openmpi-4.1.3 % lldb mpirun -- -np 1 hostname

(lldb) target create "mpirun"

Current executable set to 'mpirun' (arm64).

(lldb) settings set -- target.run-args  "-np" "1" "hostname"

(lldb) run

Process 90950 launched: '/usr/local/bin/mpirun' (arm64)

Process 90950 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8

libsystem_kernel.dylib`read:

->  0x1bde25654 <+8>:  b.lo   0x1bde25674   ; <+40>

0x1bde25658 <+12>: pacibsp

0x1bde2565c <+16>: stpx29, x30, [sp, #-0x10]!

0x1bde25660 <+20>: movx29, sp

Target 0: (mpirun) stopped.

(lldb) ^C

(lldb) thread backtrace

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

  * frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8

frame #1: 0x00010056169c libopen-pal.40.dylib`opal_fd_read(fd=27, 
len=20, buffer=0x00016fdfe90c) at fd.c:51:14

frame #2: 0x0001027b3388 
mca_odls_default.so`do_parent(cd=0x63e0, read_fd=27) at 
odls_default_module.c:495:14

Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

2022-05-04 Thread Jeff Squyres (jsquyres) via users
That backtrace seems to imply that the launch may not have completed.

Can you make an executable script foo.sh with:


#!/bin/bash


i=0

while test $i -lt 10; do

date

sleep 1

let i=$i+1

done

Make sure that foo.sh is executable and then run it via:

mpirun -np 1 foo.sh

If you start seeing output, good!If it completes, better!

If it hangs, and/or if you don't see any output at all, do this:


ps auxwww | egrep 'mpirun|foo.sh'

It should show mpirun and 2 copies of foo.sh (and probably a grep).  Does it?

--
Jeff Squyres
jsquy...@cisco.com


From: Scott Sayres 
Sent: Wednesday, May 4, 2022 2:47 PM
To: Open MPI Users
Cc: Jeff Squyres (jsquyres)
Subject: Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

Following Jeff's advice, I have rebuilt open-mpi by hand using the -g option.   
This shows more information as below.   I am attempting George's advice of how 
to track the child but notice that gdb does not support arm64.  attempting to 
update lldb.


scottsayres@scotts-mbp openmpi-4.1.3 % lldb mpirun -- -np 1 hostname

(lldb) target create "mpirun"

Current executable set to 'mpirun' (arm64).

(lldb) settings set -- target.run-args  "-np" "1" "hostname"

(lldb) run

Process 90950 launched: '/usr/local/bin/mpirun' (arm64)

Process 90950 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8

libsystem_kernel.dylib`read:

->  0x1bde25654 <+8>:  b.lo   0x1bde25674   ; <+40>

0x1bde25658 <+12>: pacibsp

0x1bde2565c <+16>: stpx29, x30, [sp, #-0x10]!

0x1bde25660 <+20>: movx29, sp

Target 0: (mpirun) stopped.

(lldb) ^C

(lldb) thread backtrace

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

  * frame #0: 0x0001bde25654 libsystem_kernel.dylib`read + 8

frame #1: 0x00010056169c libopen-pal.40.dylib`opal_fd_read(fd=27, 
len=20, buffer=0x00016fdfe90c) at fd.c:51:14

frame #2: 0x0001027b3388 
mca_odls_default.so`do_parent(cd=0x63e0, read_fd=27) at 
odls_default_module.c:495:14

frame #3: 0x0001027b2d90 
mca_odls_default.so`odls_default_fork_local_proc(cdptr=0x63e0) at 
odls_default_module.c:651:12

frame #4: 0x0001003246f8 
libopen-rte.40.dylib`orte_odls_base_spawn_proc(fd=-1, sd=4, 
cbdata=0x63e0) at odls_base_default_fns.c:1046:31

frame #5: 0x00010057a7a0 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active_single_queue(base=0x0001007061c0) at event.c:1370:4 
[opt]

frame #6: 0x00010057a628 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active(base=0x0001007061c0) at event.c:1440:8 [opt]

frame #7: 0x00010057a5ec 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop(base=0x0001007061c0, 
flags=) at event.c:1644:12 [opt]

frame #8: 0x00013b04 mpirun`orterun(argc=4, 
argv=0x00016fdff268) at orterun.c:179:9

frame #9: 0x00013904 mpirun`main(argc=4, argv=0x00016fdff268) 
at main.c:13:12

frame #10: 0x000100015088 dyld`start + 516




Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

2022-05-04 Thread Jeff Squyres (jsquyres) via users
George beat me to the reply.  :-)

His advice is the correct one (check out what's happening in a debugger).  This 
will likely work better with a hand-built Open MPI (vs. Homebrew), because then 
you can configure/build Open MPI with -g so that the debugger will be able to 
see the source code.  E.g.:

./configure CFLAGS=-g ...
make -j 8 all
[sudo] make install

(put whatever other configure flags you want in there, such as a custom prefix, 
... etc.)

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of George Bosilca via 
users 
Sent: Wednesday, May 4, 2022 12:35 PM
To: Open MPI Users
Cc: George Bosilca
Subject: Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

I compiled a fresh copy of the 4.1.3 branch on my M1 laptop, and I can run both 
MPI and non-MPI apps without any issues.

Try running `lldb mpirun -- -np 1 hostname` and once it deadlocks, do a CTRL+C 
to get back on the debugger and then `backtrace` to see where it is waiting.

George.


On Wed, May 4, 2022 at 11:28 AM Scott Sayres via users 
mailto:users@lists.open-mpi.org>> wrote:
Thanks for looking at this Jeff.
No, I cannot use mpirun to launch a non-MPI application.The command "mpirun 
-np 2 hostname" also hangs.

I get the following output if I add the -d  command before (I've replaced the 
server with the hashtags) :

[scotts-mbp.3500.dhcp.###:05469] procdir: 
/var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469/0/0

[scotts-mbp.3500.dhcp.###:05469] jobdir: 
/var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469/0

[scotts-mbp.3500.dhcp.###:05469] top: 
/var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469

[scotts-mbp.3500.dhcp.###:05469] top: 
/var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501

[scotts-mbp.3500.dhcp.###:05469] tmp: 
/var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T/

[scotts-mbp.3500.dhcp.###:05469] sess_dir_cleanup: job session dir does not 
exist

[scotts-mbp.3500.dhcp.###:05469] sess_dir_cleanup: top session dir not empty - 
leaving

[scotts-mbp.3500.dhcp.###:05469] procdir: 
/var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469/0/0

[scotts-mbp.3500.dhcp.###:05469] jobdir: 
/var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469/0

[scotts-mbp.3500.dhcp.###:05469] top: 
/var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501/pid.5469

[scotts-mbp.3500.dhcp.###:05469] top: 
/var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T//ompi.scotts-mbp.501

[scotts-mbp.3500.dhcp.###:05469] tmp: 
/var/folders/l0/94hsdtwj09xd62d90nfh_3h0gn/T/

[scotts-mbp.3500.dhcp.###:05469] [[48286,0],0] Releasing job data for [INVALID]

Can you recommend a way to find where mpirun gets stuck?
Thanks!
Scott

On Wed, May 4, 2022 at 6:06 AM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
Are you able to use mpirun to launch a non-MPI application?  E.g.:

mpirun -np 2 hostname

And if that works, can you run the simple example MPI apps in the "examples" 
directory of the MPI source tarball (the "hello world" and "ring" programs)?  
E.g.:

cd examples
make
mpirun -np 4 hello_c
mpirun -np 4 ring_c

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Scott Sayres via users 
mailto:users@lists.open-mpi.org>>
Sent: Tuesday, May 3, 2022 1:07 PM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: Scott Sayres
Subject: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

Hello,
I am new to openmpi, but would like to use it for ORCA calculations, and plan 
to run codes on the 10 processors of my macbook pro.  I installed this manually 
and also through homebrew with similar results.  I am able to compile codes 
with mpicc and run them as native codes, but everything that I attempt with 
mpirun, mpiexec just freezes.  I can end the program by typing 'control C' 
twice, but it continues to run in the background and requires me to 'kill 
'.
even as simple as 'mpirun uname' freezes

I have tried one installation by: 'arch -arm64 brew install openmpi '
and a second by downloading the source file, './configure --prefix=/usr/local', 
'make all', make install

the commands: 'which mpicc', 'which 'mpirun', etc are able to find them on the 
path... it just hangs.

Can anyone suggest how to fix the problem of the program hanging?
Thanks!
Scott


--
Scott G Sayres
Assistant Professor
School of Molecular Sciences (formerly Department of Chemistry & Biochemistry)
Biodesign Center for Applied Structural Discovery
Arizona State University


Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

2022-05-04 Thread Jeff Squyres (jsquyres) via users
Are you able to use mpirun to launch a non-MPI application?  E.g.:

mpirun -np 2 hostname

And if that works, can you run the simple example MPI apps in the "examples" 
directory of the MPI source tarball (the "hello world" and "ring" programs)?  
E.g.:

cd examples
make
mpirun -np 4 hello_c
mpirun -np 4 ring_c

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Scott Sayres via 
users 
Sent: Tuesday, May 3, 2022 1:07 PM
To: users@lists.open-mpi.org
Cc: Scott Sayres
Subject: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

Hello,
I am new to openmpi, but would like to use it for ORCA calculations, and plan 
to run codes on the 10 processors of my macbook pro.  I installed this manually 
and also through homebrew with similar results.  I am able to compile codes 
with mpicc and run them as native codes, but everything that I attempt with 
mpirun, mpiexec just freezes.  I can end the program by typing 'control C' 
twice, but it continues to run in the background and requires me to 'kill 
'.
even as simple as 'mpirun uname' freezes

I have tried one installation by: 'arch -arm64 brew install openmpi '
and a second by downloading the source file, './configure --prefix=/usr/local', 
'make all', make install

the commands: 'which mpicc', 'which 'mpirun', etc are able to find them on the 
path... it just hangs.

Can anyone suggest how to fix the problem of the program hanging?
Thanks!
Scott


Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-22 Thread Jeff Squyres (jsquyres) via users
Can you send all the information listed under "For compile problems" (please 
compress!):

https://www.open-mpi.org/community/help/

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Cici Feng via users 

Sent: Friday, April 22, 2022 5:30 AM
To: Open MPI Users
Cc: Cici Feng
Subject: Re: [OMPI users] help with M1 chip macOS openMPI installation

Hi George,

Thanks so much with the tips and I have installed Rosetta in order for my 
computer to run the Intel software. However, the same error appears as I tried 
to make the file for the OMPI and here's how it looks:

../../../../opal/threads/thread_usage.h(163): warning #266: function 
"opal_atomic_swap_ptr" declared implicitly

  OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)

  ^


In file included from ../../../../opal/class/opal_object.h(126),

 from ../../../../opal/dss/dss_types.h(40),

 from ../../../../opal/dss/dss.h(32),

 from pmix3x_server_north.c(27):

../../../../opal/threads/thread_usage.h(163): warning #120: return value type 
does not match the function type

  OPAL_THREAD_DEFINE_ATOMIC_SWAP(void *, intptr_t, ptr)

  ^


pmix3x_server_north.c(157): warning #266: function "opal_atomic_rmb" declared 
implicitly

  OPAL_ACQUIRE_OBJECT(opalcaddy);

  ^


  CCLD mca_pmix_pmix3x.la<http://mca_pmix_pmix3x.la>

Making all in mca/pstat/test

  CCLD mca_pstat_test.la<http://mca_pstat_test.la>

Making all in mca/rcache/grdma

  CCLD mca_rcache_grdma.la<http://mca_rcache_grdma.la>

Making all in mca/reachable/weighted

  CCLD mca_reachable_weighted.la<http://mca_reachable_weighted.la>

Making all in mca/shmem/mmap

  CCLD mca_shmem_mmap.la<http://mca_shmem_mmap.la>

Making all in mca/shmem/posix

  CCLD mca_shmem_posix.la<http://mca_shmem_posix.la>

Making all in mca/shmem/sysv

  CCLD mca_shmem_sysv.la<http://mca_shmem_sysv.la>

Making all in tools/wrappers

  CCLD opal_wrapper

Undefined symbols for architecture x86_64:

  "_opal_atomic_add_fetch_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_compare_exchange_strong_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_compare_exchange_strong_ptr", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_lock", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_lock_init", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_mb", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_rmb", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_sub_fetch_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_swap_32", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_swap_ptr", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_unlock", referenced from:

  import-atom in libopen-pal.dylib

  "_opal_atomic_wmb", referenced from:

  import-atom in libopen-pal.dylib

ld: symbol(s) not found for architecture x86_64

make[2]: *** [opal_wrapper] Error 1

make[1]: *** [all-recursive] Error 1

make: *** [all-recursive] Error 1


I am not sure if the ld part affects the making process or not. Either way, 
error 1 appears as the "opal_wrapper" which I think has been the error I kept 
encoutering.

Is there any explanation to this specific error?

ps. the configure command I used is as followed, provided by the official 
website of MARE2DEM

sudo  ./configure --prefix=/opt/openmpi CC=icc CXX=icc F77=ifort FC=ifort \
lt_prog_compiler_wl_FC='-Wl,';
make all install

Thanks again,
Cici

On Thu, Apr 21, 2022 at 11:18 PM George Bosilca via users 
mailto:users@lists.open-mpi.org>> wrote:
1. I am not aware of any outstanding OMPI issues with the M1 chip that would 
prevent OMPI from compiling and running efficiently in an M1-based setup, 
assuming the compilation chain is working properly.

2. M1 supports x86 code via Rosetta, an app provided by Apple to ensure a 
smooth transition from the Intel-based to the M1-based laptop's line. I do 
recall running an OMPI compiled on my Intel laptop on my M1 laptop to test the 
performance of the Rosetta binary translator. We even had some discussions 
about this, on the mailing list (or github issues).

3. Based on your original message, and their webpage, MARE2DEM is not 
supporting any other compilation chain than Intel. As explained above, that 
might not be by itself a showstopper, because you can run x86 code on the M1 
chip, using Rosetta. However, MARE2DEM relies on MKL, the Intel Math Library, 
and that library will not run on a M1 chip.

  George.


On Thu, Apr 21, 

Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation fault only when run with --bind-to none

2022-04-22 Thread Jeff Squyres (jsquyres) via users
With THREAD_FUNNELED, it means that there can only be one thread in MPI at a 
time -- and it needs to be the same thread as the one that called 
MPI_INIT_THREAD.

Is that the case in your app?

Also, what is your app doing at src/pcorona_main.f90:627?  It is making a call 
to MPI, or something else?  It might be useful to compile Open MPI (and/or 
other libraries that you're using) with -g so that you can get more meaningful 
stack traces upon error -- that might give some insight into where / why the 
failure is occurring.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Angel de Vicente 
via users 
Sent: Friday, April 22, 2022 10:54 AM
To: Gilles Gouaillardet via users
Cc: Angel de Vicente
Subject: Re: [OMPI users] Help diagnosing MPI+OpenMP application segmentation 
fault only when run with --bind-to none

Thanks Gilles,

Gilles Gouaillardet via users  writes:

> You can first double check you
> MPI_Init_thread(..., MPI_THREAD_MULTIPLE, ...)

my code uses "mpi_thread_funneled" and OpenMPI was compiled with
MPI_THREAD_MULTIPLE support:

,
| ompi_info | grep  -i thread
|   Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, 
OMPI progress: no, ORTE progress: yes, Event lib: yes)
|FT Checkpoint support: no (checkpoint thread: no)
`

Cheers,
--
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] help with M1 chip macOS openMPI installation

2022-04-21 Thread Jeff Squyres (jsquyres) via users
A little more color on Gilles' answer: I believe that we had some Open MPI 
community members work on adding M1 support to Open MPI, but Gilles is 
absolutely correct: the underlying compiler has to support the M1, or you won't 
get anywhere.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Cici Feng via users 

Sent: Thursday, April 21, 2022 6:11 AM
To: Open MPI Users
Cc: Cici Feng
Subject: Re: [OMPI users] help with M1 chip macOS openMPI installation

Gilles,

Thank you so much for the quick response!
openMPI installed by brew is compiled on gcc and gfortran using the original 
compilers by Apple. Now I haven't figured out how to use this gcc openMPI for 
the inversion software :(
Given by your answer, I think I'll pause for now with the M1-intel 
compilers-openMPI route and switch to an intel cluster until someone figured 
out the M1 chip problem ~

Thanks again for your help!
Cici

On Thu, Apr 21, 2022 at 5:59 PM Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org>> wrote:
Cici,

I do not think the Intel C compiler is able to generate native code for the M1 
(aarch64).
The best case scenario is it would generate code for x86_64 and then Rosetta 
would be used to translate it to aarch64 code,
and this is a very downgraded solution.

So if you really want to stick to the Intel compiler, I strongly encourage you 
to run on Intel/AMD processors.
Otherwise, use a native compiler for aarch64, and in this case, brew is not a 
bad option.


Cheers,

Gilles

On Thu, Apr 21, 2022 at 6:36 PM Cici Feng via users 
mailto:users@lists.open-mpi.org>> wrote:
Hi there,

I am trying to install an electromagnetic inversion software (MARE2DEM) of 
which the intel C compilers and open-MPI are considered as the prerequisite. 
However, since I am completely new to computer science and coding, together 
with some of the technical issues of the computer I am building all this on, I 
have encountered some questions with the whole process.

The computer I am working on is a macbook pro with a M1 Max chip. Despite how 
my friends have discouraged me to keep working on my M1 laptop, I still want to 
reach out to the developers since I feel like you guys might have a solution.

By downloading the source code of openMPI on the .org website and "sudo 
configure and make all install", I was not able to install the openMPI onto my 
computer. The error provided mentioned something about the chip is not 
supported or somewhat.

I have also tried to install openMPI through homebrew using the command "brew 
install openmpi" and it worked just fine. However, since Homebrew has 
automatically set up the configuration of openMPI (it uses gcc and gfortran), I 
was not able to use my intel compilers to build openMPI which causes further 
problems in the installation of my inversion software.

In conclusion, I think right now the M1 chip is the biggest problem of the 
whole installation process yet I think you guys might have some solution for 
the installation. I would assume that Apple is switching all of its chip to M1 
which makes the shifts and changes inevitable.

I would really like to hear from you with the solution of installing openMPI on 
a M1-chip macbook and I would like to thank for your time to read my prolong 
email.

Thank you very much.
Sincerely,

Cici







Re: [OMPI users] mixed OpenMP/MPI

2022-03-15 Thread Jeff Squyres (jsquyres) via users
Thanks for the poke!  Sorry we missed replying to your github issue.  Josh 
replied to it this morning.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Bernstein, Noam CIV 
USN NRL (6393) Washington DC (USA) via users 
Sent: Tuesday, March 15, 2022 8:56 AM
To: users@lists.open-mpi.org
Cc: Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Subject: [OMPI users] mixed OpenMP/MPI

Hi - I'm trying to run multi-node mixed OpenMP/MPI with each MPI task bound to 
a set of cores.  I thought this would be relatively straightforward with  
"--map-by slot:PE=$OMP_NUM_THREADS --bind-to core", but I can't get it to work. 
 I couldn't figure out if it was a bug or just something missing from the 
documentation, so I created github issue 
https://github.com/open-mpi/ompi/issues/10071, but that hasn't gotten any 
response.  Does anyone have an example of an mpirun command like, say for 4 16 
core nodes running 8 MPI processes with 8 threads each, each bound to 8/16 
cores on each node?


Re: [OMPI users] handle_wc() in openib and IBV_WC_DRIVER2/MLX5DV_WC_RAW_WQE completion code

2022-02-23 Thread Jeff Squyres (jsquyres) via users
The short answer is likely that UCX and Open MPI v4.1.x is your way forward.

openib has basically been unmaintained for quite a while -- Nvidia (Mellanox) 
made it quite clear long ago that UCX was their path forward.  openib was kept 
around until UCX became stable enough to become the preferred IB network 
transport -- which it now is.  Due to Open MPI's backwards compatibility 
guarantees, we can't remove openib from the 4.0.x and 4.1.x series, but it 
won't be present in the upcoming Open MPI v5.0.x -- IB will be solely supported 
via UCX.

What I suspect you're seeing is that you've got new firmware and/or drivers on 
some nodes, and those are reporting a new opcode error up to Open MPI's old 
openib code.  The openib code hasn't been updated to handle that new opcode, 
and it gets confused and throws an error, and therefore aborts.  UCX and/or 
Open MPI v4.1.x, presumably, have been updated to handle that new opcode, and 
therefore things run smoothly.

This is just an educated guess.  But if you're running in an 
effectively-heterogeneous scenario (i.e., some nodes with old OFED some nodes 
with new MLNX OFED), weird backwards/forwards compatibility issues like this 
can occur.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Crni Gorac via 
users 
Sent: Tuesday, February 22, 2022 7:37 AM
To: users@lists.open-mpi.org
Cc: Crni Gorac
Subject: [OMPI users] handle_wc() in openib and 
IBV_WC_DRIVER2/MLX5DV_WC_RAW_WQE completion code

We've encountered OpenMPI crashing in handle_wc(), with following error message:
[.../opal/mca/btl/openib/btl_openib_component.c:3610:handle_wc]
Unhandled work completion opcode is 136

Our setup is admittedly little tricky, but I'm still worried that it
may be a genuine problem, so please bear with me while I try to
explain:  The OpenMPI version is 3.1.2, it is built from source, here
is the relevant ompi_info excerpt:
 Configure command line: '--prefix=/opt/openmpi/3.1.2'
'--disable-silent-rules' '--with-tm=/opt/pbs' '--enable-static=yes'
'--enable-shared=yes' '--with-cuda'

Our nodes have initially had installed open-source OFED, and then on a
couple of nodes we had it replaced with recent MLNX_OFED (version
5.5-1.0.3.2), with the idea to test for some time, then upgrade them
all and then to switch to OpenMPI 4.x.  However, the system is still
in use in this intermediate state, and it happens that our code
crashes sometimes, with the error message mentioned above.  FWIW, the
configuration used for runs in question is 2 nodes with 3 MPI ranks
each; and crashes only occur if at least one of the nodes used is from
these that are upgraded to MLNX_OFED.  We also have OpenMPI 4.1.2
built, after MLNX_OFED installed, and when our code run linked with
this version, a crash won't occur, but - we've built this one with UCX
(1.12.0) and openib disabled, so the code path for handling this
completion opcode (if it occurs at all) is different.

So when I looked into /usr/include/infiniband/verbs.h, I was able to
see that opcode 136 in this context means IBV_WC_DRIVER2.  However,
this opcode, as well as some other opcodes, are not there in the
/usr/include/infiniband/verbs.h from the open-source OFED installation
that we used so far.  On the other side, for /usr/include/infiniband
from MLNX_OFED, there is MLX5DV_WC_RAW_WQE that is set to
IBV_WC_DRIVER2 in /usr/include/infiniband/mlx5dv.h, so I'm concluding
that this opcode 136 that OpenMPI reports as error, comes from
MLNX_OFED driver returning MLX5DV_WC_RAW_WQE.

Apparently, handle_wc() in pal/mca/btl/openib/btl_openib_component.c
deals with 6 completion codes only, and reports fatal error for the
rest of them; this doesn't seem to be changed between OpenMPI 3.1.2
and 4.1.2.   So my question is here: anyone able to shed some light on
MLX5DV_WC_RAW_WQE completion code, and what kind of problem could
cause it returned?  Or it's really just about us having OpenMPI built
before MLNX_OFED upgrade, i.e. is it to be expected that with OpenMPI
rebuilt now (with the same configure flags as initially, that means
with openib kept) the problem won't occur?

Thanks.


Re: [OMPI users] Unknown breakdown (Transport retry count exceeded on mlx5_0:1/IB)

2022-02-23 Thread Jeff Squyres (jsquyres) via users
I can't comment much on UCX; you'll need to ask Nvidia for support on that.

But transport retry count exceeded errors mean that the underlying IB network 
tried to send a message a bunch of times but never received the corresponding 
ACK from the receiver indicating that the receiver successfully got the 
message.  From back in my IB days, the typical first place to look for errors 
like this is to check the layer 0 and layer 1 networking with Nvidia-level 
diagnostics to ensure that the network itself is healthy.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Feng Wade via users 

Sent: Saturday, February 19, 2022 4:04 PM
To: users@lists.open-mpi.org
Cc: Feng Wade
Subject: [OMPI users] Unknown breakdown (Transport retry count exceeded on 
mlx5_0:1/IB)

Hi,

Good afternoon.

I am using openmpi/4.0.3 on Compute Canada to do 3D flow simulation. It worked 
quite well for lower Reynolds number. However, after increasing it from  3600 
to 9000, openmpi reported errors as shown below:

[gra1288:149104:0:149104] ib_mlx5_log.c:132  Transport retry count exceeded on 
mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[gra1288:149104:0:149104] ib_mlx5_log.c:132  DCI QP 0x2ecc1 wqe[475]: SEND s-e 
[rqpn 0xd7b7 rlid 1406] [va 0x2b6140d4ca80 len 8256 lkey 0x2e1bb1]
 backtrace (tid: 149102) 
 0 0x00020753 ucs_debug_print_backtrace()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
 1 0x0001dfa8 uct_ib_mlx5_completion_with_err()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5_log.c:132
 2 0x00056fae uct_ib_mlx5_poll_cq()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5.inl:81
 3 0x00056fae uct_dc_mlx5_iface_progress()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/dc/dc_mlx5.c:238
 4 0x000263ca ucs_callbackq_dispatch()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/datastruct/callbackq.h:211
 5 0x000263ca uct_worker_progress()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/api/uct.h:2221
 6 0x000263ca ucp_worker_progress()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucp/core/ucp_worker.c:1951
 7 0x36b7 mca_pml_ucx_progress()  ???:0
 8 0x000566bb opal_progress()  ???:0
 9 0x0007acf5 ompi_request_default_wait()  ???:0
10 0x000b3ad9 MPI_Sendrecv()  ???:0
11 0x9c86 transpose_chunks()  transpose-pairwise.c:0
12 0x9d0f apply()  transpose-pairwise.c:0
13 0x00422b5f channelflow::FlowFieldFD::transposeX1Y0()  ???:0
14 0x00438d50 channelflow::grad_uDalpha()  ???:0
15 0x00434a47 channelflow::VE_NL()  ???:0
16 0x00432783 channelflow::MultistepVEDNSFD::advance()  ???:0
17 0x00413767 main()  ???:0
18 0x00023e1b __libc_start_main()  
/cvmfs/soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/../csu/libc-start.c:308
19 0x004109aa _start()  ???:0
=
[gra1288:149102] *** Process received signal ***
[gra1288:149102] Signal: Aborted (6)
[gra1288:149102] Signal code:  (-6)
[gra1288:149102] [ 0] 
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(+0x38980)[0x2addb0310980]
[gra1288:149102] [ 1] 
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(gsignal+0x141)[0x2addb0310901]
[gra1288:149102] [ 2] 
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(abort+0x127)[0x2addb02fa56b]
[gra1288:149102] [ 3] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x1f435)[0x2addb6cd7435]
[gra1288:149102] [ 4] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x236b5)[0x2addb6cdb6b5]
[gra1288:149102] [ 5] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(ucs_log_dispatch+0xc9)[0x2addb6cdb7d9]
[gra1288:149102] [ 6] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x528)[0x2addb6ec1fa8]

Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

2022-02-23 Thread Jeff Squyres (jsquyres) via users
I'd recommend against using Open MPI v3.1.0 -- it's quite old.  If you have to 
use Open MPI v3.1.x, I'd at least suggest using v3.1.6, which has all the 
rolled-up bug fixes on the v3.1.x series.

That being said, Open MPI v4.1.2 is the most current.  Open MPI v4.1.2 does 
restrict which versions of UCX it uses because there are bugs in the older 
versions of UCX.  I am not intimately familiar with UCX -- you'll need to ask 
Nvidia for support there -- but I was under the impression that it's just a 
user-level library, and you could certainly install your own copy of UCX to use 
with your compilation of Open MPI.  I.e., you're not restricted to whatever UCX 
is installed in the cluster system-default locations.

I don't know why you're getting MXM-specific error messages; those don't appear 
to be coming from Open MPI (especially since you configured Open MPI with 
--without-mxm).  If you can upgrade to Open MPI v4.1.2 and the latest UCX, see 
if you are still getting those MXM error messages.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Angel de Vicente 
via users 
Sent: Friday, February 18, 2022 5:46 PM
To: Gilles Gouaillardet via users
Cc: Angel de Vicente
Subject: Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

Hello,

Gilles Gouaillardet via users  writes:

> Infiniband detection likely fails before checking expanded verbs.

thanks for this. At the end, after playing a bit with different options,
I managed to install OpenMPI 3.1.0 OK in our cluster using UCX (I wanted
4.1.1, but that would not compile cleanly with the old version of UCX
that is installed in the cluster). The configure command line (as
reported by ompi_info) was:

,
|   Configure command line: 
'--prefix=/storage/projects/can30/angelv/spack/opt/spack/linux-sles12-sandybridge/gcc-9.3.0/openmpi-3.1.0-g5a7szwxcsgmyibqvwwavfkz5b4i2ym7'
|   '--enable-shared' '--disable-silent-rules'
|   '--disable-builtin-atomics' '--with-pmi=/usr'
|   
'--with-zlib=/storage/projects/can30/angelv/spack/opt/spack/linux-sles12-sandybridge/gcc-9.3.0/zlib-1.2.11-hrstx5ffrg4f4k3xc2anyxed3mmgdcoz'
|   '--without-knem' '--with-hcoll=/opt/mellanox/hcoll'
|   '--without-psm' '--without-ofi' '--without-cma'
|   '--with-ucx=/opt/ucx' '--without-fca'
|   '--without-mxm' '--without-verbs' '--without-xpmem'
|   '--without-psm2' '--without-alps' '--without-lsf'
|   '--without-sge' '--with-slurm' '--without-tm'
|   '--without-loadleveler' '--disable-memchecker'
|   
'--with-hwloc=/storage/projects/can30/angelv/spack/opt/spack/linux-sles12-sandybridge/gcc-9.3.0/hwloc-1.11.13-kpjkidab37wn25h2oyh3eva43ycjb6c5'
|   '--disable-java' '--disable-mpi-java'
|   '--without-cuda' '--enable-wrapper-rpath'
|   '--disable-wrapper-runpath' '--disable-mpi-cxx'
|   '--disable-cxx-exceptions'
|   
'--with-wrapper-ldflags=-Wl,-rpath,/storage/projects/can30/angelv/spack/opt/spack/linux-sles12-sandybridge/gcc-7.2.0/gcc-9.3.0-ghr2jekwusoa4zip36xsa3okgp3bylqm/lib/gcc/x86_\
| 64-pc-linux-gnu/9.3.0
|   
-Wl,-rpath,/storage/projects/can30/angelv/spack/opt/spack/linux-sles12-sandybridge/gcc-7.2.0/gcc-9.3.0-ghr2jekwusoa4zip36xsa3okgp3bylqm/lib64'
`


The versions that I'm using are:

gcc:   9.3.0
mxm:   3.6.3102  (though I configure OpenMPI --without-mxm)
hcoll: 3.8.1649
knem:  1.1.2.90mlnx2 (though I configure OpenMPI --without-knem)
ucx:   1.2.2947
slurm: 18.08.7


It looks like everything executes fine, but I have a couple of warnings,
and I'm not sure how much I should worry and what I could do about them:

1) Conflicting CPU frequencies detected:

[1645221586.038838] [s01r3b78:11041:0] sys.c:744  MXM  WARN  
Conflicting CPU frequencies detected, using: 3151.41
[1645221585.740595] [s01r3b79:11484:0] sys.c:744  MXM  WARN  
Conflicting CPU frequencies detected, using: 2998.76

2) Won't use knem. In a previous try, I was specifying --with-knem, but
I was getting this warning about not being able to open /dev/knem. I
guess our cluster is not properly configured w.r.t knem, so I built
OpenMPI again --without-knem, but I still get this message?

[1645221587.091122] [s01r3b74:9054 :0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or directory. Won't use 
knem.
[1645221587.104807] [s01r3b76:8610 :0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or directory. Won't use 
knem.


Any help/pointers appreciated. Many thanks,
--
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/p

Re: [OMPI users] Building Open MPI without zlib: what might go wrong/different?

2022-01-31 Thread Jeff Squyres (jsquyres) via users
It's used for compressing the startup time messages in PMIx.  I.e., the traffic 
for when you "mpirun ...".

It's mostly beneficial when launching very large MPI jobs.  If you're only 
launching across several nodes, the performance improvement isn't really 
noticeable.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Matt Thompson via 
users 
Sent: Monday, January 31, 2022 10:53 AM
To: Open MPI Users
Cc: Matt Thompson
Subject: [OMPI users] Building Open MPI without zlib: what might go 
wrong/different?

Open MPI List,

Recently in trying to build some libraries with NVHPC + Open MPI, I hit an 
error building HDF5 where it died at configure time saying that the zlib that 
Open MPI wanted to link to (my system one) was incompatible with the zlib I 
built in my libraries leading up to HDF5. So, in the end I "fixed" my issue by 
adding:

--without-zlib

to my configure line for Open MPI and rebuilt. And hey, it worked. HDF5 built. 
And Hello world still works as well.

But I'm now wondering: what might I be missing now? Zlib isn't required by the 
MPI Standard (as far as I can tell), so I'm guessing it's not functionality but 
rather performance?

Just curious,
Matt

--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton


Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Jeff Squyres (jsquyres) via users
This is part of the challenge of HPC: there are general solutions, but no 
specific silver bullet that works in all scenarios.  In short: everyone's setup 
is different.  So we can offer advice, but not necessarily a 100%-guaranteed 
solution that will work in your environment.

In general, we advise users to:

* Configure/build Open MPI with their favorite compilers (either a proprietary 
compiler or modern/recent GCC or clang).  More recent compilers tend to give 
better performance than older compilers.
* Configure/build Open MPI against the communication library for your HPC 
interconnect.  These days, it's mostly Libfabric or UCX.  If you have 
InfiniBand, it's UCX.

That's probably table stakes right there; you can tweak more beyond that, but 
with those 2 things, you'll go pretty far.

FWIW, we did a series of talks about Open MPI in conjunction with the EasyBuild 
community recently.  The videos and slides are available here:

* https://www.open-mpi.org/video/?category=general#abcs-of-open-mpi-part-1
* https://www.open-mpi.org/video/?category=general#abcs-of-open-mpi-part-2
* https://www.open-mpi.org/video/?category=general#abcs-of-open-mpi-part-3

For a beginner, parts 1 and 2 are probably the most relevant, and you can 
probably skip the parts about PMIx (circle back to that later for more advanced 
knowledge).

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Diego Zuccato via 
users 
Sent: Thursday, January 27, 2022 2:59 AM
To: users@lists.open-mpi.org
Cc: Diego Zuccato
Subject: Re: [OMPI users] RES: OpenMPI - Intel MPI

Sorry for the noob question, but: what should I configure for OpenMPI
"to perform on the host cluster"? Any link to a guide would be welcome!

Slightly extended rationale for the question: I'm currently using
"unconfigured" Debian packages and getting some strange behaviour...
Maybe it's just something that a little tuning can fix easily.

Il 27/01/2022 07:58, Ralph Castain via users ha scritto:
> I'll disagree a bit there. You do want to use an MPI library in your
> container that is configued to perform on the host cluster. However,
> that doesn't mean you are constrained as Gilles describes. It takes a
> little more setup knowledge, true, but there are lots of instructions
> and knowledgeable people out there to help. Experiments have shown that
> using non-system MPIs provide at least equivalent performance to the
> native MPIs when configured. Matching the internal/external MPI
> implementations may simplify the mechanics of setting it up, but it is
> definitely not required.
>
>
>> On Jan 26, 2022, at 8:55 PM, Gilles Gouaillardet via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>> Brian,
>>
>> FWIW
>>
>> Keep in mind that when running a container on a supercomputer, it is
>> generally recommended to use the supercomputer MPI implementation
>> (fine tuned and with support for the high speed interconnect) instead
>> of the one of the container (generally a vanilla MPI with basic
>> support for TCP and shared memory).
>> That scenario implies several additional constraints, and one of them
>> is the MPI library of the host and the container are (oversimplified)
>> ABI compatible.
>>
>> In your case, you would have to rebuild your container with MPICH
>> (instead of Open MPI) so it can be "substituted" at run time with
>> Intel MPI (MPICH based and ABI compatible).
>>
>> Cheers,
>>
>> Gilles
>>
>> On Thu, Jan 27, 2022 at 1:07 PM Brian Dobbins via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>>
>> Hi Ralph,
>>
>>   Thanks for the explanation - in hindsight, that makes perfect
>> sense, since each process is operating inside the container and
>> will of course load up identical libraries, so data types/sizes
>> can't be inconsistent.  I don't know why I didn't realize that
>> before.  I imagine the past issues I'd experienced were just due
>> to the PMI differences in the different MPI implementations at the
>> time.  I owe you a beer or something at the next in-person SC
>> conference!
>>
>>   Cheers,
>>   - Brian
>>
>>
>> On Wed, Jan 26, 2022 at 4:54 PM Ralph Castain via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>> There is indeed an ABI difference. However, the _launcher_
>> doesn't have anything to do with the MPI library. All that is
>> needed is a launcher that can provide the key exchange
>> required to wireup the MPI processes. At this point, both
>> MPICH and OMPI have PMIx support, so you can use the same
>> launcher for both. IMPI does not, and so the IMPI launcher
>> will only support PMI-1 or PMI-2 (I forget which one).
>>
>> You can, however, work around that problem. For example, if
>> the host system is using Slurm, then you could "srun" the
>> containers and let Slurm perform the wireup. Again, you'd have
>> to ensure that OMPI was built to sup

Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

2022-01-27 Thread Jeff Squyres (jsquyres) via users
I'm afraid that without any further details, it's hard to help. I don't know 
why Gadget2 would complain about its parameters file.  From what you've stated, 
it could be a problem with the application itself.

Have you talked to the Gadget2 authors?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Diego Zuccato via 
users 
Sent: Wednesday, January 26, 2022 2:06 AM
To: users@lists.open-mpi.org
Cc: Diego Zuccato
Subject: Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

Il 26/01/2022 02:10, Jeff Squyres (jsquyres) via users ha scritto:

> I'm afraid I don't know anything about Gadget, so I can't comment there.  How 
> exactly does the application fail?
Neither did I :(
It fails saying a 'timestep' is 0, and that's usually caused by an error
in the parameters file. But the parameters file is OK, and it actually
works if the user runs it in a single process. Or even with
multithreaded runs, sometimes and on some nodes. That's quite random :(
But the runs are usually single-node (simple examples for students).

> Can you try upgrading to Open MPI v4.1.2?
That would be a real mess. I'm stuck with packages provided by Debian
stable. I lack both the manpower and the knowledge to compile everything
from scratch, given the intricate relations between slurm, openmpi,
infiniband, etc. :(

> What networking are you using?
Infiniband (Mellanox cards, w/ Debian-supplied drivers and support
programs) and ethernet. Infiniband is also used by IPoIB to reach the
storage servers (gluster). Some nodes lacks IB, so access to the storage
is achieved by a couple of iptables rules.

> 
> From: users  on behalf of Diego Zuccato via 
> users 
> Sent: Tuesday, January 25, 2022 5:43 AM
> To: Open MPI Users
> Cc: Diego Zuccato
> Subject: [OMPI users] Gadget2 error 818 when using more than 1 process?
>
> Hello all.
>
> A user of our cluster is experiencing a weird problem that I can't pinpoint.
>
> He does have a job script that worked well on every node. I's based on
> Gadget2.
>
> Lately, *sometimes*, the same executable with the same parameters file
> works, sometimes it fails. On the same node and submitting with the same
> command. On some nodes it always fails. But if it gets reduced to
> sequential (asking for just one process), it completes correctly (so the
> parameters file, common source of Gadget2 error 818, seems innocent).
>
> The cluster uses SLURM and limits resources using cgroups, if that matters.
>
> Seems most of the issues started after upgrading from openmpi 3.1.3 to
> 4.1.0 in september.
>
> Maybe related, the nodes started spitting out these warnings (that IIUC
> should be harmless... but I'd like to debug & resolve anyway):
> -8<--
> Open MPI's OFI driver detected multiple equidistant NICs from the
> current process, but had insufficient information to ensure MPI
> processes fairly pick a NIC for use.
> This may negatively impact performance. A more modern PMIx server is
> necessary to resolve this issue.
> -8<--
>
> Code is run (from the jobfile) with:
> srun --mpi=pmix_v4 ./Gadget2 paramfile
> (we also tried with a simple mpirun w/ no extra parameters leveraging
> SLURM's integration/autodetection -- same result)
>
> Any hints?
>
> TIA
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

2022-01-25 Thread Jeff Squyres (jsquyres) via users
I'm afraid I don't know anything about Gadget, so I can't comment there.  How 
exactly does the application fail?

Can you try upgrading to Open MPI v4.1.2?

What networking are you using?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Diego Zuccato via 
users 
Sent: Tuesday, January 25, 2022 5:43 AM
To: Open MPI Users
Cc: Diego Zuccato
Subject: [OMPI users] Gadget2 error 818 when using more than 1 process?

Hello all.

A user of our cluster is experiencing a weird problem that I can't pinpoint.

He does have a job script that worked well on every node. I's based on
Gadget2.

Lately, *sometimes*, the same executable with the same parameters file
works, sometimes it fails. On the same node and submitting with the same
command. On some nodes it always fails. But if it gets reduced to
sequential (asking for just one process), it completes correctly (so the
parameters file, common source of Gadget2 error 818, seems innocent).

The cluster uses SLURM and limits resources using cgroups, if that matters.

Seems most of the issues started after upgrading from openmpi 3.1.3 to
4.1.0 in september.

Maybe related, the nodes started spitting out these warnings (that IIUC
should be harmless... but I'd like to debug & resolve anyway):
-8<--
Open MPI's OFI driver detected multiple equidistant NICs from the
current process, but had insufficient information to ensure MPI
processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is
necessary to resolve this issue.
-8<--

Code is run (from the jobfile) with:
srun --mpi=pmix_v4 ./Gadget2 paramfile
(we also tried with a simple mpirun w/ no extra parameters leveraging
SLURM's integration/autodetection -- same result)

Any hints?

TIA

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

2022-01-04 Thread Jeff Squyres (jsquyres) via users
Thanks Paul!

I do not doubt that our configury has some not-quite-perfect Fortran tests; I 
know enough Fortran to be dangerous -- I am definitely​ not a Fortran expert.

Themos from NAG has been iterating with us on 
https://github.com/open-mpi/ompi/pull/9812 -- we're closer, but we haven't 
fixed everything yet.


--
Jeff Squyres
jsquy...@cisco.com


From: users on behalf of Paul Kapinos via users
Sent: Tuesday, January 4, 2022 4:27 AM
To: Jeff Squyres (jsquyres) via users
Cc: Paul Kapinos
Subject: Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

Dear Jeff,
I should like to point out that the NAG Fortran compiler is [and likely their
developers are] the most picky and overly didactic Fortran compiler [developers]
I know.

(I worked tightly with more than 5 vendors and in dozens of versions, an I
reported some 200 bugs to the early development stage of Mercurium Fortran
compiler https://github.com/bsc-pm/mcxx and dozens to Intels 'ifort' - sorry for
praising myself :-)

In about 5 cases I was hard believing 'that is a bug in the NAG compiler!'
because they did not compile a code accepted (and often working!) by all other
compilers - intel, gfortran, Sun/Oracle studio, PGI... Then I tried to open a
case by NAG (once or two times IIRC), and to read the fg Fortran language
standard, and in *all* cases - without exception! - the NAGs interpretation of
the standard was the *right* one. (I cannot state that about gfortran and intel,
by the way.)

So these guys may be snarky, but they can Fortran, definitely. And if Open MPI
bindings may be compiled by this compiler - they would be likely very
standard-conforming.

Have a nice day and a nice year 2022,

Paul Kapinos



On 12/30/21 16:27, Jeff Squyres (jsquyres) via users wrote:
> Snarky comments from the NAG tech support people aside, if they could be a 
> little more specific about what non-conformant Fortran code they're referring 
> to, we'd be happy to work with them to get it fixed.
>
> I'm one of the few people in the Open MPI dev community who has a clue about 
> Fortran, and I'm *very far* from being a Fortran expert.  Modern Fortran is a 
> legitimately complicated language.  So it doesn't surprise me that we might 
> have some code in our configure tests that isn't quite right.
>
> Let's also keep in mind that the state of F2008 support varies widely across 
> compilers and versions.  The current Open MPI configure tests straddle the 
> line of trying to find *enough* F2008 support in a given compiler to be 
> sufficient for the mpi_f08 module without being so overly proscriptive as to 
> disqualify compilers that aren't fully F2008-compliant.  Frankly, the state 
> of F2008 support across the various Fortran compilers was a mess when we 
> wrote those configure tests; we had to cobble together a variety of 
> complicated tests to figure out if any given compiler supported enough F2008 
> support for some / all of the mpi_f08 module.  That's why the configure tests 
> are... complicated.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> 
> From: users  on behalf of Matt Thompson via 
> users 
> Sent: Thursday, December 23, 2021 11:41 AM
> To: Wadud Miah
> Cc: Matt Thompson; Open MPI Users
> Subject: Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2
>
> I heard back from NAG:
>
> Regarding OpenMPI, we have attempted the build ourselves but cannot make 
> sense of the configure script. Only the OpenMPI maintainers can do something 
> about that and it looks like they assume that all compilers will just swallow 
> non-conforming Fortran code. The error downgrading options for NAG compiler 
> remain "-dusty", "-mismatch" and "-mismatch_all" and none of them seem to 
> help with the mpi_f08 module of OpenMPI. If there is a bug in the NAG Fortran 
> Compiler that is responsible for this, we would love to hear about it, but at 
> the moment we are not aware of such.
>
> So it might mean the configure script itself might need to be altered to use 
> F2008 conforming code?
>
> On Thu, Dec 23, 2021 at 8:31 AM Wadud Miah 
> mailto:wmiah...@gmail.com>> wrote:
> You can contact NAG support at supp...@nag.co.uk<mailto:supp...@nag.co.uk> 
> but they will look into this in the new year.
>
> Regards,
>
> On Thu, 23 Dec 2021, 13:18 Matt Thompson via users, 
> mailto:users@lists.open-mpi.org>> wrote:
> Oh. Yes, I am on macOS. The Linux cluster I work on doesn't have NAG 7.1 on 
> it...mainly because I haven't asked for it. Until NAG fix the bug we are 
> seeing, I figured why bother the admins.
>
> Still, it does *seem* like it should work. I might ask NAG su

Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

2021-12-30 Thread Jeff Squyres (jsquyres) via users
I filed https://github.com/open-mpi/ompi/issues/9795 to track the issue; let's 
followup there.

I tried to tag everyone on this thread; feel free to subscribe to the issue if 
I didn't guess your github ID properly.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Jeff Squyres 
(jsquyres) via users 
Sent: Thursday, December 30, 2021 4:39 PM
To: Matt Thompson
Cc: Jeff Squyres (jsquyres); Open MPI Users
Subject: Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

Sweet; thanks!

The top-level Fortran test is here: 
https://github.com/open-mpi/ompi/blob/master/config/ompi_setup_mpi_fortran.m4

That file invokes a lot of subtests, all of which are named 
config/ompi_fortran_*.m4.

People who aren't familiar with the GNU Autotools may make the mistake of 
trying to read the configure script itself.  But that's generated code, and 
pretty impossible to read.  Perhaps this is what the NAG people did...?  While 
m4 isn't a picnic to read, it should be quite a bit more readable than the 
configure script itself.

Finally, the generated config.log file itself should have at least a decent 
amount of information in terms of stdout / stderr from running each test.  If 
there's a test that should be passing that isn't, config.log is a good place to 
start.  It will show some level of detail about what test failed and why, and 
with some creative grepping, you should be able to find the corresponding .m4 
file for the test source code.

--
Jeff Squyres
jsquy...@cisco.com


From: Matt Thompson 
Sent: Thursday, December 30, 2021 4:01 PM
To: Jeff Squyres (jsquyres)
Cc: Wadud Miah; Open MPI Users
Subject: Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

Jeff,

I'll take a look when I'm back at work next week. I work with someone on the 
Fortran Standards Committee, so if I can find the code, we can probably figure 
out how to fix it.

That said, I know just enough Autotools to cause massive damage and fix a 
minor bugs. Can you give me a pointer as to where to look for the Fortran tests 
the configure scripts runs? conftest.f90 is the "generic" name I assume 
Autotools uses for tests, so I'm guessing there is an... m4 script somewhere 
generating it? In config/ maybe?

Matt

On Thu, Dec 30, 2021 at 10:27 AM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
Snarky comments from the NAG tech support people aside, if they could be a 
little more specific about what non-conformant Fortran code they're referring 
to, we'd be happy to work with them to get it fixed.

I'm one of the few people in the Open MPI dev community who has a clue about 
Fortran, and I'm *very far* from being a Fortran expert.  Modern Fortran is a 
legitimately complicated language.  So it doesn't surprise me that we might 
have some code in our configure tests that isn't quite right.

Let's also keep in mind that the state of F2008 support varies widely across 
compilers and versions.  The current Open MPI configure tests straddle the line 
of trying to find *enough* F2008 support in a given compiler to be sufficient 
for the mpi_f08 module without being so overly proscriptive as to disqualify 
compilers that aren't fully F2008-compliant.  Frankly, the state of F2008 
support across the various Fortran compilers was a mess when we wrote those 
configure tests; we had to cobble together a variety of complicated tests to 
figure out if any given compiler supported enough F2008 support for some / all 
of the mpi_f08 module.  That's why the configure tests are... complicated.

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Matt Thompson via users 
mailto:users@lists.open-mpi.org>>
Sent: Thursday, December 23, 2021 11:41 AM
To: Wadud Miah
Cc: Matt Thompson; Open MPI Users
Subject: Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

I heard back from NAG:

Regarding OpenMPI, we have attempted the build ourselves but cannot make sense 
of the configure script. Only the OpenMPI maintainers can do something about 
that and it looks like they assume that all compilers will just swallow 
non-conforming Fortran code. The error downgrading options for NAG compiler 
remain "-dusty", "-mismatch" and "-mismatch_all" and none of them seem to help 
with the mpi_f08 module of OpenMPI. If there is a bug in the NAG Fortran 
Compiler that is responsible for this, we would love to hear about it, but at 
the moment we are not aware of such.

So it might mean the configure script itself might need to be altered to use 
F2008 conforming code?

On Thu, Dec 23, 2021 at 8:31 AM Wadud Miah 
mailto:wmiah...@gmail.com><mailto:wmiah...@gmail.com<mailto

Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

2021-12-30 Thread Jeff Squyres (jsquyres) via users
Sweet; thanks!

The top-level Fortran test is here: 
https://github.com/open-mpi/ompi/blob/master/config/ompi_setup_mpi_fortran.m4

That file invokes a lot of subtests, all of which are named 
config/ompi_fortran_*.m4.

People who aren't familiar with the GNU Autotools may make the mistake of 
trying to read the configure script itself.  But that's generated code, and 
pretty impossible to read.  Perhaps this is what the NAG people did...?  While 
m4 isn't a picnic to read, it should be quite a bit more readable than the 
configure script itself.

Finally, the generated config.log file itself should have at least a decent 
amount of information in terms of stdout / stderr from running each test.  If 
there's a test that should be passing that isn't, config.log is a good place to 
start.  It will show some level of detail about what test failed and why, and 
with some creative grepping, you should be able to find the corresponding .m4 
file for the test source code.

-- 
Jeff Squyres
jsquy...@cisco.com


From: Matt Thompson 
Sent: Thursday, December 30, 2021 4:01 PM
To: Jeff Squyres (jsquyres)
Cc: Wadud Miah; Open MPI Users
Subject: Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

Jeff,

I'll take a look when I'm back at work next week. I work with someone on the 
Fortran Standards Committee, so if I can find the code, we can probably figure 
out how to fix it.

That said, I know just enough Autotools to cause massive damage and fix a 
minor bugs. Can you give me a pointer as to where to look for the Fortran tests 
the configure scripts runs? conftest.f90 is the "generic" name I assume 
Autotools uses for tests, so I'm guessing there is an... m4 script somewhere 
generating it? In config/ maybe?

Matt

On Thu, Dec 30, 2021 at 10:27 AM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
Snarky comments from the NAG tech support people aside, if they could be a 
little more specific about what non-conformant Fortran code they're referring 
to, we'd be happy to work with them to get it fixed.

I'm one of the few people in the Open MPI dev community who has a clue about 
Fortran, and I'm *very far* from being a Fortran expert.  Modern Fortran is a 
legitimately complicated language.  So it doesn't surprise me that we might 
have some code in our configure tests that isn't quite right.

Let's also keep in mind that the state of F2008 support varies widely across 
compilers and versions.  The current Open MPI configure tests straddle the line 
of trying to find *enough* F2008 support in a given compiler to be sufficient 
for the mpi_f08 module without being so overly proscriptive as to disqualify 
compilers that aren't fully F2008-compliant.  Frankly, the state of F2008 
support across the various Fortran compilers was a mess when we wrote those 
configure tests; we had to cobble together a variety of complicated tests to 
figure out if any given compiler supported enough F2008 support for some / all 
of the mpi_f08 module.  That's why the configure tests are... complicated.

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Matt Thompson via users 
mailto:users@lists.open-mpi.org>>
Sent: Thursday, December 23, 2021 11:41 AM
To: Wadud Miah
Cc: Matt Thompson; Open MPI Users
Subject: Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

I heard back from NAG:

Regarding OpenMPI, we have attempted the build ourselves but cannot make sense 
of the configure script. Only the OpenMPI maintainers can do something about 
that and it looks like they assume that all compilers will just swallow 
non-conforming Fortran code. The error downgrading options for NAG compiler 
remain "-dusty", "-mismatch" and "-mismatch_all" and none of them seem to help 
with the mpi_f08 module of OpenMPI. If there is a bug in the NAG Fortran 
Compiler that is responsible for this, we would love to hear about it, but at 
the moment we are not aware of such.

So it might mean the configure script itself might need to be altered to use 
F2008 conforming code?

On Thu, Dec 23, 2021 at 8:31 AM Wadud Miah 
mailto:wmiah...@gmail.com><mailto:wmiah...@gmail.com<mailto:wmiah...@gmail.com>>>
 wrote:
You can contact NAG support at 
supp...@nag.co.uk<mailto:supp...@nag.co.uk><mailto:supp...@nag.co.uk<mailto:supp...@nag.co.uk>>
 but they will look into this in the new year.

Regards,

On Thu, 23 Dec 2021, 13:18 Matt Thompson via users, 
mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>>
 wrote:
Oh. Yes, I am on macOS. The Linux cluster I work on doesn't have NAG 7.1 on 
it...mainly because I haven

Re: [OMPI users] Mac OS + openmpi-4.1.2 + intel oneapi

2021-12-30 Thread Jeff Squyres (jsquyres) via users
Fair enough.

For the moment, then, we should probably just document the workaround.  I'll 
add it to README.md for the 4.0.x/4.1.x series and the upcoming 5.0 RST-based 
docs.

I wasn't too excited about making a patch for Libtool -- such that the 
workaround wouldn't be needed -- because that process is fairly tricky, and 
somewhat fragile (because we have to patch Libtool _after_ it is created).  But 
if someone wants to make a PR, we can evaluate it.

--
Jeff Squyres
jsquy...@cisco.com


From: Matt Thompson 
Sent: Thursday, December 30, 2021 3:55 PM
To: Jeff Squyres (jsquyres)
Cc: Open MPI Users; Christophe Peyret
Subject: Re: [OMPI users] Mac OS + openmpi-4.1.2 + intel oneapi

Jeff,

I'm not sure it'll happen. For understandable reasons (for Intel), I think 
Intel is not putting too much emphasis on supporting macOS. I guess since I had 
a workaround I didn't press them. (Maybe the workaround has performance issues? 
I don't know, but I only ever run with macOS on laptops, so performance isn't 
primary for me yet.)

On Thu, Dec 30, 2021 at 10:15 AM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
The conclusion we came to on that issue was that this was an issue with Intel 
ifort.  Was anyone able to raise this with Intel ifort tech support?

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Matt Thompson via users 
mailto:users@lists.open-mpi.org>>
Sent: Thursday, December 30, 2021 9:56 AM
To: Open MPI Users
Cc: Matt Thompson; Christophe Peyret
Subject: Re: [OMPI users] Mac OS + openmpi-4.1.2 + intel oneapi

Oh yeah. I know that error. This is due to a long standing issue with Intel on 
macOS and Open MPI:

https://github.com/open-mpi/ompi/issues/7615

You need to configure Open MPI with "lt_cv_ld_force_load=no" at the beginning. 
(You can see an example at the top of my modulefile here: 
https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/intel-clang-2022.0.0/openmpi/4.1.2.lua)

Matt

On Thu, Dec 30, 2021 at 5:47 AM Christophe Peyret via users 
mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>>
 wrote:

Hello,

I have built openmpi-4.1.2 with latest intel oneapi compilers, including fortran

but I am facing problems at compilation:


mpif90 toto.f90

Undefined symbols for architecture x86_64:

  "_ompi_buffer_detach_f08", referenced from:

  import-atom in libmpi_usempif08.dylib

ld: symbol(s) not found for architecture x86_64

library libmpi_usempif08.dylib is present in $MPI_DIR/lib


mpif90 -showme

ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include 
-Wl,-flat_namespace -Wl,-commons,use_dylibs 
-I/Users/chris/Applications/Intel/openmpi-4.1.2/lib 
-L/Users/chris/Applications/Intel/openmpi-4.1.2/lib -lmpi_usempif08 
-lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi


if I remove -lmpi_usempif08 from that command line it works !

ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include 
-Wl,-flat_namespace -Wl,-commons,use_dylibs 
-I/Users/chris/Applications/Intel/openmpi-4.1.2/lib 
-L/Users/chris/Applications/Intel/openmpi-4.1.2/lib  -lmpi_usempi_ignore_tkr 
-lmpi_mpifh -lmpi toto.f90


And program runs:

mpirun -n 4 a.out

rank=2/4

rank=3/4

rank=0/4

rank=1/4


Annexe the Program

program toto

  use mpi

  implicit none

  integer :: i

  integer :: comm,rank,size,ierror

  call mpi_init(ierror)

  comm=MPI_COMM_WORLD

  call mpi_comm_rank(comm, rank, ierror)

  call mpi_comm_size(comm, size, ierror)

  print '("rank=",i0,"/",i0)',rank,size

  call mpi_finalize(ierror)

end program toto


--

Christophe Peyret

ONERA/DAAA/NFLU

29 ave de la Division Leclerc
F92322 Châtillon Cedex


--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton


--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton


Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

2021-12-30 Thread Jeff Squyres (jsquyres) via users
Snarky comments from the NAG tech support people aside, if they could be a 
little more specific about what non-conformant Fortran code they're referring 
to, we'd be happy to work with them to get it fixed.

I'm one of the few people in the Open MPI dev community who has a clue about 
Fortran, and I'm *very far* from being a Fortran expert.  Modern Fortran is a 
legitimately complicated language.  So it doesn't surprise me that we might 
have some code in our configure tests that isn't quite right.

Let's also keep in mind that the state of F2008 support varies widely across 
compilers and versions.  The current Open MPI configure tests straddle the line 
of trying to find *enough* F2008 support in a given compiler to be sufficient 
for the mpi_f08 module without being so overly proscriptive as to disqualify 
compilers that aren't fully F2008-compliant.  Frankly, the state of F2008 
support across the various Fortran compilers was a mess when we wrote those 
configure tests; we had to cobble together a variety of complicated tests to 
figure out if any given compiler supported enough F2008 support for some / all 
of the mpi_f08 module.  That's why the configure tests are... complicated.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Matt Thompson via 
users 
Sent: Thursday, December 23, 2021 11:41 AM
To: Wadud Miah
Cc: Matt Thompson; Open MPI Users
Subject: Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

I heard back from NAG:

Regarding OpenMPI, we have attempted the build ourselves but cannot make sense 
of the configure script. Only the OpenMPI maintainers can do something about 
that and it looks like they assume that all compilers will just swallow 
non-conforming Fortran code. The error downgrading options for NAG compiler 
remain "-dusty", "-mismatch" and "-mismatch_all" and none of them seem to help 
with the mpi_f08 module of OpenMPI. If there is a bug in the NAG Fortran 
Compiler that is responsible for this, we would love to hear about it, but at 
the moment we are not aware of such.

So it might mean the configure script itself might need to be altered to use 
F2008 conforming code?

On Thu, Dec 23, 2021 at 8:31 AM Wadud Miah 
mailto:wmiah...@gmail.com>> wrote:
You can contact NAG support at supp...@nag.co.uk but 
they will look into this in the new year.

Regards,

On Thu, 23 Dec 2021, 13:18 Matt Thompson via users, 
mailto:users@lists.open-mpi.org>> wrote:
Oh. Yes, I am on macOS. The Linux cluster I work on doesn't have NAG 7.1 on 
it...mainly because I haven't asked for it. Until NAG fix the bug we are 
seeing, I figured why bother the admins.

Still, it does *seem* like it should work. I might ask NAG support about it.

On Wed, Dec 22, 2021 at 6:28 PM Tom Kacvinsky 
mailto:tkacv...@gmail.com>> wrote:
On Wed, Dec 22, 2021 at 5:45 PM Tom Kacvinsky 
mailto:tkacv...@gmail.com>> wrote:
>
> On Wed, Dec 22, 2021 at 4:11 PM Matt Thompson 
> mailto:fort...@gmail.com>> wrote:
> >
> > All,
> >
> > When I build Open MPI with NAG, I have to pass in:
> >
> >   FCFLAGS"=-mismatch_all -fpp"
> >
> > this flag tells nagfor to downgrade some errors with interfaces to warnings:
> >
> >-mismatch_all
> >  Further downgrade consistency checking of procedure 
> > argument lists so that calls to routines in the same file which are
> >  incorrect will produce warnings instead of error messages. 
> >  This option disables -C=calls.
> >
> > The fpp flag is how you tell NAG to do preprocessing (it doesn't 
> > automatically do it with .F90 files).
> >
> > I also have to pass in a lot of other flags as seen here:
> >
> > https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/nag-7.1_7101/openmpi/4.1.2.lua
> >
> > Now I hadn't yet tried NAG 7.1 with Open MPI because NAG 7.1 has a bug with 
> > a library I depend on, but it does promise better F2008 support. To see 
> > what happens, I tried myself and added --enable-mpi-fortran=all, but:
> >
> > checking if building Fortran 'use mpi_f08' bindings... no
> > configure: error: Cannot build requested Fortran bindings, aborting
> >
> > Unfortunately, the NAG Fortran guru I work with is off until the new year. 
> > When he comes back, I might ask him about this. He might know something we 
> > can do to make NAG happy with mpif08.
> >
>
> The very curious thing about this is that NAG 7.1 is that mpif08
> configured properly with the macOS (Intel architecture) flavor of
> it.  But as this thread seems to indicate, it barfs on Linux.  Just
> an extra data point.
>

I'd like to recall that statement, I was not looking at the config.log
carefully enough.  I see this still, even on macOS

checking if building Fortran 'use mpi_f08' bindings... no


--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton


--
Matt Thompson
   

Re: [OMPI users] Mac OS + openmpi-4.1.2 + intel oneapi

2021-12-30 Thread Jeff Squyres (jsquyres) via users
The conclusion we came to on that issue was that this was an issue with Intel 
ifort.  Was anyone able to raise this with Intel ifort tech support?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Matt Thompson via 
users 
Sent: Thursday, December 30, 2021 9:56 AM
To: Open MPI Users
Cc: Matt Thompson; Christophe Peyret
Subject: Re: [OMPI users] Mac OS + openmpi-4.1.2 + intel oneapi

Oh yeah. I know that error. This is due to a long standing issue with Intel on 
macOS and Open MPI:

https://github.com/open-mpi/ompi/issues/7615

You need to configure Open MPI with "lt_cv_ld_force_load=no" at the beginning. 
(You can see an example at the top of my modulefile here: 
https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/intel-clang-2022.0.0/openmpi/4.1.2.lua)

Matt

On Thu, Dec 30, 2021 at 5:47 AM Christophe Peyret via users 
mailto:users@lists.open-mpi.org>> wrote:

Hello,

I have built openmpi-4.1.2 with latest intel oneapi compilers, including fortran

but I am facing problems at compilation:


mpif90 toto.f90

Undefined symbols for architecture x86_64:

  "_ompi_buffer_detach_f08", referenced from:

  import-atom in libmpi_usempif08.dylib

ld: symbol(s) not found for architecture x86_64

library libmpi_usempif08.dylib is present in $MPI_DIR/lib


mpif90 -showme

ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include 
-Wl,-flat_namespace -Wl,-commons,use_dylibs 
-I/Users/chris/Applications/Intel/openmpi-4.1.2/lib 
-L/Users/chris/Applications/Intel/openmpi-4.1.2/lib -lmpi_usempif08 
-lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi


if I remove -lmpi_usempif08 from that command line it works !

ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include 
-Wl,-flat_namespace -Wl,-commons,use_dylibs 
-I/Users/chris/Applications/Intel/openmpi-4.1.2/lib 
-L/Users/chris/Applications/Intel/openmpi-4.1.2/lib  -lmpi_usempi_ignore_tkr 
-lmpi_mpifh -lmpi toto.f90


And program runs:

mpirun -n 4 a.out

rank=2/4

rank=3/4

rank=0/4

rank=1/4


Annexe the Program

program toto

  use mpi

  implicit none

  integer :: i

  integer :: comm,rank,size,ierror

  call mpi_init(ierror)

  comm=MPI_COMM_WORLD

  call mpi_comm_rank(comm, rank, ierror)

  call mpi_comm_size(comm, size, ierror)

  print '("rank=",i0,"/",i0)',rank,size

  call mpi_finalize(ierror)

end program toto


--

Christophe Peyret

ONERA/DAAA/NFLU

29 ave de la Division Leclerc
F92322 Châtillon Cedex


--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton


Re: [OMPI users] stdout scrambled in file

2021-12-07 Thread Jeff Squyres (jsquyres) via users
Open MPI launches a single "helper" process on each node (in Open MPI <= v4.x, 
that helper process is called "orted").  This process is responsible for 
launching all the individual MPI processes, and it's also responsible for 
capturing all the stdout/stderr from those processes and sending it back to 
mpirun via an out-of-band network message protocol (using TCP sockets).  mpirun 
accepts those network messages and emits them to mpirun's stdout/stderr.

There's multiple places in that pipeline where messages can get fragmented, and 
therefore emitted as incomplete lines (OS stdout/stderr buffering, network MTU 
size, TCP buffering, etc.).

This is mainly because we have always assumed that stdout/stderr is not the 
primary work output of an MPI application.  We've seen many MPI applications 
either write their results to stable files or send the results back to a single 
MPI process, who then gathers and emits them (i.e., if there's only 
stdout/stderr coming from a single MPI process, the output won't get 
interleaved with anything else).

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Fisher (US), Mark S 
via users 
Sent: Monday, December 6, 2021 3:45 PM
To: Joachim Protze; Open MPI Users
Cc: Fisher (US), Mark S
Subject: Re: [OMPI users] stdout scrambled in file

This usually happens if we get a number of warning message from multiple 
processes. Seems like unbuffered is what we want but not sure how this 
interacts with MPI since stdout/stderr is pulled back from different hosts. Not 
sure how you are doing that.

-Original Message-
From: Joachim Protze 
Sent: Monday, December 06, 2021 11:12 AM
To: Fisher (US), Mark S ; Open MPI Users 

Subject: Re: [OMPI users] stdout scrambled in file

I would assume, that the buffering mode is compiler/runtime specific. At
least for Intel compiler, the default seems to be/have been unbuffered
for stdout, but there is a flag for buffered output:

https://community.intel.com/t5/Intel-Fortran-Compiler/Enabling-buffered-I-O-to-stdout-with-Intel-ifort-compiler/td-p/993203

In the worst case, each character might be written individually. If the
scrambling only happens from time to time, I guess you really just see
the buffer flush when the buffer filled up.

- Joachim

Am 06.12.21 um 16:42 schrieb Fisher (US), Mark S:
> All strings are writing as one output so that is not the issue. Adding in 
> some flushing is a good idea and we can try that. We do not open stdout just 
> write to unit 6, but we could open it if there is some un-buffered option 
> that could help. I will look into that also.  Thanks!
>
> -Original Message-
> From: Joachim Protze 
> Sent: Monday, December 6, 2021 9:24 AM
> To: Open MPI Users 
> Cc: Fisher (US), Mark S 
> Subject: Re: [OMPI users] stdout scrambled in file
>
> Hi Mark,
>
> "[...] MPI makes neither requirements nor recommendations for the output
> [...]" (MPI4.0, §2.9.1)
>
>   From my experience, an application can avoid such scrambling (still no
> guarantee), if the output of lines is written atomically. C++ streams
> are worst for concurrent output, as every stream operator writes a
> chunk. It can help to collect output into a stringstream and print out
> at once. Using printf in C is typically least problematic. Flushing the
> buffer (fflush) helpts to avoid that the output buffer fills up and is
> flushed in the middle of printing.
>
> I'm not the Fortran expert. But, I think there are some options to
> change to a buffered output mode (at least I found such options for file
> I/O). Again, the goal should be that a write statement is printed at
> once and the buffer doesn't fill up while printing.
>
> In any case, it could help to write warnings to stderr and separate the
> stdout and stderr streams.
>
> Best
> Joachim
>
> Am 02.12.21 um 16:48 schrieb Fisher (US), Mark S via users:
>> We are using Mellanox HPC-X MPI based on OpenMPI 4.1.1RC1 and having
>> issues with lines scrambling together occasionally. This causes issues
>> our convergence checking code since we put convergence data there. We
>> are not using any mpirun options for stdout we just redirect
>> stdout/stderr to a file before we run the mpirun command so all output
>> goes there. We had similar issue with Intel MPI in the past and used the
>> -ordered-output to fix it but I do not see any similar option for
>> OpenMPI. See example below. Is there anyway to ensure a line from a
>> process gets one line in the output file?
>>
>> *The data in red below is scrambled up and should look like the
>> cleaned-up version. You can see it put a line from a different process
>> inside a line from another processes and the rest of the line ended up a
>> couple of lines down.*
>>
>> ZONE   0 : Min/Max CFL= 5.000E-01 1.500E+01 Min/Max DT= 8.411E-10
>> 1.004E-01 sec
>>
>> *IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04
>> -4.945E-06  aerosurfs
>>
>> *IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-0

Re: [OMPI users] stdout scrambled in file

2021-12-05 Thread Jeff Squyres (jsquyres) via users
FWIW: Open MPI 4.1.2 has been released -- you can probably stop using an RC 
release.

I think you're probably running into an issue that is just a fact of life.  
Especially when there's a lot of output simultaneously from multiple MPI 
processes (potentially on different nodes), the stdout/stderr lines can just 
get munged together.

Can you check for convergence a different way?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Fisher (US), Mark S 
via users 
Sent: Thursday, December 2, 2021 10:48 AM
To: users@lists.open-mpi.org
Cc: Fisher (US), Mark S
Subject: [OMPI users] stdout scrambled in file

We are using Mellanox HPC-X MPI based on OpenMPI 4.1.1RC1 and having issues 
with lines scrambling together occasionally. This causes issues our convergence 
checking code since we put convergence data there. We are not using any mpirun 
options for stdout we just redirect stdout/stderr to a file before we run the 
mpirun command so all output goes there. We had similar issue with Intel MPI in 
the past and used the -ordered-output to fix it but I do not see any similar 
option for OpenMPI. See example below. Is there anyway to ensure a line from a 
process gets one line in the output file?


The data in red below is scrambled up and should look like the cleaned-up 
version. You can see it put a line from a different process inside a line from 
another processes and the rest of the line ended up a couple of lines down.

ZONE   0 : Min/Max CFL= 5.000E-01 1.500E+01 Min/Max DT= 8.411E-10 1.004E-01 sec

*IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04 -4.945E-06  
aerosurfs
*IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04 -2.785E-05  
aerosurfs
*IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04 -4.945E-06  
Aircraft-Total
*IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04 -2.785E-05  
Aircr Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  699  1625 12
Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  111  1626  6
aft-Total
*IGSTAB* 1626 6.623E-02 2.137E-01 -9.063E-04 8.450E-03 -5.485E-04 -4.961E-06  
Aircraft-OML
*IGMNTAERO* 1626 -6.118E-04 -1.602E-02 6.404E-04 5.756E-08 3.341E-04 -2.791E-05 
 Aircraft-OML


Cleaned up version:

ZONE   0 : Min/Max CFL= 5.000E-01 1.500E+01 Min/Max DT= 8.411E-10 1.004E-01 sec

*IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04 -4.945E-06  
aerosurfs
*IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04 -2.785E-05  
aerosurfs
*IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04 -4.945E-06  
Aircraft-Total
*IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04 -2.785E-05  
Aircraft-Total
 Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  699  1625 12
Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  111  1626  6
*IGSTAB* 1626 6.623E-02 2.137E-01 -9.063E-04 8.450E-03 -5.485E-04 -4.961E-06  
Aircraft-OML
*IGMNTAERO* 1626 -6.118E-04 -1.602E-02 6.404E-04 5.756E-08 3.341E-04 -2.791E-05 
 Aircraft-OML

Thanks!


  1   2   3   4   5   6   7   8   9   10   >