from:"Howard"

Re: [OMPI users] [EXTERNAL] Confusions on building and running OpenMPI over Slingshot 10 on Cray EX HPC

2024-05-09 Thread Pritchard Jr., Howard via users

Hi Jerry,

Cray EX HPC with slingshot 10 (NOT 11!!!) is basically a Mellanox IB cluster 
using RoCE rather than IB.
For this sort of interconnect, don’t use OFI, use UCX.  Although UCX 1.12.0 is 
getting a bit old.
I’d recommend 1.14.0 or newer, esp. if your system has nodes with GPUs.

CXI is the name of the vendor libfabric provider and doesn’t function on this 
system – unless parts of the cluster are wired up with slingshot 11 (nics).
For the node where you ran lspci this doesn’t seem to be the case.  You’d see 
something like this if you had Slingshot 11:

27:00.0 Ethernet controller: Cray Inc Device 0501 (rev 02)
a8:00.0 Ethernet controller: Cray Inc Device 0501 (rev 02)


For your first question, you want to double check the final output from a 
configure run and make sure that the summary says UCX support is enabled.

Please see 
https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/ib-and-roce.html for 
answers to some of your other questions below.
Note there are some RoCE specific items in the doc page you may want to check.

The PMIx slingshot config option is getting you confused. Just ignore it for 
this network.

I’d suggest tweaking your configure options to the following:

  --enable-mpi-fortran \
  --enable-shared \
  --with-pic \
  --with-ofi=no \
  --with-ucx=/project/app/ucx/1.12.1 \
  --with-pmix=internal \
  --with-pbs \
  --with-tm=/opt/pbs \
  --with-singularity=/project/app/singularity/3.10.3 \
  --with-lustre=/usr \
  CC=icc \
  FC=ifort \
  CXX=icpc

This will end up with a build of Open MPI that uses UCX – which is what you 
want.

You are getting the error message from the btl framework because the OFI BTL 
can’t find a suitable/workable OFI provider.

If you really need to build with OFI support, add –with-ofi, but set the 
following MCA parameters (here shown using env. Variables):

export OMPI_MCA_pml=ucx
export OMPI_MCA_osc=ucx
export OMPI_MCA_btl=^ofi

when running applications built using this Open MPI installation.

Hope this helps,

Howard


From: users  on behalf of Jianyu Liu via 
users 
Reply-To: Open MPI Users 
Date: Wednesday, May 8, 2024 at 7:41 PM
To: "users@lists.open-mpi.org" 
Cc: Jianyu Liu 
Subject: [EXTERNAL] [OMPI users] Confusions on building and running OpenMPI 
over Slingshot 10 on Cray EX HPC


Hi,

I'm trying to build an OpenMPI 5.0.3 environment on the Cray EX HPC with 
Slingshot 10 support.

General speaking,  there were error messages while building OpenMPI,  and make 
check also didn't report any failure.

While tested OpenMPI Env. with a simple 'hello world' MPI Fortran codes,  it 
threw out these error messages and caught  signal 11 with libucs if specified 
'-mca btl ofi'.

No components were able to be opened in the btl framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded

Host: x3001c027b4n0
Framework: btl
-
Caught signal 11 ( Segmentation fault: address not mapped to object at address 
(nil))

/project/app/ucx/1.12.1/lib/libucs.so.0 (ucs_handle_error+0x134)


This made me confused and not sure if got OpenMPI built with full Slingshot 10 
support successfully and run over Slingshot 10 properly.


Here are the building env.  on Cray EX HPC with SLES 15 SP3

OpenMPI 5.0.3 + Intel 2022.0.2 + UCX 1.12.1 + libfabric 
1.11.0.4.125-SSHOT2.0.0 + mlnx-ofed 5.5.1

Here are my configurations

  --enable-mpi-fortran \
  --enable-shared \
  --with-pic \
  --with-ofi=/opt/cray/libfabric/1.11.0.4.125 \
  --with-ofi-libdir=/opt/cray/libfabric/1.11.0.4.125/lib64 \
  --with-ucx=/project/app/ucx/1.12.1 \
  --with-pmix=internal \
  --with-slingshot \
  --with-pbs \
  --with-tm=/opt/pbs \
  --with-singularity=/project/app/singularity/3.10.3 \
  --with-lustre=/usr \
  CC=icc \
  FC=ifort \
  CXX=icpc

Here are output of lspci on computing nodes

03:00.0 Ethernet controller: Mellanox Technologies MT27800 Family 
[ConnectX-5]
24:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network 
Connection (rev 01)

Here are what I'm confusing

  1. After the configuration completed, the pmix summary didn't tell slingshot 
support is turned on for the transports
  2. config.log didn't show any checking info. against slingshot while 
conducting mca checking,  just showed --with-slingshot was passed as an 
argument.
  3. Further looked into the configure script,  the only script which will 
check Slingshot support is 3rd-party/openmix/src/mca/pnet/sshot/configure.m4,  
but looked like it's never called,  as config.log didn't show any checking 
info. against appropriate dependencies, such as CXI, JANSSON, and I believed 
that CXI library was not installed on the machine.


Here are my questions

1.  How it could tell OpenMPI was built with full Slingshot

Re: [OMPI users] [EXTERNAL] Helping interpreting error output

2024-04-16 Thread Pritchard Jr., Howard via users

Hi Jeffrey,

I would suggest trying to debug what may be going wrong with UCX on your DGX 
box.

There are several things to try from the UCX faq - 
https://openucx.readthedocs.io/en/master/faq.html

I’d suggest setting the UCX_LOG_LEVEL environment variable to info or debug and 
see if UCX says something about what’s going wrong.

Also add --mca plm_base_verbose 10 to the mpirun command line.

Have you used DGX boxes with only a single NIC successfully?

Howard


From: users  on behalf of Jeffrey Layton via 
users 
Reply-To: Open MPI Users 
Date: Tuesday, April 16, 2024 at 12:30 PM
To: Open MPI Users 
Cc: Jeffrey Layton 
Subject: [EXTERNAL] [OMPI users] Helping interpreting error output

Good afternoon MPI fans of all ages,

Yet again, I'm getting an error that I'm having trouble interpreting. This 
time, I'm trying to run ior. I've done it a thousand times but not on an NVIDIA 
DGX A100 with multiple NICs.

The ultimate command is the following:


/cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 4 -map-by 
ppr:4:node --allow-run-as-root --mca btl_openib_warn_default_gid_prefix 0 --mca 
btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 --mca plm 
rsh /home/bcm/bin/bin/ior -w -r -z -e -C -t 1m -b 1g -s 1000 -o /mnt/test


It was suggested to me to use these MPI options. The error I get is the 
following.

--
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  dgx-02
Framework: pml
Component: ucx
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_pml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--
[dgx-02:2399932] *** An error occurred in MPI_Init
[dgx-02:2399932] *** reported by process [2099773441,3]
[dgx-02:2399932] *** on a NULL communicator
[dgx-02:2399932] *** Unknown error
[dgx-02:2399932] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[dgx-02:2399932] ***and potentially your MPI job)


My first inclination was that it couldn't find ucx. So I loaded that module and 
re-ran it. I get the exact same error message. I'm still checking if the ucx 
module gets loaded when I run via Slurm, but mdtest ran without issue. But I'm 
checking that.

Any thoughts?

Thanks!

Jeff

Re: [OMPI users] [EXTERNAL] Help deciphering error message

2024-03-07 Thread Pritchard Jr., Howard via users

Hello Jeffrey,

A couple of things to try first.

Try running without UCX.  Add –-mca pml ^ucx to the mpirun command line.  If 
the app functions without ucx, then the next thing is to see what may be going 
wrong with UCX and the Open MPI components that use it.

You may want to set the UCX_LOG_LEVEL environment variable to see if Open MPI’s 
UCX PML component is actually able to initialize UCX and start trying to use it.

See https://openucx.readthedocs.io/en/master/faq.html  for an example to do 
this using mpirun and the type of output you should be getting.

Another simple thing to try is

mpirun -np 1 ucx_info -v


and see it you get something like this back on stdout:

 Library version: 1.14.0

# Library path: /usr/lib64/libucs.so.0

# API headers version: 1.14.0

# Git branch '', revision f8877c5

# Configured with: --build=aarch64-redhat-linux-gnu 
--host=aarch64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking 
--prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin 
--sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include 
--libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var 
--sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info 
--disable-optimizations --disable-logging --disable-debug --disable-assertions 
--enable-mt --disable-params-check --without-go --without-java --enable-cma 
--with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm 
--without-rocm --with-xpmem --without-fuse3 --without-ugni 
--with-cuda=/usr/local/cuda-11.7
Are you running the mpirun command on dgx-14?  If that’s a different host a 
likely problem is that for some reason, the information in your ucx/1.10.1 is 
not getting picked up on dgx-14.

One other thing, if the module UCX module name is indicating the version of 
UCX, its rather old.  I’d suggest, if possible, updating to a newer version, 
like 1.14.1 or newer.  There are many enhancements in more recent versions of 
UCX for GPU support and I would bet you’d want that for your DGX boxes.

Howard

From: users  on behalf of Jeffrey Layton via 
users 
Reply-To: Open MPI Users 
Date: Thursday, March 7, 2024 at 11:53 AM
To: Open MPI Users 
Cc: Jeffrey Layton 
Subject: [EXTERNAL] [OMPI users] Help deciphering error message

Good afternoon,

I'm getting an error message I'm not sure how to use to debug an issue. I'll 
try to give you all of the pertinent about the setup, but I didn't build the 
system nor install the software. It's an NVIDIA SuperPod system with Base 
Command Manager 10.0.

I'm building IOR but I'm really interested in mdtest. "module list" says I'm 
using the following modules:

gcc/64/4.1.5a1
ucx/1.10.1
openmpi4/gcc/4.1.5

There are no problems building the code.

I'm using Slurm to run mdtest using a script. The output from the script and 
Slurm is the following (the command to run it is included).


/cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 1 -map-by 
ppr:1:node --allow-run-as-root --mca btl_openib_warn_default_gid_prefix 0 --mca 
btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0
 --mca plm rsh /home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d 
/raid/bcm/mdtest
--
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  dgx-14
Framework: pml
Component: ucx
--
[dgx-14:4055623] [[42340,0],0] ORTE_ERROR_LOG: Data unpack would read past end 
of buffer in file util/show_help.c at line 501
[dgx-14:4055632] *** An error occurred in MPI_Init
[dgx-14:4055632] *** reported by process [2774794241,0]
[dgx-14:4055632] *** on a NULL communicator
[dgx-14:4055632] *** Unknown error
[dgx-14:4055632] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[dgx-14:4055632] ***and potentially your MPI job)


Any pointers/help is greatly appreciated.

Thanks!

Jeff




[Image removed by 
sender.]<https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$>
Virus-free.www.avast.com<https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$>

Re: [OMPI users] [EXTERNAL] Re: MPI_Init_thread error

2023-07-25 Thread Pritchard Jr., Howard via users

HI Aziz,

Oh I see you referenced the faq.  That section of the faq is discussing how to 
make the Open MPI 4 series (and older)  job launcher “know” about the batch 
scheduler you are using.
The relevant section for launching with srun is covered by this faq - 
https://www-lb.open-mpi.org/faq/?category=slurm

Howard

From: "Pritchard Jr., Howard" 
Date: Tuesday, July 25, 2023 at 8:26 AM
To: Open MPI Users 
Cc: Aziz Ogutlu 
Subject: Re: [EXTERNAL] Re: [OMPI users] MPI_Init_thread error

HI Aziz,

Did you include –with-pmi2 on your Open MPI configure line?

Howard

From: users  on behalf of Aziz Ogutlu via 
users 
Organization: Eduline Bilisim
Reply-To: Open MPI Users 
Date: Tuesday, July 25, 2023 at 8:18 AM
To: Open MPI Users 
Cc: Aziz Ogutlu 
Subject: [EXTERNAL] Re: [OMPI users] MPI_Init_thread error


Hi Gilles,

Thank you for your response.

When I run srun --mpi=list, I get only pmi2.

When I run command with --mpi=pmi2 parameter, I got same error.

OpenMPI automatically support slurm after 4.x version. 
https://www.open-mpi.org/faq/?category=building#build-rte<https://urldefense.com/v3/__https:/www.open-mpi.org/faq/?category=building*build-rte__;Iw!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl12eaQ2CN$>


On 7/25/23 12:55, Gilles Gouaillardet via users wrote:
Aziz,

When using direct run (e.g. srun), OpenMPI has to interact with SLURM.
This is typically achieved via PMI2 or PMIx

You can
srun --mpi=list
to list the available options on your system

if PMIx is available, you can
srun --mpi=pmix ...

if only PMI2 is available, you need to make sure Open MPI was built with SLURM 
support (e.g. configure --with-slurm ...)
and then
srun --mpi=pmi2 ...


Cheers,

Gilles

On Tue, Jul 25, 2023 at 5:07 PM Aziz Ogutlu via users 
mailto:users@lists.open-mpi.org>> wrote:
Hi there all,
We're using Slurm 21.08 on Redhat 7.9 HPC cluster with OpenMPI 4.0.3 + gcc 
8.5.0.
When we run command below for call SU2, we get an error message:

$ srun -p defq --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
$ module load su2/7.5.1
$ SU2_CFD config.cfg

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[cnode003.hpc:17534] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

--

Best regards,

Aziz Öğütlü



Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
www.eduline.com.tr<https://urldefense.com/v3/__http:/www.eduline.com.tr__;!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl1wUXbUrh$>

Merkez Mah. Ayazma Cad. No:37 Papirus Plaza

Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406

Tel : +90 212 324 60 61 Cep: +90 541 350 40 72

--

İyi çalışmalar,

Aziz Öğütlü



Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
www.eduline.com.tr<https://urldefense.com/v3/__http:/www.eduline.com.tr__;!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl1wUXbUrh$>

Merkez Mah. Ayazma Cad. No:37 Papirus Plaza

Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406

Tel : +90 212 324 60 61 Cep: +90 541 350 40 72

Re: [OMPI users] [EXTERNAL] Re: MPI_Init_thread error

2023-07-25 Thread Pritchard Jr., Howard via users

HI Aziz,

Did you include –with-pmi2 on your Open MPI configure line?

Howard

From: users  on behalf of Aziz Ogutlu via 
users 
Organization: Eduline Bilisim
Reply-To: Open MPI Users 
Date: Tuesday, July 25, 2023 at 8:18 AM
To: Open MPI Users 
Cc: Aziz Ogutlu 
Subject: [EXTERNAL] Re: [OMPI users] MPI_Init_thread error


Hi Gilles,

Thank you for your response.

When I run srun --mpi=list, I get only pmi2.

When I run command with --mpi=pmi2 parameter, I got same error.

OpenMPI automatically support slurm after 4.x version. 
https://www.open-mpi.org/faq/?category=building#build-rte<https://urldefense.com/v3/__https:/www.open-mpi.org/faq/?category=building*build-rte__;Iw!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl12eaQ2CN$>


On 7/25/23 12:55, Gilles Gouaillardet via users wrote:
Aziz,

When using direct run (e.g. srun), OpenMPI has to interact with SLURM.
This is typically achieved via PMI2 or PMIx

You can
srun --mpi=list
to list the available options on your system

if PMIx is available, you can
srun --mpi=pmix ...

if only PMI2 is available, you need to make sure Open MPI was built with SLURM 
support (e.g. configure --with-slurm ...)
and then
srun --mpi=pmi2 ...


Cheers,

Gilles

On Tue, Jul 25, 2023 at 5:07 PM Aziz Ogutlu via users 
mailto:users@lists.open-mpi.org>> wrote:
Hi there all,
We're using Slurm 21.08 on Redhat 7.9 HPC cluster with OpenMPI 4.0.3 + gcc 
8.5.0.
When we run command below for call SU2, we get an error message:

$ srun -p defq --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
$ module load su2/7.5.1
$ SU2_CFD config.cfg

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[cnode003.hpc:17534] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

--

Best regards,

Aziz Öğütlü



Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
www.eduline.com.tr<https://urldefense.com/v3/__http:/www.eduline.com.tr__;!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl1wUXbUrh$>

Merkez Mah. Ayazma Cad. No:37 Papirus Plaza

Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406

Tel : +90 212 324 60 61 Cep: +90 541 350 40 72

--

İyi çalışmalar,

Aziz Öğütlü



Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
www.eduline.com.tr<https://urldefense.com/v3/__http:/www.eduline.com.tr__;!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl1wUXbUrh$>

Merkez Mah. Ayazma Cad. No:37 Papirus Plaza

Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406

Tel : +90 212 324 60 61 Cep: +90 541 350 40 72

Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx

2023-07-24 Thread Pritchard Jr., Howard via users

HI Arun,

Interesting.  For problem b) I would suggest one of two things
- if you want to dig deeper yourself, and its possible on your system, I'd look 
at the output of dmesg -H -w on the node where the job is hitting this failure 
(you'll need to rerun the job)
- ping the UCX group mail list (see 
https://elist.ornl.gov/mailman/listinfo/ucx-group . 

As for your more general question, I would suggest keeping it simple and 
letting the applications use large pages via the usual libhugetlbfs mechanism 
(LD_PRELOAD libhugetlbfs and set libhugetlbfs env variables for specifying what 
type of process memory to try and map to large pages).But I'm no expert in 
the ways UCX may be able to take advantage of internally allocated large pages 
nor the extent to which such use of large pages has led to demonstrable 
application speedups.

Howard

On 7/21/23, 8:37 AM, "Chandran, Arun" mailto:arun.chand...@amd.com>> wrote:


Hi Howard,


Thank you very much for the reply.


Ucx is trying to setup the FIFO for shared memory communication using both sysv 
and posix.
By default, these allocations are failing when tried with hugetlbfs


a) Failure log from strace(Pasting only for rank0):
[pid 3541286] shmget(IPC_PRIVATE, 6291456, IPC_CREAT|IPC_EXCL|SHM_HUGETLB|0660) 
= -1 EPERM (Operation not permitted)
[pid 3541286] mmap(NULL, 6291456, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_HUGETLB, 
29, 0) = -1 EINVAL (Invalid argument)


b) I was able to overcome the failure for shmget allocation with hugetlbfs by 
adding my gid to "/proc/sys/vm/hugetlb_shm_group"
[pid 3541465] shmget(IPC_PRIVATE, 6291456, IPC_CREAT|IPC_EXCL|SHM_HUGETLB|0660) 
= 2916410--> success
[pid 3541465] mmap(NULL, 6291456, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_HUGETLB, 
29, 0) = -1 EINVAL (Invalid argument) --> still fail


But mmap with " MAP_SHARED|MAP_HUGETLB" is still failing. Any clues?


I am aware of the advantages of huge pagetables, I am asking from the openmpi 
library perspective,
Should I use it for openmpi internal buffers and data structures or leave it 
for the applications to use?
What are the community recommendations in this regard?


--Arun


-Original Message-
From: Pritchard Jr., Howard mailto:howa...@lanl.gov>> 
Sent: Thursday, July 20, 2023 9:36 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>; Florent GERMAIN mailto:florent.germ...@eviden.com>>
Cc: Chandran, Arun mailto:arun.chand...@amd.com>>
Subject: Re: [EXTERNAL] Re: [OMPI users] How to use hugetlbfs with openmpi and 
ucx


HI Arun,


Its going to be chatty, but you may want to see if strace helps in diagnosing:


mpirun -np 2 (all your favorite mpi args) strace -f send_recv 1000 1


huge pages often helps reduce pressure on a NIC's I/O MMU widget and speeds up 
resolving va to pa memory addresses.


On 7/19/23, 9:24 PM, "users on behalf of Chandran, Arun via users" 
mailto:users-boun...@lists.open-mpi.org> 
<mailto:users-boun...@lists.open-mpi.org 
<mailto:users-boun...@lists.open-mpi.org>> on behalf of 
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
<mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>> wrote:


Good luck,


Howard


Hi,




I am trying to use static huge pages, not transparent huge pages.




Ucx is allowed to allocate via hugetlbfs.




$ ./bin/ucx_info -c | grep -i huge
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_SYSV_HUGETLB_MODE=try --->It is trying this and failing 
UCX_SYSV_FIFO_HUGETLB=n UCX_POSIX_HUGETLB_MODE=try---> it is trying this and 
failing UCX_POSIX_FIFO_HUGETLB=n 
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_CMA_ALLOC=huge,thp,mmap,heap




It is failing even though I have static hugepages available in my system.




$ cat /proc/meminfo | grep HugePages_Total
HugePages_Total: 20




THP is also enabled:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never




--Arun




-Original Message-
From: Florent GERMAIN mailto:florent.germ...@eviden.com> <mailto:florent.germ...@eviden.com 
<mailto:florent.germ...@eviden.com>>>
Sent: Wednesday, July 19, 2023 7:51 PM
To: Open MPI Users mailto:users@lists.open-mpi.org> 
<mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>>; Chandran, 
Arun mailto:arun.chand...@amd.com> 
<mailto:arun.chand...@amd.com <mailto:arun.chand...@amd.com>>>
Subject: RE: How to use hugetlbfs with openmpi and ucx




Hi,
You can check if there are dedicated huge pages on your system or if 
transparent huge pages are allowed.




Transparent huge pages on rhel systems :
$cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
-> this means that transparent huge pages are selected through mmap + 
-> madvise always = always try to aggregate pages on thp (for large 
-> enough allocation with good alignment) nev

Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx

2023-07-20 Thread Pritchard Jr., Howard via users

HI Arun,

Its going to be chatty, but you may want to see if strace helps in diagnosing:

mpirun -np 2 (all your favorite mpi args) strace -f send_recv 1000 1

 huge pages often helps reduce pressure on a NIC's I/O MMU widget and speeds up 
resolving va to pa memory addresses.

On 7/19/23, 9:24 PM, "users on behalf of Chandran, Arun via users" 
mailto:users-boun...@lists.open-mpi.org> on 
behalf of users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:

Good luck,

Howard

Hi,


I am trying to use static huge pages, not transparent huge pages.


Ucx is allowed to allocate via hugetlbfs.


$ ./bin/ucx_info -c | grep -i huge
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_SYSV_HUGETLB_MODE=try --->It is trying this and failing
UCX_SYSV_FIFO_HUGETLB=n
UCX_POSIX_HUGETLB_MODE=try---> it is trying this and failing
UCX_POSIX_FIFO_HUGETLB=n
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_CMA_ALLOC=huge,thp,mmap,heap


It is failing even though I have static hugepages available in my system.


$ cat /proc/meminfo | grep HugePages_Total
HugePages_Total: 20


THP is also enabled:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never


--Arun


-Original Message-
From: Florent GERMAIN mailto:florent.germ...@eviden.com>> 
Sent: Wednesday, July 19, 2023 7:51 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>; Chandran, Arun mailto:arun.chand...@amd.com>>
Subject: RE: How to use hugetlbfs with openmpi and ucx


Hi,
You can check if there are dedicated huge pages on your system or if 
transparent huge pages are allowed.


Transparent huge pages on rhel systems :
$cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
-> this means that transparent huge pages are selected through mmap + 
-> madvise always = always try to aggregate pages on thp (for large 
-> enough allocation with good alignment) never = never try to aggregate 
-> pages on thp


Dedicated huge pages on rhel systems :
$ cat /proc/meminfo | grep HugePages_Total
HugePages_Total: 0
-> no dedicated huge pages here


It seems that ucx tries to use dedicated huge pages (mmap(addr=(nil), 
length=6291456, flags= HUGETLB, fd=29)).
If there are no dedicated huge pages available, mmap fails.


Huge pages can accelerate virtual address to physical address translation and 
reduce TLB consumption.
It may be useful for large and frequently used buffers.


Regards,
Florent


-Message d'origine-
De : users mailto:users-boun...@lists.open-mpi.org>> De la part de Chandran, Arun via 
users Envoyé : mercredi 19 juillet 2023 15:44 À : users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> Cc : Chandran, Arun mailto:arun.chand...@amd.com>> Objet : [OMPI users] How to use hugetlbfs with 
openmpi and ucx


Hi All,


I am trying to see whether hugetlbfs is improving the latency of communication 
with a small send receive program.


mpirun -np 2 --map-by core --bind-to core --mca pml ucx --mca 
opal_common_ucx_tls any --mca opal_common_ucx_devices any -mca pml_base_verbose 
10 --mca mtl_base_verbose 10 -x OMPI_MCA_pml_ucx_verbose=10 -x 
UCX_LOG_LEVEL=debu
g -x UCX_PROTO_INFO=y send_recv 1000 1




But the internal buffer allocation in ucx is unable to select the hugetlbfs.


[1688297246.205092] [lib-ssp-04:4022755:0] ucp_context.c:1979 UCX DEBUG 
allocation method[2] is 'huge'
[1688297246.208660] [lib-ssp-04:4022755:0] mm_sysv.c:97 UCX DEBUG mm failed to 
allocate 8447 bytes with hugetlb-> I checked the code, this is a valid 
failure as the size is small compared to huge page size of 2MB
[1688297246.208704] [lib-ssp-04:4022755:0] mm_sysv.c:97 UCX DEBUG mm failed to 
allocate 4292720 bytes with hugetlb
[1688297246.210048] [lib-ssp-04:4022755:0] mm_posix.c:332 UCX DEBUG shared 
memory mmap(addr=(nil), length=6291456, flags= HUGETLB, fd=29) failed: Invalid 
argument
[1688297246.211451] [lib-ssp-04:4022754:0] ucp_context.c:1979 UCX DEBUG 
allocation method[2] is 'huge'
[1688297246.214849] [lib-ssp-04:4022754:0] mm_sysv.c:97 UCX DEBUG mm failed to 
allocate 8447 bytes with hugetlb
[1688297246.214888] [lib-ssp-04:4022754:0] mm_sysv.c:97 UCX DEBUG mm failed to 
allocate 4292720 bytes with hugetlb
[1688297246.216235] [lib-ssp-04:4022754:0] mm_posix.c:332 UCX DEBUG shared 
memory mmap(addr=(nil), length=6291456, flags= HUGETLB, fd=29) failed: Invalid 
argument


Can someone suggest what are the steps to be done to enable hugetlbfs [I cannot 
run my application as root] ? Is using hugetlbfs for the internal buffers is 
recommended?


--Arun

Re: [OMPI users] [EXTERNAL] Requesting information about MPI_T events

2023-03-15 Thread Pritchard Jr., Howard via users

Hi Kingshuk,

Looks like the MPI_T Events feature is parked in this PR - 
https://github.com/open-mpi/ompi/pull/8057 - at the moment.

Hoawrd


From: users  on behalf of Kingshuk Haldar via 
users 
Reply-To: Open MPI Users 
Date: Wednesday, March 15, 2023 at 4:00 AM
To: OpenMPI-lists-users 
Cc: Kingshuk Haldar 
Subject: [EXTERNAL] [OMPI users] Requesting information about MPI_T events

Hi all,

is there any public branch of OpenMPI with which one can test the MPI_T Events 
interface?

Alternatively, any information about its potential availability in next 
releases would be good to know.

Best,
--
Kingshuk Haldar email: kingshuk.hal...@hlrs.de

Re: [OMPI users] [EXTERNAL] OFI, destroy_vni_context(1137).......: OFI domain close failed (ofi_init.c:1137:destroy_vni_context:Device or resource busy)

2022-11-01 Thread Pritchard Jr., Howard via users

HI,

You are using MPICH or a vendor derivative of MPICH.  You probably want to 
resend this email to the mpich users/help mail list.

Howard



From: users  on behalf of mrlong via users 

Reply-To: Open MPI Users 
Date: Tuesday, November 1, 2022 at 11:26 AM
To: "de...@lists.open-mpi.org" , 
"users@lists.open-mpi.org" 
Cc: mrlong 
Subject: [EXTERNAL] [OMPI users] OFI, destroy_vni_context(1137)...: OFI 
domain close failed (ofi_init.c:1137:destroy_vni_context:Device or resource 
busy)


Hi, teachers

code:

import mpi4py
import time
import numpy as np
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
print("rank",rank)


if __name__ == '__main__':
if rank == 0:
mem = np.array([0], dtype='i')
win = MPI.Win.Create(mem, comm=comm)
else:
win = MPI.Win.Create(None, comm=comm)
print(rank, "end")



(py3.6.8) ➜  ~  mpirun -n 2 python -u 
test.py<https://urldefense.com/v3/__http:/test.py__;!!Bt8fGhp8LhKGRg!EpS4l-5_ADRkiOPiRrqKHV_deuvAYDui9_niJetq7MR6TwaQ5cLC_akDsMLZGdFmPOtiSFaby1mi2zqnczR1$>
rank 0
rank 1
0 end
1 end
Abort(806449679): Fatal error in internal_Finalize: Other MPI error, error 
stack:
internal_Finalize(50)...: MPI_Finalize failed
MPII_Finalize(345)..:
MPID_Finalize(511)..:
MPIDI_OFI_mpi_finalize_hook(895):
destroy_vni_context(1137)...: OFI domain close failed 
(ofi_init.c:1137:destroy_vni_context:Device or resource busy)

Why is this happening? How to debug? This error is not reported on the other 
machine.

Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI Installation - pmi.h Error

2022-10-05 Thread Pritchard Jr., Howard via users

Hi Jeff,

I think you are now in the “send the system admin an email to install RPMs, in 
particular ask that the numa and udev devel rpms be installed”.  They will need 
to install these rpms on the compute node image(s) as well.

Howard


From: "Jeffrey D. (JD) Tamucci" 
Date: Wednesday, October 5, 2022 at 9:20 AM
To: "Pritchard Jr., Howard" 
Cc: "bbarr...@amazon.com" , Open MPI Users 

Subject: Re: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI 
Installation - pmi.h Error

Gladly, I tried it that way and it worked in that it was able to find pmi.h. 
Unfortunately there's a new error about finding lnuma and ludev.

make[2]: Entering directory '/shared/maylab/src/openmpi-4.1.4/opal'
  CCLD 
libopen-pal.la<https://urldefense.com/v3/__http:/libopen-pal.la__;!!Bt8fGhp8LhKGRg!BWR7snajnpicZF4YgkUocF-Zm3n1tT0PSpwsOGfvHrB1qcFmYIq9xU56yhcjTEBv6oq1Z5meQDixEwQJWs4fc6wp3HEFfA$>
/usr/bin/ld: cannot find -lnuma
/usr/bin/ld: cannot find -ludev
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:2249: 
libopen-pal.la<https://urldefense.com/v3/__http:/libopen-pal.la__;!!Bt8fGhp8LhKGRg!BWR7snajnpicZF4YgkUocF-Zm3n1tT0PSpwsOGfvHrB1qcFmYIq9xU56yhcjTEBv6oq1Z5meQDixEwQJWs4fc6wp3HEFfA$>]
 Error 1
make[2]: Leaving directory '/shared/maylab/src/openmpi-4.1.4/opal'
make[1]: *** [Makefile:2394: install-recursive] Error 1
make[1]: Leaving directory '/shared/maylab/src/openmpi-4.1.4/opal'
make: *** [Makefile:1912: install-recursive] Error 1

Here is a dropbox link to the full output: 
https://www.dropbox.com/s/4rv8n2yp320ix08/ompi-output_Oct4_2022.tar.bz2?dl=0<https://urldefense.com/v3/__https:/www.dropbox.com/s/4rv8n2yp320ix08/ompi-output_Oct4_2022.tar.bz2?dl=0__;!!Bt8fGhp8LhKGRg!BWR7snajnpicZF4YgkUocF-Zm3n1tT0PSpwsOGfvHrB1qcFmYIq9xU56yhcjTEBv6oq1Z5meQDixEwQJWs4fc6y8gBZt9g$>

Thank you for your help!

JD



Jeffrey D. (JD) Tamucci
University of Connecticut
Molecular & Cell Biology
RA in Lab of Eric R. May
PhD / MPH Candidate
he/him


On Tue, Oct 4, 2022 at 1:51 PM Pritchard Jr., Howard 
mailto:howa...@lanl.gov>> wrote:
*Message sent from a system outside of UConn.*

Could you change the –with-pmi to be
--with-pmi=/cm/shared/apps/slurm21.08.8

?


From: "Jeffrey D. (JD) Tamucci" 
mailto:jeffrey.tamu...@uconn.edu>>
Date: Tuesday, October 4, 2022 at 10:40 AM
To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, 
"bbarr...@amazon.com<mailto:bbarr...@amazon.com>" 
mailto:bbarr...@amazon.com>>
Cc: Open MPI Users mailto:users@lists.open-mpi.org>>
Subject: Re: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI 
Installation - pmi.h Error

Hi Howard and Brian,

Of course. Here's a dropbox link to the full folder: 
https://www.dropbox.com/s/raqlcnpgk9wz78b/ompi-output_Sep30_2022.tar.bz2?dl=0<https://urldefense.com/v3/__https:/www.dropbox.com/s/raqlcnpgk9wz78b/ompi-output_Sep30_2022.tar.bz2?dl=0__;!!Bt8fGhp8LhKGRg!Gbf2ik51d_yyLNSd0MxiRpzUUleMIUbnc_K_GZiX3bNyn_5hxYeebIpaGygYEZebCOMxxbVZugqOTreswGqTKVLD8RFMow$>

This was the configure and make commands:

./configure \
--prefix=/shared/maylab/mayapps/mpi/openmpi/4.1.4 \
--with-slurm \
--with-lsf=no \
--with-pmi=/cm/shared/apps/slurm/21.08.8/include/slurm \
--with-pmi-libdir=/cm/shared/apps/slurm/21.08.8/lib64 \
--with-hwloc=/cm/shared/apps/hwloc/1.11.11 \
--with-cuda=/gpfs/sharedfs1/admin/hpc2.0/apps/cuda/11.6 \
--enable-shared \
--enable-static &&
make -j 32 &&
make -j 32 check
make install

The output of the make command is in the install_open-mpi_4.1.4_hpc2.log file.



Jeffrey D. (JD) Tamucci
University of Connecticut
Molecular & Cell Biology
RA in Lab of Eric R. May
PhD / MPH Candidate
he/him


On Tue, Oct 4, 2022 at 12:33 PM Pritchard Jr., Howard 
mailto:howa...@lanl.gov>> wrote:
*Message sent from a system outside of UConn.*

HI JD,

Could you post the configure options your script uses to build Open MPI?

Howard

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of "Jeffrey D. (JD) Tamucci via users" 
mailto:users@lists.open-mpi.org>>
Reply-To: Open MPI Users 
mailto:users@lists.open-mpi.org>>
Date: Tuesday, October 4, 2022 at 10:07 AM
To: "users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" 
mailto:users@lists.open-mpi.org>>
Cc: "Jeffrey D. (JD) Tamucci" 
mailto:jeffrey.tamu...@uconn.edu>>
Subject: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI Installation 
- pmi.h Error

Hi,

I have been trying to install OpenMPI v4.1.4 on a university HPC cluster. We 
use the Bright cluster manager and have SLURM v21.08.8 and RHEL 8.6. I used a 
script to install OpenMPI that a former co-worker had used to successfully 
install OpenMPI v3.0.0 previously. I updated it to include new versions of the 
dependencies and new paths to those installs.

Each t

Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI Installation - pmi.h Error

2022-10-04 Thread Pritchard Jr., Howard via users

Could you change the –with-pmi to be
--with-pmi=/cm/shared/apps/slurm21.08.8

?

From: "Jeffrey D. (JD) Tamucci" 
Date: Tuesday, October 4, 2022 at 10:40 AM
To: "Pritchard Jr., Howard" , "bbarr...@amazon.com" 

Cc: Open MPI Users 
Subject: Re: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI 
Installation - pmi.h Error

Hi Howard and Brian,

Of course. Here's a dropbox link to the full folder: 
https://www.dropbox.com/s/raqlcnpgk9wz78b/ompi-output_Sep30_2022.tar.bz2?dl=0<https://urldefense.com/v3/__https:/www.dropbox.com/s/raqlcnpgk9wz78b/ompi-output_Sep30_2022.tar.bz2?dl=0__;!!Bt8fGhp8LhKGRg!Gbf2ik51d_yyLNSd0MxiRpzUUleMIUbnc_K_GZiX3bNyn_5hxYeebIpaGygYEZebCOMxxbVZugqOTreswGqTKVLD8RFMow$>

This was the configure and make commands:

./configure \
--prefix=/shared/maylab/mayapps/mpi/openmpi/4.1.4 \
--with-slurm \
--with-lsf=no \
--with-pmi=/cm/shared/apps/slurm/21.08.8/include/slurm \
--with-pmi-libdir=/cm/shared/apps/slurm/21.08.8/lib64 \
--with-hwloc=/cm/shared/apps/hwloc/1.11.11 \
--with-cuda=/gpfs/sharedfs1/admin/hpc2.0/apps/cuda/11.6 \
--enable-shared \
--enable-static &&
make -j 32 &&
make -j 32 check
make install

The output of the make command is in the install_open-mpi_4.1.4_hpc2.log file.

Jeffrey D. (JD) Tamucci
University of Connecticut
Molecular & Cell Biology
RA in Lab of Eric R. May
PhD / MPH Candidate
he/him

On Tue, Oct 4, 2022 at 12:33 PM Pritchard Jr., Howard 
mailto:howa...@lanl.gov>> wrote:
*Message sent from a system outside of UConn.*

HI JD,

Could you post the configure options your script uses to build Open MPI?

Howard

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of "Jeffrey D. (JD) Tamucci via users" 
mailto:users@lists.open-mpi.org>>
Reply-To: Open MPI Users 
mailto:users@lists.open-mpi.org>>
Date: Tuesday, October 4, 2022 at 10:07 AM
To: "users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" 
mailto:users@lists.open-mpi.org>>
Cc: "Jeffrey D. (JD) Tamucci" 
mailto:jeffrey.tamu...@uconn.edu>>
Subject: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI Installation 
- pmi.h Error

Hi,

I have been trying to install OpenMPI v4.1.4 on a university HPC cluster. We 
use the Bright cluster manager and have SLURM v21.08.8 and RHEL 8.6. I used a 
script to install OpenMPI that a former co-worker had used to successfully 
install OpenMPI v3.0.0 previously. I updated it to include new versions of the 
dependencies and new paths to those installs.

Each time, it fails in the make install step. There is a fatal error about 
finding pmi.h. It specifically says:

make[2]: Entering directory '/shared/maylab/src/openmpi-4.1.4/opal/mca/pmix/s1'
  CC   libmca_pmix_s1_la-pmix_s1_component.lo
  CC   libmca_pmix_s1_la-pmix_s1.lo
pmix_s1.c:29:10: fatal error: pmi.h: No such file or directory
   29 | #include 

I've looked through the archives and seen others face similar errors in years 
past but I couldn't understand the solutions. One person suggested that SLURM 
may be missing PMI libraries. I think I've verified that SLURM has PMI. I 
include paths to those files and it seems to find them earlier in the process.

I'm not sure what the next step is in troubleshooting this. I have included a 
bz2 file containing my install script, a log file containing the script output 
(from build, make, make install), the config.log, and the opal_config.h file. 
If anyone could provide any guidance, I'd  sincerely appreciate it.

Best,
JD

Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI Installation - pmi.h Error

2022-10-04 Thread Pritchard Jr., Howard via users

HI JD,

Could you post the configure options your script uses to build Open MPI?

Howard

From: users  on behalf of "Jeffrey D. (JD) 
Tamucci via users" 
Reply-To: Open MPI Users 
Date: Tuesday, October 4, 2022 at 10:07 AM
To: "users@lists.open-mpi.org" 
Cc: "Jeffrey D. (JD) Tamucci" 
Subject: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI Installation 
- pmi.h Error

Hi,

I have been trying to install OpenMPI v4.1.4 on a university HPC cluster. We 
use the Bright cluster manager and have SLURM v21.08.8 and RHEL 8.6. I used a 
script to install OpenMPI that a former co-worker had used to successfully 
install OpenMPI v3.0.0 previously. I updated it to include new versions of the 
dependencies and new paths to those installs.

Each time, it fails in the make install step. There is a fatal error about 
finding pmi.h. It specifically says:

make[2]: Entering directory '/shared/maylab/src/openmpi-4.1.4/opal/mca/pmix/s1'
  CC   libmca_pmix_s1_la-pmix_s1_component.lo
  CC   libmca_pmix_s1_la-pmix_s1.lo
pmix_s1.c:29:10: fatal error: pmi.h: No such file or directory
   29 | #include 

I've looked through the archives and seen others face similar errors in years 
past but I couldn't understand the solutions. One person suggested that SLURM 
may be missing PMI libraries. I think I've verified that SLURM has PMI. I 
include paths to those files and it seems to find them earlier in the process.

I'm not sure what the next step is in troubleshooting this. I have included a 
bz2 file containing my install script, a log file containing the script output 
(from build, make, make install), the config.log, and the opal_config.h file. 
If anyone could provide any guidance, I'd  sincerely appreciate it.

Best,
JD

Re: [OMPI users] [EXTERNAL] Problem with Mellanox ConnectX3 (FDR) and openmpi 4

2022-08-19 Thread Pritchard Jr., Howard via users

Hi Boyrie,

The warning message is coming from the older ibverbs component of the Open MPI 
4.0/4.1 releases.
You can make this message using several ways.  One at configure time is to add

--disable-verbs

to the configure options.

At runtime you can set

export OMPI_MCA_btl=^openib

The ucx messages are just being chatty about which ucx transport type is being 
selected.

The VASP hang may be something else.

Howard

From: users  on behalf of Boyrie Fabrice via 
users 
Reply-To: Open MPI Users 
Date: Friday, August 19, 2022 at 9:51 AM
To: "users@lists.open-mpi.org" 
Cc: Boyrie Fabrice 
Subject: [EXTERNAL] [OMPI users] Problem with Mellanox ConnectX3 (FDR) and 
openmpi 4


Hi



I had to reinstall a cluster in AlmaLinux 8.6



I am unable to make openmpi 4 working with infiniband. I have the following 
message in a trivial pingpong test


mpirun --hostfile hostfile -np 2 pingpong
--
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   node2
 Local device: mlx4_0
--
[node2:12431] common_ucx.c:107 using OPAL memory hooks as external events
[node2:12431] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[node1:13188] common_ucx.c:174 using OPAL memory hooks as external events
[node1:13188] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[node2:12431] pml_ucx.c:289 mca_pml_ucx_init
[node1:13188] common_ucx.c:333 posix/memory: did not match transport list
[node1:13188] common_ucx.c:333 sysv/memory: did not match transport list
[node1:13188] common_ucx.c:333 self/memory0: did not match transport list
[node1:13188] common_ucx.c:333 tcp/lo: did not match transport list
[node1:13188] common_ucx.c:333 tcp/eno1: did not match transport list
[node1:13188] common_ucx.c:333 tcp/ib0: did not match transport list
[node1:13188] common_ucx.c:228 driver '../../../../bus/pci/drivers/mlx4_core' 
matched by 'mlx*'
[node1:13188] common_ucx.c:324 rc_verbs/mlx4_0:1: matched both transport and 
device list
[node1:13188] common_ucx.c:337 support level is transports and devices
[node1:13188] pml_ucx.c:289 mca_pml_ucx_init
[node2:12431] pml_ucx.c:114 Pack remote worker address, size 155
[node2:12431] pml_ucx.c:114 Pack local worker address, size 291
[node2:12431] pml_ucx.c:351 created ucp context 0xf832a0, worker 0x109fc50
[node1:13188] pml_ucx.c:114 Pack remote worker address, size 155
[node1:13188] pml_ucx.c:114 Pack local worker address, size 291
[node1:13188] pml_ucx.c:351 created ucp context 0x1696320, worker 0x16c9ce0
[node1:13188] pml_ucx_component.c:147 returning priority 51
[node2:12431] pml_ucx.c:182 Got proc 0 address, size 291
[node2:12431] pml_ucx.c:411 connecting to proc. 0
[node1:13188] pml_ucx.c:182 Got proc 1 address, size 291
[node1:13188] pml_ucx.c:411 connecting to proc. 1
length   time/message (usec)transfer rate (Gbyte/sec)
[node2:12431] pml_ucx.c:182 Got proc 1 address, size 155
[node2:12431] pml_ucx.c:411 connecting to proc. 1
[node1:13188] pml_ucx.c:182 Got proc 0 address, size 155
[node1:13188] pml_ucx.c:411 connecting to proc. 0
1 45.683729   0.88
1001  4.286029   0.934198
2001  5.755391   1.390696
3001  6.902443   1.739095
4001  8.485305   1.886084
5001  9.596994   2.084403
6001  11.055146   2.171297
7001  11.977093   2.338130
8001  13.324408   2.401908
9001  14.471116   2.487991
10001 15.806676   2.530829
[node2:12431] common_ucx.c:240 disconnecting from rank 0
[node2:12431] common_ucx.c:240 disconnecting from rank 1
[node2:12431] common_ucx.c:204 waiting for 1 disconnect requests
[node2:12431] common_ucx.c:204 waiting for 0 disconnect requests
[node1:13188] common_ucx.c:466 disconnecting from rank 0
[node1:13188] common_ucx.c:430 waiting for 1 disconnect requests
[node1:13188] common_ucx.c:466 disconnecting from rank 1
[node1:13188] common_ucx.c:430 waiting for 0 disconnect requests
[node2:12431] pml_ucx.c:367 mca_pml_ucx_cleanup
[node1:13188] pml_ucx.c:367 mca_pml_ucx_cleanup
[node2:12431] pml_ucx.c:268 mca_pml_ucx_close
[node1:13188] pml_ucx.c:268 mca_pml_ucx_close



cat hostfile
node1 slots=1
node2 slots=1


And with a real program (Vasp) it stops.

Infinband seems to be working. I can ssh over infiniband and qperf works in 
rdma mode

qperf  -t 10 ibnode1 ud_lat ud_bw
ud_lat:
   latency  =  18.2 us
ud_bw:
   send_bw  =  2.81 GB/sec
   recv_bw  =  2.81 GB/sec


I use the standard AlmaLinux module for infiniband

82:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

I can not install MLNX_OFED_LINUX-5.6-2.0.9.0-rhel8.6-x86_64 because it does 
not supports ConnectX-3
And I can not install MLNX_OFED_LINUX-4.9-5.1.0.0-rhel8.6-x86_64 because the 
module compilation fail

Re: [OMPI users] [EXTERNAL] Java Segentation Fault

2022-03-17 Thread Pritchard Jr., Howard via users

HI Janek,

A few questions.

First which version of Open MPI are you using?

Did you compile your code with the Open MPI mpijavac wrapper?

Howard

From: users  on behalf of "Laudan, Janek via 
users" 
Reply-To: "Laudan, Janek" , Open MPI Users 

Date: Thursday, March 17, 2022 at 9:52 AM
To: "users@lists.open-mpi.org" 
Cc: "Laudan, Janek" 
Subject: [EXTERNAL] [OMPI users] Java Segentation Fault

Hi,

I am trying to extend an existing Java-Project to be run with open-mpi. I have 
managed to successfully set up open-mpi and my project on my local machine to 
conduct some test runs.

However, when I tried to set up things on our cluster I ran into some problems. 
I was able to run some trivial examples such as "HelloWorld" and "Ring" which I 
found on in the ompi-Github-repo. Unfortunately, when I try to run our app 
wrapped between MPI.Init(args) and MPI.Finalize() I get the following 
segmentation fault:

$ mpirun -np 1 java -cp matsim-p-1.0-SNAPSHOT.jar 
org.matsim.parallel.RunMinimalMPIExample
Java-Version: 11.0.2
before getTestScenario
before load config
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will 
impact performance.
[cluster-i:1272 :0:1274] Caught signal 11 (Segmentation fault: address not 
mapped to object at address 0xc)
 backtrace (tid:   1274) 
=
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x14a85752fdf4, pid=1272, tid=1274
#
# JRE version: Java(TM) SE Runtime Environment (11.0.2+9) (build 11.0.2+9-LTS)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (11.0.2+9-LTS, mixed mode, tiered, 
compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 612 c2 
java.lang.StringBuilder.append(Ljava/lang/String;)Ljava/lang/StringBuilder; 
java.base@11.0.2 (8 bytes) @ 0x14a85752fdf4 
[0x14a85752fdc0+0x0034]
#
# No core dump will be written. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /net/ils/laudan/mpi-test/matsim-p/hs_err_pid1272.log
Compiled method (c2)1052  612   4   java.lang.StringBuilder::append 
(8 bytes)
 total in heap  [0x14a85752fc10,0x14a8575306a8] = 2712
 relocation [0x14a85752fd88,0x14a85752fdb8] = 48
 main code  [0x14a85752fdc0,0x14a857530360] = 1440
 stub code  [0x14a857530360,0x14a857530378] = 24
 metadata   [0x14a857530378,0x14a8575303c0] = 72
 scopes data[0x14a8575303c0,0x14a857530578] = 440
 scopes pcs [0x14a857530578,0x14a857530658] = 224
 dependencies   [0x14a857530658,0x14a857530660] = 8
 handler table  [0x14a857530660,0x14a857530678] = 24
 nul chk table  [0x14a857530678,0x14a8575306a8] = 48
Compiled method (c1)1053  263   3   java.lang.StringBuilder:: 
(7 bytes)
 total in heap  [0x14a850102790,0x14a850102b30] = 928
 relocation [0x14a850102908,0x14a850102940] = 56
 main code  [0x14a850102940,0x14a850102a20] = 224
 stub code  [0x14a850102a20,0x14a850102ac8] = 168
 metadata   [0x14a850102ac8,0x14a850102ad0] = 8
 scopes data[0x14a850102ad0,0x14a850102ae8] = 24
 scopes pcs [0x14a850102ae8,0x14a850102b28] = 64
 dependencies   [0x14a850102b28,0x14a850102b30] = 8
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
[cluster-i:01272] *** Process received signal ***
[cluster-i:01272] Signal: Aborted (6)
[cluster-i:01272] Signal code:  (-6)
[cluster-i:01272] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x14a86e477630]
[cluster-i:01272] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x14a86dcbb387]
[cluster-i:01272] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x14a86dcbca78]
[cluster-i:01272] [ 3] 
/afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xc00be9)[0x14a86d3f8be9]
[cluster-i:01272] [ 4] 
/afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29619)[0x14a86d621619]
[cluster-i:01272] [ 5] 
/afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29e9b)[0x14a86d621e9b]
[cluster-i:01272] [ 6] 
/afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29ece)[0x14a86d621ece]
[cluster-i:01272] [ 7] 
/afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(JVM_handle_linux_signal+0x1c0)[0x14a86d403a00]
[cluster-i:01272] [ 8] 
/afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xbff5e8)[0x14a86d3f75e8]
[cluster-i:01272] [ 9] /usr/lib64/libpthread.so.0(+0xf630)[0x14a86e477630]
[cluster-i:01272] [10] [0x14a85752fdf4]
[cluster-i:01272] *** End of error message ***

Re: [OMPI users] [EXTERNAL] OpenMPI, Slurm and MPI_Comm_spawn

2022-03-08 Thread Pritchard Jr., Howard via users

Hi Kurt,

This documentation is rather slurm-centric.  If you build Open MPI 4.1.x series 
the default way, it will build its internal pmix package and use that when 
launching your app using mpirun.
In that case, you can use MPI_comm_spawn within a slurm allocation as long as 
there are sufficient slots in the allocation to hold both the spawner processes 
and the spawnee processes.
Note the slurm pmix implementation doesn’t support spawn – at least currently – 
so the documentation is accurate if you are building Open MPI against the SLURM 
PMix library.
In any case, you can’t use MPI_Comm_spawn if you use srun to launch the 
application.

Hope this helps,

Howard

From: users  on behalf of "Mccall, Kurt E. 
(MSFC-EV41) via users" 
Reply-To: Open MPI Users 
Date: Tuesday, March 8, 2022 at 7:49 AM
To: "OpenMpi User List (users@lists.open-mpi.org)" 
Cc: "Mccall, Kurt E. (MSFC-EV41)" 
Subject: [EXTERNAL] [OMPI users] OpenMPI, Slurm and MPI_Comm_spawn

The Slurm MPI User’s Guide at https://slurm.schedmd.com/mpi_guide.html#open_mpi 
has a note that states:

NOTE: OpenMPI has a limitation that does not support calls to MPI_Comm_spawn() 
from within a Slurm allocation. If you need to use the MPI_Comm_spawn() 
function you will need to use another MPI implementation combined with PMI-2 
since PMIx doesn't support it either.

Is this still true in OpenMPI 4.1?

Thanks,
Kurt

Re: [OMPI users] [EXTERNAL] openib BTL disabled when using MPI_Init_thread

2022-02-07 Thread Pritchard Jr., Howard via users

HI Jose,

I bet this device has not been tested with ucx.  

You may want to join the ucx users mail list at

https://elist.ornl.gov/mailman/listinfo/ucx-group

and ask whether this Marvell device has been tested and workarounds for 
disabling features that this device doesn't support.

Again though, you really may want to first see if the TCP btl will be good 
enough for your cluster. 

Howard

On 2/4/22, 8:03 AM, "Jose E. Roman"  wrote:

Howard,

I don't have much time now to try with --enable-debug.

The RoCE device we have is FastLinQ QL41000 Series 10/25/40/50GbE Controller
The output of ibv_devinfo is:
hca_id: qedr0
transport:  InfiniBand (0)
fw_ver: 8.20.0.0
node_guid:  2267:7cff:fe11:4a50
sys_image_guid: 2267:7cff:fe11:4a50
vendor_id:  0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid:   0
port_lmc:   0x00
link_layer: Ethernet

hca_id: qedr1
transport:  InfiniBand (0)
fw_ver: 8.20.0.0
node_guid:  2267:7cff:fe11:4a51
sys_image_guid: 2267:7cff:fe11:4a51
vendor_id:  0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt:  1
port:   1
state:  PORT_DOWN (1)
max_mtu:4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid:   0
port_lmc:   0x00
link_layer: Ethernet

Regarding UCX, we have tried with the latest version. Compilation goes 
through, but the ucv_info command gives an error:

# Memory domain: qedr0
# Component: ib
# register: unlimited, cost: 180 nsec
#   remote key: 8 bytes
#   local memory handle is required for zcopy
#
#  Transport: rc_verbs
# Device: qedr0:1
#   Type: network
#  System device: qedr0 (0)
[1643982133.674556] [kahan01:8217 :0]rc_iface.c:505  UCX ERROR 
ibv_create_srq() failed: Function not implemented
#   < failed to open interface >
#
#  Transport: ud_verbs
# Device: qedr0:1
#   Type: network
#  System device: qedr0 (0)
[qelr_create_qp:545]create qp: failed on ibv_cmd_create_qp with 22
[1643982133.681169] [kahan01:8217 :0]ib_iface.c:994  UCX ERROR 
iface=0x56074944bf10: failed to create UD QP TX wr:256 sge:6 inl:64 resp:0 RX 
wr:4096 sge:1 resp:0: Invalid argument
#   < failed to open interface >
#
# Memory domain: qedr1
# Component: ib
# register: unlimited, cost: 180 nsec
#   remote key: 8 bytes
#   local memory handle is required for zcopy
#   < no supported devices found >


Any idea what the error in ibv_create_srq() means?

Thanks for your help.
Jose



> El 3 feb 2022, a las 17:52, Pritchard Jr., Howard  
escribió:
> 
> Hi Jose,
> 
> A number of things.  
> 
> First for recent versions of Open MPI including the 4.1.x release stream, 
MPI_THREAD_MULTIPLE is supported by default.  However, some transport options 
available when using MPI_Init may not be available when requesting 
MPI_THREAD_MULTIPLE.
> You may want to let Open MPI trundle along with tcp used for inter-node 
messaging and see if your application performs well enough. For a small system 
tcp may well suffice. 
> 
> Second, if you want to pursue this further, you want to rebuild Open MPI 
with --enable-debug.  The debug output will be considerably more verbose and 
provides more info.  I think you will get  a message saying rdmacm CPC is 
excluded owing to the requested thread support level.  There may be info about 
why udcm is not selected as well.
> 
> Third, what sort of RoCE devices are available on your system?  The 
output from ibv_devinfo may be useful. 
> 
> As for UCX,  if it’s the version that came with your ubuntu release 
18.0.4 it may be pretty old.  It's likely that UCX has not been tested on the 
RoCE devi

Re: [OMPI users] [EXTERNAL] openib BTL disabled when using MPI_Init_thread

2022-02-03 Thread Pritchard Jr., Howard via users

Hi Jose,

A number of things.  

First for recent versions of Open MPI including the 4.1.x release stream, 
MPI_THREAD_MULTIPLE is supported by default.  However, some transport options 
available when using MPI_Init may not be available when requesting 
MPI_THREAD_MULTIPLE.
You may want to let Open MPI trundle along with tcp used for inter-node 
messaging and see if your application performs well enough. For a small system 
tcp may well suffice. 

Second, if you want to pursue this further, you want to rebuild Open MPI with 
--enable-debug.  The debug output will be considerably more verbose and 
provides more info.  I think you will get  a message saying rdmacm CPC is 
excluded owing to the requested thread support level.  There may be info about 
why udcm is not selected as well.

Third, what sort of RoCE devices are available on your system?  The output from 
ibv_devinfo may be useful. 

As for UCX,  if it’s the version that came with your ubuntu release 18.0.4 it 
may be pretty old.  It's likely that UCX has not been tested on the RoCE 
devices on your system.

You can run 

ucx_info -v

to check the version number of UCX that you are picking up.

You can download the latest release of UCX at

https://github.com/openucx/ucx/releases/tag/v1.12.0

Instructions for how to build are in the README.md at 
https://github.com/openucx/ucx.
You will want to configure with 

contrib/configure-release-mt --enable-gtest

You want to add the --enable-gtest to the configure options so that you can run 
the ucx sanity checks.   Note this takes quite a while to run but is pretty 
thorough at validating your UCX build. 
You'll want to run this test on one of the nodes with a RoCE device -  

ucx_info -d

This will show which UCX transports/devices are available.

See the Running internal unit tests section of the README.md

Hope this helps,

Howard


On 2/3/22, 8:46 AM, "Jose E. Roman"  wrote:

Thanks. The verbose output is:

[kahan01.upvnet.upv.es:29732] mca: base: components_register: registering 
framework btl components
[kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded 
component self
[kahan01.upvnet.upv.es:29732] mca: base: components_register: component 
self register function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded 
component sm
[kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded 
component openib
[kahan01.upvnet.upv.es:29732] mca: base: components_register: component 
openib register function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded 
component vader
[kahan01.upvnet.upv.es:29732] mca: base: components_register: component 
vader register function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded 
component tcp
[kahan01.upvnet.upv.es:29732] mca: base: components_register: component tcp 
register function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_open: opening btl 
components
[kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded 
component self
[kahan01.upvnet.upv.es:29732] mca: base: components_open: component self 
open function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded 
component openib
[kahan01.upvnet.upv.es:29732] mca: base: components_open: component openib 
open function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded 
component vader
[kahan01.upvnet.upv.es:29732] mca: base: components_open: component vader 
open function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded 
component tcp
[kahan01.upvnet.upv.es:29732] mca: base: components_open: component tcp 
open function successful
[kahan01.upvnet.upv.es:29732] select: initializing btl component self
[kahan01.upvnet.upv.es:29732] select: init of component self returned 
success
[kahan01.upvnet.upv.es:29732] select: initializing btl component openib
[kahan01.upvnet.upv.es:29732] Checking distance from this process to 
device=qedr0
[kahan01.upvnet.upv.es:29732] hwloc_distances->nbobjs=4
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[0]=10
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[1]=16
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[2]=16
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[3]=16
[kahan01.upvnet.upv.es:29732] ibv_obj->type set to NULL
[kahan01.upvnet.upv.es:29732] Process is bound: distance to device is 
0.00
[kahan01.upvnet.upv.es:29732] Checking distance from this process to 
device=qedr1
[kahan01.upvnet.upv.es:29732] hwloc_distances->nbobjs=4
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[0]=10
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[1]=16
[kahan01.upvnet.upv.es:29732] hwloc_distances->value

Re: [OMPI users] [EXTERNAL] openib BTL disabled when using MPI_Init_thread

2022-02-03 Thread Pritchard Jr., Howard via users

Hello Jose,

I suspect the issue here is that the OpenIB BTl isn't finding a connection 
module when you are requesting MPI_THREAD_MULTIPLE.
The rdmacm connection is deselected if MPI_THREAD_MULTIPLE thread support level 
is being requested.

If you run the test in a shell with

export OMPI_MCA_btl_base_verbose=100

there may be some more info to help diagnose what's going on.

Another option would be to build Open MPI with UCX support.  That's the better 
way to use Open MPI over IB/RoCE.

Howard

On 2/2/22, 10:52 AM, "users on behalf of Jose E. Roman via users" 
 wrote:

Hi.

I am using Open MPI 4.1.1 with the openib BTL on a 4-node cluster with 
Ethernet 10/25Gb (RoCE). It is using libibverbs from Ubuntu 18.04 (kernel 
4.15.0-166-generic).

With this hello world example:

#include 
#include 
int main (int argc, char *argv[])
{
 int rank, size, provided;
 MPI_Init_thread(, , MPI_THREAD_FUNNELED, );
 MPI_Comm_rank(MPI_COMM_WORLD, );
 MPI_Comm_size(MPI_COMM_WORLD, );
 printf("Hello world from process %d of %d, provided=%d\n", rank, size, 
provided);
 MPI_Finalize();
 return 0;
}

I get the following output when run on one node:

$ ./hellow
--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

 Local host:   kahan01
 Local device: qedr0
 Local port:   1
 CPCs attempted:   rdmacm, udcm
--
Hello world from process 0 of 1, provided=1


The message does not appear if I run on the front-end (does not have RoCE 
network) or if I run it on the node either using MPI_Init() instead of 
MPI_Init_thread() or using MPI_THREAD_SINGLE instead of MPI_THREAD_FUNNELED.

Is there any reason why MPI_Init_thread() is behaving differently to 
MPI_Init()? Note that I am not using threads, and just one MPI process.


The question has a second part: is there a way to determine (without 
running an MPI program) that MPI_Init_thread() won't work but MPI_Init() will 
work? I am asking this because PETSc programs default to use MPI_Init_thread() 
when PETSc's configure script finds the MPI_Init_thread() symbol in the MPI 
library. But in situations like the one reported here, it would be better to 
revert to MPI_Init() since MPI_Init_thread() will not work as expected. [The 
configure script cannot run an MPI program due to batch systems.]

Thanks for your help.
Jose

[OMPI users] Open MPI v4.0.7rc2 available for testing

2021-11-08 Thread Pritchard Jr., Howard via users


A second release candidate for Open MPI v4.0.7 is now available for testing:

https://www.open-mpi.org/software/ompi/v4.0/


New fixes with this release candidate:



- Fix an issue with MPI_IALLREDUCE_SCATTER when using large count arguments.

- Fixed an issue with POST/START/COMPLETE/WAIT when using subsets of processes. 
 Thanks to Thomas Gilles for reporting.

Your Open MPI release team.


—

[signature_61897647]

Howard Pritchard
Research Scientist
HPC-ENV

Los Alamos National Laboratory
howa...@lanl.gov

[signature_1468325140]<https://www.instagram.com/losalamosnatlab/>[signature_524373090]<https://twitter.com/LosAlamosNatLab>[signature_1595424545]<https://www.linkedin.com/company/los-alamos-national-laboratory/>[signature_371999348]<https://www.facebook.com/LosAlamosNationalLab/>

[OMPI users] Open MPI v4.0.7rc1 available for testing

2021-10-25 Thread Pritchard Jr., Howard via users

The first release candidate for Open MPI v4.0.7 is now available for testing:

https://www.open-mpi.org/software/ompi/v4.0/


Some fixes include:



- Numerous fixes from vendor partners.

- Fix a problem with a couple of MPI_IALLREDUCE algorithms.  Thanks to

  John Donners for reporting.

- Fix an edge case where MPI_Reduce is invoked with zero count and NULL

  source and destination buffers.

- Use the mfence instruction in opal_atomic_rmb on x86_64 cpus.  Thanks

  to George Katevenis for proposing a fix.

- Fix an issue with the Open MPI build system using the SLURM provided

  PMIx when not requested by the user.  Thanks to Alexander Grund for

  reporting.

- Fix a problem compiling Open MPI with clang on case-insensitive

  file systems.  Thanks to @srpgilles for reporting.

- Fix some OFI usNIC/OFI MTL interaction problems.  Thanks to

  @roguephysicist for reporting this issue.

- Fix a problem with the Posix fbtl component failing to load.

  Thanks to Honggang Li for reporting.

Your Open MPI release team.

—

[signature_61897647]

Howard Pritchard
Research Scientist
HPC-ENV

Los Alamos National Laboratory
howa...@lanl.gov

[signature_629633375]<https://www.instagram.com/losalamosnatlab/>[signature_843372916]<https://twitter.com/LosAlamosNatLab>[signature_178570432]<https://www.linkedin.com/company/los-alamos-national-laboratory/>[signature_1871057199]<https://www.facebook.com/LosAlamosNationalLab/>

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-19 Thread Pritchard Jr., Howard via users

HI Greg,

I believe so concerning your TCP question.

I think the patch probably isn’t actually being used otherwise you would have 
noticed the curious print statement.
Sorry about that.  I’m out of ideas on what may be happening.

Howard

From: "Fischer, Greg A." 
Date: Friday, October 15, 2021 at 9:17 AM
To: "Pritchard Jr., Howard" , Open MPI Users 

Cc: "Fischer, Greg A." 
Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

I tried the patch, but I get the same result:

error obtaining device attributes for mlx4_0 errno says Success

I’m getting (what I think are) good transfer rates using “--mca btl self,tcp” 
on the osu_bw test (~7000 MB/s). It seems to me that the only way that could be 
happening is if the infiniband interfaces are being used over TCP, correct? 
Would such an arrangement preclude the ability to do RDMA or openib? Perhaps 
the network is setup in such a way that the IB hardware is not discoverable by 
openib?

(I’m not a network admin, and I wasn’t involved in the setup of the network. 
Unfortunately, the person who knows the most has recently left the 
organization.)

Greg

From: Pritchard Jr., Howard 
Sent: Thursday, October 14, 2021 5:45 PM
To: Fischer, Greg A. ; Open MPI Users 

Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

[External Email]
HI Greg,

Oh yes that’s not good about rdmacm.

Yes the OFED looks pretty old.

Did you by any chance apply that patch?  I generated that for a sysadmin here 
who was in the situation where they needed to maintain Open MPI 3.1.6 but had 
to also upgrade to some newer RHEL release, but the Open MPi wasn’t compiling 
after the RHEL upgrade.

Howard


From: "Fischer, Greg A." 
mailto:fisch...@westinghouse.com>>
Date: Thursday, October 14, 2021 at 1:47 PM
To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, Open 
MPI Users mailto:users@lists.open-mpi.org>>
Cc: "Fischer, Greg A." 
mailto:fisch...@westinghouse.com>>
Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

I added –enable-mt and re-installed UCX. Same result. (I didn’t re-compile 
OpenMPI.)

A conspicuous warning I see in my UCX configure output is:

checking for rdma_establish in -lrdmacm... no
configure: WARNING: RDMACM requested but librdmacm is not found or does not 
provide rdma_establish() API

The version of librdmacm we have comes from 
librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64, which seems to date from 
mid-2017. I wonder if that’s too old?

Greg

From: Pritchard Jr., Howard mailto:howa...@lanl.gov>>
Sent: Thursday, October 14, 2021 3:31 PM
To: Fischer, Greg A. 
mailto:fisch...@westinghouse.com>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

[External Email]
Hi Greg,

I think the UCX PML may be discomfited by the lack of thread safety.

Could you try using the contrib/configure-release-mt  in your ucx folder?  You 
want to add –enable-mt.
That’s what stands out in your configure output from the one I usually get when 
building on a MLNX connectx5 cluster with
MLNX_OFED_LINUX-4.5-1.0.1.0

Here’s the output from one of my UCX configs:

configure: =
configure: UCX build configuration:
configure: Build prefix:   /ucx_testing/ucx/test_install
configure:Configuration dir:   ${prefix}/etc/ucx
configure:   Preprocessor flags:   -DCPU_FLAGS="" -I${abs_top_srcdir}/src 
-I${abs_top_builddir} -I${abs_top_builddir}/src
configure:   C compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch 
-Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length 
-Wnested-externs -Wshadow -Werror=declaration-after-statement
configure: C++ compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
configure: Multi-thread:   enabled
configure: NUMA support:   disabled
configure:MPI tests:   disabled
configure:  VFS support:   no
configure:Devel headers:   no
configure: io_demo CUDA support:   no
configure: Bindings:   < >
configure:  UCS modules:   < &

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-14 Thread Pritchard Jr., Howard via users

HI Greg,

Oh yes that’s not good about rdmacm.

Yes the OFED looks pretty old.

Did you by any chance apply that patch?  I generated that for a sysadmin here 
who was in the situation where they needed to maintain Open MPI 3.1.6 but had 
to also upgrade to some newer RHEL release, but the Open MPi wasn’t compiling 
after the RHEL upgrade.

Howard


From: "Fischer, Greg A." 
Date: Thursday, October 14, 2021 at 1:47 PM
To: "Pritchard Jr., Howard" , Open MPI Users 

Cc: "Fischer, Greg A." 
Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

I added –enable-mt and re-installed UCX. Same result. (I didn’t re-compile 
OpenMPI.)

A conspicuous warning I see in my UCX configure output is:

checking for rdma_establish in -lrdmacm... no
configure: WARNING: RDMACM requested but librdmacm is not found or does not 
provide rdma_establish() API

The version of librdmacm we have comes from 
librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64, which seems to date from 
mid-2017. I wonder if that’s too old?

Greg

From: Pritchard Jr., Howard 
Sent: Thursday, October 14, 2021 3:31 PM
To: Fischer, Greg A. ; Open MPI Users 

Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

[External Email]
Hi Greg,

I think the UCX PML may be discomfited by the lack of thread safety.

Could you try using the contrib/configure-release-mt  in your ucx folder?  You 
want to add –enable-mt.
That’s what stands out in your configure output from the one I usually get when 
building on a MLNX connectx5 cluster with
MLNX_OFED_LINUX-4.5-1.0.1.0

Here’s the output from one of my UCX configs:

configure: =
configure: UCX build configuration:
configure: Build prefix:   /ucx_testing/ucx/test_install
configure:Configuration dir:   ${prefix}/etc/ucx
configure:   Preprocessor flags:   -DCPU_FLAGS="" -I${abs_top_srcdir}/src 
-I${abs_top_builddir} -I${abs_top_builddir}/src
configure:   C compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch 
-Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length 
-Wnested-externs -Wshadow -Werror=declaration-after-statement
configure: C++ compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
configure: Multi-thread:   enabled
configure: NUMA support:   disabled
configure:MPI tests:   disabled
configure:  VFS support:   no
configure:Devel headers:   no
configure: io_demo CUDA support:   no
configure: Bindings:   < >
configure:  UCS modules:   < >
configure:  UCT modules:   < ib cma knem >
configure: CUDA modules:   < >
configure: ROCM modules:   < >
configure:   IB modules:   < >
configure:  UCM modules:   < >
configure: Perf modules:   < >
configure: =


Howard

From: "Fischer, Greg A." 
mailto:fisch...@westinghouse.com>>
Date: Thursday, October 14, 2021 at 12:46 PM
To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, Open 
MPI Users mailto:users@lists.open-mpi.org>>
Cc: "Fischer, Greg A." 
mailto:fisch...@westinghouse.com>>
Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

Thanks, Howard.

I downloaded a current version of UCX (1.11.2) and installed it with OpenMPI 
4.1.1. When I try to specify the “-mca pml ucx” for a simple, 2-process 
benchmark problem, I get:

--
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:  bl1311
  Framework: pml
--
[bl1311:20168] PML ucx cannot be selected
[bl1311:20169] PML ucx cannot be selected


I’ve attached my ucx_info -d output, as well as the ucx configuration 
information. I’m n

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-14 Thread Pritchard Jr., Howard via users

Hi Greg,

I think the UCX PML may be discomfited by the lack of thread safety.

Could you try using the contrib/configure-release-mt  in your ucx folder?  You 
want to add –enable-mt.
That’s what stands out in your configure output from the one I usually get when 
building on a MLNX connectx5 cluster with
MLNX_OFED_LINUX-4.5-1.0.1.0

Here’s the output from one of my UCX configs:

configure: =
configure: UCX build configuration:
configure: Build prefix:   /ucx_testing/ucx/test_install
configure:Configuration dir:   ${prefix}/etc/ucx
configure:   Preprocessor flags:   -DCPU_FLAGS="" -I${abs_top_srcdir}/src 
-I${abs_top_builddir} -I${abs_top_builddir}/src
configure:   C compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch 
-Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length 
-Wnested-externs -Wshadow -Werror=declaration-after-statement
configure: C++ compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
configure: Multi-thread:   enabled
configure: NUMA support:   disabled
configure:MPI tests:   disabled
configure:  VFS support:   no
configure:Devel headers:   no
configure: io_demo CUDA support:   no
configure: Bindings:   < >
configure:  UCS modules:   < >
configure:  UCT modules:   < ib cma knem >
configure: CUDA modules:   < >
configure: ROCM modules:   < >
configure:   IB modules:   < >
configure:  UCM modules:   < >
configure: Perf modules:   < >
configure: =====


Howard

From: "Fischer, Greg A." 
Date: Thursday, October 14, 2021 at 12:46 PM
To: "Pritchard Jr., Howard" , Open MPI Users 

Cc: "Fischer, Greg A." 
Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

Thanks, Howard.

I downloaded a current version of UCX (1.11.2) and installed it with OpenMPI 
4.1.1. When I try to specify the “-mca pml ucx” for a simple, 2-process 
benchmark problem, I get:

--
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:  bl1311
  Framework: pml
--
[bl1311:20168] PML ucx cannot be selected
[bl1311:20169] PML ucx cannot be selected


I’ve attached my ucx_info -d output, as well as the ucx configuration 
information. I’m not sure I follow everything on the UCX FAQ page, but it seems 
like everything is being routed over TCP, which is probably not what I want. 
Any thoughts as to what I might be doing wrong?

Thanks,
Greg

From: Pritchard Jr., Howard 
Sent: Wednesday, October 13, 2021 12:28 PM
To: Open MPI Users 
Cc: Fischer, Greg A. 
Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

[External Email]
HI Greg,

It’s the aging of the openib btl.

You may be able to apply the attached patch.  Note the 3.1.x release stream is 
no longer supported.

You may want to try using the 4.1.1 release, in which case you’ll want to use 
UCX.

Howard


From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of "Fischer, Greg A. via users" 
mailto:users@lists.open-mpi.org>>
Reply-To: Open MPI Users 
mailto:users@lists.open-mpi.org>>
Date: Wednesday, October 13, 2021 at 10:06 AM
To: "users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" 
mailto:users@lists.open-mpi.org>>
Cc: "Fischer, Greg A." 
mailto:fisch...@westinghouse.com>>
Subject: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno 
says Success"


Hello,



I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the 
following errors when I try to use the openib btl:



WARNING: There was an error initializing an OpenFabrics device.



  Local ho

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-13 Thread Pritchard Jr., Howard via users

HI Greg,

It’s the aging of the openib btl.

You may be able to apply the attached patch.  Note the 3.1.x release stream is 
no longer supported.

You may want to try using the 4.1.1 release, in which case you’ll want to use 
UCX.

Howard


From: users  on behalf of "Fischer, Greg A. 
via users" 
Reply-To: Open MPI Users 
Date: Wednesday, October 13, 2021 at 10:06 AM
To: "users@lists.open-mpi.org" 
Cc: "Fischer, Greg A." 
Subject: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno 
says Success"


Hello,



I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the 
following errors when I try to use the openib btl:



WARNING: There was an error initializing an OpenFabrics device.



  Local host:   bl1308

  Local device: mlx4_0

--

[bl1308][[44866,1],5][../../../../../openmpi-3.1.6/opal/mca/btl/openib/btl_openib_component.c:1671:init_one_device]
 error obtaining device attributes for mlx4_0 errno says Success



I have disabled UCX ("--without-ucx") because the UCX installation we have 
seems to be too out-of-date. ofed_info says "MLNX_OFED_LINUX-4.1-1.0.2.0". I've 
attached the detailed output of ofed_info and ompi_info.



This issue seems similar to Issue #7461 
(https://github.com/open-mpi/ompi/issues/7461), which I don't see a resolution 
for.



Does anyone know what the likely explanation is? Is the version of OFED on the 
system badly out-of-sync with contemporary OpenMPI?



Thanks,

Greg




This e-mail may contain proprietary information of the sending organization. 
Any unauthorized or improper disclosure, copying, distribution, or use of the 
contents of this e-mail and attached document(s) is prohibited. The information 
contained in this e-mail and attached document(s) is intended only for the 
personal and private use of the recipient(s) named above. If you have received 
this communication in error, please notify the sender immediately by email and 
delete the original e-mail and attached document(s).



0001-patch-ibv_exp_dev_query-function-call.patch
Description: 0001-patch-ibv_exp_dev_query-function-call.patch

Re: [OMPI users] [EXTERNAL] Error Signal code: Address not mapped (1)

2021-06-22 Thread Pritchard Jr., Howard via users

Hello Arturo,

Would you mind filing an issue against Open MPI and use the template to provide 
info we could use to help triage this problem?

https://github.com/open-mpi/ompi/issues/new

Thanks,

Howard


From: users  on behalf of Arturo Fernandez 
via users 
Reply-To: Open MPI Users 
Date: Monday, June 21, 2021 at 3:33 PM
To: Open MPI Users 
Cc: Arturo Fernandez 
Subject: [EXTERNAL] [OMPI users] Error Signal code: Address not mapped (1)

Hello,
I'm getting the error message (with either v4.1.0 or v4.1.1)
 *** Process received signal ***
 Signal: Segmentation fault (11)
 Signal code: Address not mapped (1)
 Failing at address: (nil)
 *** End of error message ***
Segmentation fault (core dumped)
The AWS system is running CentOS8 but I don't think that is the problem. After 
some troubleshooting, the error seems to appear and disappear depending on the 
libfabric version. When the system uses libfabric-aws-1.10.2g everything sails 
smoothly, the problems appear when libfabric-aws is upgraded to 1.11.2. I've 
tried to understand the differences between these versions but it's beyond my 
expertise.
Thanks,
Arturo

Re: [OMPI users] [EXTERNAL] Linker errors in Fedora 34 Docker container

2021-05-26 Thread Pritchard Jr., Howard via users

Hi John,

Good to know.  For the record were you using a docker container unmodified from 
docker hub?

Howard

From: John Haiducek 
Date: Wednesday, May 26, 2021 at 9:35 AM
To: "Pritchard Jr., Howard" 
Cc: "users@lists.open-mpi.org" 
Subject: Re: [EXTERNAL] [OMPI users] Linker errors in Fedora 34 Docker container

That was it, thank you! After installing findutils it builds successfully.

John

On May 26, 2021, at 10:49 AM, Pritchard Jr., Howard 
mailto:howa...@lanl.gov>> wrote:

Hi John,

I don’t like this in the make output:

../../libtool: line 5705: find: command not found

Maybe you need to install findutils or relevant fedora rpm in your container?

Howard

From: John Haiducek mailto:jhaid...@gmail.com>>
Date: Wednesday, May 26, 2021 at 7:29 AM
To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, 
"users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" 
mailto:users@lists.open-mpi.org>>
Subject: Re: [EXTERNAL] [OMPI users] Linker errors in Fedora 34 Docker container

On May 25, 2021, at 6:53 PM, Pritchard Jr., Howard 
mailto:howa...@lanl.gov>> wrote:

In your build area, do you see any .lo files in opal/util/coeval?

That directory doesn’t exist in my build area. In opal/util/keyval I have 
keyval_lex.lo.

Which compiler are you using?

gcc 11.1.1

 Also, are you building from the tarballs at 
https://www.open-mpi.org/software/ompi/v4.1/ ?

Yes; specifically I’m using the tarball from 
https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.bz2

John

Re: [OMPI users] [EXTERNAL] Linker errors in Fedora 34 Docker container

2021-05-26 Thread Pritchard Jr., Howard via users

Hi John,

I don’t like this in the make output:


../../libtool: line 5705: find: command not found

Maybe you need to install findutils or relevant fedora rpm in your container?

Howard


From: John Haiducek 
Date: Wednesday, May 26, 2021 at 7:29 AM
To: "Pritchard Jr., Howard" , "users@lists.open-mpi.org" 

Subject: Re: [EXTERNAL] [OMPI users] Linker errors in Fedora 34 Docker container




On May 25, 2021, at 6:53 PM, Pritchard Jr., Howard 
mailto:howa...@lanl.gov>> wrote:

In your build area, do you see any .lo files in opal/util/coeval?

That directory doesn’t exist in my build area. In opal/util/keyval I have 
keyval_lex.lo.



Which compiler are you using?

gcc 11.1.1


 Also, are you building from the tarballs at 
https://www.open-mpi.org/software/ompi/v4.1/ ?

Yes; specifically I’m using the tarball from 
https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.bz2

John

Re: [OMPI users] [EXTERNAL] Linker errors in Fedora 34 Docker container

2021-05-25 Thread Pritchard Jr., Howard via users

Hi John,

I don’t think an external dependency is going to fix this.

In your build area, do you see any .lo files in

opal/util/keyval

?

Which compiler are you using?

Also, are you building from the tarballs at 
https://www.open-mpi.org/software/ompi/v4.1/ ?

Howard

From: users  on behalf of John Haiducek via 
users 
Reply-To: Open MPI Users 
Date: Tuesday, May 25, 2021 at 3:49 PM
To: "users@lists.open-mpi.org" 
Cc: John Haiducek 
Subject: [EXTERNAL] [OMPI users] Linker errors in Fedora 34 Docker container

Hi,

When attempting to build OpenMPI in a Fedora 34 Docker image I get the 
following linker errors:


#22 77.36 make[2]: Entering directory '/build/openmpi-4.1.1/opal/tools/wrappers'

#22 77.37   CC   opal_wrapper.o

#22 77.67   CCLD opal_wrapper

#22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference 
to `opal_util_keyval_yytext'

#22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference 
to `opal_util_keyval_yyin'

#22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference 
to `opal_util_keyval_yylineno'

#22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference 
to `opal_util_keyval_yynewlines'

#22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference 
to `opal_util_keyval_yylex'

#22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference 
to `opal_util_keyval_parse_done'

#22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference 
to `opal_util_keyval_yylex_destroy'

#22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference 
to `opal_util_keyval_init_buffer'

#22 77.81 collect2: error: ld returned 1 exit status
My configure command is just ./configure --prefix=/usr/local/openmpi.
I also tried ./configure --prefix=/usr/local/openmpi --disable-silent-rules 
--enable-builtin-atomics --with-hwloc=/usr --with-libevent=external 
--with-pmix=external --with-valgrind (similar to what is in the Fedora spec 
file for OpenMPI) but that produces the same errors.

Is there a third-party library I need to install or an additional configure 
option I can set that will fix these?

John

Re: [OMPI users] [EXTERNAL] Re: Newbie With Issues

2021-03-30 Thread Pritchard Jr., Howard via users

Hi Ben,

You're heading down the right path

On our HPC systems, we use modules to handle things like setting 
LD_LIBRARY_PATH etc. when using Intel 21.x.y and other Intel compilers.
For example, for the Intel/21.1.1 the following were added to LD_LIBRARY_PATH 
(edited to avoid posting explicit paths on our systems)

prepend-path LD_LIBRARY_PATH /path_to_compiler_install 
/x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib:path_to_compiler_install
 /x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/compiler/lib/intel64_lin 
prepend-path PATH 
/path_to_compiler_install/x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/bin
 
prepend-path LD_LIBRARY_PATH /path_to_compiler_install 
/x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib/emu 
prepend-path LD_LIBRARY_PATH /path_to_compiler_install 
/x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib/x64 
prepend-path LD_LIBRARY_PATH /path_to_compiler_isnstall 
/x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib

You should check which intel compiler libraries you installed and make sure 
you're prepending the relevant folders to LD_LIBRARY_PATH.

We have tested building Open MPI with the Intel OneAPI compilers and except for 
ifx, things went okay. 

Howard

On 3/30/21, 11:12 AM, "users on behalf of bend linux4ms.net via users" 
 wrote:

I think I have found one of the issues. I took the check c program from 
openmpi
and tried to compile and got the following:

[root@jean-r8-sch24 benchmarks]# icc dummy.c 
ld: cannot find -lstdc++
[root@jean-r8-sch24 benchmarks]# cat dummy.c 
int
main ()
 {

  ;
  return 0;
}
[root@jean-r8-sch24 benchmarks]# 

Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 
39212
"Never attribute to malice, that which can be adequately explained by 
stupidity"
- Hanlon's Razor





From: users  on behalf of bend 
linux4ms.net via users 
Sent: Tuesday, March 30, 2021 12:00 PM
To: Open MPI Users
Cc: bend linux4ms.net
Subject: Re: [OMPI users] Newbie With Issues

Thanks Mr Heinz for responding.

It maybe the case with clang, but doing a intel setvars.sh then issuing the 
following
compile gives me the message:

[root@jean-r8-sch24 openmpi-4.1.0]# icc
icc: command line error: no files specified; for help type "icc -help"
[root@jean-r8-sch24 openmpi-4.1.0]# icc -v
icc version 2021.1 (gcc version 8.3.1 compatibility)
[root@jean-r8-sch24 openmpi-4.1.0]#

Would lead me to believe that icc is still available to use.

This is a government contract and they want the latest and greatest.

Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 
39212
"Never attribute to malice, that which can be adequately explained by 
stupidity"
- Hanlon's Razor





From: Heinz, Michael  William 
Sent: Tuesday, March 30, 2021 11:52 AM
To: Open MPI Users
Cc: bend linux4ms.net
Subject: RE: Newbie With Issues

It looks like you're trying to build Open MPI with the Intel C compiler. 
TBH - I think that icc isn't included with the latest release of oneAPI, I 
think they've switched to including clang instead. I had a similar issue to 
yours but I resolved it by installing a 2020 version of the Intel HPC software. 
Unfortunately, those versions require purchasing a license.

-Original Message-
From: users  On Behalf Of bend 
linux4ms.net via users
Sent: Tuesday, March 30, 2021 12:42 PM
To: Open MPI Open MPI 
Cc: bend linux4ms.net 
Subject: [OMPI users] Newbie With Issues

Hello group, My name is Ben Duncan. I have been tasked with installing 
openMPI and Intel compiler on a HPC systems. I am new to the the whole HPC and 
MPI environment so be patient with me.

I have successfully gotten the Intel compiler (oneapi version from  
l_HPCKit_p_2021.1.0.2684_offline.sh installed without any errors.

I am trying to install and configure the openMPI version 4.1.0 however 
trying to run configuration for openmpi gives me the following error:


== Configuring Open MPI


*** Startup tests
checking build system type... x86_64-unknown-linux-gnu checking host system 
type... x86_64-unknown-linux-gnu checking target system type... 
x86_64-unknown-linux-gnu checking for gcc... icc checking whether the C 
compiler works... no
configure: error: in `/p/app/openmpi-4.1.0':
configure: error: C compiler cannot create executables See `config.log' for 
more details

With the error in config.log being:

configure:6499: $? = 0
configure:6488: icc -qversion >&5
icc: command line warning #100

Re: [OMPI users] [EXTERNAL] building openshem on opa

2021-03-22 Thread Pritchard Jr., Howard via users

HI Michael,

You may want to try

https://github.com/Sandia-OpenSHMEM/SOS

if you want to use OpenSHMEM over OPA.

If you have lots of cycles for development work, you could write an OFI SPML 
for the  OSHMEM component of Open MPI.  

Howard


On 3/22/21, 8:56 AM, "users on behalf of Michael Di Domenico via users" 
 wrote:

i can build and run openmpi on an opa network just fine, but it turns
out building openshmem fails.  the message is (no spml) found

looking at the config log it looks like it tries to build spml ikrit
and ucx which fail.  i turn ucx off because it doesn't support opa and
isn't needed.

so this message is really just a confirmation that openshmem and opa
are not capable of being built or did i do something wrong

and a curiosity if anyone knows what kind of effort would be involved
in getting it to work

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Pritchard Jr., Howard via users

Hi Folks,

I'm also have problems reproducing this on one of our OPA clusters:

libpsm2-11.2.78-1.el7.x86_64
libpsm2-devel-11.2.78-1.el7.x86_64

cluster runs RHEL 7.8

hca_id: hfi1_0
transport:  InfiniBand (0)
fw_ver: 1.27.0
node_guid:  0011:7501:0179:e2d7
sys_image_guid: 0011:7501:0179:e2d7
vendor_id:  0x1175
vendor_part_id: 9456
hw_ver: 0x11
board_id:   Intel Omni-Path Host Fabric Interface 
Adapter 100 Series
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   99
port_lmc:   0x00
link_layer: InfiniBand

using gcc/gfortran 9.3.0

Built Open MPI 4.0.5 without any special configure options.

Howard

On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" 
 wrote:

for whatever it's worth running the test program on my OPA cluster
seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
sure if it's supposed to stop at some point

i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, 
without-{psm,ucx,verbs}

On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
 wrote:
>
> Hi Michael
>
> indeed I'm a little bit lost with all these parameters in OpenMPI, mainly 
because for years it works just fine out of the box in all my deployments on 
various architectures, interconnects and linux flavor. Some weeks ago I deploy 
OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an AMD epyc2 cluster with 
connectX6, and it just works fine.  It is the first time I've such trouble to 
deploy this library.
>
> If you have my mail posted  the 25/01/2021 in this discussion at 18h54 
(may be Paris TZ) there is a small test case attached that show the problem. 
Did you got it or did the list strip these attachments ? I can provide it again.
>
> Many thanks
>
> Patrick
>
> Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
>
> Patrick how are you using original PSM if you’re using Omni-Path 
hardware? The original PSM was written for QLogic DDR and QDR Infiniband 
adapters.
>
> As far as needing openib - the issue is that the PSM2 MTL doesn’t support 
a subset of MPI operations that we previously used the pt2pt BTL for. For 
recent version of OMPI, the preferred BTL to use with PSM2 is OFI.
>
> Is there any chance you can give us a sample MPI app that reproduces the 
problem? I can’t think of another way I can give you more help without being 
able to see what’s going on. It’s always possible there’s a bug in the PSM2 MTL 
but it would be surprising at this point.
>
> Sent from my iPad
>
> On Jan 26, 2021, at 1:13 PM, Patrick Begou via users 
 wrote:
>
> 
> Hi all,
>
> I ran many tests today. I saw that an older 4.0.2 version of OpenMPI 
packaged with Nix was running using openib. So I add the --with-verbs option to 
setup this module.
>
> That I can see now is that:
>
> mpirun -hostfile $OAR_NODEFILE  --mca mtl psm -mca btl_openib_allow_ib 
true 
>
> - the testcase test_layout_array is running without error
>
> - the bandwidth measured with osu_bw is half of thar it should be:
>
> # OSU MPI Bandwidth Test v5.7
> # Size  Bandwidth (MB/s)
> 1   0.54
> 2   1.13
> 4   2.26
> 8   4.51
> 16  9.06
> 32 17.93
> 64 33.87
> 12869.29
> 256   161.24
> 512   333.82
> 1024  682.66
> 2048 1188.63
> 4096 1760.14
> 8192 2166.08
> 163842036.95
> 327683466.63
> 655366296.73
> 131072   7509.43
> 262144   9104.78
> 524288   6908.55
> 1048576  5530.37
> 2097152  4489.16
> 4194304  3498.14
>
> mpirun -hostfile $OAR_NODEFILE  --mca mtl psm2 -mca btl_openib_allow_ib 
true ...
>
> - the testcase test_layout_array is not giving correct results
>
>

Re: [OMPI users] [EXTERNAL] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Pritchard Jr., Howard via users

Hi Patrick,

Also it might not hurt to disable the Open IB BTL by setting

export OMPI_MCA_btl=^openib

in your shell prior to invoking mpirun

Howard


From: users  on behalf of "Heinz, Michael 
William via users" 
Reply-To: Open MPI Users 
Date: Monday, January 25, 2021 at 8:47 AM
To: "users@lists.open-mpi.org" 
Cc: "Heinz, Michael William" 
Subject: [EXTERNAL] [OMPI users] OpenMPI 4.0.5 error with Omni-path

Patrick,

You really have to provide us some detailed information if you want assistance. 
At a minimum we need to know if you’re using the PSM2 MTL or the OFI MTL and 
what the actual error is.

Please provide the actual command line you are having problems with, along with 
any errors. In addition, I recommend adding the following to your command line:

-mca mtl_base_verbose 99

If you have a way to reproduce the problem quickly you might also want to add:

-x PSM2_TRACEMASK=11

But that will add very detailed debug output to your command and you haven’t 
mentioned that PSM2 is failing, so it may not be useful.

Re: [OMPI users] [EXTERNAL] RMA breakage

2020-12-07 Thread Pritchard Jr., Howard via users

Hello Dave,

There's an issue opened about this -

https://github.com/open-mpi/ompi/issues/8252

However, I'm not observing failures with IMB RMA on a IB/aarch64 system and UCX 
1.9.0 using OMPI 4.0.x at 6ea9d98.
This cluster is running RHEL 7.6 and MLNX_OFED_LINUX-4.5-1.0.1.0.

Howard

On 12/7/20, 7:21 AM, "users on behalf of Dave Love via users" 
 wrote:

After seeing several failures with RMA with the change needed to get
4.0.5 through IMB, I looked for simple tests.  So, I built the mpich
3.4b1 tests -- or the ones that would build, and I haven't checked why
some fail -- and ran the rma set.

Three out of 180 passed.  Many (most?) aborted in ucx, like I saw with
production code, with a backtrace like below; others at least reported
an MPI error.  This was on two nodes of a ppc64le RHEL7 IB system with
4.0.5, ucx 1.9, and MCA parameters from the ucx FAQ (though I got the
same result without those parameters).  I haven't tried to reproduce it
on x86_64, but it seems unlikely to be CPU-specific.

Is there anything we can do to run RMA without just moving to mpich?  Do
releases actually get tested on run-of-the-mill IB+Lustre systems?

+ mpirun -n 2 winname
[gpu005:50906:0:50906]  ucp_worker.c:183  Fatal: failed to set active 
message handler id 1: Invalid parameter
 backtrace (tid:  50906) 
 0 0x0005453c ucs_debug_print_backtrace()  
.../src/ucs/debug/debug.c:656
 1 0x00028218 ucp_worker_set_am_handlers()  
.../src/ucp/core/ucp_worker.c:182
 2 0x00029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:816
 3 0x00029ae0 ucp_worker_iface_check_events()  
.../src/ucp/core/ucp_worker.c:766
 4 0x00029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:819
 5 0x00029ae0 ucp_worker_iface_unprogress_ep()  
.../src/ucp/core/ucp_worker.c:841
 6 0x000582a8 ucp_wireup_ep_t_cleanup()  
.../src/ucp/wireup/wireup_ep.c:381
 7 0x00068124 ucs_class_call_cleanup_chain()  
.../src/ucs/type/class.c:56
 8 0x00057420 ucp_wireup_ep_t_delete()  
.../src/ucp/wireup/wireup_ep.c:28
 9 0x00013de8 uct_ep_destroy()  .../src/uct/base/uct_iface.c:546
10 0x000252f4 ucp_proxy_ep_replace()  
.../src/ucp/core/ucp_proxy_ep.c:236
11 0x00057b88 ucp_wireup_ep_progress()  
.../src/ucp/wireup/wireup_ep.c:89
12 0x00049820 ucs_callbackq_slow_proxy()  
.../src/ucs/datastruct/callbackq.c:400
13 0x0002ca04 ucs_callbackq_dispatch()  
.../src/ucs/datastruct/callbackq.h:211
14 0x0002ca04 uct_worker_progress()  .../src/uct/api/uct.h:2346
15 0x0002ca04 ucp_worker_progress()  
.../src/ucp/core/ucp_worker.c:2040
16 0xc144 progress_callback()  osc_ucx_component.c:0
17 0x000374ac opal_progress()  ???:0
18 0x0006cc74 ompi_request_default_wait()  ???:0
19 0x000e6fcc ompi_coll_base_sendrecv_actual()  ???:0
20 0x000e5530 ompi_coll_base_allgather_intra_two_procs()  ???:0
21 0x6c44 ompi_coll_tuned_allgather_intra_dec_fixed()  ???:0
22 0xdc20 component_select()  osc_ucx_component.c:0
23 0x00115b90 ompi_osc_base_select()  ???:0
24 0x00075264 ompi_win_create()  ???:0
25 0x000cb4e8 PMPI_Win_create()  ???:0
26 0x10006ecc MTestGetWin()  
.../mpich-3.4b1/test/mpi/util/mtest.c:1173
27 0x10002e40 main()  .../mpich-3.4b1/test/mpi/rma/winname.c:25
28 0x00025200 generic_start_main.isra.0()  libc-start.c:0
29 0x000253f4 __libc_start_main()  ???:0

followed by the abort backtrace

Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-23 Thread Howard Pritchard via users

HI All,

I opened a new issue to track the coll_perf failure in case its not related
to the HDF5 problem reported earlier.

https://github.com/open-mpi/ompi/issues/8246

Howard


Am Mo., 23. Nov. 2020 um 12:14 Uhr schrieb Dave Love via users <
users@lists.open-mpi.org>:

> Mark Dixon via users  writes:
>
> > Surely I cannot be the only one who cares about using a recent openmpi
> > with hdf5 on lustre?
>
> I generally have similar concerns.  I dug out the romio tests, assuming
> something more basic is useful.  I ran them with ompi 4.0.5+ucx on
> Mark's lustre system (similar to a few nodes of Summit, apart from the
> filesystem, but with quad-rail IB which doesn't give the bandwidth I
> expected).
>
> The perf test says romio performs a bit better.  Also -- from overall
> time -- it's faster on IMB-IO (which I haven't looked at in detail, and
> ran with suboptimal striping).
>
>   Test: perf
>   romio321
>   Access size per process = 4194304 bytes, ntimes = 5
>   Write bandwidth without file sync = 19317.372354 Mbytes/sec
>   Read bandwidth without prior file sync = 35033.325451 Mbytes/sec
>   Write bandwidth including file sync = 1081.096713 Mbytes/sec
>   Read bandwidth after file sync = 47135.349155 Mbytes/sec
>   ompio
>   Access size per process = 4194304 bytes, ntimes = 5
>   Write bandwidth without file sync = 18442.698536 Mbytes/sec
>   Read bandwidth without prior file sync = 31958.198676 Mbytes/sec
>   Write bandwidth including file sync = 1081.058583 Mbytes/sec
>   Read bandwidth after file sync = 31506.854710 Mbytes/sec
>
> However, romio coll_perf fails as follows, and ompio runs.  Isn't there
> mpi-io regression testing?
>
>   [gpu025:89063:0:89063] Caught signal 11 (Segmentation fault: address not
> mapped to object at address 0x1fffbc10)
>    backtrace (tid:  89063) 
>0 0x0005453c ucs_debug_print_backtrace()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucs/debug/debug.c:656
>1 0x00041b04 ucp_rndv_pack_data()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1335
>2 0x0001c814 uct_self_ep_am_bcopy()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:278
>3 0x0003f7ac uct_ep_am_bcopy()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2561
>4 0x0003f7ac ucp_do_am_bcopy_multi()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.inl:79
>5 0x0003f7ac ucp_rndv_progress_am_bcopy()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1352
>6 0x00041cb8 ucp_request_try_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223
>7 0x00041cb8 ucp_request_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258
>8 0x00041cb8 ucp_rndv_rtr_handler()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1754
>9 0x0001c984 uct_iface_invoke_am()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635
>   10 0x0001c984 uct_self_iface_sendrecv_am()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149
>   11 0x0001c984 uct_self_ep_am_short()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262
>   12 0x0002ee30 uct_ep_am_short()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549
>   13 0x0002ee30 ucp_do_am_single()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68
>   14 0x00042908 ucp_proto_progress_rndv_rtr()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:172
>   15 0x0003f4c4 ucp_request_try_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223
>   16 0x0003f4c4 ucp_request_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258
>   17 0x0003f4c4 ucp_rndv_req_send_rtr()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/ta

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-15 Thread Howard Pritchard via users

HI Martin,

Thanks this is helpful.  Are you getting this timeout when you're running
the spawner process as a singleton?

Howard

Am Fr., 14. Aug. 2020 um 17:44 Uhr schrieb Martín Morales <
martineduardomora...@hotmail.com>:

> Howard,
>
>
>
> I pasted below, the error message after a while of the hang I referred.
>
> Regards,
>
>
>
> Martín
>
>
>
> -
>
>
>
> *A request has timed out and will therefore fail:*
>
>
>
> *  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345*
>
>
>
> *Your job may terminate as a result of this problem. You may want to*
>
> *adjust the MCA parameter pmix_server_max_wait and try again. If this*
>
> *occurred during a connect/accept operation, you can adjust that time*
>
> *using the pmix_base_exchange_timeout parameter.*
>
>
> *--*
>
>
> *--*
>
> *It looks like MPI_INIT failed for some reason; your parallel process is*
>
> *likely to abort.  There are many reasons that a parallel process can*
>
> *fail during MPI_INIT; some of which are due to configuration or
> environment*
>
> *problems.  This failure appears to be an internal failure; here's some*
>
> *additional information (which may only be relevant to an Open MPI*
>
> *developer):*
>
>
>
> *  ompi_dpm_dyn_init() failed*
>
> *  --> Returned "Timeout" (-15) instead of "Success" (0)*
>
>
> *--*
>
> *[nos-GF7050VT-M:03767] *** An error occurred in MPI_Init*
>
> *[nos-GF7050VT-M:03767] *** reported by process [2337734658,0]*
>
> *[nos-GF7050VT-M:03767] *** on a NULL communicator*
>
> *[nos-GF7050VT-M:03767] *** Unknown error*
>
> *[nos-GF7050VT-M:03767] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,*
>
> *[nos-GF7050VT-M:03767] ***and potentially your MPI job)*
>
> *[osboxes:02457] *** An error occurred in MPI_Comm_spawn*
>
> *[osboxes:02457] *** reported by process [2337734657,0]*
>
> *[osboxes:02457] *** on communicator MPI_COMM_WORLD*
>
> *[osboxes:02457] *** MPI_ERR_UNKNOWN: unknown error*
>
> *[osboxes:02457] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,*
>
> *[osboxes:02457] ***and potentially your MPI job)*
>
> *[osboxes:02458] 1 more process has sent help message help-orted.txt /
> timedout*
>
> *[osboxes:02458] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages*
>
>
>
>
>
>
>
>
>
> *From: *Martín Morales via users 
> *Sent: *viernes, 14 de agosto de 2020 19:40
> *To: *Howard Pritchard 
> *Cc: *Martín Morales ; Open MPI Users
> 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hi Howard.
>
>
>
> Thanks for the track in Github. I have run with mpirun without “master” in
> the hostfile and runs ok. The hanging occurs when I run like a singleton
> (no mpirun) which is the way I need to run. If I make a top in both
> machines the processes are correctly mapped but hangued. Seems the
> MPI_Init() function doesn’t return. Thanks for your help.
>
> Best regards,
>
>
>
> Martín
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From: *Howard Pritchard 
> *Sent: *viernes, 14 de agosto de 2020 15:18
> *To: *Martín Morales 
> *Cc: *Open MPI Users 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hi Martin,
>
>
>
> I opened an issue on Open MPI's github to track this
> https://github.com/open-mpi/ompi/issues/8005
>
>
>
> You may be seeing another problem if you removed master from the host
> file.
>
> Could you add the --debug-daemons option to the mpirun and post the output?
>
>
>
> Howard
>
>
>
>
>
> Am Di., 11. Aug. 2020 um 17:35 Uhr schrieb Martín Morales <
> martineduardomora...@hotmail.com>:
>
> Hi Howard.
>
>
>
> Great!, that works for the crashing problem with OMPI 4.0.4. However It
> stills hanging if I remove “master” (host which launches spawning
> processes) from my hostfile.
>
> I need spawn only in “worker”. Is there a way or workaround for doing this
> without mpirun?
>
> Thanks a lot for your assistance.
>
>
>
> Martín
>
>
>
>
>
>
>
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 19:13
> *To: *Martín

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-14 Thread Howard Pritchard via users

Hi Martin,

I opened an issue on Open MPI's github to track this
https://github.com/open-mpi/ompi/issues/8005

You may be seeing another problem if you removed master from the host file.
Could you add the --debug-daemons option to the mpirun and post the output?

Howard


Am Di., 11. Aug. 2020 um 17:35 Uhr schrieb Martín Morales <
martineduardomora...@hotmail.com>:

> Hi Howard.
>
>
>
> Great!, that works for the crashing problem with OMPI 4.0.4. However It
> stills hanging if I remove “master” (host which launches spawning
> processes) from my hostfile.
>
> I need spawn only in “worker”. Is there a way or workaround for doing this
> without mpirun?
>
> Thanks a lot for your assistance.
>
>
>
> Martín
>
>
>
>
>
>
>
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 19:13
> *To: *Martín Morales 
> *Cc: *Open MPI Users 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hi Martin,
>
>
>
> I was able to reproduce this with 4.0.x branch.  I'll open an issue.
>
>
>
> If you really want to use 4.0.4, then what you'll need to do is build an
> external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and
> then build Open MPI using the --with-pmix=where your pmix is installed
>
> You will also need to build both Open MPI and PMIx against the same
> libevent.   There's a configure option with both packages to use an
> external libevent installation.
>
>
>
> Howard
>
>
>
>
>
> Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales <
> martineduardomora...@hotmail.com>:
>
> Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have
> to post this on the bug section? Thanks and regards.
>
>
>
> Martín
>
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 14:44
> *To: *Open MPI Users 
> *Cc: *Martín Morales 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hello Martin,
>
>
>
> Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx
> version that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
>
> This is supposed to be fixed in the 4.0.5 release.  Could you try the
> 4.0.5rc1 tarball and see if that addresses the problem you're seeing?
>
>
>
> https://www.open-mpi.org/software/ompi/v4.0/
>
>
>
> Howard
>
>
>
>
>
>
>
> Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users <
> users@lists.open-mpi.org>:
>
>
>
> Hello people!
>
> I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one
> "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded
> OMPI just like this:
>
>
>
> ./configure --prefix=/usr/local/openmpi-4.0.4/bin/
>
>
>
> My hostfile is this:
>
>
>
> master slots=2
> worker slots=2
>
>
>
> I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
>
> If I launch the processes only on the "master" machine It's ok. But if I
> use the hostfile crashes with this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *--
> At least one pair of MPI processes are unable to reach each other for MPI
> communications.  This means that no Open MPI device has indicated that it
> can be used to communicate between these processes.  This is an error; Open
> MPI requires that all MPI processes be able to reach each other.  This
> error can sometimes be the result of forgetting to specify the "self" BTL.
>   Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M   Process 2
> ([[35155,1],0]) is on host: unknown!   BTLs attempted: tcp self Your MPI
> job is now going to abort; sorry.
> --
> [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file
> dpm/dpm.c at line 493
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can fail
> during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):   ompi_dpm_dyn_init() failed

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-13 Thread Howard Pritchard via users

Hi Ralph,

I've not yet determined whether this is actually a PMIx issue or the way
the dpm stuff in OMPI is handling PMIx namespaces.

Howard


Am Di., 11. Aug. 2020 um 19:34 Uhr schrieb Ralph Castain via users <
users@lists.open-mpi.org>:

> Howard - if there is a problem in PMIx that is causing this problem, then
> we really could use a report on it ASAP as we are getting ready to release
> v3.1.6 and I doubt we have addressed anything relevant to what is being
> discussed here.
>
>
>
> On Aug 11, 2020, at 4:35 PM, Martín Morales via users <
> users@lists.open-mpi.org> wrote:
>
> Hi Howard.
>
> Great!, that works for the crashing problem with OMPI 4.0.4. However It
> stills hanging if I remove “master” (host which launches spawning
> processes) from my hostfile.
> I need spawn only in “worker”. Is there a way or workaround for doing this
> without mpirun?
> Thanks a lot for your assistance.
>
> Martín
>
>
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 19:13
> *To: *Martín Morales 
> *Cc: *Open MPI Users 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
> Hi Martin,
>
> I was able to reproduce this with 4.0.x branch.  I'll open an issue.
>
> If you really want to use 4.0.4, then what you'll need to do is build an
> external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and
> then build Open MPI using the --with-pmix=where your pmix is installed
> You will also need to build both Open MPI and PMIx against the same
> libevent.   There's a configure option with both packages to use an
> external libevent installation.
>
> Howard
>
>
> Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales <
> martineduardomora...@hotmail.com>:
>
> Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have
> to post this on the bug section? Thanks and regards.
>
>
> Martín
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 14:44
> *To: *Open MPI Users 
> *Cc: *Martín Morales 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
> Hello Martin,
>
>
> Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx
> version that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
> This is supposed to be fixed in the 4.0.5 release.  Could you try the
> 4.0.5rc1 tarball and see if that addresses the problem you're seeing?
>
>
> https://www.open-mpi.org/software/ompi/v4.0/
>
>
> Howard
>
>
>
>
>
>
> Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users <
> users@lists.open-mpi.org>:
>
>
> Hello people!
> I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one
> "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded
> OMPI just like this:
>
>
> ./configure --prefix=/usr/local/openmpi-4.0.4/bin/
>
>
> My hostfile is this:
>
>
> master slots=2
> worker slots=2
>
>
> I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
> If I launch the processes only on the "master" machine It's ok. But if I
> use the hostfile crashes with this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *--At
> least one pair of MPI processes are unable to reach each other forMPI
> communications.  This means that no Open MPI device has indicatedthat it
> can be used to communicate between these processes.  This isan error; Open
> MPI requires that all MPI processes be able to reacheach other.  This error
> can sometimes be the result of forgetting tospecify the "self" BTL.
> Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M  Process 2
> ([[35155,1],0]) is on host: unknown!  BTLs attempted: tcp selfYour MPI job
> is now going to abort;
> sorry.--[nos-GF7050VT-M:22526]
> [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line
> 493--It
> looks like MPI_INIT failed for some reason; your parallel process islikely
> to abort.  There are many reasons that a parallel process canfail during
> MPI_INIT; some of which are due to configuration or environmentproblems.
> This failure appears to be an internal failure; here's someadditional
> information (which may only be relevant to an Open MPIdeveloper):
>

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-10 Thread Howard Pritchard via users

Hi Martin,

I was able to reproduce this with 4.0.x branch.  I'll open an issue.

If you really want to use 4.0.4, then what you'll need to do is build an
external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and
then build Open MPI using the --with-pmix=where your pmix is installed
You will also need to build both Open MPI and PMIx against the same
libevent.   There's a configure option with both packages to use an
external libevent installation.

Howard


Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales <
martineduardomora...@hotmail.com>:

> Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have
> to post this on the bug section? Thanks and regards.
>
>
>
> Martín
>
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 14:44
> *To: *Open MPI Users 
> *Cc: *Martín Morales 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hello Martin,
>
>
>
> Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx
> version that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
>
> This is supposed to be fixed in the 4.0.5 release.  Could you try the
> 4.0.5rc1 tarball and see if that addresses the problem you're seeing?
>
>
>
> https://www.open-mpi.org/software/ompi/v4.0/
>
>
>
> Howard
>
>
>
>
>
>
>
> Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users <
> users@lists.open-mpi.org>:
>
>
>
> Hello people!
>
> I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one
> "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded
> OMPI just like this:
>
>
>
> ./configure --prefix=/usr/local/openmpi-4.0.4/bin/
>
>
>
> My hostfile is this:
>
>
>
> master slots=2
> worker slots=2
>
>
>
> I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
>
> If I launch the processes only on the "master" machine It's ok. But if I
> use the hostfile crashes with this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *--
> At least one pair of MPI processes are unable to reach each other for MPI
> communications.  This means that no Open MPI device has indicated that it
> can be used to communicate between these processes.  This is an error; Open
> MPI requires that all MPI processes be able to reach each other.  This
> error can sometimes be the result of forgetting to specify the "self" BTL.
>   Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M   Process 2
> ([[35155,1],0]) is on host: unknown!   BTLs attempted: tcp self Your MPI
> job is now going to abort; sorry.
> --
> [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file
> dpm/dpm.c at line 493
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can fail
> during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):   ompi_dpm_dyn_init() failed   --> Returned "Unreachable" (-12)
> instead of "Success" (0)
> --
> [nos-GF7050VT-M:22526] *** An error occurred in MPI_Init
> [nos-GF7050VT-M:22526] *** reported by process [2303918082,1]
> [nos-GF7050VT-M:22526] *** on a NULL communicator [nos-GF7050VT-M:22526]
> *** Unknown error [nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL
> (processes in this communicator will now abort, [nos-GF7050VT-M:22526] ***
>and potentially your MPI job)*
>
>
>
> Note: host "nos-GF7050VT-M" is "worker"
>
>
>
> But If I run without "master" in hostfile, the processes are launched but
> It hangs: MPI_Init() doesn't returns.
>
> I launched the script (pasted below) in this 2 ways with the same result:
>
>
>
> $ ./simple_spawn 2
>
> $ mpirun -np 1 ./simple_spawn 2
>
>
>
> The "simple_spawn" script:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
&

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-10 Thread Howard Pritchard via users

Hello Martin,

Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx
version that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
This is supposed to be fixed in the 4.0.5 release.  Could you try the
4.0.5rc1 tarball and see if that addresses the problem you're seeing?

https://www.open-mpi.org/software/ompi/v4.0/

Howard



Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users <
users@lists.open-mpi.org>:

>
>
> Hello people!
>
> I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one
> "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded
> OMPI just like this:
>
>
>
> ./configure --prefix=/usr/local/openmpi-4.0.4/bin/
>
>
>
> My hostfile is this:
>
>
>
> master slots=2
> worker slots=2
>
>
>
> I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
>
> If I launch the processes only on the "master" machine It's ok. But if I
> use the hostfile crashes with this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *--
> At least one pair of MPI processes are unable to reach each other for MPI
> communications.  This means that no Open MPI device has indicated that it
> can be used to communicate between these processes.  This is an error; Open
> MPI requires that all MPI processes be able to reach each other.  This
> error can sometimes be the result of forgetting to specify the "self" BTL.
>   Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M   Process 2
> ([[35155,1],0]) is on host: unknown!   BTLs attempted: tcp self Your MPI
> job is now going to abort; sorry.
> --
> [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file
> dpm/dpm.c at line 493
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can fail
> during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):   ompi_dpm_dyn_init() failed   --> Returned "Unreachable" (-12)
> instead of "Success" (0)
> --
> [nos-GF7050VT-M:22526] *** An error occurred in MPI_Init
> [nos-GF7050VT-M:22526] *** reported by process [2303918082,1]
> [nos-GF7050VT-M:22526] *** on a NULL communicator [nos-GF7050VT-M:22526]
> *** Unknown error [nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL
> (processes in this communicator will now abort, [nos-GF7050VT-M:22526] ***
>and potentially your MPI job)*
>
>
>
> Note: host "nos-GF7050VT-M" is "worker"
>
>
>
> But If I run without "master" in hostfile, the processes are launched but
> It hangs: MPI_Init() doesn't returns.
>
> I launched the script (pasted below) in this 2 ways with the same result:
>
>
>
> $ ./simple_spawn 2
>
> $ mpirun -np 1 ./simple_spawn 2
>
>
>
> The "simple_spawn" script:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *#include "mpi.h" #include  #include  int main(int
> argc, char ** argv){ int processesToRun; MPI_Comm parentcomm,
> intercomm; MPI_Info info; int rank, size, hostName_len; char
> hostName[200]; MPI_Init( ,  ); MPI_Comm_get_parent(
>  ); MPI_Comm_rank(MPI_COMM_WORLD, );
> MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Get_processor_name(hostName,
> _len); if (parentcomm == MPI_COMM_NULL) {
> if(argc < 2 ){ printf("Processes number needed!");
> return 0; } processesToRun = atoi(argv[1]);
> MPI_Info_create(  ); MPI_Info_set( info, "hostfile",
> "./hostfile" ); MPI_Info_set( info, "map_by", "node" );
> MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, processesToRun, info, 0,
> MPI_COMM_WORLD, , MPI_ERRCODES_IGNORE); printf("I'm the
> parent.\n"); } else { printf("I'm the spawned h: %s  r/s:
> %i/%i.\n", hostName, rank, size ); } fflush(stdout);
> MPI_Finalize(); return 0; }*
>
>
>
> I came from OMPI 4.0.1. In this version It's working... with some
> inconsistencies I'm afraid. That's why I decided to upgrade to OMPI 4.0.4.
>
> I tried several versions with no luck. Is there maybe an intrinsic problem
> with the OMPI dynamic allocation functionality?
>
> Any help will be very appreciated. Best regards.
>
>
>
> Martín
>
>
>

Re: [OMPI users] Differences 4.0.3 -> 4.0.4 (Regression?)

2020-08-08 Thread Howard Pritchard via users

Hello Michael,

Not sure what could be causing this in terms of delta between v4.0.3 and
v4.0.4.
Two things to try

- add --debug-daemons and --mca pmix_base_verbose 100 to the mpirun line
and compare output from the v4.0.3 and v4.0.4 installs
- perhaps try using the --enable-mpirun-prefix-by-default configure option
and reinstall v4.0.4

Howard


Am Do., 6. Aug. 2020 um 04:48 Uhr schrieb Michael Fuckner via users <
users@lists.open-mpi.org>:

> Hi,
>
> I have a small setup with one headnode and two compute nodes connected
> via IB-QDR running CentOS 8.2 and Mellanox OFED 4.9 LTS. I installed
> openmpi 3.0.6, 3.1.6, 4.0.3 and 4.0.4 with identical configuration
> (configure, compile, nothing configured in openmpi-mca-params.conf), the
> output from ompi-info and orte-info looks identical.
>
> There is a small benchmark basically just doing MPI_Send() and
> MPI_Recv(). I can invoke it directly like this (as 4.0.3 and 4.0.4)
>
> /opt/openmpi/4.0.3/gcc/bin/mpirun -np 16 -hostfile HOSTFILE_2x8 -nolocal
> ./OWnetbench.openmpi-4.0.3
>
> when running this job from slurm, it works with 4.0.3, but there is an
> error with 4.0.4. Any hint what to check?
>
>
> ### running ./OWnetbench/OWnetbench.openmpi-4.0.4 with
> /opt/openmpi/4.0.4/gcc/bin/mpirun ###
> [node002.cluster:04960] MCW rank 0 bound to socket 0[core 7[hwt 0-1]]:
> [../../../../../../../BB]
> [node002.cluster:04963] PMIX ERROR: OUT-OF-RESOURCE in file
> client/pmix_client.c at line 231
> [node002.cluster:04963] OPAL ERROR: Error in file pmix3x_client.c at
> line 112
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [node002.cluster:04963] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not able
> to guarantee that all other processes were kil
> led!
> --
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --
> --
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>Process name: [[15424,1],0]
>Exit code:1
> --
>
> Any hint why 4.0.4 behaves not like the other versions?
>
> --
> DELTA Computer Products GmbH
> Röntgenstr. 4
> D-21465 Reinbek bei Hamburg
> T: +49 40 300672-30
> F: +49 40 300672-11
> E: michael.fuck...@delta.de
>
> Internet: https://www.delta.de
> Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550
> Geschäftsführer: Hans-Peter Hellmann
>

Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-29 Thread Howard Pritchard via users

Collin,

A couple of things to try.  First, could you just configure without using
the mellanox platform file and see if you can run the app with 100 or more
processes?
Another thing to try is to keep using the mellanox platform file, but run
the app with

mpirun --mca pml ob1 -np 100 bin/xhpcg

and see if the app runs successfully.

Howard


Am Mo., 27. Jan. 2020 um 09:29 Uhr schrieb Collin Strassburger <
cstrassbur...@bihrle.com>:

> Hello Howard,
>
>
>
> To remove potential interactions, I have found that the issue persists
> without ucx and hcoll support.
>
>
>
> Run command: mpirun -np 128 bin/xhpcg
>
> Output:
>
> --
>
> mpirun was unable to start the specified application as it encountered an
>
> error:
>
>
>
> Error code: 63
>
> Error name: (null)
>
> Node: Gen2Node4
>
>
>
> when attempting to start process rank 0.
>
> --
>
> 128 total processes failed to start
>
>
>
> It returns this error for any process I initialize with >100 processes per
> node.  I get the same error message for multiple different codes, so the
> error code is mpi related rather than being program specific.
>
>
>
> Collin
>
>
>
> *From:* Howard Pritchard 
> *Sent:* Monday, January 27, 2020 11:20 AM
> *To:* Open MPI Users 
> *Cc:* Collin Strassburger 
> *Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when
> utilizing 100+ processors per node
>
>
>
> Hello Collen,
>
>
>
> Could you provide more information about the error.  Is there any output
> from either Open MPI or, maybe, UCX, that could provide more information
> about the problem you are hitting?
>
>
>
> Howard
>
>
>
>
>
> Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users <
> users@lists.open-mpi.org>:
>
> Hello,
>
>
>
> I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5.  Both of
> these versions cause the same error (error code 63) when utilizing more
> than 100 cores on a single node.  The processors I am utilizing are AMD
> Epyc “Rome” 7742s.  The OS is CentOS 8.1.  I have tried compiling with both
> the default gcc 8 and locally compiled gcc 9.  I have already tried
> modifying the maximum name field values with no success.
>
>
>
> My compile options are:
>
> ./configure
>
>  --prefix=${HPCX_HOME}/ompi
>
>  --with-platform=contrib/platform/mellanox/optimized
>
>
>
> Any assistance would be appreciated,
>
> Collin
>
>
>
> Collin Strassburger
>
> Bihrle Applied Research Inc.
>
>
>
>

Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-27 Thread Howard Pritchard via users

Hello Collen,

Could you provide more information about the error.  Is there any output
from either Open MPI or, maybe, UCX, that could provide more information
about the problem you are hitting?

Howard


Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users <
users@lists.open-mpi.org>:

> Hello,
>
>
>
> I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5.  Both of
> these versions cause the same error (error code 63) when utilizing more
> than 100 cores on a single node.  The processors I am utilizing are AMD
> Epyc “Rome” 7742s.  The OS is CentOS 8.1.  I have tried compiling with both
> the default gcc 8 and locally compiled gcc 9.  I have already tried
> modifying the maximum name field values with no success.
>
>
>
> My compile options are:
>
> ./configure
>
>  --prefix=${HPCX_HOME}/ompi
>
>  --with-platform=contrib/platform/mellanox/optimized
>
>
>
> Any assistance would be appreciated,
>
> Collin
>
>
>
> Collin Strassburger
>
> Bihrle Applied Research Inc.
>
>
>

Re: [OMPI users] Do idle MPI threads consume clock cycles?

2019-02-25 Thread Howard Pritchard

Hello Mark,

You may want to checkout this package:

https://github.com/lanl/libquo

Another option would be to do something like use an MPI_Ibarrier in the
application
with all the MPI processes but rank 0 going into a loop over waiting for
completion of the barrier
and doing a sleep.  Once rank 0 had completed the OpenMP work, it would
then enter the
barrier and wait for completion.

This type of problem may be helped in a future MPI that supports the notion
of MPI Sessions.
With this approach, you would initialize one MPI session for normal
messaging behavior, using
polling for fast processing of messages.  Your MPI library would use this
for its existing messaging.
You could initialize a second MPI session to use blocking methods for
message receipt.  You would
use a communicator derived from the second session to do what's described
above for the loop
with sleep on an Ibarrier.

Good luck,

Howard


Am Do., 21. Feb. 2019 um 11:25 Uhr schrieb Mark McClure <
mark.w.m...@gmail.com>:

> I have the following, rather unusual, scenario...
>
> I have a program running with OpenMP on a multicore computer. At one point
> in the program, I want to use an external package that is written to
> exploit MPI, not OpenMP, parallelism. So a (rather awkward) solution could
> be to launch the program in MPI, but most of the time, everything is being
> done in a single MPI process, which is using OpenMP (ie, run my current
> program in a single MPI process). Then, when I get to the part where I need
> to use the external package, distribute out the information to all the MPI
> processes, run it across all, and then pull them back to the master
> process. This is awkward, but probably better than my current approach,
> which is running the external package on a single processor (ie, not
> exploiting parallelism in this time-consuming part of the code).
>
> If I use this strategy, I fear that the idle MPI processes may be
> consuming clock cycles while I am running the rest of the program on the
> master process with OpenMP. Thus, they may compete with the OpenMP threads.
> OpenMP does not close threads between every pragma, but OMP_WAIT_POLICY can
> be set to sleep idle threads (actually, this is the default behavior). I
> have not been able to find any equivalent documentation regarding the
> behavior of idle threads in MPI.
>
> Best regards,
> Mark
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Howard Pritchard

Hello Adam,

This helps some.  Could you post first 20 lines of you config.log.  This
will
help in trying to reproduce.  The content of your host file (you can use
generic
names for the nodes if that'a an issue to publicize) would also help as
the number of nodes and number of MPI processes/node impacts the way
the reduce scatter operation works.

One thing to note about the openib BTL - it is on life support.   That's
why you needed to set btl_openib_allow_ib 1 on the mpirun command line.

You may get much better success by installing UCX
<https://github.com/openucx/ucx/releases> and rebuilding Open MPI to use
UCX.  You may actually already have UCX installed on your system if
a recent version of MOFED is installed.

You can check this by running /usr/bin/ofed_rpm_info.  It will show which
ucx version has been installed.
If UCX is installed, you can add --with-ucx to the Open MPi configuration
line and it should build in UCX
support.   If Open MPI is built with UCX support, it will by default use
UCX for message transport rather than
the OpenIB BTL.

thanks,

Howard


Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <
alebl...@iol.unh.edu>:

> On tcp side it doesn't seg fault anymore but will timeout on some tests
> but on the openib side it will still seg fault, here is the output:
>
> [pandora:19256] *** Process received signal ***
> [pandora:19256] Signal: Segmentation fault (11)
> [pandora:19256] Signal code: Address not mapped (1)
> [pandora:19256] Failing at address: 0x7f911c69fff0
> [pandora:19255] *** Process received signal ***
> [pandora:19255] Signal: Segmentation fault (11)
> [pandora:19255] Signal code: Address not mapped (1)
> [pandora:19255] Failing at address: 0x7ff09cd3fff0
> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
> [pandora:19256] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
> [pandora:19256] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
> [pandora:19256] [ 4] [pandora:19255] [ 0]
> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
> [pandora:19255] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
> [pandora:19256] [ 5] IMB-MPI1[0x40b83b]
> [pandora:19256] [ 6] IMB-MPI1[0x407155]
> [pandora:19256] [ 7] IMB-MPI1[0x4022ea]
> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
> [pandora:19255] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
> [pandora:19256] [ 9] IMB-MPI1[0x401d49]
> [pandora:19256] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
> [pandora:19255] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
> [pandora:19255] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
> [pandora:19255] [ 5] IMB-MPI1[0x40b83b]
> [pandora:19255] [ 6] IMB-MPI1[0x407155]
> [pandora:19255] [ 7] IMB-MPI1[0x4022ea]
> [pandora:19255] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
> [pandora:19255] [ 9] IMB-MPI1[0x401d49]
> [pandora:19255] *** End of error message ***
> [phoebe:12418] *** Process received signal ***
> [phoebe:12418] Signal: Segmentation fault (11)
> [phoebe:12418] Signal code: Address not mapped (1)
> [phoebe:12418] Failing at address: 0x7f5ce27dfff0
> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
> [phoebe:12418] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
> [phoebe:12418] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
> [phoebe:12418] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
> [phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:12418] [ 6] IMB-MPI1[0x407155]
> [phoebe:12418] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:12418] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]
> [phoebe:12418] [ 9] IMB-MPI1[0x401d49]
> [phoebe:12418] *** End of error message ***
> --
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --
> --
> mpirun noticed that process rank 0 with PID 0 on node pandora exited on
> signal 11 (Segmentation fault).
> --

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Howard Pritchard

HI Adam,

As a sanity check, if you try to use --mca btl self,vader,tcp

do you still see the segmentation fault?

Howard


Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <
alebl...@iol.unh.edu>:

> Hello,
>
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
> btl_openib_allow_ib 1 -np 6
>  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
>
> I get this error:
>
> #
> # Benchmarking Reduce_scatter
> # #processes = 4
> # ( 2 additional processes waiting in MPI_Barrier)
> #
>#bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
> 0 1000 0.14 0.15 0.14
> 4 1000 5.00 7.58 6.28
> 8 1000 5.13 7.68 6.41
>16 1000 5.05 7.74 6.39
>32 1000 5.43 7.96 6.75
>64 1000 6.78 8.56 7.69
>   128 1000 7.77 9.55 8.59
>   256 1000 8.2810.96 9.66
>   512 1000 9.1912.4910.85
>  1024 100011.7815.0113.38
>  2048 100017.4119.5118.52
>  4096 100025.7328.2226.89
>  8192 100047.7549.4448.79
> 16384 100081.1090.1584.75
> 32768 1000   163.01   178.58   173.19
> 65536  640   315.63   340.51   333.18
>131072  320   475.48   528.82   510.85
>262144  160   979.70  1063.81  1035.61
>524288   80  2070.51  2242.58  2150.15
>   1048576   40  4177.36  4527.25  4431.65
>   2097152   20  8738.08  9340.50  9147.89
> [pandora:04500] *** Process received signal ***
> [pandora:04500] Signal: Segmentation fault (11)
> [pandora:04500] Signal code: Address not mapped (1)
> [pandora:04500] Failing at address: 0x7f310eb0
> [pandora:04499] *** Process received signal ***
> [pandora:04499] Signal: Segmentation fault (11)
> [pandora:04499] Signal code: Address not mapped (1)
> [pandora:04499] Failing at address: 0x7f28b110
> [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
> [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
> [pandora:04500] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
> [pandora:04500] [ 3] [pandora:04499] [ 0]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
> [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
> [pandora:04499] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
> [pandora:04500] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04500] [ 6] IMB-MPI1[0x407155]
> [pandora:04500] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
> [pandora:04499] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
> [pandora:04500] [ 9] IMB-MPI1[0x401d49]
> [pandora:04500] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
> [pandora:04499] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
> [pandora:04499] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
> [pandora:04499] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04499] [ 6] IMB-MPI1[0x407155]
> [pandora:04499] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04499] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
> [pandora:04499] [ 9] IMB-MPI1[0x401d49]
> [pandora:04499] *** End of error message ***
> [phoebe:03779] *** Process received signal ***
> [phoebe:03779] Signal: Segmentation fault (11)
> [phoebe:03779] Signal code: Address not mapped (1)
> [phoebe:03779] Failing at address: 0x7f483d60
> [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
> [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
> [phoebe:03779] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
> [phoebe:03779] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_b

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-20 Thread Howard Pritchard

Hi Matt

Definitely do not include the ucx option for an omnipath cluster.  Actually
if you accidentally installed ucx in it’s default location use on the
system Switch to this config option

—with-ucx=no

Otherwise you will hit

https://github.com/openucx/ucx/issues/750

Howard


Gilles Gouaillardet  schrieb am Sa. 19. Jan.
2019 um 18:41:

> Matt,
>
> There are two ways of using PMIx
>
> - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
> to mpirun and orted daemons (e.g. the PMIx server)
> - if you use SLURM srun, then the MPI app will directly talk to the
> PMIx server provided by SLURM. (note you might have to srun
> --mpi=pmix_v2 or something)
>
> In the former case, it does not matter whether you use the embedded or
> external PMIx.
> In the latter case, Open MPI and SLURM have to use compatible PMIx
> libraries, and you can either check the cross-version compatibility
> matrix,
> or build Open MPI with the same PMIx used by SLURM to be on the safe
> side (not a bad idea IMHO).
>
>
> Regarding the hang, I suggest you try different things
> - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun
> runs on a compute node rather than on a frontend node)
> - try something even simpler such as mpirun hostname (both with sbatch
> and salloc)
> - explicitly specify the network to be used for the wire-up. you can
> for example mpirun --mca oob_tcp_if_include 192.168.0.0/24 if this is
> the network subnet by which all the nodes (e.g. compute nodes and
> frontend node if you use salloc) communicate.
>
>
> Cheers,
>
> Gilles
>
> On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson  wrote:
> >
> > On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
> >> >
> >> > With some help, I managed to build an Open MPI 4.0.0 with:
> >>
> >> We can discuss each of these params to let you know what they are.
> >>
> >> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath
> >>
> >> Did you have a reason for disabling these?  They're generally good
> things.  What they do is add linker flags to the wrapper compilers (i.e.,
> mpicc and friends) that basically put a default path to find libraries at
> run time (that can/will in most cases override LD_LIBRARY_PATH -- but you
> can override these linked-in-default-paths if you want/need to).
> >
> >
> > I've had these in my Open MPI builds for a while now. The reason was one
> of the libraries I need for the climate model I work on went nuts if both
> of them weren't there. It was originally the rpath one but then eventually
> (Open MPI 3?) I had to add the runpath one. But I have been updating the
> libraries more aggressively recently (due to OS upgrades) so it's possible
> this is no longer needed.
> >
> >>
> >>
> >> > --with-psm2
> >>
> >> Ensure that Open MPI can include support for the PSM2 library, and
> abort configure if it cannot.
> >>
> >> > --with-slurm
> >>
> >> Ensure that Open MPI can include support for SLURM, and abort configure
> if it cannot.
> >>
> >> > --enable-mpi1-compatibility
> >>
> >> Add support for MPI_Address and other MPI-1 functions that have since
> been deleted from the MPI 3.x specification.
> >>
> >> > --with-ucx
> >>
> >> Ensure that Open MPI can include support for UCX, and abort configure
> if it cannot.
> >>
> >> > --with-pmix=/usr/nlocal/pmix/2.1
> >>
> >> Tells Open MPI to use the PMIx that is installed at
> /usr/nlocal/pmix/2.1 (instead of using the PMIx that is bundled internally
> to Open MPI's source code tree/expanded tarball).
> >>
> >> Unless you have a reason to use the external PMIx, the internal/bundled
> PMIx is usually sufficient.
> >
> >
> > Ah. I did not know that. I figured if our SLURM was built linked to a
> specific PMIx v2 that I should build Open MPI with the same PMIx. I'll
> build an Open MPI 4 without specifying this.
> >
> >>
> >>
> >> > --with-libevent=/usr
> >>
> >> Same as previous; change "pmix" to "libevent" (i.e., use the external
> libevent instead of the bundled libevent).
> >>
> >> > CC=icc CXX=icpc FC=ifort
> >>
> >> Specify the exact compilers to use.
> >>
> >> > The MPI 1 is because I need to build HDF5 eventually and I added psm2
> because it's an Omnipath cluster. The libevent was prob

Re: [OMPI users] Segmentation fault using openmpi-master-201901030305-ee26ed9

2019-01-04 Thread Howard Pritchard

Hi Sigmar,

I observed this problem yesterday myself and should have a fix in to master
later today.


Howard


Am Fr., 4. Jan. 2019 um 05:30 Uhr schrieb Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I've installed (tried to install) openmpi-master-201901030305-ee26ed9 on
> my "SUSE Linux Enterprise Server 12.3 (x86_64)" with gcc-7.3.0,
> icc-19.0.1.144
> pgcc-18.4-0, and Sun C 5.15 (Oracle Developer Studio 12.6). Unfortunately,
> I
> still cannot build it with Sun C and I get a segmentation fault for one of
> my small programs for the other compilers.
>
> I get the following error for Sun C that I reported some time ago.
> https://www.mail-archive.com/users@lists.open-mpi.org/msg32816.html
>
>
> The program runs as expected if I only use my local machine "loki" and it
> breaks if I add a remote machine (even if I only use the remote machine
> without "loki").
>
> loki hello_1 114 ompi_info | grep -e "Open MPI repo revision" -e"Configure
> command line"
>Open MPI repo revision: v2.x-dev-6601-gee26ed9
>Configure command line: '--prefix=/usr/local/openmpi-master_64_gcc'
> '--libdir=/usr/local/openmpi-master_64_gcc/lib64'
> '--with-jdk-bindir=/usr/local/jdk-11/bin'
> '--with-jdk-headers=/usr/local/jdk-11/include'
> 'JAVA_HOME=/usr/local/jdk-11'
> 'LDFLAGS=-m64 -L/usr/local/cuda/lib64' 'CC=gcc' 'CXX=g++' 'FC=gfortran'
> 'CFLAGS=-m64 -I/usr/local/cuda/include' 'CXXFLAGS=-m64
> -I/usr/local/cuda/include' 'FCFLAGS=-m64' 'CPP=cpp
> -I/usr/local/cuda/include'
> 'CXXCPP=cpp -I/usr/local/cuda/include' '--enable-mpi-cxx'
> '--enable-cxx-exceptions' '--enable-mpi-java'
> '--with-cuda=/usr/local/cuda'
> '--with-valgrind=/usr/local/valgrind' '--with-hwloc=internal'
> '--without-verbs'
> '--with-wrapper-cflags=-std=c11 -m64' '--with-wrapper-cxxflags=-m64'
> '--with-wrapper-fcflags=-m64' '--enable-debug'
>
>
> loki hello_1 115 mpiexec -np 4 --host loki:2,nfs2:2 hello_1_mpi
> Process 0 of 4 running on loki
> Process 1 of 4 running on loki
> Process 2 of 4 running on nfs2
> Process 3 of 4 running on nfs2
>
> Now 3 slave tasks are sending greetings.
>
> Greetings from task 1:
>message type:3
>msg length:  132 characters
> ... (complete output of my program)
>
> [nfs2:01336] *** Process received signal ***
> [nfs2:01336] Signal: Segmentation fault (11)
> [nfs2:01336] Signal code: Address not mapped (1)
> [nfs2:01336] Failing at address: 0x7feea4849268
> [nfs2:01336] [ 0] /lib64/libpthread.so.0(+0x10c10)[0x7feeacbbec10]
> [nfs2:01336] [ 1]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(+0x7cd34)[0x7feeadd94d34]
> [nfs2:01336] [ 2]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(+0x78673)[0x7feeadd90673]
> [nfs2:01336] [ 3]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(+0x7ac2c)[0x7feeadd92c2c]
> [nfs2:01336] [ 4]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(opal_finalize_cleanup_domain+0x3e)[0x7feeadd56507]
> [nfs2:01336] [ 5]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(opal_finalize_util+0x56)[0x7feeadd56667]
> [nfs2:01336] [ 6]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(opal_finalize+0xd3)[0x7feeadd567de]
> [nfs2:01336] [ 7]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-rte.so.0(orte_finalize+0x1ba)[0x7feeae09d7ea]
> [nfs2:01336] [ 8]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-rte.so.0(orte_daemon+0x3ddd)[0x7feeae0cf55d]
> [nfs2:01336] [ 9] orted[0x40086d]
> [nfs2:01336] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7feeac829725]
> [nfs2:01336] [11] orted[0x400739]
> [nfs2:01336] *** End of error message ***
> Segmentation fault (core dumped)
> loki hello_1 116
>
>
> I would be grateful, if somebody can fix the problem. Do you need anything
> else? Thank you very much for any help in advance.
>
>
> Kind regards
>
> Siegmar
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Unable to build Open MPI with external PMIx library support

2018-12-17 Thread Howard Pritchard

HI Eduardo,

The config.log looked nominal.Could you try the following additional
options to the build with the internal PMIx builds:

--enable-orterun-prefix-by-default --disable-dlopen


?

Also, for the mpirun built using the internal PMIx,

could you check the output of ldd?


And just in case, check if the PMIX_INSTALL_PREFIX is

somehow being set?


Howard



Am Mo., 17. Dez. 2018 um 03:29 Uhr schrieb Eduardo Rothe <
eduardo.ro...@yahoo.co.uk>:

> Hi Howard,
>
> Thank you for you reply. I have just re-executed the whole process and
> here is the config.log (in attachment to this message)!
>
> Just for restating, when I use internal PMIx I get the following error
> while running mpirun (using Open MPI 4.0.0):
>
> --
> We were unable to find any usable plugins for the BFROPS framework. This
> PMIx
> framework requires at least one plugin in order to operate. This can be
> caused
> by any of the following:
>
> * we were unable to build any of the plugins due to some combination
>   of configure directives and available system support
>
> * no plugin was selected due to some combination of MCA parameter
>   directives versus built plugins (i.e., you excluded all the plugins
>   that were built and/or could execute)
>
> * the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter
>   "mca_base_component_path", is set and doesn't point to any location
>   that includes at least one usable plugin for this framework.
>
> Please check your installation and environment.
> ------
>
> Regards,
> Eduardo
>
>
> On Saturday, 15 December 2018, 18:35:44 CET, Howard Pritchard <
> hpprit...@gmail.com> wrote:
>
>
> Hi Eduardo
>
> Could you post the config.log for the build with internal PMIx so we can
> figure that out first.
>
> Howard
>
> Eduardo Rothe via users  schrieb am Fr. 14.
> Dez. 2018 um 09:41:
>
> Open MPI: 4.0.0
> PMIx: 3.0.2
> OS: Debian 9
>
> I'm building a debian package for Open MPI and either I get the following
> error messages while configuring:
>
>   undefined reference to symbol 'dlopen@@GLIBC_2.2.5'
>   undefined reference to symbol 'lt_dlopen'
>
> when using the configure option:
>
>   ./configure --with-pmix=/usr/lib/x86_64-linux-gnu/pmix
>
> or otherwise, if I use the following configure options:
>
>   ./configure --with-pmix=external
> --with-pmix-libdir=/usr/lib/x86_64-linux-gnu/pmix
>
> I have a successfull compile, but when running mpirun I get the following
> message:
>
> --
> We were unable to find any usable plugins for the BFROPS framework. This
> PMIx
> framework requires at least one plugin in order to operate. This can be
> caused
> by any of the following:
>
> * we were unable to build any of the plugins due to some combination
>   of configure directives and available system support
>
> * no plugin was selected due to some combination of MCA parameter
>   directives versus built plugins (i.e., you excluded all the plugins
>   that were built and/or could execute)
>
> * the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter
>   "mca_base_component_path", is set and doesn't point to any location
>   that includes at least one usable plugin for this framework.
>
> Please check your installation and environment.
> --
>
> What I find most strange is that I get the same error message (unable to
> find
> any usable plugins for the BFROPS framework) even if I don't configure
> external PMIx support!
>
> Can someone please hint me about what's going on?
>
> Cheers!
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Unable to build Open MPI with external PMIx library support

2018-12-15 Thread Howard Pritchard

Hi Eduardo

Could you post the config.log for the build with internal PMIx so we can
figure that out first.

Howard

Eduardo Rothe via users  schrieb am Fr. 14. Dez.
2018 um 09:41:

> Open MPI: 4.0.0
> PMIx: 3.0.2
> OS: Debian 9
>
> I'm building a debian package for Open MPI and either I get the following
> error messages while configuring:
>
>   undefined reference to symbol 'dlopen@@GLIBC_2.2.5'
>   undefined reference to symbol 'lt_dlopen'
>
> when using the configure option:
>
>   ./configure --with-pmix=/usr/lib/x86_64-linux-gnu/pmix
>
> or otherwise, if I use the following configure options:
>
>   ./configure --with-pmix=external
> --with-pmix-libdir=/usr/lib/x86_64-linux-gnu/pmix
>
> I have a successfull compile, but when running mpirun I get the following
> message:
>
> --
> We were unable to find any usable plugins for the BFROPS framework. This
> PMIx
> framework requires at least one plugin in order to operate. This can be
> caused
> by any of the following:
>
> * we were unable to build any of the plugins due to some combination
>   of configure directives and available system support
>
> * no plugin was selected due to some combination of MCA parameter
>   directives versus built plugins (i.e., you excluded all the plugins
>   that were built and/or could execute)
>
> * the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter
>   "mca_base_component_path", is set and doesn't point to any location
>   that includes at least one usable plugin for this framework.
>
> Please check your installation and environment.
> --
>
> What I find most strange is that I get the same error message (unable to
> find
> any usable plugins for the BFROPS framework) even if I don't configure
> external PMIx support!
>
> Can someone please hint me about what's going on?
>
> Cheers!
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [Open MPI Announce] Open MPI 4.0.0 Released

2018-11-14 Thread Howard Pritchard

Hi Bert,

If you'd prefer to return to the land of convenience and don't need to mix
MPI
and OpenSHMEM, then you may want to try the path I outlined in the email
archived at the following link

https://www.mail-archive.com/users@lists.open-mpi.org/msg32274.html

Howard


Am Di., 13. Nov. 2018 um 23:10 Uhr schrieb Bert Wesarg via users <
users@lists.open-mpi.org>:

> Dear Takahiro,
> On Wed, Nov 14, 2018 at 5:38 AM Kawashima, Takahiro
>  wrote:
> >
> > XPMEM moved to GitLab.
> >
> > https://gitlab.com/hjelmn/xpmem
>
> the first words from the README aren't very pleasant to read:
>
> This is an experimental version of XPMEM based on a version provided by
> Cray and uploaded to https://code.google.com/p/xpmem. This version
> supports
> any kernel 3.12 and newer. *Keep in mind there may be bugs and this version
> may cause kernel panics, code crashes, eat your cat, etc.*
>
> Installing this on my laptop where I just want developing with SHMEM
> it would be a pitty to lose work just because of that.
>
> Best,
> Bert
>
> >
> > Thanks,
> > Takahiro Kawashima,
> > Fujitsu
> >
> > > Hello Bert,
> > >
> > > What OS are you running on your notebook?
> > >
> > > If you are running Linux, and you have root access to your system,
> then
> > > you should be able to resolve the Open SHMEM support issue by
> installing
> > > the XPMEM device driver on your system, and rebuilding UCX so it picks
> > > up XPMEM support.
> > >
> > > The source code is on GitHub:
> > >
> > > https://github.com/hjelmn/xpmem
> > >
> > > Some instructions on how to build the xpmem device driver are at
> > >
> > > https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM
> > >
> > > You will need to install the kernel source and symbols rpms on your
> > > system before building the xpmem device driver.
> > >
> > > Hope this helps,
> > >
> > > Howard
> > >
> > >
> > > Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
> > > users@lists.open-mpi.org>:
> > >
> > > > Hi,
> > > >
> > > > On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
> > > >  wrote:
> > > > >
> > > > > The Open MPI Team, representing a consortium of research,
> academic, and
> > > > > industry partners, is pleased to announce the release of Open MPI
> version
> > > > > 4.0.0.
> > > > >
> > > > > v4.0.0 is the start of a new release series for Open MPI.
> Starting with
> > > > > this release, the OpenIB BTL supports only iWarp and RoCE by
> default.
> > > > > Starting with this release,  UCX is the preferred transport
> protocol
> > > > > for Infiniband interconnects. The embedded PMIx runtime has been
> updated
> > > > > to 3.0.2.  The embedded Romio has been updated to 3.2.1.  This
> > > > > release is ABI compatible with the 3.x release streams. There have
> been
> > > > numerous
> > > > > other bug fixes and performance improvements.
> > > > >
> > > > > Note that starting with Open MPI v4.0.0, prototypes for several
> > > > > MPI-1 symbols that were deleted in the MPI-3.0 specification
> > > > > (which was published in 2012) are no longer available by default in
> > > > > mpi.h. See the README for further details.
> > > > >
> > > > > Version 4.0.0 can be downloaded from the main Open MPI web site:
> > > > >
> > > > >   https://www.open-mpi.org/software/ompi/v4.0/
> > > > >
> > > > >
> > > > > 4.0.0 -- September, 2018
> > > > > 
> > > > >
> > > > > - OSHMEM updated to the OpenSHMEM 1.4 API.
> > > > > - Do not build OpenSHMEM layer when there are no SPMLs available.
> > > > >   Currently, this means the OpenSHMEM layer will only build if
> > > > >   a MXM or UCX library is found.
> > > >
> > > > so what is the most convenience way to get SHMEM working on a single
> > > > shared memory node (aka. notebook)? I just realized that I don't have
> > > > a SHMEM since Open MPI 3.0. But building with UCX does not help
> > > > either. I tried with UCX 1.4 but Open MPI SHMEM
> > > > still does not work:
> > > >
> > > > $ oshcc -o shmem_hello_world-

Re: [OMPI users] [Open MPI Announce] Open MPI 4.0.0 Released

2018-11-13 Thread Howard Pritchard

Hello Bert,

What OS are you running on your notebook?

If you are running Linux, and you have root access to your system,  then
you should be able to resolve the Open SHMEM support issue by installing
the XPMEM device driver on your system, and rebuilding UCX so it picks
up XPMEM support.

The source code is on GitHub:

https://github.com/hjelmn/xpmem

Some instructions on how to build the xpmem device driver are at

https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM

You will need to install the kernel source and symbols rpms on your
system before building the xpmem device driver.

Hope this helps,

Howard


Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
users@lists.open-mpi.org>:

> Hi,
>
> On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
>  wrote:
> >
> > The Open MPI Team, representing a consortium of research, academic, and
> > industry partners, is pleased to announce the release of Open MPI version
> > 4.0.0.
> >
> > v4.0.0 is the start of a new release series for Open MPI.  Starting with
> > this release, the OpenIB BTL supports only iWarp and RoCE by default.
> > Starting with this release,  UCX is the preferred transport protocol
> > for Infiniband interconnects. The embedded PMIx runtime has been updated
> > to 3.0.2.  The embedded Romio has been updated to 3.2.1.  This
> > release is ABI compatible with the 3.x release streams. There have been
> numerous
> > other bug fixes and performance improvements.
> >
> > Note that starting with Open MPI v4.0.0, prototypes for several
> > MPI-1 symbols that were deleted in the MPI-3.0 specification
> > (which was published in 2012) are no longer available by default in
> > mpi.h. See the README for further details.
> >
> > Version 4.0.0 can be downloaded from the main Open MPI web site:
> >
> >   https://www.open-mpi.org/software/ompi/v4.0/
> >
> >
> > 4.0.0 -- September, 2018
> > 
> >
> > - OSHMEM updated to the OpenSHMEM 1.4 API.
> > - Do not build OpenSHMEM layer when there are no SPMLs available.
> >   Currently, this means the OpenSHMEM layer will only build if
> >   a MXM or UCX library is found.
>
> so what is the most convenience way to get SHMEM working on a single
> shared memory node (aka. notebook)? I just realized that I don't have
> a SHMEM since Open MPI 3.0. But building with UCX does not help
> either. I tried with UCX 1.4 but Open MPI SHMEM
> still does not work:
>
> $ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
> $ oshrun -np 2 ./shmem_hello_world-4.0.0
> [1542109710.217344] [tudtug:27715:0] select.c:406  UCX  ERROR
> no remote registered memory access transport to tudtug:27716:
> self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
> tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
> mm/posix - Destination is unreachable, cma/cma - no put short
> [1542109710.217344] [tudtug:27716:0] select.c:406  UCX  ERROR
> no remote registered memory access transport to tudtug:27715:
> self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
> tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
> mm/posix - Destination is unreachable, cma/cma - no put short
> [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
> Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
> [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
> Error: add procs FAILED rc=-2
> [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
> Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
> [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
> Error: add procs FAILED rc=-2
> --
> It looks like SHMEM_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during SHMEM_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open SHMEM
> developer):
>
>   SPML add procs failed
>   --> Returned "Out of resource" (-2) instead of "Success" (0)
> --
> [tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
> initialize - aborting
> [tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
> initialize - aborting
> --
> SHMEM_ABORT was invo

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

2018-07-02 Thread Howard Pritchard

HI Si,

Could you add --disable-builtin-atomics

to the configure options and see if the hang goes away?

Howard


2018-07-02 8:48 GMT-06:00 Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org>:

> Simon --
>
> You don't currently have another Open MPI installation in your PATH /
> LD_LIBRARY_PATH, do you?
>
> I have seen dependency library loads cause "make check" to get confused,
> and instead of loading the libraries from the build tree, actually load
> some -- but not all -- of the required OMPI/ORTE/OPAL/etc. libraries from
> an installation tree.  Hilarity ensues (to include symptoms such as running
> forever).
>
> Can you double check that you have no Open MPI libraries in your
> LD_LIBRARY_PATH before running "make check" on the build tree?
>
>
>
> > On Jun 30, 2018, at 3:18 PM, Hammond, Simon David via users <
> users@lists.open-mpi.org> wrote:
> >
> > Nathan,
> >
> > Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2.
> >
> > S.
> >
> > --
> > Si Hammond
> > Scalable Computer Architectures
> > Sandia National Laboratories, NM, USA
> > [Sent from remote connection, excuse typos]
> >
> >
> > On 6/16/18, 10:10 PM, "Nathan Hjelm"  wrote:
> >
> >Try the latest nightly tarball for v3.1.x. Should be fixed.
> >
> >> On Jun 16, 2018, at 5:48 PM, Hammond, Simon David via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> The output from the test in question is:
> >>
> >> Single thread test. Time: 0 s 10182 us 10 nsec/poppush
> >> Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
> >> 
> >>
> >> S.
> >>
> >> --
> >> Si Hammond
> >> Scalable Computer Architectures
> >> Sandia National Laboratories, NM, USA
> >> [Sent from remote connection, excuse typos]
> >>
> >>
> >> On 6/16/18, 5:45 PM, "Hammond, Simon David"  wrote:
> >>
> >>   Hi OpenMPI Team,
> >>
> >>   We have recently updated an install of OpenMPI on POWER9 system
> (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1.
> We seem to have a symptom where code than ran before is now locking up and
> making no progress, getting stuck in wait-all operations. While I think
> it's prudent for us to root cause this a little more, I have gone back and
> rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears
> to hang forever. I am not sure if this is the cause of our issue but wanted
> to report that we are seeing this on our system.
> >>
> >>   OpenMPI 3.1.0 Configuration:
> >>
> >>   ./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.
> 1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java
> --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.
> 1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
> >>
> >>   GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA
> for POWER9 (standard download from their website). We enable IBM's JDK
> 8.0.0.
> >>   RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
> >>
> >>   Output:
> >>
> >>   make[3]: Entering directory `/home/sdhammo/openmpi/
> openmpi-3.1.0/test/class'
> >>   make[4]: Entering directory `/home/sdhammo/openmpi/
> openmpi-3.1.0/test/class'
> >>   PASS: ompi_rb_tree
> >>   PASS: opal_bitmap
> >>   PASS: opal_hash_table
> >>   PASS: opal_proc_table
> >>   PASS: opal_tree
> >>   PASS: opal_list
> >>   PASS: opal_value_array
> >>   PASS: opal_pointer_array
> >>   PASS: opal_lifo
> >>   
> >>
> >>   Output from Top:
> >>
> >>   20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo
> >>
> >>   --
> >>   Si Hammond
> >>   Scalable Computer Architectures
> >>   Sandia National Laboratories, NM, USA
> >>   [Sent from remote connection, excuse typos]
> >>
> >>
> >>
> >>
> >> ___
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] A couple of general questions

2018-06-14 Thread Howard Pritchard

Hello Charles

You are heading in the right direction.

First you might want to run the libfabric fi_info command to see what
capabilities you picked up from the libfabric RPMs.

Next you may well not actually be using the OFI  mtl.

Could you run your app with

export OMPI_MCA_mtl_base_verbose=100

and post the output?

It would also help if you described the system you are using :  OS
interconnect cpu type etc.

Howard

Charles A Taylor  schrieb am Do. 14. Juni 2018 um 06:36:

> Because of the issues we are having with OpenMPI and the openib BTL
> (questions previously asked), I’ve been looking into what other transports
> are available.  I was particularly interested in OFI/libfabric support but
> cannot find any information on it more recent than a reference to the usNIC
> BTL from 2015 (Jeff Squyres, Cisco).  Unfortunately, the openmpi-org
> website FAQ’s covering OpenFabrics support don’t mention anything beyond
> OpenMPI 1.8.  Given that 3.1 is the current stable version, that seems odd.
>
> That being the case, I thought I’d ask here. After laying down the
> libfabric-devel RPM and building (3.1.0) with —with-libfabric=/usr, I end
> up with an “ofi” MTL but nothing else.   I can run with OMPI_MCA_mtl=ofi
> and OMPI_MCA_btl=“self,vader,openib” but it eventually crashes in
> libopen-pal.so.   (mpi_waitall() higher up the stack).
>
> GIZMO:9185 terminated with signal 11 at PC=2b4d4b68a91d SP=7ffcfbde9ff0.
> Backtrace:
>
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(+0x9391d)[0x2b4d4b68a91d]
>
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(opal_progress+0x24)[0x2b4d4b632754]
>
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(ompi_request_default_wait_all+0x11f)[0x2b4d47be2a6f]
>
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(PMPI_Waitall+0xbd)[0x2b4d47c2ce4d]
>
> Questions: Am I using the OFI MTL as intended?   Should there be an “ofi”
> BTL?   Does anyone use this?
>
> Thanks,
>
> Charlie Taylor
> UF Research Computing
>
> PS - If you could use some help updating the FAQs, I’d be willing to put
> in some time.  I’d probably learn a lot.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Problem running with UCX/oshmem on single node?

2018-05-09 Thread Howard Pritchard

Hi Craig,

You are experiencing problems because you don't have a transport installed
that UCX can use for oshmem.

You either need to go and buy a connectx4/5 HCA from mellanox (and maybe a
switch), and install that
on your system, or else install xpmem (https://github.com/hjelmn/xpmem).
Note there is a bug right now
in UCX that you may hit if you try to go thee xpmem only  route:

https://github.com/open-mpi/ompi/issues/5083
and
https://github.com/openucx/ucx/issues/2588

If you are just running on a single node and want to experiment with the
OpenSHMEM program model,
and do not have mellanox mlx5 equipment installed on the node, you are much
better off trying to use SOS
over OFI libfabric:

https://github.com/Sandia-OpenSHMEM/SOS
https://github.com/ofiwg/libfabric/releases

For SOS you will need to install the hydra launcher as well:

http://www.mpich.org/downloads/

I really wish google would do a better job at hitting my responses about
this type of problem.  I seem to
respond every couple of months to this exact problem on this mail list.


Howard


2018-05-09 13:11 GMT-06:00 Craig Reese <cfre...@super.org>:

>
> I'm trying to play with oshmem on a single node (just to have a way to do
> some simple
> experimentation and playing around) and having spectacular problems:
>
> CentOS 6.9 (gcc 4.4.7)
> built and installed ucx 1.3.0
> built and installed openmpi-3.1.0
>
> [cfreese]$ cat oshmem.c
>
> #include 
> int
> main() {
> shmem_init();
> }
>
> [cfreese]$ mpicc oshmem.c -loshmem
>
> [cfreese]$ shmemrun -np 2 ./a.out
>
> [ucs1l:30118] mca: base: components_register: registering framework spml
> components
> [ucs1l:30118] mca: base: components_register: found loaded component ucx
> [ucs1l:30119] mca: base: components_register: registering framework spml
> components
> [ucs1l:30119] mca: base: components_register: found loaded component ucx
> [ucs1l:30119] mca: base: components_register: component ucx register
> function successful
> [ucs1l:30118] mca: base: components_register: component ucx register
> function successful
> [ucs1l:30119] mca: base: components_open: opening spml components
> [ucs1l:30119] mca: base: components_open: found loaded component ucx
> [ucs1l:30118] mca: base: components_open: opening spml components
> [ucs1l:30118] mca: base: components_open: found loaded component ucx
> [ucs1l:30119] mca: base: components_open: component ucx open function
> successful
> [ucs1l:30118] mca: base: components_open: component ucx open function
> successful
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
> mca_spml_base_select() select: initializing spml component ucx
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
> - mca_spml_ucx_component_init() in ucx, my priority is 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
> mca_spml_base_select() select: initializing spml component ucx
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
> - mca_spml_ucx_component_init() in ucx, my priority is 21
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
> - mca_spml_ucx_component_init() *** ucx initialized 
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
> mca_spml_base_select() select: init returned priority 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
> mca_spml_base_select() selected ucx best priority 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
> mca_spml_base_select() select: component ucx selected
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
> mca_spml_ucx_enable() *** ucx ENABLED 
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
> - mca_spml_ucx_component_init() *** ucx initialized 
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
> mca_spml_base_select() select: init returned priority 21
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
> mca_spml_base_select() selected ucx best priority 21
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
> mca_spml_base_select() select: component ucx selected
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
> mca_spml_ucx_enable() *** ucx ENABLED 
>
> here's where I think the real issue is
>
> [1525891910.424102] [ucs1l:30119:0] select.c:316  UCX  ERROR no
> remote registered memory access transport to : mm/posix -
> Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0
> - no put short, self/self - Destination is unreachable
> [1525891910.424104] [ucs1l:30118:0] select.c:316  UCX  ERROR no
> remote registered memory ac

Re: [OMPI users] Debug build of v3.0.1 tarball

2018-05-04 Thread Howard Pritchard

HI Adam,

I think you'll have better luck setting the CFLAGS on the configure line.

try

./configure CFLAGS="-g -O0" your other configury options.

Howard


2018-05-04 12:09 GMT-06:00 Moody, Adam T. <mood...@llnl.gov>:

> Hi Howard,
>
> I do have a make clean after the configure.  To be extra safe, I’m now
> also deleting the source directory and untarring for each build to make
> sure I have a clean starting point.
>
>
>
> I do get a successful build if I add --enable-debug to configure and then
> do a simple make that has no CFLAGS or LDFLAGS:
>
>
>
> make -j VERBOSE=1
>
>
>
> So that’s good.  However, looking at the compile lines that were used, I
> see a -g but no -O0.  I’m trying to force the -g -O0, because our debuggers
> show the best info at that optimization level.
>
>
>
> If I then also add a CFLAGS=”-g -O0” to my make command, I see the “-g
> -O0” in the compile lines, but then the pthread link error shows up:
>
>
>
> make -j CFLAGS=”-g -O0” VERBOSE=1
>
>
>
>   CC   opal_wrapper.o
>
>   GENERATE opal_wrapper.1
>
>   CCLD opal_wrapper
>
> ../../../opal/.libs/libopen-pal.so: undefined reference to
> `pthread_atfork'
>
> collect2: error: ld returned 1 exit status
>
> make[2]: *** [opal_wrapper] Error 1
>
>
>
> Also setting LDFLAGS fixes that up.  Just wondering whether I’m going
> about it the right way in trying to get -g -O0 in the build.
>
>
>
> Thanks for your help,
>
> -Adam
>
>
>
> *From: *users <users-boun...@lists.open-mpi.org> on behalf of Howard
> Pritchard <hpprit...@gmail.com>
> *Reply-To: *Open MPI Users <users@lists.open-mpi.org>
> *Date: *Friday, May 4, 2018 at 7:46 AM
> *To: *Open MPI Users <users@lists.open-mpi.org>
> *Subject: *Re: [OMPI users] Debug build of v3.0.1 tarball
>
>
>
> HI Adam,
>
>
>
> Sorry didn't notice you did try the --enable-debug flag.  That should not
> have
>
> led to the link error building the opal dso.  Did you do a make clean after
>
> rerunning configure?
>
>
>
> Howard
>
>
>
>
>
> 2018-05-04 8:22 GMT-06:00 Howard Pritchard <hpprit...@gmail.com>:
>
> Hi Adam,
>
>
>
> Did you try using the --enable-debug configure option along with your
> CFLAGS options?
>
> You may want to see if that simplifies your build.
>
>
>
> In any case, we'll fix the problems you found.
>
>
>
> Howard
>
>
>
>
>
> 2018-05-03 15:00 GMT-06:00 Moody, Adam T. <mood...@llnl.gov>:
>
> Hello Open MPI team,
>
> I'm looking for the recommended way to produce a debug build of Open MPI
> v3.0.1 that compiles with “-g -O0” so that I get accurate debug info under
> a debugger.
>
> So far, I've gone through the following sequence.  I started with
> CFLAGS="-g -O0" on make:
>
> shell$ ./configure --prefix=$installdir --disable-silent-rules \
>
>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>
> shell$ make -j CFLAGS="-g -O0" VERBOSE=1
>
> That led to the following error:
>
> In file included from ../../../../opal/util/arch.h:26:0,
>
>  from btl_openib.h:43,
>
>  from btl_openib_component.c:79:
>
> btl_openib_component.c: In function 'progress_pending_frags_wqe':
>
> btl_openib_component.c:3351:29: error: 'opal_list_item_t' has no member named 
> 'opal_list_item_refcount'
>
>  assert(0 == frag->opal_list_item_refcount);
>
>  ^
>
> make[2]: *** [btl_openib_component.lo] Error 1
>
> make[2]: *** Waiting for unfinished jobs
>
> make[2]: Leaving directory `.../openmpi-3.0.1/opal/mca/btl/openib'
>
> So it seems the assert is referring to a field structure that is protected
> by a debug flag.  I then added --enable-debug to configure, which led to:
>
> make[2]: Entering directory `.../openmpi-3.0.1/opal/tools/wrappers'
>
>   CC   opal_wrapper.o
>
>   GENERATE opal_wrapper.1
>
>   CCLD opal_wrapper
>
> ../../../opal/.libs/libopen-pal.so: undefined reference to `pthread_atfork'
>
> collect2: error: ld returned 1 exit status
>
> make[2]: *** [opal_wrapper] Error 1
>
> make[2]: Leaving directory `.../openmpi-3.0.1/opal/tools/wrappers'
>
> Finally, if I also add LDFLAGS="-lpthread" to make, I get a build:
>
> shell$ ./configure --prefix=$installdir --enable-debug --disable-silent-rules 
> \
>
>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>
> shell$ make -j CFLAGS="-g -O0" LDFLAGS="-lpthread" VERBOSE=1
>
> Am I doing this correct

Re: [OMPI users] Debug build of v3.0.1 tarball

2018-05-04 Thread Howard Pritchard

HI Adam,

Sorry didn't notice you did try the --enable-debug flag.  That should not
have
led to the link error building the opal dso.  Did you do a make clean after
rerunning configure?

Howard


2018-05-04 8:22 GMT-06:00 Howard Pritchard <hpprit...@gmail.com>:

> Hi Adam,
>
> Did you try using the --enable-debug configure option along with your
> CFLAGS options?
> You may want to see if that simplifies your build.
>
> In any case, we'll fix the problems you found.
>
> Howard
>
>
> 2018-05-03 15:00 GMT-06:00 Moody, Adam T. <mood...@llnl.gov>:
>
>> Hello Open MPI team,
>>
>> I'm looking for the recommended way to produce a debug build of Open MPI
>> v3.0.1 that compiles with “-g -O0” so that I get accurate debug info under
>> a debugger.
>>
>> So far, I've gone through the following sequence.  I started with
>> CFLAGS="-g -O0" on make:
>>
>> shell$ ./configure --prefix=$installdir --disable-silent-rules \
>>
>>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>>
>> shell$ make -j CFLAGS="-g -O0" VERBOSE=1
>>
>> That led to the following error:
>>
>> In file included from ../../../../opal/util/arch.h:26:0,
>>
>>  from btl_openib.h:43,
>>
>>  from btl_openib_component.c:79:
>>
>> btl_openib_component.c: In function 'progress_pending_frags_wqe':
>>
>> btl_openib_component.c:3351:29: error: 'opal_list_item_t' has no member 
>> named 'opal_list_item_refcount'
>>
>>  assert(0 == frag->opal_list_item_refcount);
>>
>>  ^
>>
>> make[2]: *** [btl_openib_component.lo] Error 1
>>
>> make[2]: *** Waiting for unfinished jobs
>>
>> make[2]: Leaving directory `.../openmpi-3.0.1/opal/mca/btl/openib'
>>
>> So it seems the assert is referring to a field structure that is
>> protected by a debug flag.  I then added --enable-debug to configure, which
>> led to:
>>
>> make[2]: Entering directory `.../openmpi-3.0.1/opal/tools/wrappers'
>>
>>   CC   opal_wrapper.o
>>
>>   GENERATE opal_wrapper.1
>>
>>   CCLD opal_wrapper
>>
>> ../../../opal/.libs/libopen-pal.so: undefined reference to `pthread_atfork'
>>
>> collect2: error: ld returned 1 exit status
>>
>> make[2]: *** [opal_wrapper] Error 1
>>
>> make[2]: Leaving directory `.../openmpi-3.0.1/opal/tools/wrappers'
>>
>> Finally, if I also add LDFLAGS="-lpthread" to make, I get a build:
>>
>> shell$ ./configure --prefix=$installdir --enable-debug 
>> --disable-silent-rules \
>>
>>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>>
>> shell$ make -j CFLAGS="-g -O0" LDFLAGS="-lpthread" VERBOSE=1
>>
>> Am I doing this correctly?
>>
>> Is there a pointer to the configure/make flags for this?
>>
>> I did find this page that describes the developer build from a git clone,
>> but that seemed a bit overkill since I am looking for a debug build from
>> the distribution tarball instead of the git clone (avoid the autotools
>> nightmare):
>>
>> https://www.open-mpi.org/source/building.php
>>
>> Thanks.
>>
>> -Adam
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Debug build of v3.0.1 tarball

2018-05-04 Thread Howard Pritchard

Hi Adam,

Did you try using the --enable-debug configure option along with your
CFLAGS options?
You may want to see if that simplifies your build.

In any case, we'll fix the problems you found.

Howard


2018-05-03 15:00 GMT-06:00 Moody, Adam T. <mood...@llnl.gov>:

> Hello Open MPI team,
>
> I'm looking for the recommended way to produce a debug build of Open MPI
> v3.0.1 that compiles with “-g -O0” so that I get accurate debug info under
> a debugger.
>
> So far, I've gone through the following sequence.  I started with
> CFLAGS="-g -O0" on make:
>
> shell$ ./configure --prefix=$installdir --disable-silent-rules \
>
>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>
> shell$ make -j CFLAGS="-g -O0" VERBOSE=1
>
> That led to the following error:
>
> In file included from ../../../../opal/util/arch.h:26:0,
>
>  from btl_openib.h:43,
>
>  from btl_openib_component.c:79:
>
> btl_openib_component.c: In function 'progress_pending_frags_wqe':
>
> btl_openib_component.c:3351:29: error: 'opal_list_item_t' has no member named 
> 'opal_list_item_refcount'
>
>  assert(0 == frag->opal_list_item_refcount);
>
>  ^
>
> make[2]: *** [btl_openib_component.lo] Error 1
>
> make[2]: *** Waiting for unfinished jobs
>
> make[2]: Leaving directory `.../openmpi-3.0.1/opal/mca/btl/openib'
>
> So it seems the assert is referring to a field structure that is protected
> by a debug flag.  I then added --enable-debug to configure, which led to:
>
> make[2]: Entering directory `.../openmpi-3.0.1/opal/tools/wrappers'
>
>   CC   opal_wrapper.o
>
>   GENERATE opal_wrapper.1
>
>   CCLD opal_wrapper
>
> ../../../opal/.libs/libopen-pal.so: undefined reference to `pthread_atfork'
>
> collect2: error: ld returned 1 exit status
>
> make[2]: *** [opal_wrapper] Error 1
>
> make[2]: Leaving directory `.../openmpi-3.0.1/opal/tools/wrappers'
>
> Finally, if I also add LDFLAGS="-lpthread" to make, I get a build:
>
> shell$ ./configure --prefix=$installdir --enable-debug --disable-silent-rules 
> \
>
>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>
> shell$ make -j CFLAGS="-g -O0" LDFLAGS="-lpthread" VERBOSE=1
>
> Am I doing this correctly?
>
> Is there a pointer to the configure/make flags for this?
>
> I did find this page that describes the developer build from a git clone,
> but that seemed a bit overkill since I am looking for a debug build from
> the distribution tarball instead of the git clone (avoid the autotools
> nightmare):
>
> https://www.open-mpi.org/source/building.php
>
> Thanks.
>
> -Adam
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0

2018-04-05 Thread Howard Pritchard

Hello Ben,

Thanks for the info.   You would probably be better off installing UCX on
your cluster and rebuilding your Open MPI with the
--with-ucx
configure option.

Here's what I'm seeing with Open MPI 3.0.1 on a ConnectX5 based cluster
using ob1/openib BTL:

mpirun -map-by ppr:1:node -np 2 ./osu_bibw

# OSU MPI Bi-Directional Bandwidth Test v5.1

# Size  Bandwidth (MB/s)

1   0.00

2   0.00

4   0.01

8   0.02

16  0.04

32  0.07

64  0.13

128   273.64

256   485.04

512   869.51

1024 1434.99

2048 2208.12

4096 3055.67

8192 3896.93

16384  89.29

32768 252.59

65536 614.42

131072  22878.74

262144  23846.93

524288  24256.23

1048576 24498.27

2097152 24615.64

4194304 24632.58


export OMPI_MCA_pml=ucx

# OSU MPI Bi-Directional Bandwidth Test v5.1

# Size  Bandwidth (MB/s)

1   4.57

2   8.95

4  17.67

8  35.99

16 71.99

32141.56

64208.86

128   410.32

256   495.56

512  1455.98

1024 2414.78

2048 3008.19

4096 5351.62

8192 5563.66

163845945.16

327686061.33

65536   21376.89

131072  23462.99

262144  24064.56

524288  24366.84

1048576 24550.75

2097152 24649.03

4194304 24693.77

You can get ucx off of GitHub

https://github.com/openucx/ucx/releases


There is also a pre-release version of UCX (1.3.0RCX?) packaged as an RPM

available in MOFED 4.3.  See


http://www.mellanox.com/page/products_dyn?product_family=26=linux_sw_drivers


I was using UCX 1.2.2 for the results above.


Good luck,


Howard




2018-04-05 1:12 GMT-06:00 Ben Menadue <ben.mena...@nci.org.au>:

> Hi,
>
> Another interesting point. I noticed that the last two message sizes
> tested (2MB and 4MB) are lower than expected for both osu_bw and osu_bibw.
> Increasing the minimum size to use the RDMA pipeline to above these sizes
> brings those two data-points up to scratch for both benchmarks:
>
> *3.0.0, osu_bw, no rdma for large messages*
>
> > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by
> ppr:1:node -np 2 -H r6,r7 ./osu_bw -m 2097152:4194304
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 2097152  6133.22
> 4194304  6054.06
>
> *3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages*
>
> > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca
> btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw -m
> 2097152:4194304
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 2097152 11397.85
> 4194304 11389.64
>
> This makes me think something odd is going on in the RDMA pipeline.
>
> Cheers,
> Ben
>
>
>
> On 5 Apr 2018, at 5:03 pm, Ben Menadue <ben.mena...@nci.org.au> wrote:
>
> Hi,
>
> We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed
> that *osu_bibw* gives nowhere near the bandwidth I’d expect (this is on
> FDR IB). However, *osu_bw* is fine.
>
> If I disable eager RDMA, then *osu_bibw* gives the expected
> numbers. Similarly, if I increase the number of eager RDMA buffers, it
> gives the expected results.
>
> OpenMPI 1.10.7 gives consistent, reasonable numbers with default settings,
> but they’re not as good as 3.0.0 (when tuned) for large buffers. The same
> option changes produce no different in the performance for 1.10.7.
>
> I was wondering if anyone else has noticed anything similar, and if this
> is unexpected, if anyone has a suggestion on how to investigate further?
>
> Thanks,
> Ben
>
>
> Here’s are the numbers:
>
> *3.0.0, osu_bw, default settings*
>
> > mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw
> # OSU MPI Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 1   1.13
> 2   2.29
> 4   4.63
> 8   9.21
> 16 18.18
> 32 36.46
> 64 69.95
> 128   128.55
> 256   250.74
> 512   451.54
> 1024  829.44
> 2048 1475.87
> 4096

Re: [OMPI users] OpenMPI with Portals4 transport

2018-02-08 Thread Howard Pritchard

HI Brian,

Thanks for the info.   I'm not sure I quite get the response though.  Is
the race condition in the way
Open MPI Portals4 MTL is using portals or is a problem in the portals
implementation itself?

Howard


2018-02-08 9:20 GMT-07:00 D. Brian Larkins <brianlark...@gmail.com>:

> Howard,
>
> Looks like ob1 is working fine. When I looked into the problems with ob1,
> it looked like the progress thread was polling the Portals event queue
> before it had been initialized.
>
> b.
>
> $ mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib osu_latency
> WARNING: Ummunotify not found: Not using ummunotify can result in
> incorrect results download and install ummunotify from:
>  http://support.systemfabricworks.com/downloads/ummunotify/
> ummunotify-v2.tar.bz2
> WARNING: Ummunotify not found: Not using ummunotify can result in
> incorrect results download and install ummunotify from:
>  http://support.systemfabricworks.com/downloads/ummunotify/
> ummunotify-v2.tar.bz2
> # OSU MPI Latency Test
> # SizeLatency (us)
> 0 1.87
> 1 1.93
> 2 1.90
> 4 1.94
> 8 1.94
> 161.96
> 321.97
> 641.99
> 128   2.43
> 256   2.50
> 512   2.71
> 1024  3.01
> 2048  3.45
> 4096  4.56
> 8192  6.39
> 16384 8.79
> 3276811.50
> 6553616.59
> 131072   27.10
> 262144   46.97
> 524288   87.55
> 1048576 168.89
> 2097152 331.40
> 4194304 654.08
>
>
> On Feb 7, 2018, at 9:04 PM, Howard Pritchard <hpprit...@gmail.com> wrote:
>
> HI Brian,
>
> As a sanity check, can you see if the ob1 pml works okay, i.e.
>
>  mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib ./osu_latency
>
> Howard
>
>
> 2018-02-07 11:03 GMT-07:00 brian larkins <brianlark...@gmail.com>:
>
>> Hello,
>>
>> I’m doing some work with Portals4 and am trying to run some MPI programs
>> using the Portals 4 as the transport layer. I’m running into problems and
>> am hoping that someone can help me figure out how to get things working.
>> I’m using OpenMPI 3.0.0 with the following configuration:
>>
>> ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky
>> --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4
>> --disable-oshmem --disable-vt --disable-java --disable-mpi-io
>> --disable-io-romio --disable-libompitrace --disable-btl-portals4-flow-control
>> --disable-mtl-portals4-flow-control
>>
>> I have also tried the head from the git repo and 2.1.2 with the same
>> results. A simpler configure line (w —prefix and —with-portals4=) also gets
>> same results.
>>
>> Portals4 configuration is from github master and configured thus:
>>
>> ./configure —prefix=path/to/portals4 --with-ev=path/to/libev
>> --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered
>>
>> If I specify the cm pml on the command-line, I can get examples/hello_c
>> to run correctly. Trying to get some latency numbers using the OSU
>> benchmarks is where my trouble begins:
>>
>> $ mpirun -n 2 --mca mtl portals4  --mca pml cm env
>> PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency
>> NOTE: Ummunotify and IB registered mem cache disabled, set
>> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
>> NOTE: Ummunotify and IB registered mem cache disabled, set
>> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
>> # OSU MPI Latency Test
>> # SizeLatency (us)
>> 025.96
>> [node41:19740] *** An error occurred in MPI_Barrier
>> [node41:19740] *** reported by process [139815819542529,4294967297]
>> [node41:19740] *** on communicator MPI_COMM_WORLD
>> [node41:19740] *** MPI_ERR_OTHER: known error not in list
>> [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>> will now abort,
>> [node41:19740] ***and potentially your MPI job)
>>
>> Not specifying CM gets an earlier segfault (defaults to ob1) and looks to
>> be a progress thread initialization problem.
>> Using PTL_IGNORE_UMMUNOTIFY=1  gets here:
>>
>> $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency
>> # OSU MPI Latency Test
>> # SizeLatency (us)
>> 0

Re: [OMPI users] Using OpenSHMEM with Shared Memory

2018-02-07 Thread Howard Pritchard

HI Ben,

I'm afraid this is bad news for using UCX.  The problem is that when UCX
was configured/built, it did not
find a transport for doing one sided put/get transfers.  If you're feeling
lucky, you may want to
install xpmem (https://github.com/hjelmn/xpmem) and rebuild UCX.  This
requires building a device driver against
your kernel source and taking steps to getting the xpmem.ko loaded into the
kernel, etc.

There's an alternative however which works just fine on a laptop running
linux or osx. Check out

https://github.com/Sandia-OpenSHMEM/SOS/releases

and get the 1.4.0 release.

For build/install, follow the directions at

https://github.com/Sandia-OpenSHMEM/SOS/wiki/OFI-Build-Instructions

Note you will also need to install the MPICH hydra launcher as well.

Sandia OpenSHMEM over OFI libfabric uses TCP sockets as the fallback if
nothing else
is available.  I use this version of OpenSHMEM if I'm doing SHMEM stuff on
my mac  (no vm's).

Howard


2018-02-07 12:49 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>:

>
> Here's what I get with those environment variables:
>
> https://hastebin.com/ibimipuden.sql
>
> I'm running Arch Linux (but with OpenMPI/UCX installed from source as
> described in my earlier message).
>
> Ben
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI with Portals4 transport

2018-02-07 Thread Howard Pritchard

HI Brian,

As a sanity check, can you see if the ob1 pml works okay, i.e.

 mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib ./osu_latency

Howard


2018-02-07 11:03 GMT-07:00 brian larkins <brianlark...@gmail.com>:

> Hello,
>
> I’m doing some work with Portals4 and am trying to run some MPI programs
> using the Portals 4 as the transport layer. I’m running into problems and
> am hoping that someone can help me figure out how to get things working.
> I’m using OpenMPI 3.0.0 with the following configuration:
>
> ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky
> --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4
> --disable-oshmem --disable-vt --disable-java --disable-mpi-io
> --disable-io-romio --disable-libompitrace --disable-btl-portals4-flow-control
> --disable-mtl-portals4-flow-control
>
> I have also tried the head from the git repo and 2.1.2 with the same
> results. A simpler configure line (w —prefix and —with-portals4=) also gets
> same results.
>
> Portals4 configuration is from github master and configured thus:
>
> ./configure —prefix=path/to/portals4 --with-ev=path/to/libev
> --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered
>
> If I specify the cm pml on the command-line, I can get examples/hello_c to
> run correctly. Trying to get some latency numbers using the OSU benchmarks
> is where my trouble begins:
>
> $ mpirun -n 2 --mca mtl portals4  --mca pml cm env
> PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency
> NOTE: Ummunotify and IB registered mem cache disabled, set
> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
> NOTE: Ummunotify and IB registered mem cache disabled, set
> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
> # OSU MPI Latency Test
> # SizeLatency (us)
> 025.96
> [node41:19740] *** An error occurred in MPI_Barrier
> [node41:19740] *** reported by process [139815819542529,4294967297]
> [node41:19740] *** on communicator MPI_COMM_WORLD
> [node41:19740] *** MPI_ERR_OTHER: known error not in list
> [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> [node41:19740] ***and potentially your MPI job)
>
> Not specifying CM gets an earlier segfault (defaults to ob1) and looks to
> be a progress thread initialization problem.
> Using PTL_IGNORE_UMMUNOTIFY=1  gets here:
>
> $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency
> # OSU MPI Latency Test
> # SizeLatency (us)
> 024.14
> 126.24
> [node41:19993] *** Process received signal ***
> [node41:19993] Signal: Segmentation fault (11)
> [node41:19993] Signal code: Address not mapped (1)
> [node41:19993] Failing at address: 0x141
> [node41:19993] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fa6ac73b710]
> [node41:19993] [ 1] /ascldap/users/dblarki/opt/portals4.master/lib/
> libportals.so.4(+0xcd65)[0x7fa69b770d65]
> [node41:19993] [ 2] /ascldap/users/dblarki/opt/portals4.master/lib/
> libportals.so.4(PtlPut+0x143)[0x7fa69b773fb3]
> [node41:19993] [ 3] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_
> portals4.so(+0xa961)[0x7fa698cf5961]
> [node41:19993] [ 4] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_
> portals4.so(+0xb0e5)[0x7fa698cf60e5]
> [node41:19993] [ 5] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_
> portals4.so(ompi_mtl_portals4_send+0x90)[0x7fa698cf61d1]
> [node41:19993] [ 6] /ascldap/users/dblarki/opt/
> ompi/lib/openmpi/mca_pml_cm.so(+0x5430)[0x7fa69a794430]
> [node41:19993] [ 7] /ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(PMPI_
> Send+0x2b4)[0x7fa6ac9ff018]
> [node41:19993] [ 8] ./osu_latency[0x40106f]
> [node41:19993] [ 9] /lib64/libc.so.6(__libc_start_
> main+0xfd)[0x7fa6ac3b6d5d]
> [node41:19993] [10] ./osu_latency[0x400c59]
>
> This cluster is running RHEL 6.5 without ummunotify modules, but I get the
> same results on a local (small) cluster running ubuntu 16.04 with
> ummunotify loaded.
>
> Any help would be much appreciated.
> thanks,
>
> brian.
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Using OpenSHMEM with Shared Memory

2018-02-07 Thread Howard Pritchard

HI Ben,

Could you set these environment variables and post the output ?

export OMPI_MCA_spml=ucx
export OMPI_MCA_spml_base_verbose=100

then run your test?

Also,  what OS are you using?

Howard


2018-02-06 20:10 GMT-07:00 Jeff Hammond <jeff.scie...@gmail.com>:

>
> On Tue, Feb 6, 2018 at 3:58 PM Benjamin Brock <br...@cs.berkeley.edu>
> wrote:
>
>> How can I run an OpenSHMEM program just using shared memory?  I'd like to
>> use OpenMPI to run SHMEM programs locally on my laptop.
>>
>
> It’s not Open-MPI itself but OSHMPI sits on top of any MPI-3 library and
> has a mode to bypass MPI for one-sided if only used within a shared-memory
> domain.
>
>
> See https://github.com/jeffhammond/oshmpi and use --enable-smp-optimizations.
> While I don’t actively maintain it and it doesn’t support the latest spec,
> I’ll fix bugs and implement features on demand if users file GitHub issues.
>
> Sorry for the shameless self-promotion but I know a few folks who use
> OSHMPI specifically because of the SMP feature.
>
> Sandia OpenSHMEM with OFI definitely works on shared-memory as well. I use
> it for all of my Travis CI testing of SHMEM code on both Mac and Linux.
>
> Jeff
>
>
>> I understand that the old SHMEM component (Yoda?) was taken out, and that
>> UCX is now required.  I have a build of OpenMPI with UCX as per the
>> directions on this random GitHub Page
>> <https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX>
>> .
>>
>> When I try to just `shmemrun`, I get a complaint about not haivng any
>> splm components available.
>>
>> [xiii@shini kmer_hash]$ shmemrun -np 2 ./kmer_generic_hash
>> 
>> --
>> No available spml components were found!
>>
>> This means that there are no components of this type installed on your
>> system or all the components reported that they could not be used.
>>
>> This is a fatal error; your SHMEM process is likely to abort.  Check the
>> output of the "ompi_info" command and ensure that components of this
>> type are available on your system.  You may also wish to check the
>> value of the "component_path" MCA parameter and ensure that it has at
>> least one directory that contains valid MCA components.
>> 
>> --
>> [shini:16341] SPML ikrit cannot be selected
>> [shini:16342] SPML ikrit cannot be selected
>> [shini:16336] 1 more process has sent help message
>> help-oshmem-memheap.txt / find-available:none-found
>> [shini:16336] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>> all help / error messages
>>
>>
>> I tried fiddling with the MCA command-line settings, but didn't have any
>> luck.  Is it possible to do this?  Can anyone point me to some
>> documentation?
>>
>> Thanks,
>>
>> Ben
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] About my GPU performance using Openmpi-2.0.4

2017-12-13 Thread Howard Pritchard

Hi Phanikumar

It’s unlikely the warning message you are seeing is related to GPU
performance.  Have you tried adding

—with-verbs=no

to your config line?  That should quash openib complaint.

Howard

Phanikumar Pentyala <phani12.c...@gmail.com> schrieb am Mo. 11. Dez. 2017
um 22:43:

> Dear users and developers,
>
> Currently I am using two Tesla K40m cards for my computational work on
> quantum espresso (QE) suit http://www.quantum-espresso.org/. My GPU
> enabled QE code running very slower than normal version. When I am
> submitting my job on gpu it was showing some error that "A high-performance
> Open MPI point-to-point messaging module was unable to find any relevant
> network interfaces:
>
> Module: OpenFabrics (openib)
>   Host: qmel
>
> Another transport will be used instead, although this may result in
> lower performance.
>
> Is this the reason for diminishing GPU performance ??
>
> I done installation by
>
> 1. ./configure --prefix=/home//software/openmpi-2.0.4
> --disable-openib-dynamic-sl --disable-openib-udcm --disable-openib-rdmacm"
> because we don't have any Infiband adapter HCA in server.
>
> 2. make all
>
> 3. make install
>
> Please correct me If I done any mistake in my installation or I have to
> use Infiband adaptor for using Openmpi??
>
> I read lot of posts in openmpi forum to remove above error while
> submitting job, I added tag of "--mca btl ^openib" , still no use error
> vanished but performance was same.
>
> Current details of server are:
>
> Server: FUJITSU PRIMERGY RX2540 M2
> CUDA version: 9.0
> openmpi version: 2.0.4 with intel mkl libraries
> QE-gpu version (my application): 5.4.0
>
> P.S: Extra information attached
>
> Thanks in advance
>
> Regards
> Phanikumar
> Research scholar
> IIT Kharagpur
> Kharagpur, westbengal
> India
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [EXTERNAL] Re: Using shmem_int_fadd() in OpenMPI\'s SHMEM

2017-11-22 Thread Howard Pritchard

Hi Ben,

Actually I did some checking about the brew install for OFi libfabric.
It looks like if your brew is up to date, it will pick up libfabric 1.5.2.

Howard


2017-11-22 15:21 GMT-07:00 Howard Pritchard <hpprit...@gmail.com>:

> HI Ben,
>
> Even on one box, the yoda component doesn't work any more.
>
> If you want to do OpenSHMEM programming on you Macbook pro (like I do)
> and you don't want to set up a VM to use UCX, then you can use
> Sandia OpenSHMEM implementation.
>
> https://github.com/Sandia-OpenSHMEM/SOS
>
> You will need to install the MPICH hydra launcher
>
> http://www.mpich.org/downloads/versions/
>
> as the SOS needs that for its oshrun launcher.
>
> I use hydra-3.2 on my mac with SOS.
>
> You will also need to install OFI libfabric:
>
> https://github.com/ofiwg/libfabric
>
> I'd suggest installing the OFI 1.5.1 tarball.  OFI is also available via
> brew
> but its so old that I doubt it will work with recent versions of SOS.
>
> If you'd like to use UCX, you'll need to install it and Open MPI on a VM
> running  a linux distro.
>
> Howard
>
>
> 2017-11-21 12:47 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>:
>
>> > What version of Open MPI are you trying to use?
>>
>> Open MPI 2.1.1-2 as distributed by Arch Linux.
>>
>> > Also, could you describe something about your system.
>>
>> This is all in shared memory on a MacBook Pro; no networking involved.
>>
>> The seg fault with the code example above looks like this:
>>
>> [xiii@shini kmer_hash]$ g++ minimal.cpp -o minimal `shmemcc
>> --showme:link`
>> [xiii@shini kmer_hash]$ !shm
>> shmemrun -n 2 ./minimal
>> [shini:08284] *** Process received signal ***
>> [shini:08284] Signal: Segmentation fault (11)
>> [shini:08284] Signal code: Address not mapped (1)
>> [shini:08284] Failing at address: 0x18
>> [shini:08284] [ 0] /usr/lib/libpthread.so.0(+0x11da0)[0x7f06fb763da0]
>> [shini:08284] [ 1] /usr/lib/openmpi/openmpi/mca_s
>> pml_yoda.so(mca_spml_yoda_get+0x7da)[0x7f06e0eef0aa]
>> [shini:08284] [ 2] /usr/lib/openmpi/openmpi/mca_a
>> tomic_basic.so(atomic_basic_lock+0xb2)[0x7f06e08d90d2]
>> [shini:08284] [ 3] /usr/lib/openmpi/openmpi/mca_a
>> tomic_basic.so(mca_atomic_basic_fadd+0x4a)[0x7f06e08d949a]
>> [shini:08284] [ 4] /usr/lib/openmpi/liboshmem.so.
>> 20(shmem_int_fadd+0x90)[0x7f06fc5a7660]
>> [shini:08284] [ 5] ./minimal(+0x94f)[0x55a5cde7e94f]
>> [shini:08284] [ 6] /usr/lib/libc.so.6(__libc_star
>> t_main+0xea)[0x7f06fb3baf6a]
>> [shini:08284] [ 7] ./minimal(+0x80a)[0x55a5cde7e80a]
>> [shini:08284] *** End of error message ***
>> 
>> --
>> shmemrun noticed that process rank 1 with PID 0 on node shini exited on
>> signal 11 (Segmentation fault).
>> 
>> --
>>
>> Cheers,
>>
>> Ben
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [EXTERNAL] Re: Using shmem_int_fadd() in OpenMPI\'s SHMEM

2017-11-22 Thread Howard Pritchard

HI Ben,

Even on one box, the yoda component doesn't work any more.

If you want to do OpenSHMEM programming on you Macbook pro (like I do)
and you don't want to set up a VM to use UCX, then you can use
Sandia OpenSHMEM implementation.

https://github.com/Sandia-OpenSHMEM/SOS

You will need to install the MPICH hydra launcher

http://www.mpich.org/downloads/versions/

as the SOS needs that for its oshrun launcher.

I use hydra-3.2 on my mac with SOS.

You will also need to install OFI libfabric:

https://github.com/ofiwg/libfabric

I'd suggest installing the OFI 1.5.1 tarball.  OFI is also available via
brew
but its so old that I doubt it will work with recent versions of SOS.

If you'd like to use UCX, you'll need to install it and Open MPI on a VM
running  a linux distro.

Howard


2017-11-21 12:47 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>:

> > What version of Open MPI are you trying to use?
>
> Open MPI 2.1.1-2 as distributed by Arch Linux.
>
> > Also, could you describe something about your system.
>
> This is all in shared memory on a MacBook Pro; no networking involved.
>
> The seg fault with the code example above looks like this:
>
> [xiii@shini kmer_hash]$ g++ minimal.cpp -o minimal `shmemcc --showme:link`
> [xiii@shini kmer_hash]$ !shm
> shmemrun -n 2 ./minimal
> [shini:08284] *** Process received signal ***
> [shini:08284] Signal: Segmentation fault (11)
> [shini:08284] Signal code: Address not mapped (1)
> [shini:08284] Failing at address: 0x18
> [shini:08284] [ 0] /usr/lib/libpthread.so.0(+0x11da0)[0x7f06fb763da0]
> [shini:08284] [ 1] /usr/lib/openmpi/openmpi/mca_s
> pml_yoda.so(mca_spml_yoda_get+0x7da)[0x7f06e0eef0aa]
> [shini:08284] [ 2] /usr/lib/openmpi/openmpi/mca_a
> tomic_basic.so(atomic_basic_lock+0xb2)[0x7f06e08d90d2]
> [shini:08284] [ 3] /usr/lib/openmpi/openmpi/mca_a
> tomic_basic.so(mca_atomic_basic_fadd+0x4a)[0x7f06e08d949a]
> [shini:08284] [ 4] /usr/lib/openmpi/liboshmem.so.
> 20(shmem_int_fadd+0x90)[0x7f06fc5a7660]
> [shini:08284] [ 5] ./minimal(+0x94f)[0x55a5cde7e94f]
> [shini:08284] [ 6] /usr/lib/libc.so.6(__libc_star
> t_main+0xea)[0x7f06fb3baf6a]
> [shini:08284] [ 7] ./minimal(+0x80a)[0x55a5cde7e80a]
> [shini:08284] *** End of error message ***
> --
> shmemrun noticed that process rank 1 with PID 0 on node shini exited on
> signal 11 (Segmentation fault).
> --
>
> Cheers,
>
> Ben
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [EXTERNAL] Re: Using shmem_int_fadd() in OpenMPI's SHMEM

2017-11-22 Thread Howard Pritchard

HI Folks,

For the Open MPI 2.1.1 release, the only OSHMEM SPML's that work are the
ikrit and ucx.
yoda doesn't work.

Ikrit only works on systems with Mellanox iinterconnects and requires MXM
to be installed.
This is recommended for systems with connectx3 or older HCAs.  For systems
with
connectx4 or connectx5 you should be using UCX.

You'll need to add --with-ucx + arguments as required to the configure
command line when
you build Open MPI/OSHMEM to pick up the ucx stuff.

A gotcha is that by default, the ucx spml is not selected, so either on the
oshrun
command line add

--mca spml ucx

or via env. variable

export OMPI_MCA_spml=ucx

I verified that a 2.1.1 release + UCX 1.2.0 builds your test (after fixing
the unusual
include files) and passes on my mellanox connectx5 cluster.

Howard


2017-11-21 8:24 GMT-07:00 Hammond, Simon David <sdha...@sandia.gov>:

> Hi Howard/OpenMPI Users,
>
>
>
> I have had a similar seg-fault this week using OpenMPI 2.1.1 with GCC
> 4.9.3 so I tried to compile the example code in the email below. I see
> similar behavior to a small benchmark we have in house (but using inc not
> finc).
>
>
>
> When I run on a single node (both PE’s on the same node) I get the error
> below. But, if I run on multiple nodes (say 2 nodes with one PE per node)
> then the code runs fine. Same thing for my benchmark which uses
> shmem_longlong_inc. For reference, we are using InfiniBand on our cluster
> and dual-socket Haswell processors.
>
>
>
> Hope that helps,
>
>
>
> S.
>
>
>
> $ shmemrun -n 2 ./testfinc
>
> --
>
> WARNING: There is at least non-excluded one OpenFabrics device found,
>
> but there are no active ports detected (or Open MPI was unable to use
>
> them).  This is most certainly not what you wanted.  Check your
>
> cables, subnet manager configuration, etc.  The openib BTL will be
>
> ignored for this job.
>
>
>
>   Local host: shepard-lsm1
>
> --
>
> [shepard-lsm1:49505] *** Process received signal ***
>
> [shepard-lsm1:49505] Signal: Segmentation fault (11)
>
> [shepard-lsm1:49505] Signal code: Address not mapped (1)
>
> [shepard-lsm1:49505] Failing at address: 0x18
>
> [shepard-lsm1:49505] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7ffc4cd9e710]
>
> [shepard-lsm1:49505] [ 1] /home/projects/x86-64-haswell/
> openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_spml_yoda.so(mca_
> spml_yoda_get+0x86d)[0x7ffc337cf37d]
>
> [shepard-lsm1:49505] [ 2] /home/projects/x86-64-haswell/
> openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_atomic_basic.so(
> atomic_basic_lock+0x9a)[0x7ffc32f190aa]
>
> [shepard-lsm1:49505] [ 3] /home/projects/x86-64-haswell/
> openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_atomic_basic.so(
> mca_atomic_basic_fadd+0x39)[0x7ffc32f19409]
>
> [shepard-lsm1:49505] [ 4] /home/projects/x86-64-haswell/
> openmpi/2.1.1/gcc/4.9.3/lib/liboshmem.so.20(shmem_int_
> fadd+0x80)[0x7ffc4d2fc110]
>
> [shepard-lsm1:49505] [ 5] ./testfinc[0x400888]
>
> [shepard-lsm1:49505] [ 6] /lib64/libc.so.6(__libc_start_
> main+0xfd)[0x7ffc4ca19d5d]
>
> [shepard-lsm1:49505] [ 7] ./testfinc[0x400739]
>
> [shepard-lsm1:49505] *** End of error message ***
>
> --
>
> shmemrun noticed that process rank 1 with PID 0 on node shepard-lsm1
> exited on signal 11 (Segmentation fault).
>
> --
>
> [shepard-lsm1:49499] 1 more process has sent help message
> help-mpi-btl-openib.txt / no active ports found
>
> [shepard-lsm1:49499] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
>
>
>
> --
>
> Si Hammond
>
> Scalable Computer Architectures
>
> Sandia National Laboratories, NM, USA
>
>
>
>
>
> *From: *users <users-boun...@lists.open-mpi.org> on behalf of Howard
> Pritchard <hpprit...@gmail.com>
> *Reply-To: *Open MPI Users <users@lists.open-mpi.org>
> *Date: *Monday, November 20, 2017 at 4:11 PM
> *To: *Open MPI Users <users@lists.open-mpi.org>
> *Subject: *[EXTERNAL] Re: [OMPI users] Using shmem_int_fadd() in
> OpenMPI's SHMEM
>
>
>
> HI Ben,
>
>
>
> What version of Open MPI are you trying to use?
>
>
>
> Also, could you describe something about your system.  If its a cluster
>
> what sort of interconnect is being used.
>
>
>
> Howard
>
>
>
>
>
> 2017-11-20 14:13 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>:
>
> What's the proper way to use

Re: [OMPI users] Using shmem_int_fadd() in OpenMPI's SHMEM

2017-11-20 Thread Howard Pritchard

HI Ben,

What version of Open MPI are you trying to use?

Also, could you describe something about your system.  If its a cluster
what sort of interconnect is being used.

Howard


2017-11-20 14:13 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>:

> What's the proper way to use shmem_int_fadd() in OpenMPI's SHMEM?
>
> A minimal example seems to seg fault:
>
> #include 
> #include 
>
> #include 
>
> int main(int argc, char **argv) {
>   shmem_init();
>   const size_t shared_segment_size = 1024;
>   void *shared_segment = shmem_malloc(shared_segment_size);
>
>   int *arr = (int *) shared_segment;
>   int *local_arr = (int *) malloc(sizeof(int) * 10);
>
>   if (shmem_my_pe() == 1) {
> shmem_int_fadd((int *) shared_segment, 1, 0);
>   }
>   shmem_barrier_all();
>
>   return 0;
> }
>
>
> Where am I going wrong here?  This sort of thing works in Cray SHMEM.
>
> Ben Bock
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Problems building OpenMPI 2.1.1 on Intel KNL

2017-11-20 Thread Howard Pritchard

Hello Ake,

Would you mind opening an issue on Github so we can track this?

https://github.com/open-mpi/ompi/issues

There's a template to show what info we need to fix this.

Thanks very much for reporting this,

Howard


2017-11-20 3:26 GMT-07:00 Åke Sandgren <ake.sandg...@hpc2n.umu.se>:

> Hi!
>
> When the xppsl-libmemkind-dev package version 1.5.3 is installed
> building OpenMPI fails.
>
> opal/mca/mpool/memkind uses the macro MEMKIND_NUM_BASE_KIND which has
> been moved to memkind/internal/memkind_private.h
>
> Current master is also using that so I think that will also fail.
>
> Are there anyone working on this?
>
> --
> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
> Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OMPI 2.1.2 and SLURM compatibility

2017-11-17 Thread Howard Pritchard

Hello Bennet,

What you are trying to do using srun as the job launcher should work.
Could you post the contents
of /etc/slurm/slurm.conf for your system?

Could you also post the output of the following command:

ompi_info --all | grep pmix

to the mail list.

the config.log from your build would also be useful.

Howard

2017-11-16 9:30 GMT-07:00 r...@open-mpi.org <r...@open-mpi.org>:

> What Charles said was true but not quite complete. We still support the
> older PMI libraries but you likely have to point us to wherever slurm put
> them.
>
> However,we definitely recommend using PMIx as you will get a faster launch
>
> Sent from my iPad
>
> > On Nov 16, 2017, at 9:11 AM, Bennet Fauber <ben...@umich.edu> wrote:
> >
> > Charlie,
> >
> > Thanks a ton!  Yes, we are missing two of the three steps.
> >
> > Will report back after we get pmix installed and after we rebuild
> > Slurm.  We do have a new enough version of it, at least, so we might
> > have missed the target, but we did at least hit the barn.  ;-)
> >
> >
> >
> >> On Thu, Nov 16, 2017 at 10:54 AM, Charles A Taylor <chas...@ufl.edu>
> wrote:
> >> Hi Bennet,
> >>
> >> Three things...
> >>
> >> 1. OpenMPI 2.x requires PMIx in lieu of pmi1/pmi2.
> >>
> >> 2. You will need slurm 16.05 or greater built with —with-pmix
> >>
> >> 2a. You will need pmix 1.1.5 which you can get from github.
> >> (https://github.com/pmix/tarballs).
> >>
> >> 3. then, to launch your mpi tasks on the allocated resources,
> >>
> >>   srun —mpi=pmix ./hello-mpi
> >>
> >> I’m replying to the list because,
> >>
> >> a) this information is harder to find than you might think.
> >> b) someone/anyone can correct me if I’’m giving a bum steer.
> >>
> >> Hope this helps,
> >>
> >> Charlie Taylor
> >> University of Florida
> >>
> >> On Nov 16, 2017, at 10:34 AM, Bennet Fauber <ben...@umich.edu> wrote:
> >>
> >> I think that OpenMPI is supposed to support SLURM integration such that
> >>
> >>   srun ./hello-mpi
> >>
> >> should work?  I built OMPI 2.1.2 with
> >>
> >> export CONFIGURE_FLAGS='--disable-dlopen --enable-shared'
> >> export COMPILERS='CC=gcc CXX=g++ FC=gfortran F77=gfortran'
> >>
> >> CMD="./configure \
> >>   --prefix=${PREFIX} \
> >>   --mandir=${PREFIX}/share/man \
> >>   --with-slurm \
> >>   --with-pmi \
> >>   --with-lustre \
> >>   --with-verbs \
> >>   $CONFIGURE_FLAGS \
> >>   $COMPILERS
> >>
> >> I have a simple hello-mpi.c (source included below), which compiles
> >> and runs with mpirun, both on the login node and in a job.  However,
> >> when I try to use srun in place of mpirun, I get instead a hung job,
> >> which upon cancellation produces this output.
> >>
> >> [bn2.stage.arc-ts.umich.edu:116377] PMI_Init [pmix_s1.c:162:s1_init]:
> >> PMI is not initialized
> >> [bn1.stage.arc-ts.umich.edu:36866] PMI_Init [pmix_s1.c:162:s1_init]:
> >> PMI is not initialized
> >> [warn] opal_libevent2022_event_active: event has no event_base set.
> >> [warn] opal_libevent2022_event_active: event has no event_base set.
> >> slurmstepd: error: *** STEP 86.0 ON bn1 CANCELLED AT
> 2017-11-16T10:03:24 ***
> >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> >> slurmstepd: error: *** JOB 86 ON bn1 CANCELLED AT 2017-11-16T10:03:24
> ***
> >>
> >> The SLURM web page suggests that OMPI 2.x and later support PMIx, and
> >> to use `srun --mpi=pimx`, however that no longer seems to be an
> >> option, and using the `openmpi` type isn't working (neither is pmi2).
> >>
> >> [bennet@beta-build hello]$ srun --mpi=list
> >> srun: MPI types are...
> >> srun: mpi/pmi2
> >> srun: mpi/lam
> >> srun: mpi/openmpi
> >> srun: mpi/mpich1_shmem
> >> srun: mpi/none
> >> srun: mpi/mvapich
> >> srun: mpi/mpich1_p4
> >> srun: mpi/mpichgm
> >> srun: mpi/mpichmx
> >>
> >> To get the Intel PMI to work with srun, I have to set
> >>
> >>   I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
> >>
> >> Is there a comparable environment variable that must be set to enable
> >> `srun` to work?
> >>
> >> Am I missing a build option or misspecifying one?
> >>
> >> -- bennet
> >>

Re: [OMPI users] [OMPI devel] Open MPI 2.0.4rc2 available for testing

2017-11-02 Thread Howard Pritchard

HI Siegmar,

Could you check if you also see a similar problem with OMPI master when you
build with the Sun compiler?

I opened issue 4436 to track this issue.  Not sure we'll have time to fix
it for 2.0.4 though.

Howard


2017-11-02 3:49 GMT-06:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> thank you very much for the fix. Unfortunately, I still get an error
> with Sun C 5.15.
>
>
> loki openmpi-2.0.4rc2-Linux.x86_64.64_cc 125 tail -30
> log.make.Linux.x86_64.64_cc
>   CC   src/client/pmix_client.lo
> "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h",
> line 161: warning: parameter in inline asm statement unused: %3
> "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h",
> line 207: warning: parameter in inline asm statement unused: %2
> "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h",
> line 228: warning: parameter in inline asm statement unused: %2
> "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h",
> line 249: warning: parameter in inline asm statement unused: %2
> "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h",
> line 270: warning: parameter in inline asm statement unused: %2
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 235: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Get_version
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 240: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Init
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 408: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Initialized
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 416: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Finalize
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 488: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Abort
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 616: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Put
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 703: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Commit
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 789: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Resolve_peers
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 852: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Resolve_nodes
> cc: acomp failed for ../../../../../../openmpi-2.0.
> 4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c
> Makefile:1242: recipe for target 'src/client/pmix_client.lo' failed
> make[4]: *** [src/client/pmix_client.lo] Error 1
> make[4]: Leaving directory '/export2/src/openmpi-2.0.4/op
> enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix'
> Makefile:1486: recipe for target 'all-recursive' failed
> make[3]: *** [all-recursive] Error 1
> make[3]: Leaving directory '/export2/src/openmpi-2.0.4/op
> enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix'
> Makefile:1935: recipe for target 'all-recursive' failed
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory '/export2/src/openmpi-2.0.4/op
> enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal/mca/pmix/pmix112'
> Makefile:2301: recipe for target 'all-recursive' failed
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory '/export2/src/openmpi-2.0.4/op
> enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal'
> Makefile:1800: recipe for target 'all-recursive' failed
> make: *** [all-recursive] Error 1
> loki openmpi-2.0.4rc2-Linux.x86_64.64_cc 125
>
>
>
> I would be grateful, if somebody can fix these problems as well.
> Thank you very much for any help in advance.
>
>
> Kind regards
>
> Siegmar
>
>
>
> On 11/01/17 23:18, Howard Pritchard wrote:
>
>> HI Folks,
>>
>> We decided to roll an rc2 to pick up a PMIx fix:
>>
>&g

Re: [OMPI users] Strange benchmarks at large message sizes

2017-09-19 Thread Howard Pritchard

Hello Cooper

Could you rerun your test with the following env. variable set

export OMPI_MCA_coll=self,basic,libnbc

and see if that helps?

Also, what type of interconnect are you using - ethernet, IB, ...?

Howard



2017-09-19 8:56 GMT-06:00 Cooper Burns <cooper.bu...@convergecfd.com>:

> Hello,
>
> I have been running some simple benchmarks and saw some strange behaviour:
> All tests are done on 4 nodes with 24 cores each (total of 96 mpi
> processes)
>
> When I run MPI_Allreduce() I see the run time spike up (about 10x) when I
> go from reducing a total of 4096KB to 8192KB for example, when count is
> 2^21 (8192 kb of 4 byte ints):
>
> MPI_Allreduce(send_buf, recv_buf, count, MPI_SUM, MPI_COMM_WORLD)
>
> is slower than:
>
> MPI_Allreduce(send_buf, recv_buf, count*/2*, MPI_INT, MPI_SUM,
> MPI_COMM_WORLD)
> MPI_Allreduce(send_buf* + count/2*, recv_buf *+ count/2*, count*/2*,MPI_INT,
> MPI_SUM, MPI_COMM_WORLD)
>
> Just wondering if anyone knows what the cause of this behaviour is.
>
> Thanks!
> Cooper
>
>
> Cooper Burns
> Senior Research Engineer
> <https://www.linkedin.com/company/convergent-science-inc>
> <https://www.facebook.com/ConvergentScience>
> <https://twitter.com/convergecfd>
> <https://www.youtube.com/user/convergecfd>
> <https://vimeo.com/convergecfd>
> (608) 230-1551
> convergecfd.com
> <https://convergecfd.com/?utm_source=Email_medium=signature_campaign=CSIEmailSignature>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] openmpi-2.1.2rc2: warnings from "make" and "make check"

2017-08-30 Thread Howard Pritchard

Hi Siegmar,

Opened issue 4151 to track this.

Thanks,

Howard


2017-08-21 7:13 GMT-06:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I've installed openmpi-2.1.2rc2 on my "SUSE Linux Enterprise Server 12.2
> (x86_64)" with Sun C 5.15 (Oracle Developer Studio 12.6) and gcc-7.1.0.
> Perhaps somebody wants to eliminate the following warnings.
>
>
> openmpi-2.1.2rc2-Linux.x86_64.64_gcc/log.make.Linux.x86_64.6
> 4_gcc:openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/utils.c:97:3:
> warning: passing argument 3 of 'PMPI_Type_hindexed' discards 'const'
> qualifier from pointer target type [-Wdiscarded-qualifiers]
> openmpi-2.1.2rc2-Linux.x86_64.64_gcc/log.make.Linux.x86_64.6
> 4_gcc:openmpi-2.1.2rc2/ompi/mpiext/cuda/c/mpiext_cuda_c.h:16:0: warning:
> "MPIX_CUDA_AWARE_SUPPORT" redefined
>
>
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-custom.c",
> line 88: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-linux.c",
> line 2640: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-synthetic.c",
> line 851: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-x86.c",
> line 113: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-xml.c",
> line 1667: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c",
> line 428: warning: statement not reached
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/ad_threaded_io.c",
> line 31: warning: statement not reached
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/utils.c",
> line 97: warning: argument #3 is incompatible with prototype:
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 161:
> warning: parameter in inline asm statement unused: %3
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 207:
> warning: parameter in inline asm statement unused: %2
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 228:
> warning: parameter in inline asm statement unused: %2
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 249:
> warning: parameter in inline asm statement unused: %2
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 270:
> warning: parameter in inline asm statement unused: %2
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix/src/client/pmi1.c", line
> 708: warning: null dimension: argvp
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c",
> line 266: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c",
> line 267: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/ompi/mpiext/cuda/c/mpiext_cuda_c.h", line 16:
> warning: macro redefined: MPIX_CUDA_AWARE_SUPPORT
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/timer.h", line 49:
> warning: initializer does not fit or is out of range: 0x8007
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix1_client.c", line 408:
> warning: enum type mismatch: arg #1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linu

Re: [OMPI users] openmpi-master-201708190239-9d3f451: warnings from "make" and "make check"

2017-08-30 Thread Howard Pritchard

Hi Siegmar,

I opened issue 4151 to track this.  This is relevant to a project to get
open mpi to build with -Werror.

Thanks very much,

Howard


2017-08-21 7:27 GMT-06:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I've installed openmpi-master-201708190239-9d3f451 on my "SUSE Linux
> Enterprise
> Server 12.2 (x86_64)" with Sun C 5.15 (Oracle Developer Studio 12.6) and
> gcc-7.1.0. Perhaps somebody wants to eliminate the following warnings.
>
>
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log.
> make.Linux.x86_64.64_gcc:../../../../../../../../../openmpi-
> master-201708190239-9d3f451/opal/mca/pmix/pmix2x/pmix/src/
> mca/bfrops/base/bfrop_base_copy.c:414:22: warning: statement will never
> be executed [-Wswitch-unreachable]
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log.
> make.Linux.x86_64.64_gcc:../../../../../openmpi-master-20170
> 8190239-9d3f451/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:136:34:
> warning: passing argument 1 of '__xpg_basename' discards 'const' qualifier
> from pointer target type [-Wdiscarded-qualifiers]
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log.
> make.Linux.x86_64.64_gcc:../../../../../openmpi-master-20170
> 8190239-9d3f451/ompi/mpiext/cuda/c/mpiext_cuda_c.h:16:0: warning:
> "MPIX_CUDA_AWARE_SUPPORT" redefined
>
>
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log.
> make-check.Linux.x86_64.64_gcc:../../../openmpi-master-
> 201708190239-9d3f451/test/class/opal_fifo.c:109:26: warning: assignment
> discards 'volatile' qualifier from pointer target type
> [-Wdiscarded-qualifiers]
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log.
> make-check.Linux.x86_64.64_gcc:../../../openmpi-master-
> 201708190239-9d3f451/test/class/opal_lifo.c:72:26: warning: assignment
> discards 'volatile' qualifier from pointer target type
> [-Wdiscarded-qualifiers]
>
>
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/pmix/pmix2x/pmix/src/mca/base/pmix_mca_base_component_repository.c",
> line 266: warning: statement not reached
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/pmix/pmix2x/pmix/src/mca/bfrops/base/bfrop_base_copy.c", line
> 414: warning: statement not reached
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-linux.c", line 2797:
> warning: initializer will be sign-extended: -1
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-synthetic.c", line 946:
> warning: initializer will be sign-extended: -1
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-x86.c", line 238: warning:
> initializer will be sign-extended: -1
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-xml.c", line 2404: warning:
> initializer will be sign-extended: -1
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/pmix/pmix2x/pmix/src/client/pmi1.c", line 711: warning: null
> dimension: argvp
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /ompi/mca/io/romio314/romio/adio/common/ad_fstype.c", line 428: warning:
> statement not reached
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /ompi/mca/io/romio314/romio/adio/common/ad_threaded_io.c", line 31:
> warning: statement not reached
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /ompi/mca/coll/monitoring/coll_monitoring_component.c", line 160:
> warning: improper pointer/integer combination: op "="
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c", line 136: warning:
> argument #1 is incompatible with prototype:
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmp

Re: [OMPI users] pmix, lxc, hpcx

2017-05-26 Thread Howard Pritchard

Hi John,

In the 2.1.x release stream a shared memory capability was introduced into
the PMIx component.

I know nothing about LXC containers, but it looks to me like there's some
issue when PMIx tries
to create these shared memory segments.  I'd check to see if there's
something about your
container configuration that is preventing the creation of shared memory
segments.

Howard


2017-05-26 15:18 GMT-06:00 John Marshall <john.marsh...@ssc-spc.gc.ca>:

> Hi,
>
> I have built openmpi 2.1.1 with hpcx-1.8 and tried to run some mpi code
> under
> ubuntu 14.04 and LXC (1.x) but I get the following:
>
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/dstore/pmix_esh.c at line 1651
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/dstore/pmix_esh.c at line 1751
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/dstore/pmix_esh.c at line 1114
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/common/pmix_jobdata.c at line 93
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/common/pmix_jobdata.c at line 333
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/server/pmix_server.c at line 606
>
> I do not get the same outside of the LXC container and my code runs fine.
>
> I've looked for more info on these messages but could not find anything
> helpful. Are these messages indicative of something missing in, or some
> incompatibility with, the container?
>
> When I build using 2.0.2, I do not have a problem running inside or
> outside of
> the container.
>
> Thanks,
> John
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-22 Thread Howard Pritchard

Forgot you probably need an equal sign after btl arg

Howard Pritchard <hpprit...@gmail.com> schrieb am Mi. 22. März 2017 um
18:11:

> Hi Goetz
>
> Thanks for trying these other versions.  Looks like a bug.  Could you post
> the config.log output from your build of the 2.1.0 to the list?
>
> Also could you try running the job using this extra command line arg to
> see if the problem goes away?
>
> mpirun --mca btl ^vader (rest of your args)
>
> Howard
>
> Götz Waschk <goetz.was...@gmail.com> schrieb am Mi. 22. März 2017 um
> 13:09:
>
> On Wed, Mar 22, 2017 at 7:46 PM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
> > Hi Goetz,
> >
> > Would you mind testing against the 2.1.0 release or the latest from the
> > 1.10.x series (1.10.6)?
>
> Hi Howard,
>
> after sending my mail I have tested both 1.10.6 and 2.1.0 and I have
> received the same error. I have also tested outside of slurm using
> ssh, same problem.
>
> Here's the message from 2.1.0:
> [pax11-10:21920] *** Process received signal ***
> [pax11-10:21920] Signal: Bus error (7)
> [pax11-10:21920] Signal code: Non-existant physical address (2)
> [pax11-10:21920] Failing at address: 0x2b5d5b752290
> [pax11-10:21920] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b5d446e9370]
> [pax11-10:21920] [ 1]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5d531645e0]
> [pax11-10:21920] [ 2]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b5d44f607c1]
> [pax11-10:21920] [ 3]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b5d53162b51]
> [pax11-10:21920] [ 4]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b5d5bb0a17f]
> [pax11-10:21920] [ 5]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b5d5bafe0aa]
> [pax11-10:21920] [ 6]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x399)[0x2b5d44480429]
> [pax11-10:21920] [ 7]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b5d86ab]
> [pax11-10:21920] [ 8] IMB-MPI1[0x40b2ff]
> [pax11-10:21920] [ 9] IMB-MPI1[0x402646]
> [pax11-10:21920] [10]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b5d44917b35]
> [pax11-10:21920] [11] IMB-MPI1[0x401f79]
> [pax11-10:21920] *** End of error message ***
> --
> mpirun noticed that process rank 320 with PID 21920 on node pax11-10
> exited on signal 7 (Bus error).
> --
>
>
> Regards, Götz Waschk
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-22 Thread Howard Pritchard

Hi Goetz

Thanks for trying these other versions.  Looks like a bug.  Could you post
the config.log output from your build of the 2.1.0 to the list?

Also could you try running the job using this extra command line arg to see
if the problem goes away?

mpirun --mca btl ^vader (rest of your args)

Howard

Götz Waschk <goetz.was...@gmail.com> schrieb am Mi. 22. März 2017 um 13:09:

On Wed, Mar 22, 2017 at 7:46 PM, Howard Pritchard <hpprit...@gmail.com>
wrote:
> Hi Goetz,
>
> Would you mind testing against the 2.1.0 release or the latest from the
> 1.10.x series (1.10.6)?

Hi Howard,

after sending my mail I have tested both 1.10.6 and 2.1.0 and I have
received the same error. I have also tested outside of slurm using
ssh, same problem.

Here's the message from 2.1.0:
[pax11-10:21920] *** Process received signal ***
[pax11-10:21920] Signal: Bus error (7)
[pax11-10:21920] Signal code: Non-existant physical address (2)
[pax11-10:21920] Failing at address: 0x2b5d5b752290
[pax11-10:21920] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b5d446e9370]
[pax11-10:21920] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5d531645e0]
[pax11-10:21920] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b5d44f607c1]
[pax11-10:21920] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b5d53162b51]
[pax11-10:21920] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b5d5bb0a17f]
[pax11-10:21920] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b5d5bafe0aa]
[pax11-10:21920] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x399)[0x2b5d44480429]
[pax11-10:21920] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b5d86ab]
[pax11-10:21920] [ 8] IMB-MPI1[0x40b2ff]
[pax11-10:21920] [ 9] IMB-MPI1[0x402646]
[pax11-10:21920] [10]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b5d44917b35]
[pax11-10:21920] [11] IMB-MPI1[0x401f79]
[pax11-10:21920] *** End of error message ***
--
mpirun noticed that process rank 320 with PID 21920 on node pax11-10
exited on signal 7 (Bus error).
--


Regards, Götz Waschk
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-22 Thread Howard Pritchard

Hi Goetz,

Would you mind testing against the 2.1.0 release or the latest from the
1.10.x series (1.10.6)?

Thanks,

Howard


2017-03-22 6:25 GMT-06:00 Götz Waschk <goetz.was...@gmail.com>:

> Hi everyone,
>
> I'm testing a new machine with 32 nodes of 32 cores each using the IMB
> benchmark. It is working fine with 512 processes, but it crashes with
> 1024 processes after a running for a minute:
>
> [pax11-17:16978] *** Process received signal ***
> [pax11-17:16978] Signal: Bus error (7)
> [pax11-17:16978] Signal code: Non-existant physical address (2)
> [pax11-17:16978] Failing at address: 0x2b147b785450
> [pax11-17:16978] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b1473b13370]
> [pax11-17:16978] [ 1]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_
> vader.so(mca_btl_vader_frag_init+0x8e)[0x2b14794a413e]
> [pax11-17:16978] [ 2]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(ompi_
> free_list_grow+0x199)[0x2b147384f309]
> [pax11-17:16978] [ 3]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_
> vader.so(+0x270d)[0x2b14794a270d]
> [pax11-17:16978] [ 4]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_
> ob1.so(mca_pml_ob1_send_request_start_prepare+0x43)[0x2b1479ae3a13]
> [pax11-17:16978] [ 5]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_
> ob1.so(mca_pml_ob1_send+0x89a)[0x2b1479ad90ca]
> [pax11-17:16978] [ 6]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_coll_
> tuned.so(ompi_coll_tuned_allreduce_intra_ring+0x3f1)[0x2b147ad6ec41]
> [pax11-17:16978] [ 7]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(MPI_
> Allreduce+0x17b)[0x2b147387d6bb]
> [pax11-17:16978] [ 8] IMB-MPI1[0x40b316]
> [pax11-17:16978] [ 9] IMB-MPI1[0x407284]
> [pax11-17:16978] [10] IMB-MPI1[0x40250e]
> [pax11-17:16978] [11]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1473d41b35]
> [pax11-17:16978] [12] IMB-MPI1[0x401f79]
> [pax11-17:16978] *** End of error message ***
> --
> mpirun noticed that process rank 552 with PID 0 on node pax11-17
> exited on signal 7 (Bus error).
> --
>
> The program is started from the slurm batch system using mpirun. The
> same application is working fine when using mvapich2 instead.
>
> Regards, Götz Waschk
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Shared Windows and MPI_Accumulate

2017-03-03 Thread Howard Pritchard

Hello Joseph,

I'm still unable to reproduce this system on my SLES12 x86_64 node.

Are you building with CFLAGS=-O3?

If so, could you build without CFLAGS set and see if you still see the
failure?

Howard


2017-03-02 2:34 GMT-07:00 Joseph Schuchart <schuch...@hlrs.de>:

> Hi Howard,
>
> Thanks for trying to reproduce this. It seems that on master the issue
> occurs less frequently but is still there. I used the following bash
> one-liner on my laptop and on our Linux Cluster (single node, 4 processes):
>
> ```
> $ for i in $(seq 1 100) ; do echo $i && mpirun -n 4
> ./mpi_shared_accumulate | grep \! && break ; done
> 1
> 2
> [0] baseptr[0]: 1004 (expected 1010) [!!!]
> [0] baseptr[1]: 1005 (expected 1011) [!!!]
> [0] baseptr[2]: 1006 (expected 1012) [!!!]
> [0] baseptr[3]: 1007 (expected 1013) [!!!]
> [0] baseptr[4]: 1008 (expected 1014) [!!!]
> ```
>
> Sometimes the error occurs after one or two iterations (like above),
> sometimes only at iteration 20 or later. However, I can reproduce it within
> the 100 runs every time I run the statement above. I am attaching the
> config.log and output of ompi_info of master on my laptop. Please let me
> know if I can help with anything else.
>
> Thanks,
> Joseph
>
> On 03/01/2017 11:24 PM, Howard Pritchard wrote:
>
> Hi Joseph,
>
> I built this test with craypich (Cray MPI) and it passed.  I also tried
> with Open MPI master and the test passed.  I also tried with 2.0.2
> and can't seem to reproduce on my system.
>
> Could you post the output of config.log?
>
> Also, how intermittent is the problem?
>
>
> Thanks,
>
> Howard
>
>
>
>
> 2017-03-01 8:03 GMT-07:00 Joseph Schuchart <schuch...@hlrs.de>:
>
>> Hi all,
>>
>> We are seeing issues in one of our applications, in which processes in a
>> shared communicator allocate a shared MPI window and execute MPI_Accumulate
>> simultaneously on it to iteratively update each process' values. The test
>> boils down to the sample code attached. Sample output is as follows:
>>
>> ```
>> $ mpirun -n 4 ./mpi_shared_accumulate
>> [1] baseptr[0]: 1010 (expected 1010)
>> [1] baseptr[1]: 1011 (expected 1011)
>> [1] baseptr[2]: 1012 (expected 1012)
>> [1] baseptr[3]: 1013 (expected 1013)
>> [1] baseptr[4]: 1014 (expected 1014)
>> [2] baseptr[0]: 1005 (expected 1010) [!!!]
>> [2] baseptr[1]: 1006 (expected 1011) [!!!]
>> [2] baseptr[2]: 1007 (expected 1012) [!!!]
>> [2] baseptr[3]: 1008 (expected 1013) [!!!]
>> [2] baseptr[4]: 1009 (expected 1014) [!!!]
>> [3] baseptr[0]: 1010 (expected 1010)
>> [0] baseptr[0]: 1010 (expected 1010)
>> [0] baseptr[1]: 1011 (expected 1011)
>> [0] baseptr[2]: 1012 (expected 1012)
>> [0] baseptr[3]: 1013 (expected 1013)
>> [0] baseptr[4]: 1014 (expected 1014)
>> [3] baseptr[1]: 1011 (expected 1011)
>> [3] baseptr[2]: 1012 (expected 1012)
>> [3] baseptr[3]: 1013 (expected 1013)
>> [3] baseptr[4]: 1014 (expected 1014)
>> ```
>>
>> Each process should hold the same values but sometimes (not on all
>> executions) random processes diverge (marked through [!!!]).
>>
>> I made the following observations:
>>
>> 1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with MPICH
>> 3.2.
>> 2) The issue occurs only if the window is allocated through
>> MPI_Win_allocate_shared, using MPI_Win_allocate works fine.
>> 3) The code assumes that MPI_Accumulate atomically updates individual
>> elements (please correct me if that is not covered by the MPI standard).
>>
>> Both OpenMPI and the example code were compiled using GCC 5.4.1 and run
>> on a Linux system (single node). OpenMPI was configure with
>> --enable-mpi-thread-multiple and --with-threads but the application is not
>> multi-threaded. Please let me know if you need any other information.
>>
>> Cheers
>> Joseph
>>
>> --
>> Dipl.-Inf. Joseph Schuchart
>> High Performance Computing Center Stuttgart (HLRS)
>> Nobelstr. 19
>> D-70569 Stuttgart
>>
>> Tel.: +49(0)711-68565890
>> Fax: +49(0)711-6856832
>> E-Mail: schuch...@hlrs.de
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>
>
> ___
> users mailing 
> listus...@lists.open-mpi.orghttps://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> --
> Dipl.-Inf. Joseph Schuchart
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstr. 19
> D-70569 Stuttgart
>
> Tel.: +49(0)711-68565890 <+49%20711%2068565890>
> Fax: +49(0)711-6856832 <+49%20711%206856832>
> E-Mail: schuch...@hlrs.de
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] sharedfp/lockedfile collision between multiple program instances

2017-03-03 Thread Howard Pritchard

Hi Edgar

Please open an issue too so we can track the fix.

Howard


Edgar Gabriel <egabr...@central.uh.edu> schrieb am Fr. 3. März 2017 um
07:45:

> Nicolas,
>
> thank you for the bug report, I can confirm the behavior. I will work on
> a patch and will try to get that into the next release, should hopefully
> not be too complicated.
>
> Thanks
>
> Edgar
>
>
> On 3/3/2017 7:36 AM, Nicolas Joly wrote:
> > Hi,
> >
> > We just got hit by a problem with sharedfp/lockedfile component under
> > v2.0.1 (should be identical with v2.0.2). We had 2 instances of an MPI
> > program running conccurrently on the same input file and using
> > MPI_File_read_shared() function ...
> >
> > If the shared file pointer is maintained with the lockedfile
> > component, a "XXX.lockedfile" is created near to the data
> > file. Unfortunately, this fixed name will collide with multiple tools
> > instances ;)
> >
> > Running 2 instances of the following command line (source code
> > attached) on the same machine will show the problematic behaviour.
> >
> > mpirun -n 1 --mca sharedfp lockedfile ./shrread -v input.dat
> >
> > Confirmed with lsof(8) output :
> >
> > njoly@tars [~]> lsof input.dat.lockedfile
> > COMMAND  PID  USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> > shrread 5876 njoly   21w   REG   0,308 13510798885996031
> input.dat.lockedfile
> > shrread 5884 njoly   21w   REG   0,308 13510798885996031
> input.dat.lockedfile
> >
> > Thanks in advance.
> >
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Shared Windows and MPI_Accumulate

2017-03-01 Thread Howard Pritchard

Hi Joseph,

I built this test with craypich (Cray MPI) and it passed.  I also tried
with Open MPI master and the test passed.  I also tried with 2.0.2
and can't seem to reproduce on my system.

Could you post the output of config.log?

Also, how intermittent is the problem?


Thanks,

Howard




2017-03-01 8:03 GMT-07:00 Joseph Schuchart <schuch...@hlrs.de>:

> Hi all,
>
> We are seeing issues in one of our applications, in which processes in a
> shared communicator allocate a shared MPI window and execute MPI_Accumulate
> simultaneously on it to iteratively update each process' values. The test
> boils down to the sample code attached. Sample output is as follows:
>
> ```
> $ mpirun -n 4 ./mpi_shared_accumulate
> [1] baseptr[0]: 1010 (expected 1010)
> [1] baseptr[1]: 1011 (expected 1011)
> [1] baseptr[2]: 1012 (expected 1012)
> [1] baseptr[3]: 1013 (expected 1013)
> [1] baseptr[4]: 1014 (expected 1014)
> [2] baseptr[0]: 1005 (expected 1010) [!!!]
> [2] baseptr[1]: 1006 (expected 1011) [!!!]
> [2] baseptr[2]: 1007 (expected 1012) [!!!]
> [2] baseptr[3]: 1008 (expected 1013) [!!!]
> [2] baseptr[4]: 1009 (expected 1014) [!!!]
> [3] baseptr[0]: 1010 (expected 1010)
> [0] baseptr[0]: 1010 (expected 1010)
> [0] baseptr[1]: 1011 (expected 1011)
> [0] baseptr[2]: 1012 (expected 1012)
> [0] baseptr[3]: 1013 (expected 1013)
> [0] baseptr[4]: 1014 (expected 1014)
> [3] baseptr[1]: 1011 (expected 1011)
> [3] baseptr[2]: 1012 (expected 1012)
> [3] baseptr[3]: 1013 (expected 1013)
> [3] baseptr[4]: 1014 (expected 1014)
> ```
>
> Each process should hold the same values but sometimes (not on all
> executions) random processes diverge (marked through [!!!]).
>
> I made the following observations:
>
> 1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with MPICH
> 3.2.
> 2) The issue occurs only if the window is allocated through
> MPI_Win_allocate_shared, using MPI_Win_allocate works fine.
> 3) The code assumes that MPI_Accumulate atomically updates individual
> elements (please correct me if that is not covered by the MPI standard).
>
> Both OpenMPI and the example code were compiled using GCC 5.4.1 and run on
> a Linux system (single node). OpenMPI was configure with
> --enable-mpi-thread-multiple and --with-threads but the application is not
> multi-threaded. Please let me know if you need any other information.
>
> Cheers
> Joseph
>
> --
> Dipl.-Inf. Joseph Schuchart
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstr. 19
> D-70569 Stuttgart
>
> Tel.: +49(0)711-68565890
> Fax: +49(0)711-6856832
> E-Mail: schuch...@hlrs.de
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Issues with different IB adapters and openmpi 2.0.2

2017-02-27 Thread Howard Pritchard

Hi Orion

Does the problem occur if you only use font2 and 3?  Do you have MXM
installed on the font1 node?

The 2.x series is using PMIX and it could be that is impacting the PML
sanity check.

Howard


Orion Poplawski <or...@cora.nwra.com> schrieb am Mo. 27. Feb. 2017 um 14:50:

> We have a couple nodes with different IB adapters in them:
>
> font1/var/log/lspci:03:00.0 InfiniBand [0c06]: Mellanox Technologies
> MT25204
> [InfiniHost III Lx HCA] [15b3:6274] (rev 20)
> font2/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220
> InfiniBand
> HCA [1077:7220] (rev 02)
> font3/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220
> InfiniBand
> HCA [1077:7220] (rev 02)
>
> With 1.10.3 we saw the following errors with mpirun:
>
> [font2.cora.nwra.com:13982] [[23220,1],10] selected pml cm, but peer
> [[23220,1],0] on font1 selected pml ob1
>
> which crashed MPI_Init.
>
> We worked around this by passing "--mca pml ob1".  I notice now with
> openmpi
> 2.0.2 without that option I no longer see errors, but the mpi program will
> hang shortly after startup.  Re-adding the option makes it work, so I'm
> assuming the underlying problem is still the same, but openmpi appears to
> have
> stopped alerting me to the issue.
>
> Thoughts?
>
> --
> Orion Poplawski
> Technical Manager  720-772-5637
> NWRA, Boulder/CoRA Office FAX: 303-415-9702
> 3380 Mitchell Lane   or...@nwra.com
> Boulder, CO 80301   http://www.nwra.com
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_THREAD_MULTIPLE: Fatal error on MPI_Win_create

2017-02-18 Thread Howard Pritchard

Hi Joseph

What OS are you using when running the test?

Could you try running with

export OMPI_mca_osc=^pt2pt
and
export OMPI_mca_osc_base_verbose=10

This error message was put in to this OMPI release because this part of the
code has known problems when used multi threaded.



Joseph Schuchart  schrieb am Sa. 18. Feb. 2017 um 04:02:

> All,
>
> I am seeing a fatal error with OpenMPI 2.0.2 if requesting support for
> MPI_THREAD_MULTIPLE and afterwards creating a window using
> MPI_Win_create. I am attaching a small reproducer. The output I get is
> the following:
>
> ```
> MPI_THREAD_MULTIPLE supported: yes
> MPI_THREAD_MULTIPLE supported: yes
> MPI_THREAD_MULTIPLE supported: yes
> MPI_THREAD_MULTIPLE supported: yes
> --
> The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this
> release.
> Workarounds are to run on a single node, or to use a system with an RDMA
> capable network such as Infiniband.
> --
> [beryl:10705] *** An error occurred in MPI_Win_create
> [beryl:10705] *** reported by process [2149974017,2]
> [beryl:10705] *** on communicator MPI_COMM_WORLD
> [beryl:10705] *** MPI_ERR_WIN: invalid window
> [beryl:10705] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> [beryl:10705] ***and potentially your MPI job)
> [beryl:10698] 3 more processes have sent help message help-osc-pt2pt.txt
> / mpi-thread-multiple-not-supported
> [beryl:10698] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> [beryl:10698] 3 more processes have sent help message
> help-mpi-errors.txt / mpi_errors_are_fatal
> ```
>
> I am running on a single node (my laptop). Both OpenMPI and the
> application were compiled using GCC 5.3.0. Naturally, there is no
> support for Infiniband available. Should I signal OpenMPI that I am
> indeed running on a single node? If so, how can I do that? Can't this be
> detected by OpenMPI automatically? The test succeeds if I only request
> MPI_THREAD_SINGLE.
>
> OpenMPI 2.0.2 has been configured using only
> --enable-mpi-thread-multiple and --prefix configure parameters. I am
> attaching the output of ompi_info.
>
> Please let me know if you need any additional information.
>
> Cheers,
> Joseph
>
> --
> Dipl.-Inf. Joseph Schuchart
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstr. 19
> D-70569 Stuttgart
>
> Tel.: +49(0)711-68565890
> Fax: +49(0)711-6856832
> E-Mail: schuch...@hlrs.de
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Problem with MPI_Comm_spawn using openmpi 2.0.x + sbatch

2017-02-15 Thread Howard Pritchard

Hi Anastasia,

Definitely check the mpirun when in batch environment but you may also want
to upgrade to Open MPI 2.0.2.

Howard

r...@open-mpi.org <r...@open-mpi.org> schrieb am Mi. 15. Feb. 2017 um 07:49:

> Nothing immediate comes to mind - all sbatch does is create an allocation
> and then run your script in it. Perhaps your script is using a different
> “mpirun” command than when you type it interactively?
>
> On Feb 14, 2017, at 5:11 AM, Anastasia Kruchinina <
> nastja.kruchin...@gmail.com> wrote:
>
> Hi,
>
> I am trying to use MPI_Comm_spawn function in my code. I am having trouble
> with openmpi 2.0.x + sbatch (batch system Slurm).
> My test program is located here:
> http://user.it.uu.se/~anakr367/files/MPI_test/
>
> When I am running my code I am getting an error:
>
> OPAL ERROR: Timeout in file
> ../../../../openmpi-2.0.1/opal/mca/pmix/base/pmix_base_fns.c at line 193
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>ompi_dpm_dyn_init() failed
>--> Returned "Timeout" (-15) instead of "Success" (0)
> --
>
> The interesting thing is that there is no error when I am firstly
> allocating nodes with salloc and then run my program. So, I noticed that
> the program works fine using openmpi 1.x+sbach/salloc or openmpi
> 2.0.x+salloc but not openmpi 2.0.x+sbatch.
>
> The error was reproduced on three different computer clusters.
>
> Best regards,
> Anastasia
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12

2017-02-06 Thread Howard Pritchard

Hi Michel,

Could you try running the app with

export TMPDIR=/tmp

set in the shell you are using?

Howard


2017-02-02 13:46 GMT-07:00 Michel Lesoinne <mlesoi...@cmsoftinc.com>:

Howard,

First, thanks to you and Jeff for looking into this with me. 
I tried ../configure --disable-shared --enable-static --prefix ~/.local
The result is the same as without --disable-shared. i.e. I get the
following error:

[Michels-MacBook-Pro.local:92780] [[46617,0],0] ORTE_ERROR_LOG: Bad
parameter in file ../../orte/orted/pmix/pmix_server.c at line 262

[Michels-MacBook-Pro.local:92780] [[46617,0],0] ORTE_ERROR_LOG: Bad
parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line
666

--

It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during orte_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):


  pmix server init failed

  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS

--

On Thu, Feb 2, 2017 at 12:29 PM, Howard Pritchard <hpprit...@gmail.com>
wrote:

Hi Michel

Try adding --enable-static to the configure.
That fixed the problem for me.

Howard

Michel Lesoinne <mlesoi...@cmsoftinc.com> schrieb am Mi. 1. Feb. 2017 um
19:07:

I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have
been trying to run simple program.
I configured openmpi with
../configure --disable-shared --prefix ~/.local
make all install

Then I have  a simple code only containing a call to MPI_Init.
I compile it with
mpirun -np 2 ./mpitest

The output is:

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_patcher_overwrite: File not found (ignored)

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_shmem_mmap: File not found (ignored)

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_shmem_posix: File not found (ignored)

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_shmem_sysv: File not found (ignored)

--

It looks like opal_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during opal_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):


  opal_shmem_base_select failed

  --> Returned value -1 instead of OPAL_SUCCESS

--

Without the --disable-shared in the configuration, then I get:


[Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
parameter in file ../../orte/orted/pmix/pmix_server.c at line 264

[Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line
666

--

It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during orte_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):


  pmix server init failed

  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS

--




Has anyone seen this? What am I missing?
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

2017-02-03 Thread Howard Pritchard

Hello Brendan,

Sorry for the delay in responding.  I've been on travel the past two weeks.

I traced through the debug output you sent.  It provided enough information
to show that for some reason, when using the breakout cable, Open MPI
is unable to complete initialization it needs to use the openib BTL.  It
correctly detects that the first port is not available, but for port 1, it
still fails to initialize.

To debug this further, I'd need to provide you with a custom Open MPI
to try that would have more debug output in the suspect area.

If you'd like to go this route let me know and I'll build a one of library
to try to debug this problem.

One thing to do just as a sanity check is to try tcp:

mpirun --mca btl tcp,self,sm 

with the breakout cable.  If that doesn't work, then I think there may
be some network setup problem that needs to be resolved first before
trying custom Open MPI tarballs.

Thanks,

Howard




2017-02-01 15:08 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>:

> Hello Howard,
>
> I was wondering if you have been able to look at this issue at all, or if
> anyone has any ideas on what to try next.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Brendan
> Myers
> *Sent:* Tuesday, January 24, 2017 11:11 AM
>
> *To:* 'Open MPI Users' <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hello Howard,
>
> Here is the error output after building with debug enabled.  These CX4
> Mellanox cards view each port as a separate device and I am using port 1 on
> the card which is device mlx5_0.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org
> <users-boun...@lists.open-mpi.org>] *On Behalf Of *Howard Pritchard
> *Sent:* Tuesday, January 24, 2017 8:21 AM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hello Brendan,
>
>
>
> This helps some, but looks like we need more debug output.
>
>
>
> Could you build a debug version of Open MPI by adding --enable-debug
>
> to the config options and rerun the test with the breakout cable setup
>
> and keeping the --mca btl_base_verbose 100 command line option?
>
>
>
> Thanks
>
>
>
> Howard
>
>
>
>
>
> 2017-01-23 8:23 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>:
>
> Hello Howard,
>
> Thank you for looking into this. Attached is the output you requested.
> Also, I am using Open MPI 2.0.1.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Friday, January 20, 2017 6:35 PM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hi Brendan
>
>
>
> I doubt this kind of config has gotten any testing with OMPI.  Could you
> rerun with
>
>
>
> --mca btl_base_verbose 100
>
>
>
> added to the command line and post the output to the list?
>
>
>
> Howard
>
>
>
>
>
> Brendan Myers <brendan.my...@soft-forge.com> schrieb am Fr. 20. Jan. 2017
> um 15:04:
>
> Hello,
>
> I am attempting to get Open MPI to run over 2 nodes using a switch and a
> single breakout cable with this design:
>
> (100GbE)QSFP ßà 2x (50GbE)QSFP
>
>
>
> Hardware Layout:
>
> Breakout cable module A connects to switch (100GbE)
>
> Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
>
> Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
>
> Switch is Mellanox SN 2700 100GbE RoCE switch
>
>
>
> · I  am able to pass RDMA traffic between the nodes with perftest
> (ib_write_bw) when using the breakout cable as the IC from both nodes to
> the switch.
>
> · When attempting to run a job using the breakout cable as the IC
> Open MPI aborts with failure to initialize open fabrics device errors.
>
> · If I replace the breakout cable with 2 standard QSFP cables the
> Open MPI job will complete correctly.
>
>
>
>
>
> This is the command I use, it works unless I attempt a run with the
> breakout cable used as IC:
>
> *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
> P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm  -hostfile
> mpi-hosts-ce /usr/local/bin/IMB-MPI1*
>
>
>
> If anyone has any idea as to why using a breakout cable is causing my jobs
> to fail please let me

Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12

2017-02-02 Thread Howard Pritchard

Hi Michel

Try adding --enable-static to the configure.
That fixed the problem for me.

Howard

Michel Lesoinne <mlesoi...@cmsoftinc.com> schrieb am Mi. 1. Feb. 2017 um
19:07:

> I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have
> been trying to run simple program.
> I configured openmpi with
> ../configure --disable-shared --prefix ~/.local
> make all install
>
> Then I have  a simple code only containing a call to MPI_Init.
> I compile it with
> mpirun -np 2 ./mpitest
>
> The output is:
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_patcher_overwrite: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_mmap: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_posix: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_sysv: File not found (ignored)
>
> --
>
> It looks like opal_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during opal_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   opal_shmem_base_select failed
>
>   --> Returned value -1 instead of OPAL_SUCCESS
>
> --
>
> Without the --disable-shared in the configuration, then I get:
>
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../orte/orted/pmix/pmix_server.c at line 264
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line
> 666
>
> --
>
> It looks like orte_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during orte_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   pmix server init failed
>
>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
>
> --
>
>
>
>
> Has anyone seen this? What am I missing?
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12

2017-02-02 Thread Howard Pritchard

Hi Michael,

I reproduced this problem on my Mac too:

pn1249323:~/ompi/examples (v2.0.x *)$ mpirun -np 2 ./ring_c

[pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to
open mca_patcher_overwrite: File not found (ignored)

[pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to
open mca_shmem_mmap: File not found (ignored)

[pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to
open mca_shmem_posix: File not found (ignored)

[pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to
open mca_shmem_sysv: File not found (ignored)

--

It looks like opal_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during opal_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):


  opal_shmem_base_select failed

  --> Returned value -1 instead of OPAL_SUCCESS

Is there a reason why you are using the --disable-shared option?  Can you
use the --disable-dlopen instead?

I'll do some more investigating and open an issue.

Howard



2017-02-01 19:05 GMT-07:00 Michel Lesoinne <mlesoi...@cmsoftinc.com>:

> I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have
> been trying to run simple program.
> I configured openmpi with
> ../configure --disable-shared --prefix ~/.local
> make all install
>
> Then I have  a simple code only containing a call to MPI_Init.
> I compile it with
> mpirun -np 2 ./mpitest
>
> The output is:
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_patcher_overwrite: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_mmap: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_posix: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_sysv: File not found (ignored)
>
> --
>
> It looks like opal_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during opal_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   opal_shmem_base_select failed
>
>   --> Returned value -1 instead of OPAL_SUCCESS
>
> --
>
> Without the --disable-shared in the configuration, then I get:
>
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../orte/orted/pmix/pmix_server.c at line 264
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at
> line 666
>
> --
>
> It looks like orte_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during orte_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   pmix server init failed
>
>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
>
> --
>
>
>
>
> Has anyone seen this? What am I missing?
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12

2017-02-02 Thread Howard Pritchard

Hi Michel

It's somewhat unusual to use the disable-shared  configure option.  That
may be causing this.  Could you try to build without using this option and
see if you still see the problem?


Thanks,

Howard

Michel Lesoinne <mlesoi...@cmsoftinc.com> schrieb am Mi. 1. Feb. 2017 um
21:07:

> I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have
> been trying to run simple program.
> I configured openmpi with
> ../configure --disable-shared --prefix ~/.local
> make all install
>
> Then I have  a simple code only containing a call to MPI_Init.
> I compile it with
> mpirun -np 2 ./mpitest
>
> The output is:
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_patcher_overwrite: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_mmap: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_posix: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_sysv: File not found (ignored)
>
> --
>
> It looks like opal_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during opal_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   opal_shmem_base_select failed
>
>   --> Returned value -1 instead of OPAL_SUCCESS
>
> --
>
> Without the --disable-shared in the configuration, then I get:
>
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../orte/orted/pmix/pmix_server.c at line 264
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line
> 666
>
> --
>
> It looks like orte_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during orte_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   pmix server init failed
>
>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
>
> --
>
>
>
>
> Has anyone seen this? What am I missing?
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Error using hpcc benchmark

2017-01-31 Thread Howard Pritchard

Hi Wodel

Randomaccess part of HPCC is probably causing this.

Perhaps set PSM env. variable -

Export PSM_MQ_REVCREQ_MAX=1000

or something like that.

Alternatively launch the job using

mpirun --mca plm ob1 --host 

to avoid use of psm.  Performance will probably suffer with this option
however.

Howard
wodel youchi <wodel.you...@gmail.com> schrieb am Di. 31. Jan. 2017 um 08:27:

> Hi,
>
> I am n newbie in HPC world
>
> I am trying to execute the hpcc benchmark on our cluster, but every time I
> start the job, I get this error, then the job exits
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which
> usually indicates a user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)compute024.22840Exhausted 1048576 MQ irecv
> request descriptors, which usually indicates a user program error or
> insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)compute019.22847Exhausted 1048576 MQ irecv
> request descriptors, which usually indicates a user program error or
> insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)---Primary
> job  terminated normally, but 1 process returneda non-zero exit code.. Per
> user-direction, the job has been
> aborted.-mpirun
> detected that one or more processes exited with non-zero status, thus
> causingthe job to be terminated. The first process to do so was:  Process
> name: [[19601,1],272]  Exit code:
> 255--*
>
> Platform : IBM PHPC
> OS : RHEL 6.5
> one management node
> 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD
> infiniband 40Gb/s
>
> I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7)
> and Openmpi 1.8.1 (compiled with gcc 4.4.7)
>
> I get the errors, but each time on different compute nodes.
>
> This is the command I used to start the job
>
> *mpirun -np 512 --mca mtl psm --hostfile hosts32
> /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt*
>
> Any help will be appreciated, and if you need more details, let me know.
> Thanks in advance.
>
>
> Regards.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

2017-01-24 Thread Howard Pritchard

Hello Brendan,

This helps some, but looks like we need more debug output.

Could you build a debug version of Open MPI by adding --enable-debug
to the config options and rerun the test with the breakout cable setup
and keeping the --mca btl_base_verbose 100 command line option?

Thanks

Howard


2017-01-23 8:23 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>:

> Hello Howard,
>
> Thank you for looking into this. Attached is the output you requested.
> Also, I am using Open MPI 2.0.1.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Friday, January 20, 2017 6:35 PM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hi Brendan
>
>
>
> I doubt this kind of config has gotten any testing with OMPI.  Could you
> rerun with
>
>
>
> --mca btl_base_verbose 100
>
>
>
> added to the command line and post the output to the list?
>
>
>
> Howard
>
>
>
>
>
> Brendan Myers <brendan.my...@soft-forge.com> schrieb am Fr. 20. Jan. 2017
> um 15:04:
>
> Hello,
>
> I am attempting to get Open MPI to run over 2 nodes using a switch and a
> single breakout cable with this design:
>
> (100GbE)QSFP ßà 2x (50GbE)QSFP
>
>
>
> Hardware Layout:
>
> Breakout cable module A connects to switch (100GbE)
>
> Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
>
> Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
>
> Switch is Mellanox SN 2700 100GbE RoCE switch
>
>
>
> · I  am able to pass RDMA traffic between the nodes with perftest
> (ib_write_bw) when using the breakout cable as the IC from both nodes to
> the switch.
>
> · When attempting to run a job using the breakout cable as the IC
> Open MPI aborts with failure to initialize open fabrics device errors.
>
> · If I replace the breakout cable with 2 standard QSFP cables the
> Open MPI job will complete correctly.
>
>
>
>
>
> This is the command I use, it works unless I attempt a run with the
> breakout cable used as IC:
>
> *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
> P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm  -hostfile
> mpi-hosts-ce /usr/local/bin/IMB-MPI1*
>
>
>
> If anyone has any idea as to why using a breakout cable is causing my jobs
> to fail please let me know.
>
>
>
> Thank you,
>
>
>
> Brendan T. W. Myers
>
> brendan.my...@soft-forge.com
>
> Software Forge Inc
>
>
>
> ___
>
> users mailing list
>
> users@lists.open-mpi.org
>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

2017-01-20 Thread Howard Pritchard

Hi Brendan

I doubt this kind of config has gotten any testing with OMPI.  Could you
rerun with

--mca btl_base_verbose 100

added to the command line and post the output to the list?

Howard


Brendan Myers <brendan.my...@soft-forge.com> schrieb am Fr. 20. Jan. 2017
um 15:04:

> Hello,
>
> I am attempting to get Open MPI to run over 2 nodes using a switch and a
> single breakout cable with this design:
>
> (100GbE)QSFP ßà 2x (50GbE)QSFP
>
>
>
> Hardware Layout:
>
> Breakout cable module A connects to switch (100GbE)
>
> Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
>
> Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
>
> Switch is Mellanox SN 2700 100GbE RoCE switch
>
>
>
> · I  am able to pass RDMA traffic between the nodes with perftest
> (ib_write_bw) when using the breakout cable as the IC from both nodes to
> the switch.
>
> · When attempting to run a job using the breakout cable as the IC
> Open MPI aborts with failure to initialize open fabrics device errors.
>
> · If I replace the breakout cable with 2 standard QSFP cables the
> Open MPI job will complete correctly.
>
>
>
>
>
> This is the command I use, it works unless I attempt a run with the
> breakout cable used as IC:
>
> *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
> P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm  -hostfile
> mpi-hosts-ce /usr/local/bin/IMB-MPI1*
>
>
>
> If anyone has any idea as to why using a breakout cable is causing my jobs
> to fail please let me know.
>
>
>
> Thank you,
>
>
>
> Brendan T. W. Myers
>
> brendan.my...@soft-forge.com
>
> Software Forge Inc
>
>
> ___
>
> users mailing list
>
> users@lists.open-mpi.org
>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux

2017-01-09 Thread Howard Pritchard

HI Siegmar,

You have some config parameters I wasn't trying that may have some impact.
I'll give a try with these parameters.

This should be enough info for now,

Thanks,

Howard


2017-01-09 0:59 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi Howard,
>
> I use the following commands to build and install the package.
> ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
> Linux machine.
>
> mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
> cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>
> ../openmpi-2.0.2rc3/configure \
>   --prefix=/usr/local/openmpi-2.0.2_64_cc \
>   --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
>   --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
>   --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
>   JAVA_HOME=/usr/local/jdk1.8.0_66 \
>   LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
>   CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
>   CPP="cpp" CXXCPP="cpp" \
>   --enable-mpi-cxx \
>   --enable-mpi-cxx-bindings \
>   --enable-cxx-exceptions \
>   --enable-mpi-java \
>   --enable-heterogeneous \
>   --enable-mpi-thread-multiple \
>   --with-hwloc=internal \
>   --without-verbs \
>   --with-wrapper-cflags="-m64 -mt" \
>   --with-wrapper-cxxflags="-m64" \
>   --with-wrapper-fcflags="-m64" \
>   --with-wrapper-ldflags="-mt" \
>   --enable-debug \
>   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>
> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
> rm -r /usr/local/openmpi-2.0.2_64_cc.old
> mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
> make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
> make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>
>
> I get a different error if I run the program with gdb.
>
> loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
> GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
> Copyright (C) 2016 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.h
> tml>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-suse-linux".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://bugs.opensuse.org/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
> (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
> Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host
> loki --slot-list 0:0-5,1:0-5 spawn_master
> Missing separate debuginfos, use: zypper install
> glibc-debuginfo-2.24-2.3.x86_64
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> [New Thread 0x73b97700 (LWP 13582)]
> [New Thread 0x718a4700 (LWP 13583)]
> [New Thread 0x710a3700 (LWP 13584)]
> [New Thread 0x7fffebbba700 (LWP 13585)]
> Detaching after fork from child process 13586.
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> Detaching after fork from child process 13589.
> Detaching after fork from child process 13590.
> Detaching after fork from child process 13591.
> [loki:13586] OPAL ERROR: Timeout in file ../../../../openmpi-2.0.2rc3/o
> pal/mca/pmix/base/pmix_base_fns.c at line 193
> [loki:13586] *** An error occurred in MPI_Comm_spawn
> [loki:13586] *** reported by process [2873294849,0]
> [loki:13586] *** on communicator MPI_COMM_WORLD
> [loki:13586] *** MPI_ERR_UNKNOWN: unknown error
> [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
> now abort,
> [loki:13586] ***and potentially your MPI job)
> [Thread 0x7fffebbba700 (LWP 13585) exited]
> [Thread 0x710a3700 (LWP 13584) exited]
> [Thread 0x718a4700 (LWP 13583) exited]
> [Thread 0x73b97700 (LWP 13582) exited]
> [Inferior 1 (process 13567) exited with code 016]
> Missing separate debuginfos, use: zypper install
> libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3
> .x86_64
> (gdb) bt
> No stack.
> (gdb)
>
> Do you need anything else?
>
>
&g

Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux

2017-01-08 Thread Howard Pritchard

HI Siegmar,

Could you post the configury options you use when building the 2.0.2rc3?
Maybe that will help in trying to reproduce the segfault you are observing.

Howard


2017-01-07 2:30 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
> Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
> I still get the same error that I reported for rc2.
>
> I would be grateful, if somebody can fix the problem before
> releasing the final version. Thank you very much for any help
> in advance.
>
>
> Kind regards
>
> Siegmar
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] segmentation fault with openmpi-2.0.2rc2 on Linux

2017-01-03 Thread Howard Pritchard

HI Siegmar,

Could you please rerun the spawn_slave program with 4 processes?
Your original traceback indicates a failure in the barrier in the slave
program.  I'm interested in seeing if when you run the slave program
standalone with 4 processes the barrier failure is observed.

Thanks,

Howard


2017-01-03 0:32 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi Howard,
>
> thank you very much that you try to solve my problem. I haven't
> changed the programs since 2013 so that you use the correct
> version. The program works as expected with the master trunk as
> you can see at the bottom of this email from my last mail. The
> slave program works when I launch it directly.
>
> loki spawn 122 mpicc --showme
> cc -I/usr/local/openmpi-2.0.2_64_cc/include -m64 -mt -mt -Wl,-rpath
> -Wl,/usr/local/openmpi-2.0.2_64_cc/lib64 -Wl,--enable-new-dtags
> -L/usr/local/openmpi-2.0.2_64_cc/lib64 -lmpi
> loki spawn 123 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
> Open MPI: 2.0.2rc2
>  C compiler absolute: /opt/solstudio12.5b/bin/cc
> loki spawn 124 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 --mca
> btl_base_verbose 10 spawn_slave
> [loki:05572] mca: base: components_register: registering framework btl
> components
> [loki:05572] mca: base: components_register: found loaded component self
> [loki:05572] mca: base: components_register: component self register
> function successful
> [loki:05572] mca: base: components_register: found loaded component sm
> [loki:05572] mca: base: components_register: component sm register
> function successful
> [loki:05572] mca: base: components_register: found loaded component tcp
> [loki:05572] mca: base: components_register: component tcp register
> function successful
> [loki:05572] mca: base: components_register: found loaded component vader
> [loki:05572] mca: base: components_register: component vader register
> function successful
> [loki:05572] mca: base: components_open: opening btl components
> [loki:05572] mca: base: components_open: found loaded component self
> [loki:05572] mca: base: components_open: component self open function
> successful
> [loki:05572] mca: base: components_open: found loaded component sm
> [loki:05572] mca: base: components_open: component sm open function
> successful
> [loki:05572] mca: base: components_open: found loaded component tcp
> [loki:05572] mca: base: components_open: component tcp open function
> successful
> [loki:05572] mca: base: components_open: found loaded component vader
> [loki:05572] mca: base: components_open: component vader open function
> successful
> [loki:05572] select: initializing btl component self
> [loki:05572] select: init of component self returned success
> [loki:05572] select: initializing btl component sm
> [loki:05572] select: init of component sm returned failure
> [loki:05572] mca: base: close: component sm closed
> [loki:05572] mca: base: close: unloading component sm
> [loki:05572] select: initializing btl component tcp
> [loki:05572] select: init of component tcp returned success
> [loki:05572] select: initializing btl component vader
> [loki][[35331,1],0][../../../../../openmpi-2.0.2rc2/opal/mca
> /btl/vader/btl_vader_component.c:454:mca_btl_vader_component_init] No
> peers to communicate with. Disabling vader.
> [loki:05572] select: init of component vader returned failure
> [loki:05572] mca: base: close: component vader closed
> [loki:05572] mca: base: close: unloading component vader
> [loki:05572] mca: bml: Using self btl for send to [[35331,1],0] on node
> loki
> Slave process 0 of 1 running on loki
> spawn_slave 0: argv[0]: spawn_slave
> [loki:05572] mca: base: close: component self closed
> [loki:05572] mca: base: close: unloading component self
> [loki:05572] mca: base: close: component tcp closed
> [loki:05572] mca: base: close: unloading component tcp
> loki spawn 125
>
>
> Kind regards and thank you very much once more
>
> Siegmar
>
> Am 03.01.2017 um 00:17 schrieb Howard Pritchard:
>
>> HI Siegmar,
>>
>> I've attempted to reproduce this using gnu compilers and
>> the version of this test program(s) you posted earlier in 2016
>> but am unable to reproduce the problem.
>>
>> Could you double check that the slave program can be
>> successfully run when launched directly by mpirun/mpiexec?
>> It might also help to use --mca btl_base_verbose 10 when
>> running the slave program standalone.
>>
>> Thanks,
>>
>> Howard
>>
>>
>>
>> 2016-12-28 7:06 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-f
>> ulda.de <mailto:siegmar.gr...@informatik.hs-fulda.de>>:
>>
>

Re: [OMPI users] segmentation fault with openmpi-2.0.2rc2 on Linux

2017-01-02 Thread Howard Pritchard

HI Siegmar,

I've attempted to reproduce this using gnu compilers and
the version of this test program(s) you posted earlier in 2016
but am unable to reproduce the problem.

Could you double check that the slave program can be
successfully run when launched directly by mpirun/mpiexec?
It might also help to use --mca btl_base_verbose 10 when
running the slave program standalone.

Thanks,

Howard



2016-12-28 7:06 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise
> Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately,
> I get an error when I run one of my programs. Everything works as
> expected with openmpi-master-201612232109-67a08e8. The program
> gets a timeout with openmpi-v2.x-201612232156-5ce66b0.
>
> loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
> Open MPI: 2.0.2rc2
>  C compiler absolute: /opt/solstudio12.5b/bin/cc
>
>
> loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
> spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> --
> A system call failed during shared memory initialization that should
> not have.  It is likely that your MPI job will now either abort or
> experience performance degradation.
>
>   Local host:  loki
>   System call: open(2)
>   Error:   No such file or directory (errno 2)
> --
> [loki:17855] *** Process received signal ***
> [loki:17855] Signal: Segmentation fault (11)
> [loki:17855] Signal code: Address not mapped (1)
> [loki:17855] Failing at address: 0x8
> [loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870]
> [loki:17855] [ 1] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae]
> [loki:17855] [ 2] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x1
> 96)[0x7f053250cb16]
> [loki:17855] [ 3] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8]
> [loki:17855] [ 4] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c
> )[0x7f053e52300c]
> [loki:17855] [ 5] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+
> 0x1ed)[0x7f053e523eed]
> [loki:17855] [ 6] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_
> intra_dec_fixed+0x1a3)[0x7f0531ea7c03]
> [loki:17855] [ 7] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38]
> [loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in
> file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line
> 186
> /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_
> dyn_init+0xcd)[0x7f053d48aeed]
> [loki:17855] [ 9] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3]
> [loki:17855] [10] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd]
> [loki:17855] [11] spawn_slave[0x4009cf]
> [loki:17855] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f053cd53b25]
> [loki:17855] [13] spawn_slave[0x400892]
> [loki:17855] *** End of error message ***
> [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file
> ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[55817,2],0]) is on host: loki
>   Process 2 ([[55817,2],1]) is on host: unknown!
>   BTLs attempted: self sm tcp vader
>
> Your MPI job is now going to abort; sorry.
> --
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MP

Re: [OMPI users] Segmentation Fault (Core Dumped) on mpif90 -v

2016-12-23 Thread Howard Pritchard

Hi Paul,

Thanks very much Christmas present.

The Open MPI README has been updated
to include a note about issues with the Intel 16.0.3-4 compiler suites.

Enjoy the holidays,

Howard


2016-12-23 3:41 GMT-07:00 Paul Kapinos <kapi...@itc.rwth-aachen.de>:

> Hi all,
>
> we discussed this issue with Intel compiler support and it looks like they
> now know what the issue is and how to protect after. It is a known issue
> resulting from a backwards incompatibility in an OS/glibc update, cf.
> https://sourceware.org/bugzilla/show_bug.cgi?id=20019
>
> Affected versions of the Intel compilers: 16.0.3, 16.0.4
> Not affected versions: 16.0.2, 17.0
>
> So, simply do not use affected versions (and hope on an bugfix update in
> 16x series if you cannot immediately upgrade to 17x, like we, despite this
> is the favourite option from Intel).
>
> Have a nice Christmas time!
>
> Paul Kapinos
>
> On 12/14/16 13:29, Paul Kapinos wrote:
>
>> Hello all,
>> we seem to run into the same issue: 'mpif90' sigsegvs immediately for
>> Open MPI
>> 1.10.4 compiled using Intel compilers 16.0.4.258 and 16.0.3.210, while it
>> works
>> fine when compiled with 16.0.2.181.
>>
>> It seems to be a compiler issue (more exactly: library issue on libs
>> delivered
>> with 16.0.4.258 and 16.0.3.210 versions). Changing the version of compiler
>> loaded back to 16.0.2.181 (=> change of dynamically loaded libs) let the
>> prevously-failing binary (compiled with newer compilers) to work
>> propperly.
>>
>> Compiling with -O0 does not help. As the issue is likely in the Intel
>> libs (as
>> said changing out these solves/raises the issue) we will do a failback to
>> 16.0.2.181 compiler version. We will try to open a case by Intel - let's
>> see...
>>
>> Have a nice day,
>>
>> Paul Kapinos
>>
>>
>>
>> On 05/06/16 14:10, Jeff Squyres (jsquyres) wrote:
>>
>>> Ok, good.
>>>
>>> I asked that question because typically when we see errors like this, it
>>> is
>>> usually either a busted compiler installation or inadvertently mixing the
>>> run-times of multiple different compilers in some kind of incompatible
>>> way.
>>> Specifically, the mpifort (aka mpif90) application is a fairly simple
>>> program
>>> -- there's no reason it should segv, especially with a stack trace that
>>> you
>>> sent that implies that it's dying early in startup, potentially even
>>> before it
>>> has hit any Open MPI code (i.e., it could even be pre-main).
>>>
>>> BTW, you might be able to get a more complete stack trace from the
>>> debugger
>>> that comes with the Intel compiler (idb?  I don't remember offhand).
>>>
>>> Since you are able to run simple programs compiled by this compiler, it
>>> sounds
>>> like the compiler is working fine.  Good!
>>>
>>> The next thing to check is to see if somehow the compiler and/or run-time
>>> environments are getting mixed up.  E.g., the apps were compiled for one
>>> compiler/run-time but are being used with another.  Also ensure that any
>>> compiler/linker flags that you are passing to Open MPI's configure
>>> script are
>>> native and correct for the platform for which you're compiling (e.g.,
>>> don't
>>> pass in flags that optimize for a different platform; that may result in
>>> generating machine code instructions that are invalid for your platform).
>>>
>>> Try recompiling/re-installing Open MPI from scratch, and if it still
>>> doesn't
>>> work, then send all the information listed here:
>>>
>>> https://www.open-mpi.org/community/help/
>>>
>>>
>>> On May 6, 2016, at 3:45 AM, Giacomo Rossi <giacom...@gmail.com> wrote:
>>>>
>>>> Yes, I've tried three simple "Hello world" programs in fortan, C and
>>>> C++ and
>>>> the compile and run with intel 16.0.3. The problem is with the openmpi
>>>> compiled from source.
>>>>
>>>> Giacomo Rossi Ph.D., Space Engineer
>>>>
>>>> Research Fellow at Dept. of Mechanical and Aerospace Engineering,
>>>> "Sapienza"
>>>> University of Rome
>>>> p: (+39) 0692927207 | m: (+39) 3408816643 | e: giacom...@gmail.com
>>>>
>>>> Member of Fortran-FOSS-programmers
>>>>
>>>>
>>>> 2016-05-05 11:15 GMT+02:00 Giacomo Rossi <giacom...@gmail.com>:
>>>>  gdb /opt/openmpi/

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Howard Pritchard

Hi Daniele,

I bet this psm2 got installed as part of Mpss 3.7.  I see something in the
readme for that about MPSS install with OFED support.
I think if you want to go the route of using the RHEL Open MPI RPMS, you
could use the mca-params.conf file approach
to disabling the use of psm2.

This file and a lot of other stuff about mca parameters is described here:

https://www.open-mpi.org/faq/?category=tuning

Alternatively, you could try and build/install Open MPI yourself from the
download page:

https://www.open-mpi.org/software/ompi/v1.10/

The simplest solution - but you need to be confident that nothing's using
the PSM2 software - would be just
use yum to deinstall the psm2 rpm.

Good luck,

Howard




2016-12-08 14:17 GMT-07:00 Daniele Tartarini <d.tartar...@sheffield.ac.uk>:

> Hi,
> many thanks for tour reply.
>
> I have a S2600IP Intel motherboard. it is a stand alone server and I
> cannot see any omnipath device and so not such modules.
> opainfo is not available on my system
>
> missing anything?
> cheers
> Daniele
>
> On 8 December 2016 at 17:55, Cabral, Matias A <matias.a.cab...@intel.com>
> wrote:
>
>> >Anyway, * /dev/hfi1_0* doesn't exist.
>>
>> Make sure you have the hfi1 module/driver loaded.
>>
>> In addition, please confirm the links are in active state on all the
>> nodes `opainfo`
>>
>>
>>
>> _MAC
>>
>>
>>
>> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
>> Pritchard
>> *Sent:* Thursday, December 08, 2016 9:23 AM
>> *To:* Open MPI Users <users@lists.open-mpi.org>
>> *Subject:* Re: [OMPI users] device failed to appear .. Connection timed
>> out
>>
>>
>>
>> hello Daniele,
>>
>>
>>
>> Could you post the output from ompi_info command?  I'm noticing on the
>> RPMS that came with the rhel7.2 distro on
>>
>> one of our systems that it was built to support psm2/hfi-1.
>>
>>
>>
>> Two things, could you try running applications with
>>
>>
>>
>> mpirun --mca pml ob1 (all the rest of your args)
>>
>>
>>
>> and see if that works?
>>
>>
>>
>> Second,  what sort of system are you using?  Is this a cluster?  If it
>> is, you may want to check whether
>>
>> you have a situation where its an omnipath interconnect and you have the
>> psm2/hfi1 packages installed
>>
>> but for some reason the omnipath HCAs themselves are not active.
>>
>>
>>
>> On one of our omnipath systems the following hfi1 related pms are
>> installed:
>>
>>
>>
>> *hfi*diags-0.8-13.x86_64
>>
>> *hfi*1-psm-devel-0.7-244.x86_64
>> lib*hfi*1verbs-0.5-16.el7.x86_64
>> *hfi*1-psm-0.7-244.x86_64
>> *hfi*1-firmware-0.9-36.noarch
>> *hfi*1-psm-compat-0.7-244.x86_64
>> lib*hfi*1verbs-devel-0.5-16.el7.x86_64
>> *hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
>> *hfi*1-firmware_debug-0.9-36.noarc
>> *hfi*1-diagtools-sw-0.8-13.x86_64
>>
>>
>>
>> Howard
>>
>>
>>
>> 2016-12-08 8:45 GMT-07:00 r...@open-mpi.org <r...@open-mpi.org>:
>>
>> Sounds like something didn’t quite get configured right, or maybe you
>> have a library installed that isn’t quite setup correctly, or...
>>
>>
>>
>> Regardless, we generally advise building from source to avoid such
>> problems. Is there some reason not to just do so?
>>
>>
>>
>> On Dec 8, 2016, at 6:16 AM, Daniele Tartarini <
>> d.tartar...@sheffield.ac.uk> wrote:
>>
>>
>>
>> Hi,
>>
>> I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:
>>
>> *openmpi-devel.x86_64 1.10.3-3.el7  *
>>
>>
>>
>> any code I try to run (including the mpitests-*) I get the following
>> message with slight variants:
>>
>>
>>
>> * my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
>> failed to appear after 15.0 seconds: Connection timed out*
>>
>>
>>
>> Is anyone able to help me in identifying the source of the problem?
>>
>> Anyway, * /dev/hfi1_0* doesn't exist.
>>
>>
>>
>> If I use an OpenMPI version compiled from source I have no issue (gcc
>> 4.8.5).
>>
>>
>>
>> many thanks in advance.
>>
>>
>>
>> cheers
>>
>> Daniele
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Howard Pritchard

hello Daniele,

Could you post the output from ompi_info command?  I'm noticing on the RPMS
that came with the rhel7.2 distro on
one of our systems that it was built to support psm2/hfi-1.

Two things, could you try running applications with

mpirun --mca pml ob1 (all the rest of your args)

and see if that works?

Second,  what sort of system are you using?  Is this a cluster?  If it is,
you may want to check whether
you have a situation where its an omnipath interconnect and you have the
psm2/hfi1 packages installed
but for some reason the omnipath HCAs themselves are not active.

On one of our omnipath systems the following hfi1 related pms are installed:

*hfi*diags-0.8-13.x86_64

*hfi*1-psm-devel-0.7-244.x86_64
lib*hfi*1verbs-0.5-16.el7.x86_64
*hfi*1-psm-0.7-244.x86_64
*hfi*1-firmware-0.9-36.noarch
*hfi*1-psm-compat-0.7-244.x86_64
lib*hfi*1verbs-devel-0.5-16.el7.x86_64
*hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
*hfi*1-firmware_debug-0.9-36.noarc
*hfi*1-diagtools-sw-0.8-13.x86_64


Howard

2016-12-08 8:45 GMT-07:00 r...@open-mpi.org <r...@open-mpi.org>:

> Sounds like something didn’t quite get configured right, or maybe you have
> a library installed that isn’t quite setup correctly, or...
>
> Regardless, we generally advise building from source to avoid such
> problems. Is there some reason not to just do so?
>
> On Dec 8, 2016, at 6:16 AM, Daniele Tartarini <d.tartar...@sheffield.ac.uk>
> wrote:
>
> Hi,
>
> I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:
>
> *openmpi-devel.x86_64 1.10.3-3.el7  *
>
> any code I try to run (including the mpitests-*) I get the following
> message with slight variants:
>
> * my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
> failed to appear after 15.0 seconds: Connection timed out*
>
> Is anyone able to help me in identifying the source of the problem?
> Anyway, * /dev/hfi1_0* doesn't exist.
>
> If I use an OpenMPI version compiled from source I have no issue (gcc
> 4.8.5).
>
> many thanks in advance.
>
> cheers
> Daniele
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Follow-up to Open MPI SC'16 BOF

2016-11-22 Thread Howard Pritchard

Hi Jeff,

I don't think it was the use of memkind itself, but a need to refactor the
way Open MPI is using info objects
that was the issue.  I don't recall the details.

Howard


2016-11-22 16:27 GMT-07:00 Jeff Hammond <jeff.scie...@gmail.com>:

>
>>
>>1. MPI_ALLOC_MEM integration with memkind
>>
>> It would sense to prototype this as a standalone project that is
> integrated with any MPI library via PMPI.  It's probably a day or two of
> work to get that going.
>
> Jeff
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Follow-up to Open MPI SC'16 BOF

2016-11-22 Thread Howard Pritchard

Hello Folks,

This is a followup to the question posed at the SC’16 Open MPI BOF:  Would
the community prefer to have a v2.2.x limited feature but backwards
compatible release sometime in 2017, or would the community prefer a v3.x
(not backwards compatible but potentially more features) sometime in late
2017 to early 2018?

BOF attendees expressed an interest in having a list of features that might
make it in to v2.2.x and ones that the Open MPI developers think would be
too hard to back port from the development branch (master) to a v2.2.x
release stream.

Here are the requested lists:

Features that we anticipate we could port to a v2.2.x release

   1. Improved collective performance (a new “tuned” module)
   2. Enable Linux CMA shared memory support by default
   3. PMIx 3.0 (If new functionality were to be used in this release of
   Open MPI)

Features that we anticipate would be too difficult to port to a v2.2.x
release

   1. Revamped CUDA support
   2. MPI_ALLOC_MEM integration with memkind
   3. OpenMP affinity/placement integration
   4. THREAD_MULTIPLE improvements to MTLs (not so clear on the level of
   difficult for this one)

You can register your opinion on whether to go with a v2.2.x release next
year or to go from v2.1.x to v3.x in late 2017 or early 2018 at the link
below:

https://www.open-mpi.org/sc16/

Thanks very much,

Howard

-- 

Howard Pritchard

HPC-DES

Los Alamos National Laboratory
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

1 2 >

1 - 100 of 161 matches

Mail list logo