Hi Jerry,
Cray EX HPC with slingshot 10 (NOT 11!!!) is basically a Mellanox IB cluster
using RoCE rather than IB.
For this sort of interconnect, don’t use OFI, use UCX. Although UCX 1.12.0 is
getting a bit old.
I’d recommend 1.14.0 or newer, esp. if your system has nodes with GPUs.
CXI is the
Hi Jeffrey,
I would suggest trying to debug what may be going wrong with UCX on your DGX
box.
There are several things to try from the UCX faq -
https://openucx.readthedocs.io/en/master/faq.html
I’d suggest setting the UCX_LOG_LEVEL environment variable to info or debug and
see if UCX says
Hello Jeffrey,
A couple of things to try first.
Try running without UCX. Add –-mca pml ^ucx to the mpirun command line. If
the app functions without ucx, then the next thing is to see what may be going
wrong with UCX and the Open MPI components that use it.
You may want to set the
HI Aziz,
Oh I see you referenced the faq. That section of the faq is discussing how to
make the Open MPI 4 series (and older) job launcher “know” about the batch
scheduler you are using.
The relevant section for launching with srun is covered by this faq -
HI Aziz,
Did you include –with-pmi2 on your Open MPI configure line?
Howard
From: users on behalf of Aziz Ogutlu via
users
Organization: Eduline Bilisim
Reply-To: Open MPI Users
Date: Tuesday, July 25, 2023 at 8:18 AM
To: Open MPI Users
Cc: Aziz Ogutlu
Subject: [EXTERNAL] Re: [OMPI users]
HI Arun,
Interesting. For problem b) I would suggest one of two things
- if you want to dig deeper yourself, and its possible on your system, I'd look
at the output of dmesg -H -w on the node where the job is hitting this failure
(you'll need to rerun the job)
- ping the UCX group mail list
HI Arun,
Its going to be chatty, but you may want to see if strace helps in diagnosing:
mpirun -np 2 (all your favorite mpi args) strace -f send_recv 1000 1
huge pages often helps reduce pressure on a NIC's I/O MMU widget and speeds up
resolving va to pa memory addresses.
On 7/19/23, 9:24
Hi Kingshuk,
Looks like the MPI_T Events feature is parked in this PR -
https://github.com/open-mpi/ompi/pull/8057 - at the moment.
Hoawrd
From: users on behalf of Kingshuk Haldar via
users
Reply-To: Open MPI Users
Date: Wednesday, March 15, 2023 at 4:00 AM
To: OpenMPI-lists-users
Cc:
HI,
You are using MPICH or a vendor derivative of MPICH. You probably want to
resend this email to the mpich users/help mail list.
Howard
From: users on behalf of mrlong via users
Reply-To: Open MPI Users
Date: Tuesday, November 1, 2022 at 11:26 AM
To: "de...@lists.open-mpi.org" ,
Hi Jeff,
I think you are now in the “send the system admin an email to install RPMs, in
particular ask that the numa and udev devel rpms be installed”. They will need
to install these rpms on the compute node image(s) as well.
Howard
From: "Jeffrey D. (JD) Tamucci"
Date: Wednesday, October
Could you change the –with-pmi to be
--with-pmi=/cm/shared/apps/slurm21.08.8
?
From: "Jeffrey D. (JD) Tamucci"
Date: Tuesday, October 4, 2022 at 10:40 AM
To: "Pritchard Jr., Howard" , "bbarr...@amazon.com"
Cc: Open MPI Users
Subject: Re: [EXTERNAL] [OMPI users] Beginner Troubleshooting
HI JD,
Could you post the configure options your script uses to build Open MPI?
Howard
From: users on behalf of "Jeffrey D. (JD)
Tamucci via users"
Reply-To: Open MPI Users
Date: Tuesday, October 4, 2022 at 10:07 AM
To: "users@lists.open-mpi.org"
Cc: "Jeffrey D. (JD) Tamucci"
Subject:
Hi Boyrie,
The warning message is coming from the older ibverbs component of the Open MPI
4.0/4.1 releases.
You can make this message using several ways. One at configure time is to add
--disable-verbs
to the configure options.
At runtime you can set
export OMPI_MCA_btl=^openib
The ucx
HI Janek,
A few questions.
First which version of Open MPI are you using?
Did you compile your code with the Open MPI mpijavac wrapper?
Howard
From: users on behalf of "Laudan, Janek via
users"
Reply-To: "Laudan, Janek" , Open MPI Users
Date: Thursday, March 17, 2022 at 9:52 AM
To:
Hi Kurt,
This documentation is rather slurm-centric. If you build Open MPI 4.1.x series
the default way, it will build its internal pmix package and use that when
launching your app using mpirun.
In that case, you can use MPI_comm_spawn within a slurm allocation as long as
there are
HI Jose,
I bet this device has not been tested with ucx.
You may want to join the ucx users mail list at
https://elist.ornl.gov/mailman/listinfo/ucx-group
and ask whether this Marvell device has been tested and workarounds for
disabling features that this device doesn't support.
Again
Hi Jose,
A number of things.
First for recent versions of Open MPI including the 4.1.x release stream,
MPI_THREAD_MULTIPLE is supported by default. However, some transport options
available when using MPI_Init may not be available when requesting
MPI_THREAD_MULTIPLE.
You may want to let
Hello Jose,
I suspect the issue here is that the OpenIB BTl isn't finding a connection
module when you are requesting MPI_THREAD_MULTIPLE.
The rdmacm connection is deselected if MPI_THREAD_MULTIPLE thread support level
is being requested.
If you run the test in a shell with
export
A second release candidate for Open MPI v4.0.7 is now available for testing:
https://www.open-mpi.org/software/ompi/v4.0/
New fixes with this release candidate:
- Fix an issue with MPI_IALLREDUCE_SCATTER when using large count arguments.
- Fixed an issue with POST/START/COMPLETE/WAIT when
The first release candidate for Open MPI v4.0.7 is now available for testing:
https://www.open-mpi.org/software/ompi/v4.0/
Some fixes include:
- Numerous fixes from vendor partners.
- Fix a problem with a couple of MPI_IALLREDUCE algorithms. Thanks to
John Donners for reporting.
- Fix
HI Greg,
I believe so concerning your TCP question.
I think the patch probably isn’t actually being used otherwise you would have
noticed the curious print statement.
Sorry about that. I’m out of ideas on what may be happening.
Howard
From: "Fischer, Greg A."
Date: Friday, October 15, 2021
HI Greg,
Oh yes that’s not good about rdmacm.
Yes the OFED looks pretty old.
Did you by any chance apply that patch? I generated that for a sysadmin here
who was in the situation where they needed to maintain Open MPI 3.1.6 but had
to also upgrade to some newer RHEL release, but the Open MPi
Hi Greg,
I think the UCX PML may be discomfited by the lack of thread safety.
Could you try using the contrib/configure-release-mt in your ucx folder? You
want to add –enable-mt.
That’s what stands out in your configure output from the one I usually get when
building on a MLNX connectx5
HI Greg,
It’s the aging of the openib btl.
You may be able to apply the attached patch. Note the 3.1.x release stream is
no longer supported.
You may want to try using the 4.1.1 release, in which case you’ll want to use
UCX.
Howard
From: users on behalf of "Fischer, Greg A.
via users"
Hello Arturo,
Would you mind filing an issue against Open MPI and use the template to provide
info we could use to help triage this problem?
https://github.com/open-mpi/ompi/issues/new
Thanks,
Howard
From: users on behalf of Arturo Fernandez
via users
Reply-To: Open MPI Users
Date:
Hi John,
Good to know. For the record were you using a docker container unmodified from
docker hub?
Howard
From: John Haiducek
Date: Wednesday, May 26, 2021 at 9:35 AM
To: "Pritchard Jr., Howard"
Cc: "users@lists.open-mpi.org"
Subject: Re: [EXTERNAL] [OMPI users] Linker errors in Fedora 34
Hi John,
I don’t like this in the make output:
../../libtool: line 5705: find: command not found
Maybe you need to install findutils or relevant fedora rpm in your container?
Howard
From: John Haiducek
Date: Wednesday, May 26, 2021 at 7:29 AM
To: "Pritchard Jr., Howard" ,
Hi John,
I don’t think an external dependency is going to fix this.
In your build area, do you see any .lo files in
opal/util/keyval
?
Which compiler are you using?
Also, are you building from the tarballs at
https://www.open-mpi.org/software/ompi/v4.1/ ?
Howard
From: users on behalf of
Hi Ben,
You're heading down the right path
On our HPC systems, we use modules to handle things like setting
LD_LIBRARY_PATH etc. when using Intel 21.x.y and other Intel compilers.
For example, for the Intel/21.1.1 the following were added to LD_LIBRARY_PATH
(edited to avoid posting
HI Michael,
You may want to try
https://github.com/Sandia-OpenSHMEM/SOS
if you want to use OpenSHMEM over OPA.
If you have lots of cycles for development work, you could write an OFI SPML
for the OSHMEM component of Open MPI.
Howard
On 3/22/21, 8:56 AM, "users on behalf of Michael Di
Hi Folks,
I'm also have problems reproducing this on one of our OPA clusters:
libpsm2-11.2.78-1.el7.x86_64
libpsm2-devel-11.2.78-1.el7.x86_64
cluster runs RHEL 7.8
hca_id: hfi1_0
transport: InfiniBand (0)
fw_ver: 1.27.0
Hi Patrick,
Also it might not hurt to disable the Open IB BTL by setting
export OMPI_MCA_btl=^openib
in your shell prior to invoking mpirun
Howard
From: users on behalf of "Heinz, Michael
William via users"
Reply-To: Open MPI Users
Date: Monday, January 25, 2021 at 8:47 AM
To:
Hello Dave,
There's an issue opened about this -
https://github.com/open-mpi/ompi/issues/8252
However, I'm not observing failures with IMB RMA on a IB/aarch64 system and UCX
1.9.0 using OMPI 4.0.x at 6ea9d98.
This cluster is running RHEL 7.6 and MLNX_OFED_LINUX-4.5-1.0.1.0.
Howard
On
33 matches
Mail list logo