Re: [OMPI users] [EXTERNAL] Confusions on building and running OpenMPI over Slingshot 10 on Cray EX HPC

2024-05-09 Thread Pritchard Jr., Howard via users
Hi Jerry, Cray EX HPC with slingshot 10 (NOT 11!!!) is basically a Mellanox IB cluster using RoCE rather than IB. For this sort of interconnect, don’t use OFI, use UCX. Although UCX 1.12.0 is getting a bit old. I’d recommend 1.14.0 or newer, esp. if your system has nodes with GPUs. CXI is the

Re: [OMPI users] [EXTERNAL] Helping interpreting error output

2024-04-16 Thread Pritchard Jr., Howard via users
Hi Jeffrey, I would suggest trying to debug what may be going wrong with UCX on your DGX box. There are several things to try from the UCX faq - https://openucx.readthedocs.io/en/master/faq.html I’d suggest setting the UCX_LOG_LEVEL environment variable to info or debug and see if UCX says

Re: [OMPI users] [EXTERNAL] Help deciphering error message

2024-03-07 Thread Pritchard Jr., Howard via users
Hello Jeffrey, A couple of things to try first. Try running without UCX. Add –-mca pml ^ucx to the mpirun command line. If the app functions without ucx, then the next thing is to see what may be going wrong with UCX and the Open MPI components that use it. You may want to set the

Re: [OMPI users] [EXTERNAL] Re: MPI_Init_thread error

2023-07-25 Thread Pritchard Jr., Howard via users
HI Aziz, Oh I see you referenced the faq. That section of the faq is discussing how to make the Open MPI 4 series (and older) job launcher “know” about the batch scheduler you are using. The relevant section for launching with srun is covered by this faq -

Re: [OMPI users] [EXTERNAL] Re: MPI_Init_thread error

2023-07-25 Thread Pritchard Jr., Howard via users
HI Aziz, Did you include –with-pmi2 on your Open MPI configure line? Howard From: users on behalf of Aziz Ogutlu via users Organization: Eduline Bilisim Reply-To: Open MPI Users Date: Tuesday, July 25, 2023 at 8:18 AM To: Open MPI Users Cc: Aziz Ogutlu Subject: [EXTERNAL] Re: [OMPI users]

Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx

2023-07-24 Thread Pritchard Jr., Howard via users
HI Arun, Interesting. For problem b) I would suggest one of two things - if you want to dig deeper yourself, and its possible on your system, I'd look at the output of dmesg -H -w on the node where the job is hitting this failure (you'll need to rerun the job) - ping the UCX group mail list

Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx

2023-07-20 Thread Pritchard Jr., Howard via users
HI Arun, Its going to be chatty, but you may want to see if strace helps in diagnosing: mpirun -np 2 (all your favorite mpi args) strace -f send_recv 1000 1 huge pages often helps reduce pressure on a NIC's I/O MMU widget and speeds up resolving va to pa memory addresses. On 7/19/23, 9:24

Re: [OMPI users] [EXTERNAL] Requesting information about MPI_T events

2023-03-15 Thread Pritchard Jr., Howard via users
Hi Kingshuk, Looks like the MPI_T Events feature is parked in this PR - https://github.com/open-mpi/ompi/pull/8057 - at the moment. Hoawrd From: users on behalf of Kingshuk Haldar via users Reply-To: Open MPI Users Date: Wednesday, March 15, 2023 at 4:00 AM To: OpenMPI-lists-users Cc:

Re: [OMPI users] [EXTERNAL] OFI, destroy_vni_context(1137).......: OFI domain close failed (ofi_init.c:1137:destroy_vni_context:Device or resource busy)

2022-11-01 Thread Pritchard Jr., Howard via users
HI, You are using MPICH or a vendor derivative of MPICH. You probably want to resend this email to the mpich users/help mail list. Howard From: users on behalf of mrlong via users Reply-To: Open MPI Users Date: Tuesday, November 1, 2022 at 11:26 AM To: "de...@lists.open-mpi.org" ,

Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI Installation - pmi.h Error

2022-10-05 Thread Pritchard Jr., Howard via users
Hi Jeff, I think you are now in the “send the system admin an email to install RPMs, in particular ask that the numa and udev devel rpms be installed”. They will need to install these rpms on the compute node image(s) as well. Howard From: "Jeffrey D. (JD) Tamucci" Date: Wednesday, October

Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI Installation - pmi.h Error

2022-10-04 Thread Pritchard Jr., Howard via users
Could you change the –with-pmi to be --with-pmi=/cm/shared/apps/slurm21.08.8 ? From: "Jeffrey D. (JD) Tamucci" Date: Tuesday, October 4, 2022 at 10:40 AM To: "Pritchard Jr., Howard" , "bbarr...@amazon.com" Cc: Open MPI Users Subject: Re: [EXTERNAL] [OMPI users] Beginner Troubleshooting

Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI Installation - pmi.h Error

2022-10-04 Thread Pritchard Jr., Howard via users
HI JD, Could you post the configure options your script uses to build Open MPI? Howard From: users on behalf of "Jeffrey D. (JD) Tamucci via users" Reply-To: Open MPI Users Date: Tuesday, October 4, 2022 at 10:07 AM To: "users@lists.open-mpi.org" Cc: "Jeffrey D. (JD) Tamucci" Subject:

Re: [OMPI users] [EXTERNAL] Problem with Mellanox ConnectX3 (FDR) and openmpi 4

2022-08-19 Thread Pritchard Jr., Howard via users
Hi Boyrie, The warning message is coming from the older ibverbs component of the Open MPI 4.0/4.1 releases. You can make this message using several ways. One at configure time is to add --disable-verbs to the configure options. At runtime you can set export OMPI_MCA_btl=^openib The ucx

Re: [OMPI users] [EXTERNAL] Java Segentation Fault

2022-03-17 Thread Pritchard Jr., Howard via users
HI Janek, A few questions. First which version of Open MPI are you using? Did you compile your code with the Open MPI mpijavac wrapper? Howard From: users on behalf of "Laudan, Janek via users" Reply-To: "Laudan, Janek" , Open MPI Users Date: Thursday, March 17, 2022 at 9:52 AM To:

Re: [OMPI users] [EXTERNAL] OpenMPI, Slurm and MPI_Comm_spawn

2022-03-08 Thread Pritchard Jr., Howard via users
Hi Kurt, This documentation is rather slurm-centric. If you build Open MPI 4.1.x series the default way, it will build its internal pmix package and use that when launching your app using mpirun. In that case, you can use MPI_comm_spawn within a slurm allocation as long as there are

Re: [OMPI users] [EXTERNAL] openib BTL disabled when using MPI_Init_thread

2022-02-07 Thread Pritchard Jr., Howard via users
HI Jose, I bet this device has not been tested with ucx. You may want to join the ucx users mail list at https://elist.ornl.gov/mailman/listinfo/ucx-group and ask whether this Marvell device has been tested and workarounds for disabling features that this device doesn't support. Again

Re: [OMPI users] [EXTERNAL] openib BTL disabled when using MPI_Init_thread

2022-02-03 Thread Pritchard Jr., Howard via users
Hi Jose, A number of things. First for recent versions of Open MPI including the 4.1.x release stream, MPI_THREAD_MULTIPLE is supported by default. However, some transport options available when using MPI_Init may not be available when requesting MPI_THREAD_MULTIPLE. You may want to let

Re: [OMPI users] [EXTERNAL] openib BTL disabled when using MPI_Init_thread

2022-02-03 Thread Pritchard Jr., Howard via users
Hello Jose, I suspect the issue here is that the OpenIB BTl isn't finding a connection module when you are requesting MPI_THREAD_MULTIPLE. The rdmacm connection is deselected if MPI_THREAD_MULTIPLE thread support level is being requested. If you run the test in a shell with export

[OMPI users] Open MPI v4.0.7rc2 available for testing

2021-11-08 Thread Pritchard Jr., Howard via users
A second release candidate for Open MPI v4.0.7 is now available for testing: https://www.open-mpi.org/software/ompi/v4.0/ New fixes with this release candidate: - Fix an issue with MPI_IALLREDUCE_SCATTER when using large count arguments. - Fixed an issue with POST/START/COMPLETE/WAIT when

[OMPI users] Open MPI v4.0.7rc1 available for testing

2021-10-25 Thread Pritchard Jr., Howard via users
The first release candidate for Open MPI v4.0.7 is now available for testing: https://www.open-mpi.org/software/ompi/v4.0/ Some fixes include: - Numerous fixes from vendor partners. - Fix a problem with a couple of MPI_IALLREDUCE algorithms. Thanks to John Donners for reporting. - Fix

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-19 Thread Pritchard Jr., Howard via users
HI Greg, I believe so concerning your TCP question. I think the patch probably isn’t actually being used otherwise you would have noticed the curious print statement. Sorry about that. I’m out of ideas on what may be happening. Howard From: "Fischer, Greg A." Date: Friday, October 15, 2021

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-14 Thread Pritchard Jr., Howard via users
HI Greg, Oh yes that’s not good about rdmacm. Yes the OFED looks pretty old. Did you by any chance apply that patch? I generated that for a sysadmin here who was in the situation where they needed to maintain Open MPI 3.1.6 but had to also upgrade to some newer RHEL release, but the Open MPi

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-14 Thread Pritchard Jr., Howard via users
Hi Greg, I think the UCX PML may be discomfited by the lack of thread safety. Could you try using the contrib/configure-release-mt in your ucx folder? You want to add –enable-mt. That’s what stands out in your configure output from the one I usually get when building on a MLNX connectx5

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-13 Thread Pritchard Jr., Howard via users
HI Greg, It’s the aging of the openib btl. You may be able to apply the attached patch. Note the 3.1.x release stream is no longer supported. You may want to try using the 4.1.1 release, in which case you’ll want to use UCX. Howard From: users on behalf of "Fischer, Greg A. via users"

Re: [OMPI users] [EXTERNAL] Error Signal code: Address not mapped (1)

2021-06-22 Thread Pritchard Jr., Howard via users
Hello Arturo, Would you mind filing an issue against Open MPI and use the template to provide info we could use to help triage this problem? https://github.com/open-mpi/ompi/issues/new Thanks, Howard From: users on behalf of Arturo Fernandez via users Reply-To: Open MPI Users Date:

Re: [OMPI users] [EXTERNAL] Linker errors in Fedora 34 Docker container

2021-05-26 Thread Pritchard Jr., Howard via users
Hi John, Good to know. For the record were you using a docker container unmodified from docker hub? Howard From: John Haiducek Date: Wednesday, May 26, 2021 at 9:35 AM To: "Pritchard Jr., Howard" Cc: "users@lists.open-mpi.org" Subject: Re: [EXTERNAL] [OMPI users] Linker errors in Fedora 34

Re: [OMPI users] [EXTERNAL] Linker errors in Fedora 34 Docker container

2021-05-26 Thread Pritchard Jr., Howard via users
Hi John, I don’t like this in the make output: ../../libtool: line 5705: find: command not found Maybe you need to install findutils or relevant fedora rpm in your container? Howard From: John Haiducek Date: Wednesday, May 26, 2021 at 7:29 AM To: "Pritchard Jr., Howard" ,

Re: [OMPI users] [EXTERNAL] Linker errors in Fedora 34 Docker container

2021-05-25 Thread Pritchard Jr., Howard via users
Hi John, I don’t think an external dependency is going to fix this. In your build area, do you see any .lo files in opal/util/keyval ? Which compiler are you using? Also, are you building from the tarballs at https://www.open-mpi.org/software/ompi/v4.1/ ? Howard From: users on behalf of

Re: [OMPI users] [EXTERNAL] Re: Newbie With Issues

2021-03-30 Thread Pritchard Jr., Howard via users
Hi Ben, You're heading down the right path On our HPC systems, we use modules to handle things like setting LD_LIBRARY_PATH etc. when using Intel 21.x.y and other Intel compilers. For example, for the Intel/21.1.1 the following were added to LD_LIBRARY_PATH (edited to avoid posting

Re: [OMPI users] [EXTERNAL] building openshem on opa

2021-03-22 Thread Pritchard Jr., Howard via users
HI Michael, You may want to try https://github.com/Sandia-OpenSHMEM/SOS if you want to use OpenSHMEM over OPA. If you have lots of cycles for development work, you could write an OFI SPML for the OSHMEM component of Open MPI. Howard On 3/22/21, 8:56 AM, "users on behalf of Michael Di

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Pritchard Jr., Howard via users
Hi Folks, I'm also have problems reproducing this on one of our OPA clusters: libpsm2-11.2.78-1.el7.x86_64 libpsm2-devel-11.2.78-1.el7.x86_64 cluster runs RHEL 7.8 hca_id: hfi1_0 transport: InfiniBand (0) fw_ver: 1.27.0

Re: [OMPI users] [EXTERNAL] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Pritchard Jr., Howard via users
Hi Patrick, Also it might not hurt to disable the Open IB BTL by setting export OMPI_MCA_btl=^openib in your shell prior to invoking mpirun Howard From: users on behalf of "Heinz, Michael William via users" Reply-To: Open MPI Users Date: Monday, January 25, 2021 at 8:47 AM To:

Re: [OMPI users] [EXTERNAL] RMA breakage

2020-12-07 Thread Pritchard Jr., Howard via users
Hello Dave, There's an issue opened about this - https://github.com/open-mpi/ompi/issues/8252 However, I'm not observing failures with IMB RMA on a IB/aarch64 system and UCX 1.9.0 using OMPI 4.0.x at 6ea9d98. This cluster is running RHEL 7.6 and MLNX_OFED_LINUX-4.5-1.0.1.0. Howard On