Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Joshua Ladd via users
Hi, Noam Can you try your original command line with the following addition: mpirun —mca pml ucx —mca btl ^vader,tcp,openib -*mca osc ucx * I think we're seeing some conflict between UCX PML and UCT OSC. Josh On Wed, Jun 19, 2019 at 4:36 PM Noam Bernstein via users <

Re: [OMPI users] UCX and MPI_THREAD_MULTIPLE

2019-08-26 Thread Joshua Ladd via users
it was built using the default version > of UCX that comes with EPEL (1.5.1). We only built 1.6.0 as the version > provided by EPEL did not build with MT enabled, which to me seems strange > as I don't see any reason not to build with MT enabled. Anyways that's the > deeper context. &

Re: [OMPI users] UCX and MPI_THREAD_MULTIPLE

2019-08-26 Thread Joshua Ladd via users
out >> this kink so if users do hit it we have a solution. As noted UCX is very >> new to us and thus it is entirely possible that we are missing something in >> its interaction with OpenMPI. Our MPI is compiled thusly: >> >> >> https://github.com/fasrc/helmod/bl

Re: [OMPI users] UCX and MPI_THREAD_MULTIPLE

2019-08-23 Thread Joshua Ladd via users
Paul, Can you provide a repro and command line, please. Also, what network hardware are you using? Josh On Fri, Aug 23, 2019 at 3:35 PM Paul Edmon via users < users@lists.open-mpi.org> wrote: > I have a code using MPI_THREAD_MULTIPLE along with MPI-RMA that I'm > using OpenMPI 4.0.1. Since

Re: [OMPI users] CUDA-aware codes not using GPU

2019-09-06 Thread Joshua Ladd via users
Did you build UCX with CUDA support (--with-cuda) ? Josh On Thu, Sep 5, 2019 at 8:45 PM AFernandez via users < users@lists.open-mpi.org> wrote: > Hello OpenMPI Team, > > I'm trying to use CUDA-aware OpenMPI but the system simply ignores the GPU > and the code runs on the CPUs. I've tried

Re: [OMPI users] Trouble with Mellanox's hcoll component and MPI_THREAD_MULTIPLE support?

2020-02-04 Thread Joshua Ladd via users
We cannot reproduce this. On four nodes 20 PPN with and w/o hcoll it takes exactly the same 19 secs (80 ranks). What version of HCOLL are you using? Command line? Josh On Tue, Feb 4, 2020 at 8:44 AM George Bosilca via users < users@lists.open-mpi.org> wrote: > Hcoll will be present in many

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Also, can you try running: mpirun -np 128 hostname Josh On Tue, Jan 28, 2020 at 11:49 AM Joshua Ladd wrote: > I don't see how this can be diagnosed as a "problem with the Mellanox > Software". This is on a single node. What happens when you try to launch on > more than one node? > > Josh > >

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users < users@lists.open-mpi.org> wrote: > Okay, so this is a problem with the Mellanox software - copying Artem. > > On Jan 28, 2020, at 8:15 AM, Collin Strassburger >

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Interesting. Can you try: mpirun -np 128 --debug-daemons hostname Josh On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > In relation to the multi-node attempt, I haven’t yet set that up yet as > the per-node configuration doesn’t pass its tests (full

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
I don't see how this can be diagnosed as a "problem with the Mellanox Software". This is on a single node. What happens when you try to launch on more than one node? Josh On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > Here’s the I/O for these high

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
OK. Please try: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname Josh On Tue, Jan 28, 2020 at 12:49 PM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname > > > > Output: > > [Gen2Node3:54039] [[16643,0],0]

Re: [OMPI users] Trouble with Mellanox's hcoll component and MPI_THREAD_MULTIPLE support?

2020-02-05 Thread Joshua Ladd via users
This is an ancient version of HCOLL. Please upgrade to the latest version (you can do this by installing HPC-X https://www.mellanox.com/products/hpc-x-toolkit) Josh On Wed, Feb 5, 2020 at 4:35 AM Angel de Vicente wrote: > Hi, > > Joshua Ladd writes: > > > We cannot reproduce this. On four

Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

2022-03-01 Thread Joshua Ladd via users
These are very, very old versions of UCX and HCOLL installed in your environment. Also, MXM was deprecated years ago in favor of UCX. What version of MOFED is installed (run ofed_info -s)? What HCA generation is present (run ibstat). Josh On Tue, Mar 1, 2022 at 6:42 AM Angel de Vicente via users