Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

2022-03-01 Thread Joshua Ladd via users
These are very, very old versions of UCX and HCOLL installed in your environment. Also, MXM was deprecated years ago in favor of UCX. What version of MOFED is installed (run ofed_info -s)? What HCA generation is present (run ibstat). Josh On Tue, Mar 1, 2022 at 6:42 AM Angel de Vicente via users

Re: [OMPI users] Trouble with Mellanox's hcoll component and MPI_THREAD_MULTIPLE support?

2020-02-05 Thread Joshua Ladd via users
This is an ancient version of HCOLL. Please upgrade to the latest version (you can do this by installing HPC-X https://www.mellanox.com/products/hpc-x-toolkit) Josh On Wed, Feb 5, 2020 at 4:35 AM Angel de Vicente wrote: > Hi, > > Joshua Ladd writes: > > > We cannot repro

Re: [OMPI users] Trouble with Mellanox's hcoll component and MPI_THREAD_MULTIPLE support?

2020-02-04 Thread Joshua Ladd via users
We cannot reproduce this. On four nodes 20 PPN with and w/o hcoll it takes exactly the same 19 secs (80 ranks). What version of HCOLL are you using? Command line? Josh On Tue, Feb 4, 2020 at 8:44 AM George Bosilca via users < users@lists.open-mpi.org> wrote: > Hcoll will be present in many

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
------- > > 128 total processes failed to start > > > > Collin > > > > *From:* Joshua Ladd > *Sent:* Tuesday, January 28, 2020 12:48 PM > *To:* Collin Strassburger > *Cc:* Open MPI Users ; Ralph Castain < > r...@o

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
gt; > 128 total processes failed to start > > > > > > Collin > > > > > > *From:* users *On Behalf Of *Ralph > Castain via users > *Sent:* Tuesday, January 28, 2020 12:06 PM > *To:* Joshua Ladd > *Cc:* Ralph Castain ; Open MPI Users < > users@li

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Also, can you try running: mpirun -np 128 hostname Josh On Tue, Jan 28, 2020 at 11:49 AM Joshua Ladd wrote: > I don't see how this can be diagnosed as a "problem with the Mellanox > Software". This is on a single node. What happens when you try to launch on > more than

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node4 > > > > when attempting to start process rank 0. > > ------ > > 128 tot

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users < users@lists.open-mpi.org> wrote: > Okay, so this is a problem with the Mellanox software - copying Artem. > > On Jan 28, 2020, at 8:15 AM, Collin Strassburger >

Re: [OMPI users] CUDA-aware codes not using GPU

2019-09-06 Thread Joshua Ladd via users
Did you build UCX with CUDA support (--with-cuda) ? Josh On Thu, Sep 5, 2019 at 8:45 PM AFernandez via users < users@lists.open-mpi.org> wrote: > Hello OpenMPI Team, > > I'm trying to use CUDA-aware OpenMPI but the system simply ignores the GPU > and the code runs on the CPUs. I've tried

Re: [OMPI users] UCX and MPI_THREAD_MULTIPLE

2019-08-26 Thread Joshua Ladd via users
**apropos :-) On Mon, Aug 26, 2019 at 9:19 PM Joshua Ladd wrote: > Hi, Paul > > I must say, this is eerily appropo. I've just sent a request for Wombat > last week as I was planning to have my group start looking at the > performance of UCX OSC on IB. We are most interested in e

Re: [OMPI users] UCX and MPI_THREAD_MULTIPLE

2019-08-26 Thread Joshua Ladd via users
it was built using the default version > of UCX that comes with EPEL (1.5.1). We only built 1.6.0 as the version > provided by EPEL did not build with MT enabled, which to me seems strange > as I don't see any reason not to build with MT enabled. Anyways that's the > deeper context. &

Re: [OMPI users] UCX and MPI_THREAD_MULTIPLE

2019-08-23 Thread Joshua Ladd via users
Paul, Can you provide a repro and command line, please. Also, what network hardware are you using? Josh On Fri, Aug 23, 2019 at 3:35 PM Paul Edmon via users < users@lists.open-mpi.org> wrote: > I have a code using MPI_THREAD_MULTIPLE along with MPI-RMA that I'm > using OpenMPI 4.0.1. Since

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Joshua Ladd via users
Hi, Noam Can you try your original command line with the following addition: mpirun —mca pml ucx —mca btl ^vader,tcp,openib -*mca osc ucx * I think we're seeing some conflict between UCX PML and UCT OSC. Josh On Wed, Jun 19, 2019 at 4:36 PM Noam Bernstein via users <

Re: [OMPI users] OSHMEM: shmem_ptr always returns NULL

2018-06-01 Thread Joshua Ladd
**xpmem kernel module. On Fri, Jun 1, 2018 at 3:16 PM, Joshua Ladd wrote: > Hi, Marcin > > Sorry for the late response (somehow this one got lost in the clutter). We > added support for shmem_ptr in the UCX SPML in Open MPI 3.0. However, in > order to use it, you must install

Re: [OMPI users] OSHMEM: shmem_ptr always returns NULL

2018-06-01 Thread Joshua Ladd
Hi, Marcin Sorry for the late response (somehow this one got lost in the clutter). We added support for shmem_ptr in the UCX SPML in Open MPI 3.0. However, in order to use it, you must install the Knem kernel module ( https://github.com/hjelmn/xpmem). Best, Josh On Wed, Apr 18, 2018 at 4:01

Re: [OMPI users] Bottleneck of OpenMPI over 100Gbps ROCE

2017-08-25 Thread Joshua Ladd
Hi, There is a known issue in ConnectX-4 which impacts RDMA_READ bandwidth with a single QP. The overhead in the HCA of processing a single RDMA_READ response packet is too high due to the need to lock the QP. With a small MTU (as is the case with Ethernet packets), the impact is magnified

Re: [OMPI users] KNEM errors when running OMPI 2.0.1

2017-01-17 Thread Joshua Ladd
Can you please attach your configure log. It looks like both MXM and the Vader BTL (used for OSC) are complaining because they can't find your KNEM installation. Josh On Tue, Jan 17, 2017 at 7:16 AM, Juan A. Cordero Varelaq < bioinformatica-i...@us.es> wrote: > Hi, I am running on my SCG

Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-15 Thread Joshua Ladd
Hi, Martin The environment variable: MXM_RDMA_PORTS=device:port is what you're looking for. You can specify a device/port pair on your OMPI command line like: mpirun -np 2 ... -x MXM_RDMA_PORTS=mlx4_0:1 ... Best, Josh On Fri, Aug 12, 2016 at 5:03 PM, Audet, Martin

Re: [OMPI users] Ability to overlap communication and computation on Infiniband

2016-08-01 Thread Joshua Ladd
>> When a more modern MTL/mxm or PML/yall framework/component is used, I hope things are different and result in more communication/computation overlap potential. >Others will need to comment on that; the cm PML (i.e., all MTLs) and PML/yalla are super-thin shims to get to the underlying

Re: [OMPI users] Fw: OpenSHMEM Runtime Error

2016-06-23 Thread Joshua Ladd
Ryan, Four suggestions are provided in the help output. Please try these. Josh On Thu, Jun 23, 2016 at 1:25 AM, Jeff Squyres (jsquyres) wrote: > Ryan -- > > Did you try the suggestions listed in the help message? > > > > On Jun 23, 2016, at 1:24 AM, RYAN RAY

Re: [OMPI users] SLOAVx alltoallv

2016-05-06 Thread Joshua Ladd
It did not make it upstream. Josh On Fri, May 6, 2016 at 9:28 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Dave, > > I briefly read the papers and it suggests the SLOAVx algorithm is > implemented by the ml collective module > this module had some issues and was judged not

Re: [OMPI users] Problems using 1.10.2 with MOFED 3.1-1.1.0.1

2016-05-05 Thread Joshua Ladd
We are working with Andy offline. Josh On Thu, May 5, 2016 at 7:32 AM, Andy Riebs wrote: > I've built 1.10.2 with all my favorite configuration options, but I get > messages such as this (one for each rank with orte_base_help_aggregate=0) > when I try to run on a MOFED

Re: [OMPI users] terrible infiniband performance for

2016-03-23 Thread Joshua Ladd
Hi, Ron Please include the command line you used in your tests. Have you run any sanity checks, like OSU latency and bandwidth benchmarks between the nodes? Josh On Wed, Mar 23, 2016 at 8:47 AM, Ronald Cohen wrote: > Thank you! Here are the answers: > > I did not try a

Re: [OMPI users] Questions about non-blocking collective calls...

2015-10-22 Thread Joshua Ladd
Instead of posting a single, big MPI_Igather, why not post and simultaneously progress multiple, small MPI_Igathers? In this way, you can pipeline multiple outstanding collectives and do post collective processing as soon as requests complete. Without network hardware offload capability, Gilles'

Re: [OMPI users] orted segmentation fault in pmix on master

2015-06-11 Thread Joshua Ladd
Ken, Could you try to launch the job with aprun instead of mpirun? Thanks, Josh On Thu, Jun 11, 2015 at 12:21 PM, Howard Pritchard wrote: > Hello Ken, > > Could you give the details of the allocation request (qsub args) > as well as the mpirun command line args? I'm

[OMPI users] Fwd: job post

2015-05-22 Thread Joshua Ladd
Dear Open MPI Community, I'd like to advertise multiple positions of particular relevance to this community. Please feel free to contact me directly or our US Hiring Manager, Scott Chong sco...@mellanox.com, if you or someone you know may be a good fit. Two open positions reporting to me. Can

Re: [OMPI users] disappearance of the memory registration error in 1.8.x?

2015-03-11 Thread Joshua Ladd
Hi, Greg We changed the default behavior to essentially assume folks were running with current MOFED/OFED drivers which allow one to register twice the amount of physical memory. If you are running OFED less than 2.0 or using older drivers, then you should set the following mca parameter: -mca

Re: [OMPI users] Several Bcast calls in a loop causing the code to hang

2015-02-23 Thread Joshua Ladd
der "Basic" colls. Josh On Mon, Feb 23, 2015 at 4:13 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > > Josh, do you see a hang when using vader? It is preferred over the old > sm btl. > > -Nathan > > On Mon, Feb 23, 2015 at 03:48:17PM -0500, Joshua Ladd wr

Re: [OMPI users] Several Bcast calls in a loop causing the code to hang

2015-02-23 Thread Joshua Ladd
2015 at 9:31 PM, Sachin Krishnan <sachk...@gmail.com> >> wrote: >> >>> Josh, >>> >>> Thanks for the help. >>> I'm running on a single host. How do I confirm that it is an issue with >>> the shared memory? >>> >>> Sac

Re: [OMPI users] Several Bcast calls in a loop causing the code to hang

2015-02-20 Thread Joshua Ladd
Sachin, Are you running this on a single host or across multiple hosts (i.e. are you communicating between processes via networking.) If it's on a single host, then it might be an issue with shared memory. Josh On Fri, Feb 20, 2015 at 1:51 AM, Sachin Krishnan wrote: >

Re: [OMPI users] Several Bcast calls in a loop causing the code to hang

2015-02-18 Thread Joshua Ladd
Sachin, Can you, please, provide a command line? Additional information about your system could be helpful also. Josh On Wed, Feb 18, 2015 at 3:43 AM, Sachin Krishnan wrote: > Hello, > > I am new to MPI and also this list. > I wrote an MPI code with several MPI_Bcast calls

Re: [OMPI users] Determine IB transport type of OpenMPI job

2015-01-09 Thread Joshua Ladd
Open MPI's openib BTL only supports RC transport. Best, Josh Sent from my iPhone > On Jan 9, 2015, at 9:03 AM, "Sasso, John (GE Power & Water, Non-GE)" > wrote: > > For a multi-node job using OpenMPI 1.6.5 over InfiniBand where the OFED > library is used, is there a

Re: [OMPI users] Warning about not enough registerable memory on SL6.6

2014-12-08 Thread Joshua Ladd
Hi, This should be fixed in OMPI 1.8.3. Is it possible for you to give 1.8.3 a shot? Best, Josh On Mon, Dec 8, 2014 at 8:43 AM, Götz Waschk wrote: > Dear Open-MPI experts, > > I have updated my little cluster from Scientific Linux 6.5 to 6.6, > this included extensive

Re: [OMPI users] File-backed mmaped I/O and openib btl.

2014-11-11 Thread Joshua Ladd
h the simplified test case. I hope someone will be able to > reproduce the problem. > > Best regards, > > E. > > > On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé > <emmanuel.th...@gmail.com> wrote: > > Thanks for your answer. > > > > On Mon,

Re: [OMPI users] File-backed mmaped I/O and openib btl.

2014-11-10 Thread Joshua Ladd
Just really quick off the top of my head, mmaping relies on the virtual memory subsystem, whereas IB RDMA operations rely on physical memory being pinned (unswappable.) For a large message transfer, the OpenIB BTL will register the user buffer, which will pin the pages and make them unswappable.

Re: [OMPI users] low CPU utilization with OpenMPI

2014-10-23 Thread Joshua Ladd
It's not coming from OSHMEM but from the OPAL "shmem" framework. You are going to get terrible performance - possibly slowing to a crawl having all processes open their backing files for mmap on NSF. I think that's the error that he's getting. Josh On Thu, Oct 23, 2014 at 6:06 AM, Vinson Leung

Re: [OMPI users] Update/patch to check/opal_check_pmi.m4

2014-10-06 Thread Joshua Ladd
We only link in libpmi(2).so if specifically requested to do so via "--with-pmi" configure flag. It is not automatic. Josh On Mon, Oct 6, 2014 at 3:28 PM, Timothy Brown wrote: > Hi, > > I’m not too sure if this is the right list, or if I should be posting to > the

Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-20 Thread Joshua Ladd
Hi, Filippo When launching with mpirun in a SLURM environment, srun is only being used to launch the ORTE daemons (orteds.) Since the daemon will already exist on the node from which you invoked mpirun, this node will not be included in the list of nodes. SLURM's PMI library is not involved

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Joshua Ladd
Maxime, Can you run with: mpirun -np 4 --mca plm_base_verbose 10 /path/to/examples//ring_c On Mon, Aug 18, 2014 at 12:21 PM, Maxime Boissonneault < maxime.boissonnea...@calculquebec.ca> wrote: > Hi, > I just did compile without Cuda, and the result is the same. No output, > exits with

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-15 Thread Joshua Ladd
output when running and exited with code 65. > > Thanks, > > Maxime > > Le 2014-08-14 15:26, Joshua Ladd a écrit : > > One more, Maxime, can you please make sure you've covered everything > here: > > http://www.open-mpi.org/community/help/ > > Josh > >

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
One more, Maxime, can you please make sure you've covered everything here: http://www.open-mpi.org/community/help/ Josh On Thu, Aug 14, 2014 at 3:18 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > And maybe include your LD_LIBRARY_PATH > > Josh > > > On Thu, Aug 14,

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
And maybe include your LD_LIBRARY_PATH Josh On Thu, Aug 14, 2014 at 3:16 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > Can you try to run the example code "ring_c" across nodes? > > Josh > > > On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault < &

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
n the OpenMPI 1.8.1 build and the gromacs build. For OpenMPI 1.8.2rc4 > however, it was the exact same compiler for everything. > > Maxime > > Le 2014-08-14 14:57, Joshua Ladd a écrit : > > Hmmm...weird. Seems like maybe a mismatch between libraries. Did you > build OMPI with the

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
shared --enable-static \ > --with-io-romio-flags="--with-file-system=nfs+lustre" \ > --without-loadleveler --without-slurm --with-tm \ >--with-cuda=$(dirname $(dirname $(which nvcc))) > > Maxime > > > Le 2014-08-14 14:20, Joshua Ladd a écrit : &

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
Boissonneault < maxime.boissonnea...@calculquebec.ca> wrote: > Hi, > I ran gromacs successfully with OpenMPI 1.8.1 and Cuda 6.0.37 on a single > node, with 8 ranks and multiple OpenMP threads. > > Maxime > > > Le 2014-08-14 14:15, Joshua Ladd a écrit : > > Hi, Ma

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
Hi, Maxime Just curious, are you able to run a vanilla MPI program? Can you try one one of the example programs in the "examples" subdirectory. Looks like a threading issue to me. Thanks, Josh

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-23 Thread Joshua Ladd
Ahsan, This link might be helpful in trying to diagnose and treat IB fabric issues: http://docs.oracle.com/cd/E18476_01/doc.220/e18478/fabric.htm#CIHIHJGD You might try resetting the problematic port, or just use port 2 for your jobs as a quick workaround: -mca btl_openib_if_include mlx4_0:2

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Joshua Ladd
Sayed, You might try this link (or have your sysadmin do it if you do not have admin privileges.) To me it looks like your second port is in the "INIT" state but has not been added by the subnet manager.

Re: [OMPI users] poor performance using the openib btl

2014-06-26 Thread Joshua Ladd
You might try restarting the device drivers. $pdsh -g yourcluster service openibd restart Josh Sent from my iPhone > On Jun 26, 2014, at 6:55 AM, "Jeff Squyres (jsquyres)" > wrote: > > Just curious -- if you run standard ping-pong kinds of MPI benchmarks with > the

Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Joshua Ladd
Aleksandar, Please ensure your system administrator follows the guidelines outlined in the link printed in the error message http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Best, Josh On Fri, Jun 20, 2014 at 2:56 PM, Ivanov, Aleksandar (INR) < aleksandar.iva...@kit.edu>

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd
lp to use RDMACM though as you will just see the > resource failure somewhere else. UDCM is not the problem. Something is > wrong with the system. Allocating a 512 entry CQ should not fail. > > -Nathan > > On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: > >

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd
I'm guessing it's a resource limitation issue coming from Torque. H...I found something interesting on the interwebs that looks awfully similar: http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html Greg, if the suggestion from the Torque users doesn't resolve your

Re: [OMPI users] OPENIB unknown transport errors

2014-06-05 Thread Joshua Ladd
ing is fine. > > I'm really very befuddled by this. OpenMPI sees that the two cards are the > same and made by the same vendor, yet it thinks the transport types are > different (and one is unknown). I'm hoping someone with some experience > with how the OpenIB BTL works can shed som

Re: [OMPI users] intel compiler and openmpi 1.8.1

2014-05-29 Thread Joshua Ladd
Lorenzo, set $> export PATH=/Users/lorenzodona/ Documents/openmpi-1.8.1/bin:$PATH $> export LD_LIBRARY_PATH=/Users/lorenzodona/ Documents/openmpi-1.8.1/lib:$LD_LIBRARY_PATH Then try. Josh On Thu, May 29, 2014 at 9:04 AM, Ralph Castain wrote: > Are you sure you have

Re: [OMPI users] Using PMI as RTE component

2014-05-15 Thread Joshua Ladd
Hadi, Is your job launching and executing normally? During the launch, frameworks are initialized by opening all components, selecting the desired one, and closing the others. I think you're just seeing components being opened, queried, and ultimately closed. The important thing is knowing if PMI

Re: [OMPI users] OPENIB unknown transport errors

2014-05-09 Thread Joshua Ladd
Hi, Tim Run "ibstat" on each host: 1. Make sure the adapters are alive and active. 2. Look at the Link Layer settings for host w34. Does it match host w4's? Josh On Fri, May 9, 2014 at 1:18 PM, Tim Miller wrote: > Hi All, > > We're using OpenMPI 1.7.3 with Mellanox

Re: [OMPI users] Call stack upon MPI routine error

2014-03-21 Thread Joshua Ladd
Hi, Vince Couple of ideas off the top of my head: 1. Try disabling eager RDMA. Eager RDMA can consume significant resources: "-mca btl_openib_use_eager_rdma 0" 2. Try using the TCP BTL - is the error still present? 3. Try the poor man's debugger - print the pid and hostname of the process

[OMPI users] FW: LOCAL QP OPERATION ERROR

2014-03-11 Thread Joshua Ladd
Hi, Vince Have you tried with a different BTL? In particular, have you tried with the TCP BTL? Please try setting "-mca btl sm,self,tcp" and see if you still run into the issue. How is your OMPI configured? Josh > From: Vince Grimes > Subject: [OMPI users] LOCAL QP