These are very, very old versions of UCX and HCOLL installed in your
environment. Also, MXM was deprecated years ago in favor of UCX. What
version of MOFED is installed (run ofed_info -s)? What HCA generation is
present (run ibstat).
Josh
On Tue, Mar 1, 2022 at 6:42 AM Angel de Vicente via users
This is an ancient version of HCOLL. Please upgrade to the latest version
(you can do this by installing HPC-X
https://www.mellanox.com/products/hpc-x-toolkit)
Josh
On Wed, Feb 5, 2020 at 4:35 AM Angel de Vicente
wrote:
> Hi,
>
> Joshua Ladd writes:
>
> > We cannot repro
We cannot reproduce this. On four nodes 20 PPN with and w/o hcoll it takes
exactly the same 19 secs (80 ranks).
What version of HCOLL are you using? Command line?
Josh
On Tue, Feb 4, 2020 at 8:44 AM George Bosilca via users <
users@lists.open-mpi.org> wrote:
> Hcoll will be present in many
-------
>
> 128 total processes failed to start
>
>
>
> Collin
>
>
>
> *From:* Joshua Ladd
> *Sent:* Tuesday, January 28, 2020 12:48 PM
> *To:* Collin Strassburger
> *Cc:* Open MPI Users ; Ralph Castain <
> r...@o
gt;
> 128 total processes failed to start
>
>
>
>
>
> Collin
>
>
>
>
>
> *From:* users *On Behalf Of *Ralph
> Castain via users
> *Sent:* Tuesday, January 28, 2020 12:06 PM
> *To:* Joshua Ladd
> *Cc:* Ralph Castain ; Open MPI Users <
> users@li
Also, can you try running:
mpirun -np 128 hostname
Josh
On Tue, Jan 28, 2020 at 11:49 AM Joshua Ladd wrote:
> I don't see how this can be diagnosed as a "problem with the Mellanox
> Software". This is on a single node. What happens when you try to launch on
> more than
specified application as it encountered an
>
> error:
>
>
>
> Error code: 63
>
> Error name: (null)
>
> Node: Gen2Node4
>
>
>
> when attempting to start process rank 0.
>
> ------
>
> 128 tot
Can you send the output of a failed run including your command line.
Josh
On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users <
users@lists.open-mpi.org> wrote:
> Okay, so this is a problem with the Mellanox software - copying Artem.
>
> On Jan 28, 2020, at 8:15 AM, Collin Strassburger
>
Did you build UCX with CUDA support (--with-cuda) ?
Josh
On Thu, Sep 5, 2019 at 8:45 PM AFernandez via users <
users@lists.open-mpi.org> wrote:
> Hello OpenMPI Team,
>
> I'm trying to use CUDA-aware OpenMPI but the system simply ignores the GPU
> and the code runs on the CPUs. I've tried
**apropos :-)
On Mon, Aug 26, 2019 at 9:19 PM Joshua Ladd wrote:
> Hi, Paul
>
> I must say, this is eerily appropo. I've just sent a request for Wombat
> last week as I was planning to have my group start looking at the
> performance of UCX OSC on IB. We are most interested in e
it was built using the default version
> of UCX that comes with EPEL (1.5.1). We only built 1.6.0 as the version
> provided by EPEL did not build with MT enabled, which to me seems strange
> as I don't see any reason not to build with MT enabled. Anyways that's the
> deeper context.
&
Paul,
Can you provide a repro and command line, please. Also, what network
hardware are you using?
Josh
On Fri, Aug 23, 2019 at 3:35 PM Paul Edmon via users <
users@lists.open-mpi.org> wrote:
> I have a code using MPI_THREAD_MULTIPLE along with MPI-RMA that I'm
> using OpenMPI 4.0.1. Since
Hi, Noam
Can you try your original command line with the following addition:
mpirun —mca pml ucx —mca btl ^vader,tcp,openib -*mca osc ucx *
I think we're seeing some conflict between UCX PML and UCT OSC.
Josh
On Wed, Jun 19, 2019 at 4:36 PM Noam Bernstein via users <
**xpmem kernel module.
On Fri, Jun 1, 2018 at 3:16 PM, Joshua Ladd wrote:
> Hi, Marcin
>
> Sorry for the late response (somehow this one got lost in the clutter). We
> added support for shmem_ptr in the UCX SPML in Open MPI 3.0. However, in
> order to use it, you must install
Hi, Marcin
Sorry for the late response (somehow this one got lost in the clutter). We
added support for shmem_ptr in the UCX SPML in Open MPI 3.0. However, in
order to use it, you must install the Knem kernel module (
https://github.com/hjelmn/xpmem).
Best,
Josh
On Wed, Apr 18, 2018 at 4:01
Hi,
There is a known issue in ConnectX-4 which impacts RDMA_READ bandwidth with
a single QP. The overhead in the HCA of processing a single RDMA_READ
response packet is too high due to the need to lock the QP. With a small
MTU (as is the case with Ethernet packets), the impact is magnified
Can you please attach your configure log. It looks like both MXM and the
Vader BTL (used for OSC) are complaining because they can't find your KNEM
installation.
Josh
On Tue, Jan 17, 2017 at 7:16 AM, Juan A. Cordero Varelaq <
bioinformatica-i...@us.es> wrote:
> Hi, I am running on my SCG
Hi, Martin
The environment variable:
MXM_RDMA_PORTS=device:port
is what you're looking for. You can specify a device/port pair on your OMPI
command line like:
mpirun -np 2 ... -x MXM_RDMA_PORTS=mlx4_0:1 ...
Best,
Josh
On Fri, Aug 12, 2016 at 5:03 PM, Audet, Martin
>> When a more modern MTL/mxm or PML/yall framework/component is used, I
hope things are different and result in more communication/computation
overlap potential.
>Others will need to comment on that; the cm PML (i.e., all MTLs) and
PML/yalla are super-thin shims to get to the underlying
Ryan,
Four suggestions are provided in the help output. Please try these.
Josh
On Thu, Jun 23, 2016 at 1:25 AM, Jeff Squyres (jsquyres) wrote:
> Ryan --
>
> Did you try the suggestions listed in the help message?
>
>
> > On Jun 23, 2016, at 1:24 AM, RYAN RAY
It did not make it upstream.
Josh
On Fri, May 6, 2016 at 9:28 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:
> Dave,
>
> I briefly read the papers and it suggests the SLOAVx algorithm is
> implemented by the ml collective module
> this module had some issues and was judged not
We are working with Andy offline.
Josh
On Thu, May 5, 2016 at 7:32 AM, Andy Riebs wrote:
> I've built 1.10.2 with all my favorite configuration options, but I get
> messages such as this (one for each rank with orte_base_help_aggregate=0)
> when I try to run on a MOFED
Hi, Ron
Please include the command line you used in your tests. Have you run any
sanity checks, like OSU latency and bandwidth benchmarks between the nodes?
Josh
On Wed, Mar 23, 2016 at 8:47 AM, Ronald Cohen wrote:
> Thank you! Here are the answers:
>
> I did not try a
Instead of posting a single, big MPI_Igather, why not post and
simultaneously progress multiple, small MPI_Igathers? In this way, you can
pipeline multiple outstanding collectives and do post collective processing
as soon as requests complete. Without network hardware offload
capability, Gilles'
Ken,
Could you try to launch the job with aprun instead of mpirun?
Thanks,
Josh
On Thu, Jun 11, 2015 at 12:21 PM, Howard Pritchard
wrote:
> Hello Ken,
>
> Could you give the details of the allocation request (qsub args)
> as well as the mpirun command line args? I'm
Dear Open MPI Community,
I'd like to advertise multiple positions of particular relevance to this
community. Please feel free to contact me directly or our US Hiring
Manager, Scott Chong sco...@mellanox.com, if you or someone you know may
be a good fit.
Two open positions reporting to me. Can
Hi, Greg
We changed the default behavior to essentially assume folks were running
with current MOFED/OFED drivers which allow one to register twice the
amount of physical memory. If you are running OFED less than 2.0 or using
older drivers, then you should set the following mca parameter:
-mca
der "Basic" colls.
Josh
On Mon, Feb 23, 2015 at 4:13 PM, Nathan Hjelm <hje...@lanl.gov> wrote:
>
> Josh, do you see a hang when using vader? It is preferred over the old
> sm btl.
>
> -Nathan
>
> On Mon, Feb 23, 2015 at 03:48:17PM -0500, Joshua Ladd wr
2015 at 9:31 PM, Sachin Krishnan <sachk...@gmail.com>
>> wrote:
>>
>>> Josh,
>>>
>>> Thanks for the help.
>>> I'm running on a single host. How do I confirm that it is an issue with
>>> the shared memory?
>>>
>>> Sac
Sachin,
Are you running this on a single host or across multiple hosts (i.e. are
you communicating between processes via networking.) If it's on a single
host, then it might be an issue with shared memory.
Josh
On Fri, Feb 20, 2015 at 1:51 AM, Sachin Krishnan wrote:
>
Sachin,
Can you, please, provide a command line? Additional information about your
system could be helpful also.
Josh
On Wed, Feb 18, 2015 at 3:43 AM, Sachin Krishnan wrote:
> Hello,
>
> I am new to MPI and also this list.
> I wrote an MPI code with several MPI_Bcast calls
Open MPI's openib BTL only supports RC transport.
Best,
Josh
Sent from my iPhone
> On Jan 9, 2015, at 9:03 AM, "Sasso, John (GE Power & Water, Non-GE)"
> wrote:
>
> For a multi-node job using OpenMPI 1.6.5 over InfiniBand where the OFED
> library is used, is there a
Hi,
This should be fixed in OMPI 1.8.3. Is it possible for you to give 1.8.3 a
shot?
Best,
Josh
On Mon, Dec 8, 2014 at 8:43 AM, Götz Waschk wrote:
> Dear Open-MPI experts,
>
> I have updated my little cluster from Scientific Linux 6.5 to 6.6,
> this included extensive
h the simplified test case. I hope someone will be able to
> reproduce the problem.
>
> Best regards,
>
> E.
>
>
> On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé
> <emmanuel.th...@gmail.com> wrote:
> > Thanks for your answer.
> >
> > On Mon,
Just really quick off the top of my head, mmaping relies on the virtual
memory subsystem, whereas IB RDMA operations rely on physical memory being
pinned (unswappable.) For a large message transfer, the OpenIB BTL will
register the user buffer, which will pin the pages and make them
unswappable.
It's not coming from OSHMEM but from the OPAL "shmem" framework. You are
going to get terrible performance - possibly slowing to a crawl having all
processes open their backing files for mmap on NSF. I think that's the
error that he's getting.
Josh
On Thu, Oct 23, 2014 at 6:06 AM, Vinson Leung
We only link in libpmi(2).so if specifically requested to do so via
"--with-pmi" configure flag. It is not automatic.
Josh
On Mon, Oct 6, 2014 at 3:28 PM, Timothy Brown
wrote:
> Hi,
>
> I’m not too sure if this is the right list, or if I should be posting to
> the
Hi, Filippo
When launching with mpirun in a SLURM environment, srun is only being used
to launch the ORTE daemons (orteds.) Since the daemon will already exist
on the node from which you invoked mpirun, this node will not be included
in the list of nodes. SLURM's PMI library is not involved
Maxime,
Can you run with:
mpirun -np 4 --mca plm_base_verbose 10 /path/to/examples//ring_c
On Mon, Aug 18, 2014 at 12:21 PM, Maxime Boissonneault <
maxime.boissonnea...@calculquebec.ca> wrote:
> Hi,
> I just did compile without Cuda, and the result is the same. No output,
> exits with
output when running and exited with code 65.
>
> Thanks,
>
> Maxime
>
> Le 2014-08-14 15:26, Joshua Ladd a écrit :
>
> One more, Maxime, can you please make sure you've covered everything
> here:
>
> http://www.open-mpi.org/community/help/
>
> Josh
>
>
One more, Maxime, can you please make sure you've covered everything here:
http://www.open-mpi.org/community/help/
Josh
On Thu, Aug 14, 2014 at 3:18 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> And maybe include your LD_LIBRARY_PATH
>
> Josh
>
>
> On Thu, Aug 14,
And maybe include your LD_LIBRARY_PATH
Josh
On Thu, Aug 14, 2014 at 3:16 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> Can you try to run the example code "ring_c" across nodes?
>
> Josh
>
>
> On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault <
&
n the OpenMPI 1.8.1 build and the gromacs build. For OpenMPI 1.8.2rc4
> however, it was the exact same compiler for everything.
>
> Maxime
>
> Le 2014-08-14 14:57, Joshua Ladd a écrit :
>
> Hmmm...weird. Seems like maybe a mismatch between libraries. Did you
> build OMPI with the
shared --enable-static \
> --with-io-romio-flags="--with-file-system=nfs+lustre" \
> --without-loadleveler --without-slurm --with-tm \
>--with-cuda=$(dirname $(dirname $(which nvcc)))
>
> Maxime
>
>
> Le 2014-08-14 14:20, Joshua Ladd a écrit :
&
Boissonneault <
maxime.boissonnea...@calculquebec.ca> wrote:
> Hi,
> I ran gromacs successfully with OpenMPI 1.8.1 and Cuda 6.0.37 on a single
> node, with 8 ranks and multiple OpenMP threads.
>
> Maxime
>
>
> Le 2014-08-14 14:15, Joshua Ladd a écrit :
>
> Hi, Ma
Hi, Maxime
Just curious, are you able to run a vanilla MPI program? Can you try one
one of the example programs in the "examples" subdirectory. Looks like a
threading issue to me.
Thanks,
Josh
Ahsan,
This link might be helpful in trying to diagnose and treat IB fabric issues:
http://docs.oracle.com/cd/E18476_01/doc.220/e18478/fabric.htm#CIHIHJGD
You might try resetting the problematic port, or just use port 2 for your
jobs as a quick workaround:
-mca btl_openib_if_include mlx4_0:2
Sayed,
You might try this link (or have your sysadmin do it if you do not have
admin privileges.) To me it looks like your second port is in the "INIT"
state but has not been added by the subnet manager.
You might try restarting the device drivers.
$pdsh -g yourcluster service openibd restart
Josh
Sent from my iPhone
> On Jun 26, 2014, at 6:55 AM, "Jeff Squyres (jsquyres)"
> wrote:
>
> Just curious -- if you run standard ping-pong kinds of MPI benchmarks with
> the
Aleksandar,
Please ensure your system administrator follows the guidelines outlined in
the link printed in the error message
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Best,
Josh
On Fri, Jun 20, 2014 at 2:56 PM, Ivanov, Aleksandar (INR) <
aleksandar.iva...@kit.edu>
lp to use RDMACM though as you will just see the
> resource failure somewhere else. UDCM is not the problem. Something is
> wrong with the system. Allocating a 512 entry CQ should not fail.
>
> -Nathan
>
> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
> >
I'm guessing it's a resource limitation issue coming from Torque.
H...I found something interesting on the interwebs that looks awfully
similar:
http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html
Greg, if the suggestion from the Torque users doesn't resolve your
ing is fine.
>
> I'm really very befuddled by this. OpenMPI sees that the two cards are the
> same and made by the same vendor, yet it thinks the transport types are
> different (and one is unknown). I'm hoping someone with some experience
> with how the OpenIB BTL works can shed som
Lorenzo,
set
$> export PATH=/Users/lorenzodona/
Documents/openmpi-1.8.1/bin:$PATH
$> export LD_LIBRARY_PATH=/Users/lorenzodona/
Documents/openmpi-1.8.1/lib:$LD_LIBRARY_PATH
Then try.
Josh
On Thu, May 29, 2014 at 9:04 AM, Ralph Castain wrote:
> Are you sure you have
Hadi,
Is your job launching and executing normally? During the launch, frameworks
are initialized by opening all components, selecting the desired one, and
closing the others. I think you're just seeing components being opened,
queried, and ultimately closed. The important thing is knowing if PMI
Hi, Tim
Run "ibstat" on each host:
1. Make sure the adapters are alive and active.
2. Look at the Link Layer settings for host w34. Does it match host w4's?
Josh
On Fri, May 9, 2014 at 1:18 PM, Tim Miller wrote:
> Hi All,
>
> We're using OpenMPI 1.7.3 with Mellanox
Hi, Vince
Couple of ideas off the top of my head:
1. Try disabling eager RDMA. Eager RDMA can consume significant resources:
"-mca btl_openib_use_eager_rdma 0"
2. Try using the TCP BTL - is the error still present?
3. Try the poor man's debugger - print the pid and hostname of the process
Hi, Vince
Have you tried with a different BTL? In particular, have you tried with the TCP
BTL? Please try setting "-mca btl sm,self,tcp" and see if you still run into
the issue.
How is your OMPI configured?
Josh
> From: Vince Grimes
> Subject: [OMPI users] LOCAL QP
58 matches
Mail list logo