k at the referenced code area to see if there needs to be some
Cygwin-related tweak.
Ralph
> On Jan 9, 2022, at 11:09 PM, Marco Atzeri via devel
> wrote:
>
> On 10.01.2022 06:50, Marco Atzeri wrote:
>> On 09.01.2022 15:54, Ralph Castain via devel wrote:
>>> Hi Ma
Hi Marco
Try the patch here (for the prrte 3rd-party subdirectory):
https://github.com/openpmix/prrte/pull/1173
Ralph
> On Jan 9, 2022, at 12:29 AM, Marco Atzeri via devel
> wrote:
>
> On 01.01.2022 20:07, Barrett, Brian wrote:
>> Marco -
>> There are some patches that haven't made it to
It was a bug (typo in the attribute name when backported from OMPI master) in
OMPI 4.1.1 - it has been fixed.
> On Oct 9, 2021, at 9:18 PM, Orion Poplawski via devel
> wrote:
>
> It looks like openmpi 4.1.1 is not compatible with pmix 4.1.0 - is that
> expected?
>
> In file included from
Answered on packager list - apologies that it didn't get answered there in a
timely fashion.
> On Sep 16, 2021, at 6:56 AM, Orion Poplawski via devel
> wrote:
>
> Is there any documentation that would indicate how long the 4.0 (or any
> particular release series) will be supported? This
We've been struggling a bit lately with the problem of resolving multiple names
for the same host. Part of the problem has been the need to minimize DNS
resolves as systems were taking way too long to perform them, resulting in very
long startup times. I've done my best to minimize this and
PMIx and PRRTE both read and forward their respective default MCA parameters
from default system and user-level param files:
/etc/pmix-mca-params.conf
/.pmix/mca-params.conf
/etc/prte-mca-params.conf
/.prte/mca-params.conf
PMIx will also do the same thing for OMPI default system and user-level
can't possibly describe an arbitrary
(rankfile) layout, so I was nervous about why they would be required if a
rankfile was provided...
Martyn
On Mon, 15 Mar 2021 at 19:57, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
Martyn? Why are you saying SLURM_TASKS_PE
Hi folks
I've written a wiki page explaining how OMPI handles HWLOC from inside the OMPI
code base starting with OMPI v5. The link is on the home page under the
Developer Documents (Accessing the HWLOC topology tree from inside the MPI/OPAL
layers):
22:19:09 +0000
> Ralph Castain via devel wrote:
>
>> Why would it not be set? AFAICT, Slurm is supposed to always set that
>> envar, or so we've been told.
>
> Maybe confusion on the exact name?
>
> AFAIK slurm always sets SLURM_TASKS_PER_NODE but only sets
> SLU
unset in the
configuration.
Martyn
On Thu, 11 Mar 2021 at 16:09, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
What version of Slurm is this?
> On Mar 11, 2021, at 8:03 AM, Martyn Foster via devel
> mailto:devel@lists.open-mpi.org> > wrote:
>
> Hi al
What version of Slurm is this?
> On Mar 11, 2021, at 8:03 AM, Martyn Foster via devel
> wrote:
>
> Hi all,
>
> Using a rather trivial example
> mpirun -np 1 -rf rankfile ./HelloWorld
> on a Slurm system;
> --
> While
aving the independent MPI jobs leave a trace there, such that they can find
each other and create the initial socket.
2. You could replace ssh/rsh with a no-op script (that returns success such
that the mpirun process thinks it successfully started the processes), and then
handcraft the environmen
I'm afraid that won't work - there is no way for the job to "self assemble".
One could create a way to do it, but it would take some significant coding in
the guts of OMPI to get there.
On Mar 5, 2021, at 9:40 AM, Gabriel Tanase via devel mailto:devel@lists.open-mpi.org> > wrote:
Hi all,
I
Hi folks
I'm planning on removing the OPAL dss (pack/unpack) code as part of my work to
reduce the code base I historically supported. The pack/unpack functionality is
now in PMIx (has been since v3.0 was released), and so we had duplicate
capabilities spread across OPAL and PRRTE. I have
FWIW: now that I am out of Intel, we are planning on upping the PMIx support
for GPUs in general, so I expect we'll be including this one. Support will
include providing info on capabilities (for both local and remote devices),
distances from every proc to each of its local GPUs, affinity
Sounds like I need to resync the PMIx lustre configury with the OMPI one - I'll
do that.
On Feb 4, 2021, at 11:56 AM, Gabriel, Edgar via devel mailto:devel@lists.open-mpi.org> > wrote:
I have a weird problem running configure on master on our cluster. Basically,
configure fails when I request
021, at 8:09 AM, Ralph Castain via devel
> wrote:
>
> What if we do this:
>
> - if you are using PMIx v4.1 or above, then there is no problem. Call
> PMIx_Load_topology and we will always return a valid pointer to the topology,
> subject to the caveat that all members
see short of asking every library to provide us with
the ability to pass hwloc_topology_t down to them. Outside of that obvious
answer, I suppose we could put the hwloc_topology_t address into the
environment and have them connect that way?
> On Feb 3, 2021, at 7:36 AM, Ralph Castain via
> using the topology pointer in the process must use the same hwloc
> version (e.g. not 2.0 vs 2.4). shmem_adopt() verifies that the exported
> and importer are compatible. But passing the topology pointer doesn't
> provide any way to verify that the caller doesn't use its own
> incom
Hi folks
Per today's telecon, here is a link to a description of the HWLOC duplication
issue for many-core environments and methods by which you can mitigate the
impact.
https://openpmix.github.io/support/faq/avoid-hwloc-dup
George: for lower-level libs like treematch or HAN, you might want
It could be a Slurm issue, but I'm seeing one thing that makes me suspicious
that this might be a problem reported elsewhere.
Andrej - what version of Slurm are you using here?
> On Feb 1, 2021, at 5:34 PM, Gilles Gouaillardet via devel
> wrote:
>
> Andrej,
>
> that really looks like a
The Slurm launch component would only disqualify itself if it didn't see a
Slurm allocation - i.e., there is no SLURM_JOBID in the environment. If you
want to use mpirun in a Slurm cluster, you need to:
1. get an allocation from Slurm using "salloc"
2. then run "mpirun"
Did you remember to
Andrej
I fail to understand why you continue to think that PMI has anything to do with
this problem. I see no indication of a PMIx-related issue in anything you have
provided to date.
In the output below, it is clear what the problem is - you locked it to the
"slurm" launcher (with -mca plm
Just trying to understand - why are you saying this is a pmix problem?
Obviously, something to do with mpirun is failing, but I don't see any
indication here that it has to do with pmix.
Can you add --enable-debug to your configure line and inspect the core file
from the dump?
> On Jan 31,
, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
Just a point of clarification since there was a comment on the PR that made
this change. This is _not_ a permanent situation, nor was it done because PMIx
had achieved some magic milestone. We changed the submodule to
prefer, but we can do either.
On Dec 17, 2020, at 9:03 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
Hi folks
I just switched OMPI's PMIx submodule to point at the new v4.0 branch. When you
want to update, you may need to do the following after a "git pull"
Hi folks
I just switched OMPI's PMIx submodule to point at the new v4.0 branch. When you
want to update, you may need to do the following after a "git pull":
git submodule sync
git submodule update --init --recursive --remote
to get yourself onto the proper branch.
Ralph
e/z04/ompi_slurm
>> --with-pmix=/lustre/z04/pmix --with-pmi=/lustre/z04/pmix --with-slurm
>> --with-cuda=/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.1/include/
>> --with-libevent=/lustre/z04/pmix/libevent/
>>
>> and I can see the that header file in the pmix built:
&g
pmix_rename.h pmix_tool.h
>
> What am I missing?
>
> Cheers,
> Luis
> On 05/08/2020 15:21, Ralph Castain via devel wrote:
>> For OMPI, I would recommend installing PMIx:
>> https://github.com/openpmix/openpmix/releases/tag/v3.1.5
>>
>>
>>> On Aug
For OMPI, I would recommend installing PMIx:
https://github.com/openpmix/openpmix/releases/tag/v3.1.5
> On Aug 5, 2020, at 12:40 AM, Luis Cebamanos via devel
> wrote:
>
> Hi all,
>
> We are trying to install OpenMPI with Slurm support on a recently
> upgraded system. Unfortunately libpmi,
We forgot to discuss this at the last telecon - GP, would you please ensure it
is on next week's agenda?
FWIW: I agree that this should not have been committed. We need to stop doing
local patches to public packages and instead focus on getting them into the
upstream (which has still not been
I would have hoped that the added protections we put into PMIx would have
resolved ds12 as well as ds21, but it is possible those changes didn't get into
OMPI v4.0.x. Regardless, I think you should be just fine using the gds/hash
component for cygwin. I would suggest simply "locking" that param
lviewtech.com>
All Done!
mic:/amd/home/jdelsign/PMIx>
Does that mean there is something wrong microway2? If that were the case, then
why would it ever work?
On 2020-05-04 12:08, Ralph Castain via devel wrote:
What happens if you run your "3 procs on two nodes" case using just microway1
port 1024 allowed to connect to ?
George.
On Mon, May 4, 2020 at 11:36 AM John DelSignore via devel
mailto:devel@lists.open-mpi.org> > wrote:
Inline below...
On 2020-05-04 11:09, Ralph Castain via devel wrote:
Staring at this some more, I do have the following questions:
* in your fir
Good to confirm - thanks! This does indeed look like an issue in the btl/tcp
component's reachability code.
On May 4, 2020, at 8:34 AM, John DelSignore mailto:jdelsign...@perforce.com> > wrote:
Inline below...
On 2020-05-04 11:09, Ralph Castain via devel wrote:
Staring at this some m
Staring at this some more, I do have the following questions:
* in your first case, it looks like "prte" was started from microway3 - correct?
* in the second case, that worked, it looks like "mpirun" was executed from
microway1 - correct?
* in the third case, you state that "mpirun" was again
So here is an interesting consequence of moving from ORTE to PRRTE. In ORTE,
you could express any mapping policy as an MCA param - e.g., the following:
OMPI_MCA_rmaps_base_mapping_policy=core
OMPI_MCA_rmaps_base_display_map=1
would be the equivalent of a cmd line that included "--map-by core
020, at 9:51 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
>
> We have deprecated a number of cmd line options (e.g., bynode, npernode,
> npersocket) - what do we want to do about their MPI_Info equivalents when
> calling comm_spawn?
>
> Do I sil
I just want to confirm the default behaviors we want for OMPI v5. This is what
we have currently set:
* if the user specifies nothing:
if np <=2: map-by core, rank-by core, bind-to core
if np > 2: map-by socket, rank-by core, bind-to socket
* if the user only specifies map-by:
There was a recent discussion regarding whether or not two job could
communicate via shared memory. I recalled adding support for this, but thought
that Nathan needed to do something in "vader" to enable it. Turns out I
remembered correctly about adding the support - but I believe "vader"
We have deprecated a number of cmd line options (e.g., bynode, npernode,
npersocket) - what do we want to do about their MPI_Info equivalents when
calling comm_spawn?
Do I silently convert them? Should we output a deprecation warning? Return an
error?
Ralph
Hey folks
I have been fighting the build system for the last two days and discovered
something a little bothersome. It appears that there are only two ways to build
OMPI:
* with all three of libevent, hwloc, and pmix internal
* with all three of libevent, hwloc, and pmix external
In other
for the
specified proc, one per NIC, ordered as above.
I'll be posting some example code illustrating the use of all these in the near
future and will alert anyone interested when I do.
Ralph
> On Mar 22, 2020, at 11:36 AM, Ralph Castain via devel
> wrote:
>
> I'll be writing a ser
I'll be writing a series of notes containing thoughts on how to exploit
PMIx-provided information, especially covering aspects that might not be
obvious (e.g., attributes that might not be widely known). This first note
covers the topic of collective optimization.
PMIx provides network-related
support for ofi MTL using
hwloc
https://github.com/open-mpi/ompi/pull/7547 fixes it and has an explanation as
to why it wasn't catching us elsewhere in the MPI code
On Mar 20, 2020, at 9:22 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
Odd - the topology object gets filled
https://github.com/open-mpi/ompi/pull/7547 fixes it and has an explanation as
to why it wasn't catching us elsewhere in the MPI code
On Mar 20, 2020, at 9:22 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
Odd - the topology object gets filled in during init, well
e, but wanted to figure out if this
> was expected and, if so, if we had options for getting the right data from
> PMIx early enough in the process. Sorry, this is part of the runtime changes
> I haven't been following closely enough.
>
> Brian
>
> -Original Message---
iam
>
> On 3/17/20, 11:54 PM, "devel on behalf of Ralph Castain via devel"
>
> wrote:
>
>CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
>
Hey folks
I saw the referenced "new feature" on the v5 feature spreadsheet and wanted to
ask a quick question. Is the OFI MTL going to be doing its own hwloc topology
discovery for this feature? Or is it going to access the topology info via PMIx
and the OPAL hwloc abstraction?
I ask because
You have a missing symbol in your component:
undefined symbol: ompi_coll_libpnbc_osc_neighbor_alltoall_init (ignored)
On Mar 5, 2020, at 5:57 AM, Luis Cebamanos via devel mailto:devel@lists.open-mpi.org> > wrote:
Hi folks,
We are developing a (hopefully) new component for the coll
I checked this with a fresh clone and everything is working fine, so I expect
this is a stale submodule issue again. I've asked John to check.
> On Mar 4, 2020, at 8:05 AM, John DelSignore via devel
> wrote:
>
> Hi,
>
> I've been working with Ralph to try to get the PMIx debugging interfaces
You'll have to do it in the PRRTE project: https://github.com/openpmix/prrte
OMPI has removed the ORTE code base and replaced it with PRRTE, which is
effectively the same code but taken from a repo that multiple projects support.
You can use any of the components in there as a template - I
Just an FYI: GitHub is degraded today, especially on the webhooks and actions
that we depend upon for things like CI. Hopefully, they will get it fixed soon.
Ralph
Hey folks
Now that we have multiple projects sharing a build system, we need to be
careful how we name our #if's. For example, using:
#ifdef MCA_FOO_BAR
#define MCA_FOO_BAR
...
might have been fine in the past, but you wind up clobbering another project
that also has a "foo_bar" component.
What do we want to do with the following options? These have either been
renamed (changing from "orte..." to a "prrte" equivalent) or are no longer
valid:
--enable-orterun-prefix-by-default
--enable-mpirun-prefix-by-default
These are now --enable-prte-prefix-by-default. Should I error out via
Hi folks
I integrated the minutes from this week's meeting into the meeting's wiki page:
https://github.com/open-mpi/ompi/wiki/Meeting-2020-02
Feel free to update and/or let me know of errors or omissions
Ralph
Hey folks
Based on the discussion at the OMPI developer's meeting this week, I have
created the following wiki page explaining how OMPI's command line and envars
will be processed for OMPI v5:
https://github.com/open-mpi/ompi/wiki/Command-Line-Envar-Parsing
Feel free to comment and/or ask
We are seeing many failures on MTT because of errors on the cmd line. Note that
by request of the OMPI community, PRRTE is strictly enforcing the Posix "dash"
syntax:
* a single-dash must be used only for single-character options. You can
combine the single-character options like "-abc" as
FYI: pursuant to the objectives outline last year, I have committed PR #7202
and removed ORTE from the OMPI repository. It has been replaced with a PRRTE
submodule pointed at the PRRTE master branch. At the same tie, we replaced the
embedded PMIx code tree with a submodule pointed to the PMIx
FWIW: I have major problems when rebasing if that rebase runs across the point
where a submodule is added. Every file that was removed and replaced by the
submodule generates a conflict. Only solution I could find was to whack the
subdirectory containing the files-to-be-replaced and work thru
It is the latter one it is complaining about:
> /tmp/ompi.LAPTOP-82F08ILC.197609/pid.93/0/debugger_attach_fifo
I have no idea why it is complaining.
> On Feb 3, 2020, at 2:03 PM, Marco Atzeri via devel
> wrote:
>
> Am 03.02.2020 um 18:15 schrieb Ralph Castain via dev
Hi Marco
mpirun isn't trying to run a debugger. It is opening a fifo pipe in case a
debugger later wishes to attach to the running job - it is used by an
MPIR-based debugger to let mpirun know that it is attaching. My guess is that
the code is attempting to create the fifo in an unacceptable
Actually, I take that back - making a separate PR to change the opal/pmix
embedded component to a submodule was way too painful. I simply added it to the
existing #7202.
> On Jan 7, 2020, at 1:33 PM, Ralph Castain via devel
> wrote:
>
> Just an FYI: there will soon be THREE PRs
Just an FYI: there will soon be THREE PRs introducing submodules - I am
breaking #7202 into two pieces. The first will replace opal/pmix with direct
use of PMIx everywhere and replace the embedded pmix component with a submodule
pointing to PMIx master, and the second will replace ORTE with
I was able to create the fix - it is in OMPI master. I have provided a patch
for OMPI v3.1.5 here:
https://github.com/open-mpi/ompi/pull/7276
Ralph
> On Jan 3, 2020, at 6:04 PM, Ralph Castain via devel
> wrote:
>
> I'm afraid the fix uncovered an issue in the ds
I'm afraid the fix uncovered an issue in the ds21 component that will require
Mellanox to address it - unsure of the timetable for that to happen.
> On Jan 3, 2020, at 6:28 AM, Ralph Castain via devel
> wrote:
>
> I committed something upstream in PMIx master and v3.1 that proba
wrote:
>
> Is there a configure test we can add to make this kind of behavior be the
> default?
>
>
>> On Jan 1, 2020, at 11:50 PM, Marco Atzeri via devel
>> wrote:
>>
>> thanks Ralph
>>
>> gds = ^ds21
>> works as expected
>>
>&
interfaces?
On Jan 2, 2020, at 9:35 AM, George Bosilca mailto:bosi...@icl.utk.edu> > wrote:
Ralph,
I think the first use is still pending reviews (more precisely my review) at
https://github.com/open-mpi/ompi/pull/7134.
George.
On Wed, Jan 1, 2020 at 9:53 PM Ralph Castain via dev
Hey folks
I can't find where the opal/reachable framework is being used in OMPI. I would
like to utilize it in the PRRTE oob/tcp component, but need some guidance on
how to do so, or pointers to an example.
Ralph
PMIx likely defaults to the ds12 component - which will work fine but a tad
slower than ds21. It is likely something to do with the way cygwin handles
memory locks. You can avoid the error message by simply adding "gds = ^ds21" to
your default MCA param file (the pmix one - should be named
Hi folks
The move to replace ORTE with PRRTE is now ready to go (the OSHMEM team needs
to fix something in that project). This means that all further development
activity and/or PRs involving ORTE should be transferred to the PRRTE project
(https://github.com/openpmix/prrte). Existing PRs that
ting of those paths is why we are moving to PMIx-based tool support in OMPI
v5.
HTH
Ralph
On Nov 13, 2019, at 10:40 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:
Agreed and understood. My point was only that I'm not convinced the problem was
"fixed" as i
il.
I think that unless there's already a problem in the code, the debugger should
not be able to interfere at all.
Cheers, John D.
On 11/12/19 6:51 PM, Ralph Castain via devel wrote:
Again, John, I'm not convinced your last statement is true. However, I think it
is "good enough" fo
thread polling on the flag to see the
update - something the volatile keyword doesn't do on its own. I think it's
also much cleaner as it eliminates an arbitrary sleep from the code - which I
see as a good thing as well.
"Ralph Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov
x_lock();
flg = new_val;
pthread_cond_signal();
pthread_mutex_unlock();
This provides the memory barrier for the thread polling on the flag to see the
update - something the volatile keyword doesn't do on its own. I think it's
also much cleaner as it eliminates an arbitrary sleep from the code -
something the volatile keyword doesn't do on its own. I think it's
also much cleaner as it eliminates an arbitrary sleep from the code - which I
see as a good thing as well.
"Ralph Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11,
2019, at 4:53 PM, Gilles Gouaillardet via
ndled by PMIx, at least in the master branch) ?
IIRC, that progress thread only runs if explicitly asked to do so by MCA param.
We don't need that code any more as PMIx takes care of it.
>
> Cheers,
>
> Gilles
>
> On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
>&
Hi John
Sorry to say, but there is no way to really answer your question as the OMPI
community doesn't actively test MPIR support. I haven't seen any reports of
hangs during MPI_Init from any release series, including 4.x. My guess is that
it may have something to do with the debugger
Hi folks
I mentioned this very briefly at the Tues telecon, but didn't explain it well
as there just wasn't adequate time available. With the recent updates of the
embedded PMIx code, OMPI's mpirun now has the ability to fully support
pre-launch network resource assignment for processes. This
rinted only once by the
second case.
On Fri, Jul 12, 2019 at 6:00 PM Ralph Castain via devel
mailto:devel@lists.open-mpi.org> > wrote:
Afraid I don't know anything about that program, but it looks like it is
printing the same number of times in both cases. It only appears to be more in
th
mailto:cs15mtech11...@iith.ac.in> > wrote:
Thanks, Ralph.
Why is the output of the program(mm-llvm.out) being run is printed only once,
while the mpirun from intel prints as many times as mentioned in the command
line?
On Thu, Jul 11, 2019 at 11:08 PM Ralph Castain via devel
mailto:devel@lists.open-m
Because OMPI binds to core by default when np=2. If you have an OpenMP process,
you want to add “--bind-to numa" to your mpirun cmd line.
On Jul 11, 2019, at 10:28 AM, Dangeti Tharun kumar via devel
mailto:devel@lists.open-mpi.org> > wrote:
Hi Devs,
I have build openmpi with LLVM-8
82 matches
Mail list logo