Andrej,
that's odd, there should be a mca_pmix_pmix3x.so (assuming you built
with the internal pmix)
what was your exact configure command line?
fwiw, in your build tree, there should be a
opal/mca/pmix/pmix3x/.libs/mca_pmix_pmix3x.so
if it's there, try running
sudo make install
once more and
Andrej,
what is your mpirun command line?
is mpirun invoked from a batch allocation?
in order to get some more debug info, you can
mpirun --mca ess_base_verbose 10 --mca pmix_base_verbose 10 ...
Cheers,
Gilles
On Mon, Feb 1, 2021 at 10:27 PM Andrej Prsa via devel
wrote:
>
> Hi Gilles,
>
>
Andrej,
it seems only flux is a PMIx option, which is very suspicious.
can you check other components are available?
ls -l /usr/local/lib/openmpi/mca_pmix_*.so
will list them.
Cheers,
Gilles
On Mon, Feb 1, 2021 at 10:53 PM Andrej Prsa via devel
wrote:
>
> Hi Gilles,
>
> > what is your
Hi Gilles,
I invite you to do some cleanup
sudo rm -rf /usr/local/lib/openmpi /usr/local/lib/pmix
and then
sudo make install
and try again.
Good catch! Alright, I deleted /usr/local/lib/openmpi and
/usr/local/lib/pmix, then I rebuilt (make clean; make) and installed
pmix from the latest
Hi Gilles,
what is your mpirun command line?
is mpirun invoked from a batch allocation?
I call mpirun directly; here's a full output:
andrej@terra:~/system/tests/MPI$ mpirun --mca ess_base_verbose 10 --mca
pmix_base_verbose 10 -np 4 python testmpi.py
[terra:203257] mca: base:
Hi Gilles,
it seems only flux is a PMIx option, which is very suspicious.
can you check other components are available?
ls -l /usr/local/lib/openmpi/mca_pmix_*.so
andrej@terra:~/system/tests/MPI$ ls -l /usr/local/lib/openmpi/mca_pmix_*.so
-rwxr-xr-x 1 root root 97488 Feb 1 08:20
Hi Gilles,
that's odd, there should be a mca_pmix_pmix3x.so (assuming you built
with the internal pmix)
Ah, I didn't -- I linked against the latest git pmix; here's the
configure line:
./configure --prefix=/usr/local --with-pmix=/usr/local --with-slurm
--without-tm --without-moab
Hi Joseph,
Thanks -- I did that and checked that the configure summary says
internal for pmix. I also distcleaned the tree just to be sure. It's
building as we speak.
Cheers,
Andrej
On 2/1/21 9:55 AM, Joseph Schuchart via devel wrote:
Andrej,
If your installation originally picked up a
Alright, I rebuilt mpirun and it's working on a local machine. But now
I'm back to my original problem: running this works:
mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96
python testmpi.py
but running this doesn't:
mpirun -mca plm slurm -np 384 -H
On Jan 27, 2021, at 7:19 PM, Gilles Gouaillardet wrote:
>
> What I meant is the default Linux behavior is to first lookup dependencies in
> the rpath, and then fallback to LD_LIBRARY_PATH
> *unless* -Wl,--enable-new-dtags was used at link time.
>
> In the case of Open MPI,
Andrej,
If your installation originally picked up a preinstalled PMIx and you
deleted it, it's better to run OMPI's configure again (make/make install
might not be sufficient to install the internal PMIx).
Cheers
Joseph
On 2/1/21 3:48 PM, Gilles Gouaillardet via devel wrote:
Andrej,
On Mon, 1 Feb 2021 14:46:22 +
"Jeff Squyres \(jsquyres\) via devel" wrote:
> On Jan 27, 2021, at 7:19 PM, Gilles Gouaillardet
> wrote:
> >
> > What I meant is the default Linux behavior is to first lookup
> > dependencies in the rpath, and then fallback to LD_LIBRARY_PATH
> > *unless*
Andrej,
you are now invoking mpirun from a slurm allocation, right?
you can try this:
/usr/local/bin/mpirun -mca plm slurm -np 384 -H
node15:96,node16:96,node17:96,node18:96
python testmpi.py
if it does not work, you can collect more relevant logs with
mpirun -mca plm slurm -mca
FYI - I wasn’t bothered by the default behavior... I was just looking for a
sanctioned way for an installer (e.g. a sysadmin) to make the UCX be loaded
based on LD_LIBRARY_PATH so that there was an ability for the user to swap
in a debug build of UCX at runtime.
On Mon, Feb 1, 2021 at 10:14 AM
Hi Ralph, Gilles,
I fail to understand why you continue to think that PMI has anything to do with
this problem. I see no indication of a PMIx-related issue in anything you have
provided to date.
Oh, I went off the traceback that yelled about pmix, and slurm not being
able to find it until
Here is what you can try
$ salloc -N 4 -n 384
/* and then from the allocation */
$ srun -n 1 orted
/* that should fail, but the error message can be helpful */
$ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true
Cheers
Gilles
On Tue, Feb 2, 2021 at 10:03 AM Andrej Prsa
Hi Gilles,
Here is what you can try
$ salloc -N 4 -n 384
/* and then from the allocation */
$ srun -n 1 orted
/* that should fail, but the error message can be helpful */
$ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true
andrej@terra:~/system/tests/MPI$ salloc -N 4 -n
Andrej,
that really looks like a SLURM issue that does not involve Open MPI
In order to confirm, you can
$ salloc -N 2 -n 2
/* and then from the allocation */
srun hostname
If this does not work, then this is a SLURM issue you have to fix.
Once fixed, I am confident Open MPI will just work
Andrej,
you *have* to invoke
mpirun --mca plm slurm ...
from a SLURM allocation, and SLURM_* environment variables should have
been set by SLURM
(otherwise, this is a SLURM error out of the scope of Open MPI).
Here is what you can try (and send the logs if that fails)
$ salloc -N 4 -n 384
and
Hi Gilles,
andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 384
salloc: Granted job allocation 836
andrej@terra:~/system/tests/MPI$ env | grep ^SLURM_
SLURM_TASKS_PER_NODE=96(x4)
SLURM_SUBMIT_DIR=/home/users/andrej/system/tests/MPI
SLURM_NODE_ALIASES=(null)
SLURM_CLUSTER_NAME=terra
Andrej,
I can reproduce this behavior ... when running outside of a slurm allocation.
What does
$ env | grep ^SLURM_
reports?
Cheers,
Gilles
On Tue, Feb 2, 2021 at 9:06 AM Andrej Prsa via devel
wrote:
>
> Hi Ralph, Gilles,
>
> > I fail to understand why you continue to think that PMI has
Hi Gilles,
I can reproduce this behavior ... when running outside of a slurm allocation.
I just tried from slurm (sbatch run.sh) and I get the exact same error.
What does
$ env | grep ^SLURM_
reports?
Empty; no environment variables have been defined.
Thanks,
Andrej
It could be a Slurm issue, but I'm seeing one thing that makes me suspicious
that this might be a problem reported elsewhere.
Andrej - what version of Slurm are you using here?
> On Feb 1, 2021, at 5:34 PM, Gilles Gouaillardet via devel
> wrote:
>
> Andrej,
>
> that really looks like a
The Slurm launch component would only disqualify itself if it didn't see a
Slurm allocation - i.e., there is no SLURM_JOBID in the environment. If you
want to use mpirun in a Slurm cluster, you need to:
1. get an allocation from Slurm using "salloc"
2. then run "mpirun"
Did you remember to
Josh posted https://github.com/open-mpi/ompi/pull/8432 last week to propagate
arguments from PMIx and PRRTE into the top level configure --help, and it
probably deserves more discussion.
There are three patches, the last of which renames arguments so you could (for
example) specify different
Hi Ralph,
Andrej - what version of Slurm are you using here?
It's slurm 20.11.3, i.e. the latest release afaik.
But Gilles is correct; the proposed test failed:
andrej@terra:~/system/tests/MPI$ salloc -N 2 -n 2
salloc: Granted job allocation 838
andrej@terra:~/system/tests/MPI$ srun
The saga continues.
I managed to build slurm with pmix by first patching slurm using this
patch and manually building the plugin:
https://bugs.schedmd.com/show_bug.cgi?id=10683
Now srun shows pmix as an option:
andrej@terra:~/system/tests/MPI$ srun --mpi=list
srun: MPI types are...
srun:
Hi Gilles,
srun -N 1 -n 1 orted
that is expected to fail, but it should at least find all its
dependencies and start
This was quite illuminating!
andrej@terra:~/system/tests/MPI$ srun -N 1 -n 1 orted
srun: /usr/local/lib/slurm/switch_generic.so: Incompatible Slurm plugin
version (20.02.6)
Andrej,
My previous email listed other things to try
Cheers,
Gilles
Sent from my iPod
> On Feb 2, 2021, at 6:23, Andrej Prsa via devel
> wrote:
>
> The saga continues.
>
> I managed to build slurm with pmix by first patching slurm using this patch
> and manually building the plugin:
>
>
Andrej
I fail to understand why you continue to think that PMI has anything to do with
this problem. I see no indication of a PMIx-related issue in anything you have
provided to date.
In the output below, it is clear what the problem is - you locked it to the
"slurm" launcher (with -mca plm
30 matches
Mail list logo