Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Gilles Gouaillardet via devel
Andrej, that's odd, there should be a mca_pmix_pmix3x.so (assuming you built with the internal pmix) what was your exact configure command line? fwiw, in your build tree, there should be a opal/mca/pmix/pmix3x/.libs/mca_pmix_pmix3x.so if it's there, try running sudo make install once more and

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Gilles Gouaillardet via devel
Andrej, what is your mpirun command line? is mpirun invoked from a batch allocation? in order to get some more debug info, you can mpirun --mca ess_base_verbose 10 --mca pmix_base_verbose 10 ... Cheers, Gilles On Mon, Feb 1, 2021 at 10:27 PM Andrej Prsa via devel wrote: > > Hi Gilles, > >

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Gilles Gouaillardet via devel
Andrej, it seems only flux is a PMIx option, which is very suspicious. can you check other components are available? ls -l /usr/local/lib/openmpi/mca_pmix_*.so will list them. Cheers, Gilles On Mon, Feb 1, 2021 at 10:53 PM Andrej Prsa via devel wrote: > > Hi Gilles, > > > what is your

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, I invite you to do some cleanup sudo rm -rf /usr/local/lib/openmpi /usr/local/lib/pmix and then sudo make install and try again. Good catch! Alright, I deleted /usr/local/lib/openmpi and /usr/local/lib/pmix, then I rebuilt (make clean; make) and installed pmix from the latest

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, what is your mpirun command line? is mpirun invoked from a batch allocation? I call mpirun directly; here's a full output: andrej@terra:~/system/tests/MPI$ mpirun --mca ess_base_verbose 10 --mca pmix_base_verbose 10 -np 4 python testmpi.py [terra:203257] mca: base:

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, it seems only flux is a PMIx option, which is very suspicious. can you check other components are available? ls -l /usr/local/lib/openmpi/mca_pmix_*.so andrej@terra:~/system/tests/MPI$ ls -l /usr/local/lib/openmpi/mca_pmix_*.so -rwxr-xr-x 1 root root 97488 Feb  1 08:20

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, that's odd, there should be a mca_pmix_pmix3x.so (assuming you built with the internal pmix) Ah, I didn't -- I linked against the latest git pmix; here's the configure line: ./configure --prefix=/usr/local --with-pmix=/usr/local --with-slurm --without-tm --without-moab

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Joseph, Thanks -- I did that and checked that the configure summary says internal for pmix. I also distcleaned the tree just to be sure. It's building as we speak. Cheers, Andrej On 2/1/21 9:55 AM, Joseph Schuchart via devel wrote: Andrej, If your installation originally picked up a

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Alright, I rebuilt mpirun and it's working on a local machine. But now I'm back to my original problem: running this works: mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py but running this doesn't: mpirun -mca plm slurm -np 384 -H

Re: [OMPI devel] How to build Open MPI so the UCX used can be changed at runtime?

2021-02-01 Thread Jeff Squyres (jsquyres) via devel
On Jan 27, 2021, at 7:19 PM, Gilles Gouaillardet wrote: > > What I meant is the default Linux behavior is to first lookup dependencies in > the rpath, and then fallback to LD_LIBRARY_PATH > *unless* -Wl,--enable-new-dtags was used at link time. > > In the case of Open MPI,

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Joseph Schuchart via devel
Andrej, If your installation originally picked up a preinstalled PMIx and you deleted it, it's better to run OMPI's configure again (make/make install might not be sufficient to install the internal PMIx). Cheers Joseph On 2/1/21 3:48 PM, Gilles Gouaillardet via devel wrote: Andrej,

Re: [OMPI devel] How to build Open MPI so the UCX used can be changed at runtime?

2021-02-01 Thread Peter Kjellström via devel
On Mon, 1 Feb 2021 14:46:22 + "Jeff Squyres \(jsquyres\) via devel" wrote: > On Jan 27, 2021, at 7:19 PM, Gilles Gouaillardet > wrote: > > > > What I meant is the default Linux behavior is to first lookup > > dependencies in the rpath, and then fallback to LD_LIBRARY_PATH > > *unless*

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Gilles Gouaillardet via devel
Andrej, you are now invoking mpirun from a slurm allocation, right? you can try this: /usr/local/bin/mpirun -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py if it does not work, you can collect more relevant logs with mpirun -mca plm slurm -mca

Re: [OMPI devel] How to build Open MPI so the UCX used can be changed at runtime?

2021-02-01 Thread Tim Mattox via devel
FYI - I wasn’t bothered by the default behavior... I was just looking for a sanctioned way for an installer (e.g. a sysadmin) to make the UCX be loaded based on LD_LIBRARY_PATH so that there was an ability for the user to swap in a debug build of UCX at runtime. On Mon, Feb 1, 2021 at 10:14 AM

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Ralph, Gilles, I fail to understand why you continue to think that PMI has anything to do with this problem. I see no indication of a PMIx-related issue in anything you have provided to date. Oh, I went off the traceback that yelled about pmix, and slurm not being able to find it until

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Gilles Gouaillardet via devel
Here is what you can try $ salloc -N 4 -n 384 /* and then from the allocation */ $ srun -n 1 orted /* that should fail, but the error message can be helpful */ $ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true Cheers Gilles On Tue, Feb 2, 2021 at 10:03 AM Andrej Prsa

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, Here is what you can try $ salloc -N 4 -n 384 /* and then from the allocation */ $ srun -n 1 orted /* that should fail, but the error message can be helpful */ $ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true andrej@terra:~/system/tests/MPI$ salloc -N 4 -n

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Gilles Gouaillardet via devel
Andrej, that really looks like a SLURM issue that does not involve Open MPI In order to confirm, you can $ salloc -N 2 -n 2 /* and then from the allocation */ srun hostname If this does not work, then this is a SLURM issue you have to fix. Once fixed, I am confident Open MPI will just work

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Gilles Gouaillardet via devel
Andrej, you *have* to invoke mpirun --mca plm slurm ... from a SLURM allocation, and SLURM_* environment variables should have been set by SLURM (otherwise, this is a SLURM error out of the scope of Open MPI). Here is what you can try (and send the logs if that fails) $ salloc -N 4 -n 384 and

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 384 salloc: Granted job allocation 836 andrej@terra:~/system/tests/MPI$ env | grep ^SLURM_ SLURM_TASKS_PER_NODE=96(x4) SLURM_SUBMIT_DIR=/home/users/andrej/system/tests/MPI SLURM_NODE_ALIASES=(null) SLURM_CLUSTER_NAME=terra

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Gilles Gouaillardet via devel
Andrej, I can reproduce this behavior ... when running outside of a slurm allocation. What does $ env | grep ^SLURM_ reports? Cheers, Gilles On Tue, Feb 2, 2021 at 9:06 AM Andrej Prsa via devel wrote: > > Hi Ralph, Gilles, > > > I fail to understand why you continue to think that PMI has

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, I can reproduce this behavior ... when running outside of a slurm allocation. I just tried from slurm (sbatch run.sh) and I get the exact same error. What does $ env | grep ^SLURM_ reports? Empty; no environment variables have been defined. Thanks, Andrej

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Ralph Castain via devel
It could be a Slurm issue, but I'm seeing one thing that makes me suspicious that this might be a problem reported elsewhere. Andrej - what version of Slurm are you using here? > On Feb 1, 2021, at 5:34 PM, Gilles Gouaillardet via devel > wrote: > > Andrej, > > that really looks like a

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Ralph Castain via devel
The Slurm launch component would only disqualify itself if it didn't see a Slurm allocation - i.e., there is no SLURM_JOBID in the environment. If you want to use mpirun in a Slurm cluster, you need to: 1. get an allocation from Slurm using "salloc" 2. then run "mpirun" Did you remember to

[OMPI devel] Configure --help results

2021-02-01 Thread Barrett, Brian via devel
Josh posted https://github.com/open-mpi/ompi/pull/8432 last week to propagate arguments from PMIx and PRRTE into the top level configure --help, and it probably deserves more discussion. There are three patches, the last of which renames arguments so you could (for example) specify different

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Ralph, Andrej - what version of Slurm are you using here? It's slurm 20.11.3, i.e. the latest release afaik. But Gilles is correct; the proposed test failed: andrej@terra:~/system/tests/MPI$ salloc -N 2 -n 2 salloc: Granted job allocation 838 andrej@terra:~/system/tests/MPI$ srun

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
The saga continues. I managed to build slurm with pmix by first patching slurm using this patch and manually building the plugin: https://bugs.schedmd.com/show_bug.cgi?id=10683 Now srun shows pmix as an option: andrej@terra:~/system/tests/MPI$ srun --mpi=list srun: MPI types are... srun:

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, srun -N 1 -n 1 orted that is expected to fail, but it should at least find all its dependencies and start This was quite illuminating! andrej@terra:~/system/tests/MPI$ srun -N 1 -n 1 orted srun: /usr/local/lib/slurm/switch_generic.so: Incompatible Slurm plugin version (20.02.6)

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Gilles Gouaillardet via devel
Andrej, My previous email listed other things to try Cheers, Gilles Sent from my iPod > On Feb 2, 2021, at 6:23, Andrej Prsa via devel > wrote: > > The saga continues. > > I managed to build slurm with pmix by first patching slurm using this patch > and manually building the plugin: > >

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Ralph Castain via devel
Andrej I fail to understand why you continue to think that PMI has anything to do with this problem. I see no indication of a PMIx-related issue in anything you have provided to date. In the output below, it is clear what the problem is - you locked it to the "slurm" launcher (with -mca plm