Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-27 Thread Heinz, Michael William via users
Pavel, Did you ever resolve this? A co-worker pointed out that setting that variable is the recommended way to use OMPI, PSM2 and SLURM. You can download the user manual here:

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-21 Thread Pavel Mezentsev via users
Thank you very much for all the suggestions. 1) Sadly setting `OMPI_MCA_orte_precondition_transports=”0123456789ABCDEF-0123456789ABCDEF` did not help, still got the same error about not getting this piece of info from ORTE 2) I rebuilt OpenMPI without slurm. Don't remember the exact message but

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Peter Kjellström via users
On Wed, 19 May 2021 15:53:50 +0200 Pavel Mezentsev via users wrote: > It took some time but my colleague was able to build OpenMPI and get > it working with OmniPath, however the performance is quite > disappointing. The configuration line used was the > following: ./configure

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Heinz, Michael William via users
Right. there was a reference counting issue in OMPI that required a change to PSM2 to properly fix. There's a configuration option to disable the reference count check at build time, although I don't recall what the option is off the top of my head. From: Carlson, Timothy S Sent: Wednesday,

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Heinz, Michael William via users
After thinking about this for a few more minutes, it occurred to me that you might be able to "fake" the required UUID support by passing it as a shell variable. For example: export OMPI_MCA_orte_precondition_transports="0123456789ABCDEF-0123456789ABCDEF" would probably do it. However, note

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Heinz, Michael William via users
So, the bad news is that the PSM2 MTL requires ORTE - ORTE generates a UUID to identify the job across all nodes in the fabric, allowing processes to find each other over OPA at init time. I believe the reason this works when you use OFI/libfabric is that libfabrice generates its own UUIDs.

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Ralph Castain via users
The original configure line is correct ("--without-orte") - just a typo in the later text. You may be running into some issues with Slurm's built-in support for OMPI. Try running it with OMPI's "mpirun" instead and see if you get better performance. You'll have to reconfigure to remove the

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Jorge D'Elia via users
- Mensaje original - > De: "Pavel Mezentsev via users" > Para: users@lists.open-mpi.org > CC: "Pavel Mezentsev" > Enviado: Miércoles, 19 de Mayo 2021 10:53:50 > Asunto: Re: [OMPI users] unable to launch a job on a system with OmniPath > > It took some time but my colleague was able to

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Pavel Mezentsev via users
It took some time but my colleague was able to build OpenMPI and get it working with OmniPath, however the performance is quite disappointing. The configuration line used was the following: ./configure --prefix=$INSTALL_PATH --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --enable-shared

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-10 Thread Heinz, Michael William via users
That warning is an annoying bit of cruft from the openib / verbs provider that can be ignored. (Actually, I recommend using "-btl ^openib" to suppress the warning.) That said, there is a known issue with selecting PSM2 and OMPI 4.1.0. I'm not sure that that's the problem you're hitting,

[OMPI users] unable to launch a job on a system with OmniPath

2021-05-10 Thread Pavel Mezentsev via users
Hi! I'm working on a system with KNL and OmniPath and I'm trying to launch a job but it fails. Could someone please advise what parameters I need to add to make it work properly? At first I need to make it work within one node, however later I need to use multiple nodes and eventually I may need