Re: [OMPI users] Order of ranks in mpirun

2019-05-15 Thread Ralph Castain via users
> On May 15, 2019, at 7:18 PM, Adam Sylvester via users > wrote: > > Up to this point, I've been running a single MPI rank per physical host > (using multithreading within my application to use all available cores). I > use this command: > mpirun -N 1 --bind-to none --hostfile hosts.txt >

Re: [OMPI users] Errors using contexts with OSHMEM 4.1.0

2019-05-17 Thread Ralph Castain via users
Hi Lee Ann I fear so - and assign it to @hoopoepg , @brminich and @yosefe Ralph > On May 17, 2019, at 11:14 AM, Riesen, Lee Ann via users > wrote: > > I haven't received a reply to this. Should I submit a bug report? Lee Ann > > - > Lee Ann Riesen, Enterprise and Government Group,

Re: [OMPI users] process mapping

2019-06-21 Thread Ralph Castain via users
On Jun 21, 2019, at 1:52 PM, Noam Bernstein mailto:noam.bernst...@nrl.navy.mil> > wrote: On Jun 21, 2019, at 4:45 PM, Ralph Castain mailto:r...@open-mpi.org> > wrote: Hilarious - I wrote that code and I have no idea who added that option or what it is supposed to do. I can assure, however,

Re: [OMPI users] process mapping

2019-06-21 Thread Ralph Castain via users
, at 1:43 PM, Noam Bernstein mailto:noam.bernst...@nrl.navy.mil> > wrote: On Jun 21, 2019, at 4:04 PM, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote: I’m unaware of any “map-to cartofile” option, nor do I find it in mpirun’s help or man page. Are you seeing it somew

Re: [OMPI users] process mapping

2019-06-21 Thread Ralph Castain via users
I’m unaware of any “map-to cartofile” option, nor do I find it in mpirun’s help or man page. Are you seeing it somewhere? On Jun 21, 2019, at 12:43 PM, Noam Bernstein via users mailto:users@lists.open-mpi.org> > wrote: Hi - are there any examples of the cartofile format?  Or is there some

Re: [OMPI users] Unable to run a python code on cluster with mpirun in parallel

2019-09-09 Thread Ralph Castain via users
Take a look at "man orte_hosts" for a full explanation of how to use hostfile - /etc/hosts is not a properly formatted hostfile. You really just want a file that lists the names of the hosts, one per line, as that is the simplest hostfile. > On Sep 7, 2019, at 4:23 AM, Sepinoud Azimi via users

Re: [OMPI users] MPI_Comm_Spawn failure: All nodes already filled

2019-08-07 Thread Ralph Castain via users
Yeah, we do currently require that to be true. Process mapping is distributed across the daemons - i.e., the daemon on each node independently computes the map. We have talked about picking up the hostfile on the head node and sending out the contents, but haven't implemented that yet. On Aug

[OMPI users] SGE Users/Dev: Request

2019-07-26 Thread Ralph Castain via users
I just wanted to address a question to the SGE users and/or developers on this list. As you may know, we have been developing PMIx for the last few years and have now integrated it into various RMs. This allows the RMs to directly launch application processes without going through mpirun and

Re: [OMPI users] MPI_Comm_Spawn failure: All nodes already filled

2019-08-06 Thread Ralph Castain via users
I'm afraid I cannot replicate this problem on OMPI master, so it could be something different about OMPI 4.0.1 or your environment. Can you download and test one of the nightly tarballs from the "master" branch and see if it works for you? https://www.open-mpi.org/nightly/master/ Ralph On

Re: [OMPI users] OMPI was not built with SLURM's PMI support

2019-08-08 Thread Ralph Castain via users
Did you configure Slurm to use PMIx? If so, then you simply need to set the "--mpi=pmix" or "--mpi=pmix_v2" (depending on which version of PMIx you used) flag on your srun cmd line so it knows to use it. If not (and you can't fix it), then you have to explicitly configure OMPI to use Slurm's

Re: [OMPI users] OMPI was not built with SLURM's PMI support

2019-08-09 Thread Ralph Castain via users
Artem - do you have any suggestions? On Aug 8, 2019, at 12:06 PM, Jing Gong mailto:gongj...@kth.se> > wrote: Hi Ralph, $ Did you remember to add "--mpi=pmix" to your srun cmd line? On the cluster, $ srun  --mpi=list srun: MPI types are... srun: none srun: openmpi srun: pmi2 srun: pmix srun:

Re: [OMPI users] OMPI was not built with SLURM's PMI support

2019-08-09 Thread Ralph Castain via users
ailed   --> Returned value Not found (-13) instead of ORTE_SUCCESS -- What is the issue? Thanks a lot. /Jing From: users mailto:users-boun...@lists.open-mpi.org> > on behalf of Ra

Re: [OMPI users] How is the rank determined (Open MPI and Podman)

2019-07-24 Thread Ralph Castain via users
rent MODEX keys > are used. It seems like MODEX can not fetch messages in another order > than it was sent. Is that so? > > Not sure how to tell the other processes to not use CMA, while some > processes are still transmitting their user namespace ID to PROC 0. > >

Re: [OMPI users] How is the rank determined (Open MPI and Podman)

2019-07-22 Thread Ralph Castain via users
If that works, then it might be possible to include the namespace ID in the job-info provided by PMIx at startup - would have to investigate, so please confirm that the modex option works first. > On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users > wrote: > > Adrian, > > > An

Re: [OMPI users] TMPDIR for running openMPI job under grid

2019-07-26 Thread Ralph Castain via users
Upgrade to OMPI v4 or at least something in the v3 series. If you continue to have a problem, then set PMIX_MCA_ptl=tcp in your environment. On Jul 26, 2019, at 12:12 PM, Kulshrestha, Vipul via users mailto:users@lists.open-mpi.org> > wrote: Hi,  I am trying to setup my open-mpi application

Re: [OMPI users] Issue with OpenMPI + SLURM + Plugins when using OpenMPI 4.0.1

2019-11-02 Thread Ralph Castain via users
I'm afraid I don't know how to advise you on this - you may need to talk to the Slurm folks. When you start your application with mpirun, we use "srun" to start our own daemons on the job's nodes. The application processes, however, are subsequently started by those daemons using our own

Re: [OMPI users] How can I specify the number of threads for mpirun?

2019-11-14 Thread Ralph Castain via users
I _think_ what the user is saying is that their "hello world" program is returning rank=0 for all procs when started with mpirun, but not when started with MPICH's mpiexec.hydra. The most likely problem is that your "hello" program wasn't built against OMPI - are you trying to run the same

Re: [OMPI users] OpenMPI - Job pauses and goes no further

2019-11-13 Thread Ralph Castain via users
Difficult to know what to say here. I have no idea what your program does after validating the license. Does it execute some kind of MPI collective operation? Does only one proc validate the license and all others just use it? All I can tell from your output is that the procs all launched okay.

Re: [OMPI users] Change behavior of --output-filename

2019-11-12 Thread Ralph Castain via users
The man page is simply out of date - see  https://github.com/open-mpi/ompi/issues/7095 for further thinking On Nov 12, 2019, at 1:26 AM, Max Sagebaum via users mailto:users@lists.open-mpi.org> > wrote: Hello @ all, Short question: How to select what is the behavior of --output-filename? Long

Re: [OMPI users] Singleton and Spawn

2019-09-25 Thread Ralph Castain via users
It's a different code path, that's all - just a question of what path gets traversed. Would you mind posting a little more info on your two use-cases? For example, do you have a default hostfile telling mpirun what machines to use?  On Sep 25, 2019, at 12:41 PM, Martín Morales

Re: [OMPI users] Singleton and Spawn

2019-09-25 Thread Ralph Castain via users
Yes, of course it can - however, I believe there is a bug in the add-hostfile code path. We can address that problem far easier than moving to a different interconnect. On Sep 25, 2019, at 11:39 AM, Martín Morales via users mailto:users@lists.open-mpi.org> > wrote: Thanks Steven. So,

Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-29 Thread Ralph Castain via users
Urrr...this problem has been resolved, Howard. On Jan 29, 2020, at 2:51 PM, Howard Pritchard via users mailto:users@lists.open-mpi.org> > wrote: Collin, A couple of things to try.  First, could you just configure without using the mellanox platform file and see if you can run the app with 100

Re: [OMPI users] [EXTERNAL] Shmem errors on Mac OS Catalina

2020-02-06 Thread Ralph Castain via users
It is also wise to create a "tmp" directory under your home directory, and reset TMPDIR to point there. Avoiding use of the system tmpdir is highly advisable under Mac OS, especially Catalina. On Feb 6, 2020, at 4:09 PM, Gutierrez, Samuel K. via users mailto:users@lists.open-mpi.org> > wrote:

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org> > wrote: Hello,  I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
n you send the output of a failed run including your command line.    Josh   On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote: Okay, so this is a problem with the Mellanox software - copying Artem. On Jan 28, 2020, at 8:15 AM, Collin S

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
 users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 11:02 AM To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mailto:r...@open-mpi.org> > Subject: Re: [OMPI users] [External] Re: OMPI returns e

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, debug-daemons isn't going to help as we aren't launching any daemons. This is all one node. So try adding "--mca odls_base_verbose 10 --mca state_base_verbose 10" to the cmd line and let's see what is going on. I agree with Josh - neither mpirun nor hostname are invoking the Mellanox

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, that nailed it down - the problem is the number of open file descriptors is exceeding your system limit. I suspect the connection to the Mellanox drivers is solely due to it also having some descriptors open, and you are just close enough to the boundary that it causes you to hit it. See

Re: [OMPI users] Clean termination after receiving multiple SIGINT

2020-04-06 Thread Ralph Castain via users
reut...@siemens.com> www.sw.siemens.com <http://www.sw.siemens.com/>   From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Montag, 6. April 2020 16:32 To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mailto:r...@open-mp

Re: [OMPI users] Clean termination after receiving multiple SIGINT

2020-04-06 Thread Ralph Castain via users
Currently, mpirun takes that second SIGINT to mean "you seem to be stuck trying to cleanly abort - just die", which means mpirun exits immediately without doing any cleanup. The individual procs all commit suicide when they see their daemons go away, which is why you don't get zombies left

Re: [OMPI users] Meaning of mpiexec error flags

2020-04-13 Thread Ralph Castain via users
I updated the message to explain the flags (instead of a numerical value) for OMPI v5. In brief: #define PRRTE_NODE_FLAG_DAEMON_LAUNCHED    0x01   // whether or not the daemon on this node has been launched #define PRRTE_NODE_FLAG_LOC_VERIFIED               0x02   // whether or not the

Re: [OMPI users] Interpreting the output of --display-map and --display-allocation

2020-03-16 Thread Ralph Castain via users
FWIW: I have replaced those flags in the display option output with their string equivalent to make interpretation easier. This is available in OMPI master and will be included in the v5 release. > On Nov 21, 2019, at 2:08 AM, Peter Kjellström via users > wrote: > > On Mon, 18 Nov 2019

Re: [OMPI users] Propagating SIGINT instead of SIGTERM to children processes

2020-03-16 Thread Ralph Castain via users
Hi Nathan Sorry for the long, long delay in responding - no reasonable excuse (just busy, switching over support areas, etc.). Hopefully, you already found the solution. You can specify the signals to forward to children using an MCA parameter: OMPI_MCA_ess_base_forward_signals=SIGINT should

Re: [OMPI users] mpirun CLI parsing

2020-03-30 Thread Ralph Castain via users
I'm afraid the short answer is "no" - there is no way to do that today. > On Mar 30, 2020, at 1:45 PM, Jean-Baptiste Skutnik via users > wrote: > > Hello, > > I am writing a wrapper around `mpirun` which requires pre-processing of the > user's program. To achieve this, I need to isolate the

Re: [OMPI users] Can't start jobs with srun.

2020-04-26 Thread Ralph Castain via users
.. srun: none srun: pmi2 srun: openmpi I did launch the job with srun --mpi=pmi2 Does OpenMPI 4 need PMIx specifically? On 4/23/20 10:23 AM, Ralph Castain via users wrote: Is Slurm built with PMIx support? Did you tell srun to use it? On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
--mpi=pmi2 > > Does OpenMPI 4 need PMIx specifically? > > > On 4/23/20 10:23 AM, Ralph Castain via users wrote: >> Is Slurm built with PMIx support? Did you tell srun to use it? >> >> >>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users &

Re: [OMPI users] Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
Is Slurm built with PMIx support? Did you tell srun to use it? > On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users > wrote: > > I'm using OpenMPI 4.0.3 with Slurm 19.05.5 I'm testing the software with a > very simple hello, world MPI program that I've used reliably for years. When > I

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
hy is that? Can I not trust the output > of --mpi=list? > > Prentice > > On 4/23/20 10:43 AM, Ralph Castain via users wrote: >> No, but you do have to explicitly build OMPI with non-PMIx support if that >> is what you are going to use. In this case, you need to conf

Re: [OMPI users] Meaning of mpiexec error flags

2020-04-14 Thread Ralph Castain via users
he working node flag (0x11) and the non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED.    What does that imply?   The location of the daemon has NOT been verified?  Kurt  From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Mo

Re: [OMPI users] I can't build openmpi 4.0.X using PMIx 3.1.5 to use with Slurm

2020-05-11 Thread Ralph Castain via users
I'm not sure I understand why you are trying to build CentOS rpms for PMIx, Slurm, or OMPI - all three are readily available online. Is there some particular reason you are trying to do this yourself? I ask because it is non-trivial to do and requires significant familiarity with both the

Re: [OMPI users] I can't build openmpi 4.0.X using PMIx 3.1.5 to use with Slurm

2020-05-12 Thread Ralph Castain via users
Try adding --without-psm2 to the PMIx configure line - sounds like you have that library installed on your machine, even though you don't have omnipath. On May 12, 2020, at 4:42 AM, Leandro via users mailto:users@lists.open-mpi.org> > wrote: HI,  I compile it statically to make sure compilers

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-05-06 Thread Ralph Castain via users
The following (from what you posted earlier): $ srun --mpi=list srun: MPI types are... srun: none srun: pmix_v3 srun: pmi2 srun: openmpi srun: pmix would indicate that Slurm was built against a PMIx v3.x release. Using OMPI v4.0.3 with pmix=internal should be just fine so long as you set

Re: [OMPI users] MPI_Comm_spawn: no allocated resources for the application ...

2020-03-16 Thread Ralph Castain via users
Sorry for the incredibly late reply. Hopefully, you have already managed to find the answer. I'm not sure what your comm_spawn command looks like, but it appears you specified the host in it using the "dash_host" info-key, yes? The problem is that this is interpreted the same way as the "-host

Re: [OMPI users] can't open /dev/ipath, network down (err=26)

2020-05-08 Thread Ralph Castain via users
I fear those cards are past end-of-life so far as support is concerned. I'm not sure if anyone can really advise you on them. It sounds like the fabric is experiencing failures, but that's just a guess. On May 8, 2020, at 12:56 PM, Prentice Bisbal via users mailto:users@lists.open-mpi.org> >

Re: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

2020-08-20 Thread Ralph Castain via users
Your use-case sounds more like a workflow than an application - in which case, you probably should be using PRRTE to execute it instead of "mpirun" as PRRTE will "remember" the multiple jobs and avoid the overload scenario you describe. This link will walk you thru how to get and build it: 

Re: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

2020-08-20 Thread Ralph Castain via users
hemist and not a sysadmin (I miss a lot a specialized sysadmin in our Department!). Carlo Il giorno gio 20 ago 2020 alle ore 18:45 Ralph Castain via users mailto:users@lists.open-mpi.org> > ha scritto: Your use-case sounds more like a workflow than an application - in which case, you pro

Re: [OMPI users] Limiting IP addresses used by OpenMPI

2020-09-30 Thread Ralph Castain via users
I'm not sure where you are looking, but those params are indeed present in the opal/mca/btl/tcp component: /*  *  Called by MCA framework to open the component, registers  *  component parameters.  */ static int mca_btl_tcp_component_register(void) {     char* message;     /* register TCP

Re: [OMPI users] Running mpirun with grid

2020-05-31 Thread Ralph Castain via users
The messages about the daemons is coming from two different sources. Grid is saying it was able to spawn the orted - then the orted is saying it doesn't know how to communicate and fails. I think the root of the problem lies in the plm output that shows the qrsh it will use to start the job.

Re: [OMPI users] Running mpirun with grid

2020-06-01 Thread Ralph Castain via users
_base). > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common ne

Re: [OMPI users] Running mpirun with grid

2020-06-01 Thread Ralph Castain via users
Afraid I have no real ideas here. Best I can suggest is taking the qrsh cmd line from the prior debug output and try running it manually. This might give you a chance to manipulate it and see if you can identify what is causing it an issue, if anything. Without mpirun executing, the daemons

Re: [OMPI users] Correct mpirun Options for Hybrid OpenMPI/OpenMP

2020-08-03 Thread Ralph Castain via users
Be default, OMPI will bind your procs to a single core. You probably want to at least bind to socket (for NUMA reasons), or not bind at all if you want to use all the cores on the node. So either add "--bind-to socket" or "--bind-to none" to your cmd line. On Aug 3, 2020, at 1:33 AM, John

Re: [OMPI users] MPI is still dominant paradigm?

2020-08-07 Thread Ralph Castain via users
The Java bindings were added specifically to support the Spark/Hadoop communities, so I see no reason why you couldn't use them for Akka or whatever. Note that there are also Python wrappers for MPI at mpi4py that you could build upon. There is plenty of evidence out there for a general

Re: [OMPI users] ORTE HNP Daemon Error - Generated by Tweaking MTU

2020-08-10 Thread Ralph Castain via users
Well, we aren't really that picky :-) While I agree with Gilles that we are unlikely to be able to help you resolve the problem, we can give you a couple of ideas on how to chase it down First, be sure to build OMPI with "--enable-debug" and then try adding "--mca oob_base_verbose 100" to you

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-11 Thread Ralph Castain via users
Howard - if there is a problem in PMIx that is causing this problem, then we really could use a report on it ASAP as we are getting ready to release v3.1.6 and I doubt we have addressed anything relevant to what is being discussed here. On Aug 11, 2020, at 4:35 PM, Martín Morales via users

Re: [OMPI users] ORTE HNP Daemon Error - Generated by Tweaking MTU

2020-08-10 Thread Ralph Castain via users
My apologies - I should have included "--debug-daemons" for the mpirun cmd line so that the stderr of the backend daemons would be output. > On Aug 10, 2020, at 10:28 AM, John Duffy via users > wrote: > > Thanks Ralph > > I will do all of that. Much appreciated.

Re: [OMPI users] Issues with MPI_Comm_Spawn

2020-08-12 Thread Ralph Castain via users
Setting aside the known issue with comm_spawn in v4.0.4, how are you planning to forward stdin without the use of "mpirun"? Something has to collect stdin of the terminal and distribute it to the stdin of the processes > On Aug 12, 2020, at 9:20 AM, Alvaro Payero Pinto via users > wrote: > >

Re: [OMPI users] Running with Intel Omni-Path

2020-08-01 Thread Ralph Castain via users
Add "--mca pml cm" to your cmd line On Jul 31, 2020, at 9:54 PM, Supun Kamburugamuve via users mailto:users@lists.open-mpi.org> > wrote: Hi all, I'm trying to setup OpenMPI on a cluster with the Omni-Path network. When I try the following command it gives an error. mpirun -n 2 --hostfile

Re: [OMPI users] Issues with MPI_Comm_Spawn

2020-08-12 Thread Ralph Castain via users
18:29, Ralph Castain via users mailto:users@lists.open-mpi.org> > escribió: Setting aside the known issue with comm_spawn in v4.0.4, how are you planning to forward stdin without the use of "mpirun"? Something has to collect stdin of the terminal and distribute it to the stdin of t

Re: [OMPI users] Any reason why I can't start an mpirun job from within an mpi process?

2020-07-11 Thread Ralph Castain via users
You cannot cascade mpirun cmds like that - the child mpirun picks up envars that causes it to break. You'd have to either use comm_spawn to start the child job, or do a fork/exec where you can set the environment to be some pristine set of values. > On Jul 11, 2020, at 1:12 PM, John Retterer

Re: [OMPI users] slot number calculation when no config files?

2020-06-08 Thread Ralph Castain via users
Note that you can also resolve it by adding --use-hwthread-cpus to your cmd line - it instructs mpirun to treat the HWTs as independent cpus so you would have 4 slots in this case. > On Jun 8, 2020, at 11:28 AM, Collin Strassburger via users > wrote: > > Hello David, > > The slot

Re: [OMPI users] Moving an installation

2020-07-24 Thread Ralph Castain via users
While possible, it is highly unlikely that your desktop version is going to be binary compatible with your cluster... On Jul 24, 2020, at 9:55 AM, Lana Deere via users mailto:users@lists.open-mpi.org> > wrote: I have open-mpi 4.0.4 installed on my desktop and my small test programs are

Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-12-02 Thread Ralph Castain via users
Just a point to consider. OMPI does _not_ want to get in the mode of modifying imported software packages. That is a blackhole of effort we simply cannot afford. The correct thing to do would be to flag Rob Latham on that PR and ask that he upstream the fix into ROMIO so we can absorb it. We

Re: [OMPI users] MPMD hostfile: executables on same hosts

2020-12-21 Thread Ralph Castain via users
You want to use the "sequential" mapper and then specify each proc's location, like this for your hostfile: host1 host1 host2 host2 host3 host3 host1 host2 host3 and then add "--mca rmaps seq" to your mpirun cmd line. Ralph On Dec 21, 2020, at 5:22 AM, Vineet Soni via users

Re: [OMPI users] pmi.h/pmi2.h found but libpmi/libpmi missing

2020-12-20 Thread Ralph Castain via users
Did you remember to build the Slurm pmi and pmi2 libraries? They aren't built by default - IIRC, you have to manually go into a subdirectory and do a "make install" to have them built and installed. You might check the Slurm documentation for details. You also might need to add a

Re: [OMPI users] [External] Re: mpi/pmix: ERROR: Error handler invoked: status = -25: No such file or directory (2)

2020-11-12 Thread Ralph Castain via users
t; expected. I just want to make sure that this was the case, and the error > below wasn't a sign of another issue with the job. > > Prentice > > On 11/11/20 5:47 PM, Ralph Castain via users wrote: >> Looks like it is coming from the Slurm PMIx plugin, not OMPI. >> &

Re: [OMPI users] mpi/pmix: ERROR: Error handler invoked: status = -25: No such file or directory (2)

2020-11-11 Thread Ralph Castain via users
Looks like it is coming from the Slurm PMIx plugin, not OMPI. Artem - any ideas? Ralph > On Nov 11, 2020, at 10:03 AM, Prentice Bisbal via users > wrote: > > One of my users recently reported a failed job that was using OpenMPI 4.0.4 > compiled with PGI 20.4. There two different errors

Re: [OMPI users] PRRTE DVM: how to tell prun to not share nodes among prun jobs?

2020-11-14 Thread Ralph Castain via users
That would be very kind of you and most welcome! > On Nov 14, 2020, at 12:38 PM, Alexei Colin wrote: > > On Sat, Nov 14, 2020 at 08:07:47PM +0000, Ralph Castain via users wrote: >> IIRC, the correct syntax is: >> >> prun -host +e ... >> >> Thi

Re: [OMPI users] Starting a mixed fortran python MPMD application

2020-11-04 Thread Ralph Castain via users
Afraid I would have no idea - all I could tell them is that there was a bug and it has been fixed On Nov 2, 2020, at 12:18 AM, Andrea Piacentini via users mailto:users@lists.open-mpi.org> > wrote: I installed version 4.0.5 and the problem appears to be fixed. Can you please help us

Re: [OMPI users] Starting a mixed fortran python MPMD application

2020-10-28 Thread Ralph Castain via users
Could you please tell us what version of OMPI you are using? On Oct 28, 2020, at 11:16 AM, Andrea Piacentini via users mailto:users@lists.open-mpi.org> > wrote: Good morning we need to launch a MPMD application with two fortran excutables and one interpreted python (mpi4py) application.

Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Ralph Castain via users
I think you mean add "--mca mtl ofi" to the mpirun cmd line > On Jan 25, 2021, at 10:18 AM, Heinz, Michael William via users > wrote: > > What happens if you specify -mtl ofi ? > > -Original Message- > From: users On Behalf Of Patrick Begou via > users > Sent: Monday, January 25,

Re: [OMPI users] MCA parameter "orte_base_help_aggregate"

2021-01-25 Thread Ralph Castain via users
There should have been an error message right above that - all this is saying is that the same error message was output by 7 more processes besides the one that was output. It then indicates that process 3 (which has pid 0?) was killed. Looking at the help message tag, it looks like no NICs

Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-29 Thread Ralph Castain via users
Okay, I can't promise when I'll get to it, but I'll try to have it in time for OMPI v5. On Jan 29, 2021, at 1:30 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote: Hi Ralph, It would be great to have it for load balancing issues. Ideally one could do something like

Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-28 Thread Ralph Castain via users
the app-contexts wind up in MPI_COMM_WORLD. On Jan 28, 2021, at 3:18 PM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote: That's right Ralph! On 28/01/2021 23:13, Ralph Castain via users wrote: Trying to wrap my head around this, so let me try a 2-node example. You want

Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-28 Thread Ralph Castain via users
Trying to wrap my head around this, so let me try a 2-node example. You want (each rank bound to a single core): ranks 0-3 to be mapped onto node1 ranks 4-7 to be mapped onto node2 ranks 8-11 to be mapped onto node1 ranks 12-15 to be mapped onto node2 etc.etc. Correct? > On Jan 28, 2021, at

Re: [OMPI users] How to set parameters to utilize multiple network interfaces?

2021-06-11 Thread Ralph Castain via users
You can still use "map-by" to get what you want since you know there are four interfaces per node - just do "--map-by ppr:8:node". Note that you definitely do NOT want to list those multiple IP addresses in your hostfile - all you are doing is causing extra work for mpirun as it has to DNS

Re: [OMPI users] [Help] Must orted exit after all spawned proecesses exit

2021-05-19 Thread Ralph Castain via users
To answer your specific questions: The backend daemons (orted) will not exit until all locally spawned procs exit. This is not configurable - for one thing, OMPI procs will suicide if they see the daemon depart, so it makes no sense to have the daemon fail if a proc terminates. The logic

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Ralph Castain via users
The original configure line is correct ("--without-orte") - just a typo in the later text. You may be running into some issues with Slurm's built-in support for OMPI. Try running it with OMPI's "mpirun" instead and see if you get better performance. You'll have to reconfigure to remove the

Re: [OMPI users] How do I launch workers by our private protocol?

2021-04-21 Thread Ralph Castain via users
I'm not sure we support what you are wanting to do. You can direct mpiexec to use a specified script to launch its daemons on remote nodes. The daemons will need to connect back via TCP to mpiexec. The daemons are responsible for fork/exec'ing the local MPI application procs on each node.

Re: [OMPI users] Dynamic process allocation hangs

2021-03-24 Thread Ralph Castain via users
Apologies for the very long delay in response. This has been verified fixed in OMPI's master branch that is to be released as v5.0 in the near future. Unfortunately, there are no plans to backport that fi to earlier release series. We therefore recommend that you upgrade to v5.0 if you retain

Re: [OMPI users] Dynamic process allocation hangs

2021-03-25 Thread Ralph Castain via users
Hmmm...disturbing. The changes I made have somehow been lost. I'll have to redo it - will get back to you when it is restored. On Mar 25, 2021, at 2:54 PM, L Lutret mailto:lu.lut...@gmail.com> > wrote: Hi Ralph, Thanks for your response. I tried with the master branch a very simple spawn

Re: [OMPI users] How do you change ports used? [EXT]

2021-03-18 Thread Ralph Castain via users
(pure default), it just doesn’t function (I’m guessing because it chose “bad” or in-use ports). On 18 Mar 2021, at 14:11, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote: Hard to say - unless there is some reason, why not make it large enough to not be an issue?

Re: [OMPI users] How do you change ports used? [EXT]

2021-03-18 Thread Ralph Castain via users
t range resulted in the issue I posted about here before, where mpirun just does nothing for 5mins and then terminates itself, without any error messages.) Cheers, Sendu. On 17 Mar 2021, at 13:25, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote: What you are missing i

Re: [OMPI users] How do you change ports used?

2021-03-17 Thread Ralph Castain via users
What you are missing is that there are _two_ messaging layers in the system. You told the btl/tcp layer to use the specified ports, but left the oob/tcp one unspecified. You need to add oob_tcp_dynamic_ipv4_ports = 46207-46239 or whatever range you want to specify Note that if you want the

Re: [OMPI users] How do you change ports used? [EXT]

2021-03-19 Thread Ralph Castain via users
ailable ports, but is it checking those ports are also available on all the other hosts it’s going to run on? On 18 Mar 2021, at 15:57, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote: Hmmm...then you have something else going on. By default, OMPI will ask the loca

Re: [OMPI users] building openshem on opa

2021-03-22 Thread Ralph Castain via users
You did everything right - the OSHMEM implementation in OMPI only supports UCX as it is essentially a Mellanox offering. I think the main impediment to broadening it is simply interest and priority on the part of the non-UCX developers. > On Mar 22, 2021, at 7:51 AM, Michael Di Domenico via

Re: [OMPI users] Mapping, binding and ranking

2021-02-28 Thread Ralph Castain via users
[../BB/../.. /../../../../../../../../../../../../../../../../../../../../../../../../../../. ./../../../../../../../../../../../../../../../../../../../../../../../../../../ ../../../../../../..][../../../../../../../../../../../../../../../../../../../. ./../../../../../../../../../../../../../../../../../../../../../../../../../../ ../../../../../../../../../../../../../../../../../..] On 28/02/2021 16:24, Ralph Castain via users wrote: Did you read the documentation on rankfile? The "slot=N" directive saids to "put this proc on

Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 ?

2021-03-04 Thread Ralph Castain via users
Excuse me, but would you please ensure that you do not send mail to a mailing list containing this label: [AMD Official Use Only - Internal Distribution Only] Thank you Ralph On Mar 4, 2021, at 4:55 AM, Raut, S Biplab via users mailto:users@lists.open-mpi.org> > wrote: [AMD Official Use Only

Re: [OMPI users] Mapping, binding and ranking

2021-03-01 Thread Ralph Castain via users
e core, and the second bound to all the rest, with no use of hyperthreads. Would this be --map-by ppr:2:node --bind-to core --cpu-list 0,1-31 ? Thx On 2/28/21 5:44 PM, Ralph Castain via users wrote: The only way I know of to do what you want is --map-by ppr:32:socket

Re: [OMPI users] Mapping, binding and ranking

2021-03-01 Thread Ralph Castain via users
other policies. I have also tried with --cpu-set with identical results. Probably rankfile is my only option too. On 28/02/2021 22:44, Ralph Castain via users wrote: The only way I know of to do what you want is --map-by ppr:32:socket --bind-to core --cpu-list 0,2,4,6,... whe

Re: [OMPI users] Mapping, binding and ranking

2021-03-01 Thread Ralph Castain via users
I need a rankfile listing all the hosts? John On 3/1/21 10:26 AM, Ralph Castain via users wrote: I'm afraid not - you have simply told us that all cpus are available. I don't know of any way to accomplish what John wants other than with a rankfile. On Mar 1, 2021, at 7:13 AM, Luis Ceb

Re: [OMPI users] Mapping, binding and ranking

2021-02-28 Thread Ralph Castain via users
Your command line is incorrect: --map-by ppr:32:socket:PE=4 --bind-to hwthread should be --map-by ppr:32:socket:PE=2 --bind-to core On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote: I should have said, "I would like to run 128 MPI processes on 2

Re: [OMPI users] Mapping, binding and ranking

2021-02-28 Thread Ralph Castain via users
./..] And this is still different from the output produce using the rankfile. Cheers, Luis On 28/02/2021 14:06, Ralph Castain via users wrote: Your command line is incorrect: --map-by ppr:32:socket:PE=4 --bind-to hwthread

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-08-11 Thread Ralph Castain via users
--- > ||_// the State| Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ| Office of Advanced Research Computing - MSB C630, > Newark > `' > >> On Jul

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Ralph Castain via users
ob step aborted: Waiting up to 32 seconds for job step to finish. > srun: error: gpu004: tasks 0-1: Exited with exit code 1 > > -- > #BlackLivesMatter > ____ > || \\UTGERS, > |---*O*--- > ||_// the State

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Ralph Castain via users
Ryan - I suspect what Sergey was trying to say was that you need to ensure OMPI doesn't try to use the OpenIB driver, or at least that it doesn't attempt to initialize it. Try adding OMPI_MCA_pml=ucx to your environment. On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users

Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
You just need to tell mpirun that you want your procs to be bound to cores, not socket (which is the default). Add "--bind-to core" to your mpirun cmd line On Oct 10, 2021, at 11:17 PM, Chang Liu via users mailto:users@lists.open-mpi.org> > wrote: Yes they are. This is an interactive job from

Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
d that? Thanks. >Ray > > > From: users on behalf of Ralph Castain via > users > Sent: Monday, October 11, 2021 1:49 PM > To: Open MPI Users > Cc: Ralph Castain > Subject: Re: [OMPI users] [External] Re: cpu bi

Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
o processes sharing a physical core. I guess there is a way to do that by playing with mapping. I just want to know if this is a bug in mpirun, or this feature for interacting with slurm was never implemented. Chang On 10/11/21 10:07 AM, Ralph Castain via users wrote: You just need to tell

Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
mailto:users@lists.open-mpi.org> > wrote: OK thank you. Seems that srun is a better option for normal users. Chang On 10/11/21 1:23 PM, Ralph Castain via users wrote: Sorry, your output wasn't clear about cores vs hwthreads. Apparently, your Slurm config is setup to use hwthreads as indep

Re: [OMPI users] cpu binding of mpirun to follow slurm setting

2021-10-10 Thread Ralph Castain via users
Could you please include (a) what version of OMPI you are talking about, and (b) the binding patterns you observed from both srun and mpirun? > On Oct 9, 2021, at 6:41 PM, Chang Liu via users > wrote: > > Hi, > > I wonder if mpirun can follow the cpu binding settings from slurm, when >

  1   2   >