Re: [OMPI users] mpirun hanging when processes started on head node

2007-06-12 Thread Ralph H Castain
Hi Sean > [Sean] I'm working through the strace output to follow the progression on the > head node. It looks like mpirun consults '/bpfs/self' and determines that the > request is to be run on the local machine so it fork/execs 'orted' which then > runs 'hostname'. 'mpirun' didn't consult '/bpfs'

Re: [OMPI users] Connection to HNP lost

2007-07-10 Thread Ralph H Castain
On 7/10/07 11:08 AM, "Glenn Carver" wrote: > Hi, > > I'd be grateful if someone could explain the meaning of this error > message to me and whether it indicates a hardware problem or > application software issue: > > [node2:11881] OOB: Connection to HNP lost > [node1:09876] OOB: Connection t

Re: [OMPI users] Recursive use of "orterun"

2007-07-11 Thread Ralph H Castain
I'm unaware of any issues that would cause it to fail just because it is being run via that interface. The error message is telling us that the procs got launched, but then orterun went away unexpectedly. Are you seeing your procs complete? We do sometimes see that message due to a race condition

Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Ralph H Castain
ing that module; the next thing in > the code is the os.system call to start orterun with 2 processors.) > > Also, there is absolutely no output from the second orterun-launched > program (even the first line does not execute.) > > Cheers, > > Lev > > > >>

Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Ralph H Castain
On 7/18/07 9:49 AM, "Adam C Powell IV" wrote: > As mentioned, I'm running in a chroot environment, so rsh and ssh won't > work: "rsh localhost" will rsh into the primary local host environment, > not the chroot, which will fail. > > [The purpose is to be able to build and test MPI programs in

Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Ralph H Castain
Tim has proposed a clever fix that I had not thought of - just be aware that it could cause unexpected behavior at some point. Still, for what you are trying to do, that might meet your needs. Ralph On 7/18/07 11:44 AM, "Tim Prins" wrote: > Adam C Powell IV wrote: >> As mentioned, I'm running

Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Ralph H Castain
On 7/18/07 11:46 AM, "Bill Johnstone" wrote: > --- Ralph Castain wrote: > >> No, the session directory is created in the tmpdir - we don't create >> anything anywhere else, nor do we write any executables anywhere. > > In the case where the TMPDIR env variable isn't specified, what is the >

Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Ralph H Castain
Hooray! Glad we could help track this down - sorry it was so hard to do so. To answer your questions: 1. Yes - ORTE should bail out gracefully. It definitely should not hang. I will log the problem and investigate. I believe I know where the problem lies, and it may already be fixed on our trunk,

Re: [OMPI users] OpenMPI start up problems

2007-07-19 Thread Ralph H Castain
I gather you are running under TM since you have a PBS_NODEFILE? If so, in 1.2 we setup to read that file directly - you cannot specify it on the command line. We will fix this in 1.3 so you can do both, but for now - under TM - you have to leave that "-machinefile $PBS_NODEFILE" off of the comman

Re: [OMPI users] orterun --bynode/--byslot problem

2007-07-23 Thread Ralph H Castain
No, byslot appears to be working just fine on our bproc clusters (it is the default mode). As you probably know, bproc is a little strange in how we launch - we have to launch the procs in "waves" that correspond to the number of procs on a node. In other words, the first "wave" launches a proc on

Re: [OMPI users] orterun --bynode/--byslot problem

2007-07-23 Thread Ralph H Castain
Yes...it would indeed. On 7/23/07 9:03 AM, "Kelley, Sean" wrote: > Would this logic be in the bproc pls component? > Sean > > > From: users-boun...@open-mpi.org on behalf of Ralph H Castain > Sent: Mon 7/23/2007 9:18 AM > To: Open MPI Users > Subject:

Re: [OMPI users] mpi daemon

2007-08-02 Thread Ralph H Castain
The daemon's name is "orted" - one will be launched on each remote node as the application is started, but they only live for as long as the application is executing. Then they go away. On 8/2/07 12:47 PM, "Reuti" wrote: > Am 02.08.2007 um 18:32 schrieb Francesco Pietra: > >> I compiled succes

Re: [OMPI users] memory leaks on solaris

2007-08-06 Thread Ralph H Castain
On 8/5/07 6:35 PM, "Glenn Carver" wrote: > I'd appreciate some advice and help on this one. We're having > serious problems running parallel applications on our cluster. After > each batch job finishes, we lose a certain amount of available > memory. Additional jobs cause free memory to grad

Re: [OMPI users] memory leaks on solaris

2007-08-06 Thread Ralph H Castain
ld be curious if this helps. > > -DON > p.s. orte-clean does not exist in the ompi v1.2 branch, it is in the > trunk but I think there is an issue with it currently > > Ralph H Castain wrote: > >> >> On 8/5/07 6:35 PM, "Glenn Carver" wrote: >>

Re: [OMPI users] Circumvent --host or dynamically read host info?

2007-08-30 Thread Ralph H Castain
I take it you are running in an rsh/ssh environment (as opposed to a managed environment like SLURM)? I'm afraid that you have to tell us -all- of the nodes that will be utilized in your job at the beginning (i.e., to mpirun). This requirement is planned to be relaxed in a later version, but that

Re: [OMPI users] Job does not quit even when the simulation dies

2007-11-07 Thread Ralph H Castain
As Jeff indicated, the degree of capability has improved over time - I'm not sure which version this represents. The type of failure also plays a major role in our ability to respond. If a process actually segfaults or dies, we usually pick that up pretty well and abort the rest of the job (certai

Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-29 Thread Ralph H Castain
Hi Bob I'm afraid the person most familiar with the oob subsystem recently left the project, so we are somewhat hampered at the moment. I don't recognize the "Software caused connection abort" error message - it doesn't appear to be one of ours (at least, I couldn't find it anywhere in our code ba

Re: [OMPI users] Q: Problems launching MPMD applications? ('mca_oob_tcp_peer_try_connect' error 103)

2007-12-06 Thread Ralph H Castain
On 12/5/07 8:47 AM, "Brian Dobbins" wrote: > Hi Josh, > >> I believe the problem is that you are only applying the MCA >> parameters to the first app context instead of all of them: > > Thank you very much.. applying the parameters with -gmca works fine with the > test case (and I'll try t

Re: [OMPI users] ORTE_ERROR_LOG: Data unpack had inadequate space in file gpr_replica_cmd_processor.c at line 361

2007-12-14 Thread Ralph H Castain
Hi Qiang This error message usually indicates that you have more than one Open MPI installation around, and that the backend nodes are picking up a different version than mpirun is using. Check to make sure that you have a consistent version across all the nodes. I also noted you were building wi

Re: [OMPI users] ORTE_ERROR_LOG: Data unpack had inadequate space in file gpr_replica_cmd_processor.c at line 361

2007-12-14 Thread Ralph H Castain
er > Moffett Field, CA 94035-1000 > > Fax: 415-604-3957 > > > If I try to use multiple nodes, I got the error messages: > ORTE_ERROR_LOG: Data unpack had inadequate space in file dss/dss_unpack.c at > line 90 > ORTE_ERROR_LOG: Data unpack had inadequate space in file > gpr_replica

Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2007-12-17 Thread Ralph H Castain
On 12/12/07 5:46 AM, "Elena Zhebel" wrote: > > > Hello, > > > > I'm working on a MPI application where I'm using OpenMPI instead of MPICH. > > In my "master" program I call the function MPI::Intracomm::Spawn which spawns > "slave" processes. It is not clear for me how to spawn the

Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2007-12-17 Thread Ralph H Castain
e available in a future release - TBD. Hope that helps Ralph > > Thanks and regards, > Elena > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph H Castain > Sent: Monday, December 17, 2007 3:31 PM

Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2007-12-18 Thread Ralph H Castain
hing should work the same. Just as an FYI: the name of that environmental variable is going to change in the 1.3 release, but everything will still work the same. Hope that helps Ralph > > Thanks and regards, > Elena > > > -Original Message- > From: Ralph H Castain [mai

Re: [OMPI users] Torque and OpenMPI 1.2

2007-12-18 Thread Ralph H Castain
Hate to be a party-pooper, but the answer is "no" in OpenMPI 1.2. We don't allow the use of a hostfile in a Torque environment in that version. We have changed this for v1.3, but you'll have to wait for that release. Sorry Ralph On 12/18/07 11:12 AM, "pat.o'bry...@exxonmobil.com" wrote: > Ti

Re: [OMPI users] Torque and OpenMPI 1.2

2007-12-19 Thread Ralph H Castain
>>>>> "--without-tm", >>>>> the OpenMPI 1.2.4 build allows the use of "-hostfile". >>>> Apparently, by >>>>> default, OpenMPI 1.2.4 will incorporate Torque if it >>>> exists, so it is >>>>> necessary to specifically request "no Torque support". I >>>>&

Re: [OMPI users] Torque and OpenMPI 1.2

2007-12-19 Thread Ralph H Castain
he same > restrictions you list below? > Pat > > J.W. (Pat) O'Bryant,Jr. > Business Line Infrastructure > Technical Systems, HPC > Office: 713-431-7022 > > > > > Ralph H > Castain > >

Re: [OMPI users] Torque and OpenMPI 1.2

2007-12-19 Thread Ralph H Castain
845, FAX: 505-284-2518 > >> -Original Message- >> From: users-boun...@open-mpi.org >> [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph H Castain >> Sent: Wednesday, December 19, 2007 2:35 PM >> To: Open MPI Users ; pat.o'bry...@exxonmobil.com &g

Re: [OMPI users] mpirun: specify multiple install prefixes

2007-12-20 Thread Ralph H Castain
I'm afraid not - nor is it in the plans for 1.3 either. I'm afraid it fell through the cracks as the needs inside the developer community moved into other channels. I'll raise the question internally and see if people feel we should do this. It wouldn't be hard to put it into 1.3 at this point, bu

Re: [OMPI users] Torque and OpenMPI 1.2

2007-12-20 Thread Ralph H Castain
on >> Sandia National Laboratories >> P.O. Box 5800, Mail Stop 1318 >> Albuquerque, NM 87185-1318 >> Voice: 505-284-8845, FAX: 505-284-2518 >> >>> -Original Message- >>> From: users-boun...@open-mpi.org >>> [mailto:users-boun...@open

Re: [OMPI users] orte in persistent mode

2008-01-02 Thread Ralph H Castain
Hi Neeraj No, we still don't support having a persistent set of daemons acting as some kind of "virtual machine" like LAM/MPI did. We at one time had talked about adding it. However, our most recent efforts have actually taken us away from supporting that mode of operation. As a result, I very muc

Re: [OMPI users] Need explanation for the following ORTE error message

2008-01-23 Thread Ralph H Castain
On 1/23/08 8:26 AM, "David Gunter" wrote: > A user of one of our OMPI 1.2.3 builds encountered the following error > message during an MPI job run: > > ORTE_ERROR_LOG: File read failure in file > util/universe_setup_file_io.c at line 123 It means that at some point in the past, an mpirun att

[OMPI users] FW: problems with hostfile when doing MPMD

2008-04-14 Thread Ralph H Castain
Hi Jody I believe this was intended for the Users mailing list, so I'm sending the reply there. We do plan to provide more explanation on these in the 1.3 release - believe me, you are not alone in puzzling over all the configuration params! Many of us in the developer community also sometimes wo

Re: [OMPI users] Proper use of sigaction in Open MPI?

2008-04-24 Thread Ralph H Castain
I have never tested this before, so I could be wrong. However, my best guess is that the following is happening: 1. you trap the signal and do your cleanup. However, when your proc now exits, it does not exit with a status of "terminated-by-signal". Instead, it exits normally. 2. the local daemon

Re: [OMPI users] specifying hosts in mpi_spawn()

2008-05-30 Thread Ralph H Castain
I'm afraid I cannot answer that question without first knowing what version of Open MPI you are using. Could you provide that info? Thanks Ralph On 5/29/08 6:41 PM, "Bruno Coutinho" wrote: > How mpi handles the host string passed in the info argument to > mpi_comm_spawn() ? > > if I set host

Re: [OMPI users] specifying hosts in mpi_spawn()

2008-06-02 Thread Ralph H Castain
ssary. > > > 2008/5/30 Ralph H Castain : >> I'm afraid I cannot answer that question without first knowing what version >> of Open MPI you are using. Could you provide that info? >> >> Thanks >> Ralph >> >> >> >> On 5/29/08

Re: [OMPI users] Application Context and OpenMPI 1.2.4

2008-06-17 Thread Ralph H Castain
Hi Pat A friendly elf forwarded this to me, so please be sure to explicitly include me on any reply. Was that the only error message you received? I would have expected a trail of "error_log" outputs that would help me understand where this came from. If not, I can give you some debug flags to se

Re: [OMPI users] SLURM and OpenMPI

2008-06-17 Thread Ralph H Castain
I can believe 1.2.x has problems in that regard. Some of that has nothing to do with slurm and reflects internal issues with 1.2. We have made it much more resistant to those problems in the upcoming 1.3 release, but there is no plan to retrofit those changes to 1.2. Part of the problem was that w

Re: [OMPI users] SLURM and OpenMPI

2008-06-19 Thread Ralph H Castain
it more effectively. For example if slurm > detects a node has failed, it will stop the job, allocate an additional > free node to make up the deficit, then relaunch. It more difficult (to > put it mildly) for a job launcher to do that. > > Thanks again, > Federico > > -Origi

Re: [OMPI users] null characters in output

2008-06-19 Thread Ralph H Castain
ase, but > unfortunately have not found one that is deterministic so far. > > Thanks, > Federico > > -Original Message- > From: Ralph H Castain [mailto:r...@lanl.gov] > Sent: Tuesday, June 17, 2008 1:09 PM > To: Sacerdoti, Federico; Open MPI Users > Sub

Re: [OMPI users] Displaying Selected MCA Modules

2008-06-23 Thread Ralph H Castain
I can guarantee bproc support isn't broken in 1.2 - we use it on several production machines every day, and it works fine. I heard of only one potential problem having to do with specifying multiple app_contexts on a cmd line, but we are still trying to confirm that it wasn't operator error. In th

Re: [OMPI users] mca parameters: meaning and use

2008-06-26 Thread Ralph H Castain
Actually, I suspect the requestor was hoping for an explanation somewhat more illuminating than the terse comments output by ompi_info. ;-) Bottom line is "no". We have talked numerous times about the need to do this, but unfortunately there has been little accomplished. I doubt it will happen any

Re: [OMPI users] Need some help regarding Linpack execution

2008-07-02 Thread Ralph H Castain
You also might want to resend this to the MPICH mailing list ­ this is the Open MPI mailing list ;-) On 7/2/08 8:03 AM, "Swamy Kandadai" wrote: > Hi: > May be you do not have 12 entries in your machine.list file. You need to have > atleast np lines in your machine.list > > Dr. Swamy N. Kandad

Re: [OMPI users] mpirun w/ enable-mpi-threads spinning up cputime when app path is invalid

2008-07-02 Thread Ralph H Castain
Out of curiosity - what version of OMPI are you using? On 7/2/08 10:46 AM, "Steve Johnson" wrote: > If mpirun is given an application that isn't in the PATH, then instead of > exiting it prints the error that it failed to find the executable and then > proceeds spins up cpu time. strace shows

Re: [OMPI users] mpirun w/ enable-mpi-threads spinning up cputime when app path is invalid

2008-07-02 Thread Ralph H Castain
Sorry - went to one of your links to get that info. We know OMPI 1.2.x isn't thread safe. This is unfortunately another example of it. Hopefully, 1.3 will be better. Ralph On 7/2/08 11:01 AM, "Ralph H Castain" wrote: > Out of curiosity - what version of OMPI are you using?

Re: [OMPI users] ORTE_ERROR_LOG timeout

2008-07-08 Thread Ralph H Castain
Several thins are going on here. First, this error message: > mpirun noticed that job rank 1 with PID 9658 on node mac1 exited on signal > 6 (Aborted). > 2 additional processes aborted (not shown) indicates that your application procs are aborting for some reason. The system is then attempting to

Re: [OMPI users] Query regarding OMPI_MCA_ns_nds_vpid env variable

2008-07-11 Thread Ralph H Castain
This variable is only for internal use and has no applicability to a user. Basically, it is used by the local daemon to tell an application process its rank when launched. Note that it disappears in v1.3...so I wouldn't recommend looking for it. Is there something you are trying to do with it? Re

Re: [OMPI users] Query regarding OMPI_MCA_ns_nds_vpid env variable

2008-07-11 Thread Ralph H Castain
On 7/11/08 7:32 AM, "Ashley Pittman" wrote: > On Fri, 2008-07-11 at 07:20 -0600, Ralph H Castain wrote: >> This variable is only for internal use and has no applicability to a user. >> Basically, it is used by the local daemon to tell an application process

Re: [OMPI users] Outputting rank and size for all outputs.

2008-07-11 Thread Ralph H Castain
Adding the ability to tag stdout/err with the process rank is fairly simple. We are going to talk about this next week at a design meeting - we have several different tagging schemes that people have requested, so we want to define a way to meet them all that doesn't create too much ugliness in the

Re: [OMPI users] Outputting rank and size for all outputs.

2008-07-11 Thread Ralph H Castain
ly a little > nicer than my current setup. > > -Mark > > > On Jul 11, 2008, at 9:46 AM, Ralph H Castain wrote: > >> Adding the ability to tag stdout/err with the process rank is fairly >> simple. >> We are going to talk about this next week at a design

Re: [OMPI users] Query regarding OMPI_MCA_ns_nds_vpid env variable

2008-07-11 Thread Ralph H Castain
On 7/11/08 7:50 AM, "Ashley Pittman" wrote: > On Fri, 2008-07-11 at 07:42 -0600, Ralph H Castain wrote: >> >> >> On 7/11/08 7:32 AM, "Ashley Pittman" >> wrote: >> >>> On Fri, 2008-07-11 at 07:20 -0600, Ralph H Castain

Re: [OMPI users] Query regarding OMPI_MCA_ns_nds_vpid env variable

2008-07-11 Thread Ralph H Castain
On 7/11/08 8:33 AM, "Ashley Pittman" wrote: > On Fri, 2008-07-11 at 08:01 -0600, Ralph H Castain wrote: >>>> I believe this is partly what motivated the creation of the MPI envars - to >>>> create a vehicle that -would- be guaranteed stable for just thes

Re: [OMPI users] hwloc, OpenMPI and unsupported OSes and toolchains

2018-03-21 Thread Ralph H Castain
I don’t see how Open MPI can operate without pthreads > On Mar 19, 2018, at 3:23 PM, Gregory (tim) Kelly wrote: > > Hello Everyone, > I'm inquiring to find someone that can answer some multi-part questions about > hwloc, OpenMPI and an alternative OS and toolchain. I have a project as part >

Re: [OMPI users] Comm_connect: Data unpack would read past end of buffer

2018-08-03 Thread Ralph H Castain
The buffer being overrun isn’t anything to do with you - it’s an internal buffer used as part of creating the connections. It indicates a problem in OMPI. The 1.10 series is out of the support window, but if you want to stick with it you should at least update to the last release in that series

Re: [OMPI users] Settings oversubscribe as default?

2018-08-03 Thread Ralph H Castain
The equivalent MCA param is rmaps_base_oversubscribe=1. You can add OMPI_MCA_rmaps_base_oversubscribe to your environ, or set rmaps_base_oversubscribe in your default MCA param file. > On Aug 3, 2018, at 1:24 AM, Florian Lindner wrote: > > Hello, > > I can use --oversubscribe to enable overs

Re: [OMPI users] local communicator and crash of the code

2018-08-03 Thread Ralph H Castain
Those two command lines look exactly the same to me - what am I missing? > On Aug 3, 2018, at 10:23 AM, Diego Avesani wrote: > > Dear all, > > I am experiencing a strange error. > > In my code I use three group communications: > MPI_COMM_WORLD > MPI_MASTERS_COMM > LOCAL_COMM > > which have i

Re: [OMPI users] cannot run openmpi 2.1

2018-08-11 Thread Ralph H Castain
Put "oob=^usock” in your default mca param file, or add OMPI_MCA_oob=^usock to your environment > On Aug 11, 2018, at 5:54 AM, Kapetanakis Giannis > wrote: > > Hi, > > I'm struggling to get 2.1.x to work with our HPC. > > Version 1.8.8 and 3.x works fine. > > In 2.1.3 and 2.1.4 I get errors

Re: [OMPI users] What happened to orte-submit resp. DVM?

2018-08-28 Thread Ralph H Castain
You must have some stale code because those tools no longer exist. Note that we are (gradually) replacing orte-dvm with PRRTE: https://github.com/pmix/prrte See the “how-to” guides for PRRTE towards the bottom of this page: https://pmix.org/support/how-to/

Re: [OMPI users] What happened to orte-submit resp. DVM?

2018-08-29 Thread Ralph H Castain
> On Aug 29, 2018, at 1:59 AM, Reuti wrote: > >> >> Am 29.08.2018 um 04:46 schrieb Ralph H Castain > <mailto:r...@open-mpi.org>>: >> You must have some stale code because those tools no longer exist. > > Aha. This code is then by accient still

Re: [OMPI users] stdout/stderr question

2018-09-10 Thread Ralph H Castain
I’m not sure why this would be happening. These error outputs go through the “show_help” functionality, and we specifically target it at stderr: /* create an output stream for us */ OBJ_CONSTRUCT(&lds, opal_output_stream_t); lds.lds_want_stderr = true; orte_help_output = opal_outp

Re: [OMPI users] stdout/stderr question

2018-09-10 Thread Ralph H Castain
job to be terminated. The first process to do so was: >>> >>> Process name: [[22380,1],0] >>> Exit code:255 >>> -- >>> $ cat stdout >>> hello from 0 >>&

Re: [OMPI users] No network interfaces were found for out-of-band communications.

2018-09-12 Thread Ralph H Castain
What OMPI version are we talking about here? > On Sep 11, 2018, at 6:56 PM, Greg Russell wrote: > > I have a single machine w 96 cores. It runs CentOS7 and is not connected to > any network as it needs to isolated for security. > > I attempted the standard install process and upon attempting

Re: [OMPI users] No network interfaces were found for out-of-band communications.

2018-09-12 Thread Ralph H Castain
ers > wrote: > > Can you send all the information listed here: > >https://www.open-mpi.org/community/help/ > > > >> On Sep 12, 2018, at 11:03 AM, Greg Russell wrote: >> >> OpenMPI-3.1.2 >> >> Sent from my iPhone >> >> On

Re: [OMPI users] Unable to spawn MPI processes on multiple nodes with recent version of OpenMPI

2018-09-16 Thread Ralph H Castain
I see you are using “preconnect_all” - that is the source of the trouble. I don’t believe we have tested that option in years and the code is almost certainly dead. I’d suggest removing that option and things should work. > On Sep 15, 2018, at 1:46 PM, Andrew Benson wrote: > > I'm running int

Re: [OMPI users] mpirun noticed that process rank 5 with PID 0 on node localhost exited on signal 9 (Killed).

2018-09-28 Thread Ralph H Castain
Ummm…looks like you have a problem in your input deck to that application. Not sure what we can say about it… > On Sep 28, 2018, at 9:47 AM, Zeinab Salah wrote: > > Hi everyone, > I use openmpi-3.0.2 and I want to run chimere model with 8 processors, but in > the step of parallel mode, the ru

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-02 Thread Ralph H Castain
Looks like PMIx failed to build - can you send the config.log? > On Oct 2, 2018, at 12:00 AM, Siegmar Gross > wrote: > > Hi, > > yesterday I've installed openmpi-v4.0.x-201809290241-a7e275c and > openmpi-master-201805080348-b39bbfb on my "SUSE Linux Enterprise Server > 12.3 (x86_64)" with Sun

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-02 Thread Ralph H Castain
it, but perhaps something is different about this environment. > On Oct 2, 2018, at 6:36 AM, Ralph H Castain wrote: > > Looks like PMIx failed to build - can you send the config.log? > >> On Oct 2, 2018, at 12:00 AM, Siegmar Gross >> wrote: >> >> Hi, >

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-02 Thread Ralph H Castain
libc.so.6 (GLIBC_2.8) => /lib64/libc.so.6 >libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6 >libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6 >libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6 >/lib64/libresolv.so.2: >libc.so.6 (GLIBC_2.14) =

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-03 Thread Ralph H Castain
; It looks like Siegmar passed --with-hwloc=internal. > > Open MPI's configure understood this and did the appropriate things. > PMIX's configure didn't. > > I think we need to add an adjustment into the PMIx configure.m4 in OMPI... > > >> On Oct 2, 201

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-03 Thread Ralph H Castain
Did you configure OMPI —with-tm=? It looks like we didn’t build PBS support and so we only see one node with a single slot allocated to it. > On Oct 3, 2018, at 12:02 PM, Castellana Michele > wrote: > > Dear all, > I am having trouble running an MPI code across multiple cores on a new > com

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-03 Thread Ralph H Castain
Actually, I see that you do have the tm components built, but they cannot be loaded because you are missing libcrypto from your LD_LIBRARY_PATH > On Oct 3, 2018, at 12:33 PM, Ralph H Castain wrote: > > Did you configure OMPI —with-tm=? It looks like we didn’t > build PBS suppo

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-05 Thread Ralph H Castain
> > On 10/3/18 8:14 PM, Ralph H Castain wrote: >> Jeff and I talked and believe the patch in >> https://github.com/open-mpi/ompi/pull/5836 >> <https://github.com/open-mpi/ompi/pull/5836> should fix the problem. > > > Today I've installed openmpi-ma

Re: [OMPI users] Unable to spawn MPI processes on multiple nodes with recent version of OpenMPI

2018-10-06 Thread Ralph H Castain
*and potentially your MPI job) > > I've tried increasing both pmix_server_max_wait and > pmix_base_exchange_timeout > as suggested in the error message, but the result is unchanged (it just takes > longer to time out). > > Once again, if I remove "--map-by no

Re: [OMPI users] Unable to spawn MPI processes on multiple nodes with recent version of OpenMPI

2018-10-06 Thread Ralph H Castain
tes.google.com/site/galacticusmodel> > On Sat, Oct 6, 2018, 9:02 AM Ralph H Castain <mailto:r...@open-mpi.org>> wrote: > Sorry for delay - this should be fixed by > https://github.com/open-mpi/ompi/pull/5854 > <https://github.com/open-mpi/ompi/pull/5854> > > &

Re: [OMPI users] issue compiling openmpi 3.2.1 with pmi and slurm

2018-10-10 Thread Ralph H Castain
It appears that the CPPFLAGS isn’t getting set correctly as the component didn’t find the Slurm PMI-1 header file. Perhaps it would help if we saw the config.log output so we can see where OMPI thought the file was located. > On Oct 10, 2018, at 6:44 AM, Ross, Daniel B. via users > wrote: >

Re: [OMPI users] issue compiling openmpi 3.2.1 with pmi and slurm

2018-10-10 Thread Ralph H Castain
A_orte_ess_ALL_SUBDIRS=' mca/ess/env mca/ess/hnp mca/ess/pmi > mca/ess/singleton mca/ess/tool mca/ess/alps mca/ess/lsf mca/ess/slurm > mca/ess/tm' > MCA_orte_ess_DSO_COMPONENTS=' env hnp pmi singleton tool slurm' > MCA_orte_ess_DSO_SUBDIRS=' mca/ess/env

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-12 Thread Ralph H Castain
; https://github.com/open-mpi/ompi/pull/5846 >> Both of these should be in tonight's nightly snapshot. >> Thank you! >>> On Oct 5, 2018, at 5:45 AM, Ralph H Castain wrote: >>> >>> Please send Jeff and I the opal/mca/pmix/pmix4x/pmix/config.log

[OMPI users] SC'18 PMIx BoF meeting

2018-10-15 Thread Ralph H Castain
Hello all [I’m sharing this on the OMPI mailing lists (as well as the PMIx one) as PMIx has become tightly integrated to the OMPI code since v2.0 was released] The PMIx Community will once again be hosting a Birds-of-a-Feather meeting at SuperComputing. This year, however, will be a little diff

Re: [OMPI users] Bug with Open-MPI Processor Count

2018-11-01 Thread Ralph H Castain
Set rmaps_base_verbose=10 for debugging output Sent from my iPhone > On Nov 1, 2018, at 9:31 AM, Adam LeBlanc wrote: > > The version by the way for Open-MPI is 3.1.2. > > -Adam LeBlanc > >> On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc wrote: >> Hello, >> >> I am an employee of the UNH Inte

Re: [OMPI users] Bug with Open-MPI Processor Count

2018-11-01 Thread Ralph H Castain
> ------ Forwarded message - > From: Ralph H Castain mailto:r...@open-mpi.org>> > Date: Thu, Nov 1, 2018 at 1:07 PM > Subject: Re: [OMPI users] Bug with Open-MPI Processor Count > To: Open MPI Users <mailto:users@lists.open-mpi.org>> > > > Set r

Re: [OMPI users] Bug with Open-MPI Processor Count

2018-11-01 Thread Ralph H Castain
; = > > > Yes the hostfile is available on all nodes through an NFS mount for all of > our home directories. > >> On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc wrote: >> >> >> -- Forw

Re: [OMPI users] OMPI 3.1.x, PMIx, SLURM, and mpiexec/mpirun

2018-11-12 Thread Ralph H Castain
mpirun should definitely still work in parallel with srun - they aren’t mutually exclusive. OMPI 3.1.2 contains PMIx v2.1.3. The problem here is that you built Slurm against PMIx v2.0.2, which is not cross-version capable. You can see the cross-version situation here: https://pmix.org/support/f

Re: [OMPI users] OpenMPI2 + slurm

2018-11-23 Thread Ralph H Castain
Couple of comments. Your original cmd line: >> srun -n 2 mpirun MPI-hellow tells srun to launch two copies of mpirun, each of which is to run as many processes as there are slots assigned to the allocation. srun will get an allocation of two slots, and so you’ll get two concurrent MPI jobs, e

Re: [OMPI users] Issue with MPI_Init in MPI_Comm_Spawn

2018-11-29 Thread Ralph H Castain
I ran a simple spawn test - you can find it in the OMPI code at orte/test/mpi/simple_spawn.c - and it worked fine: $ mpirun -n 2 ./simple_spawn [1858076673:0 pid 19909] starting up on node Ralphs-iMac-2.local! [1858076673:1 pid 19910] starting up on node Ralphs-iMac-2.local! 1 completed MPI_Init P

Re: [OMPI users] singularity support

2018-12-12 Thread Ralph H Castain
FWIW: we also automatically detect that the application is a singularity container and do the right stuff > On Dec 12, 2018, at 12:25 AM, Gilles Gouaillardet wrote: > > My understanding is that MPI tasks will be launched inside a singularity > container. > > In a typical environment, mpirun

[OMPI users] open-mpi.org is DOWN

2018-12-22 Thread Ralph H Castain
Hello all Apologies to everyone, but I received an alert this moring that malware has been detected on the www.open-mpi.org site. I have tried to contact the hosting agency and the security scanners, but nobody is around on this pre-holiday weekend. Accordingly, I have taken the site OFFLINE f

Re: [OMPI users] open-mpi.org is DOWN

2018-12-23 Thread Ralph H Castain
The security scanner has apologized for a false positive and fixed their system - the site has been restored. Ralph > On Dec 22, 2018, at 12:12 PM, Ralph H Castain wrote: > > Hello all > > Apologies to everyone, but I received an alert this moring that malware has > be

Re: [OMPI users] Suppress mpirun exit error chatter

2019-01-06 Thread Ralph H Castain
Afraid not. What it saids is actually accurate - it didn’t say the application called “abort”. It saids that the job was aborted. There is a very different message when the application itself calls MPI_Abort. > On Jan 6, 2019, at 1:19 PM, Jeff Wentworth via users > wrote: > > Hi everyone, >

Re: [OMPI users] Fwd: pmix and srun

2019-01-18 Thread Ralph H Castain
Looks strange. I’m pretty sure Mellanox didn’t implement the event notification system in the Slurm plugin, but you should only be trying to call it if OMPI is registering a system-level event code - which OMPI 3.1 definitely doesn’t do. If you are using PMIx v2.2.0, then please note that there

Re: [OMPI users] Fwd: pmix and srun

2019-01-18 Thread Ralph H Castain
-user@labhead slurm]$ git branch > * (detached from origin/slurm-18.08) > master > [ec2-user@labhead slurm]$ cd ../ompi/ > [ec2-user@labhead ompi]$ git branch > * (detached from origin/v3.1.x) > master > > > attached is the debug out from the run with the debugging turne

Re: [OMPI users] pmix and srun

2019-01-18 Thread Ralph H Castain
I have pushed a fix to the v2.2 branch - could you please confirm it? > On Jan 18, 2019, at 2:23 PM, Ralph H Castain wrote: > > Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm plugin > folks seem to be off somewhere for awhile and haven’t been testing it. Sig

Re: [OMPI users] pmix and srun

2019-01-18 Thread Ralph H Castain
Good - thanks! > On Jan 18, 2019, at 3:25 PM, Michael Di Domenico > wrote: > > seems to be better now. jobs are running > > On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain wrote: >> >> I have pushed a fix to the v2.2 branch - could you please confirm it? >

Re: [OMPI users] Open MPI installation problem

2019-01-23 Thread Ralph H Castain
Your PATH and LD_LIBRARY_PATH setting is incorrect. You installed OMPI into $HOME/openmpi, so you should have done: PATH=$HOME/openmpi/bin:$PATH LD_LIBRARY_PATH=$HOME/openmpi/lib:$LD_LIBRARY_PATH Ralph > On Jan 23, 2019, at 6:36 AM, Serdar Hiçdurmaz > wrote: > > Hi All, > > I try to instal

Re: [OMPI users] Building PMIx and Slurm support

2019-03-04 Thread Ralph H Castain
> On Mar 4, 2019, at 5:34 AM, Daniel Letai wrote: > > Gilles, > On 3/4/19 8:28 AM, Gilles Gouaillardet wrote: >> Daniel, >> >> >> On 3/4/2019 3:18 PM, Daniel Letai wrote: >>> So unless you have a specific reason not to mix both, you might also give the internal PMIx a try. >>>

Re: [OMPI users] IRC/Discord?

2019-03-05 Thread Ralph H Castain
Not IRC or discord, but we do make significant use of Slack: open-mpi.slack.com > On Mar 5, 2019, at 8:04 AM, George Marselis > wrote: > > Hey guys, > > Sorry to bother you. I was wondering if there is an IRC or discord channel > for this mailing list. > > (there is an IRC channel on Free

Re: [OMPI users] IRC/Discord?

2019-03-06 Thread Ralph H Castain
n invitation? > > > Best Regards, > > > George Marselis > > ____ > From: users on behalf of Ralph H Castain > > Sent: Tuesday, March 5, 2019 5:12 PM > To: Open MPI Users > Subject: Re: [OMPI users] IRC/Discord? > &

Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Ralph H Castain
You are probably using the ofi mtl - could be psm2 uses loopback method? Sent from my iPhone > On Mar 11, 2019, at 8:40 AM, Michael Di Domenico > wrote: > > i have a user that's claiming when two ranks on the same node want to > talk with each other, they're using the NIC to talk rather then j

Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Ralph H Castain
OFI uses libpsm2 underneath it when omnipath detected Sent from my iPhone > On Mar 11, 2019, at 9:06 AM, Gilles Gouaillardet > wrote: > > Michael, > > You can > > mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca > mtl_base_verbose 10 ... > > It might show that pml/cm and m

Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-16 Thread Ralph H Castain
FWIW: I just ran a cycle of 10,000 spawns on my Mac without a problem using OMPI master, so I believe this has been resolved. I don’t know if/when the required updates might come into the various release branches. Ralph > On Mar 16, 2019, at 1:13 PM, Thomas Pak wrote: > > Dear Jeff, > > I d

Re: [OMPI users] Low performance of Open MPI-1.3 over Gigabit

2009-03-04 Thread Ralph H. Castain
It would also help to have some idea how you installed and ran this - e.g., did you set mpi_paffinity_alone so that the processes would bind to their processors? That could explain the cpu vs. elapsed time since it helps the processes from being swapped out as much. Ralph > Your Intel processors

Re: [OMPI users] MPI_Comm_spawn error messages

2006-07-06 Thread Ralph H Castain
Hi Saadat Could you tell us something more about the system you are using? What type of processors, operating system, any resource manager (e.g., SLURM, PBS), etc? Thanks Ralph On 7/6/06 10:49 AM, "s anwar" wrote: > Good Day: > > I am getting the following error messages every time I run a

  1   2   >