Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-16 Thread Ralph H Castain
FWIW: I just ran a cycle of 10,000 spawns on my Mac without a problem using OMPI master, so I believe this has been resolved. I don’t know if/when the required updates might come into the various release branches. Ralph > On Mar 16, 2019, at 1:13 PM, Thomas Pak wrote: > > Dear Jeff, > > I

Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Ralph H Castain
OFI uses libpsm2 underneath it when omnipath detected Sent from my iPhone > On Mar 11, 2019, at 9:06 AM, Gilles Gouaillardet > wrote: > > Michael, > > You can > > mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca > mtl_base_verbose 10 ... > > It might show that pml/cm and

Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Ralph H Castain
You are probably using the ofi mtl - could be psm2 uses loopback method? Sent from my iPhone > On Mar 11, 2019, at 8:40 AM, Michael Di Domenico > wrote: > > i have a user that's claiming when two ranks on the same node want to > talk with each other, they're using the NIC to talk rather then

Re: [OMPI users] IRC/Discord?

2019-03-06 Thread Ralph H Castain
n invitation? > > > Best Regards, > > > George Marselis > > ____ > From: users on behalf of Ralph H Castain > > Sent: Tuesday, March 5, 2019 5:12 PM > To: Open MPI Users > Subject: Re: [OMPI users] IRC/Discord? >

Re: [OMPI users] IRC/Discord?

2019-03-05 Thread Ralph H Castain
Not IRC or discord, but we do make significant use of Slack: open-mpi.slack.com > On Mar 5, 2019, at 8:04 AM, George Marselis > wrote: > > Hey guys, > > Sorry to bother you. I was wondering if there is an IRC or discord channel > for this mailing list. > > (there is an IRC channel on

Re: [OMPI users] Building PMIx and Slurm support

2019-03-04 Thread Ralph H Castain
> On Mar 4, 2019, at 5:34 AM, Daniel Letai wrote: > > Gilles, > On 3/4/19 8:28 AM, Gilles Gouaillardet wrote: >> Daniel, >> >> >> On 3/4/2019 3:18 PM, Daniel Letai wrote: >>> So unless you have a specific reason not to mix both, you might also give the internal PMIx a try. >>>

Re: [OMPI users] Open MPI installation problem

2019-01-23 Thread Ralph H Castain
Your PATH and LD_LIBRARY_PATH setting is incorrect. You installed OMPI into $HOME/openmpi, so you should have done: PATH=$HOME/openmpi/bin:$PATH LD_LIBRARY_PATH=$HOME/openmpi/lib:$LD_LIBRARY_PATH Ralph > On Jan 23, 2019, at 6:36 AM, Serdar Hiçdurmaz > wrote: > > Hi All, > > I try to

Re: [OMPI users] pmix and srun

2019-01-18 Thread Ralph H Castain
Good - thanks! > On Jan 18, 2019, at 3:25 PM, Michael Di Domenico > wrote: > > seems to be better now. jobs are running > > On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain wrote: >> >> I have pushed a fix to the v2.2 branch - could you please confirm it? >

Re: [OMPI users] pmix and srun

2019-01-18 Thread Ralph H Castain
I have pushed a fix to the v2.2 branch - could you please confirm it? > On Jan 18, 2019, at 2:23 PM, Ralph H Castain wrote: > > Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm plugin > folks seem to be off somewhere for awhile and haven’t been testing it. Sig

Re: [OMPI users] Fwd: pmix and srun

2019-01-18 Thread Ralph H Castain
lurm]$ git branch > * (detached from origin/slurm-18.08) > master > [ec2-user@labhead slurm]$ cd ../ompi/ > [ec2-user@labhead ompi]$ git branch > * (detached from origin/v3.1.x) > master > > > attached is the debug out from the run with the debugging turned on > >

Re: [OMPI users] Fwd: pmix and srun

2019-01-18 Thread Ralph H Castain
Looks strange. I’m pretty sure Mellanox didn’t implement the event notification system in the Slurm plugin, but you should only be trying to call it if OMPI is registering a system-level event code - which OMPI 3.1 definitely doesn’t do. If you are using PMIx v2.2.0, then please note that there

Re: [OMPI users] Suppress mpirun exit error chatter

2019-01-06 Thread Ralph H Castain
Afraid not. What it saids is actually accurate - it didn’t say the application called “abort”. It saids that the job was aborted. There is a very different message when the application itself calls MPI_Abort. > On Jan 6, 2019, at 1:19 PM, Jeff Wentworth via users > wrote: > > Hi everyone,

Re: [OMPI users] open-mpi.org is DOWN

2018-12-23 Thread Ralph H Castain
The security scanner has apologized for a false positive and fixed their system - the site has been restored. Ralph > On Dec 22, 2018, at 12:12 PM, Ralph H Castain wrote: > > Hello all > > Apologies to everyone, but I received an alert this moring that malware has

[OMPI users] open-mpi.org is DOWN

2018-12-22 Thread Ralph H Castain
Hello all Apologies to everyone, but I received an alert this moring that malware has been detected on the www.open-mpi.org site. I have tried to contact the hosting agency and the security scanners, but nobody is around on this pre-holiday weekend. Accordingly, I have taken the site OFFLINE

Re: [OMPI users] singularity support

2018-12-12 Thread Ralph H Castain
FWIW: we also automatically detect that the application is a singularity container and do the right stuff > On Dec 12, 2018, at 12:25 AM, Gilles Gouaillardet wrote: > > My understanding is that MPI tasks will be launched inside a singularity > container. > > In a typical environment, mpirun

Re: [OMPI users] Issue with MPI_Init in MPI_Comm_Spawn

2018-11-29 Thread Ralph H Castain
I ran a simple spawn test - you can find it in the OMPI code at orte/test/mpi/simple_spawn.c - and it worked fine: $ mpirun -n 2 ./simple_spawn [1858076673:0 pid 19909] starting up on node Ralphs-iMac-2.local! [1858076673:1 pid 19910] starting up on node Ralphs-iMac-2.local! 1 completed MPI_Init

Re: [OMPI users] OpenMPI2 + slurm

2018-11-23 Thread Ralph H Castain
Couple of comments. Your original cmd line: >> srun -n 2 mpirun MPI-hellow tells srun to launch two copies of mpirun, each of which is to run as many processes as there are slots assigned to the allocation. srun will get an allocation of two slots, and so you’ll get two concurrent MPI jobs,

Re: [OMPI users] OMPI 3.1.x, PMIx, SLURM, and mpiexec/mpirun

2018-11-12 Thread Ralph H Castain
mpirun should definitely still work in parallel with srun - they aren’t mutually exclusive. OMPI 3.1.2 contains PMIx v2.1.3. The problem here is that you built Slurm against PMIx v2.0.2, which is not cross-version capable. You can see the cross-version situation here:

Re: [OMPI users] Bug with Open-MPI Processor Count

2018-11-01 Thread Ralph H Castain
> = > > > Yes the hostfile is available on all nodes through an NFS mount for all of > our home directories. > >> On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc wrote: >> >> >> -- Fo

Re: [OMPI users] Bug with Open-MPI Processor Count

2018-11-01 Thread Ralph H Castain
> ------ Forwarded message - > From: Ralph H Castain mailto:r...@open-mpi.org>> > Date: Thu, Nov 1, 2018 at 1:07 PM > Subject: Re: [OMPI users] Bug with Open-MPI Processor Count > To: Open MPI Users <mailto:users@lists.open-mpi.org>> > > > Set r

Re: [OMPI users] Bug with Open-MPI Processor Count

2018-11-01 Thread Ralph H Castain
Set rmaps_base_verbose=10 for debugging output Sent from my iPhone > On Nov 1, 2018, at 9:31 AM, Adam LeBlanc wrote: > > The version by the way for Open-MPI is 3.1.2. > > -Adam LeBlanc > >> On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc wrote: >> Hello, >> >> I am an employee of the UNH

[OMPI users] SC'18 PMIx BoF meeting

2018-10-15 Thread Ralph H Castain
Hello all [I’m sharing this on the OMPI mailing lists (as well as the PMIx one) as PMIx has become tightly integrated to the OMPI code since v2.0 was released] The PMIx Community will once again be hosting a Birds-of-a-Feather meeting at SuperComputing. This year, however, will be a little

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-12 Thread Ralph H Castain
ithub.com/open-mpi/ompi/pull/5846 >> Both of these should be in tonight's nightly snapshot. >> Thank you! >>> On Oct 5, 2018, at 5:45 AM, Ralph H Castain wrote: >>> >>> Please send Jeff and I the opal/mca/pmix/pmix4x/pmix/config.log again - >>>

Re: [OMPI users] issue compiling openmpi 3.2.1 with pmi and slurm

2018-10-10 Thread Ralph H Castain
='' > opal_pmi1_LDFLAGS='' > opal_pmi1_LIBS='-lpmi' > opal_pmi1_rpath='' > opal_pmi2_CPPFLAGS='' > opal_pmi2_LDFLAGS='' > opal_pmi2_LIBS='-lpmi2' > opal_pmi2_rpath='' > opal_pmix_ext1x_CPPFLAGS='' > opal_pmix_ext1x_LDFLAGS='' > opal_pmix_ext1x_LIBS='' > opal_p

Re: [OMPI users] issue compiling openmpi 3.2.1 with pmi and slurm

2018-10-10 Thread Ralph H Castain
It appears that the CPPFLAGS isn’t getting set correctly as the component didn’t find the Slurm PMI-1 header file. Perhaps it would help if we saw the config.log output so we can see where OMPI thought the file was located. > On Oct 10, 2018, at 6:44 AM, Ross, Daniel B. via users > wrote: >

Re: [OMPI users] Unable to spawn MPI processes on multiple nodes with recent version of OpenMPI

2018-10-06 Thread Ralph H Castain
.com/site/galacticusmodel> > On Sat, Oct 6, 2018, 9:02 AM Ralph H Castain <mailto:r...@open-mpi.org>> wrote: > Sorry for delay - this should be fixed by > https://github.com/open-mpi/ompi/pull/5854 > <https://github.com/open-mpi/ompi/pull/5854> > > > On Sep

Re: [OMPI users] Unable to spawn MPI processes on multiple nodes with recent version of OpenMPI

2018-10-06 Thread Ralph H Castain
*and potentially your MPI job) > > I've tried increasing both pmix_server_max_wait and > pmix_base_exchange_timeout > as suggested in the error message, but the result is unchanged (it just takes > longer to time out). > > Once again, if I remove "--map-by node

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-05 Thread Ralph H Castain
> > On 10/3/18 8:14 PM, Ralph H Castain wrote: >> Jeff and I talked and believe the patch in >> https://github.com/open-mpi/ompi/pull/5836 >> <https://github.com/open-mpi/ompi/pull/5836> should fix the problem. > > > Today I've installed openmpi-master-

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-03 Thread Ralph H Castain
Actually, I see that you do have the tm components built, but they cannot be loaded because you are missing libcrypto from your LD_LIBRARY_PATH > On Oct 3, 2018, at 12:33 PM, Ralph H Castain wrote: > > Did you configure OMPI —with-tm=? It looks like we didn’t > build PBS suppo

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-03 Thread Ralph H Castain
Did you configure OMPI —with-tm=? It looks like we didn’t build PBS support and so we only see one node with a single slot allocated to it. > On Oct 3, 2018, at 12:02 PM, Castellana Michele > wrote: > > Dear all, > I am having trouble running an MPI code across multiple cores on a new >

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-03 Thread Ralph H Castain
looks like Siegmar passed --with-hwloc=internal. > > Open MPI's configure understood this and did the appropriate things. > PMIX's configure didn't. > > I think we need to add an adjustment into the PMIx configure.m4 in OMPI... > > >> On Oct 2, 2018, at 5:25 PM, Ralph

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-02 Thread Ralph H Castain
> /lib64/libc.so.6 >libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6 >libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6 >libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6 >/lib64/libresolv.so.2: > libc.so.6 (GLIBC_2.14) => /lib64/libc.so.6 >

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-02 Thread Ralph H Castain
it, but perhaps something is different about this environment. > On Oct 2, 2018, at 6:36 AM, Ralph H Castain wrote: > > Looks like PMIx failed to build - can you send the config.log? > >> On Oct 2, 2018, at 12:00 AM, Siegmar Gross >> wrote: >> >> Hi, >

Re: [OMPI users] opal_pmix_base_select failed for master and 4.0.0

2018-10-02 Thread Ralph H Castain
Looks like PMIx failed to build - can you send the config.log? > On Oct 2, 2018, at 12:00 AM, Siegmar Gross > wrote: > > Hi, > > yesterday I've installed openmpi-v4.0.x-201809290241-a7e275c and > openmpi-master-201805080348-b39bbfb on my "SUSE Linux Enterprise Server > 12.3 (x86_64)" with Sun

Re: [OMPI users] mpirun noticed that process rank 5 with PID 0 on node localhost exited on signal 9 (Killed).

2018-09-28 Thread Ralph H Castain
Ummm…looks like you have a problem in your input deck to that application. Not sure what we can say about it… > On Sep 28, 2018, at 9:47 AM, Zeinab Salah wrote: > > Hi everyone, > I use openmpi-3.0.2 and I want to run chimere model with 8 processors, but in > the step of parallel mode, the

[MTT users] test message

2018-09-25 Thread Ralph H Castain
This is just a test message to ensure the mailing list is active Ralph ___ mtt-users mailing list mtt-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/mtt-users

Re: [OMPI users] Unable to spawn MPI processes on multiple nodes with recent version of OpenMPI

2018-09-16 Thread Ralph H Castain
I see you are using “preconnect_all” - that is the source of the trouble. I don’t believe we have tested that option in years and the code is almost certainly dead. I’d suggest removing that option and things should work. > On Sep 15, 2018, at 1:46 PM, Andrew Benson wrote: > > I'm running

Re: [OMPI users] No network interfaces were found for out-of-band communications.

2018-09-12 Thread Ralph H Castain
ers > wrote: > > Can you send all the information listed here: > >https://www.open-mpi.org/community/help/ > > > >> On Sep 12, 2018, at 11:03 AM, Greg Russell wrote: >> >> OpenMPI-3.1.2 >> >> Sent from my iPhone >> >> On

Re: [OMPI users] No network interfaces were found for out-of-band communications.

2018-09-12 Thread Ralph H Castain
What OMPI version are we talking about here? > On Sep 11, 2018, at 6:56 PM, Greg Russell wrote: > > I have a single machine w 96 cores. It runs CentOS7 and is not connected to > any network as it needs to isolated for security. > > I attempted the standard install process and upon

Re: [OMPI users] stdout/stderr question

2018-09-10 Thread Ralph H Castain
job to be terminated. The first process to do so was: >>> >>> Process name: [[22380,1],0] >>> Exit code:255 >>> -- >>> $ cat stdout >>> hello from 0 >&g

Re: [OMPI users] stdout/stderr question

2018-09-10 Thread Ralph H Castain
I’m not sure why this would be happening. These error outputs go through the “show_help” functionality, and we specifically target it at stderr: /* create an output stream for us */ OBJ_CONSTRUCT(, opal_output_stream_t); lds.lds_want_stderr = true; orte_help_output =

Re: [OMPI users] What happened to orte-submit resp. DVM?

2018-08-29 Thread Ralph H Castain
> On Aug 29, 2018, at 1:59 AM, Reuti wrote: > >> >> Am 29.08.2018 um 04:46 schrieb Ralph H Castain > <mailto:r...@open-mpi.org>>: >> You must have some stale code because those tools no longer exist. > > Aha. This code is then by accient still

Re: [OMPI users] What happened to orte-submit resp. DVM?

2018-08-28 Thread Ralph H Castain
You must have some stale code because those tools no longer exist. Note that we are (gradually) replacing orte-dvm with PRRTE: https://github.com/pmix/prrte See the “how-to” guides for PRRTE towards the bottom of this page: https://pmix.org/support/how-to/

[MTT users] Python client requires MTT_HOME

2018-08-14 Thread Ralph H Castain
Hello all During the telecon today, we decided to enforce a requirement in the Python client that MTT_HOME be set in the environment to point at the top of the MTT directory tree. This significantly simplified some code and seemed a reasonable minimum requirement for operation. The commit for

Re: [OMPI users] cannot run openmpi 2.1

2018-08-11 Thread Ralph H Castain
Put "oob=^usock” in your default mca param file, or add OMPI_MCA_oob=^usock to your environment > On Aug 11, 2018, at 5:54 AM, Kapetanakis Giannis > wrote: > > Hi, > > I'm struggling to get 2.1.x to work with our HPC. > > Version 1.8.8 and 3.x works fine. > > In 2.1.3 and 2.1.4 I get

Re: [OMPI users] local communicator and crash of the code

2018-08-03 Thread Ralph H Castain
Those two command lines look exactly the same to me - what am I missing? > On Aug 3, 2018, at 10:23 AM, Diego Avesani wrote: > > Dear all, > > I am experiencing a strange error. > > In my code I use three group communications: > MPI_COMM_WORLD > MPI_MASTERS_COMM > LOCAL_COMM > > which have

Re: [OMPI users] Settings oversubscribe as default?

2018-08-03 Thread Ralph H Castain
The equivalent MCA param is rmaps_base_oversubscribe=1. You can add OMPI_MCA_rmaps_base_oversubscribe to your environ, or set rmaps_base_oversubscribe in your default MCA param file. > On Aug 3, 2018, at 1:24 AM, Florian Lindner wrote: > > Hello, > > I can use --oversubscribe to enable

Re: [OMPI users] Comm_connect: Data unpack would read past end of buffer

2018-08-03 Thread Ralph H Castain
The buffer being overrun isn’t anything to do with you - it’s an internal buffer used as part of creating the connections. It indicates a problem in OMPI. The 1.10 series is out of the support window, but if you want to stick with it you should at least update to the last release in that series

Re: [OMPI users] hwloc, OpenMPI and unsupported OSes and toolchains

2018-03-21 Thread Ralph H Castain
I don’t see how Open MPI can operate without pthreads > On Mar 19, 2018, at 3:23 PM, Gregory (tim) Kelly wrote: > > Hello Everyone, > I'm inquiring to find someone that can answer some multi-part questions about > hwloc, OpenMPI and an alternative OS and toolchain. I have a

Re: [OMPI users] Query regarding OMPI_MCA_ns_nds_vpid env variable

2008-07-11 Thread Ralph H Castain
On 7/11/08 8:33 AM, "Ashley Pittman" <apitt...@concurrent-thinking.com> wrote: > On Fri, 2008-07-11 at 08:01 -0600, Ralph H Castain wrote: >>>> I believe this is partly what motivated the creation of the MPI envars - to >>>> create a vehicle

Re: [OMPI users] Query regarding OMPI_MCA_ns_nds_vpid env variable

2008-07-11 Thread Ralph H Castain
On 7/11/08 7:50 AM, "Ashley Pittman" <apitt...@concurrent-thinking.com> wrote: > On Fri, 2008-07-11 at 07:42 -0600, Ralph H Castain wrote: >> >> >> On 7/11/08 7:32 AM, "Ashley Pittman" <apitt...@concurrent-thinking.com> >> wrote:

Re: [OMPI users] Outputting rank and size for all outputs.

2008-07-11 Thread Ralph H Castain
reat, and are probably a little > nicer than my current setup. > > -Mark > > > On Jul 11, 2008, at 9:46 AM, Ralph H Castain wrote: > >> Adding the ability to tag stdout/err with the process rank is fairly >> simple. >> We are going to talk about this next week

Re: [OMPI users] Outputting rank and size for all outputs.

2008-07-11 Thread Ralph H Castain
Adding the ability to tag stdout/err with the process rank is fairly simple. We are going to talk about this next week at a design meeting - we have several different tagging schemes that people have requested, so we want to define a way to meet them all that doesn't create too much ugliness in

Re: [OMPI users] Query regarding OMPI_MCA_ns_nds_vpid env variable

2008-07-11 Thread Ralph H Castain
On 7/11/08 7:32 AM, "Ashley Pittman" <apitt...@concurrent-thinking.com> wrote: > On Fri, 2008-07-11 at 07:20 -0600, Ralph H Castain wrote: >> This variable is only for internal use and has no applicability to a user. >> Basically, it is used by the local daemon

Re: [OMPI users] Query regarding OMPI_MCA_ns_nds_vpid env variable

2008-07-11 Thread Ralph H Castain
This variable is only for internal use and has no applicability to a user. Basically, it is used by the local daemon to tell an application process its rank when launched. Note that it disappears in v1.3...so I wouldn't recommend looking for it. Is there something you are trying to do with it?

Re: [OMPI users] ORTE_ERROR_LOG timeout

2008-07-08 Thread Ralph H Castain
Several thins are going on here. First, this error message: > mpirun noticed that job rank 1 with PID 9658 on node mac1 exited on signal > 6 (Aborted). > 2 additional processes aborted (not shown) indicates that your application procs are aborting for some reason. The system is then attempting

Re: [OMPI users] mpirun w/ enable-mpi-threads spinning up cputime when app path is invalid

2008-07-02 Thread Ralph H Castain
Sorry - went to one of your links to get that info. We know OMPI 1.2.x isn't thread safe. This is unfortunately another example of it. Hopefully, 1.3 will be better. Ralph On 7/2/08 11:01 AM, "Ralph H Castain" <r...@lanl.gov> wrote: > Out of curiosity - what version of

Re: [OMPI users] mpirun w/ enable-mpi-threads spinning up cputime when app path is invalid

2008-07-02 Thread Ralph H Castain
Out of curiosity - what version of OMPI are you using? On 7/2/08 10:46 AM, "Steve Johnson" wrote: > If mpirun is given an application that isn't in the PATH, then instead of > exiting it prints the error that it failed to find the executable and then > proceeds spins up cpu

Re: [OMPI users] Need some help regarding Linpack execution

2008-07-02 Thread Ralph H Castain
You also might want to resend this to the MPICH mailing list ­ this is the Open MPI mailing list ;-) On 7/2/08 8:03 AM, "Swamy Kandadai" wrote: > Hi: > May be you do not have 12 entries in your machine.list file. You need to have > atleast np lines in your machine.list > >

Re: [OMPI users] mca parameters: meaning and use

2008-06-26 Thread Ralph H Castain
Actually, I suspect the requestor was hoping for an explanation somewhat more illuminating than the terse comments output by ompi_info. ;-) Bottom line is "no". We have talked numerous times about the need to do this, but unfortunately there has been little accomplished. I doubt it will happen

Re: [OMPI users] Displaying Selected MCA Modules

2008-06-23 Thread Ralph H Castain
I can guarantee bproc support isn't broken in 1.2 - we use it on several production machines every day, and it works fine. I heard of only one potential problem having to do with specifying multiple app_contexts on a cmd line, but we are still trying to confirm that it wasn't operator error. In

Re: [OMPI users] null characters in output

2008-06-19 Thread Ralph H Castain
is before? I am working on a simple test case, but > unfortunately have not found one that is deterministic so far. > > Thanks, > Federico > > -Original Message- > From: Ralph H Castain [mailto:r...@lanl.gov] > Sent: Tuesday, June 17, 2008 1:09 PM > To: Sacerdoti,

Re: [OMPI users] SLURM and OpenMPI

2008-06-19 Thread Ralph H Castain
such failures, and in fact did it more effectively. For example if slurm > detects a node has failed, it will stop the job, allocate an additional > free node to make up the deficit, then relaunch. It more difficult (to > put it mildly) for a job launcher to do that. > > Thanks again,

Re: [OMPI users] SLURM and OpenMPI

2008-06-17 Thread Ralph H Castain
I can believe 1.2.x has problems in that regard. Some of that has nothing to do with slurm and reflects internal issues with 1.2. We have made it much more resistant to those problems in the upcoming 1.3 release, but there is no plan to retrofit those changes to 1.2. Part of the problem was that

Re: [OMPI users] Application Context and OpenMPI 1.2.4

2008-06-17 Thread Ralph H Castain
Hi Pat A friendly elf forwarded this to me, so please be sure to explicitly include me on any reply. Was that the only error message you received? I would have expected a trail of "error_log" outputs that would help me understand where this came from. If not, I can give you some debug flags to

Re: [OMPI users] specifying hosts in mpi_spawn()

2008-06-02 Thread Ralph H Castain
o another > version if necessary. > > > 2008/5/30 Ralph H Castain <r...@lanl.gov>: >> I'm afraid I cannot answer that question without first knowing what version >> of Open MPI you are using. Could you provide that info? >> >> Thanks >> Ralph >> >

Re: [OMPI users] specifying hosts in mpi_spawn()

2008-05-30 Thread Ralph H Castain
I'm afraid I cannot answer that question without first knowing what version of Open MPI you are using. Could you provide that info? Thanks Ralph On 5/29/08 6:41 PM, "Bruno Coutinho" wrote: > How mpi handles the host string passed in the info argument to >

Re: [OMPI users] Proper use of sigaction in Open MPI?

2008-04-24 Thread Ralph H Castain
I have never tested this before, so I could be wrong. However, my best guess is that the following is happening: 1. you trap the signal and do your cleanup. However, when your proc now exits, it does not exit with a status of "terminated-by-signal". Instead, it exits normally. 2. the local

[OMPI users] FW: problems with hostfile when doing MPMD

2008-04-14 Thread Ralph H Castain
Hi Jody I believe this was intended for the Users mailing list, so I'm sending the reply there. We do plan to provide more explanation on these in the 1.3 release - believe me, you are not alone in puzzling over all the configuration params! Many of us in the developer community also sometimes

Re: [OMPI users] Need explanation for the following ORTE error message

2008-01-23 Thread Ralph H Castain
On 1/23/08 8:26 AM, "David Gunter" wrote: > A user of one of our OMPI 1.2.3 builds encountered the following error > message during an MPI job run: > > ORTE_ERROR_LOG: File read failure in file > util/universe_setup_file_io.c at line 123 It means that at some point in the

Re: [OMPI users] orte in persistent mode

2008-01-02 Thread Ralph H Castain
Hi Neeraj No, we still don't support having a persistent set of daemons acting as some kind of "virtual machine" like LAM/MPI did. We at one time had talked about adding it. However, our most recent efforts have actually taken us away from supporting that mode of operation. As a result, I very

Re: [OMPI users] Torque and OpenMPI 1.2

2007-12-20 Thread Ralph H Castain
t; Optimization and Uncertainty Estimation >> Sandia National Laboratories >> P.O. Box 5800, Mail Stop 1318 >> Albuquerque, NM 87185-1318 >> Voice: 505-284-8845, FAX: 505-284-2518 >> >>> -Original Message- >>> From: users-boun..

Re: [OMPI users] mpirun: specify multiple install prefixes

2007-12-20 Thread Ralph H Castain
I'm afraid not - nor is it in the plans for 1.3 either. I'm afraid it fell through the cracks as the needs inside the developer community moved into other channels. I'll raise the question internally and see if people feel we should do this. It wouldn't be hard to put it into 1.3 at this point,

Re: [OMPI users] Torque and OpenMPI 1.2

2007-12-19 Thread Ralph H Castain
ra parms. Will 1.3 also carry the same > restrictions you list below? > Pat > > J.W. (Pat) O'Bryant,Jr. > Business Line Infrastructure > Technical Systems, HPC > Office: 713-431-7022 > > > > >

Re: [OMPI users] Torque and OpenMPI 1.2

2007-12-19 Thread Ralph H Castain
>>>>> Terry, >>>>>Your suggestion worked. So long as I specifically state >>>>> "--without-tm", >>>>> the OpenMPI 1.2.4 build allows the use of "-hostfile". >>>> Apparently, by >>>>> default, OpenMPI 1.2.4 will incorporate To

Re: [OMPI users] Torque and OpenMPI 1.2

2007-12-18 Thread Ralph H Castain
Hate to be a party-pooper, but the answer is "no" in OpenMPI 1.2. We don't allow the use of a hostfile in a Torque environment in that version. We have changed this for v1.3, but you'll have to wait for that release. Sorry Ralph On 12/18/07 11:12 AM, "pat.o'bry...@exxonmobil.com"

Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2007-12-18 Thread Ralph H Castain
eside - everything should work the same. Just as an FYI: the name of that environmental variable is going to change in the 1.3 release, but everything will still work the same. Hope that helps Ralph > > Thanks and regards, > Elena > > > -Original Message- > From:

Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2007-12-17 Thread Ralph H Castain
above). This may become available in a future release - TBD. Hope that helps Ralph > > Thanks and regards, > Elena > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph H Castain > Sent: Monday, December

Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2007-12-17 Thread Ralph H Castain
On 12/12/07 5:46 AM, "Elena Zhebel" wrote: > > > Hello, > > > > I'm working on a MPI application where I'm using OpenMPI instead of MPICH. > > In my "master" program I call the function MPI::Intracomm::Spawn which spawns > "slave" processes. It is not

Re: [OMPI users] ORTE_ERROR_LOG: Data unpack had inadequate space in file gpr_replica_cmd_processor.c at line 361

2007-12-14 Thread Ralph H Castain
ter > Moffett Field, CA 94035-1000 > > Fax: 415-604-3957 > > > If I try to use multiple nodes, I got the error messages: > ORTE_ERROR_LOG: Data unpack had inadequate space in file dss/dss_unpack.c at > line 90 > ORTE_ERROR_LOG: Data unpack had inadequate space in fil

Re: [OMPI users] ORTE_ERROR_LOG: Data unpack had inadequate space in file gpr_replica_cmd_processor.c at line 361

2007-12-14 Thread Ralph H Castain
Hi Qiang This error message usually indicates that you have more than one Open MPI installation around, and that the backend nodes are picking up a different version than mpirun is using. Check to make sure that you have a consistent version across all the nodes. I also noted you were building

Re: [OMPI users] Q: Problems launching MPMD applications? ('mca_oob_tcp_peer_try_connect' error 103)

2007-12-06 Thread Ralph H Castain
On 12/5/07 8:47 AM, "Brian Dobbins" wrote: > Hi Josh, > >> I believe the problem is that you are only applying the MCA >> parameters to the first app context instead of all of them: > > Thank you very much.. applying the parameters with -gmca works fine with the > test

Re: [OMPI users] Job does not quit even when the simulation dies

2007-11-07 Thread Ralph H Castain
As Jeff indicated, the degree of capability has improved over time - I'm not sure which version this represents. The type of failure also plays a major role in our ability to respond. If a process actually segfaults or dies, we usually pick that up pretty well and abort the rest of the job

Re: [OMPI users] Circumvent --host or dynamically read host info?

2007-08-30 Thread Ralph H Castain
I take it you are running in an rsh/ssh environment (as opposed to a managed environment like SLURM)? I'm afraid that you have to tell us -all- of the nodes that will be utilized in your job at the beginning (i.e., to mpirun). This requirement is planned to be relaxed in a later version, but that

Re: [OMPI users] memory leaks on solaris

2007-08-06 Thread Ralph H Castain
t; I would be curious if this helps. > > -DON > p.s. orte-clean does not exist in the ompi v1.2 branch, it is in the > trunk but I think there is an issue with it currently > > Ralph H Castain wrote: > >> >> On 8/5/07 6:35 PM, "Glenn Carver" <gl

Re: [OMPI users] memory leaks on solaris

2007-08-06 Thread Ralph H Castain
On 8/5/07 6:35 PM, "Glenn Carver" wrote: > I'd appreciate some advice and help on this one. We're having > serious problems running parallel applications on our cluster. After > each batch job finishes, we lose a certain amount of available > memory.

Re: [OMPI users] mpi daemon

2007-08-02 Thread Ralph H Castain
The daemon's name is "orted" - one will be launched on each remote node as the application is started, but they only live for as long as the application is executing. Then they go away. On 8/2/07 12:47 PM, "Reuti" wrote: > Am 02.08.2007 um 18:32 schrieb Francesco

Re: [OMPI users] orterun --bynode/--byslot problem

2007-07-23 Thread Ralph H Castain
Yes...it would indeed. On 7/23/07 9:03 AM, "Kelley, Sean" <sean.kel...@solers.com> wrote: > Would this logic be in the bproc pls component? > Sean > > > From: users-boun...@open-mpi.org on behalf of Ralph H Castain > Sent: Mon 7/23/2007 9:18 AM > T

Re: [OMPI users] orterun --bynode/--byslot problem

2007-07-23 Thread Ralph H Castain
No, byslot appears to be working just fine on our bproc clusters (it is the default mode). As you probably know, bproc is a little strange in how we launch - we have to launch the procs in "waves" that correspond to the number of procs on a node. In other words, the first "wave" launches a proc

Re: [OMPI users] OpenMPI start up problems

2007-07-19 Thread Ralph H Castain
I gather you are running under TM since you have a PBS_NODEFILE? If so, in 1.2 we setup to read that file directly - you cannot specify it on the command line. We will fix this in 1.3 so you can do both, but for now - under TM - you have to leave that "-machinefile $PBS_NODEFILE" off of the

Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Ralph H Castain
Hooray! Glad we could help track this down - sorry it was so hard to do so. To answer your questions: 1. Yes - ORTE should bail out gracefully. It definitely should not hang. I will log the problem and investigate. I believe I know where the problem lies, and it may already be fixed on our

Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Ralph H Castain
On 7/18/07 11:46 AM, "Bill Johnstone" wrote: > --- Ralph Castain wrote: > >> No, the session directory is created in the tmpdir - we don't create >> anything anywhere else, nor do we write any executables anywhere. > > In the case where the TMPDIR env

Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Ralph H Castain
Tim has proposed a clever fix that I had not thought of - just be aware that it could cause unexpected behavior at some point. Still, for what you are trying to do, that might meet your needs. Ralph On 7/18/07 11:44 AM, "Tim Prins" wrote: > Adam C Powell IV wrote: >> As

Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Ralph H Castain
On 7/18/07 9:49 AM, "Adam C Powell IV" wrote: > As mentioned, I'm running in a chroot environment, so rsh and ssh won't > work: "rsh localhost" will rsh into the primary local host environment, > not the chroot, which will fail. > > [The purpose is to be able to build

Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Ralph H Castain
put is from loading that module; the next thing in > the code is the os.system call to start orterun with 2 processors.) > > Also, there is absolutely no output from the second orterun-launched > program (even the first line does not execute.) > > Cheers, > > Lev > > > >&

Re: [OMPI users] Recursive use of "orterun"

2007-07-11 Thread Ralph H Castain
I'm unaware of any issues that would cause it to fail just because it is being run via that interface. The error message is telling us that the procs got launched, but then orterun went away unexpectedly. Are you seeing your procs complete? We do sometimes see that message due to a race condition

Re: [OMPI users] mpirun hanging when processes started on head node

2007-06-12 Thread Ralph H Castain
Hi Sean > [Sean] I'm working through the strace output to follow the progression on the > head node. It looks like mpirun consults '/bpfs/self' and determines that the > request is to be run on the local machine so it fork/execs 'orted' which then > runs 'hostname'. 'mpirun' didn't consult

Re: [OMPI users] mpirun hanging when processes started on head node

2007-06-11 Thread Ralph H Castain
Hi Sean Could you please clarify something? I¹m a little confused by your comments about where things are running. I¹m assuming that you mean everything works fine if you type the mpirun command on the head node and just let it launch on your compute nodes ­ that the problems only occur when you

Re: [OMPI users] MPI_Comm_Spawn

2007-04-04 Thread Ralph H Castain
sible. Threading support is VERY lightly tested, but I >> doubt it is the problem since it always fails after 31 spawns. >> >> Again, I have tried with these configure options and the same version >> of Open MPI and have still have been able to replicate this (after >> letti

Re: [OMPI users] Open MPI error when using MPI_Comm_spawn

2007-04-04 Thread Ralph H Castain
Hi Prakash I can't really test this solution as the Torque dynamic host allocator appears to be something you are adding to that system (so it isn't part of the released code). However, the attached code should cleanly add any nodes to any existing allocation known to OpenRTE. I hope to resume

  1   2   >