[OMPI devel] ORTE->PRRTE: some consequences to communicate to users

2020-04-28 Thread Ralph Castain via devel
So here is an interesting consequence of moving from ORTE to PRRTE. In ORTE, you could express any mapping policy as an MCA param - e.g., the following: OMPI_MCA_rmaps_base_mapping_policy=core OMPI_MCA_rmaps_base_display_map=1 would be the equivalent of a cmd line that included "--map-by core -

Re: [OMPI devel] ORTE has been removed!

2020-02-10 Thread Jeff Squyres (jsquyres) via devel
On Feb 8, 2020, at 3:30 PM, Ralph Castain via devel wrote: > > FYI: pursuant to the objectives outline last year, I have committed PR #7202 > and removed ORTE from the OMPI repository. It has been replaced with a PRRTE > submodule pointed at the PRRTE master branch. At the same tie, we replace

[OMPI devel] ORTE has been removed!

2020-02-08 Thread Ralph Castain via devel
FYI: pursuant to the objectives outline last year, I have committed PR #7202 and removed ORTE from the OMPI repository. It has been replaced with a PRRTE submodule pointed at the PRRTE master branch. At the same tie, we replaced the embedded PMIx code tree with a submodule pointed to the PMIx ma

[OMPI devel] ORTE replacement

2019-12-25 Thread Ralph Castain via devel
Hi folks The move to replace ORTE with PRRTE is now ready to go (the OSHMEM team needs to fix something in that project). This means that all further development activity and/or PRs involving ORTE should be transferred to the PRRTE project (https://github.com/openpmix/prrte). Existing PRs that

[OMPI devel] ORTE DVM update

2017-09-18 Thread r...@open-mpi.org
Hi all The DVM on master is working again. You will need to use the new “prun” tool instead of “orterun” to submit your jobs - note that “prun” automatically finds the DVM, and so there is no longer any need to have orte-dvm report its URI, nor does prun take the “-hnp” argument. The “orte-ps”

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-23 Thread Christoph Niethammer
Hi Howard, You find the pull request under https://github.com/open-mpi/ompi/pull/3739 Best Christoph - Original Message - From: "Howard Pritchard" To: "Open MPI Developers" Sent: Thursday, June 22, 2017 4:42:14 PM Subject: Re: [OMPI devel] orte-clean not cleaning l

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-22 Thread Howard Pritchard
en a pull request? > > Best > Christoph > > - Original Message - > From: "Howard Pritchard" > To: "Open MPI Developers" > Sent: Wednesday, June 21, 2017 5:57:05 PM > Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O > files i

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-22 Thread Christoph Niethammer
Hi Howard, Sorry, missed the new license policy. I added a Sign-off now. Shall I open a pull request? Best Christoph - Original Message - From: "Howard Pritchard" To: "Open MPI Developers" Sent: Wednesday, June 21, 2017 5:57:05 PM Subject: Re: [OMPI devel] orte-cle

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-21 Thread Howard Pritchard
2c02 >> >> Best >> Christoph >> >> - Original Message - >> From: "Ralph Castain" >> To: "Open MPI Developers" >> Sent: Wednesday, June 21, 2017 4:33:29 AM >> Subject: Re: [OMPI devel] orte-clean not cleaning lef

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-21 Thread Howard Pritchard
ph > > - Original Message - > From: "Ralph Castain" > To: "Open MPI Developers" > Sent: Wednesday, June 21, 2017 4:33:29 AM > Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O > files in /tmp > > I updated orte-c

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-21 Thread Christoph Niethammer
://github.com/cniethammer/ompi/commit/2aedf6134813299803628e7d6856a3b781542c02 Best Christoph - Original Message - From: "Ralph Castain" To: "Open MPI Developers" Sent: Wednesday, June 21, 2017 4:33:29 AM Subject: Re: [OMPI devel] orte-clean not cleaning left over tem

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-20 Thread r...@open-mpi.org
To: "Open MPI Developers" > Sent: Monday, May 8, 2017 6:28:42 PM > Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O > files in /tmp > > What version of OMPI are you using? > >> On May 8, 2017, at 8:56 AM, Christoph Niethammer wrote: >

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-05-09 Thread Christoph Niethammer
Hi, I am using Open MPI 2.1.0. Best Christoph - Original Message - From: "Ralph Castain" To: "Open MPI Developers" Sent: Monday, May 8, 2017 6:28:42 PM Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp What version of OMPI

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-05-08 Thread r...@open-mpi.org
What version of OMPI are you using? > On May 8, 2017, at 8:56 AM, Christoph Niethammer wrote: > > Hello > > According to the manpage "...orte-clean attempts to clean up any processes > and files left over from Open MPI jobs that were run in the past as well as > any currently running jobs. Th

[OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-05-08 Thread Christoph Niethammer
Hello According to the manpage "...orte-clean attempts to clean up any processes and files left over from Open MPI jobs that were run in the past as well as any currently running jobs. This includes OMPI infrastructure and helper commands, any processes that were spawned as part of the job, and

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-11-01 Thread Ralph Castain
Just to close the loop: on a prior email, you had requested adding an MCA param for -host that mirrored what we have for -hostfile. I have now added that to the OMPI master - you can set the MCA param orte_default_dash_host and both orterun and orte-submit will properly pick it up. Ralph > On

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-27 Thread Ralph Castain
Good to hear - thanks! > On Oct 27, 2015, at 11:37 AM, Mark Santcroos > wrote: > > >> On 24 Oct 2015, at 7:54 , Mark Santcroos wrote: >> Will test it on real systems once it hits master. > > FYI: Its been holding up pretty well on real deployment too! > __

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-27 Thread Mark Santcroos
> On 24 Oct 2015, at 7:54 , Mark Santcroos wrote: > Will test it on real systems once it hits master. FYI: Its been holding up pretty well on real deployment too!

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-24 Thread Mark Santcroos
Great!! Can't reproduce the probem anymore on my laptop either! Thanks Ralph! Will test it on real systems once it hits master. Cheers, Mark > On 24 Oct 2015, at 6:08 , Ralph Castain wrote: > > Thanks Mark!! > > Your logs helped me track it down - a fix is in the oven: > > https://github.c

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-24 Thread Ralph Castain
Thanks Mark!! Your logs helped me track it down - a fix is in the oven: https://github.com/open-mpi/ompi/pull/1067 Once that completes, I’ll schedule it for 1.10.1 and 2.0. If you get a chance, please give it a try and let me know how it works for y

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-23 Thread Ralph Castain
Could be - let me investigate this weekend. Thanks for all that parsing!!! > On Oct 23, 2015, at 5:00 PM, Mark Santcroos > wrote: > > Is this the culprit? > > 'ACTIVATING PROC [[8679,2],0] STATE IOF COMPLETE PRI 4', > 'state:base:track_procs called for proc [[8679,2],0] state RUNNING', > > T

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-23 Thread Mark Santcroos
Is this the culprit? 'ACTIVATING PROC [[8679,2],0] STATE IOF COMPLETE PRI 4', 'state:base:track_procs called for proc [[8679,2],0] state RUNNING', That seems to be out of order for the hanging processes.

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-23 Thread Mark Santcroos
> On 23 Oct 2015, at 23:45 , Mark Santcroos wrote: > the second is output from my parser script I figured you might want the output of the succeeded jobs too, please see the updated output attached. Jobs started 16 Jobs completed: 15 Procs completed: 16 Communication "Errors": 15 JOB [9019,

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-23 Thread Mark Santcroos
> On 21 Oct 2015, at 2:50 , Ralph Castain wrote: > Can you do me a favor? Hi Ralph, It required some parsing-fu, but here you go! :-) Three text files attached. One is the raw log, the second is output from my parser script and the third is the output of pstree after it hangs. Hopefully this

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-20 Thread Ralph Castain
Hey Mark Can you do me a favor? I’m totally buried, but I have been able to replicate this on my machine, so it is a definite race condition. What would really help me is if you could do the following: * start the orte-dvm with the “—mca state_base_verbose 10” option, and capture stdout/stderr

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
> On 16 Oct 2015, at 0:44 , Ralph Castain wrote: > > Hmmmok. I'll have to look at it this weekend when I return from travel. > Can you please send me your test program so I can try to locally reproduce it? Ok, thanks Ralph. Start the DVM with: orte-dvm --report-uri dvm_uri --debug-devel

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Ralph Castain
Hmmmok. I'll have to look at it this weekend when I return from travel. Can you please send me your test program so I can try to locally reproduce it? On Thu, Oct 15, 2015 at 3:42 PM, Mark Santcroos wrote: > > > On 16 Oct 2015, at 0:23 , Ralph Castain wrote: > > Okay, that means that the

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
> On 16 Oct 2015, at 0:23 , Ralph Castain wrote: > Okay, that means that the dvm isn't recognizing that the jobs actually > completed. Ok. > So the question is: what is it about those jobs? They are all the same. > Are those 6 jobs very short-lived, and the others are longer-lived? All very

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Ralph Castain
Okay, that means that the dvm isn't recognizing that the jobs actually completed. So the question is: what is it about those jobs? Are those 6 jobs very short-lived, and the others are longer-lived? If you look at the nodes (before you kill the dvm), are any of those procs still there? On Thu, Oct

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
> On 16 Oct 2015, at 0:09 , Ralph Castain wrote: > > Help me out a bit - how many jobs did you actually run? 42 tasks in total, 6 stalled, 36 returned.

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Ralph Castain
Help me out a bit - how many jobs did you actually run? On Thu, Oct 15, 2015 at 2:33 PM, Mark Santcroos wrote: > > > On 15 Oct 2015, at 17:25 , Ralph Castain wrote: > > > > Interesting - I see why. Please try this version. > > Ok, that works as expected. > > I'll repeat the results with this

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
> On 15 Oct 2015, at 17:25 , Ralph Castain wrote: > > Interesting - I see why. Please try this version. Ok, that works as expected. I'll repeat the results with this version too: $ grep TERMINATED dvm_output-patched.txt |wc -l 36 $ grep NOTIFYING dvm_output-patched.txt |wc -l 36

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Ralph Castain
Interesting - I see why. Please try this version. Ralph On Thu, Oct 15, 2015 at 4:05 AM, Mark Santcroos wrote: > > > On 15 Oct 2015, at 4:38 , Ralph Castain wrote: > > Okay, please try the attached patch. > > *scratch* > > Although I reported results with the patch earlier, I can't reproduce

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
> On 15 Oct 2015, at 4:38 , Ralph Castain wrote: > Okay, please try the attached patch. *scratch* Although I reported results with the patch earlier, I can't reproduce it anymore. Now orte-dvm shuts down after the first orte-submit completes with: [netbook:72038] [[9827,0],0] orted:comm:proc

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
Another data point, this only seems to happen for really short tasks, i.e. < 1 sec.

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
Hi! > On 15 Oct 2015, at 4:38 , Ralph Castain wrote: > > Okay, please try the attached patch. It will cause two messages to be output > for each job: one indicating the job has been marked terminated, and the > other reporting that the completion message was sent to the requestor. Let's > see

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-14 Thread Ralph Castain
Okay, please try the attached patch. It will cause two messages to be output for each job: one indicating the job has been marked terminated, and the other reporting that the completion message was sent to the requestor. Let's see what that tells us. Thanks Ralph On Wed, Oct 14, 2015 at 3:44 PM,

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-14 Thread Mark Santcroos
Hi Ralph, > On 15 Oct 2015, at 0:26 , Ralph Castain wrote: > Okay, so each orte-submit is reporting job has launched, which means the hang > is coming while waiting to hear the job completed. Are you sure that orte-dvm > believes the job has completed? No, I'm not. > In other words, when you

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-14 Thread Ralph Castain
Okay, so each orte-submit is reporting job has launched, which means the hang is coming while waiting to hear the job completed. Are you sure that orte-dvm believes the job has completed? In other words, when you say that you observe the job as completing, are you basing that on some output from or

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-14 Thread Mark Santcroos
Hi Ralph, > On 14 Oct 2015, at 21:50 , Ralph Castain wrote: > I wonder if they might be getting duplicate process names if started quickly > enough. Do you get the "job has launched" message (orte-submit outputs a > message after orte-dvm responds that the job launched)? Based on the output bel

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-14 Thread Ralph Castain
I wonder if they might be getting duplicate process names if started quickly enough. Do you get the "job has launched" message (orte-submit outputs a message after orte-dvm responds that the job launched)? On Wed, Oct 14, 2015 at 12:04 PM, Mark Santcroos wrote: > Hi, > > By hammering on a DVM

[OMPI devel] orte-dvm / orte-submit race condition

2015-10-14 Thread Mark Santcroos
Hi, By hammering on a DVM with orte-submit I can reproducibly make orte-submit not return, but hang instead. The task is executed correctly though. It can be reproduced using the small snippet below. Switching from sequential to "concurrent" execution of the orte-submit's triggers the effect.

Re: [OMPI devel] orte-dvm and orte_max_vm_size

2015-09-17 Thread Mark Santcroos
Hi Ralph, Sorry for the late reply, something along the lines of "swamped" ;-) > On 03 Sep 2015, at 16:04 , Ralph Castain wrote: > The purpose of orte_max_vm_size is to subdivide the allocation - i.e., for a > given mpirun execution, you can specify to only use a certain number of the > alloca

Re: [OMPI devel] orte-dvm and orte_max_vm_size

2015-09-03 Thread Ralph Castain
Hi Mark The purpose of orte_max_vm_size is to subdivide the allocation - i.e., for a given mpirun execution, you can specify to only use a certain number of the allocated nodes. If you want to further limit the VM to specific nodes in the allocation, then you would use -host option. It’s a lit

[OMPI devel] orte-dvm and orte_max_vm_size

2015-09-03 Thread Mark Santcroos
Hi, I've been running into some funny issue with using orte-dvm (Hi Ralph ;-) and trying to define the size of the created vm and for that I use "--mca orte_max_vm_size" which in general seems to work. In this example I have a PBS job of 4 nodes and want to run the DVM on < 4 nodes. If I creat

Re: [OMPI devel] orte-dvm startup fails on HEAD

2015-08-22 Thread Mark Santcroos
Yep, it works again, thanks! > On 22 Aug 2015, at 0:00 , Mark Santcroos wrote: > > Thanks Ralph. > The machine in question is in maintenance currently, so can't check, will get > back to you as soon as I can. > >> On 21 Aug 2015, at 16:51 , Ralph Castain wrote: >> >> Okay Mark, I just pushed

Re: [OMPI devel] orte-dvm startup fails on HEAD

2015-08-21 Thread Mark Santcroos
Thanks Ralph. The machine in question is in maintenance currently, so can't check, will get back to you as soon as I can. > On 21 Aug 2015, at 16:51 , Ralph Castain wrote: > > Okay Mark, I just pushed a fix. Sorry for the problem > > >> On Aug 21, 2015, at 7:39 AM, Ralph Castain wrote: >> >

Re: [OMPI devel] orte-dvm startup fails on HEAD

2015-08-21 Thread Ralph Castain
Okay Mark, I just pushed a fix. Sorry for the problem > On Aug 21, 2015, at 7:39 AM, Ralph Castain wrote: > > I found the problem, Howard - has nothing to do with the Cray, but is a > selection issue on the state framework. > > >> On Aug 21, 2015, at 7:37 AM, Howard Pritchard >

Re: [OMPI devel] orte-dvm startup fails on HEAD

2015-08-21 Thread Ralph Castain
I found the problem, Howard - has nothing to do with the Cray, but is a selection issue on the state framework. > On Aug 21, 2015, at 7:37 AM, Howard Pritchard wrote: > > I will check if i can reproduce on nersc systems. > > -- > > sent from my smart phonr so no good type. > > Howar

Re: [OMPI devel] orte-dvm startup fails on HEAD

2015-08-21 Thread Howard Pritchard
I will check if i can reproduce on nersc systems. -- sent from my smart phonr so no good type. Howard On Aug 21, 2015 7:51 AM, "Ralph Castain" wrote: > I’ll take a look at it > > > On Aug 20, 2015, at 11:34 PM, Mark Santcroos > wrote: > > > > Hi all, > > > > I see the errors below on

Re: [OMPI devel] orte-dvm startup fails on HEAD

2015-08-21 Thread Ralph Castain
I’ll take a look at it > On Aug 20, 2015, at 11:34 PM, Mark Santcroos > wrote: > > Hi all, > > I see the errors below on startup of orte-dvm on a Cray XE/XK hybrid. > Didn't track the commit that caused it yet, but maybe somebody has a clue > from the error already. > Last known to work was o

[OMPI devel] orte-dvm startup fails on HEAD

2015-08-21 Thread Mark Santcroos
Hi all, I see the errors below on startup of orte-dvm on a Cray XE/XK hybrid. Didn't track the commit that caused it yet, but maybe somebody has a clue from the error already. Last known to work was on July 14. The 2.x branch works fine. Please let me know if this should be a ticket. Thanks Ma

Re: [OMPI devel] ORTE headers in OPAL source

2014-10-19 Thread Josh Hursey
The first variable can probably be moved to opal pretty easily. That is used when we need to fully shutdown the BTLs and re-init them on continue. We do not have to do that for tcp (since we leave the sockets open), but do have to do that for IB, for example. The second call is a bit tricky since

Re: [OMPI devel] ORTE headers in OPAL source

2014-10-17 Thread Adrian Reber
Josh, I had a look at the code (e.g., opal/mca/btl/sm/btl_sm.c) and there are two uses of orte code: if (orte_cr_continue_like_restart) and /* On restart we need the old file names to exist (not necessarily * contain content) so the CRS component does not fail when searching * for these o

Re: [OMPI devel] ORTE headers in OPAL source

2014-08-11 Thread Adrian Reber
I have seen it. I am still waiting for things to settle down before I start fixing the FT code ( again ;-) Adrian On Mon, Aug 11, 2014 at 01:40:33PM +, Jeff Squyres (jsquyres) wrote: > Ah, I see. > > Ok -- add it to the list of > FT-things-to-be-fixed-before-FT-can-be-suppor

Re: [OMPI devel] ORTE headers in OPAL source

2014-08-11 Thread Jeff Squyres (jsquyres)
Ah, I see. Ok -- add it to the list of FT-things-to-be-fixed-before-FT-can-be-supported-again (which I think Josh just did :-) ). Also: Adrian -- FYI. :-) On Aug 11, 2014, at 9:05 AM, George Bosilca wrote: > I just checked the code and noticed that all the usages of the sstore are > prote

Re: [OMPI devel] ORTE headers in OPAL source

2014-08-11 Thread George Bosilca
I just checked the code and noticed that all the usages of the sstore are protected by an OPAL_ENABLE_FT_CR define. As we are not supporting FT, I don't think this is something we should spend time fixing right now. George. On Sat, Aug 9, 2014 at 8:06 AM, Jeff Squyres (jsquyres) wrote: > I

Re: [OMPI devel] ORTE headers in OPAL source

2014-08-09 Thread Josh Hursey
Those calls should be protected with the CR FT #define - If I remember correctly. We were using the sstore to track the shared memory file names so we could clean them up on restart. I'm not sure if the sstore framework is necessary in this location, since we should be able to tell opal_crs and it

Re: [OMPI devel] ORTE headers in OPAL source

2014-08-09 Thread Jeff Squyres (jsquyres)
I think you're making a joke, right...? I see direct calls to ORTE sstore functionality in all three. On Aug 8, 2014, at 5:42 PM, George Bosilca wrote: > These are harmless. They are only used when FT is enabled which should rarely > be the case. > > George. > > > > On Fri, Aug 8, 201

Re: [OMPI devel] ORTE headers in OPAL source

2014-08-08 Thread George Bosilca
These are harmless. They are only used when FT is enabled which should rarely be the case. George. On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) wrote: > Here's a few ORTE headers in OPAL source -- can respective owners clean > these up? Thanks. > > - > mca/btl/smcuda/btl_smc

[OMPI devel] ORTE headers in OPAL source

2014-08-08 Thread Jeff Squyres (jsquyres)
Here's a few ORTE headers in OPAL source -- can respective owners clean these up? Thanks. - mca/btl/smcuda/btl_smcuda.c 63:#include "orte/mca/sstore/sstore.h" mca/btl/sm/btl_sm.c 62:#include "orte/mca/sstore/sstore.h" mca/mpool/sm/mpool_sm_module.c 34:#include "orte/mca/sstore/sstore.h" --

Re: [OMPI devel] orte-restart and PATH

2014-03-14 Thread Josh Hursey
It looks like I did not add the prefix path to the binary name before fork/exec in orte-restart. There is a string variable that you can use to get the appropriate prefix: opal_install_dirs.prefix from opal/mca/installdirs/installdirs.h It's the same one that Ralph mentioned that orterun uses

Re: [OMPI devel] orte-restart and PATH

2014-03-12 Thread Ralph Castain
That's what the --enable-orterun-prefix-by-default configure option is for On Mar 12, 2014, at 9:28 AM, Adrian Reber wrote: > I am using orte-restart without setting my PATH to my Open MPI > installation. I am running /full/path/to/orte-restart and orte-restart > tries to run mpirun to restart

[OMPI devel] orte-restart and PATH

2014-03-12 Thread Adrian Reber
I am using orte-restart without setting my PATH to my Open MPI installation. I am running /full/path/to/orte-restart and orte-restart tries to run mpirun to restart the process. This fails on my system because I do not have any mpirun in my PATH. Is it expected for an Open MPI installation to set u

Re: [OMPI devel] ORTE

2012-06-19 Thread George Bosilca
We've been through similar processes quite a few times, with the result we all know today. Such a major change should not be done without thoughtful analysis, a careful consideration of all the potential drawbacks and benefits and, of course, the consideration of all competing approaches offerin

[OMPI devel] ORTE

2012-06-16 Thread Ralph Castain
Over the next month, there will be significant changes to ORTE both in terms of framework APIs and internal behavior. This work will focus on a few areas: 1. launch scalability and timing. I try to review our status on this whenever we prepare for the start of a new release series, and as usual

[OMPI devel] ORTE progress thread

2012-06-07 Thread Ralph Castain
Hi folks At the developer's meeting on Wed, we decided to switch the default setting of two configure options: --enable-orte-progress-threads now defaults to being "enabled" --enable-event-thread-support now defaults to being "enabled" This is a trial condition to see what, if any, impacts are o

[OMPI devel] ORTE async operations

2012-05-20 Thread Ralph Castain
Hi folks Just an FYI: I have committed a change to the developer's trunk that allows ORTE to use an asynchronous progress thread for application processes. This allows out-of-band communications to progress independently from any calls down into the MPI library. This capability is -only- on if

Re: [OMPI devel] orte question

2011-07-27 Thread Ralph Castain
Hmmm...I'm not seeing that behavior. I get a 0 exit code every time. You'll get a 243 if there are stale session directories laying around as it indicates that the mpirun's in those dirs are not reachable. Perhaps that is what's happening? On Jul 27, 2011, at 3:14 PM, Greg Watson wrote: > Ral

Re: [OMPI devel] orte question

2011-07-27 Thread Ralph Castain
Hmmmno, can't imagine why. I'll fix - thanks! On Jul 27, 2011, at 3:14 PM, Greg Watson wrote: > Ralph, > > Looking good so far. I did notice that ompi-ps always seems to have an exit > code of 243. Is that on purpose? > > Greg > > On Jul 25, 2011, at 4:44 PM, Ralph Castain wrote: > >> r2

Re: [OMPI devel] orte question

2011-07-27 Thread Greg Watson
Ralph, Looking good so far. I did notice that ompi-ps always seems to have an exit code of 243. Is that on purpose? Greg On Jul 25, 2011, at 4:44 PM, Ralph Castain wrote: > r24944 - let me know how it works! > > > On Jul 25, 2011, at 1:01 PM, Greg Watson wrote: > >> That would probably be m

Re: [OMPI devel] orte question

2011-07-25 Thread Ralph Castain
r24944 - let me know how it works! On Jul 25, 2011, at 1:01 PM, Greg Watson wrote: > That would probably be more intuitive. > > Thanks, > Greg > > On Jul 25, 2011, at 2:28 PM, Ralph Castain wrote: > >> job 0 is mpirun and its daemons - I can have it ignore that job as I doubt >> users care :

Re: [OMPI devel] orte question

2011-07-25 Thread Greg Watson
That would probably be more intuitive. Thanks, Greg On Jul 25, 2011, at 2:28 PM, Ralph Castain wrote: > job 0 is mpirun and its daemons - I can have it ignore that job as I doubt > users care :-) > > On Jul 25, 2011, at 12:25 PM, Greg Watson wrote: > >> Ralph, >> >> The output format looks g

Re: [OMPI devel] orte question

2011-07-25 Thread Ralph Castain
job 0 is mpirun and its daemons - I can have it ignore that job as I doubt users care :-) On Jul 25, 2011, at 12:25 PM, Greg Watson wrote: > Ralph, > > The output format looks good, but I'm not sure it's quite correct. If I run > the mpirun command, I see the following: > > mpirun:47520:num n

Re: [OMPI devel] orte question

2011-07-25 Thread Greg Watson
Ralph, The output format looks good, but I'm not sure it's quite correct. If I run the mpirun command, I see the following: mpirun:47520:num nodes:1:num jobs:2 jobid:0:state:RUNNING:slots:0:num procs:0 jobid:1:state:RUNNING:slots:1:num procs:4 process:x:rank:0:pid:47522:node:greg.local:state:SYN

Re: [OMPI devel] orte question

2011-07-23 Thread Ralph Castain
On Jul 23, 2011, at 5:04 PM, Ashley Pittman wrote: > > On 23 Jul 2011, at 03:55, Ralph Castain wrote: >>> c) A more easily parsable output format from ompi-ps. It doesn't need to be >>> a full blown XML format, just something like the following would suffice: >>> >>> jobid:719585280:state:Runn

Re: [OMPI devel] orte question

2011-07-23 Thread Ashley Pittman
On 23 Jul 2011, at 03:55, Ralph Castain wrote: >> c) A more easily parsable output format from ompi-ps. It doesn't need to be >> a full blown XML format, just something like the following would suffice: >> >> jobid:719585280:state:Running:slots:1:num procs:4 >> process_name:./x:rank:0:pid:3082:n

Re: [OMPI devel] orte question

2011-07-23 Thread Ralph Castain
Okay, you should have it in r24929. Use: orte-ps --parseable to get the new output. On Jul 23, 2011, at 11:43 AM, Ralph Castain wrote: > Gar - have to eat my words a bit. The jobid requested by orte-ps is just the > "local" jobid - i.e., it is expecting you to provide a number from 0-N, as I

Re: [OMPI devel] orte question

2011-07-23 Thread Ralph Castain
Gar - have to eat my words a bit. The jobid requested by orte-ps is just the "local" jobid - i.e., it is expecting you to provide a number from 0-N, as I described below (copied here): > A jobid of 1 indicates the primary application, 2 and above would specify > comm_spawned jobs. Not providi

Re: [OMPI devel] orte question

2011-07-22 Thread Ralph Castain
On Jul 22, 2011, at 3:57 PM, Greg Watson wrote: > Hi Ralph, > > I'd like three things :-) > > a) A --report-jobid option that prints the jobid on the first line in a form > that can be passed to the -jobid option on ompi-ps. Probably tagging it in > the output if -tag-output is enabled (e.g.

Re: [OMPI devel] orte question

2011-07-22 Thread Greg Watson
Hi Ralph, I'd like three things :-) a) A --report-jobid option that prints the jobid on the first line in a form that can be passed to the -jobid option on ompi-ps. Probably tagging it in the output if -tag-output is enabled (e.g. jobid:) would be a good idea. b) The orte-ps command output to

Re: [OMPI devel] orte question

2011-07-22 Thread Ralph Castain
Hmmm...well, it looks like we could have made this nicer than we did :-/ If you add --report-uri to the mpirun command line, you'll get back the uri for that mpirun. This has the form of :. As the -h option indicates: -report-uri | --report-uri Printout URI on stdou

[OMPI devel] orte question

2011-07-22 Thread Greg Watson
Hi all, Does anyone know if it's possible to get the orte jobid from the mpirun command? If not, how are you supposed to get it to use with orte-ps? Also, orte-ps reports the jobid in [x,y] notation, but the jobid argument seems to be an integer. How does that work? Thanks, Greg

Re: [OMPI devel] orte does not compile on XT5 (pgcc)

2010-10-26 Thread Ralph Castain
Did this ever get fixed? Anyone able to do so (I can't - no access to PGI or environment)? On Sep 29, 2010, at 1:45 PM, Aurélien Bouteiller wrote: > Here is the problem. The PGI compiler is especially paranoid regarding post > declared structures typedefs. It looks like the include ordering ma

Re: [OMPI devel] orte does not compile on XT5 (pgcc)

2010-09-29 Thread Ralph Castain
I have no way of testing this as I don't have access to a PGI compiler - so if you do, by all means feel free to change the ordering to make them happy. On Sep 29, 2010, at 1:45 PM, Aurélien Bouteiller wrote: > Here is the problem. The PGI compiler is especially paranoid regarding post > decla

[OMPI devel] orte does not compile on XT5 (pgcc)

2010-09-29 Thread Aurélien Bouteiller
Here is the problem. The PGI compiler is especially paranoid regarding post declared structures typedefs. It looks like the include ordering makes the nidmap.h file being included before orte_jmap_t typedefs and siblings have been done. /opt/cray/xt-asyncpe/4.0/bin/cc: INFO: linux target is be

[OMPI devel] ORTE thread safety

2009-12-28 Thread Ralph Castain
Hi folks I have mentioned this on several recent OMPI telecons, but wanted to (a) ensure people on the calls remember, and (b) alert the broader audience to an upcoming significant change to the ORTE layer to make it thread safe - i.e., to allow operation with --enable-progress-thread set. Thi

Re: [OMPI devel] ORTE meeting Feb 25-27, 2008

2009-02-19 Thread Jeff Squyres
I also updated the same wiki page with a few google maps: - One showing the Cisco office general location - One showing where to turn and the satellite picture of the office (it's set a little back from the main road) - One showing klumps of hotels near the Cisco offices Enjoy. On Feb 19

[OMPI devel] ORTE meeting Feb 25-27, 2008

2009-02-19 Thread Ralph Castain
Hi folks I have (finally) updated the wiki to include a topic list for this meeting: https://svn.open-mpi.org/trac/ompi/wiki/Feb09Meeting As noted, I expect the meeting will begin with a discussion over a more complete topic list, and a proposed agenda. I will attempt to update the wiki

[OMPI devel] ORTE Dec 08 Design Meeting Notes

2009-01-09 Thread Ralph Castain
FYI: I have completed the meeting notes from the Dec ORTE design meeting held in San Jose CA: https://svn.open-mpi.org/trac/ompi/wiki/Dec08MeetingNotes Please feel free to comment and/or correct them - I tried to capture everything, but freely admit there may be holes and mistakes. The not

Re: [OMPI devel] ORTE environments supported

2008-11-17 Thread Terry Dontje
Jeff Squyres wrote: I'm reviewing the README for v1.3; I somewhat doubt that the following is true -- can those who know send updates? - The run-time systems that are currently supported are: - rsh / ssh - LoadLeveler - PBS Pro, Open PBS, Torque - Platform LSF - SLURM - XGrid - Cr

[OMPI devel] ORTE environments supported

2008-11-15 Thread Jeff Squyres
I'm reviewing the README for v1.3; I somewhat doubt that the following is true -- can those who know send updates? - The run-time systems that are currently supported are: - rsh / ssh - LoadLeveler - PBS Pro, Open PBS, Torque - Platform LSF - SLURM - XGrid - Cray XT-3 and XT-4 -

[OMPI devel] ORTE Scaling results: updated

2008-04-08 Thread Ralph H Castain
Hello all The wiki page has been updated with the latest test results from a new branch that implemented inbound collectives on the modex and barrier operations. As you will see from the graphs, ORTE/OMPI now exhibits a negative 2nd-derivative on the launch time curve for mpi_no_op (i.e., MPI_Init

Re: [OMPI devel] orte\mca\smr

2008-03-17 Thread Ralph H Castain
Hello As Jeff stated, the smr has been removed from the system. We did this because experience showed that monitoring process/node status was highly system dependent and directly correlated with the launch system. Thus, it made no sense to separate those two functions. For example, we have succes

Re: [OMPI devel] orte\mca\smr

2008-03-10 Thread Leonardo Fialho
Hi Jeff, I need to implement a heart bit/watchdog monitoring system, I´m looking for the "best place" to put it and I don´t want to put duplicated code. I´ll try to put it into PLM for now, and when I get a Ralph´s response I change it, if necessary. Jeff Squyres escribió: Yes, it all got co

Re: [OMPI devel] orte\mca\smr

2008-03-10 Thread Jeff Squyres
Yes, it all got consolidated down into plm. We need to update the FAQ; the ORTE frameworks changed quite a bit in the recent ORTE merge... Ralph's on vacation this week. A detailed answer to your question may not occur until he returns... On Mar 10, 2008, at 10:05 AM, Leonardo Fialho wro

[OMPI devel] orte\mca\smr

2008-03-10 Thread Leonardo Fialho
Hi all, Where is the "old" orte\mca\smr? I don´t found it in orte/mca/plm... -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93-581-2888 Fax: +34-93-581-2478

Re: [OMPI devel] Orte cleanup

2008-03-07 Thread Aurélien Bouteiller
Looks like it works. Aurelien Le 6 mars 08 à 10:36, Ralph Castain a écrit : I believe I have at least helped reduce this with r17761. I added the ability for procs to detect that their "lifeline" connection (either the HNP for unity routed, or their local daemon for tree) has been lost and

Re: [OMPI devel] Orte cleanup

2008-03-06 Thread Ralph Castain
I believe I have at least helped reduce this with r17761. I added the ability for procs to detect that their "lifeline" connection (either the HNP for unity routed, or their local daemon for tree) has been lost and gracefully abort. Let me know if that helps Ralph On 3/4/08 9:37 PM, "Aurélien B

Re: [OMPI devel] orte can't launch process

2008-03-06 Thread Gleb Natapov
On Thu, Mar 06, 2008 at 07:49:13AM -0500, Tim Prins wrote: > Sorry about that. I removed a field in a structure, then 'svn up' seems > to have added it back, so we were using a field that should not even > exist in a couple places. > > Should be fixed in r17757 Works again. Thanks --

  1   2   >