Re: [OMPI users] null characters in output
Ralph, I'm working on a test-case for that now, hopefully I can nail it down to a particular openmpi version. I have another small issue, which is somewhat bothering: orterun 1.2.6 exits with return code zero if the executable cannot be found. Should this be non-zero? E.g. $ orterun /asdf -- Failed to find or execute the following executable: Host: drdblogin2.en.desres.deshaw.com Executable: /asdf Cannot continue. -- $ echo $? 0 Thanks Federico -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph H Castain Sent: Thursday, June 19, 2008 10:24 AM To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] null characters in output No, I haven't seen that - if you can provide an example, we can take a look at it. Thanks Ralph On 6/19/08 8:15 AM, "Sacerdoti, Federico" <federico.sacerd...@deshawresearch.com> wrote: > Ralph, another issue perhaps you can shed some light on. > > When launching with orterun, we sometimes see null characters in the > stdout output. These do not show up on a terminal, but when piped to a > file they are visible in an editor. They also can show up in the middle > of a line, and so can interfere with greps on the output, etc. > > Have you seen this before? I am working on a simple test case, but > unfortunately have not found one that is deterministic so far. > > Thanks, > Federico > > -Original Message- > From: Ralph H Castain [mailto:r...@lanl.gov] > Sent: Tuesday, June 17, 2008 1:09 PM > To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] SLURM and OpenMPI > > I can believe 1.2.x has problems in that regard. Some of that has > nothing to > do with slurm and reflects internal issues with 1.2. > > We have made it much more resistant to those problems in the upcoming > 1.3 > release, but there is no plan to retrofit those changes to 1.2. Part of > the > problem was that we weren't using the --kill-on-bad-exit flag when we > called > srun internally, which has been fixed for 1.3. > > BTW: we actually do use srun to launch the daemons - we just call it > internally from inside orterun. The only real difference is that we use > orterun to setup the cmd line and then tell the daemons what they need > to > do. The issues you are seeing relate to our ability to detect that srun > has > failed, and/or that one or more daemons have failed to launch or do > something they were supposed to do. The 1.2 system has problems in that > regard, which was one motivation for the 1.3 overhaul. > > I would argue that slurm allowing us to attempt to launch on a > no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we > use > srun to launch the daemons - the only reason we hang is that srun is not > returning with an error. I've seen this on other systems as well, but > have > no real answer - if slurm doesn't indicate an error has occurred, I'm > not > sure what I can do about it. > > We are unlikely to use srun to directly launch jobs (i.e., to have slurm > directly launch the job from an srun cmd line without mpirun) anytime > soon. > It isn't clear there is enough benefit to justify the rather large > effort, > especially considering what would be required to maintain scalability. > Decisions on all that are still pending, though, which means any > significant > change in that regard wouldn't be released until sometime next year. > > Ralph > > On 6/17/08 10:39 AM, "Sacerdoti, Federico" > <federico.sacerd...@deshawresearch.com> wrote: > >> Ralph, >> >> I was wondering what the status of this feature was (using srun to >> launch orted daemons)? I have two new bug reports to add from our >> experience using orterun from 1.2.6 on our 4000 CPU infiniband > cluster. >> >> 1. Orterun will happily hang if it is asked to run on an invalid slurm >> job, e.g. if the job has exceeded its timelimit. This would be > trivially >> fixed if you used srun to launch, as they would fail with non-zero > exit >> codes. >> >> 2. A very simple orterun invocation hangs instead of exiting with an >> error. In this case the executable does not exist, and we would expect >> orterun to exit non-zero. This has caused >> headaches with some workflow management script that automatically > start >> jobs. >> >> salloc -N2 -p swdev orterun dummy-binary-I-dont-exist >> [hang] >> >> orterun dummy-binary-I-dont-exist >&g
Re: [OMPI users] SLURM and OpenMPI
Ralph, Thanks for your reply. Let me know if I can help in any way. fds -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph H Castain Sent: Thursday, June 19, 2008 10:24 AM To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] SLURM and OpenMPI Well, if the only system I cared about was slurm, there are some things I could possibly do to make things better, but at the expense of our support for other environments - which is unacceptable. There are a few technical barriers to doing this without the orteds on slurm, and a major licensing issue that prohibits us from calling any slurm APIs. How all that gets resolved is unclear. Frankly, one reason we don't put more emphasis on it is that we don't see a significant launch time difference between the two modes, and we truly do want to retain the ability to utilize different error response strategies (which slurm will not allow - you can only follow theirs). So I would say we simply have different objectives than what you stated, and different concerns that make a deeper slurm integration less favorable. May still happen, but not anytime soon. Ralph On 6/19/08 8:08 AM, "Sacerdoti, Federico" <federico.sacerd...@deshawresearch.com> wrote: > Ralph thanks for your quick response. > > Regarding your fourth paragraph, slurm will not let you run on a > no-longer-valid allocation, an srun will correctly exit non-zero with a > useful failure reason. So perhaps openmpi 1.3 with your changes will > just work, I look forward to testing it. > > E.g. > $ srun hostname > srun: error: Unable to confirm allocation for job 745346: Invalid job id > specified > srun: Check SLURM_JOBID environment variable for expired or invalid job. > > > Regarding srun to launch the jobs directly (no orteds), I am sad to hear > the idea is not in favor. We have found srun to be extremely scalable > (tested up to 4096 MPI processes) and very good at cleaning up after an > error or node failure. It seems you could simplify orterun quite a bit > by relying on slurm (or whatever resource manager) to handle job > cleanup after failures; it is their responsibility after all, and they > have better knowledge about the health and availability of nodes than > any launcher can hope for. > > I helped write an mvapich launcher used internally called mvrun, which > was used for several years. I wrote a lot of logic to run down and stop > all processes when one had failed, which I understand you have as well. > We came to the conclusion that slurm was in a better position to handle > such failures, and in fact did it more effectively. For example if slurm > detects a node has failed, it will stop the job, allocate an additional > free node to make up the deficit, then relaunch. It more difficult (to > put it mildly) for a job launcher to do that. > > Thanks again, > Federico > > -Original Message- > From: Ralph H Castain [mailto:r...@lanl.gov] > Sent: Tuesday, June 17, 2008 1:09 PM > To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] SLURM and OpenMPI > > I can believe 1.2.x has problems in that regard. Some of that has > nothing to > do with slurm and reflects internal issues with 1.2. > > We have made it much more resistant to those problems in the upcoming > 1.3 > release, but there is no plan to retrofit those changes to 1.2. Part of > the > problem was that we weren't using the --kill-on-bad-exit flag when we > called > srun internally, which has been fixed for 1.3. > > BTW: we actually do use srun to launch the daemons - we just call it > internally from inside orterun. The only real difference is that we use > orterun to setup the cmd line and then tell the daemons what they need > to > do. The issues you are seeing relate to our ability to detect that srun > has > failed, and/or that one or more daemons have failed to launch or do > something they were supposed to do. The 1.2 system has problems in that > regard, which was one motivation for the 1.3 overhaul. > > I would argue that slurm allowing us to attempt to launch on a > no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we > use > srun to launch the daemons - the only reason we hang is that srun is not > returning with an error. I've seen this on other systems as well, but > have > no real answer - if slurm doesn't indicate an error has occurred, I'm > not > sure what I can do about it. > > We are unlikely to use srun to directly launch jobs (i.e., to have slurm > directly launch the job from an srun cmd line without mpirun) anytime > soon. > It isn't clear there is enough benefit to justify the rather large
Re: [OMPI users] SLURM and OpenMPI
Ralph wrote: "I don't know if I would say we "interfere" with SLURM - I would say that we are only lightly integrated with SLURM at this time. We use SLURM as a resource manager to assign nodes, and then map processes onto those nodes according to the user's wishes. We chose to do this because srun applies its own load balancing algorithms if you launch processes directly with it, which leaves the user with little flexibility to specify their desired rank/slot mapping. We chose to support the greater flexibility." Ralph, we wrote a launcher for mvapich that uses srun to launch but keeps tight control of where processes are started. The way we did it was to force srun to launch a single process on a particular node. The launcher calls many of these: srun --jobid $JOBID -N 1 -n 1 -w host005 CMD ARGS Hope this helps (and we are looking forward to a tighter orterun/slurm integration as you know). Regards, Federico -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Thursday, March 20, 2008 6:41 PM To: Open MPI UsersCc: Ralph Castain Subject: Re: [OMPI users] SLURM and OpenMPI Hi there I am no slurm expert. However, it is our understanding that SLURM_TASKS_PER_NODE means the number of slots allocated to the job, not the number of tasks to be executed on each node. So the 4(x2) tells us that we have 4 slots on each of two nodes to work with. You got 4 slots on each node because you used the -N option, which told slurm to assign all slots on that node to this job - I assume you have 4 processors on your nodes. OpenMPI parses that string to get the allocation, then maps the number of specified processes against it. It is possible that the interpretation of SLURM_TASKS_PER_NODE is different when used to allocate as opposed to directly launch processes. Our typical usage is for someone to do: srun -N 2 -A mpirun -np 2 helloworld In other words, we use srun to create an allocation, and then run mpirun separately within it. I am therefore unsure what the "-n 2" will do here. If I believe the documentation, it would seem to imply that srun will attempt to launch two copies of "mpirun -np 2 helloworld", yet your output doesn't seem to support that interpretation. It would appear that the "-n 2" is being ignored and only one copy of mpirun is being launched. I'm no slurm expert, so perhaps that interpretation is incorrect. Assuming that the -n 2 is ignored in this situation, your command line: > srun -N 2 -n 2 -b mpirun -np 2 helloworld will cause mpirun to launch two processes, mapped byslot against the slurm allocation of two nodes, each having 4 slots. Thus, both processes will be launched on the first node, which is what you observed. Similarly, the command line > srun -N 2 -n 2 -b mpirun helloworld doesn't specify the #procs to mpirun. In that case, mpirun will launch a process on every available slot in the allocation. Given this command, that means 4 procs will be launched on each of the 2 nodes, for a total of 8 procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the second. Again, this is what you observed. I don't know if I would say we "interfere" with SLURM - I would say that we are only lightly integrated with SLURM at this time. We use SLURM as a resource manager to assign nodes, and then map processes onto those nodes according to the user's wishes. We chose to do this because srun applies its own load balancing algorithms if you launch processes directly with it, which leaves the user with little flexibility to specify their desired rank/slot mapping. We chose to support the greater flexibility. Using the SLURM-defined mapping will require launching without our mpirun. This capability is still under development, and there are issues with doing that in slurm environments which need to be addressed. It is at a lower priority than providing such support for TM right now, so I wouldn't expect it to become available for several months at least. Alternatively, it may be possible for mpirun to get the SLURM-defined mapping and use it to launch the processes. If we can get it somehow, there is no problem launching it as specified - the problem is how to get the map! Unfortunately, slurm's licensing prevents us from using its internal APIs, so obtaining the map is not an easy thing to do. Anyone who wants to help accelerate that timetable is welcome to contact me. We know the technical issues - this is mostly a problem of (a) priorities versus my available time, and (b) similar considerations on the part of the slurm folks to do the work themselves. Ralph On 3/20/08 3:48 PM, "Tim Prins" wrote: > Hi Werner, > > Open MPI does things a little bit differently than other MPIs when it > comes to supporting SLURM. See > http://www.open-mpi.org/faq/?category=slurm > for general information about running with Open MPI on SLURM. > > After trying the commands you
[OMPI users] FW: slurm and all-srun orterun
Ralph, here is Moe's response. The srun options he mentions look promising: they can signal an otherwise happy orted daemon (sitting on a waitpid) that something is amiss elsewhere in the job. Do orteds change their session ID? Thanks Moe, Federico -Original Message- From: jet...@llnl.gov [mailto:jet...@llnl.gov] Sent: Wednesday, March 05, 2008 2:21 PM To: Sacerdoti, Federico; Open MPI Users Subject: RE: [OMPI users] slurm and all-srun orterun Slurm and its APIs are available under the GPL license. Since Open MPI is not available under the GPL license it can not link with the Slurm APIs, however virtually all of that API functionality is available through existing Slurm commands. The commands are clearly not as simple to use as the APIs, but if you encounter any problems using the commands we can certainly make changes to facilitate their use. For example, Slurm communicates with the Maui and Moab schedulers using an interface that loosely resembles XML. We are also prepared to provide additional functionality as needed by OpenMPI. Regarding premature termination of processes that Slurm spawns, the srun command has a couple of option that may prove useful: -K, --kill-on-bad-exit Terminate a job if any task exits with a non-zero exit code. -W, --wait=seconds Specify how long to wait after the first task terminates before terminating all remaining tasks. A value of 0 indicates an unlimited wait (a warning will be issued after 60 seconds). The default value is set by the WaitTime parameter in the slurm configuration file (see slurm.conf(5)). This option can be use- ful to insure that a job is terminated in a timely fashion in the event that one or more tasks terminate prematurely. Any tasks launched outside of Slurm's control (e.g. rsh) are not purged on job termination. Slurm locates spawned tasks and any of their children using the configured ProcTrack plugin, of which several are available. If you use the SID (session ID) plugin and spawned tasks change their SID, Slurm will no longer track them. Several reliable process tracking mechanisms are available, but some do require kernel changes. See "man slurm.conf" for more information. Moe At 11:16 AM -0500 3/5/08, Sacerdoti, Federico wrote: >Thanks Ralph, > >First, we would be happy to test the slurm direct launch capability. >Regarding the failure case, I realize that the IB errors do not directly >affect the orted daemons. This is what we observed: > >1. Parallel job started >2. IB errors caused some processes to fail (but not all) >3. slurm tears down entire job, attempting to kill all orted and their >children > >We want this behavior: if any process of a parallel job dies, all >processes should be stopped. The orted daemons in charge of processes >that did not fail are the problem, as slurm was not able to kill them. >Sounds like this is a known issue in openmpi 1.2.x. > >In any case, the new direct launching methods sound promising. I am >surprised there are licensing issues with Slurm, is this a GPL-and-BSD >issue? I am CC'ing slurm author Moe; he may be able to help. > >Thanks again and I look forward to testing the direct launch, >Federico > > >-Original Message- >From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >Behalf Of Ralph Castain >Sent: Monday, March 03, 2008 8:19 PM >To: Open MPI Users <us...@open-mpi.org> >Cc: Ralph Castain >Subject: Re: [OMPI users] slurm and all-srun orterun > >Hello > >I don't monitor the user list any more, but a friendly elf sent this >along >to me. > >I'm not entirely sure what problem might be causing the behavior you are >seeing. Neither mpirun nor any orted should be impacted by IB problems >as >they aren't MPI processes and thus never interact with IB. Only >application >procs touch the IB subsystem - if an application proc fails, the orted >should see that and correctly order the shutdown of the job. So if you >are >having IB problems, that wouldn't explain daemons failing. > >If a daemon is aborting, that will cause problems in 1.2.x. We have >noted >that SLURM (even though the daemons are launched via srun) doesn't >always >tell us when this happens, leaving Open MPI vulnerable to "hangs" as it >attempts to cleanup and finds it can't do it. I'm not sure why you would >see >a daemon die, though - the fact that an application process failed >shouldn't >cause that to happen. Likewise, it would seem strange that the >application >process would fail and the daemon not notice - this has nothing to do >with >slurm, but is just a standard Linux "waitpid" method. > >The most likely reason for the behavior you describe is that an >application >process encounters an IB problem whi
Re: [OMPI users] slurm and all-srun orterun
Thanks Ralph, First, we would be happy to test the slurm direct launch capability. Regarding the failure case, I realize that the IB errors do not directly affect the orted daemons. This is what we observed: 1. Parallel job started 2. IB errors caused some processes to fail (but not all) 3. slurm tears down entire job, attempting to kill all orted and their children We want this behavior: if any process of a parallel job dies, all processes should be stopped. The orted daemons in charge of processes that did not fail are the problem, as slurm was not able to kill them. Sounds like this is a known issue in openmpi 1.2.x. In any case, the new direct launching methods sound promising. I am surprised there are licensing issues with Slurm, is this a GPL-and-BSD issue? I am CC'ing slurm author Moe; he may be able to help. Thanks again and I look forward to testing the direct launch, Federico -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Monday, March 03, 2008 8:19 PM To: Open MPI Users <us...@open-mpi.org> Cc: Ralph Castain Subject: Re: [OMPI users] slurm and all-srun orterun Hello I don't monitor the user list any more, but a friendly elf sent this along to me. I'm not entirely sure what problem might be causing the behavior you are seeing. Neither mpirun nor any orted should be impacted by IB problems as they aren't MPI processes and thus never interact with IB. Only application procs touch the IB subsystem - if an application proc fails, the orted should see that and correctly order the shutdown of the job. So if you are having IB problems, that wouldn't explain daemons failing. If a daemon is aborting, that will cause problems in 1.2.x. We have noted that SLURM (even though the daemons are launched via srun) doesn't always tell us when this happens, leaving Open MPI vulnerable to "hangs" as it attempts to cleanup and finds it can't do it. I'm not sure why you would see a daemon die, though - the fact that an application process failed shouldn't cause that to happen. Likewise, it would seem strange that the application process would fail and the daemon not notice - this has nothing to do with slurm, but is just a standard Linux "waitpid" method. The most likely reason for the behavior you describe is that an application process encounters an IB problem which blocks communication - but the process doesn't actually abort or terminate, it just hangs there. In this case, the orted doesn't see the process exit, so the system doesn't know it should take any action. That said, we know that 1.2.x has problems with clean shutdown in abnormal situations. Release 1.3 (when it comes out) addresses these issues and appears (from our testing, at least) to be much more reliable about cleanup. You should see a definite improvement in the detection of process failures and subsequent cleanup. As for your question, I am working as we speak on two new launch modes for Open MPI: 1. "direct" - this uses mpirun to directly launch the application processes without use of the intermediate daemons. 2. "standalone" - this uses the native launch command to simply launch the application processes, without use of mpirun or the intermediate daemons. The initial target environments for these capabilities are TM and SLURM. The latter poses a bit of a challenge as we cannot use their API due to licensing issues, so it will come a little later. We have a design for getting around the problem - the ordering is more driven by priorities then anything technical. The direct launch capability -may- be included in 1.3 assuming it can be completed in time for the release. If not, it will almost certainly be in 1.3.1. I'm expecting to complete the TM version in the next few days, and perhaps get the SLURM version working sometime this month - but they will need validation before being included in an official release. I can keep you posted if you like - once this gets into our repository, you are certainly welcome to try it out. I would welcome feedback on it. Hope that helps Ralph >> From: "Sacerdoti, Federico" <federico.sacerd...@deshaw.com> >> Date: March 3, 2008 12:44:39 PM EST >> To: "Open MPI Users" <us...@open-mpi.org> >> Subject: [OMPI users] slurm and all-srun orterun >> Reply-To: Open MPI Users <us...@open-mpi.org> >> >> Hi, >> >> We are migrating to openmpi on our large (~1000 node) cluster, and >> plan >> to use it exclusively on a multi-thousand core infiniband cluster in >> the >> near future. We had extensive problems with parallel processes not >> dying >> after a job crash, which was largely solved by switching to the slurm >> resource manager. >> >> While orterun supports slurm, it only uses the srun facility to launch >> the
[OMPI users] slurm and all-srun orterun
Hi, We are migrating to openmpi on our large (~1000 node) cluster, and plan to use it exclusively on a multi-thousand core infiniband cluster in the near future. We had extensive problems with parallel processes not dying after a job crash, which was largely solved by switching to the slurm resource manager. While orterun supports slurm, it only uses the srun facility to launch the "orted" daemons, which then start the actual user process themselves. In our recent migration to openmpi, we have noticed occasions where orted did not correctly clean up after a parallel job crash. In most cases the crash was due to an infiniband error. Most worryingly slurm was not able to cleanup the orted, and it along with user processes were left running. At SC07 I was told that there is some talk of using srun to launch both orted and user processes, or alternatively use srun only. Either would solve the cleanup problem, in our experience. Is Rolf Castain on this list? Thanks, Federico P.S. We use proctrack/linuxproc slurm process tracking plugin. As noted in the config man page, this may fail to find certain processes and explain why slurm could not clean up orted effectively. man slurm.conf(5), version 1.2.22: NOTE: "proctrack/linuxproc" and "proctrack/pgid" can fail to identify all processes associated with a job since processes can become a child of the init process (when the parent process terminates) or change their process group. To reliably track all processes, one of the other mechanisms utilizing kernel modifications is preferable.
Re: [OMPI users] openmpi credits for eager messages
To keep this out of the weeds, I have attached a program called "bug3" that illustrates this problem on openmpi 1.2.5 using the openib BTL. In bug3 process with rank 0 uses all available memory buffering "unexpected" messages from its neighbors. Bug3 is a test-case derived from a real, scalable application (desmond for molecular dynamics) that several experienced MPI developers have worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the openmpi silently sends them in the background and overwhelms process 0 due to lack of flow control. It may not be hard to change desmond to work around openmpi's small message semantics, but a programmer should reasonably be allowed to think a blocking send will block if the receiver cannot handle it yet. Federico -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brightwell, Ronald Sent: Monday, February 04, 2008 3:30 PM To: Patrick Geoffray Cc: Open MPI Users Subject: Re: [OMPI users] openmpi credits for eager messages > > I'm looking at a network where the number of endpoints is large enough that > > everybody can't have a credit to start with, and the "offender" isn't any > > single process, but rather a combination of processes doing N-to-1 where N > > is sufficiently large. I can't just tell one process to slow down. I have > > to tell them all to slow down and do it quickly... > > When you have N->1 patterns, then the hardware flow-control will > throttle the senders, or drop packets if there is no hardware > flow-control. If you don't have HOL blocking but the receiver does not > consume for any reasons (busy, sleeping, dead, whatever), then you can > still drop packets on the receiver (NIC, driver, thread) at a last > resort, this is what TCP does. The key is have exponential backoff (or a > reasonably large resend timeout) to no continue the hammering. > > It costs nothing in the common case (unlike the credits approach), but > it does handle corner cases without affecting too much other nodes > (unlike hardware flow-control). Right. For a sufficiently large number of endpoints, flow control has to get pushed out of MPI and down into the network, which is why I don't necesarily want an MPI that does flow control at the user-level. > > But you know all that. You are just being mean to your users because you > can :-) The sick part is that I think I envy you... You know it :) -Ron ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users bug3.c Description: bug3.c
[OMPI users] openmpi credits for eager messages
Hi, I am readying an openmpi 1.2.5 software stack for use with a many-thousand core cluster. I have a question about sending small messages that I hope can be answered on this list. I was under the impression that if node A wants to send a small MPI message to node B, it must have a credit to do so. The credit assures A that B has enough buffer space to accept the message. Credits are required by the mpi layer regardless of the BTL transport layer used. I have been told by a Voltaire tech that this is not so, the credits are used by the infiniband transport layer to reliably send a message, and is not an openmpi feature. Thanks, Federico