On 6/23/08 2:50 PM, "Sacerdoti, Federico"
<federico.sacerd...@deshawresearch.com> wrote:

> Ralph,
> 
> I'm working on a test-case for that now, hopefully I can nail it down to
> a particular openmpi version.

Great - thanks! I have since seen something where a forwarded output can
have a funny character on the end - I haven't tracked down the precise
character yet, but probably a NULL. If it's the same problem, you would only
see it on output from a remote proc, not on something output by mpirun
itself.

> 
> I have another small issue, which is somewhat bothering: orterun 1.2.6
> exits with return code zero if the executable cannot be found. Should
> this be non-zero?

Yes - it is fixed in 1.3.

We are also trying to expand our test coverage for the 1.3 release to catch
more of these non-MPI issues, so hopefully they won't slip by in future
releases.

> 
> E.g.
> $ orterun /asdf
> ------------------------------------------------------------------------
> --
> Failed to find or execute the following executable:
> 
> Host:       drdblogin2.en.desres.deshaw.com
> Executable: /asdf
> 
> Cannot continue.
> ------------------------------------------------------------------------
> --
> $ echo $?
> 0
> 
> Thanks
> Federico
> 
> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Ralph H Castain
> Sent: Thursday, June 19, 2008 10:24 AM
> To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] null characters in output
> 
> No, I haven't seen that - if you can provide an example, we can take a
> look
> at it.
> 
> Thanks
> Ralph
> 
> 
> 
> On 6/19/08 8:15 AM, "Sacerdoti, Federico"
> <federico.sacerd...@deshawresearch.com> wrote:
> 
>> Ralph, another issue perhaps you can shed some light on.
>> 
>> When launching with orterun, we sometimes see null characters in the
>> stdout output. These do not show up on a terminal, but when piped to a
>> file they are visible in an editor. They also can show up in the
> middle
>> of a line, and so can interfere with greps on the output, etc.
>> 
>> Have you seen this before? I am working on a simple test case, but
>> unfortunately have not found one that is deterministic so far.
>> 
>> Thanks,
>> Federico 
>> 
>> -----Original Message-----
>> From: Ralph H Castain [mailto:r...@lanl.gov]
>> Sent: Tuesday, June 17, 2008 1:09 PM
>> To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org>
>> Subject: Re: [OMPI users] SLURM and OpenMPI
>> 
>> I can believe 1.2.x has problems in that regard. Some of that has
>> nothing to
>> do with slurm and reflects internal issues with 1.2.
>> 
>> We have made it much more resistant to those problems in the upcoming
>> 1.3
>> release, but there is no plan to retrofit those changes to 1.2. Part
> of
>> the
>> problem was that we weren't using the --kill-on-bad-exit flag when we
>> called
>> srun internally, which has been fixed for 1.3.
>> 
>> BTW: we actually do use srun to launch the daemons - we just call it
>> internally from inside orterun. The only real difference is that we
> use
>> orterun to setup the cmd line and then tell the daemons what they need
>> to
>> do. The issues you are seeing relate to our ability to detect that
> srun
>> has
>> failed, and/or that one or more daemons have failed to launch or do
>> something they were supposed to do. The 1.2 system has problems in
> that
>> regard, which was one motivation for the 1.3 overhaul.
>> 
>> I would argue that slurm allowing us to attempt to launch on a
>> no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we
>> use
>> srun to launch the daemons - the only reason we hang is that srun is
> not
>> returning with an error. I've seen this on other systems as well, but
>> have
>> no real answer - if slurm doesn't indicate an error has occurred, I'm
>> not
>> sure what I can do about it.
>> 
>> We are unlikely to use srun to directly launch jobs (i.e., to have
> slurm
>> directly launch the job from an srun cmd line without mpirun) anytime
>> soon.
>> It isn't clear there is enough benefit to justify the rather large
>> effort,
>> especially considering what would be required to maintain scalability.
>> Decisions on all that are still pending, though, which means any
>> significant
>> change in that regard wouldn't be released until sometime next year.
>> 
>> Ralph
>> 
>> On 6/17/08 10:39 AM, "Sacerdoti, Federico"
>> <federico.sacerd...@deshawresearch.com> wrote:
>> 
>>> Ralph,
>>> 
>>> I was wondering what the status of this feature was (using srun to
>>> launch orted daemons)? I have two new bug reports to add from our
>>> experience using orterun from 1.2.6 on our 4000 CPU infiniband
>> cluster.
>>> 
>>> 1. Orterun will happily hang if it is asked to run on an invalid
> slurm
>>> job, e.g. if the job has exceeded its timelimit. This would be
>> trivially
>>> fixed if you used srun to launch, as they would fail with non-zero
>> exit
>>> codes.
>>> 
>>> 2. A very simple orterun invocation hangs instead of exiting with an
>>> error. In this case the executable does not exist, and we would
> expect
>>> orterun to exit non-zero. This has caused
>>> headaches with some workflow management script that automatically
>> start
>>> jobs.
>>> 
>>> salloc -N2 -p swdev orterun dummy-binary-I-dont-exist
>>> [hang]
>>> 
>>> orterun dummy-binary-I-dont-exist
>>> [hang]
>>> 
>>> Thanks,
>>> Federico
>>> 
>>> -----Original Message-----
>>> From: Sacerdoti, Federico
>>> Sent: Friday, March 21, 2008 5:41 PM
>>> To: 'Open MPI Users'
>>> Subject: RE: [OMPI users] SLURM and OpenMPI
>>> 
>>> 
>>> Ralph wrote:
>>> "I don't know if I would say we "interfere" with SLURM - I would say
>>> that we
>>> are only lightly integrated with SLURM at this time. We use SLURM as
> a
>>> resource manager to assign nodes, and then map processes onto those
>>> nodes
>>> according to the user's wishes. We chose to do this because srun
>> applies
>>> its
>>> own load balancing algorithms if you launch processes directly with
>> it,
>>> which leaves the user with little flexibility to specify their
> desired
>>> rank/slot mapping. We chose to support the greater flexibility."
>>>  
>>> Ralph, we wrote a launcher for mvapich that uses srun to launch but
>>> keeps tight control of where processes are started. The way we did it
>>> was to force srun to launch a single process on a particular node.
>>> 
>>> The launcher calls many of these:
>>>  srun --jobid $JOBID -N 1 -n 1 -w host005 CMD ARGS
>>> 
>>> Hope this helps (and we are looking forward to a tighter
> orterun/slurm
>>> integration as you know).
>>> 
>>> Regards,
>>> Federico
>>> 
>>> -----Original Message-----
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>> On
>>> Behalf Of Ralph Castain
>>> Sent: Thursday, March 20, 2008 6:41 PM
>>> To: Open MPI Users <us...@open-mpi.org>
>>> Cc: Ralph Castain
>>> Subject: Re: [OMPI users] SLURM and OpenMPI
>>> 
>>> Hi there
>>> 
>>> I am no slurm expert. However, it is our understanding that
>>> SLURM_TASKS_PER_NODE means the number of slots allocated to the job,
>> not
>>> the
>>> number of tasks to be executed on each node. So the 4(x2) tells us
>> that
>>> we
>>> have 4 slots on each of two nodes to work with. You got 4 slots on
>> each
>>> node
>>> because you used the -N option, which told slurm to assign all slots
>> on
>>> that
>>> node to this job - I assume you have 4 processors on your nodes.
>> OpenMPI
>>> parses that string to get the allocation, then maps the number of
>>> specified
>>> processes against it.
>>> 
>>> It is possible that the interpretation of SLURM_TASKS_PER_NODE is
>>> different
>>> when used to allocate as opposed to directly launch processes. Our
>>> typical
>>> usage is for someone to do:
>>> 
>>> srun -N 2 -A
>>> mpirun -np 2 helloworld
>>> 
>>> In other words, we use srun to create an allocation, and then run
>> mpirun
>>> separately within it.
>>> 
>>> 
>>> I am therefore unsure what the "-n 2" will do here. If I believe the
>>> documentation, it would seem to imply that srun will attempt to
> launch
>>> two
>>> copies of "mpirun -np 2 helloworld", yet your output doesn't seem to
>>> support
>>> that interpretation. It would appear that the "-n 2" is being ignored
>>> and
>>> only one copy of mpirun is being launched. I'm no slurm expert, so
>>> perhaps
>>> that interpretation is incorrect.
>>> 
>>> Assuming that the -n 2 is ignored in this situation, your command
>> line:
>>> 
>>>> srun -N 2 -n 2 -b mpirun -np 2 helloworld
>>> 
>>> will cause mpirun to launch two processes, mapped byslot against the
>>> slurm
>>> allocation of two nodes, each having 4 slots. Thus, both processes
>> will
>>> be
>>> launched on the first node, which is what you observed.
>>> 
>>> Similarly, the command line
>>> 
>>>> srun -N 2 -n 2 -b mpirun helloworld
>>> 
>>> doesn't specify the #procs to mpirun. In that case, mpirun will
> launch
>> a
>>> process on every available slot in the allocation. Given this
> command,
>>> that
>>> means 4 procs will be launched on each of the 2 nodes, for a total of
>> 8
>>> procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the
>>> second.
>>> Again, this is what you observed.
>>> 
>>> I don't know if I would say we "interfere" with SLURM - I would say
>> that
>>> we
>>> are only lightly integrated with SLURM at this time. We use SLURM as
> a
>>> resource manager to assign nodes, and then map processes onto those
>>> nodes
>>> according to the user's wishes. We chose to do this because srun
>> applies
>>> its
>>> own load balancing algorithms if you launch processes directly with
>> it,
>>> which leaves the user with little flexibility to specify their
> desired
>>> rank/slot mapping. We chose to support the greater flexibility.
>>> 
>>> Using the SLURM-defined mapping will require launching without our
>>> mpirun.
>>> This capability is still under development, and there are issues with
>>> doing
>>> that in slurm environments which need to be addressed. It is at a
>> lower
>>> priority than providing such support for TM right now, so I wouldn't
>>> expect
>>> it to become available for several months at least.
>>> 
>>> Alternatively, it may be possible for mpirun to get the SLURM-defined
>>> mapping and use it to launch the processes. If we can get it somehow,
>>> there
>>> is no problem launching it as specified - the problem is how to get
>> the
>>> map!
>>> Unfortunately, slurm's licensing prevents us from using its internal
>>> APIs,
>>> so obtaining the map is not an easy thing to do.
>>> 
>>> Anyone who wants to help accelerate that timetable is welcome to
>> contact
>>> me.
>>> We know the technical issues - this is mostly a problem of (a)
>>> priorities
>>> versus my available time, and (b) similar considerations on the part
>> of
>>> the
>>> slurm folks to do the work themselves.
>>> 
>>> Ralph
>>> 
>>> 
>>> On 3/20/08 3:48 PM, "Tim Prins" <tpr...@open-mpi.org> wrote:
>>> 
>>>> Hi Werner,
>>>> 
>>>> Open MPI does things a little bit differently than other MPIs when
> it
>>>> comes to supporting SLURM. See
>>>> http://www.open-mpi.org/faq/?category=slurm
>>>> for general information about running with Open MPI on SLURM.
>>>> 
>>>> After trying the commands you sent, I am actually a bit surprised by
>>> the
>>>> results. I would have expected this mode of operation to work. But
>>>> looking at the environment variables that SLURM is setting for us, I
>>> can
>>>> see why it doesn't.
>>>> 
>>>> On a cluster with 4 cores/node, I ran:
>>>> [tprins@odin ~]$ cat mprun.sh
>>>> #!/bin/sh
>>>> printenv
>>>> [tprins@odin ~]$  srun -N 2 -n 2 -b mprun.sh
>>>> srun: jobid 55641 submitted
>>>> [tprins@odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE
>>>> SLURM_TASKS_PER_NODE=4(x2)
>>>> [tprins@odin ~]$
>>>> 
>>>> Which seems to be wrong, since the srun man page says that
>>>> SLURM_TASKS_PER_NODE is the "Number  of tasks to be initiated on
> each
>>>> node". This seems to imply that the value should be "1(x2)". So
> maybe
>>>> this is a SLURM problem? If this value were correctly reported, Open
>>> MPI
>>>> should work fine for what you wanted to do.
>>>> 
>>>> Two other things:
>>>> 1. You should probably use the command line option '--npernode' for
>>>> mpirun instead of setting the rmaps_base_n_pernode directly.
>>>> 2. In regards to your second example below, Open MPI by default maps
>>> 'by
>>>> slot'. That is, it will fill all available slots on the first node
>>>> before moving to the second. You can change this, see:
>>>> http://www.open-mpi.org/faq/?category=running#mpirun-scheduling
>>>> 
>>>> I have copied Ralph on this mail to see if he has a better response.
>>>> 
>>>> Tim
>>>> 
>>>> Werner Augustin wrote:
>>>>> Hi,
>>>>> 
>>>>> At our site here at the University of Karlsruhe we are running two
>>>>> large clusters with SLURM and HP-MPI. For our new cluster we want
> to
>>>>> keep SLURM and switch to OpenMPI. While testing I got the following
>>>>> problem:
>>>>> 
>>>>> with HP-MPI I do something like
>>>>> 
>>>>> srun -N 2 -n 2 -b mpirun -srun helloworld
>>>>> 
>>>>> and get 
>>>>> 
>>>>> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
>>>>> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14.
>>>>> 
>>>>> when I try the same with OpenMPI (version 1.2.4)
>>>>> 
>>>>> srun -N 2 -n 2 -b mpirun helloworld
>>>>> 
>>>>> I get
>>>>> 
>>>>> Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13.
>>>>> Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13.
>>>>> Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14.
>>>>> Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13.
>>>>> Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14.
>>>>> Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13.
>>>>> Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14.
>>>>> Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14.
>>>>> 
>>>>> and with 
>>>>> 
>>>>> srun -N 2 -n 2 -b mpirun -np 2 helloworld
>>>>> 
>>>>> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
>>>>> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13.
>>>>> 
>>>>> which is still wrong, because it uses only one of the two allocated
>>>>> nodes.
>>>>> 
>>>>> OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE
> environment
>>>>> variables, starts with slurm one orted per node and tasks upto the
>>>>> maximum number of slots on every node. So basically it also does
>>>>> some 'resource management' and interferes with slurm. OK, I can fix
>>> that
>>>>> with a mpirun wrapper script which calls mpirun with the right -np
>>> and
>>>>> the right rmaps_base_n_pernode setting, but it gets worse. We want
>> to
>>>>> allocate computing power on a per cpu base instead of per node,
> i.e.
>>>>> different user might share a node. In addition slurm allows to
>>> schedule
>>>>> according to memory usage. Therefore it is important that on every
>>> node
>>>>> there is exactly the number of tasks running that slurm wants. The
>>> only
>>>>> solution I came up with is to generate for every job a detailed
>>>>> hostfile and call mpirun --hostfile. Any suggestions for
>> improvement?
>>>>> 
>>>>> I've found a discussion thread "slurm and all-srun orterun" in the
>>>>> mailinglist archive concerning the same problem, where Ralph
> Castain
>>>>> announced that he is working on two new launch methods which would
>>> fix
>>>>> my problems. Unfortunately his email address is deleted from the
>>>>> archive, so it would be really nice if the friendly elf mentioned
>>> there
>>>>> is still around and could forward my mail to him.
>>>>> 
>>>>> Thanks in advance,
>>>>> Werner Augustin
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to