Re: [OMPI users] independent startup of orted and orterun

Ralph Castain Mon, 2 Feb 2015 13:07:13 -0500 (EST)

Yikes - looks like a bug crept into there at the last minute. I actually
had it working just fine - not sure what happened here. I'm on travel this
week, but I'll try to dig into this a bit and spot the issue.


Thanks!
Ralph


On Mon, Feb 2, 2015 at 3:50 AM, Mark Santcroos <mark.santcr...@rutgers.edu>
wrote:

> Hi Ralph,
>
> Great, the semantics look exactly as what I need!
>
> (To aid in debugging I added "--debug-devel" to orte-dvm.c which was
> useful to detect and come by some initial bumps)
>
> The current status:
>
> * I can submit applications and see their output on the orte-dvm console
>
> * The following message is reported infinitely on the orte-submit console:
>
> [warn] opal_libevent2022_event_base_loop: reentrant invocation.  Only one
> event_base_loop can run on each event_base at once.
>
> * orte-submit doesn't return, while I see "[nid02819:20571] [[2120,0],0]
> dvm: job [2120,9] has completed" on the orte-dvm console.
>
> * On the orte-dvm console I see the following when submitting (so also for
> "successful" runs):
>
> [nid02434:00564] [[9021,0],0] Releasing job data for [INVALID]
> [nid03388:26474] [[9021,0],2] ORTE_ERROR_LOG: Not found in file
> ../../../../orte/mca/odls/base/odls_base_default_fns.c at line 433
> [nid03534:31545] procdir: /tmp/openmpi-sessions-62758@nid03534_0/9021/1/0
> [nid03534:31545] jobdir: /tmp/openmpi-sessions-62758@nid03534_0/9021/1
> [nid03534:31545] top: openmpi-sessions-62758@nid03534_0
> [nid03534:31545] tmp: /tmp
> [nid03534:31545] sess_dir_finalize: proc session dir does not exist
>
> * If I dont specify any "-np" on the orte-submit, then I see on the
> orte-dvm console:
>
> [nid02434:00564] [[9021,0],0] Releasing job data for [INVALID]
> [nid03388:26474] [[9021,0],2] ORTE_ERROR_LOG: Not found in file
> ../../../../orte/mca/odls/base/odls_base_default_fns.c at line 433
> [nid03534:31544] [[9021,0],1] ORTE_ERROR_LOG: Not found in file
> ../../../../orte/mca/odls/base/odls_base_default_fns.c at line 433
>
> * It only seems to work for single nodes (probably related to the previous
> point).
>
>
> Is this all expected behaviour given the current implementation?
>
>
> Thanks!
>
> Mark
>
>
>
> > On 02 Feb 2015, at 4:21 , Ralph Castain <r...@open-mpi.org> wrote:
> >
> > I have pushed the changes to the OMPI master. It took a little bit more
> than I had hoped due to the changes to the ORTE infrastructure, but
> hopefully this will meet your needs. It consists of two new tools:
> >
> > (a) orte-dvm - starts the virtual machine by launching a daemon on every
> node of the allocation, as constrained by -host and/or -hostfile. Check the
> options for outputting the URI as you’ll need that info for the other tool.
> The DVM remains “up” until you issue the orte-submit -terminate command, or
> hit the orte-dvm process with a sigterm.
> >
> > (b) orte-submit - takes the place of mpirun. Basically just packages
> your app and arguments and sends it to orte-dvm for execution. Requires the
> URI of orte-dvm. The tool exits once the job has completed execution,
> though you can run multiple jobs in parallel by backgrounding orte-submit
> or issuing commands from separate shells.
> >
> > I’ve added man pages for both tools, though they may not be complete.
> Also, I don’t have all the mapping/ranking/binding options supported just
> yet as I first wanted to see if this meets your basic needs before worrying
> about the detail.
> >
> > Let me know what you think
> > Ralph
> >
> >
> >> On Jan 21, 2015, at 4:07 PM, Mark Santcroos <mark.santcr...@rutgers.edu>
> wrote:
> >>
> >> Hi Ralph,
> >>
> >> All makes sense! Thanks a lot!
> >>
> >> Looking forward to your modifications.
> >> Please don't hesitate to through things with rough-edges to me!
> >>
> >> Cheers,
> >>
> >> Mark
> >>
> >>> On 21 Jan 2015, at 23:21 , Ralph Castain <r...@open-mpi.org> wrote:
> >>>
> >>> Let me address your questions up here so you don’t have to scan thru
> the entire note.
> >>>
> >>> PMIx rationale: PMI has been around for a long time, primarily used
> inside the MPI library implementations to perform wireup. It provided a
> link from the MPI library to the local resource manager. However, as we
> move towards exascale, two things became apparent:
> >>>
> >>> 1. the current PMI implementations don’t scale adequately to get
> there. The API created too many communications and assumed everything was a
> blocking operation, thus preventing asynchronous progress
> >>>
> >>> 2. there were increasing requests for application-level interactions
> to the resource manager. People want ways to spawn jobs (and not just from
> within MPI), request pre-location of data, control power, etc. Rather than
> having every RM write its own interface (and thus make everyone’s code
> non-portable), we at Intel decided to extend the existing PMI definitions
> to support those functions. Thus, an application developer can directly
> access PMIx functions to perform all those operations.
> >>>
> >>> PMIx v1.0 is about to be released - it’ll be backward compatible with
> PMI-1 and PMI-2, plus add non-blocking operations and significantly reduce
> the number of communications. PMIx 2.0 is slated for this summer and will
> include the advanced controls capabilities.
> >>>
> >>> ORCM is being developed because we needed a BSD-licensed, fully
> featured resource manager. This will allow us to integrate the RM even more
> tightly to the file system, networking, and other subsystems, thus
> achieving higher launch performance and providing desired features such as
> QoS management. PMIx is a part of that plan, but as you say, they each play
> their separate roles in the overall stack.
> >>>
> >>>
> >>> Persistent ORTE: there is a learning curve on ORTE, I fear. We do have
> some videos on the web site that can help get you started, and I’ve given a
> number of “classes" at Intel now for that purpose. I still have it on my
> “to-do” list that I summarize those classes and post them on the web site.
> >>>
> >>> For now, let me summarize how things work. At startup, mpirun reads
> the allocation (usually from the environment, but it depends on the host
> RM) and launches a daemon on each allocated node. Each daemon reads its
> local hardware environment and “phones home” to let mpirun know it is
> alive. Once all daemons have reported, mpirun maps the processes to the
> nodes and sends that map to all the daemons in a scalable broadcast pattern.
> >>>
> >>> Upon receipt of the launch message, each daemon parses it to identify
> which procs it needs to locally spawn. Once spawned, each proc connects
> back to its local daemon via a Unix domain socket for wireup support. As
> procs complete, the daemon maintains bookkeeping and reports back to mpirun
> once all procs are done. When all procs are reported complete (or one
> reports as abnormally terminated), mpirun sends a “die” message to every
> daemon so it will cleanly terminate.
> >>>
> >>> What I will do is simply tell mpirun to not do that last step, but
> instead to wait to receive a “terminate” cmd before ending the daemons.
> This will allow you to reuse the existing DVM, making each independent job
> start a great deal faster. You’ll need to either manually terminate the
> DVM, or the RM will do so when the allocation expires.
> >>>
> >>> HTH
> >>> Ralph
> >>>
> >>>
> >>>> On Jan 21, 2015, at 12:52 PM, Mark Santcroos <
> mark.santcr...@rutgers.edu> wrote:
> >>>>
> >>>> Hi Ralph,
> >>>>
> >>>>> On 21 Jan 2015, at 21:20 , Ralph Castain <r...@open-mpi.org> wrote:
> >>>>>
> >>>>> Hi Mark
> >>>>>
> >>>>>> On Jan 21, 2015, at 11:21 AM, Mark Santcroos <
> mark.santcr...@rutgers.edu> wrote:
> >>>>>>
> >>>>>> Hi Ralph, all,
> >>>>>>
> >>>>>> To give some background, I'm part of the RADICAL-Pilot [1]
> development team.
> >>>>>> RADICAL-Pilot is a Pilot System, an implementation of the Pilot
> (job) concept, which is in its most minimal form takes care of the
> decoupling of resource acquisition and workload management.
> >>>>>> So instead of launching your real_science.exe through PBS, you
> submit a Pilot, which will allow you to perform application level
> scheduling.
> >>>>>> Most obvious use-case if you want to run many (relatively) small
> tasks, then you really don;t want to go through the batch system every
> time. That is besides the fact that these machines are very bad in managing
> many tasks anyway.
> >>>>>
> >>>>> Yeah, we sympathize.
> >>>>
> >>>> Thats always good :-)
> >>>>
> >>>>> Of course, one obvious solution is to get an allocation and execute
> a shell script that runs the tasks within that allocation - yes?
> >>>>
> >>>> Not really. Most of our use-cases have dynamic runtime properties,
> which means that at t=0 the exact workload is not known.
> >>>>
> >>>> In addition, I don't think such a script would allow me to work
> around the aprun bottleneck, as I'm not aware of a way to start MPI tasks
> that span multiple nodes from a Cray worker node.
> >>>>
> >>>>>> I looked a bit better at ORCM and it clearly overlaps with what I
> want to achieve.
> >>>>>
> >>>>> Agreed. In ORCM, we allow a user to request a “session” that results
> in allocation of resources. Each session is given an “orchestrator” - the
> ORCM “shepherd” daemon - responsible for executing the individual tasks
> across the assigned allocation, and a collection of “lamb” daemons (one on
> each node of the allocation) that forms a distributed VM. The orchestrator
> can execute the tasks very quickly since it doesn’t have to go back to the
> scheduler, and we allow it to do so according to any provided precedence
> requirement. Again, for simplicity, a shell script is the default mechanism
> for submitting the individual tasks.
> >>>>
> >>>> Yeah, similar solution to a similar problem.
> >>>> I noticed that Exascale is also part of the motivation? How does this
> relate to the pmix effort? Different part of the stack I guess.
> >>>>
> >>>>>> One thing I noticed is that parts of it runs as root, why is that?
> >>>>>
> >>>>> ORCM is a full resource manager, which means it has a scheduler
> (rudimentary today) and boot-time daemons that must run as root so they can
> fork/exec the session-level daemons (that run at the user level). The
> orchestrator and its daemons all run at the user-level.
> >>>>
> >>>> Ok. Our solution is user-space only, as one of our features is that
> we are able to run across different type of systems. Both approaches come
> with a tradeoff obviously.
> >>>>
> >>>>>>> We used to have a cmd line option in ORTE for what you propose -
> it wouldn’t be too hard to restore. Is there some reason to do so?
> >>>>>>
> >>>>>> Can you point me to something that I could look for in the repo
> history, then I can see if it serves my purpose.
> >>>>>
> >>>>> It would be back in the svn repo, I fear - would take awhile to hunt
> it down. Basically, it just (a) started all the daemons to create a VM, and
> (b) told mpirun to stick around as a persistent daemon. All subsequent
> calls to mpirun would reference back to the persistent one, thus using it
> to launch the jobs against the standing VM instead of starting a new one
> every time.
> >>>>
> >>>> *nod* That's what I tried to do this afternoon actually with the
> "--ompi-server", but that was not meant to be.
> >>>>
> >>>>> For ORCM, we just took that capability and expressed it as the
> “shepherd” plus “lamb” daemon architecture described above.
> >>>>
> >>>> ACK.
> >>>>
> >>>>> If you don’t want to replace the base RM, then using ORTE to
> establish a persistent VM is probably the way to go.
> >>>>
> >>>> Indeed, thats what it sounds like. Plus that ORTE is generic enough
> that I can re-use it on other type of systems too.
> >>>>
> >>>>> I can probably make it do that again fairly readily. We have a
> developer’s meeting next week, which usually means I have some free time
> (during evenings and topics I’m not involved with), so I can take a crack
> at this then if that would be timely enough.
> >>>>
> >>>> Happy to accept that offer. At this stage I'm not sure if I would
> want a CLI or would be more interested to be able to do this
> programmatically though.
> >>>> Also more than willing to assist in any way I can.
> >>>>
> >>>> I tried to see how it all worked, but because of the modular nature
> of ompi that was quite daunting. There is some learning curve I guess :-)
> >>>> So it seems that mpirun is persistent, and opens up a listening port,
> then some orded's get launched that phone home.
> >>>> From there I got lost in the MCA maze. How do the tasks get unto the
> compute nodes and started?
> >>>>
> >>>> Thanks a lot again, I appreciate your help.
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Mark
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/01/26227.php
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/01/26228.php
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/01/26229.php
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/02/26249.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/02/26254.php

Re: [OMPI users] independent startup of orted and orterun

Reply via email to