Ok, let me check on some other systems too though, it might be Cray specific.
> On 02 Feb 2015, at 19:07 , Ralph Castain <r...@open-mpi.org> wrote: > > Yikes - looks like a bug crept into there at the last minute. I actually had > it working just fine - not sure what happened here. I'm on travel this week, > but I'll try to dig into this a bit and spot the issue. > > Thanks! > Ralph > > > On Mon, Feb 2, 2015 at 3:50 AM, Mark Santcroos <mark.santcr...@rutgers.edu> > wrote: > Hi Ralph, > > Great, the semantics look exactly as what I need! > > (To aid in debugging I added "--debug-devel" to orte-dvm.c which was useful > to detect and come by some initial bumps) > > The current status: > > * I can submit applications and see their output on the orte-dvm console > > * The following message is reported infinitely on the orte-submit console: > > [warn] opal_libevent2022_event_base_loop: reentrant invocation. Only one > event_base_loop can run on each event_base at once. > > * orte-submit doesn't return, while I see "[nid02819:20571] [[2120,0],0] dvm: > job [2120,9] has completed" on the orte-dvm console. > > * On the orte-dvm console I see the following when submitting (so also for > "successful" runs): > > [nid02434:00564] [[9021,0],0] Releasing job data for [INVALID] > [nid03388:26474] [[9021,0],2] ORTE_ERROR_LOG: Not found in file > ../../../../orte/mca/odls/base/odls_base_default_fns.c at line 433 > [nid03534:31545] procdir: /tmp/openmpi-sessions-62758@nid03534_0/9021/1/0 > [nid03534:31545] jobdir: /tmp/openmpi-sessions-62758@nid03534_0/9021/1 > [nid03534:31545] top: openmpi-sessions-62758@nid03534_0 > [nid03534:31545] tmp: /tmp > [nid03534:31545] sess_dir_finalize: proc session dir does not exist > > * If I dont specify any "-np" on the orte-submit, then I see on the orte-dvm > console: > > [nid02434:00564] [[9021,0],0] Releasing job data for [INVALID] > [nid03388:26474] [[9021,0],2] ORTE_ERROR_LOG: Not found in file > ../../../../orte/mca/odls/base/odls_base_default_fns.c at line 433 > [nid03534:31544] [[9021,0],1] ORTE_ERROR_LOG: Not found in file > ../../../../orte/mca/odls/base/odls_base_default_fns.c at line 433 > > * It only seems to work for single nodes (probably related to the previous > point). > > > Is this all expected behaviour given the current implementation? > > > Thanks! > > Mark > > > > > On 02 Feb 2015, at 4:21 , Ralph Castain <r...@open-mpi.org> wrote: > > > > I have pushed the changes to the OMPI master. It took a little bit more > > than I had hoped due to the changes to the ORTE infrastructure, but > > hopefully this will meet your needs. It consists of two new tools: > > > > (a) orte-dvm - starts the virtual machine by launching a daemon on every > > node of the allocation, as constrained by -host and/or -hostfile. Check the > > options for outputting the URI as you’ll need that info for the other tool. > > The DVM remains “up” until you issue the orte-submit -terminate command, or > > hit the orte-dvm process with a sigterm. > > > > (b) orte-submit - takes the place of mpirun. Basically just packages your > > app and arguments and sends it to orte-dvm for execution. Requires the URI > > of orte-dvm. The tool exits once the job has completed execution, though > > you can run multiple jobs in parallel by backgrounding orte-submit or > > issuing commands from separate shells. > > > > I’ve added man pages for both tools, though they may not be complete. Also, > > I don’t have all the mapping/ranking/binding options supported just yet as > > I first wanted to see if this meets your basic needs before worrying about > > the detail. > > > > Let me know what you think > > Ralph > > > > > >> On Jan 21, 2015, at 4:07 PM, Mark Santcroos <mark.santcr...@rutgers.edu> > >> wrote: > >> > >> Hi Ralph, > >> > >> All makes sense! Thanks a lot! > >> > >> Looking forward to your modifications. > >> Please don't hesitate to through things with rough-edges to me! > >> > >> Cheers, > >> > >> Mark > >> > >>> On 21 Jan 2015, at 23:21 , Ralph Castain <r...@open-mpi.org> wrote: > >>> > >>> Let me address your questions up here so you don’t have to scan thru the > >>> entire note. > >>> > >>> PMIx rationale: PMI has been around for a long time, primarily used > >>> inside the MPI library implementations to perform wireup. It provided a > >>> link from the MPI library to the local resource manager. However, as we > >>> move towards exascale, two things became apparent: > >>> > >>> 1. the current PMI implementations don’t scale adequately to get there. > >>> The API created too many communications and assumed everything was a > >>> blocking operation, thus preventing asynchronous progress > >>> > >>> 2. there were increasing requests for application-level interactions to > >>> the resource manager. People want ways to spawn jobs (and not just from > >>> within MPI), request pre-location of data, control power, etc. Rather > >>> than having every RM write its own interface (and thus make everyone’s > >>> code non-portable), we at Intel decided to extend the existing PMI > >>> definitions to support those functions. Thus, an application developer > >>> can directly access PMIx functions to perform all those operations. > >>> > >>> PMIx v1.0 is about to be released - it’ll be backward compatible with > >>> PMI-1 and PMI-2, plus add non-blocking operations and significantly > >>> reduce the number of communications. PMIx 2.0 is slated for this summer > >>> and will include the advanced controls capabilities. > >>> > >>> ORCM is being developed because we needed a BSD-licensed, fully featured > >>> resource manager. This will allow us to integrate the RM even more > >>> tightly to the file system, networking, and other subsystems, thus > >>> achieving higher launch performance and providing desired features such > >>> as QoS management. PMIx is a part of that plan, but as you say, they each > >>> play their separate roles in the overall stack. > >>> > >>> > >>> Persistent ORTE: there is a learning curve on ORTE, I fear. We do have > >>> some videos on the web site that can help get you started, and I’ve given > >>> a number of “classes" at Intel now for that purpose. I still have it on > >>> my “to-do” list that I summarize those classes and post them on the web > >>> site. > >>> > >>> For now, let me summarize how things work. At startup, mpirun reads the > >>> allocation (usually from the environment, but it depends on the host RM) > >>> and launches a daemon on each allocated node. Each daemon reads its local > >>> hardware environment and “phones home” to let mpirun know it is alive. > >>> Once all daemons have reported, mpirun maps the processes to the nodes > >>> and sends that map to all the daemons in a scalable broadcast pattern. > >>> > >>> Upon receipt of the launch message, each daemon parses it to identify > >>> which procs it needs to locally spawn. Once spawned, each proc connects > >>> back to its local daemon via a Unix domain socket for wireup support. As > >>> procs complete, the daemon maintains bookkeeping and reports back to > >>> mpirun once all procs are done. When all procs are reported complete (or > >>> one reports as abnormally terminated), mpirun sends a “die” message to > >>> every daemon so it will cleanly terminate. > >>> > >>> What I will do is simply tell mpirun to not do that last step, but > >>> instead to wait to receive a “terminate” cmd before ending the daemons. > >>> This will allow you to reuse the existing DVM, making each independent > >>> job start a great deal faster. You’ll need to either manually terminate > >>> the DVM, or the RM will do so when the allocation expires. > >>> > >>> HTH > >>> Ralph > >>> > >>> > >>>> On Jan 21, 2015, at 12:52 PM, Mark Santcroos > >>>> <mark.santcr...@rutgers.edu> wrote: > >>>> > >>>> Hi Ralph, > >>>> > >>>>> On 21 Jan 2015, at 21:20 , Ralph Castain <r...@open-mpi.org> wrote: > >>>>> > >>>>> Hi Mark > >>>>> > >>>>>> On Jan 21, 2015, at 11:21 AM, Mark Santcroos > >>>>>> <mark.santcr...@rutgers.edu> wrote: > >>>>>> > >>>>>> Hi Ralph, all, > >>>>>> > >>>>>> To give some background, I'm part of the RADICAL-Pilot [1] development > >>>>>> team. > >>>>>> RADICAL-Pilot is a Pilot System, an implementation of the Pilot (job) > >>>>>> concept, which is in its most minimal form takes care of the > >>>>>> decoupling of resource acquisition and workload management. > >>>>>> So instead of launching your real_science.exe through PBS, you submit > >>>>>> a Pilot, which will allow you to perform application level scheduling. > >>>>>> Most obvious use-case if you want to run many (relatively) small > >>>>>> tasks, then you really don;t want to go through the batch system every > >>>>>> time. That is besides the fact that these machines are very bad in > >>>>>> managing many tasks anyway. > >>>>> > >>>>> Yeah, we sympathize. > >>>> > >>>> Thats always good :-) > >>>> > >>>>> Of course, one obvious solution is to get an allocation and execute a > >>>>> shell script that runs the tasks within that allocation - yes? > >>>> > >>>> Not really. Most of our use-cases have dynamic runtime properties, which > >>>> means that at t=0 the exact workload is not known. > >>>> > >>>> In addition, I don't think such a script would allow me to work around > >>>> the aprun bottleneck, as I'm not aware of a way to start MPI tasks that > >>>> span multiple nodes from a Cray worker node. > >>>> > >>>>>> I looked a bit better at ORCM and it clearly overlaps with what I want > >>>>>> to achieve. > >>>>> > >>>>> Agreed. In ORCM, we allow a user to request a “session” that results in > >>>>> allocation of resources. Each session is given an “orchestrator” - the > >>>>> ORCM “shepherd” daemon - responsible for executing the individual tasks > >>>>> across the assigned allocation, and a collection of “lamb” daemons (one > >>>>> on each node of the allocation) that forms a distributed VM. The > >>>>> orchestrator can execute the tasks very quickly since it doesn’t have > >>>>> to go back to the scheduler, and we allow it to do so according to any > >>>>> provided precedence requirement. Again, for simplicity, a shell script > >>>>> is the default mechanism for submitting the individual tasks. > >>>> > >>>> Yeah, similar solution to a similar problem. > >>>> I noticed that Exascale is also part of the motivation? How does this > >>>> relate to the pmix effort? Different part of the stack I guess. > >>>> > >>>>>> One thing I noticed is that parts of it runs as root, why is that? > >>>>> > >>>>> ORCM is a full resource manager, which means it has a scheduler > >>>>> (rudimentary today) and boot-time daemons that must run as root so they > >>>>> can fork/exec the session-level daemons (that run at the user level). > >>>>> The orchestrator and its daemons all run at the user-level. > >>>> > >>>> Ok. Our solution is user-space only, as one of our features is that we > >>>> are able to run across different type of systems. Both approaches come > >>>> with a tradeoff obviously. > >>>> > >>>>>>> We used to have a cmd line option in ORTE for what you propose - it > >>>>>>> wouldn’t be too hard to restore. Is there some reason to do so? > >>>>>> > >>>>>> Can you point me to something that I could look for in the repo > >>>>>> history, then I can see if it serves my purpose. > >>>>> > >>>>> It would be back in the svn repo, I fear - would take awhile to hunt it > >>>>> down. Basically, it just (a) started all the daemons to create a VM, > >>>>> and (b) told mpirun to stick around as a persistent daemon. All > >>>>> subsequent calls to mpirun would reference back to the persistent one, > >>>>> thus using it to launch the jobs against the standing VM instead of > >>>>> starting a new one every time. > >>>> > >>>> *nod* That's what I tried to do this afternoon actually with the > >>>> "--ompi-server", but that was not meant to be. > >>>> > >>>>> For ORCM, we just took that capability and expressed it as the > >>>>> “shepherd” plus “lamb” daemon architecture described above. > >>>> > >>>> ACK. > >>>> > >>>>> If you don’t want to replace the base RM, then using ORTE to establish > >>>>> a persistent VM is probably the way to go. > >>>> > >>>> Indeed, thats what it sounds like. Plus that ORTE is generic enough that > >>>> I can re-use it on other type of systems too. > >>>> > >>>>> I can probably make it do that again fairly readily. We have a > >>>>> developer’s meeting next week, which usually means I have some free > >>>>> time (during evenings and topics I’m not involved with), so I can take > >>>>> a crack at this then if that would be timely enough. > >>>> > >>>> Happy to accept that offer. At this stage I'm not sure if I would want a > >>>> CLI or would be more interested to be able to do this programmatically > >>>> though. > >>>> Also more than willing to assist in any way I can. > >>>> > >>>> I tried to see how it all worked, but because of the modular nature of > >>>> ompi that was quite daunting. There is some learning curve I guess :-) > >>>> So it seems that mpirun is persistent, and opens up a listening port, > >>>> then some orded's get launched that phone home. > >>>> From there I got lost in the MCA maze. How do the tasks get unto the > >>>> compute nodes and started? > >>>> > >>>> Thanks a lot again, I appreciate your help. > >>>> > >>>> Cheers, > >>>> > >>>> Mark > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/users/2015/01/26227.php > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/users/2015/01/26228.php > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/01/26229.php > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2015/02/26249.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/02/26254.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/02/26256.php