Let me address your questions up here so you don’t have to scan thru the entire 
note.

PMIx rationale: PMI has been around for a long time, primarily used inside the 
MPI library implementations to perform wireup. It provided a link from the MPI 
library to the local resource manager. However, as we move towards exascale, 
two things became apparent:

1. the current PMI implementations don’t scale adequately to get there. The API 
created too many communications and assumed everything was a blocking 
operation, thus preventing asynchronous progress

2. there were increasing requests for application-level interactions to the 
resource manager. People want ways to spawn jobs (and not just from within 
MPI), request pre-location of data, control power, etc. Rather than having 
every RM write its own interface (and thus make everyone’s code non-portable), 
we at Intel decided to extend the existing PMI definitions to support those 
functions. Thus, an application developer can directly access PMIx functions to 
perform all those operations.

PMIx v1.0 is about to be released - it’ll be backward compatible with PMI-1 and 
PMI-2, plus add non-blocking operations and significantly reduce the number of 
communications. PMIx 2.0 is slated for this summer and will include the 
advanced controls capabilities.

ORCM is being developed because we needed a BSD-licensed, fully featured 
resource manager. This will allow us to integrate the RM even more tightly to 
the file system, networking, and other subsystems, thus achieving higher launch 
performance and providing desired features such as QoS management. PMIx is a 
part of that plan, but as you say, they each play their separate roles in the 
overall stack.


Persistent ORTE: there is a learning curve on ORTE, I fear. We do have some 
videos on the web site that can help get you started, and I’ve given a number 
of “classes" at Intel now for that purpose. I still have it on my “to-do” list 
that I summarize those classes and post them on the web site.

For now, let me summarize how things work. At startup, mpirun reads the 
allocation (usually from the environment, but it depends on the host RM) and 
launches a daemon on each allocated node. Each daemon reads its local hardware 
environment and “phones home” to let mpirun know it is alive. Once all daemons 
have reported, mpirun maps the processes to the nodes and sends that map to all 
the daemons in a scalable broadcast pattern.

Upon receipt of the launch message, each daemon parses it to identify which 
procs it needs to locally spawn. Once spawned, each proc connects back to its 
local daemon via a Unix domain socket for wireup support. As procs complete, 
the daemon maintains bookkeeping and reports back to mpirun once all procs are 
done. When all procs are reported complete (or one reports as abnormally 
terminated), mpirun sends a “die” message to every daemon so it will cleanly 
terminate.

What I will do is simply tell mpirun to not do that last step, but instead to 
wait to receive a “terminate” cmd before ending the daemons. This will allow 
you to reuse the existing DVM, making each independent job start a great deal 
faster. You’ll need to either manually terminate the DVM, or the RM will do so 
when the allocation expires.

HTH
Ralph


> On Jan 21, 2015, at 12:52 PM, Mark Santcroos <mark.santcr...@rutgers.edu> 
> wrote:
> 
> Hi Ralph,
> 
>> On 21 Jan 2015, at 21:20 , Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> Hi Mark
>> 
>>> On Jan 21, 2015, at 11:21 AM, Mark Santcroos <mark.santcr...@rutgers.edu> 
>>> wrote:
>>> 
>>> Hi Ralph, all,
>>> 
>>> To give some background, I'm part of the RADICAL-Pilot [1] development team.
>>> RADICAL-Pilot is a Pilot System, an implementation of the Pilot (job) 
>>> concept, which is in its most minimal form takes care of the decoupling of 
>>> resource acquisition and workload management.
>>> So instead of launching your real_science.exe through PBS, you submit a 
>>> Pilot, which will allow you to perform application level scheduling.
>>> Most obvious use-case if you want to run many (relatively) small tasks, 
>>> then you really don;t want to go through the batch system every time. That 
>>> is besides the fact that these machines are very bad in managing many tasks 
>>> anyway.
>> 
>> Yeah, we sympathize.
> 
> Thats always good :-)
> 
>> Of course, one obvious solution is to get an allocation and execute a shell 
>> script that runs the tasks within that allocation - yes?
> 
> Not really. Most of our use-cases have dynamic runtime properties, which 
> means that at t=0 the exact workload is not known.
> 
> In addition, I don't think such a script would allow me to work around the 
> aprun bottleneck, as I'm not aware of a way to start MPI tasks that span 
> multiple nodes from a Cray worker node.
> 
>>> I looked a bit better at ORCM and it clearly overlaps with what I want to 
>>> achieve.
>> 
>> Agreed. In ORCM, we allow a user to request a “session” that results in 
>> allocation of resources. Each session is given an “orchestrator” - the ORCM 
>> “shepherd” daemon - responsible for executing the individual tasks across 
>> the assigned allocation, and a collection of “lamb” daemons (one on each 
>> node of the allocation) that forms a distributed VM. The orchestrator can 
>> execute the tasks very quickly since it doesn’t have to go back to the 
>> scheduler, and we allow it to do so according to any provided precedence 
>> requirement. Again, for simplicity, a shell script is the default mechanism 
>> for submitting the individual tasks.
> 
> Yeah, similar solution to a similar problem.
> I noticed that Exascale is also part of the motivation? How does this relate 
> to the pmix effort? Different part of the stack I guess.
> 
>>> One thing I noticed is that parts of it runs as root, why is that?
>> 
>> ORCM is a full resource manager, which means it has a scheduler (rudimentary 
>> today) and boot-time daemons that must run as root so they can fork/exec the 
>> session-level daemons (that run at the user level). The orchestrator and its 
>> daemons all run at the user-level.
> 
> Ok. Our solution is user-space only, as one of our features is that we are 
> able to run across different type of systems. Both approaches come with a 
> tradeoff obviously.
> 
>>>> We used to have a cmd line option in ORTE for what you propose - it 
>>>> wouldn’t be too hard to restore. Is there some reason to do so?
>>> 
>>> Can you point me to something that I could look for in the repo history, 
>>> then I can see if it serves my purpose.
>> 
>> It would be back in the svn repo, I fear - would take awhile to hunt it 
>> down. Basically, it just (a) started all the daemons to create a VM, and (b) 
>> told mpirun to stick around as a persistent daemon. All subsequent calls to 
>> mpirun would reference back to the persistent one, thus using it to launch 
>> the jobs against the standing VM instead of starting a new one every time.
> 
> *nod* That's what I tried to do this afternoon actually with the 
> "--ompi-server", but that was not meant to be.
> 
>> For ORCM, we just took that capability and expressed it as the “shepherd” 
>> plus “lamb” daemon architecture described above.
> 
> ACK.
> 
>> If you don’t want to replace the base RM, then using ORTE to establish a 
>> persistent VM is probably the way to go.
> 
> Indeed, thats what it sounds like. Plus that ORTE is generic enough that I 
> can re-use it on other type of systems too.
> 
>> I can probably make it do that again fairly readily. We have a developer’s 
>> meeting next week, which usually means I have some free time (during 
>> evenings and topics I’m not involved with), so I can take a crack at this 
>> then if that would be timely enough.
> 
> Happy to accept that offer. At this stage I'm not sure if I would want a CLI 
> or would be more interested to be able to do this programmatically though.
> Also more than willing to assist in any way I can.
> 
> I tried to see how it all worked, but because of the modular nature of ompi 
> that was quite daunting. There is some learning curve I guess :-)
> So it seems that mpirun is persistent, and opens up a listening port, then 
> some orded's get launched that phone home.
> From there I got lost in the MCA maze. How do the tasks get unto the compute 
> nodes and started?
> 
> Thanks a lot again, I appreciate your help.
> 
> Cheers,
> 
> Mark
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/01/26227.php 
> <http://www.open-mpi.org/community/lists/users/2015/01/26227.php>

Reply via email to