Re: [OMPI devel] RTE issue I. Support for non-MPI jobs

Ralph H Castain Wed, 5 Dec 2007 10:24:54 -0500

On 12/5/07 7:58 AM, "rolf.vandeva...@sun.com" <rolf.vandeva...@sun.com>
wrote:

> Ralph H Castain wrote:
> 
>> I. Support for non-MPI jobs
>> Considerable complexity currently exists in ORTE because of the stipulation
>> in our first requirements document that users be able to mpirun non-MPI jobs
>> - i.e., that we support such calls as "mpirun -n 100 hostname". This creates
>> a situation, however, where the RTE cannot know if the application will call
>> MPI_Init (or at least orte_init), which has significant implications to the
>> RTE's architecture. For example, during the launch of the application's
>> processes, the RTE cannot go into any form of blocking receive while waiting
>> for the procs to report a successful startup as this won't occur for
>> execution of something like "hostname".
>> 
>> Jeff has noted that support for non-MPI jobs is not something most (all?)
>> MPIs currently provide, nor something that users are likely to exploit as
>> they can more easily just "qsub hostname" (or the equivalent for that
>> environment). While nice for debugging purposes, therefore, it isn't clear
>> that supporting non-MPI jobs is worth the increased code complexity and
>> fragility.
>> 
>> In addition, the fact that we do not know if a job will call Init limits our
>> ability to do collective communications within the RTE, and hence our
>> scalability - see the note on that specific subject for further discussion
>> on this area.
>> 
>> This would be a "regression" in behavior, though, so the questions for the
>> community are:
>> 
>> (a) do we want to retain the feature to run non-MPI jobs with mpirun as-is
>> (and accept the tradeoffs, including the one described below in II)?
>>  
>> 
> Hi Ralph:
>  From a user standpoint, a) would be preferable.  However, as you point
> out, there are issues.  Are you saying that we cannot do collectives
> (Item III) if we preserve a?  Or is it that things will just be more
> complex.  I guess I am looking for more details about what the tradeoffs
> are for preserving a.

I believe it would be more accurate to say things would be more complex. I'm
not sure we know enough at the moment to say collectives can't be done at
all. All I can say is that I have spent a little time trying to define a
"snowball" collective (i.e., one that collects info from each daemon it
passes through to deliver the final collection to the HNP) and not knowing
whether or not a process is going to call orte_init is one of the hurdles. I
believe some clever programming could probably overcome it - at least, I'm
not willing to give up yet. It just will take additional effort and time.

Likewise for the overall logic in the system. The biggest problem with
supporting non-MPI apps is that you have to be -very- careful throughout the
RTE to avoid any blocking operations that depend upon the procs calling
orte_init or orte_finalize - while still requiring that if they -did- call
orte_init, then they must call orte_finalize or else we consider the
termination to have been an 'abort'. Not impossible - but it adds time spent
while making changes to the system, and always leaves open the door for a
deadlock condition to jump out of the bushes.

It can be done as we obviously currently do it, though I note it has taken
us nearly three years to identify and resolve all those deadlock scenarios
(fingers crossed). Just pointing out that it does introduce some complexity
that could potentially be removed from the current code base, and will make
those inbound collectives more difficult. So Jeff and I thought it would be
worth at least asking if this was a desirable feature we should preserve, or
something people really didn't care about or use, but was just another of
those leftover requirements from the early days.

Personally, I like the feature while debugging the system as it allows me to
test the underlying allocate/map/launch infrastructure without the OOB
wireup - but I can live without that if people would prefer that we further
simplify the code. I can also certainly use the switch to indicate "this is
a non-MPI app" when I'm debugging, but I consider that to be somewhat user
unfriendly...

...especially since if the user forgets the switch and mpirun's a non-MPI
job, we would have no choice but to "hang" until they ctrl-c the job, or
introduce some totally artificial timeout constraint!

So I guess my recommendation is: if you believe (a) is preferable from a
user's perspective, then I would preserve the feature "as-is" and accept the
code complexity and risk since that can be overcome with careful design and
testing until such time as we -prove- that inbound collectives cannot be
written under those conditions. I believe this last point is critical as we
really shouldn't accept linear scale-by-node as a limitation.

Now if I could just get some help on those inbound collectives so we can
resolve that point...but that was note III, I believe. ;-)



> 
> Having said that, we would probably be OK with b) if that makes things
> better/faster/robuster.
> 
> Rolf
> 
>> (b) do we provide a flag to mpirun (perhaps adding the distinction that
>> "orterun" must be used for non-MPI jobs?) to indicate "this is NOT an MPI
>> job" so we can act accordingly?
>> 
>> (c) simply eliminate support for non-MPI jobs?
>> 
>> (d) other suggestions?
>> 
>> Ralph
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>  
>> 
>
Re: [OMPI devel] RTE issue I. Support for non-MPI jobs

Reply via email to