I don't know the provenance of the PMI specification, but would it be possible to add a new function (at least within SLURM's PMI implementation) with the effect that Victor describes? Legacy SHMEM implementations have provided globalexit() and, should OpenSHMEM evolve to include it, it will likely have the semantics that globalexit(0) should cause the launcher to exit with 0.

Andy


On 06/04/2013 07:59 AM, Victor Kocheganov wrote:
OK, I see your points: I did not suspect it would be so inconvenient to have such a behavior, but all the reasons look convenient. The source of requirement is just "will" of our users.
Will try to find another approach then.

Thanks for detail explanation, Ralph!


On Tue, Jun 4, 2013 at 3:34 AM, Ralph Castain <[email protected] <mailto:[email protected]>> wrote:

    The OMPI developers were meeting this afternoon, so we took
    advantage of it to discuss this topic. We would recommend not
    changing the current behavior for two reasons. First, there is a
    long precedent for returning the first non-zero status, and
    returning a non-zero status if any process causes the entire job
    to abort even if they all abort with status zero. This is the only
    way the user (and any script they are using) can know that an
    "abort" was ordered.

    Second, we have looked at the OpenShmem standard and confirmed
    that nothing is said there about returning zero status in such
    situations. We don't know the source of this proposed requirement,
    but feel that it shouldn't override the community's expected behavior.

    Just our $0.02
    Ralph



    On Mon, Jun 3, 2013 at 1:45 PM, Ralph Castain <[email protected]
    <mailto:[email protected]>> wrote:

        I'm leery of this patch - will discuss with other MPI folks as
        this could cause problems for existing apps



            Sent from my iPhone

            On Jun 3, 2013, at 5:17 AM, "Riebs, Andy"
            <[email protected] <mailto:[email protected]>> wrote:

            Hi Victor,

            If the patch is straight-forward, and the reason for it
            is clear, patches sent to this list tend to be adopted
            quickly. However, since this changes behavior that
            someone else may be counting on, it might get held for
            the next major release if it is accepted.

            Andy

            *From:*Victor Kocheganov
            [mailto:[email protected]]
            *Sent:* Monday, June 03, 2013 6:48 AM
            *To:* slurm-dev
            *Subject:* [slurm-dev] Re: PMI_Abort with zero value

            OK, I see.

            I've got SLURM sources with PMI in it and found out the
            reason of "strange" behavior (I mean 0 rank process
            behaves different from others in PMI_Abort()).

            It seems clear to deal with it. Is it a complex procedure
            to provide a minor fix to community (via patch)?

            On Mon, Jun 3, 2013 at 12:18 AM, Brian Gilmer
            <[email protected] <mailto:[email protected]>> wrote:

            We are able to use _exit() so I did not go any further.
             The behavior of PMI_Abort() and exit() were both odd so
            I thought that my save you some time.  I am interested if
            you find another solution.

            On Sun, Jun 2, 2013 at 3:39 AM, Victor Kocheganov
            <[email protected]
            <mailto:[email protected]>> wrote:

            Thank you for the rapid answer! But still I have several
            questions, please see inline.

            On Fri, May 31, 2013 at 6:47 PM, Brian Gilmer
            <[email protected] <mailto:[email protected]>> wrote:

            I addressed a similar problem with _exit(<value>).

            [Victor Kocheganov] Where can I find it? I can not any
            clue in archive of slurm-dev list
            (http://dir.gmane.org/gmane.comp.distributed.slurm.devel
            <http://dir.gmane.org/gmane.comp.distributed.slurm.devel%22>)

                Slurm will kill off the rest of the pe in a job step
                if one exits with a non-zero code.

            [Victor Kocheganov] Unfortunately it depends on slurm
            configurations as far as I know (whether '-K' flag is set
            or not; it could be set implicitly). So I can not rely on
            such a behavior...

                The exit() function doesn't work under mx shmem
                because the exit() function is overridden and does
                not propagate the exit code.  PMI_Abort(exit_code)
                uses exit() so in our case it always returns an exit
                code of 9 regardless of the value of exit_code.

            [Victor Kocheganov] And this is interesting, because I
            see that SLURM always returns zero value to system when
            PMI_Abort(0,NULL) was invoked by some process, except for
            the case when process with zero rank (PMI daemon as I
            suspect) invoked it. Therefore a little hope still exists
            in my mind, that I can make PMI_Abort work for me (return
            zero always in case PMI_Abort(0,NULL)).

            But you are saying that there is no hope in PMI_Abort(),
            am I understand right? Do you have any other ways to make
            SLURM ( using PMI or without it) terminate all the
            processes if one of them requested it (with passed exit
            statuses off course)?

                On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov
                <[email protected]
                <mailto:[email protected]>> wrote:

                    Hello,

                    I am SHMEM library developer and I am looking for
                    approach to terminate the whole slurm job with
                    the specific exit status, when one of processes
                    initiate it. That is SHMEM library should have
                    some API routine named 'globalexit(int status);',
                    which terminates the job with other processes in
                    it with status exit code.

                    The only way I found out is to use
                    PMI_Abort(status), but it does not work for zero
                    status value, when PMI_Abort is invoked by zero
                    process (daemon for PMI, as I understand). Is it
                    normal behavior or a bug? Could you please help
                    to find any other approaches, if this one does
                    not seem proper for slurm?

                    Thank you in advance,

                    Victor Kocheganov.




-- Speak when you are angry--and you will make the best
                speech you'll ever regret.
                  - Laurence J. Peter



-- Speak when you are angry--and you will make the best
            speech you'll ever regret.
              - Laurence J. Peter





--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP

Reply via email to