The PMI group is handled elsewhere, but I agree that we need a different API for this purpose. We see conditional execution in Hadoop and elsewhere, so it is a reasonable use-case - just need to leave "abort" as something different. I'd propose we create something like PMI_Terminate as a separate way of implementing it.
I'll pose the question to the PMI folks and post the conclusion here as well. Ralph On Tue, Jun 4, 2013 at 6:52 AM, Andy Riebs <[email protected]> wrote: > I don't know the provenance of the PMI specification, but would it be > possible to add a new function (at least within SLURM's PMI implementation) > with the effect that Victor describes? Legacy SHMEM implementations have > provided globalexit() and, should OpenSHMEM evolve to include it, it will > likely have the semantics that globalexit(0) should cause the launcher to > exit with 0. > > Andy > > > On 06/04/2013 07:59 AM, Victor Kocheganov wrote: > > OK, I see your points: I did not suspect it would be so inconvenient to > have such a behavior, but all the reasons look convenient. The source of > requirement is just "will" of our users. > Will try to find another approach then. > > Thanks for detail explanation, Ralph! > > > On Tue, Jun 4, 2013 at 3:34 AM, Ralph Castain <[email protected]> wrote: > >> The OMPI developers were meeting this afternoon, so we took advantage >> of it to discuss this topic. We would recommend not changing the current >> behavior for two reasons. First, there is a long precedent for returning >> the first non-zero status, and returning a non-zero status if any process >> causes the entire job to abort even if they all abort with status zero. >> This is the only way the user (and any script they are using) can know that >> an "abort" was ordered. >> >> Second, we have looked at the OpenShmem standard and confirmed that >> nothing is said there about returning zero status in such situations. We >> don't know the source of this proposed requirement, but feel that it >> shouldn't override the community's expected behavior. >> >> Just our $0.02 >> Ralph >> >> >> >> On Mon, Jun 3, 2013 at 1:45 PM, Ralph Castain <[email protected]> wrote: >> >>> I'm leery of this patch - will discuss with other MPI folks as this >>> could cause problems for existing apps >>> >>> >>>> >>>> >>>> Sent from my iPhone >>>> >>>> On Jun 3, 2013, at 5:17 AM, "Riebs, Andy" <[email protected]> wrote: >>>> >>>> Hi Victor, >>>> >>>> >>>> >>>> If the patch is straight-forward, and the reason for it is clear, >>>> patches sent to this list tend to be adopted quickly. However, since this >>>> changes behavior that someone else may be counting on, it might get held >>>> for the next major release if it is accepted. >>>> >>>> >>>> >>>> Andy >>>> >>>> >>>> >>>> *From:* Victor Kocheganov >>>> [mailto:[email protected]<[email protected]>] >>>> >>>> *Sent:* Monday, June 03, 2013 6:48 AM >>>> *To:* slurm-dev >>>> *Subject:* [slurm-dev] Re: PMI_Abort with zero value >>>> >>>> >>>> >>>> OK, I see. >>>> >>>> >>>> >>>> I've got SLURM sources with PMI in it and found out the reason of >>>> "strange" behavior (I mean 0 rank process behaves different from others in >>>> PMI_Abort()). >>>> >>>> It seems clear to deal with it. Is it a complex procedure to provide a >>>> minor fix to community (via patch)? >>>> >>>> >>>> >>>> On Mon, Jun 3, 2013 at 12:18 AM, Brian Gilmer <[email protected]> >>>> wrote: >>>> >>>> We are able to use _exit() so I did not go any further. The behavior >>>> of PMI_Abort() and exit() were both odd so I thought that my save you some >>>> time. I am interested if you find another solution. >>>> >>>> >>>> >>>> On Sun, Jun 2, 2013 at 3:39 AM, Victor Kocheganov < >>>> [email protected]> wrote: >>>> >>>> Thank you for the rapid answer! But still I have several questions, >>>> please see inline. >>>> >>>> >>>> >>>> On Fri, May 31, 2013 at 6:47 PM, Brian Gilmer <[email protected]> >>>> wrote: >>>> >>>> I addressed a similar problem with _exit(<value>). >>>> >>>> [Victor Kocheganov] Where can I find it? I can not any clue in archive >>>> of slurm-dev list ( >>>> http://dir.gmane.org/gmane.comp.distributed.slurm.devel<http://dir.gmane.org/gmane.comp.distributed.slurm.devel%22> >>>> ) >>>> >>>> >>>> >>>> Slurm will kill off the rest of the pe in a job step if one exits >>>> with a non-zero code. >>>> >>>> [Victor Kocheganov] Unfortunately it depends on slurm configurations >>>> as far as I know (whether '-K' flag is set or not; it could be set >>>> implicitly). So I can not rely on such a behavior... >>>> >>>> >>>> >>>> The exit() function doesn't work under mx shmem because the exit() >>>> function is overridden and does not propagate the exit code. >>>> PMI_Abort(exit_code) uses exit() so in our case it always returns an exit >>>> code of 9 regardless of the value of exit_code. >>>> >>>> [Victor Kocheganov] And this is interesting, because I see that SLURM >>>> always returns zero value to system when PMI_Abort(0,NULL) was invoked by >>>> some process, except for the case when process with zero rank (PMI daemon >>>> as I suspect) invoked it. Therefore a little hope still exists in my mind, >>>> that I can make PMI_Abort work for me (return zero always in >>>> case PMI_Abort(0,NULL)). >>>> >>>> >>>> >>>> But you are saying that there is no hope in PMI_Abort(), am I >>>> understand right? Do you have any other ways to make SLURM ( using PMI or >>>> without it) terminate all the processes if one of them requested it (with >>>> passed exit statuses off course)? >>>> >>>> >>>> >>>> On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov < >>>> [email protected]> wrote: >>>> >>>> Hello, >>>> >>>> >>>> >>>> I am SHMEM library developer and I am looking for approach to terminate >>>> the whole slurm job with the specific exit status, when one of processes >>>> initiate it. That is SHMEM library should have some API routine named >>>> 'globalexit(int status);', which terminates the job with other processes in >>>> it with status exit code. >>>> >>>> >>>> >>>> The only way I found out is to use PMI_Abort(status), but it does not >>>> work for zero status value, when PMI_Abort is invoked by zero process >>>> (daemon for PMI, as I understand). Is it normal behavior or a bug? Could >>>> you please help to find any other approaches, if this one does not seem >>>> proper for slurm? >>>> >>>> >>>> >>>> Thank you in advance, >>>> >>>> Victor Kocheganov. >>>> >>>> >>>> >>>> >>>> -- >>>> Speak when you are angry--and you will make the best speech you'll ever >>>> regret. >>>> - Laurence J. Peter >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Speak when you are angry--and you will make the best speech you'll ever >>>> regret. >>>> - Laurence J. Peter >>>> >>>> >>>> >>>> >>> >> > > -- > Andy Riebs > Hewlett-Packard Company > High Performance Computing+1 404 648 9024 > My opinions are not necessarily those of HP > >
