[slurm-dev] Re: Implementing soft limits and notifications with Slurm/Moab

Eckert, Phil Tue, 05 Jun 2012 15:21:09 -0700

Michael,

I was curious, so I tried the:


RESOURCELIMITPOLICY:ALWAYS,EXTENDEDVIOLATION:NOTIFY,CANCEL:12:00:00

parameter on my test cluster so that I could observe the behavior, and I
also used the OverTimeLimit parameter in my SLURM test system. When the
initial time limit is reached, I see that the job remaining time in Moab
goes negative. From what I've read, Torque supports a hard and soft limit,
so when it uses its initial time, the time remaining reflects to the
extended value, but that fact that with SLURM showing a negative value, at
least it is a an indication that the job is running on the extended time
allotment.

You are saying the jobs shows cancelled after using the initial time, but
I have found that if use the Moab parameter:

JOBMAXOVERRUN 12:00:00

in my moab.cfg, the job will stay in the system and showq will display the
job (reflecting a negative time value) until completion.

Phil Eckert
LLNL



On 6/5/12 8:33 AM, "Michael Gutteridge" <[email protected]>
wrote:

>
>On Mon, Jun 4, 2012 at 1:48 PM, Lipari, Don <[email protected]> wrote:
>
>> What appears to be happening is that Moab is sending the canceljob
>>message to SLURM when the job's time limit expires.  It should email the
>>user at that point, but hold off issuing the canceljob command to SLURM
>>until Moab's EXTENDEDVIOLATION grace period - 12 hours in this case -
>>has transpired.
>>
>
>I didn't go into this in detail, but it is slurm that is issuing the
>cancel command to the job at the originally specified end time- why I
>originally set OverTimeLimit=UNLIMITED.  Moab is not sending the
>cancel command until it reaches EXTENDEDVIOLATION.
>
>> By setting SLURM's OverTimeLimit to match Moab's grace period, Michael
>>has solved the problem.
>
>What happens at that point is that the job's "EndTime" is set to the
>time at which EXTENDEDVIOLATION was reached.  That's when the
>OverTimeLimit timer takes over- thus, slurm won't cancel the job until
>StartTime + WallTime + EXTENDEDVIOLATION + OverTimeLimit.  It works,
>but Moab is confused about the job state after EXTENDEDVIOLATION
>(i.e., it thinks the job has been cancelled, but the RM reports it
>active).
>
>So yes, eventually this works, but has undesirable side effects (i.e.
>the job isn't visible in showq, I don't know how the resources would
>be  scheduled, etc.)
>
>>
>> If the above changes to Moab behavior are not made, I would recommend
>>using SLURM's OverTimeLimit as Michael described.  However, I don't see
>>the need to eliminate _timeout_job function from the wiki*/cancel_job.c
>>modules.
>>
>
>What I've put together (but haven't tried out yet) is leaving the
>_timeout_job module as is, but adding the job cancel code from
>_cancel_job.  So it both sets EndTime (which I'm guessing might be
>good for accounting purposes) and cancels the job.  Might be
>redundant, but likely harmless anyway.
>
>> Don
>>
>>> -----Original Message-----
>>> From: Moe Jette [mailto:[email protected]]
>>> Sent: Monday, June 04, 2012 11:29 AM
>>> To: slurm-dev
>>> Subject: [slurm-dev] Re: Implementing soft limits and notifications
>>> with Slurm/Moab
>>>
>>>
>>> The code in question dates back about six years to the first
>>> SLURM/Moab integration. I have no idea what the reason is for the
>>> reason for the different treatment of job cancellation for time limit
>>> and an administrator cancellation. I can understand the problem caused
>>> by the current SLURM code and your configuration. It seems that
>>> removing the _timeout_job function and calling the _cancel_job()
>>> function in all cases is reasonable.  If you want to validate that and
>>> respond to the list, we can change the SLURM code.
>>>
>>> Quoting Michael Gutteridge <[email protected]>:
>>>
>>> >
>>> > I have kind of an interesting situation.  We'd like to enable jobs to
>>> > overrun their requested time by some amount as well as provide
>>> > notifications when that wall time is close to used up.  We've got
>>> Moab
>>> > Workload Manager (6.1.6) and Slurm 2.3.5 installed.  I'd originally
>>> > attempted to use Moab's resource limit policy:
>>> >
>>> > RESOURCELIMITPOLICY:ALWAYS,EXTENDEDVIOLATION:NOTIFY,CANCEL:12:00:00
>>> >
>>> > Meaning that when the job goes over time, moab notifies the user but
>>> > then cancels the job after it's gone 12 hours past it's wall time.
>>> > Now, this initially didn't work- Slurm just kills the job.  I set
>>> > OverTimeLimit=UNLIMITED and then I got the notifications OK... but
>>> > when the job reaches its overtime limit, the job isn't cancelled.
>>> > Moab cancels the job.  I see it send Slurm the message via wiki2:
>>> >
>>> > 05/31 11:15:31  INFO:     message sent: 'CMD=CANCELJOB ARG=1508
>>> > TYPE=WALLCLOCK'
>>> >
>>> > And I see slurm acknowledge the event:
>>> >
>>> > 112785 05/31 11:15:31  INFO:     received message
>>> 'CK=8512712decedc584
>>> > TS=1338488131 AUTH=slurm DT=SC=0 RESPONSE=job 1508 cancelled
>>> > successfully' from wiki server
>>> > 112786 05/31 11:15:31  MSUDisconnect(9)
>>> > 112787 05/31 11:15:31  INFO:     job '1508' cancelled through WIKI RM
>>> >
>>> > At higher log levels I see that Slurm sets the end time for the job
>>> to
>>> > the current time.  In src/plugins/sched/wiki2/cancel_job.c, you can
>>> > see where the plugin checks the command type and calls _timeout_job()
>>> > based on it being "WALLCLOCK".  This function simply sets the end
>>> time
>>> > for the job and relies on the Slurm scheduler to kill it (actually,
>>> > purge I think is the term):
>>> >
>>> > 220
>>> > 221     job_ptr->end_time = time(NULL);
>>> > 222     debug("wiki: set end time for job %u", jobid);
>>> > 223
>>> >
>>> > Now, since the scheduler is set with OverTimeLimit "UNLIMITED", it
>>> > never purges the job.  Setting OverTimeLimit to a more reasonable
>>> > value does result in the job being purged, but only *after* both the
>>> > "EXTENDEDVIOLATION" period (12 hours in this example) and the
>>> > OverTimeLimit time periods have elapsed.  Note that if OverTimeLimit
>>> > is less than EXTENDEDVIOLATION, the job will be terminated early.
>>> The
>>> > job's "EndTime" attribute is set based on the original start time
>>> plus
>>> > wall time.
>>> >
>>> > I've been working on ways to get around this.  I'm working on having
>>> > Moab simply handle notifications and let Slurm do the dirty work.  So
>>> > if I want people to have 12 hours beyond their request, I'd set up:
>>> >
>>> > RESOURCELIMITPOLICY:ALWAYS,EXTENDEDVIOLATION:NOTIFY,NOTIFY:12:00:00
>>> >
>>> > in Moab and
>>> >
>>> > OverTimeLimit=720
>>> >
>>> > in slurm.conf.  This appears to work the way we'd like it- I get two
>>> > notifications and the job's killed.
>>> >
>>> > That issue resolved, I do think there's kind of a larger issue:
>>> > shouldn't Slurm be cancelling the job no matter what type of
>>> CANCELJOB
>>> > command it gets?  The wiki specification indicates:
>>> >
>>> >    The 'CancelJob' command, if applied to an active job, will
>>> >    terminate its execution.  If applied to an idle or active job,
>>> >    the CancelJob command will change the job's state to 'Canceled'.
>>> >
>>> > However its reporting the job cancelled when it isn't (if
>>> > OverTimeLimit > 0).  The problem that arises is that while the job is
>>> > still running, Moab thinks its been cancelled and doesn't report on
>>> it
>>> > for that period of time.  There's no room in the spec that suggests
>>> > that the job cancel should change with the type:
>>> >
>>> >     <CANCELTYPE> is one of the following:
>>> >     ADMIN               (command initiated by scheduler
>>> administrator)
>>> >     WALLCLOCK (command initiated by scheduler because job exceeded
>>> its
>>> > specified wallclock limit)
>>> >
>>> > But I don't really know too much about the original reasons this was
>>> > implemented this way.  There's probably a couple ways to change that
>>> > and I think I can provide some patches, but I thought I'd ask the
>>> > question before going too far down that road.
>>> >
>>> > Thanks- all the best
>>> >
>>> > Michael
>>> >
>>> > --
>>> > Hey! Somebody punched the foley guy!
>>> >    - Crow, MST3K ep. 508
>
>
>
>-- 
>Hey! Somebody punched the foley guy!
>   - Crow, MST3K ep. 508

[slurm-dev] Re: Implementing soft limits and notifications with Slurm/Moab

Reply via email to