[slurm-dev] Re: Implementing soft limits and notifications with Slurm/Moab

Moe Jette Mon, 04 Jun 2012 11:16:09 -0700

The code in question dates back about six years to the first  
SLURM/Moab integration. I have no idea what the reason is for the  
reason for the different treatment of job cancellation for time limit  
and an administrator cancellation. I can understand the problem caused  
by the current SLURM code and your configuration. It seems that  
removing the _timeout_job function and calling the _cancel_job()  
function in all cases is reasonable.  If you want to validate that and  
respond to the list, we can change the SLURM code.


Quoting Michael Gutteridge <[email protected]>:

>
> I have kind of an interesting situation.  We'd like to enable jobs to
> overrun their requested time by some amount as well as provide
> notifications when that wall time is close to used up.  We've got Moab
> Workload Manager (6.1.6) and Slurm 2.3.5 installed.  I'd originally
> attempted to use Moab's resource limit policy:
>
> RESOURCELIMITPOLICY:ALWAYS,EXTENDEDVIOLATION:NOTIFY,CANCEL:12:00:00
>
> Meaning that when the job goes over time, moab notifies the user but
> then cancels the job after it's gone 12 hours past it's wall time.
> Now, this initially didn't work- Slurm just kills the job.  I set
> OverTimeLimit=UNLIMITED and then I got the notifications OK... but
> when the job reaches its overtime limit, the job isn't cancelled.
> Moab cancels the job.  I see it send Slurm the message via wiki2:
>
> 05/31 11:15:31  INFO:     message sent: 'CMD=CANCELJOB ARG=1508  
> TYPE=WALLCLOCK'
>
> And I see slurm acknowledge the event:
>
> 112785 05/31 11:15:31  INFO:     received message 'CK=8512712decedc584
> TS=1338488131 AUTH=slurm DT=SC=0 RESPONSE=job 1508 cancelled
> successfully' from wiki server
> 112786 05/31 11:15:31  MSUDisconnect(9)
> 112787 05/31 11:15:31  INFO:     job '1508' cancelled through WIKI RM
>
> At higher log levels I see that Slurm sets the end time for the job to
> the current time.  In src/plugins/sched/wiki2/cancel_job.c, you can
> see where the plugin checks the command type and calls _timeout_job()
> based on it being "WALLCLOCK".  This function simply sets the end time
> for the job and relies on the Slurm scheduler to kill it (actually,
> purge I think is the term):
>
> 220
> 221     job_ptr->end_time = time(NULL);
> 222     debug("wiki: set end time for job %u", jobid);
> 223
>
> Now, since the scheduler is set with OverTimeLimit "UNLIMITED", it
> never purges the job.  Setting OverTimeLimit to a more reasonable
> value does result in the job being purged, but only *after* both the
> "EXTENDEDVIOLATION" period (12 hours in this example) and the
> OverTimeLimit time periods have elapsed.  Note that if OverTimeLimit
> is less than EXTENDEDVIOLATION, the job will be terminated early.  The
> job's "EndTime" attribute is set based on the original start time plus
> wall time.
>
> I've been working on ways to get around this.  I'm working on having
> Moab simply handle notifications and let Slurm do the dirty work.  So
> if I want people to have 12 hours beyond their request, I'd set up:
>
> RESOURCELIMITPOLICY:ALWAYS,EXTENDEDVIOLATION:NOTIFY,NOTIFY:12:00:00
>
> in Moab and
>
> OverTimeLimit=720
>
> in slurm.conf.  This appears to work the way we'd like it- I get two
> notifications and the job's killed.
>
> That issue resolved, I do think there's kind of a larger issue:
> shouldn't Slurm be cancelling the job no matter what type of CANCELJOB
> command it gets?  The wiki specification indicates:
>
>    The 'CancelJob' command, if applied to an active job, will
>    terminate its execution.  If applied to an idle or active job,
>    the CancelJob command will change the job's state to 'Canceled'.
>
> However its reporting the job cancelled when it isn't (if
> OverTimeLimit > 0).  The problem that arises is that while the job is
> still running, Moab thinks its been cancelled and doesn't report on it
> for that period of time.  There's no room in the spec that suggests
> that the job cancel should change with the type:
>
>     <CANCELTYPE> is one of the following:
>     ADMIN               (command initiated by scheduler administrator)
>     WALLCLOCK (command initiated by scheduler because job exceeded its
> specified wallclock limit)
>
> But I don't really know too much about the original reasons this was
> implemented this way.  There's probably a couple ways to change that
> and I think I can provide some patches, but I thought I'd ask the
> question before going too far down that road.
>
> Thanks- all the best
>
> Michael
>
> --
> Hey! Somebody punched the foley guy!
>    - Crow, MST3K ep. 508

[slurm-dev] Re: Implementing soft limits and notifications with Slurm/Moab

Reply via email to