The code in question dates back about six years to the first SLURM/Moab integration. I have no idea what the reason is for the reason for the different treatment of job cancellation for time limit and an administrator cancellation. I can understand the problem caused by the current SLURM code and your configuration. It seems that removing the _timeout_job function and calling the _cancel_job() function in all cases is reasonable. If you want to validate that and respond to the list, we can change the SLURM code.
Quoting Michael Gutteridge <[email protected]>: > > I have kind of an interesting situation. We'd like to enable jobs to > overrun their requested time by some amount as well as provide > notifications when that wall time is close to used up. We've got Moab > Workload Manager (6.1.6) and Slurm 2.3.5 installed. I'd originally > attempted to use Moab's resource limit policy: > > RESOURCELIMITPOLICY:ALWAYS,EXTENDEDVIOLATION:NOTIFY,CANCEL:12:00:00 > > Meaning that when the job goes over time, moab notifies the user but > then cancels the job after it's gone 12 hours past it's wall time. > Now, this initially didn't work- Slurm just kills the job. I set > OverTimeLimit=UNLIMITED and then I got the notifications OK... but > when the job reaches its overtime limit, the job isn't cancelled. > Moab cancels the job. I see it send Slurm the message via wiki2: > > 05/31 11:15:31 INFO: message sent: 'CMD=CANCELJOB ARG=1508 > TYPE=WALLCLOCK' > > And I see slurm acknowledge the event: > > 112785 05/31 11:15:31 INFO: received message 'CK=8512712decedc584 > TS=1338488131 AUTH=slurm DT=SC=0 RESPONSE=job 1508 cancelled > successfully' from wiki server > 112786 05/31 11:15:31 MSUDisconnect(9) > 112787 05/31 11:15:31 INFO: job '1508' cancelled through WIKI RM > > At higher log levels I see that Slurm sets the end time for the job to > the current time. In src/plugins/sched/wiki2/cancel_job.c, you can > see where the plugin checks the command type and calls _timeout_job() > based on it being "WALLCLOCK". This function simply sets the end time > for the job and relies on the Slurm scheduler to kill it (actually, > purge I think is the term): > > 220 > 221 job_ptr->end_time = time(NULL); > 222 debug("wiki: set end time for job %u", jobid); > 223 > > Now, since the scheduler is set with OverTimeLimit "UNLIMITED", it > never purges the job. Setting OverTimeLimit to a more reasonable > value does result in the job being purged, but only *after* both the > "EXTENDEDVIOLATION" period (12 hours in this example) and the > OverTimeLimit time periods have elapsed. Note that if OverTimeLimit > is less than EXTENDEDVIOLATION, the job will be terminated early. The > job's "EndTime" attribute is set based on the original start time plus > wall time. > > I've been working on ways to get around this. I'm working on having > Moab simply handle notifications and let Slurm do the dirty work. So > if I want people to have 12 hours beyond their request, I'd set up: > > RESOURCELIMITPOLICY:ALWAYS,EXTENDEDVIOLATION:NOTIFY,NOTIFY:12:00:00 > > in Moab and > > OverTimeLimit=720 > > in slurm.conf. This appears to work the way we'd like it- I get two > notifications and the job's killed. > > That issue resolved, I do think there's kind of a larger issue: > shouldn't Slurm be cancelling the job no matter what type of CANCELJOB > command it gets? The wiki specification indicates: > > The 'CancelJob' command, if applied to an active job, will > terminate its execution. If applied to an idle or active job, > the CancelJob command will change the job's state to 'Canceled'. > > However its reporting the job cancelled when it isn't (if > OverTimeLimit > 0). The problem that arises is that while the job is > still running, Moab thinks its been cancelled and doesn't report on it > for that period of time. There's no room in the spec that suggests > that the job cancel should change with the type: > > <CANCELTYPE> is one of the following: > ADMIN (command initiated by scheduler administrator) > WALLCLOCK (command initiated by scheduler because job exceeded its > specified wallclock limit) > > But I don't really know too much about the original reasons this was > implemented this way. There's probably a couple ways to change that > and I think I can provide some patches, but I thought I'd ask the > question before going too far down that road. > > Thanks- all the best > > Michael > > -- > Hey! Somebody punched the foley guy! > - Crow, MST3K ep. 508
