On Mon, Jun 4, 2012 at 1:48 PM, Lipari, Don <[email protected]> wrote:
> What appears to be happening is that Moab is sending the canceljob message to > SLURM when the job's time limit expires. It should email the user at that > point, but hold off issuing the canceljob command to SLURM until Moab's > EXTENDEDVIOLATION grace period - 12 hours in this case - has transpired. > I didn't go into this in detail, but it is slurm that is issuing the cancel command to the job at the originally specified end time- why I originally set OverTimeLimit=UNLIMITED. Moab is not sending the cancel command until it reaches EXTENDEDVIOLATION. > By setting SLURM's OverTimeLimit to match Moab's grace period, Michael has > solved the problem. What happens at that point is that the job's "EndTime" is set to the time at which EXTENDEDVIOLATION was reached. That's when the OverTimeLimit timer takes over- thus, slurm won't cancel the job until StartTime + WallTime + EXTENDEDVIOLATION + OverTimeLimit. It works, but Moab is confused about the job state after EXTENDEDVIOLATION (i.e., it thinks the job has been cancelled, but the RM reports it active). So yes, eventually this works, but has undesirable side effects (i.e. the job isn't visible in showq, I don't know how the resources would be scheduled, etc.) > > If the above changes to Moab behavior are not made, I would recommend using > SLURM's OverTimeLimit as Michael described. However, I don't see the need to > eliminate _timeout_job function from the wiki*/cancel_job.c modules. > What I've put together (but haven't tried out yet) is leaving the _timeout_job module as is, but adding the job cancel code from _cancel_job. So it both sets EndTime (which I'm guessing might be good for accounting purposes) and cancels the job. Might be redundant, but likely harmless anyway. > Don > >> -----Original Message----- >> From: Moe Jette [mailto:[email protected]] >> Sent: Monday, June 04, 2012 11:29 AM >> To: slurm-dev >> Subject: [slurm-dev] Re: Implementing soft limits and notifications >> with Slurm/Moab >> >> >> The code in question dates back about six years to the first >> SLURM/Moab integration. I have no idea what the reason is for the >> reason for the different treatment of job cancellation for time limit >> and an administrator cancellation. I can understand the problem caused >> by the current SLURM code and your configuration. It seems that >> removing the _timeout_job function and calling the _cancel_job() >> function in all cases is reasonable. If you want to validate that and >> respond to the list, we can change the SLURM code. >> >> Quoting Michael Gutteridge <[email protected]>: >> >> > >> > I have kind of an interesting situation. We'd like to enable jobs to >> > overrun their requested time by some amount as well as provide >> > notifications when that wall time is close to used up. We've got >> Moab >> > Workload Manager (6.1.6) and Slurm 2.3.5 installed. I'd originally >> > attempted to use Moab's resource limit policy: >> > >> > RESOURCELIMITPOLICY:ALWAYS,EXTENDEDVIOLATION:NOTIFY,CANCEL:12:00:00 >> > >> > Meaning that when the job goes over time, moab notifies the user but >> > then cancels the job after it's gone 12 hours past it's wall time. >> > Now, this initially didn't work- Slurm just kills the job. I set >> > OverTimeLimit=UNLIMITED and then I got the notifications OK... but >> > when the job reaches its overtime limit, the job isn't cancelled. >> > Moab cancels the job. I see it send Slurm the message via wiki2: >> > >> > 05/31 11:15:31 INFO: message sent: 'CMD=CANCELJOB ARG=1508 >> > TYPE=WALLCLOCK' >> > >> > And I see slurm acknowledge the event: >> > >> > 112785 05/31 11:15:31 INFO: received message >> 'CK=8512712decedc584 >> > TS=1338488131 AUTH=slurm DT=SC=0 RESPONSE=job 1508 cancelled >> > successfully' from wiki server >> > 112786 05/31 11:15:31 MSUDisconnect(9) >> > 112787 05/31 11:15:31 INFO: job '1508' cancelled through WIKI RM >> > >> > At higher log levels I see that Slurm sets the end time for the job >> to >> > the current time. In src/plugins/sched/wiki2/cancel_job.c, you can >> > see where the plugin checks the command type and calls _timeout_job() >> > based on it being "WALLCLOCK". This function simply sets the end >> time >> > for the job and relies on the Slurm scheduler to kill it (actually, >> > purge I think is the term): >> > >> > 220 >> > 221 job_ptr->end_time = time(NULL); >> > 222 debug("wiki: set end time for job %u", jobid); >> > 223 >> > >> > Now, since the scheduler is set with OverTimeLimit "UNLIMITED", it >> > never purges the job. Setting OverTimeLimit to a more reasonable >> > value does result in the job being purged, but only *after* both the >> > "EXTENDEDVIOLATION" period (12 hours in this example) and the >> > OverTimeLimit time periods have elapsed. Note that if OverTimeLimit >> > is less than EXTENDEDVIOLATION, the job will be terminated early. >> The >> > job's "EndTime" attribute is set based on the original start time >> plus >> > wall time. >> > >> > I've been working on ways to get around this. I'm working on having >> > Moab simply handle notifications and let Slurm do the dirty work. So >> > if I want people to have 12 hours beyond their request, I'd set up: >> > >> > RESOURCELIMITPOLICY:ALWAYS,EXTENDEDVIOLATION:NOTIFY,NOTIFY:12:00:00 >> > >> > in Moab and >> > >> > OverTimeLimit=720 >> > >> > in slurm.conf. This appears to work the way we'd like it- I get two >> > notifications and the job's killed. >> > >> > That issue resolved, I do think there's kind of a larger issue: >> > shouldn't Slurm be cancelling the job no matter what type of >> CANCELJOB >> > command it gets? The wiki specification indicates: >> > >> > The 'CancelJob' command, if applied to an active job, will >> > terminate its execution. If applied to an idle or active job, >> > the CancelJob command will change the job's state to 'Canceled'. >> > >> > However its reporting the job cancelled when it isn't (if >> > OverTimeLimit > 0). The problem that arises is that while the job is >> > still running, Moab thinks its been cancelled and doesn't report on >> it >> > for that period of time. There's no room in the spec that suggests >> > that the job cancel should change with the type: >> > >> > <CANCELTYPE> is one of the following: >> > ADMIN (command initiated by scheduler >> administrator) >> > WALLCLOCK (command initiated by scheduler because job exceeded >> its >> > specified wallclock limit) >> > >> > But I don't really know too much about the original reasons this was >> > implemented this way. There's probably a couple ways to change that >> > and I think I can provide some patches, but I thought I'd ask the >> > question before going too far down that road. >> > >> > Thanks- all the best >> > >> > Michael >> > >> > -- >> > Hey! Somebody punched the foley guy! >> > - Crow, MST3K ep. 508 -- Hey! Somebody punched the foley guy! - Crow, MST3K ep. 508
