Disclaimer:  responses below based on Michael's reported observations.  I have 
not attempted to recreate the scenario and observe the behavior...

> -----Original Message-----
> From: Michael Gutteridge [mailto:[email protected]]
> Sent: Tuesday, June 05, 2012 8:40 AM
> To: slurm-dev
> Subject: [slurm-dev] Re: Implementing soft limits and notifications
> with Slurm/Moab
> 
> 
> On Mon, Jun 4, 2012 at 1:48 PM, Lipari, Don <[email protected]> wrote:
> 
> > What appears to be happening is that Moab is sending the canceljob
> message to SLURM when the job's time limit expires.  It should email
> the user at that point, but hold off issuing the canceljob command to
> SLURM until Moab's EXTENDEDVIOLATION grace period - 12 hours in this
> case - has transpired.
> >
> 
> I didn't go into this in detail, but it is slurm that is issuing the
> cancel command to the job at the originally specified end time- why I
> originally set OverTimeLimit=UNLIMITED.  Moab is not sending the
> cancel command until it reaches EXTENDEDVIOLATION.

Given that is the case, then I still think the change needs to be in Moab:  As 
the original end time for the job approaches, and Moab decides to extend the 
job's run time an additional 12 hours, it should send a message to SLURM to 
extend the job's end time by 12 hours.

You basically have the scheduler making the decision to extend the end time vs. 
SLURM working off the end time it was originally given.

> > By setting SLURM's OverTimeLimit to match Moab's grace period,
> Michael has solved the problem.
> 
> What happens at that point is that the job's "EndTime" is set to the
> time at which EXTENDEDVIOLATION was reached.  That's when the
> OverTimeLimit timer takes over- thus, slurm won't cancel the job until
> StartTime + WallTime + EXTENDEDVIOLATION + OverTimeLimit.  It works,
> but Moab is confused about the job state after EXTENDEDVIOLATION
> (i.e., it thinks the job has been cancelled, but the RM reports it
> active).
> 
> So yes, eventually this works, but has undesirable side effects (i.e.
> the job isn't visible in showq, I don't know how the resources would
> be  scheduled, etc.)
>
> > If the above changes to Moab behavior are not made, I would recommend
> using SLURM's OverTimeLimit as Michael described.  However, I don't see
> the need to eliminate _timeout_job function from the wiki*/cancel_job.c
> modules.
> >
> 
> What I've put together (but haven't tried out yet) is leaving the
> _timeout_job module as is, but adding the job cancel code from
> _cancel_job.  So it both sets EndTime (which I'm guessing might be
> good for accounting purposes) and cancels the job.  Might be
> redundant, but likely harmless anyway.

So let me summarize my understanding.

Option A:  Fix Moab
1.  Moab submits job to SLURM with original walltime limit
2.  Just prior to original walltime limit being reached,
    a. Moab decides to give job EXTENDEDVIOLATION grace time
    b. Moab emails user that job life is being extended
    c. Moab issues command to SLURM to extend the job's endtime by 
EXTENDEDVIOLATION (minutes?)
3.  For the next EXTENDEDVIOLATION minutes, showq and squeue show running job
4.  After EXTENDEDVIOLATION minutes have transpired
    a. SLURM cancels job and Moab issues CANCELJOB command to SLURM
    b. Job is instantly cancelled (based on default OverTimeLimit == 0)

Option B:  Modify SLURM's _timeout_job()
1.  Moab submits job to SLURM with original walltime limit
2.  Just prior to original walltime limit being reached,
    a. Moab decides to give job EXTENDEDVIOLATION grace time
    b. Moab emails user that job life is being extended
    c. SLURM wants to cancel job, but waits for OverTimeLimit seconds
3.  For the next EXTENDEDVIOLATION minutes, showq and squeue show running job 
(I expect)
4.  After EXTENDEDVIOLATION minutes have transpired
    a. Moab issues job CANCELJOB command to SLURM
    b. _timeout_job() is called with its new behavior to act immediately
    c. Job is instantly cancelled

Don

Reply via email to