[slurm-dev] Re: what happens after a prolog failure

Alessandro Italiano Mon, 26 Nov 2012 01:59:08 -0800

Hi,

I commented out in the following way and it works.



""""""
if (status != 0) {
                 bool kill_job = false;
                 slurmctld_lock_t job_write_lock = {
                         NO_LOCK, WRITE_LOCK, WRITE_LOCK, NO_LOCK };
                 error("prolog_slurmctld job %u prolog exit status %u:%u",
                       job_id, WEXITSTATUS(status), WTERMSIG(status));
                 lock_slurmctld(job_write_lock);
                 /*if (last_job_requeue == job_id) {
                         info("prolog_slurmctld failed again for job %u",
                              job_id);
                         kill_job = true;
                 } else if ((rc = job_requeue(0, job_id, -1,
                 */
                 if ((rc = job_requeue(0, job_id, -1,
                                              (uint16_t)NO_VAL, false))) {
                         info("unable to requeue job %u: %m", job_id);
                         kill_job = true;
                 } else
                         last_job_requeue = job_id;
                 if (kill_job) {
                         srun_user_message(job_ptr,
                                           "PrologSlurmctld failed, job 
killed");
                         (void) job_signal(job_id, SIGKILL, 0, 0, false);
                 }
                 unlock_slurmctld(job_write_lock);
         } else
                 debug2("prolog_slurmctld job %u prolog completed", job_id);

"""""""

let us know whether this is the correct way to achieve our gol or not.

thanks

Ale

On 11/20/2012 05:03 PM, Moe Jette wrote:
> I'm not sure that you want to requeue the job an indefinite number of
> times, but if that's the case look in src/slurmctld/job_scheduler.c
> around line 2135. Just comment out the line "kill_job = true" on
> repeated failures of the PrologSlurmctld.
>
> Quoting Alessandro Italiano<[email protected]>:
>
>> Hi,
>>
>> I've done several tests and it seems that after one or two
>> PrologSlurmctld failures
>> the batch job [submitted in this way: sbatch -p debug ale.sh] is
>> being canceled.
>>
>> this is an example of the job status reported by sacct command
>>
>> """"""""""""""""""""""""""""""""""""""""""""""
>>    [root@pccms60 ~]# sacct -j 77
>>          JobID    JobName  Partition    Account  AllocCPUS      State 
>> ExitCode
>> ------------ ---------- ---------- ---------- ---------- ---------- --------
>> 77               ale.sh      debug       root          1  NODE_FAIL      0:0
>>
>>
>> [root@pccms60 ~]# scontrol sho  config | grep JobRequeue
>> JobRequeue              = 1
>>
>> [root@pccms60 ~]# scontrol sho node| grep State
>>      State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>      State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>      State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>
>> [root@pccms60 ~]# scontrol -V
>> slurm 2.4.4
>> """"""""""""""""""""
>>
>>
>> Is it possible to always requeue the job upon a PrologSlurmctld failure ?
>>
>> thanks in advance
>>
>> Ale
>>
>> On 11/19/2012 06:08 PM, Moe Jette wrote:
>>> I've looked at the code and it is somewhat different from what I
>>> thought. If the  PrologSlurmctld fails then batch jobs get requeued.
>>> Interactive jobs (salloc and srun) will be killed.
>>> diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
>>> index f45e483..96016bd 100644
>>> --- a/doc/man/man5/slurm.conf.5
>>> +++ b/doc/man/man5/slurm.conf.5
>>> @@ -1229,7 +1229,10 @@ also be used to specify more than one program
>>> to run (e.g.
>>>     the first job step.  The prolog script or scripts may be used to
>>> purge files,
>>>     enable user login, etc.  By default there is no prolog. Any
>>> configured script
>>>     is expected to complete execution quickly (in less time than
>>> -\fBMessageTimeout\fR).  See \fBProlog and Epilog Scripts\fR for more
>>> information.
>>> +\fBMessageTimeout\fR).
>>> +If the prolog fails (returns a non\-zero exit code), this will
>>> result in the
>>> +node being set to a DOWN state and the job requeued to executed on
>>> another node.
>>> +See \fBProlog and Epilog Scripts\fR for more information.
>>>
>>>     .TP
>>>     \fBPrologSlurmctld\fR
>>> @@ -1250,7 +1253,7 @@ If some node can not be made available for use,
>>> the program should drain
>>>     the node (typically using the scontrol command) and terminate with a
>>> non\-zero
>>>     exit code.
>>>     A non\-zero exit code will result in the job being requeued
>>> (where possible)
>>> -or killed.
>>> +or killed. Note that only batch jobs can be requeued.
>>>     See \fBProlog and Epilog Scripts\fR for more information.
>>>
>>>     .TP
>>>
>>>
>>>
>>> Quoting Alessandro Italiano<[email protected]>:
>>>
>>>> Hi
>>>>
>>>> we are going to evaluate slurm as batch system for our computing
>>>> farm[14k computing slots].
>>>>
>>>> I've done some tests using the prolog script and I've noticed that
>>>>
>>>> 1. when the "Prolog" script fails the host, where it failed, is flagged
>>>> as DOWN
>>>>        and the job will stack in PENDING status.
>>>> 2. when the "PrologSlurmctld" script fails the job is CANCELLED.
>>>>
>>>>
>>>> first of all, can someone confirm that this is the expected behavior ?
>>>>
>>>> Is there a way to configure slurm in order to automatically dispatch a
>>>> job on
>>>> a new host when the "Prolog " script fails ?
>>>>
>>>> unfortunately I didn't find any answer to my questions in the "Prolog
>>>> and Epilog Scripts" section of the slurm.conf man page
>>>>
>>>> thanks in advance
>>>>
>>>> Alessandro
>>>>

[slurm-dev] Re: what happens after a prolog failure

Reply via email to