[slurm-dev] Re: what happens after a prolog failure

Moe Jette Thu, 29 Nov 2012 09:45:06 -0800

For what you want to do, having the Prolog cancel the job having a bad  
environment then returning exit code of 0 is may a better solution.


Quoting Alessandro Italiano <[email protected]>:

>
> Hi,
>
> I applied the same patch for the Prolog script, modifying the following
> files
>
> 1. src/slurmctld/job_mgr.c
>     line 3394:  /*job_ptr->batch_flag++; only one retry */
> 2. src/slurmctld/node_mgr.c
>     line 1835: /*set_node_down(reg_msg->node_name, "Prolog failed");*/
>
> We use the prolog script to check user's environment before the job starts.
> In a multiuser computing farm can be useful keeping the job Pending and
> let it
> lands on an other node maybe with the correct user's environment.
> On the other hand setting down a node can lead to reduce computing
> slots although the node can provide the right environment for other users
>
>
> thanks for the  quick support
>
> Ale
>
> On 11/27/2012 02:05 AM, Moe Jette wrote:
>> Your patch is correct. After some discussion we decided to make this
>> the behavior of Slurm version 2.5, which will be released within a few
>> days.
>>
>> Quoting Alessandro Italiano <[email protected]>:
>>
>>>
>>> Hi,
>>>
>>> I commented out in the following way and it works.
>>>
>>>
>>> """"""
>>> if (status != 0) {
>>>                  bool kill_job = false;
>>>                  slurmctld_lock_t job_write_lock = {
>>>                          NO_LOCK, WRITE_LOCK, WRITE_LOCK, NO_LOCK };
>>>                  error("prolog_slurmctld job %u prolog exit status
>>> %u:%u",
>>>                        job_id, WEXITSTATUS(status), WTERMSIG(status));
>>>                  lock_slurmctld(job_write_lock);
>>>                  /*if (last_job_requeue == job_id) {
>>>                          info("prolog_slurmctld failed again for job
>>> %u",
>>>                               job_id);
>>>                          kill_job = true;
>>>                  } else if ((rc = job_requeue(0, job_id, -1,
>>>                  */
>>>                  if ((rc = job_requeue(0, job_id, -1,
>>>                                               (uint16_t)NO_VAL,
>>> false))) {
>>>                          info("unable to requeue job %u: %m", job_id);
>>>                          kill_job = true;
>>>                  } else
>>>                          last_job_requeue = job_id;
>>>                  if (kill_job) {
>>>                          srun_user_message(job_ptr,
>>>                                            "PrologSlurmctld failed, job
>>> killed");
>>>                          (void) job_signal(job_id, SIGKILL, 0, 0,
>>> false);
>>>                  }
>>>                  unlock_slurmctld(job_write_lock);
>>>          } else
>>>                  debug2("prolog_slurmctld job %u prolog completed",
>>> job_id);
>>>
>>> """""""
>>>
>>> let us know whether this is the correct way to achieve our gol or not.
>>>
>>> thanks
>>>
>>> Ale
>>>
>>> On 11/20/2012 05:03 PM, Moe Jette wrote:
>>>> I'm not sure that you want to requeue the job an indefinite number of
>>>> times, but if that's the case look in src/slurmctld/job_scheduler.c
>>>> around line 2135. Just comment out the line "kill_job = true" on
>>>> repeated failures of the PrologSlurmctld.
>>>>
>>>> Quoting Alessandro Italiano<[email protected]>:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've done several tests and it seems that after one or two
>>>>> PrologSlurmctld failures
>>>>> the batch job [submitted in this way: sbatch -p debug ale.sh] is
>>>>> being canceled.
>>>>>
>>>>> this is an example of the job status reported by sacct command
>>>>>
>>>>> """"""""""""""""""""""""""""""""""""""""""""""
>>>>>    [root@pccms60 ~]# sacct -j 77
>>>>>          JobID    JobName  Partition    Account AllocCPUS
>>>>> State ExitCode
>>>>> ------------ ---------- ---------- ---------- ---------- ----------
>>>>> --------
>>>>> 77               ale.sh      debug       root          1
>>>>> NODE_FAIL      0:0
>>>>>
>>>>>
>>>>> [root@pccms60 ~]# scontrol sho  config | grep JobRequeue
>>>>> JobRequeue              = 1
>>>>>
>>>>> [root@pccms60 ~]# scontrol sho node| grep State
>>>>>      State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>>>      State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>>>      State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>>>
>>>>> [root@pccms60 ~]# scontrol -V
>>>>> slurm 2.4.4
>>>>> """"""""""""""""""""
>>>>>
>>>>>
>>>>> Is it possible to always requeue the job upon a PrologSlurmctld
>>>>> failure ?
>>>>>
>>>>> thanks in advance
>>>>>
>>>>> Ale
>>>>>
>>>>> On 11/19/2012 06:08 PM, Moe Jette wrote:
>>>>>> I've looked at the code and it is somewhat different from what I
>>>>>> thought. If the  PrologSlurmctld fails then batch jobs get requeued.
>>>>>> Interactive jobs (salloc and srun) will be killed.
>>>>>> diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
>>>>>> index f45e483..96016bd 100644
>>>>>> --- a/doc/man/man5/slurm.conf.5
>>>>>> +++ b/doc/man/man5/slurm.conf.5
>>>>>> @@ -1229,7 +1229,10 @@ also be used to specify more than one program
>>>>>> to run (e.g.
>>>>>>     the first job step.  The prolog script or scripts may be used to
>>>>>> purge files,
>>>>>>     enable user login, etc.  By default there is no prolog. Any
>>>>>> configured script
>>>>>>     is expected to complete execution quickly (in less time than
>>>>>> -\fBMessageTimeout\fR).  See \fBProlog and Epilog Scripts\fR for more
>>>>>> information.
>>>>>> +\fBMessageTimeout\fR).
>>>>>> +If the prolog fails (returns a non\-zero exit code), this will
>>>>>> result in the
>>>>>> +node being set to a DOWN state and the job requeued to executed on
>>>>>> another node.
>>>>>> +See \fBProlog and Epilog Scripts\fR for more information.
>>>>>>
>>>>>>     .TP
>>>>>>     \fBPrologSlurmctld\fR
>>>>>> @@ -1250,7 +1253,7 @@ If some node can not be made available for use,
>>>>>> the program should drain
>>>>>>     the node (typically using the scontrol command) and terminate
>>>>>> with a
>>>>>> non\-zero
>>>>>>     exit code.
>>>>>>     A non\-zero exit code will result in the job being requeued
>>>>>> (where possible)
>>>>>> -or killed.
>>>>>> +or killed. Note that only batch jobs can be requeued.
>>>>>>     See \fBProlog and Epilog Scripts\fR for more information.
>>>>>>
>>>>>>     .TP
>>>>>>
>>>>>>
>>>>>>
>>>>>> Quoting Alessandro Italiano<[email protected]>:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> we are going to evaluate slurm as batch system for our computing
>>>>>>> farm[14k computing slots].
>>>>>>>
>>>>>>> I've done some tests using the prolog script and I've noticed that
>>>>>>>
>>>>>>> 1. when the "Prolog" script fails the host, where it failed, is
>>>>>>> flagged
>>>>>>> as DOWN
>>>>>>>        and the job will stack in PENDING status.
>>>>>>> 2. when the "PrologSlurmctld" script fails the job is CANCELLED.
>>>>>>>
>>>>>>>
>>>>>>> first of all, can someone confirm that this is the expected
>>>>>>> behavior ?
>>>>>>>
>>>>>>> Is there a way to configure slurm in order to automatically
>>>>>>> dispatch a
>>>>>>> job on
>>>>>>> a new host when the "Prolog " script fails ?
>>>>>>>
>>>>>>> unfortunately I didn't find any answer to my questions in the
>>>>>>> "Prolog
>>>>>>> and Epilog Scripts" section of the slurm.conf man page
>>>>>>>
>>>>>>> thanks in advance
>>>>>>>
>>>>>>> Alessandro
>>>>>>>
>>>
>>
>>
>>
>

[slurm-dev] Re: what happens after a prolog failure

Reply via email to