[slurm-dev] Re: what happens after a prolog failure

Alessandro Italiano Thu, 29 Nov 2012 00:45:09 -0800

Hi,

I applied the same patch for the Prolog script, modifying the following 
files


1. src/slurmctld/job_mgr.c
    line 3394:  /*job_ptr->batch_flag++; only one retry */
2. src/slurmctld/node_mgr.c
    line 1835: /*set_node_down(reg_msg->node_name, "Prolog failed");*/

We use the prolog script to check user's environment before the job starts.
In a multiuser computing farm can be useful keeping the job Pending and 
let it
lands on an other node maybe with the correct user's environment.
On the other hand setting down a node can lead to reduce computing
slots although the node can provide the right environment for other users


thanks for the  quick support

Ale

On 11/27/2012 02:05 AM, Moe Jette wrote:
> Your patch is correct. After some discussion we decided to make this 
> the behavior of Slurm version 2.5, which will be released within a few 
> days.
>
> Quoting Alessandro Italiano <[email protected]>:
>
>>
>> Hi,
>>
>> I commented out in the following way and it works.
>>
>>
>> """"""
>> if (status != 0) {
>>                  bool kill_job = false;
>>                  slurmctld_lock_t job_write_lock = {
>>                          NO_LOCK, WRITE_LOCK, WRITE_LOCK, NO_LOCK };
>>                  error("prolog_slurmctld job %u prolog exit status 
>> %u:%u",
>>                        job_id, WEXITSTATUS(status), WTERMSIG(status));
>>                  lock_slurmctld(job_write_lock);
>>                  /*if (last_job_requeue == job_id) {
>>                          info("prolog_slurmctld failed again for job 
>> %u",
>>                               job_id);
>>                          kill_job = true;
>>                  } else if ((rc = job_requeue(0, job_id, -1,
>>                  */
>>                  if ((rc = job_requeue(0, job_id, -1,
>>                                               (uint16_t)NO_VAL, 
>> false))) {
>>                          info("unable to requeue job %u: %m", job_id);
>>                          kill_job = true;
>>                  } else
>>                          last_job_requeue = job_id;
>>                  if (kill_job) {
>>                          srun_user_message(job_ptr,
>>                                            "PrologSlurmctld failed, job
>> killed");
>>                          (void) job_signal(job_id, SIGKILL, 0, 0, 
>> false);
>>                  }
>>                  unlock_slurmctld(job_write_lock);
>>          } else
>>                  debug2("prolog_slurmctld job %u prolog completed", 
>> job_id);
>>
>> """""""
>>
>> let us know whether this is the correct way to achieve our gol or not.
>>
>> thanks
>>
>> Ale
>>
>> On 11/20/2012 05:03 PM, Moe Jette wrote:
>>> I'm not sure that you want to requeue the job an indefinite number of
>>> times, but if that's the case look in src/slurmctld/job_scheduler.c
>>> around line 2135. Just comment out the line "kill_job = true" on
>>> repeated failures of the PrologSlurmctld.
>>>
>>> Quoting Alessandro Italiano<[email protected]>:
>>>
>>>> Hi,
>>>>
>>>> I've done several tests and it seems that after one or two
>>>> PrologSlurmctld failures
>>>> the batch job [submitted in this way: sbatch -p debug ale.sh] is
>>>> being canceled.
>>>>
>>>> this is an example of the job status reported by sacct command
>>>>
>>>> """"""""""""""""""""""""""""""""""""""""""""""
>>>>    [root@pccms60 ~]# sacct -j 77
>>>>          JobID    JobName  Partition    Account AllocCPUS      
>>>> State ExitCode
>>>> ------------ ---------- ---------- ---------- ---------- ---------- 
>>>> --------
>>>> 77               ale.sh      debug       root          1 
>>>> NODE_FAIL      0:0
>>>>
>>>>
>>>> [root@pccms60 ~]# scontrol sho  config | grep JobRequeue
>>>> JobRequeue              = 1
>>>>
>>>> [root@pccms60 ~]# scontrol sho node| grep State
>>>>      State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>>      State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>>      State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>>
>>>> [root@pccms60 ~]# scontrol -V
>>>> slurm 2.4.4
>>>> """"""""""""""""""""
>>>>
>>>>
>>>> Is it possible to always requeue the job upon a PrologSlurmctld 
>>>> failure ?
>>>>
>>>> thanks in advance
>>>>
>>>> Ale
>>>>
>>>> On 11/19/2012 06:08 PM, Moe Jette wrote:
>>>>> I've looked at the code and it is somewhat different from what I
>>>>> thought. If the  PrologSlurmctld fails then batch jobs get requeued.
>>>>> Interactive jobs (salloc and srun) will be killed.
>>>>> diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
>>>>> index f45e483..96016bd 100644
>>>>> --- a/doc/man/man5/slurm.conf.5
>>>>> +++ b/doc/man/man5/slurm.conf.5
>>>>> @@ -1229,7 +1229,10 @@ also be used to specify more than one program
>>>>> to run (e.g.
>>>>>     the first job step.  The prolog script or scripts may be used to
>>>>> purge files,
>>>>>     enable user login, etc.  By default there is no prolog. Any
>>>>> configured script
>>>>>     is expected to complete execution quickly (in less time than
>>>>> -\fBMessageTimeout\fR).  See \fBProlog and Epilog Scripts\fR for more
>>>>> information.
>>>>> +\fBMessageTimeout\fR).
>>>>> +If the prolog fails (returns a non\-zero exit code), this will
>>>>> result in the
>>>>> +node being set to a DOWN state and the job requeued to executed on
>>>>> another node.
>>>>> +See \fBProlog and Epilog Scripts\fR for more information.
>>>>>
>>>>>     .TP
>>>>>     \fBPrologSlurmctld\fR
>>>>> @@ -1250,7 +1253,7 @@ If some node can not be made available for use,
>>>>> the program should drain
>>>>>     the node (typically using the scontrol command) and terminate 
>>>>> with a
>>>>> non\-zero
>>>>>     exit code.
>>>>>     A non\-zero exit code will result in the job being requeued
>>>>> (where possible)
>>>>> -or killed.
>>>>> +or killed. Note that only batch jobs can be requeued.
>>>>>     See \fBProlog and Epilog Scripts\fR for more information.
>>>>>
>>>>>     .TP
>>>>>
>>>>>
>>>>>
>>>>> Quoting Alessandro Italiano<[email protected]>:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> we are going to evaluate slurm as batch system for our computing
>>>>>> farm[14k computing slots].
>>>>>>
>>>>>> I've done some tests using the prolog script and I've noticed that
>>>>>>
>>>>>> 1. when the "Prolog" script fails the host, where it failed, is 
>>>>>> flagged
>>>>>> as DOWN
>>>>>>        and the job will stack in PENDING status.
>>>>>> 2. when the "PrologSlurmctld" script fails the job is CANCELLED.
>>>>>>
>>>>>>
>>>>>> first of all, can someone confirm that this is the expected 
>>>>>> behavior ?
>>>>>>
>>>>>> Is there a way to configure slurm in order to automatically 
>>>>>> dispatch a
>>>>>> job on
>>>>>> a new host when the "Prolog " script fails ?
>>>>>>
>>>>>> unfortunately I didn't find any answer to my questions in the 
>>>>>> "Prolog
>>>>>> and Epilog Scripts" section of the slurm.conf man page
>>>>>>
>>>>>> thanks in advance
>>>>>>
>>>>>> Alessandro
>>>>>>
>>
>
>
>

[slurm-dev] Re: what happens after a prolog failure

Reply via email to