Hi,
I applied the same patch for the Prolog script, modifying the following
files
1. src/slurmctld/job_mgr.c
line 3394: /*job_ptr->batch_flag++; only one retry */
2. src/slurmctld/node_mgr.c
line 1835: /*set_node_down(reg_msg->node_name, "Prolog failed");*/
We use the prolog script to check user's environment before the job starts.
In a multiuser computing farm can be useful keeping the job Pending and
let it
lands on an other node maybe with the correct user's environment.
On the other hand setting down a node can lead to reduce computing
slots although the node can provide the right environment for other users
thanks for the quick support
Ale
On 11/27/2012 02:05 AM, Moe Jette wrote:
> Your patch is correct. After some discussion we decided to make this
> the behavior of Slurm version 2.5, which will be released within a few
> days.
>
> Quoting Alessandro Italiano <[email protected]>:
>
>>
>> Hi,
>>
>> I commented out in the following way and it works.
>>
>>
>> """"""
>> if (status != 0) {
>> bool kill_job = false;
>> slurmctld_lock_t job_write_lock = {
>> NO_LOCK, WRITE_LOCK, WRITE_LOCK, NO_LOCK };
>> error("prolog_slurmctld job %u prolog exit status
>> %u:%u",
>> job_id, WEXITSTATUS(status), WTERMSIG(status));
>> lock_slurmctld(job_write_lock);
>> /*if (last_job_requeue == job_id) {
>> info("prolog_slurmctld failed again for job
>> %u",
>> job_id);
>> kill_job = true;
>> } else if ((rc = job_requeue(0, job_id, -1,
>> */
>> if ((rc = job_requeue(0, job_id, -1,
>> (uint16_t)NO_VAL,
>> false))) {
>> info("unable to requeue job %u: %m", job_id);
>> kill_job = true;
>> } else
>> last_job_requeue = job_id;
>> if (kill_job) {
>> srun_user_message(job_ptr,
>> "PrologSlurmctld failed, job
>> killed");
>> (void) job_signal(job_id, SIGKILL, 0, 0,
>> false);
>> }
>> unlock_slurmctld(job_write_lock);
>> } else
>> debug2("prolog_slurmctld job %u prolog completed",
>> job_id);
>>
>> """""""
>>
>> let us know whether this is the correct way to achieve our gol or not.
>>
>> thanks
>>
>> Ale
>>
>> On 11/20/2012 05:03 PM, Moe Jette wrote:
>>> I'm not sure that you want to requeue the job an indefinite number of
>>> times, but if that's the case look in src/slurmctld/job_scheduler.c
>>> around line 2135. Just comment out the line "kill_job = true" on
>>> repeated failures of the PrologSlurmctld.
>>>
>>> Quoting Alessandro Italiano<[email protected]>:
>>>
>>>> Hi,
>>>>
>>>> I've done several tests and it seems that after one or two
>>>> PrologSlurmctld failures
>>>> the batch job [submitted in this way: sbatch -p debug ale.sh] is
>>>> being canceled.
>>>>
>>>> this is an example of the job status reported by sacct command
>>>>
>>>> """"""""""""""""""""""""""""""""""""""""""""""
>>>> [root@pccms60 ~]# sacct -j 77
>>>> JobID JobName Partition Account AllocCPUS
>>>> State ExitCode
>>>> ------------ ---------- ---------- ---------- ---------- ----------
>>>> --------
>>>> 77 ale.sh debug root 1
>>>> NODE_FAIL 0:0
>>>>
>>>>
>>>> [root@pccms60 ~]# scontrol sho config | grep JobRequeue
>>>> JobRequeue = 1
>>>>
>>>> [root@pccms60 ~]# scontrol sho node| grep State
>>>> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>>
>>>> [root@pccms60 ~]# scontrol -V
>>>> slurm 2.4.4
>>>> """"""""""""""""""""
>>>>
>>>>
>>>> Is it possible to always requeue the job upon a PrologSlurmctld
>>>> failure ?
>>>>
>>>> thanks in advance
>>>>
>>>> Ale
>>>>
>>>> On 11/19/2012 06:08 PM, Moe Jette wrote:
>>>>> I've looked at the code and it is somewhat different from what I
>>>>> thought. If the PrologSlurmctld fails then batch jobs get requeued.
>>>>> Interactive jobs (salloc and srun) will be killed.
>>>>> diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
>>>>> index f45e483..96016bd 100644
>>>>> --- a/doc/man/man5/slurm.conf.5
>>>>> +++ b/doc/man/man5/slurm.conf.5
>>>>> @@ -1229,7 +1229,10 @@ also be used to specify more than one program
>>>>> to run (e.g.
>>>>> the first job step. The prolog script or scripts may be used to
>>>>> purge files,
>>>>> enable user login, etc. By default there is no prolog. Any
>>>>> configured script
>>>>> is expected to complete execution quickly (in less time than
>>>>> -\fBMessageTimeout\fR). See \fBProlog and Epilog Scripts\fR for more
>>>>> information.
>>>>> +\fBMessageTimeout\fR).
>>>>> +If the prolog fails (returns a non\-zero exit code), this will
>>>>> result in the
>>>>> +node being set to a DOWN state and the job requeued to executed on
>>>>> another node.
>>>>> +See \fBProlog and Epilog Scripts\fR for more information.
>>>>>
>>>>> .TP
>>>>> \fBPrologSlurmctld\fR
>>>>> @@ -1250,7 +1253,7 @@ If some node can not be made available for use,
>>>>> the program should drain
>>>>> the node (typically using the scontrol command) and terminate
>>>>> with a
>>>>> non\-zero
>>>>> exit code.
>>>>> A non\-zero exit code will result in the job being requeued
>>>>> (where possible)
>>>>> -or killed.
>>>>> +or killed. Note that only batch jobs can be requeued.
>>>>> See \fBProlog and Epilog Scripts\fR for more information.
>>>>>
>>>>> .TP
>>>>>
>>>>>
>>>>>
>>>>> Quoting Alessandro Italiano<[email protected]>:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> we are going to evaluate slurm as batch system for our computing
>>>>>> farm[14k computing slots].
>>>>>>
>>>>>> I've done some tests using the prolog script and I've noticed that
>>>>>>
>>>>>> 1. when the "Prolog" script fails the host, where it failed, is
>>>>>> flagged
>>>>>> as DOWN
>>>>>> and the job will stack in PENDING status.
>>>>>> 2. when the "PrologSlurmctld" script fails the job is CANCELLED.
>>>>>>
>>>>>>
>>>>>> first of all, can someone confirm that this is the expected
>>>>>> behavior ?
>>>>>>
>>>>>> Is there a way to configure slurm in order to automatically
>>>>>> dispatch a
>>>>>> job on
>>>>>> a new host when the "Prolog " script fails ?
>>>>>>
>>>>>> unfortunately I didn't find any answer to my questions in the
>>>>>> "Prolog
>>>>>> and Epilog Scripts" section of the slurm.conf man page
>>>>>>
>>>>>> thanks in advance
>>>>>>
>>>>>> Alessandro
>>>>>>
>>
>
>
>