This is from June 14:
Hi,
We have an user claiming his job was not requeued when the node failed.
Slurmctld detects the missing job when node is rebooted and slurmd sends
the registration message.
In these cases, slurmctld just call to job_complete with requeue=0 and
node_fail=1. I wonder why it is not possible to requeue a job when this
happen. Maybe a complex interaction that I can not see.
Also, slurmctld shows this message " Job 10529777 cancelled from
interactive user", which is not the case. but code triggers here:
(line 3463 at job_mgr.c)
if ((job_return_code == NO_VAL) &&
(IS_JOB_RUNNING(job_ptr) || IS_JOB_PENDING(job_ptr))) {
info("Job %u cancelled from interactive user", job_ptr->job_id);
}
Probably an extra check for node_fail should be done.
On 06/20/2013 08:09 AM, Mario Kadastik wrote:
>
>> One note: Only batch jobs will be requeued. We can't do much for jobs
>> initiated by salloc or srun.
>>
> That would be fine, most of our jobs are sbatch submissions.
>
>
>> Quoting Aaron Knister <[email protected]>:
>>
>>> SLURM can and will, I believe by default, resubmit jobs that fail
>>> due to node failures recognized by slurmctld that put the node in an
>>> offline state. This doesnt help you, however, as SLURM doesnt appear
>>> to notice these failures.
>>>
>>> I wonder if a SPANK plugin could do the job here.
>>>
> Yes, resubmit on node failure is ok, but sometimes it's the job that
> discovers it before the health check script because the job is actively using
> the service that fails while health check is run every ~5 minutes. Therefore
> yes it would be nice if it could be a flag that can be set at time of
> submission (it should be up to the user to choose if (s)he wants a resubmit
> or not).
>
> Thanks,
>
> Mario Kadastik, PhD
> Researcher
>
> ---
> "Physics is like sex, sure it may have practical reasons, but that's not
> why we do it"
> -- Richard P. Feynman
>
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer