Vladimir, in cases where you have a 'hairs on the back of your neck'
feeling it is often the case that these indicate something real.
However, you do have to be scientific about this. If you think that uptime
is an influence, you have to record job startup times each hour, and plot
these.
Be scientific.

I would also suggest watching  a tail -f on the slurm logs, and then submit
a job. You might get some indication of there the slow-up is.
You have increased the debug level in the logs:  *SlurmctldDebug*

Finally, my one piece of advice to everyone managing batch systems. It is a
name resolution problem. No, really it is.
Even if your cluster catches fire, the real reason that your jobs are not
being submitted is that the DNS resolver is burning and the scheduler can't
resolve the hostname of the submit host.
Joking aside, many, many problems with batch systems are due to name
resolution.






On 11 October 2017 at 09:33, Vladimir Daric <
vladimir.da...@ips2.universite-paris-saclay.fr> wrote:

> Hello,
>
> We are running a 10 node cluster in our lab and we are experiencing a job
> allocation lag.
>
> srun commands wait for resource allocation up to 1 minute even if there
> are several idle nodes. It's the same with sbatch scripts. Even if there
> are idle nodes, jobs are waiting for about one minute for resource
> allocation..
>
> Our ControlMachine is on a virtual node. Compute nodes are all physical
> machines.
>
> In our config file we set those values :
> FastSchedule=1
> SchedulerType=sched/backfill
>
> I feel like after the whole cluster reboot, jobs are scheduled pretty fast
> and after few weeks uptime job scheduling slows down (at this moment
> ControlMAchine uptime is 25 days). I'm not quite sure those are related.
>
> Everything looks in order, there is no errors in logfiles ...
>
> I'll be grateful for any hint ... or advice.
>
> Thanks,
> Vladimir
>
>
>
>
>
>
>
>

Reply via email to