[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

kesim Sat, 18 Mar 2017 11:06:06 -0700

Dear John,

Thank you for your answer. Obviously you are right that I could slurm up
everything and thus avoid the issue and your points are taken. However, I
still insist that it is a serious bug not to take into account the actual
CPU load when the scheduler submit a job regardless whose fault it is that
a non-slurm job is running. I would not suspect that from even simplest
scheduler and if I had such prior knowledge I would not invest so much time
and effort  to setup slurm.
Best regards,


Ketiw

On Sat, Mar 18, 2017 at 5:42 PM, John Hearns <[email protected]> wrote:

>
> Kesim,
>
> what you are saying is that Slurm schedukes tasks based on the number of
> allocated CPUs. Rather than the actual load factor on the server.
> As I recall Gridengine actually used the load factor.
>
> However you comment that "users run programs on the nodes" and "the slurm
> is aware about the load of non-slurm jobs"
> IMHO, in any well-run HPC setup any user running jobs without using the
> scheduler would have their fingers broken. or at least bruised using the
> clue stick.
>
> Seriously, three points:
>
> a) tell users to use 'salloc' and 'srun'  to run interactive jobs. They
> can easily open a Bash session on a compute node and do what they like.
> Under the Slurm scheduler.
>
> b) implement the pam-slurm PAMmodule. It is a few minutes work. This means
> your users cannot go behind the sluem scheduler and log into the nodes
>
> c) on Bright clusters which I configure, you have a healtcheck running
> which wans you when a user is detected as logging in withotu using Slurm
>
>
> Seriously again. You have implemented an HPC infrastructure, and have gone
> to the time and effort to implement a batch scheduling system.
> A batch scheduler can be adapted to let your users do their jobs,
> including interactive shell sessions and remote visualization sessions.
> Do not let the users ride roughshod over you.
>
> ________________________________________
> From: kesim [[email protected]]
> Sent: 18 March 2017 16:16
> To: slurm-dev
> Subject: [slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load
>
> Unbelievable but it seems that nobody knows how to do that. It is
> astonishing that such sophisticated system fails with such simple problem.
> The slurm is aware about the cpu load of non-slurm jobs but it does not use
> the info. My original understanding of LLN was apparently correct. I can
> practically kill the CPUs on particular node with nonslurm tasks but slurm
> will diligently submit 7 jobs to this node leaving other idling.  I
> consider this as a serious bug of this program.
>
>
> On Fri, Mar 17, 2017 at 10:32 AM, kesim <[email protected]<mailto:ket
> [email protected]>> wrote:
> Dear All,
> Yesterday I did some tests and it seemed that the scheduling is following
> CPU load but I was wrong.
> My configuration is at the moment:
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU,CR_LLN
>
> Today I submitted 70 threaded jobs to the queue and here is the CPU_LOAD
> info
> node1         0.08          7/0/0/7
> node2        0.01          7/0/0/7
> node3        0.00          7/0/0/7
> node4        2.97          7/0/0/7
> node5       0.00          7/0/0/7
> node6         0.01          7/0/0/7
> node7      0.00          7/0/0/7
> node8       0.05          7/0/0/7
> node9        0.07          7/0/0/7
> node10        0.38          7/0/0/7
> node11     0.01          0/7/0/7
> As you can see it allocated 7 CPUs on node 4 with CPU_LOAD 2.97 and 0 CPUs
> on idling node11. Why such simple thing is not a default? What am I
> missing???
>
> On Thu, Mar 16, 2017 at 7:53 PM, kesim <[email protected]<mailto:ket
> [email protected]>> wrote:
> Than you for great suggestion. It is working! However the description of
> CR_LLN is misleading "Schedule resources to jobs on the least loaded nodes
> (based upon the number of idle CPUs)" Which I understood that if the two
> nodes has not fully allocated CPUs  the node with smaller number of
> allocated CPUs will take precedence. Therefore the bracketed comment should
> be removed from the description.
>
> On Thu, Mar 16, 2017 at 6:24 PM, Paul Edmon <[email protected]<
> mailto:[email protected]>> wrote:
>
> You should look at LLN (least loaded nodes):
>
> https://slurm.schedmd.com/slurm.conf.html
>
> That should do what you want.
>
> -Paul Edmon-
>
> On 03/16/2017 12:54 PM, kesim wrote:
>
> ---------- Forwarded message ----------
> From: kesim <[email protected]<mailto:[email protected]>>
> Date: Thu, Mar 16, 2017 at 5:50 PM
> Subject: Scheduling jobs according to the CPU load
> To: [email protected]<mailto:[email protected]>
>
>
> Hi all,
>
> I am a new user and I created a small network of 11 nodes 7 CPUs per node
> out of users desktops.
> I configured slurm as:
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU
> When I submit a task with srun -n70 task
> It will fill 10 nodes with 7 tasks/node. However, I have no clue what is
> the algorithm of choosing the nodes. Users run programs on the nodes and
> some nodes are more busy than others. It seems logical that the scheduler
> should submit the tasks to the less busy nodes but it is not the case.
> In the sinfo -N -o '%N %O %C' I can see that the jobs are allocated to the
> node11 with the load 2.06 leaving the node4 which is totally idling. That
> somehow make no sense to me.
> node1         0.00          7/0/0/7
> node2        0.26          7/0/0/7
> node3         0.54          7/0/0/7
> node4        0.07          0/7/0/7
> node5      0.00          7/0/0/7
> node6        0.01          7/0/0/7
> node7       0.00          7/0/0/7
> node8       0.01          7/0/0/7
> node9        0.06          7/0/0/7
> node10      0.11          7/0/0/7
> node11      2.06          7/0/0/7
> How can I configure slurm to be able to fill the node with minimum load
> first?
>
>
>
>
>
>
> Any views or opinions presented in this email are solely those of the
> author and do not necessarily represent those of the company. Employees of
> XMA Ltd are expressly required not to make defamatory statements and not to
> infringe or authorise any infringement of copyright or any other legal
> right by email communications. Any such communication is contrary to
> company policy and outside the scope of the employment of the individual
> concerned. The company will not accept any liability in respect of such
> communication, and the employee responsible will be personally liable for
> any damages or other liability arising. XMA Limited is registered in
> England and Wales (registered no. 2051703). Registered Office: Wilford
> Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP
>

[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

Reply via email to