[slurm-dev] Re: Splitting a large SMP machine into nodes

Artem Kulachenko Sun, 03 Jun 2012 13:42:06 -0700

Hi Moe,

Thank you very much. It is seems exactly what we want.

This is how far I went through with the instructions:

I modified the .conf file with

NodeName=burster0 NodeHostname=burster CPUs=8 CoresPerSocket=8
ThreadsPerCore=1 State=UNKNOWN Port=4000
#
NodeName=burster1 NodeHostname=burster CPUs=8 CoresPerSocket=8
ThreadsPerCore=1 State=UNKNOWN Port=4001
...
NodeName=burster19 NodeHostname=burster CPUs=8 CoresPerSocket=8
ThreadsPerCore=1 State=UNKNOWN Port=4019
#
PartitionName=run Nodes=burster[0-19] Default=YES Shared=NO
MaxTime=INFINITE State=UP

I restarted the deamon, however slurmd started only on burster0

> sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
run*         up   infinite     19  down* burster[1-19]
run*         up   infinite      1   idle burster0

In the log file I see:
[2012-06-03T22:20:53] slurmctld version 2.5.0-pre1 started on cluster
cluster
[2012-06-03T22:20:53] error: WARNING: Even though we are collecting
accounting information you have asked for it not to be stored
(accounting_storage/none) if this is not what you have in mind you will
need to change it.
[2012-06-03T22:20:53] Recovered state of 20 nodes
[2012-06-03T22:20:53] Recovered job 60 0
[2012-06-03T22:20:53] Recovered job 61 0
[2012-06-03T22:20:53] Recovered information about 2 jobs
[2012-06-03T22:20:53] cons_res: select_p_node_init
[2012-06-03T22:20:53] cons_res: preparing for 1 partitions
[2012-06-03T22:20:53] Recovered state of 0 reservations
[2012-06-03T22:20:53] read_slurm_conf: backup_controller not specified.
[2012-06-03T22:20:53] cons_res: select_p_reconfigure
[2012-06-03T22:20:53] cons_res: select_p_node_init
[2012-06-03T22:20:53] cons_res: preparing for 1 partitions
[2012-06-03T22:20:53] Running as primary controller
[2012-06-03T22:25:53] error: Nodes burster[1-19] not responding
[2012-06-03T22:27:36] error: Nodes burster[1-19] not responding, setting
DOWN

In the log for buster0 there is
[2012-06-03T22:20:53] Node configuration differs from hardware
   CPUs=8:160(hw) Sockets=1:20(hw)
   CoresPerSocket=8:8(hw) ThreadsPerCore=1:1(hw)
[2012-06-03T22:20:53] error: WARNING: Even though we are collecting
accounting information you have asked for it not to be stored
(accounting_storage/none) if this is not what you have in mind you will
need to change it.
[2012-06-03T22:20:53] slurmd version 2.5.0-pre1 started
[2012-06-03T22:20:53] slurmd started on Sun 03 Jun 2012 22:20:53 +0200
[2012-06-03T22:20:53] CPUs=8 Sockets=1 Cores=8 Threads=1 Memory=645937
TmpDisk=258031 Uptime=891138

Could you please suggest what I am doing incorrectly?

Also, where should I configure each slurmd as having the resources of the
NUMA socket and bind to that socket using cpusets or linux cgroup?

I appreciate your help and patience.

Kind regards,
Artem

On 3 June 2012 19:43, Moe Jette <[email protected]> wrote:

>
> See
> http://www.schedmd.com/slurmdocs/faq.html#multi_slurmd
>
> Configure each slurmd as having the resources of the NUMA socket and
> bind to that socket using cpusets or linux cgroup
>
> Quoting Artem Kulachenko <[email protected]>:
>
> > Hi Everyone,
> >
> > We have one middle size SMP machine with NUMA architectures having 20
> > sockets and 160 cores. We would like to use SLURM, which currently sees
> it
> > as a single node. I wonder if there is a way to split it into 20 nodes
> (one
> > per socket) in order to run the jobs locally at neighboring cores using
> > local memory without the need to assign the processors manually.
> >
> > I will very appreciate any advice.
> >
> > Kind regards,
> > Artem
> >
>
>

[slurm-dev] Re: Splitting a large SMP machine into nodes

Reply via email to