Hi Moe, Thank you very much. It is seems exactly what we want.
This is how far I went through with the instructions: I modified the .conf file with NodeName=burster0 NodeHostname=burster CPUs=8 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN Port=4000 # NodeName=burster1 NodeHostname=burster CPUs=8 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN Port=4001 ... NodeName=burster19 NodeHostname=burster CPUs=8 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN Port=4019 # PartitionName=run Nodes=burster[0-19] Default=YES Shared=NO MaxTime=INFINITE State=UP I restarted the deamon, however slurmd started only on burster0 > sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST run* up infinite 19 down* burster[1-19] run* up infinite 1 idle burster0 In the log file I see: [2012-06-03T22:20:53] slurmctld version 2.5.0-pre1 started on cluster cluster [2012-06-03T22:20:53] error: WARNING: Even though we are collecting accounting information you have asked for it not to be stored (accounting_storage/none) if this is not what you have in mind you will need to change it. [2012-06-03T22:20:53] Recovered state of 20 nodes [2012-06-03T22:20:53] Recovered job 60 0 [2012-06-03T22:20:53] Recovered job 61 0 [2012-06-03T22:20:53] Recovered information about 2 jobs [2012-06-03T22:20:53] cons_res: select_p_node_init [2012-06-03T22:20:53] cons_res: preparing for 1 partitions [2012-06-03T22:20:53] Recovered state of 0 reservations [2012-06-03T22:20:53] read_slurm_conf: backup_controller not specified. [2012-06-03T22:20:53] cons_res: select_p_reconfigure [2012-06-03T22:20:53] cons_res: select_p_node_init [2012-06-03T22:20:53] cons_res: preparing for 1 partitions [2012-06-03T22:20:53] Running as primary controller [2012-06-03T22:25:53] error: Nodes burster[1-19] not responding [2012-06-03T22:27:36] error: Nodes burster[1-19] not responding, setting DOWN In the log for buster0 there is [2012-06-03T22:20:53] Node configuration differs from hardware CPUs=8:160(hw) Sockets=1:20(hw) CoresPerSocket=8:8(hw) ThreadsPerCore=1:1(hw) [2012-06-03T22:20:53] error: WARNING: Even though we are collecting accounting information you have asked for it not to be stored (accounting_storage/none) if this is not what you have in mind you will need to change it. [2012-06-03T22:20:53] slurmd version 2.5.0-pre1 started [2012-06-03T22:20:53] slurmd started on Sun 03 Jun 2012 22:20:53 +0200 [2012-06-03T22:20:53] CPUs=8 Sockets=1 Cores=8 Threads=1 Memory=645937 TmpDisk=258031 Uptime=891138 Could you please suggest what I am doing incorrectly? Also, where should I configure each slurmd as having the resources of the NUMA socket and bind to that socket using cpusets or linux cgroup? I appreciate your help and patience. Kind regards, Artem On 3 June 2012 19:43, Moe Jette <[email protected]> wrote: > > See > http://www.schedmd.com/slurmdocs/faq.html#multi_slurmd > > Configure each slurmd as having the resources of the NUMA socket and > bind to that socket using cpusets or linux cgroup > > Quoting Artem Kulachenko <[email protected]>: > > > Hi Everyone, > > > > We have one middle size SMP machine with NUMA architectures having 20 > > sockets and 160 cores. We would like to use SLURM, which currently sees > it > > as a single node. I wonder if there is a way to split it into 20 nodes > (one > > per socket) in order to run the jobs locally at neighboring cores using > > local memory without the need to assign the processors manually. > > > > I will very appreciate any advice. > > > > Kind regards, > > Artem > > > >
