[slurm-dev] Re: Splitting a large SMP machine into nodes

Artem Kulachenko Mon, 04 Jun 2012 09:03:07 -0700

Hi Everyone!

I modified the slurm script as follows and all the nodes are up now. During
the service start, I bind *slurmd* to physical CPUs directly with numactl,
however, *srun* still places the jobs on the first socket (procs 0-7 as I
see with top -iH and f->j), although it reports that the jobs are placed on
different nodes as intended.


My partitions should not allow sharing:
*PartitionName=run Nodes=burster[0-19] Default=YES Shared=NO
MaxTime=INFINITE State=UP*

Here is how I run:
*srun -N 1 -c 8 --partition=run --verbose
/usr/ansys_inc/v140/ansys/bin/ansys140 -b nolist -s noread -np 8 -jobnam
file1 -i example.dat*

Could you please suggest what goes wrong here?

The start script mentioned above:

start() {
    prog=$1
    shift
    unset HOME MAIL USER USERNAME
    if [[ ${prog} == slurmd ]]
    then
 i1=0
 i2=7
 for node in $(scontrol show aliases)
 do
  numactl --physcpubind=${i1}-${i2} ${prog} -N ${node}
  echo "starting: numactl --physcpubind=${i1}-${i2} ${prog} -N ${node}"
  let i1=i1+8
  let i2=i2+8
 done
    else
 echo -n "starting $prog: "
        $STARTPROC $SBINDIR/$prog $*
    fi
    rc_status -v
    echo
    touch /var/lock/subsys/slurm
}


On 3 June 2012 22:43, Artem Kulachenko <[email protected]> wrote:

>  Hi Moe,
>
> Thank you very much. It is seems exactly what we want.
>
> This is how far I went through with the instructions:
>
> I modified the .conf file with
>
> NodeName=burster0 NodeHostname=burster CPUs=8 CoresPerSocket=8
> ThreadsPerCore=1 State=UNKNOWN Port=4000
> #
> NodeName=burster1 NodeHostname=burster CPUs=8 CoresPerSocket=8
> ThreadsPerCore=1 State=UNKNOWN Port=4001
> ...
> NodeName=burster19 NodeHostname=burster CPUs=8 CoresPerSocket=8
> ThreadsPerCore=1 State=UNKNOWN Port=4019
> #
> PartitionName=run Nodes=burster[0-19] Default=YES Shared=NO
> MaxTime=INFINITE State=UP
>
> I restarted the deamon, however slurmd started only on burster0
>
> > sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> run*         up   infinite     19  down* burster[1-19]
> run*         up   infinite      1   idle burster0
>
> In the log file I see:
> [2012-06-03T22:20:53] slurmctld version 2.5.0-pre1 started on cluster
> cluster
> [2012-06-03T22:20:53] error: WARNING: Even though we are collecting
> accounting information you have asked for it not to be stored
> (accounting_storage/none) if this is not what you have in mind you will
> need to change it.
> [2012-06-03T22:20:53] Recovered state of 20 nodes
> [2012-06-03T22:20:53] Recovered job 60 0
> [2012-06-03T22:20:53] Recovered job 61 0
> [2012-06-03T22:20:53] Recovered information about 2 jobs
> [2012-06-03T22:20:53] cons_res: select_p_node_init
> [2012-06-03T22:20:53] cons_res: preparing for 1 partitions
> [2012-06-03T22:20:53] Recovered state of 0 reservations
> [2012-06-03T22:20:53] read_slurm_conf: backup_controller not specified.
> [2012-06-03T22:20:53] cons_res: select_p_reconfigure
> [2012-06-03T22:20:53] cons_res: select_p_node_init
> [2012-06-03T22:20:53] cons_res: preparing for 1 partitions
> [2012-06-03T22:20:53] Running as primary controller
> [2012-06-03T22:25:53] error: Nodes burster[1-19] not responding
> [2012-06-03T22:27:36] error: Nodes burster[1-19] not responding, setting
> DOWN
>
> In the log for buster0 there is
> [2012-06-03T22:20:53] Node configuration differs from hardware
>    CPUs=8:160(hw) Sockets=1:20(hw)
>    CoresPerSocket=8:8(hw) ThreadsPerCore=1:1(hw)
> [2012-06-03T22:20:53] error: WARNING: Even though we are collecting
> accounting information you have asked for it not to be stored
> (accounting_storage/none) if this is not what you have in mind you will
> need to change it.
> [2012-06-03T22:20:53] slurmd version 2.5.0-pre1 started
> [2012-06-03T22:20:53] slurmd started on Sun 03 Jun 2012 22:20:53 +0200
> [2012-06-03T22:20:53] CPUs=8 Sockets=1 Cores=8 Threads=1 Memory=645937
> TmpDisk=258031 Uptime=891138
>
> Could you please suggest what I am doing incorrectly?
>
> Also, where should I configure each slurmd as having the resources of the
> NUMA socket and bind to that socket using cpusets or linux cgroup?
>
> I appreciate your help and patience.
>
> Kind regards,
> Artem
>
>
> On 3 June 2012 19:43, Moe Jette <[email protected]> wrote:
>
>>
>> See
>> http://www.schedmd.com/slurmdocs/faq.html#multi_slurmd
>>
>> Configure each slurmd as having the resources of the NUMA socket and
>> bind to that socket using cpusets or linux cgroup
>>
>> Quoting Artem Kulachenko <[email protected]>:
>>
>> > Hi Everyone,
>> >
>> > We have one middle size SMP machine with NUMA architectures having 20
>> > sockets and 160 cores. We would like to use SLURM, which currently sees
>> it
>> > as a single node. I wonder if there is a way to split it into 20 nodes
>> (one
>> > per socket) in order to run the jobs locally at neighboring cores using
>> > local memory without the need to assign the processors manually.
>> >
>> > I will very appreciate any advice.
>> >
>> > Kind regards,
>> > Artem
>> >
>>
>>
>

[slurm-dev] Re: Splitting a large SMP machine into nodes

Reply via email to