[slurm-dev] Re: single node workstation

andrealphus Thu, 08 Sep 2016 17:06:46 -0700

p.s. same issue on v16


On Wed, Sep 7, 2016 at 9:57 AM, andrealphus <andrealp...@gmail.com> wrote:
>
> p.s. it's listing 36 processors with sinfo, and that theyre all being
> used, but it only running 18 jobs. So it looks like while it can see
> the 36 "processors" its only allocating on the core level and not the
> thread level;
>
>  squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>  3850_[19-1000%25]     debug slurm_ex   ashton PD       0:00      1 
> (Resources)
>             3850_1     debug slurm_ex   ashton  R       0:05      1 localhost
>             3850_2     debug slurm_ex   ashton  R       0:05      1 localhost
>             3850_3     debug slurm_ex   ashton  R       0:05      1 localhost
>             3850_4     debug slurm_ex   ashton  R       0:05      1 localhost
>             3850_5     debug slurm_ex   ashton  R       0:05      1 localhost
>             3850_6     debug slurm_ex   ashton  R       0:05      1 localhost
>             3850_7     debug slurm_ex   ashton  R       0:05      1 localhost
>             3850_8     debug slurm_ex   ashton  R       0:05      1 localhost
>             3850_9     debug slurm_ex   ashton  R       0:05      1 localhost
>            3850_10     debug slurm_ex   ashton  R       0:05      1 localhost
>            3850_11     debug slurm_ex   ashton  R       0:05      1 localhost
>            3850_12     debug slurm_ex   ashton  R       0:05      1 localhost
>            3850_13     debug slurm_ex   ashton  R       0:05      1 localhost
>            3850_14     debug slurm_ex   ashton  R       0:05      1 localhost
>            3850_15     debug slurm_ex   ashton  R       0:05      1 localhost
>            3850_16     debug slurm_ex   ashton  R       0:05      1 localhost
>            3850_17     debug slurm_ex   ashton  R       0:05      1 localhost
>            3850_18     debug slurm_ex   ashton  R       0:05      1 localhost
> sinfo -o %C
> CPUS(A/I/O/T)
> 36/0/0/36
>
> On Wed, Sep 7, 2016 at 9:41 AM, andrealphus <andrealp...@gmail.com> wrote:
>>
>> I tried changing the CPU flag int eh compute node section of the conf
>> file to 36, but it didnt make a difference, still limited to 18. Also
>> tried removing the flag and letting slurm calculate it from the other
>> info, e.g.;
>>  Sockets=1 CoresPerSocket=18 ThreadsPerCore=2
>>
>> also no change. Could it be a non configuration issue, e.g. a slurm
>> bug related to the processor type? I only say that because I am
>> normally a torque user, but there is an open bug with Adaptive that
>> seems to be related to some of the newer intel
>> processsors/glibc/elision locking....
>>
>>
>> On Tue, Sep 6, 2016 at 7:30 PM, andrealphus <andrealp...@gmail.com> wrote:
>>>
>>> ahhhh......I'll give that a try. Thanks Lachlan, feel better!
>>>
>>> On Tue, Sep 6, 2016 at 6:49 PM, Lachlan Musicman <data...@gmail.com> wrote:
>>>> No, sorry, I meant that your config file line needs to change:
>>>>
>>>>
>>>> NodeName=localhost CPUs=36 RealMemory=120000 Sockets=1 CoresPerSocket=18
>>>> ThreadsPerCore=2 State=UNKNOWN
>>>>
>>>> ------
>>>> The most dangerous phrase in the language is, "We've always done it this
>>>> way."
>>>>
>>>> - Grace Hopper
>>>>
>>>> On 7 September 2016 at 11:34, andrealphus <andrealp...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Yup, thats what I expect too! Since Im brand new to slurm, not sure if
>>>>> there is some other config option or srun flag to enable
>>>>> multithreading
>>>>>
>>>>> On Tue, Sep 6, 2016 at 5:42 PM, Lachlan Musicman <data...@gmail.com>
>>>>> wrote:
>>>>> > Oh, I'm not 100% sure on this (home sick actually), but I think:
>>>>> >
>>>>> > NodeName=localhost CPUs=1 RealMemory=120000 Sockets=1 CoresPerSocket=18
>>>>> > ThreadsPerCore=2 State=UNKNOWN
>>>>> >
>>>>> >
>>>>> > should have CPUs=36 (ie, ThreadsperCore*CoresPerSocket*Sockets)
>>>>> >
>>>>> > cheers
>>>>> > L,
>>>>> >
>>>>> > ------
>>>>> > The most dangerous phrase in the language is, "We've always done it this
>>>>> > way."
>>>>> >
>>>>> > - Grace Hopper
>>>>> >
>>>>> > On 7 September 2016 at 10:39, andrealphus <andrealp...@gmail.com> wrote:
>>>>> >>
>>>>> >>
>>>>> >> Thanks Lachman, took threads-per-core and out same behavior, still
>>>>> >> limited to 18.
>>>>> >>
>>>>> >> On Tue, Sep 6, 2016 at 5:33 PM, Lachlan Musicman <data...@gmail.com>
>>>>> >> wrote:
>>>>> >> > You don't need --threads-per-core.
>>>>> >> >
>>>>> >> > It's sufficient to have
>>>>> >> >
>>>>> >> > SelectType=select/cons_res
>>>>> >> > SelectTypeParameters=CR_CPU
>>>>> >> >
>>>>> >> > then you should be able to get to all 36.
>>>>> >> >
>>>>> >> > cheers
>>>>> >> > L.
>>>>> >> >
>>>>> >> > ------
>>>>> >> > The most dangerous phrase in the language is, "We've always done it
>>>>> >> > this
>>>>> >> > way."
>>>>> >> >
>>>>> >> > - Grace Hopper
>>>>> >> >
>>>>> >> > On 7 September 2016 at 10:22, andrealphus <andrealp...@gmail.com>
>>>>> >> > wrote:
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> one more follow up....
>>>>> >> >>
>>>>> >> >> This seems to limited to the number of cores. Anyway to change it so
>>>>> >> >> that I can run up to the thread limit (18x2) concurrently?
>>>>> >> >>
>>>>> >> >> Thanks!
>>>>> >> >>
>>>>> >> >> On Tue, Sep 6, 2016 at 3:21 PM, andrealphus <andrealp...@gmail.com>
>>>>> >> >> wrote:
>>>>> >> >> >
>>>>> >> >> > spoke too soon, so for posterity....
>>>>> >> >> >
>>>>> >> >> > need to set, in the conf;
>>>>> >> >> > SelectType=select/con_res
>>>>> >> >> > SelectTypeParameters=CR_CPU
>>>>> >> >> >
>>>>> >> >> > and in the script;
>>>>> >> >> > #SBATCH --threads-per-core=1
>>>>> >> >> >
>>>>> >> >> > and DefMemPerCPU, did not matter...
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > On Tue, Sep 6, 2016 at 3:08 PM, andrealphus
>>>>> >> >> > <andrealp...@gmail.com>
>>>>> >> >> > wrote:
>>>>> >> >> >>
>>>>> >> >> >> Hi all,
>>>>> >> >> >>
>>>>> >> >> >> Long time Torque user, first time SLURM user. I'm running version
>>>>> >> >> >> 15.08 from APT on Ubuntu Xenial. (running on an 18 core CPU
>>>>> >> >> >> E5-2697
>>>>> >> >> >> v4)
>>>>> >> >> >>
>>>>> >> >> >> I'm trying to figure out the proper slurm.conf configuration, and
>>>>> >> >> >> script parameters to run a job array on a single node/server
>>>>> >> >> >> workstation, with more than one concurrent task of the job
>>>>> >> >> >> running
>>>>> >> >> >> at
>>>>> >> >> >> the time.
>>>>> >> >> >>
>>>>> >> >> >> e.g.
>>>>> >> >> >>
>>>>> >> >> >> #!/bin/bash
>>>>> >> >> >> #SBATCH -o slurm_example-%A_%a.out
>>>>> >> >> >> #SBATCH --array=1-21%3
>>>>> >> >> >> #SBATCH --mem-per-cpu=2000 f
>>>>> >> >> >>
>>>>> >> >> >> srun sleep 15
>>>>> >> >> >>
>>>>> >> >> >> and submitting with $sbatch exmaple.sh, should run 21 total
>>>>> >> >> >> instances
>>>>> >> >> >> of sleep, 3 at a time, correct?
>>>>> >> >> >>
>>>>> >> >> >> I can never get more than 1 concurrent process going....
>>>>> >> >> >>
>>>>> >> >> >> My slurm.conf file looks like;
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >> ControlMachine=localhost
>>>>> >> >> >> AuthType=auth/munge
>>>>> >> >> >> CacheGroups=0
>>>>> >> >> >> CryptoType=crypto/munge
>>>>> >> >> >> MaxTasksPerNode=32
>>>>> >> >> >> MpiDefault=none
>>>>> >> >> >> ProctrackType=proctrack/pgid
>>>>> >> >> >> ReturnToService=1
>>>>> >> >> >> SlurmctldPidFile=/var/run/slurmctld.pid
>>>>> >> >> >> SlurmctldPort=6817
>>>>> >> >> >> SlurmdPidFile=/var/run/slurmd.pid
>>>>> >> >> >> SlurmdPort=6818
>>>>> >> >> >> SlurmdSpoolDir=/var/spool/slurmd
>>>>> >> >> >> SlurmUser=root
>>>>> >> >> >> StateSaveLocation=/var/spool
>>>>> >> >> >> SwitchType=switch/none
>>>>> >> >> >> TaskPlugin=task/none
>>>>> >> >> >> InactiveLimit=0
>>>>> >> >> >> KillWait=30
>>>>> >> >> >> MinJobAge=300
>>>>> >> >> >> SlurmctldTimeout=120
>>>>> >> >> >> SlurmdTimeout=300
>>>>> >> >> >> Waittime=0
>>>>> >> >> >> FastSchedule=1
>>>>> >> >> >> SchedulerType=sched/backfill
>>>>> >> >> >> SchedulerPort=7321
>>>>> >> >> >> SelectType=select/cons_res
>>>>> >> >> >> SelectTypeParameters=CR_CPU
>>>>> >> >> >> AccountingStorageType=accounting_storage/none
>>>>> >> >> >> AccountingStoreJobComment=YES
>>>>> >> >> >> ClusterName=cluster
>>>>> >> >> >> JobCompType=jobcomp/none
>>>>> >> >> >> JobAcctGatherFrequency=30
>>>>> >> >> >> JobAcctGatherType=jobacct_gather/none
>>>>> >> >> >> SlurmctldDebug=3
>>>>> >> >> >> SlurmdDebug=3
>>>>> >> >> >>
>>>>> >> >> >> # COMPUTE NODES
>>>>> >> >> >> NodeName=localhost CPUs=1 RealMemory=120000 Sockets=1
>>>>> >> >> >> CoresPerSocket=18 ThreadsPerCore=2 State=UNKNOWN
>>>>> >> >> >> PartitionName=debug Nodes=localhost Shared=YES DefMemPerCPU=3000
>>>>> >> >> >> Default=YES MaxTime=INFINITE State=UP
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >> I've tried both;
>>>>> >> >> >>
>>>>> >> >> >> SelectType=select/cons_res
>>>>> >> >> >> SelectTypeParameters=CR_CPU
>>>>> >> >> >> and
>>>>> >> >> >> SelectType=select/linear
>>>>> >> >> >>
>>>>> >> >> >> but both return;
>>>>> >> >> >> sinfo -o %C
>>>>> >> >> >> CPUS(A/I/O/T)
>>>>> >> >> >> 0/0/1/1
>>>>> >> >> >>
>>>>> >> >> >> which didnt seem right, because I thought if I sent
>>>>> >> >> >> SelectType=select/cons_res & SelectTypeParameters=CR_CPU, the
>>>>> >> >> >> threads
>>>>> >> >> >> should be seen as the CPUS
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >> I've tried to piece it together with the slurm and ubuntu mailing
>>>>> >> >> >> list, but two days later am ready to hide in a corner....
>>>>> >> >> >>
>>>>> >> >> >> any info appreciated!
>>>>> >> >> >>
>>>>> >> >> >> ashton
>>>>> >> >
>>>>> >> >
>>>>> >
>>>>> >
>>>>
>>>>

[slurm-dev] Re: single node workstation

Reply via email to