[slurm-dev] DB on worker nodes

2016-03-23 Thread Lachlan Musicman
Hi, I'm just configuring a script to deploy worker nodes. I've realised that version #1, made many moons ago, installed MySQL/MariaDB. But now that I look at my worker nodes, I don't think that they need mysql on them. Can any one confirm or deny if they do? cheers L. -- The most dangerous

[slurm-dev] Website down?

2016-04-06 Thread Lachlan Musicman
Is it just me, or is schedmd.com having issues at the moment? I'm getting intermittent responses, nothing from the downloads page cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper

[slurm-dev] Re: Website down?

2016-04-06 Thread Lachlan Musicman
lenkov wrote: > It does look like there is a problem! Doesn't work for me either > > Gene > > > On 07/04/16 11:51, Lachlan Musicman wrote: > > Is it just me, or is schedmd.com having issues at the moment? > > I'm getting intermittent responses, nothing fro

[slurm-dev] Slurm, configuration and hostname

2016-04-10 Thread Lachlan Musicman
Hola There are a number of places in the slurm configuration where we need to enter hostnames. The (almost?) always docs recommend the short hostname, the slurm.conf docs are the only place I've found that explicitly states that it should be the value returned by `hostname -s`. Our systems have

[slurm-dev] Slurm plugins

2016-04-11 Thread Lachlan Musicman
Is it just me, or have the slurm plugins in 16.05pre2 moved from /usr/local/lib to /usr/lib64? cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper

[slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

2016-04-11 Thread Lachlan Musicman
I think I saw something like this just now - are you running: systemctl start slurm or systemctl start slurmd ? And slurmctld is running on the head? Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 12 April 2016 at 13:04, Joh

[slurm-dev] error: slurm_jobcomp plugin context not initialized

2016-04-12 Thread Lachlan Musicman
Hi, While running the tests, I'm seeing a lot of this error: error: slurm_jobcomp plugin context not initialized AFAICT, slurmdbd is set up correctly, and slurm.conf has JobCompType=jobcomp/slurmdbd What do I do to fix this error? cheers L. -- The most dangerous phrase in the language

[slurm-dev] Re: error: slurm_jobcomp plugin context not initialized

2016-04-12 Thread Lachlan Musicman
l doesn't work. Looks like it needs a db schema inserted? Where might I find that or ? Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 13 April 2016 at 10:59, Lachlan Musicman wrote: > Hi, > &g

[slurm-dev] Re: strange going-ons with OpenMPI and Infiniband

2016-04-12 Thread Lachlan Musicman
I was reading about this today. Isn't OpenMPI compiled --with-slurm by default when installing with one of the pkg managers? https://www.open-mpi.org/faq/?category=building#build-rte Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper

[slurm-dev] More than one job/task per node?

2016-04-28 Thread Lachlan Musicman
I'm finding this a little confusing. We have a very simple script we are using to test/train staff how to use SLURM (16.05-pre2). They are moving from an old Torque/Maui system. I have a test partition set up, from slurm.conf NodeName=slurm-[01-02] CPUs=8 RealMemory=32000 Sockets=1 CoresPerSock

[slurm-dev] Re: Slurm Upgrade Instructions needed

2016-05-04 Thread Lachlan Musicman
I would backup /etc/slurm. That's about it. Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 5 May 2016 at 07:36, Balaji Deivam wrote: > Hello, > > Right now we are using Slurm 14.11.3 and planning to upgrade to the > latest ve

[slurm-dev] RE: Guide for begginers Admin to make prioriries

2016-05-18 Thread Lachlan Musicman
We haven't got it in production yet, but I don't see why not. There's a section in the docs that talks about sssd, so I presume it "just works" L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 18 May 2016 at 17:46, David Ramírez wrote:

[slurm-dev] 16.05?

2016-05-29 Thread Lachlan Musicman
I know "it will be ready when it's ready" but I am about to deploy to production - how far off is the official 16.05 release? cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper

[slurm-dev] Re: 16.05?

2016-05-29 Thread Lachlan Musicman
gt; On May 29, 2016 9:01:22 PM PDT, Lachlan Musicman > wrote: >> >> I know "it will be ready when it's ready" but I am about to deploy to >> production - how far off is the official 16.05 release? >> >> cheers >> L. >> -- >> The m

[slurm-dev] Re: Slurm 16.05.0 and 15.08.12 are now available

2016-05-31 Thread Lachlan Musicman
Thanks and congrats! -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 1 June 2016 at 08:03, Danny Auble wrote: > > We are pleased to announce the release of 16.05.0! It contains many new > features and performance enhancements. Please read

[slurm-dev] Building SLURM

2016-05-31 Thread Lachlan Musicman
Hola, I build the newest slurm release for installation. The docs say to install on the head and worker nodes: - slurm - slurm-devel - slurm-munge - slurm-perlapi - slurm-plugins - slurm-sjobexit - slurm-sjstat - slurm-torque My RPM folder also contains: slurm-openlava

[slurm-dev] Re: Building SLURM

2016-06-01 Thread Lachlan Musicman
ce Hopper On 1 June 2016 at 15:39, Lachlan Musicman wrote: > Hola, > > I build the newest slurm release for installation. The docs say to install > on the head and worker nodes: > > >- slurm >- slurm-devel >- slurm-munge >- slurm-perlapi >- sl

[slurm-dev] Re: sacct -j

2016-06-01 Thread Lachlan Musicman
Remi, The obvious questions are: Have you set up the accounting? Added a cluster, added some users, etc? ie, on the link below, there's a section under "Tools" and "Database Configuration" that might apply? http://slurm.schedmd.com/accounting.html I think that this section is ripe for a how

[slurm-dev] test 30.1 is failing

2016-06-02 Thread Lachlan Musicman
Hi, I've just run the testsuite and got a couple of failures. The first I can't solve is 30.1 FAILURE: there was an error during the rpmbuild spawn ls /tmp/built_rpms/RPMS spawn /usr/bin/bash -c exec rpm -qpl /tmp/built_rpms/RPMS//Slurm-0-0..rpm | grep srun.1 error: open of /tmp/built_rpms/RPMS/

[slurm-dev] kill-on-invalid-dep

2016-06-14 Thread Lachlan Musicman
The sbatch command http://slurm.schedmd.com/sbatch.html has the flag --kill-on-invalid-dep= Which we would like to turn on by default. (ie = yes) The man page indicates that there would be a slurm.conf setting kill_invalid_depend but I don't see it in the default slurm.conf? I do see KillOnBa

[slurm-dev] Re: kill-on-invalid-dep

2016-06-14 Thread Lachlan Musicman
Thanks Chris! -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 15 June 2016 at 14:12, Christopher Samuel wrote: > > On 15/06/16 14:03, Lachlan Musicman wrote: > > > The man page indicates that there

[slurm-dev] Learning sacctmgr

2016-06-14 Thread Lachlan Musicman
sacctmgr isn't the easiest thing to wrap your head around :) I've just successfully run this command: sacctmgr add user pers...@domain.com DefaultAccount=prod Partition=prod (we use sssd for login against an AD, hence the @domain.com) I tried to modify a user I had added previously: sacctmgr m

[slurm-dev] Re: Learning sacctmgr

2016-06-14 Thread Lachlan Musicman
Ah! I was using the example on the Accounting page http://slurm.schedmd.com/accounting.html Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 15 June 2016 at 15:16, Christopher Samuel wrote: > > On 15

[slurm-dev] Updating slurm.conf

2016-06-15 Thread Lachlan Musicman
Hi, I would like some clarification on upgrading slurm.conf. As we discover things needing to be added or changed, we update a central slurm.conf and distribute to all nodes, AllocNodes and head nodes via ansible. This works a treat. Next, we would like to have out new slurm.conf applied without

[slurm-dev] Re: Learning sacctmgr

2016-06-15 Thread Lachlan Musicman
ays done it this way." - Grace Hopper On 15 June 2016 at 15:16, Christopher Samuel wrote: > > On 15/06/16 15:10, Lachlan Musicman wrote: > > > sacctmgr modify user set Partition=prod where User=pers...@domain.com > > I *think* you need to have the where before the set,

[slurm-dev] Re: Updating slurm.conf

2016-06-15 Thread Lachlan Musicman
I forgot the important info, sorry!: running slurm 16.05 and the subject should read "Updating slurm.conf kills the queue" Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 16 June 2016 at 13:24, Lachl

[slurm-dev] Re: configure:--enable-front-end: just curious

2016-06-17 Thread Lachlan Musicman
I think that's the AllocNode on the Partition? See here http://slurm.schedmd.com/slurm.conf.html and http://slurm.schedmd.com/scontrol.html (search for AllocNode on both) Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 18 June

[slurm-dev] Job dependency failure bc disk write cache?

2016-06-19 Thread Lachlan Musicman
Morning! We have a scenario where I *think* the problem is a write cache issue, but I'm not 100% sure. We have JobB dependent on JobA. JobA internally (ie, not via --output) writes three small files to nfs-shared disk, the first of which is then parsed by JobB - hence the dependency (using --dep

[slurm-dev] Re: Job dependency failure bc disk write cache?

2016-06-23 Thread Lachlan Musicman
We worked this one out - it was pebkac, not slurm :/ -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 20 June 2016 at 10:37, Lachlan Musicman wrote: > Morning! > > We have a scenario where I *think* the problem is

[slurm-dev] Re: Updating slurm.conf

2016-06-23 Thread Lachlan Musicman
t out is to do a 'slurmctld -Dv' and it will fail and > tell you what the issue is. > > Hopefully this helps. > > --- > Nicholas McCollum > HPC Systems Administrator > Alabama Supercomputer Authority > > > On Wed, 15 Jun 2016, Lachlan Musicman

[slurm-dev] TMPDIR, clean up and prolog/epilog

2016-06-23 Thread Lachlan Musicman
We are transitioning from Torque/Maui to SLURM and have only just noticed that SLURM puts all files in /tmp and doesn't create a per job/user TMPDIR. On searching, we have found a number of options for creation of TMPDIR on the fly using SPANK and lua and prolog/epilog. I am looking for something

[slurm-dev] Re: TMPDIR, clean up and prolog/epilog

2016-06-26 Thread Lachlan Musicman
art for instance with: > https://github.com/fafik23/slurm_plugins/tree/master/bindtmp > > cheers > marcin > > 2016-06-24 7:22 GMT+02:00 Lachlan Musicman : > >> We are transitioning from Torque/Maui to SLURM and have only just noticed >> that SLURM puts all files in /

[slurm-dev] Re: Learning sacctmgr

2016-06-28 Thread Lachlan Musicman
Chris, Are the Allowgroups groups from the system groups? cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 16 June 2016 at 15:35, Christopher Samuel wrote: > > On 16/06/16 14:28, Lachlan Musicman wrote:

[slurm-dev] Re: Learning sacctmgr

2016-06-28 Thread Lachlan Musicman
Hmm thanks. I'm not seeing it working unfortunately. :/ -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 29 June 2016 at 14:51, Christopher Samuel wrote: > > On 29/06/16 14:39, Lachlan Musicman wrote: > &

[slurm-dev] Re: Learning sacctmgr

2016-06-28 Thread Lachlan Musicman
tion". Looks like you are adding a user. gah. Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 29 June 2016 at 15:19, Lachlan Musicman wrote: > Hmm thanks. I'm not seeing it working unfortunately.

[slurm-dev] Re: Learning sacctmgr

2016-06-28 Thread Lachlan Musicman
er On 29 June 2016 at 15:28, Christopher Samuel wrote: > > On 29/06/16 15:21, Lachlan Musicman wrote: > > > Hmm thanks. I'm not seeing it working unfortunately. > > You need to make sure you've got SSSD set to enumerate (unless you're on > Slurm

[slurm-dev] Accounting, users, associations and Partitions

2016-06-29 Thread Lachlan Musicman
Is it possible to set a Default Partition against a user? So that they can srun hostname while AccountingStorageEnforce=associations instead of srun --partition=dev hostname cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper

[slurm-dev] Re: gmail spam filters?

2016-07-08 Thread Lachlan Musicman
Yeah, I'm marking a lot of slurm list email as "not spam" in my gmail account atm Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 9 July 2016 at 03:52, Michael Di Domenico wrote: > > On Fri, Jul 8, 2016 at 1:22 PM, Tim Wickberg

[slurm-dev] Re: specifying compute nodes with arbitrary hostnames in configurator.html

2016-07-11 Thread Lachlan Musicman
Regardless of what you put in, make sure that the end product (the text conf file) has somehting that looks like: NodeName=compute[01-02] CPUs=40 RealMemory=385000 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN If I recall correctly, the NodeNames can be comma delim: NodeName=*A

[slurm-dev] Interesting unexpected consequences

2016-07-17 Thread Lachlan Musicman
Hola, Because I built the cluster on the fly, I named it after my partner. The boss didn't like this, so we wanted to change the name. (to rosalind for Rosalind Franklin). I took a dump of sacctmgr: sacctmgr dump fiona File=/root/fiona_cluster.cfg and took a look inside. I didn't read the enti

[slurm-dev] Re: Interesting unexpected consequences

2016-07-17 Thread Lachlan Musicman
Is that expected behaviour for reasons I haven't read yet, or am I just thinking about sacctmgr all wrong? cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 18 July 2016 at 11:21, Lachlan Musicman wrote: >

[slurm-dev] Suspend job/reboot slurm

2016-07-24 Thread Lachlan Musicman
Hola, Just looking for some clarification of the nature of job suspension. I have just had some success with it and am confirming that it was as wonderful as it seemed? Scenario: about to switch from dev to prod, boss asks me to reboot whole cluster to show stakeholders that it will come back up

[slurm-dev] Re: slurm support for Redhat Identity Management

2016-07-27 Thread Lachlan Musicman
Maybe. My understanding is slightly different. We use RIM (FreeIPA) for our users - since all the users need to be on all the nodes with the same uid. Munge is used by slurmctld service ("head node") to communicate with slurmd services on worker nodes. IE for the underlying management part. So t

[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-02 Thread Lachlan Musicman
I'm pretty sure it's very simple and smooth. Backup /etc/slurm If you have some accounting happening, backup the relevant database (it will be listed in either slurm.conf or slurmdb.conf). You can update while slurm is running too - get everything prepared, update your slurmdbd service, restart

[slurm-dev] Running tasks in parallel w/o arrays

2016-08-02 Thread Lachlan Musicman
Sometimes we would like to run jobs in parallel without using arrays because the files aren't well named. But the files are all in the same folder. We have written a small script that loops over each file, constructs the command in question and runs it. We only want each command to run once, but

[slurm-dev] Re: Running tasks in parallel w/o arrays

2016-08-02 Thread Lachlan Musicman
of Auckland > e: g.soudlen...@auckland.ac.nz > p: +64 9 3737599 ext 89834 c: +64 21 840 825 f: +64 9 373 7453 > w: www.nesi.org.nz > > > On 3/08/16 3:33 pm, Lachlan Musicman wrote: > >> Sometimes we would like to run jobs in parallel without using arrays >> because

[slurm-dev] Re: Running tasks in parallel w/o arrays

2016-08-02 Thread Lachlan Musicman
ot; > > > > srun python run_sgRNA.py $1 `basedir $1` #Sbatch has allocated the tasks, > python can use 12 tasks safely. > > > > Then on your scheduling server just do like you did > > $for sdir in /pipeline/Runs/ProjectFolders/Project_Michael-He/Sample_*; do > sba

[slurm-dev] Re: TMPDIR, clean up and prolog/epilog

2016-08-22 Thread Lachlan Musicman
per On 27 June 2016 at 01:10, Marcin Stolarek wrote: > This was discussed numbers of times before. You can check the list > archive, or start for instance with: > https://github.com/fafik23/slurm_plugins/tree/master/bindtmp > > cheers > marcin > > 2016-06-24 7:22 G

[slurm-dev] Job Accounting for sstat

2016-08-29 Thread Lachlan Musicman
Hi, I noticed that sstat wasn't giving any data, so I looked into how to make that happen. After some reading, I cracked open slurm.conf and uncommented these two lines: JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux after confirming that jobacct_gather_linux.so was in PluginD

[slurm-dev] Re: Job Accounting for sstat

2016-08-29 Thread Lachlan Musicman
Oh! Thanks. I presume that includes sruns that are in an sbatch file. Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 30 August 2016 at 12:12, Christopher Samuel wrote: > > On 30/08/16 11:51, Lachlan Mu

[slurm-dev] Re: SLURM daemon doesn't start

2016-08-29 Thread Lachlan Musicman
James, Would be great to know OS and SLURM version. For instance, on Centos 7/Debian 8/Ubuntu 16.04, you might be using systemctl status/start/restart slurmctld (head node) systemctl status/start/restart slurmd (worker nodes) instead? Cheers L. -- The most dangerous phrase in the langu

[slurm-dev] Re: resource usage, TRES and --exclusive option

2016-08-31 Thread Lachlan Musicman
-- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 31 August 2016 at 21:07, Christof Koehler < christof.koeh...@bccms.uni-bremen.de> wrote: > Hello everybody, > > If I understand the slurm documentation correctly the usual configuration > in s

[slurm-dev] Re: resource usage, TRES and --exclusive option

2016-09-01 Thread Lachlan Musicman
On 1 September 2016 at 18:16, Christof Koehler < christof.koeh...@bccms.uni-bremen.de> wrote: > Hello, > > On Wed, Aug 31, 2016 at 05:52:48PM -0700, Lachlan Musicman wrote: > > I don't believe it's 100% necessary to use OverSubscribe Yes. > > > > We

[slurm-dev] Re: single node workstation

2016-09-06 Thread Lachlan Musicman
You don't need --threads-per-core. It's sufficient to have SelectType=select/cons_res SelectTypeParameters=CR_CPU then you should be able to get to all 36. cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 7 September 2016 at 10

[slurm-dev] Re: single node workstation

2016-09-06 Thread Lachlan Musicman
ge is, "We've always done it this way." - Grace Hopper On 7 September 2016 at 10:39, andrealphus wrote: > > Thanks Lachman, took threads-per-core and out same behavior, still > limited to 18. > > On Tue, Sep 6, 2016 at 5:33 PM, Lachlan Musicman > wrote:

[slurm-dev] Re: single node workstation

2016-09-06 Thread Lachlan Musicman
mber 2016 at 11:34, andrealphus wrote: > > Yup, thats what I expect too! Since Im brand new to slurm, not sure if > there is some other config option or srun flag to enable > multithreading > > On Tue, Sep 6, 2016 at 5:42 PM, Lachlan Musicman > wrote: > > Oh, I'm

[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-18 Thread Lachlan Musicman
I think you need a couple of things going on: 1. you have to have some sort of accounting organised and set up 2. your sbatch scripts need to use: srun not just 3. sinfo should then work on the job number. When I asked, that was the response iirc. cheers L. -- The most dangerous phrase i

[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-18 Thread Lachlan Musicman
;ve always done it this way." - Grace Hopper On 19 September 2016 at 12:07, Lachlan Musicman wrote: > I think you need a couple of things going on: > > 1. you have to have some sort of accounting organised and set up > 2. your sbatch scripts need to use: srun not just > 3.

[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-18 Thread Lachlan Musicman
Gah, yes. sstat, not sinfo. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 19 September 2016 at 13:00, Peter A Ruprecht wrote: > Igor, > > Would sstat give you what you need? (http://slurm.schedmd.com/sstat.html) > It doesn't update in

[slurm-dev] CPUs and Hyperthreading and etc

2016-09-21 Thread Lachlan Musicman
Our nodes have Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 CPUs are set to 40 and SelectTypeParameters=CR_CPU According to this FAQ, this is "not a typical configuration". http://slurm.schedmd.com/faq.html#cpu_count Which is fine, I am aware that this is the set up - I did the configuration.

[slurm-dev] Re: CPUs and Hyperthreading and etc

2016-09-21 Thread Lachlan Musicman
cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 22 September 2016 at 14:51, Lachlan Musicman wrote: > Our nodes have > > Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 > > CPUs are set to 40 a

[slurm-dev] Re: CPUs and Hyperthreading and etc

2016-09-22 Thread Lachlan Musicman
s a hard limit of 20. > > On Wed, Sep 21, 2016 at 9:54 PM, Lachlan Musicman > wrote: > > On a side note, I have a minor documentation bug report: > > > > On this page http://slurm.schedmd.com/cpu_management.html there is a > link to > > > > -s, --oversubscr

[slurm-dev] Slurmdbd

2016-09-23 Thread Lachlan Musicman
Is there a description of what each field is in the slurmdbd? I'm looking, in particular, at the _job_table fields: exit_code state time_eligible timelimit (units are minutes?) tres_alloc tres_req (well, mostly how this is caluculated) Cheers L. -- The most dangerous phrase in the language

[slurm-dev] CGroups

2016-09-25 Thread Lachlan Musicman
Hi, cgroups have been on my radar since about two weeks after I started looking into SLURM and I'm just getting around to looking at them now. I note that the ProcTrackType docs say > This plugin writes to disk often and can impact performance. If you are running lots of > short running jobs (le

[slurm-dev] Re: Slurm web dashboards

2016-09-27 Thread Lachlan Musicman
I am surprised how hard I found it to find these as well - especially given how frequently the question is asked. This mob have made one, and it looks good, but all development has happened on .deb systems, and I didn't have sufficient time (or skill) to unpack and repack for rpm or generic. http

[slurm-dev] Re: Slurm web dashboards

2016-09-28 Thread Lachlan Musicman
t // Emersons > Green // Bristol // BS16 7FR > > CFMS Services Ltd is registered in England and Wales No 05742022 - a > subsidiary of CFMS Ltd > CFMS Services Ltd registered office // Victoria House // 51 Victoria > Street // Bristol // BS1 6AD > > On 28 September 2016 at 00:36

[slurm-dev] Struggling with QOS?

2016-09-28 Thread Lachlan Musicman
Hi, After some fun incidents with accidental monopolization of the cluster, we decided to enforce some QOS. I read the documentation. Thus far in the set up the only thing I've done that's even close is I assigned "share" values when I set up each association. The cluster had a QOS called normal

[slurm-dev] Re: Struggling with QOS?

2016-09-28 Thread Lachlan Musicman
ge is, "We've always done it this way." - Grace Hopper On 29 September 2016 at 11:10, Lachlan Musicman wrote: > Hi, > > After some fun incidents with accidental monopolization of the cluster, we > decided to enforce some QOS. > > I read the documentation. Thus far

[slurm-dev] Re: Struggling with QOS?

2016-10-02 Thread Lachlan Musicman
29 September 2016 at 22:01, Janne Blomqvist wrote: > On 2016-09-29 04:11, Lachlan Musicman wrote: > > Hi, > > > > After some fun incidents with accidental monopolization of the cluster, > > we decided to enforce some QOS. > [snip] > > What have I done wrong? I re

[slurm-dev] QOS, Limits, CPUs and threads - something is wrong?

2016-10-02 Thread Lachlan Musicman
I started a thread on understand QOS, but quickly realised I had made a fundamental error in my configuration. I fixed that problem last week. (ref: https://groups.google.com/forum/#!msg/slurm-devel/dqL30WwmrmU/SoOMHmRVDAAJ ) Despite these changes, the issue remains, so I would like to ask again,

[slurm-dev] Re: Job Accounting for sstat

2016-10-02 Thread Lachlan Musicman
wrote: > > On 30/08/16 12:39, Lachlan Musicman wrote: > > > Oh! Thanks. > > > > I presume that includes sruns that are in an sbatch file. > > Yup, that's right. > > cheers! > Chris > -- > Christopher SamuelSenior Systems Administrator >

[slurm-dev] Re: QOS, Limits, CPUs and threads - something is wrong?

2016-10-03 Thread Lachlan Musicman
27;ve always done it this way." - Grace Hopper > > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > National Energy Research Scientific Computing Center > <http://www.nersc.gov> > dmjacob...@lbl.gov > > - __o > -- _ '\

[slurm-dev] Re: cons_res / CR_CPU - we don't have select plugin type 102

2016-10-04 Thread Lachlan Musicman
Jose, Do all the nodes have access to either a shared /usr/lib64/slurm or do they each have their own? And is there a file in that dir (on each machine) called select_cons_res.so? Also, when changing slurm.conf here's a quick and easy workflow: 1. change slurm.conf 2. deploy to all machines in c

[slurm-dev] slurm 16.05.5

2016-10-04 Thread Lachlan Musicman
Hola, Just built the rpms as per the installation docs. Noted that there were three new rpms: slurm-openlava-16.05.5-1.el7.centos.x86_64.rpm slurm-pam_slurm-16.05.5-1.el7.centos.x86_64.rpm slurm-seff-16.05.5-1.el7.centos.x86_64.rpm Is that due to a more sophisticated build machine or due to a

[slurm-dev] Re: slurm build options

2016-10-07 Thread Lachlan Musicman
Check against the installed libs? check *-devel? Otherwise I'm not 100% sure - unless the rpmbuild folder with all files still exists and there's something in there? FWIW, it's relatively easy to install all the libs that SLURM needs without causing too much problems. The hardest I've found so far

[slurm-dev] Draining, Maint or ?

2016-10-11 Thread Lachlan Musicman
Hola, For reasons, our IT team needs some downtime on our authentication server (FreeIPA/sssd). We would like to minimize the disruption, but also not lose any work. The current plan is for the nodes to be set to DRAIN on Friday afternoon and on Monday morning we will suspend any running jobs, m

[slurm-dev] Re: Draining, Maint or ?

2016-10-11 Thread Lachlan Musicman
partition - jobs running on that partition will continue to do so cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 12 October 2016 at 10:35, Lachlan Musicman wrote: > Hola, > > For reasons, our IT team ne

[slurm-dev] Re: ulimit issue I'm sure someone has seen before

2016-10-13 Thread Lachlan Musicman
Mike, I would suggest that the limit is a SLURM limit rather than a ulimit. What is the result of scontrol show config | grep Mem ? Because you have set your SelectTypeParameters=CR_Core_Memory Memory will cause jobs to fail if they go over the default memory limit. The SLURM head will kill j

[slurm-dev] Re: Packaging for fedora (and EPEL)

2016-10-17 Thread Lachlan Musicman
I've had consistent success with the documented system - "rpmbulid slurm-.tgz" then yum installing the resulting files, using 15.x, 16.05 and 17.02. Have on occasion needed to recompile - hdf5 support and for non main line plugins, but otherwise it's been pretty easy. Will happily support/debug y

[slurm-dev] Re: sreport "duplicate" lines

2016-10-20 Thread Lachlan Musicman
On 21 October 2016 at 12:39, Christopher Samuel wrote: > > On 21/10/16 12:29, Andrew Elwell wrote: > > > When running sreport (both 14.11 and 16.05) I'm seeing "duplicate" > > user info with different timings. Can someone say what's being added > > up separately here - it seems to be summing some

[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-24 Thread Lachlan Musicman
On 25 October 2016 at 08:42, Tuo Chen Peng wrote: > Hello all, > > This is my first post in the mailing list - nice to join the community! > Welcome! > > > I have a general question regarding slurm partition change: > > If I move one node from one partition to the other, will it cause any > im

[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-24 Thread Lachlan Musicman
On 25 October 2016 at 09:17, Tuo Chen Peng wrote: > Oh ok thanks for pointing this out. > > I thought ‘scontrol update’ command is for letting slurmctld to pick up > any change in slurm.conf. > > But after reading the manual again, it seems this command is instead to > change the setting at runti

[slurm-dev] How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Lachlan Musicman
Morning, Yesterday we had some internal network issues that caused havoc on our system. By the end of the day everything was ok on the whole. This morning I came in to see one job on the queue (which was otherwise relatively quiet) with the error message/Nodelist Reason (launch failed requeued he

[slurm-dev] Re: How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Lachlan Musicman
On 28 October 2016 at 09:20, Christopher Samuel wrote: > > On 28/10/16 08:44, Lachlan Musicman wrote: > > > So I checked the system, noticed that one node was drained, resumed it. > > Then I tried both > > > > scontrol requeue 230591 > > scontrol resume 2

[slurm-dev] Re: Can slurm work on one node?

2016-10-30 Thread Lachlan Musicman
I think it should. Can you send through your slurm.conf? Also, the logs usually explicitly say why slurmctld/slurmd don't start, and the best way to judge if slurm is running is with systemd: systemctl status slurmctl systemctl status slurmd cheers L. -- The most dangerous phrase in the l

[slurm-dev] Re: start munge again after boot?

2016-11-07 Thread Lachlan Musicman
On 8 November 2016 at 07:11, Peixin Qiao wrote: > Hi, > > I install munge and restart my computer, then munge stopped work and > restarting munge didn't work. It says: > > munged: Error: Failed to check pidfile dir "/var/run/munge": cannot > canonicalize "/var/run/munge": No such file or director

[slurm-dev] Re:

2016-11-07 Thread Lachlan Musicman
Peixin, Again, depends on your OS and deployment methods, but essentially: In slurm.conf set SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm SlurmctldLogFile=/var/log/slurm/slurm-ctld.log SlurmdLogFile=/var/log/slur

[slurm-dev] sinfo man page

2016-11-07 Thread Lachlan Musicman
Priority: Minor I notice that this command works well: sinfo -Nle -o '%C %t' Tue Nov 8 11:38:09 2016 CPUS(A/I/O/T) STATE 40/0/0/40 alloc 38/2/0/40 mix 36/4/0/40 mix 36/4/0/40 mix 6/34/0/40 mix 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idl

[slurm-dev] Re: sinfo man page

2016-11-07 Thread Lachlan Musicman
Arg, I see now (hit send too soon). My parsing of the man page was wrong. cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 8 November 2016 at 11:39, Lachlan Musicman wrote: > Priority: Minor > > I notice

[slurm-dev] Re: Re:

2016-11-08 Thread Lachlan Musicman
On 9 November 2016 at 09:36, Christopher Samuel wrote: > > But /tmp is almost certainly the second worst place (after /dev/shm). > I don't know Chris, I think that /dev/null would rate tbh. :) cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - G

[slurm-dev] Using slurm to control container images?

2016-11-15 Thread Lachlan Musicman
Hola, We were looking for the ability to make jobs perfectly reproducible - while the system is set up with environment modules with the increasing number of package management tools - pip/conda; npm; CRAN/Bioconductor - and people building increasingly more complex software stacks, our users have

[slurm-dev] Re: Using slurm to control container images?

2016-11-15 Thread Lachlan Musicman
lenv's if that is the case the switch to a container with rkt seems > "normal" instead of a more intrusive one all mighty process to rule > everything that docker had the last time I check, its probably better now. > > Saludos. > Jean > > On Tue, Nov 15, 2016 a

[slurm-dev] New design on schedmd site!

2016-11-15 Thread Lachlan Musicman
Hey Devs, The new design on the schedmd site is pretty - thanks! L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper

[slurm-dev] Email differentials

2016-12-01 Thread Lachlan Musicman
Hi, I've had a request from a user about the email system in SLURM. Basically, there's a team collaboration and the request was: is there an sbatch command such that two groups will get different sets of emails. Group 1: only get the email if the jobs FAIL Group 2: get Begin, End and Fail Cheer

[slurm-dev] Re: submitting jobs based in saccmgr info

2016-12-07 Thread Lachlan Musicman
On 8 December 2016 at 07:54, Mark R. Piercy wrote: > > Is it ever possible to submit jobs based on a users org affiliation? So > if a user is in org (PI) "smith" then their jobs would automatically be > sent to a particular partition. So no need to use the -p option in > sbatch/srun job. > M

[slurm-dev] Re: How to remove node temporal files

2016-12-28 Thread Lachlan Musicman
Hi David, I dealt with this recently (see https://groups.google.com/forum/#!topic/slurm-devel/DKcFng8c1zE for instance ) In the end we went with this solution that has worked well for us: https://slurm.schedmd.com/SUG14/private_tmp.pdf which describes this plugin: https://github.com/hpc2n/span

[slurm-dev] Re: Trying to figure out if I need to use "associations" on my cluster

2016-12-28 Thread Lachlan Musicman
Will, I believe you do. While they aren't necessary in your case, I believe the software has been built for maximum extensibility, and as such there needs to be: at least one cluster at least one account at least one user and an association is the "grouping" of those three. The relevant part of

[slurm-dev] Re: mail job status to user

2017-01-09 Thread Lachlan Musicman
Not 100% sure what you are asking? The mail options are available from within an sbatch script by using the commands you mention. They can also be passed directly to slurm when invoking the commands sbatch --mail-type=ALL --mail-user=e...@mail.com Are you asking if there is a default "always ma

[slurm-dev] Re: Job temporary directory

2017-01-22 Thread Lachlan Musicman
We use the SPANK plugin found here https://github.com/hpc2n/spank-private-tmp and find it works very well. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 21 January 2017 at 03:15, John Hearns wrote: > As I remember, in SGE and in PbsPr

[slurm-dev] Re: New User Creation Issue

2017-01-23 Thread Lachlan Musicman
Interesting. To the best of my knowledge, if you are using Accounting, all users actually need to be in an association - ie having a user account is insufficient. An Association is a tuple consisting of: cluster, user, account and (optional) partition. Is that the problem? cheers L. -- The

  1   2   >