[slurm-dev] Re: Cluster nodes

2017-11-04 Thread Lachlan Musicman
On 5 November 2017 at 11:08, ايمان <435204...@student.ksu.edu.sa> wrote: > Good morning; > > > I want to run parallel java code on more than one nodes , but it > executed only on one nodes ? > > How can I run it on more than one nodes? > Good morning Eman, Without more details, it's hard to

[slurm-dev] Re: slurm-dev ÑÏ: Re: Slurm and parallel java libaray pj2

2017-10-30 Thread Lachlan Musicman
are referring to - do you need a guide for SLURM or for pj2? Both can be found quickly with Google. Cheers L. > > ---------- > *من:* Lachlan Musicman <data...@gmail.com> > *‏‏تم الإرسال:* 10/صفر/1439 02:45 ص > * إلى: slurm-dev * > * ‏‏الموضوع: [slurm-de

[slurm-dev] Re: Slurm and parallel java libaray pj2

2017-10-29 Thread Lachlan Musicman
On 30 October 2017 at 10:04, ايمان <435204...@student.ksu.edu.sa> wrote: > I am programming a parallel program that use parallel java library pj2 . > > I want to run it using slurm but I did not know if slurm support this > library. > > And what is the correct commands to run java on cluster > >

[slurm-dev] Re: Naming of output & error files

2017-10-25 Thread Lachlan Musicman
On 26 October 2017 at 13:27, Alex Chekholko wrote: > Why can't you just do > > for fasta_file in `ls /path/to/fasta_files`; do sbatch > --output=$fasta_file.out --error=$fasta_file.err myscript.sbatch > $fasta_file; done > Because it was staring me in the face and I

[slurm-dev] Naming of output & error files

2017-10-25 Thread Lachlan Musicman
Hi All, I've now been asked twice in two days if there is any way to intelligently name slurm output files. Sometimes our users will do something like for fasta_file in `ls /path/to/fasta_files`; do sbatch myscript.sbatch $fasta_file; done They would like their output and error files to be

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-19 Thread Lachlan Musicman
On 19 October 2017 at 20:37, Chris Samuel wrote: > > On Thursday, 19 October 2017 7:41:37 PM AEDT Nadav Toledo wrote: > > > running : id -u domain_name\\username , does return its uid > > So your system is not finding users as just "username", but instead only as >

[slurm-dev] Re: Configure partitions to ignore cpu limit

2017-10-08 Thread Lachlan Musicman
greggish/status/873177525903609857 > On Thu, Oct 5, 2017 at 3:34 PM, Lachlan Musicman <data...@gmail.com> > wrote: > >> On 6 October 2017 at 07:35, Doug Meyer <dameye...@gmail.com> wrote: >> >>> Within the cluster we have partitions that are shared and some that

[slurm-dev] Re: Disable accounting/qos policy enforcement for single job

2017-10-05 Thread Lachlan Musicman
On 6 October 2017 at 07:29, Jacob Chappell wrote: > Is there a way (via scontrol for example) to disable accounting/qos policy > enforcement for a single job? We'd like to be able to allow a job to go > ahead and run, even though it may violate policy (MaxTRES) on a >

[slurm-dev] Re: Configure partitions to ignore cpu limit

2017-10-05 Thread Lachlan Musicman
On 6 October 2017 at 07:35, Doug Meyer wrote: > Within the cluster we have partitions that are shared and some that are > dedicated to specific groups. Is there a way to configure slurm so the > private use partitions do not impact the priority system nor are they > counted

[slurm-dev] Re: defaults, passwd and data

2017-09-24 Thread Lachlan Musicman
On 24 September 2017 at 16:20, Daniel Letai wrote: > Hello, > > B. We have active directory(AD) in our faculty, and We prefer manage > users/groups from there , is it possible? any guide available somewhere? > > > Search this mailing list, this question pops up every now and

[slurm-dev] Re: Fwd: Slurm nodes down

2017-09-21 Thread Lachlan Musicman
On 21 September 2017 at 17:55, Fabrice Nininahazwe wrote: > > Dear developer, > > I have encountered some of the nodes that are down, I can ping to node > n003 and not node n001, I have run scontrol update to change the state with > no success below is the result after

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-18 Thread Lachlan Musicman
On 18 September 2017 at 11:13, Christopher Samuel <sam...@unimelb.edu.au> wrote: > > On 14/09/17 16:04, Lachlan Musicman wrote: > > > It's worth noting that before this change cgroups couldn't get down to > > the thread level. We would only consume at the core level -

[slurm-dev] Re: why the env is the env of submit node, not the env of job running node.

2017-09-15 Thread Lachlan Musicman
On 15 September 2017 at 17:09, Dr. Thomas Orgis wrote: > Hi Zhang, > > the default behaviour of slurm is to try to keep the environment > variables from the submit node. I do not like that and in our > installation, we urge users to always specify > > #SBATCH

[slurm-dev] Re: why the env is the env of submit node, not the env of job running node.

2017-09-14 Thread Lachlan Musicman
c/profile/ /home/user1/.bashrc has already define > many variable, I think these are default variables, currently, every time, > I also need to source them before using, it is not reasonable from my view. > > Whether there is a way to configure slurm to use running node env, not > sub

[slurm-dev] Re: why the env is the env of submit node, not the env of job running node.

2017-09-14 Thread Lachlan Musicman
On 14 September 2017 at 19:41, Chaofeng Zhang wrote: > On node A, I submit job file using sbatch command, the job is running on > the node B, you will find that the output is not the env of node B, it is > the env of node A. > > > > *#!/bin/bash* > > *#SBATCH

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-14 Thread Lachlan Musicman
On 14 September 2017 at 11:06, Lachlan Musicman <data...@gmail.com> wrote: > > I've just implemented the change from > > NodeName=papr-res-compute[34-36] CPUs=8 RealMemory=31000 State=UNKNOWN > > to > > NodeName=papr-res-compute[34-36] CPUs=8 RealMemory=3100

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Lachlan Musicman
On 14 September 2017 at 11:10, Christopher Samuel <sam...@unimelb.edu.au> wrote: > > On 14/09/17 11:07, Lachlan Musicman wrote: > > > Node configuration differs from hardware: CPUs=8:8(hw) Boards=1:1(hw) > > SocketsPerBoard=8:1(hw) CoresPerSocket=1:4(hw) ThreadsPerCo

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-12 Thread Lachlan Musicman
On 13 September 2017 at 10:36, Christopher Samuel wrote: > > On 13/09/17 07:22, Patrick Goetz wrote: > > > All I have to say to this is: um, what? > > My take has always been that ThreadsPerCore is really for HPC workloads > where you've decided not to disable HT full stop

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-11 Thread Lachlan Musicman
On 11 September 2017 at 20:11, Gennaro Oliva wrote: > > Hi Patrick, > > On Fri, Sep 08, 2017 at 01:17:33PM -0600, Patrick Goetz wrote: > > After some > > discussion on this list, someone convinced me that setting > > "ThreadsPerCore=2" informs Slurm that each CPU actually

[slurm-dev] Slurm and Environments and aliases

2017-08-16 Thread Lachlan Musicman
Hola, I was under the impression that environments travelled with slurm when sbatch was executed - so any node could execute any code as if it was the env I executed from or built within my sbatch scripts. We use Environment Modules and this has all worked just great. Very pleased. Recently I

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-15 Thread Lachlan Musicman
On 16 August 2017 at 00:14, Will French wrote: > > On Aug 15, 2017, at 5:29 AM, Chris Samuel wrote: > > > > > > On Tuesday, 15 August 2017 4:34:55 PM AEST John Hearns wrote: > > > >> For the /proc/self you need to start an interactive job under

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread Lachlan Musicman
On 15 August 2017 at 11:38, Christopher Samuel <sam...@unimelb.edu.au> wrote: > On 15/08/17 09:41, Lachlan Musicman wrote: > > > I guess I'm not 100% sure what I'm looking for, but I do see that there > > is a > > > > 1:name=systemd:/user.slice/user-0.slice/ses

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread Lachlan Musicman
On 15 August 2017 at 07:41, Robbert Eggermont <r.eggerm...@tudelft.nl> wrote: > > On 14-08-17 07:50, Lachlan Musicman wrote: > >> We have TaskPlugin=task/cgroup and when testing I noticed that the # of >> threads/cpus being allocated was rounded up to the nearest even

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread Lachlan Musicman
On 14 August 2017 at 16:22, John Hearns wrote: > Lachlan, forgive me if I am teaching granny to suck eggs..,, > I have recently been workign with cgroups. > If you run an interactive job what do you see when cat /proc/self/cgroups > Also have you explored in

[slurm-dev] Proctrack cgroup; documentation bug

2017-08-13 Thread Lachlan Musicman
Hola, Two things: in the documentation for slurm.conf the reference to ProcTrack = proctrack/cgroup tells people to see `man cgroup.conf` for more details. That man page holds no details re proctrack. https://slurm.schedmd.com/slurm.conf.html The details in question are on

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-08 Thread Lachlan Musicman
Yep, thanks Chris. I went with regular reboot and have now successfully used scontrol reboot ASAP Very handy! L. -- "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics is the insistence that we cannot ignore the truth, nor should we panic about it. It is a shared

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-07 Thread Lachlan Musicman
e power plug. > Let's see you deal with that one. > > On 7 August 2017 at 06:08, Lachlan Musicman <data...@gmail.com> wrote: > >> I've just been asked about implementing a "drain and reboot" for >> nodes/partitions. >> >> In slurm.conf, there is a Re

[slurm-dev] RebootProgram - who uses it?

2017-08-06 Thread Lachlan Musicman
I've just been asked about implementing a "drain and reboot" for nodes/partitions. In slurm.conf, there is a RebootProgram - does this need to be a direct link to a bin or can it be a command? RebootProgram=/usr/sbin/reboot or RebootProgram='systemctl disable reboot-guard; reboot' Cheers L.

[slurm-dev] Re: How to use cgroup in slurm?

2017-08-01 Thread Lachlan Musicman
eong-gu, Daejeon > Republic of Korea 305-701 > Tel. +82-10-2075-6911 <+82%2010-2075-6911> > > 2017-08-02 13:05 GMT+09:00 Lachlan Musicman <data...@gmail.com>: > >> [root@n6 /]# si >>> >>> PARTITIONNODES NODES(A/I/O/T) S:C:TMEMORY T

[slurm-dev] Re: How to use cgroup in slurm?

2017-08-01 Thread Lachlan Musicman
> > [root@n6 /]# si > > PARTITIONNODES NODES(A/I/O/T) S:C:TMEMORY TMP_DISK > TIMELIMIT AVAIL_FEATURES NODELIST > > debug* 6 0/6/0/61:4:27785 113264 > infinite(null) c[1-6] > > (for a moment) > > [root@n6 /]# si > > PARTITION

[slurm-dev] Re: How to use cgroup in slurm?

2017-08-01 Thread Lachlan Musicman
Sumin, The error message is saying that the node is down. When you say "works with sinfo", you need to show us what that means - sinfo is a command that interrogates the state of nodes, whereas srun sends commands *to* nodes. So sinfo is meant to work - even if the nodes are down. It is hte

[slurm-dev] Re: Why my slurm is running on only one node?

2017-07-27 Thread Lachlan Musicman
On 28 July 2017 at 14:30, 허웅 wrote: > I modified my slurm.conf like : > > > > NodeName=GO[1-5] > > > > PartitionName=party Default=yes Nodes=GO[1-5] > > > > and I restarted slurmctld and slurmd services. > > > > [root@GO1]~# systemctl start slurmctld > > [root@GO1]~#

[slurm-dev] Re: Why my slurm is running on only one node?

2017-07-27 Thread Lachlan Musicman
> > sgo3 1party* idle > > sgo4 1party* idle > > sgo5 1party* idle > > [root@GO1]~# sn > Fri Jul 28 09:55:53 2017 >HOSTNAMES > GO1 > GO2 > GO3 >

[slurm-dev] Re: Why my slurm is running on only one node?

2017-07-27 Thread Lachlan Musicman
e only way out is through, and the only way through is together. " *Greg Bloom* @greggish https://twitter.com/greggish/status/873177525903609857 On 28 July 2017 at 10:47, Lachlan Musicman <data...@gmail.com> wrote: > I think it's because hostname is so undemanding. > > How many CPUs

[slurm-dev] Re: Why my slurm is running on only one node?

2017-07-27 Thread Lachlan Musicman
I think it's because hostname is so undemanding. How many CPUs does each host have? You may need to use ((number of cpus per host) + 1) to see action on another node. You can try using stress-ng to test higher loads? https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/

[slurm-dev] Re: #SBATCH --time= not always overriding default?

2017-06-30 Thread Lachlan Musicman
in/bash > WDIR=$PWD > #SBATCH -t 1:00 > > the -t 1:00 will get ignored by sbatch > > > On Thu, 29 Jun 2017, Lachlan Musicman wrote: > >> We have a 40min default time on our main partition. >> >> We are finding that researchers that use >> >> #SBATCH --ti

[slurm-dev] #SBATCH --time= not always overriding default?

2017-06-29 Thread Lachlan Musicman
We have a 40min default time on our main partition. We are finding that researchers that use #SBATCH --time=0-07:00:00 are still having their jobs terminated at 40 minutes. Using slurm 17.2.04 on Centos 7.3 Has anyone else experienced this? Cheers L. -- "Mission Statement: To provide

[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Lachlan Musicman
We did it in place, worked as noted on the tin. It was less painful than I expected. TBH, your procedures are admirable, but you shouldn't worry - it's a relatively smooth process. cheers L. -- "Mission Statement: To provide hope and inspiration for collective action, to build collective

[slurm-dev] Re: Slurm: which packages on hosts

2017-06-08 Thread Lachlan Musicman
On 9 June 2017 at 15:26, Lachlan Musicman <data...@gmail.com> wrote: > On 9 June 2017 at 14:53, Nicholas C Santucci <santu...@uci.edu> wrote: > >> Those first two of your Gone list I noticed when 17.02.0 was released on >> Feb 23. >> A patch was

[slurm-dev] Re: Slurm: which packages on hosts

2017-06-08 Thread Lachlan Musicman
t; +Obsoletes: slurm-sjobexit slurm-sjstat slurm-seff >> %description contribs >> seff is a mail program used directly by the Slurm daemons. On completion >> of a >> job, wait for it's accounting information to be available and include >> that >> >> On 0

[slurm-dev] Slurm: which packages on hosts

2017-06-08 Thread Lachlan Musicman
Hola, I followed the instructions for building the 16.05.0 bz2 and installed the resulting rpms as follows: Each node got: slurm.x86_64 slurm-devel.x86_64 slurm-munge.x86_64 slurm-perlapi.x86_64 slurm-plugins.x86_64 slurm-sjobexit.x86_64 slurm-sjstat.x86_64 slurm-torque.x86_64 The head

[slurm-dev] Re: Multinode MATLAB jobs

2017-05-31 Thread Lachlan Musicman
I'm pretty sure you need the MDCS. Having said that, I know people run GNU Octave on clusters, can't speak to it though. R works on a cluster quite nicely. cheers L. -- "Mission Statement: To provide hope and inspiration for collective action, to build collective power, to achieve

[slurm-dev] Re: questions

2017-05-30 Thread Lachlan Musicman
Hi and welcome to SLURM. It is late and I am tired, but: 1. SLURM is a cluster 2. front end will run the slurm-ctld service. Compute nodes will run slurmd service. How that is divided is up to you. cheers L. -- "Mission Statement: To provide hope and inspiration for collective action,

[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Lachlan Musicman
On 24 May 2017 at 13:18, Christopher Samuel <sam...@unimelb.edu.au> wrote: > > Hiya, > > On 24/05/17 13:10, Lachlan Musicman wrote: > > > Occasionally I'll see a bunch of processes "running" (sleeping) on a > > node well after the job they are ass

[slurm-dev] Job ends successfully but spawned processes still run?

2017-05-23 Thread Lachlan Musicman
Hola, Occasionally I'll see a bunch of processes "running" (sleeping) on a node well after the job they are associated with has finished. How does this happen - does slurm not make sure all processes spawned by a job have finished at completion? cheers L. -- "Mission Statement: To provide

[slurm-dev] Time limit exhausted for JobId=

2017-05-22 Thread Lachlan Musicman
One user has recently started to see their jobs killed after roughly 40 minutes, even though they have asked for four hours. 40 minutes is partitions' default, but this user has #SBATCH --time=04:00:00 in their sbatch file? I have found this: https://bugs.schedmd.com/show_bug.cgi?id=2353 and

[slurm-dev] Re: PartitionTimeLimit : what does that mean?

2017-05-22 Thread Lachlan Musicman
- Patrice Cullors, *Black Lives Matter founder* On 23 May 2017 at 09:43, Lachlan Musicman <data...@gmail.com> wrote: > Hola, > > One of my users has been given the PartitionTimeLimit reason for his jobs > not running. > > He has requested 20 days for the job, but I don't reme

[slurm-dev] PartitionTimeLimit : what does that mean?

2017-05-22 Thread Lachlan Musicman
Hola, One of my users has been given the PartitionTimeLimit reason for his jobs not running. He has requested 20 days for the job, but I don't remember setting a time limit on any partition? I do recall setting a default time, but not a time limit. The docs claim:

[slurm-dev] Re: Slurm version 17.02.3 is now available

2017-05-10 Thread Lachlan Musicman
On 11 May 2017 at 08:33, Batsirai Mabvakure wrote: > Is there a command i can execute for slurm to update automatically without > having to download it again? > > Not really. Ubuntu packages SLURM IIRC, but you would need to wait until they do their packaging and push the

[slurm-dev] Re:

2017-05-09 Thread Lachlan Musicman
ooted in grief and rage but pointed towards vision and dreams." - Patrice Cullors, *Black Lives Matter founder* On 10 May 2017 at 10:57, Lachlan Musicman <data...@gmail.com> wrote: > Running Slurm 16.05 on CentOS 7.3 I'm trying to start an interactive > session with > > srun -w

[slurm-dev]

2017-05-09 Thread Lachlan Musicman
Running Slurm 16.05 on CentOS 7.3 I'm trying to start an interactive session with srun -w papr-expanded01 --pty --mem 8192 -t 06:00 /bin/bash --partition=expanded srun -w papr-expanded01 --pty -t 06:00 /bin/bash --partition=expanded srun -w papr-expanded01 --pty --mem 8192 /bin/bash

[slurm-dev] Re: LDAP required?

2017-04-11 Thread Lachlan Musicman
On 11 April 2017 at 02:36, Raymond Wan wrote: > > For SLURM to work, I understand from web pages such as > https://slurm.schedmd.com/accounting.html that UIDs need to be shared > across nodes. Based on this web page, it seems sharing /etc/passwd > between nodes appears

[slurm-dev] Re: Adding more nodes to SLURM

2017-02-18 Thread Lachlan Musicman
mabvakure <batsir...@nicd.ac.za> wrote: > Thank you so much for the reply. Is there another way I can configure the > nodes other than using mpich that allows me only to update the slurm.conf > file and not install slurm on every new node every time I scale up? > > > On 20

[slurm-dev] Re: sharing generic resources

2017-02-17 Thread Lachlan Musicman
I don't know if you can split it at a GRES level, but I would put the node in the two partitions, and then use QOS to only allow one partition access to the single card and the other partition 3 cards. cheers L. -- The most dangerous phrase in the language is, "We've always done it this

[slurm-dev] Re: slurmctld not pinging at regular interval

2017-02-16 Thread Lachlan Musicman
t > > partitions. The cluster of 64 is the only one where I see this > > happening. Unless that number of nodes is pushing the limit for a single > > slurmctld (which I doubt) I'd be inclined to think it's more likely a > > network issue but in that case I'd expect w

[slurm-dev] RE: Configuring slurm accounts

2017-02-16 Thread Lachlan Musicman
On 17 February 2017 at 03:02, Baker D.J. wrote: > Hello, > > > > Thank you for the reply. There are two accounts on this cluster. What I > was primarily trying to do was define a default QOS with a partition. My > idea was to use sacctmgr to create an association between

[slurm-dev] Re: New User Creation Issue

2017-02-15 Thread Lachlan Musicman
On 16 February 2017 at 09:36, Christopher Samuel wrote: > > We also have all our partitions (other than our debug one reserved for > sysadmins) marked as "State=DOWN" in slurm.conf so that they won't start > jobs when slurmctld is brought back up again. > Chris, What's

[slurm-dev] RE: Configuring slurm accounts

2017-02-15 Thread Lachlan Musicman
If you are only in one account, you don't need to list it. What version of slurm are you using? Someone else mentioned needing to restart slurmctld to users to stick. Which is not something I've experienced, but try that maybe? I am presuming that your slurm.conf is set up correctly for

[slurm-dev] Re: Standard suspend/resume scripts?

2017-02-15 Thread Lachlan Musicman
If you are looking to suspend and resume jobs, use scontrol: scontrol suspend scontrol resume https://slurm.schedmd.com/scontrol.html The docs you are pointing to look more like taking nodes offline in times of low usage? cheers L. -- The most dangerous phrase in the language is,

[slurm-dev] Re: Can't Specify Memory Constraint or Run Multiple Jobs per Node

2017-02-12 Thread Lachlan Musicman
g that change to all nodes, restarting slurmctld then running scontrol reconfigure? cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper > On Sat, Feb 11, 2017 at 3:26 AM Lachlan Musicman <data...@gmail.com> > wrote: >

[slurm-dev] Re: Can't Specify Memory Constraint or Run Multiple Jobs per Node

2017-02-11 Thread Lachlan Musicman
1. As EV noted, to get Memory as a consumable resource, you will need to add it to the line that says CR_CPU - change to CR_CPU_Memory https://slurm.schedmd.com/slurm.conf.html 2. That's because of the CR_CPU combined with cons_res. Change to CR_CORE for per core or CR_SOCKET for per socket. For

[slurm-dev] Re: let sbatch run a script

2017-01-31 Thread Lachlan Musicman
There's always the --dependency flag for sbatch. So yes, depending on what you wanted, you could line up another sbatch after the first if you liked. cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 1 February 2017 at 08:38,

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-01-31 Thread Lachlan Musicman
trival questions: does node has correct time wrt head node? and is node correctly configured in slurm.conf? (# of cpus, amount of memory, etc) cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 1 February 2017 at 08:03, E V

[slurm-dev] Re: slurmctld not pinging at regular interval

2017-01-24 Thread Lachlan Musicman
Check they are all in the same time or ntpd against the same server. I found that the nodes that kept going down had the time out of sync. Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 25 January 2017 at 05:49, Allan Streib

[slurm-dev] Re: New User Creation Issue

2017-01-23 Thread Lachlan Musicman
stuser slurm_clu+ > > > > one thing that I did notice when I add the user I see this error in the > slurmctld log > > > > [2017-01-23T16:47:34.351] error: Update Association request from non-super > user uid=450 > > > > UID 450 happens to be the slurm user &

[slurm-dev] Re: New User Creation Issue

2017-01-23 Thread Lachlan Musicman
Interesting. To the best of my knowledge, if you are using Accounting, all users actually need to be in an association - ie having a user account is insufficient. An Association is a tuple consisting of: cluster, user, account and (optional) partition. Is that the problem? cheers L. -- The

[slurm-dev] Re: Job temporary directory

2017-01-22 Thread Lachlan Musicman
We use the SPANK plugin found here https://github.com/hpc2n/spank-private-tmp and find it works very well. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 21 January 2017 at 03:15, John Hearns wrote: > As I

[slurm-dev] Re: Trying to figure out if I need to use "associations" on my cluster

2016-12-28 Thread Lachlan Musicman
Will, I believe you do. While they aren't necessary in your case, I believe the software has been built for maximum extensibility, and as such there needs to be: at least one cluster at least one account at least one user and an association is the "grouping" of those three. The relevant part of

[slurm-dev] Re: How to remove node temporal files

2016-12-28 Thread Lachlan Musicman
Hi David, I dealt with this recently (see https://groups.google.com/forum/#!topic/slurm-devel/DKcFng8c1zE for instance ) In the end we went with this solution that has worked well for us: https://slurm.schedmd.com/SUG14/private_tmp.pdf which describes this plugin:

[slurm-dev] Re: submitting jobs based in saccmgr info

2016-12-07 Thread Lachlan Musicman
On 8 December 2016 at 07:54, Mark R. Piercy wrote: > > Is it ever possible to submit jobs based on a users org affiliation? So > if a user is in org (PI) "smith" then their jobs would automatically be > sent to a particular partition. So no need to use the -p option in >

[slurm-dev] Email differentials

2016-12-01 Thread Lachlan Musicman
Hi, I've had a request from a user about the email system in SLURM. Basically, there's a team collaboration and the request was: is there an sbatch command such that two groups will get different sets of emails. Group 1: only get the email if the jobs FAIL Group 2: get Begin, End and Fail

[slurm-dev] New design on schedmd site!

2016-11-15 Thread Lachlan Musicman
Hey Devs, The new design on the schedmd site is pretty - thanks! L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper

[slurm-dev] Re: Using slurm to control container images?

2016-11-15 Thread Lachlan Musicman
pip inside > virtualenv's if that is the case the switch to a container with rkt seems > "normal" instead of a more intrusive one all mighty process to rule > everything that docker had the last time I check, its probably better now. > > Saludos. > Jean > > On Tue, Nov 15, 20

[slurm-dev] Using slurm to control container images?

2016-11-15 Thread Lachlan Musicman
Hola, We were looking for the ability to make jobs perfectly reproducible - while the system is set up with environment modules with the increasing number of package management tools - pip/conda; npm; CRAN/Bioconductor - and people building increasingly more complex software stacks, our users

[slurm-dev] Re: sinfo man page

2016-11-07 Thread Lachlan Musicman
Arg, I see now (hit send too soon). My parsing of the man page was wrong. cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 8 November 2016 at 11:39, Lachlan Musicman <data...@gmail.com> wrote: > Priority:

[slurm-dev] sinfo man page

2016-11-07 Thread Lachlan Musicman
Priority: Minor I notice that this command works well: sinfo -Nle -o '%C %t' Tue Nov 8 11:38:09 2016 CPUS(A/I/O/T) STATE 40/0/0/40 alloc 38/2/0/40 mix 36/4/0/40 mix 36/4/0/40 mix 6/34/0/40 mix 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40

[slurm-dev] Re:

2016-11-07 Thread Lachlan Musicman
Peixin, Again, depends on your OS and deployment methods, but essentially: In slurm.conf set SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm SlurmctldLogFile=/var/log/slurm/slurm-ctld.log

[slurm-dev] Re: start munge again after boot?

2016-11-07 Thread Lachlan Musicman
On 8 November 2016 at 07:11, Peixin Qiao wrote: > Hi, > > I install munge and restart my computer, then munge stopped work and > restarting munge didn't work. It says: > > munged: Error: Failed to check pidfile dir "/var/run/munge": cannot > canonicalize "/var/run/munge": No

[slurm-dev] Re: Can slurm work on one node?

2016-10-30 Thread Lachlan Musicman
I think it should. Can you send through your slurm.conf? Also, the logs usually explicitly say why slurmctld/slurmd don't start, and the best way to judge if slurm is running is with systemd: systemctl status slurmctl systemctl status slurmd cheers L. -- The most dangerous phrase in the

[slurm-dev] Re: How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Lachlan Musicman
On 28 October 2016 at 09:20, Christopher Samuel <sam...@unimelb.edu.au> wrote: > > On 28/10/16 08:44, Lachlan Musicman wrote: > > > So I checked the system, noticed that one node was drained, resumed it. > > Then I tried both > > > > scontrol requeue 230

[slurm-dev] How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Lachlan Musicman
Morning, Yesterday we had some internal network issues that caused havoc on our system. By the end of the day everything was ok on the whole. This morning I came in to see one job on the queue (which was otherwise relatively quiet) with the error message/Nodelist Reason (launch failed requeued

[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-24 Thread Lachlan Musicman
On 25 October 2016 at 09:17, Tuo Chen Peng wrote: > Oh ok thanks for pointing this out. > > I thought ‘scontrol update’ command is for letting slurmctld to pick up > any change in slurm.conf. > > But after reading the manual again, it seems this command is instead to > change

[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-24 Thread Lachlan Musicman
On 25 October 2016 at 08:42, Tuo Chen Peng wrote: > Hello all, > > This is my first post in the mailing list - nice to join the community! > Welcome! > > > I have a general question regarding slurm partition change: > > If I move one node from one partition to the other,

[slurm-dev] Re: sreport "duplicate" lines

2016-10-20 Thread Lachlan Musicman
On 21 October 2016 at 12:39, Christopher Samuel wrote: > > On 21/10/16 12:29, Andrew Elwell wrote: > > > When running sreport (both 14.11 and 16.05) I'm seeing "duplicate" > > user info with different timings. Can someone say what's being added > > up separately here - it

[slurm-dev] Re: Packaging for fedora (and EPEL)

2016-10-17 Thread Lachlan Musicman
I've had consistent success with the documented system - "rpmbulid slurm-.tgz" then yum installing the resulting files, using 15.x, 16.05 and 17.02. Have on occasion needed to recompile - hdf5 support and for non main line plugins, but otherwise it's been pretty easy. Will happily support/debug

[slurm-dev] Re: ulimit issue I'm sure someone has seen before

2016-10-13 Thread Lachlan Musicman
Mike, I would suggest that the limit is a SLURM limit rather than a ulimit. What is the result of scontrol show config | grep Mem ? Because you have set your SelectTypeParameters=CR_Core_Memory Memory will cause jobs to fail if they go over the default memory limit. The SLURM head will kill

[slurm-dev] Re: Draining, Maint or ?

2016-10-11 Thread Lachlan Musicman
for that partition - jobs running on that partition will continue to do so cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 12 October 2016 at 10:35, Lachlan Musicman <data...@gmail.com> wrote: > Hola, > > For reason

[slurm-dev] Draining, Maint or ?

2016-10-11 Thread Lachlan Musicman
Hola, For reasons, our IT team needs some downtime on our authentication server (FreeIPA/sssd). We would like to minimize the disruption, but also not lose any work. The current plan is for the nodes to be set to DRAIN on Friday afternoon and on Monday morning we will suspend any running jobs,

[slurm-dev] Re: slurm build options

2016-10-07 Thread Lachlan Musicman
Check against the installed libs? check *-devel? Otherwise I'm not 100% sure - unless the rpmbuild folder with all files still exists and there's something in there? FWIW, it's relatively easy to install all the libs that SLURM needs without causing too much problems. The hardest I've found so

[slurm-dev] slurm 16.05.5

2016-10-04 Thread Lachlan Musicman
Hola, Just built the rpms as per the installation docs. Noted that there were three new rpms: slurm-openlava-16.05.5-1.el7.centos.x86_64.rpm slurm-pam_slurm-16.05.5-1.el7.centos.x86_64.rpm slurm-seff-16.05.5-1.el7.centos.x86_64.rpm Is that due to a more sophisticated build machine or due to

[slurm-dev] Re: cons_res / CR_CPU - we don't have select plugin type 102

2016-10-04 Thread Lachlan Musicman
Jose, Do all the nodes have access to either a shared /usr/lib64/slurm or do they each have their own? And is there a file in that dir (on each machine) called select_cons_res.so? Also, when changing slurm.conf here's a quick and easy workflow: 1. change slurm.conf 2. deploy to all machines in

[slurm-dev] Re: QOS, Limits, CPUs and threads - something is wrong?

2016-10-03 Thread Lachlan Musicman
he language is, "We've always done it this way." - Grace Hopper > > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > National Energy Research Scientific Computing Center > <http://www.nersc.gov> > dmjacob...@lbl.gov > > - __o >

[slurm-dev] Re: Job Accounting for sstat

2016-10-02 Thread Lachlan Musicman
.@unimelb.edu.au> wrote: > > On 30/08/16 12:39, Lachlan Musicman wrote: > > > Oh! Thanks. > > > > I presume that includes sruns that are in an sbatch file. > > Yup, that's right. > > cheers! > Chris > -- > Christopher SamuelSenior Systems Admini

[slurm-dev] QOS, Limits, CPUs and threads - something is wrong?

2016-10-02 Thread Lachlan Musicman
I started a thread on understand QOS, but quickly realised I had made a fundamental error in my configuration. I fixed that problem last week. (ref: https://groups.google.com/forum/#!msg/slurm-devel/dqL30WwmrmU/SoOMHmRVDAAJ ) Despite these changes, the issue remains, so I would like to ask again,

[slurm-dev] Re: Struggling with QOS?

2016-10-02 Thread Lachlan Musicman
:01, Janne Blomqvist <janne.blomqv...@aalto.fi> wrote: > On 2016-09-29 04:11, Lachlan Musicman wrote: > > Hi, > > > > After some fun incidents with accidental monopolization of the cluster, > > we decided to enforce some QOS. > [snip] > > What have I done w

[slurm-dev] Re: Struggling with QOS?

2016-09-28 Thread Lachlan Musicman
is, "We've always done it this way." - Grace Hopper On 29 September 2016 at 11:10, Lachlan Musicman <data...@gmail.com> wrote: > Hi, > > After some fun incidents with accidental monopolization of the cluster, we > decided to enforce some QOS. > > I read the documentatio

[slurm-dev] Struggling with QOS?

2016-09-28 Thread Lachlan Musicman
Hi, After some fun incidents with accidental monopolization of the cluster, we decided to enforce some QOS. I read the documentation. Thus far in the set up the only thing I've done that's even close is I assigned "share" values when I set up each association. The cluster had a QOS called

[slurm-dev] Re: Slurm web dashboards

2016-09-28 Thread Lachlan Musicman
Park // Dirac Crescent // Emersons > Green // Bristol // BS16 7FR > > CFMS Services Ltd is registered in England and Wales No 05742022 - a > subsidiary of CFMS Ltd > CFMS Services Ltd registered office // Victoria House // 51 Victoria > Street // Bristol // BS1 6AD > >

[slurm-dev] Re: Slurm web dashboards

2016-09-27 Thread Lachlan Musicman
I am surprised how hard I found it to find these as well - especially given how frequently the question is asked. This mob have made one, and it looks good, but all development has happened on .deb systems, and I didn't have sufficient time (or skill) to unpack and repack for rpm or generic.

[slurm-dev] CGroups

2016-09-26 Thread Lachlan Musicman
Hi, cgroups have been on my radar since about two weeks after I started looking into SLURM and I'm just getting around to looking at them now. I note that the ProcTrackType docs say > This plugin writes to disk often and can impact performance. If you are running lots of > short running jobs

  1   2   >