[slurm-dev] Re: How to strictly limit the memory per CPU

2017-11-01 Thread Christopher Samuel

On 02/11/17 14:34, 马银萍 wrote:

> It means that he used only one cpu and asked for 125G memoey, so he used
> most of the memory on that node, then it will affect other user's job,
> this is invalid.
> So is there any  way to strictly limit the avarage memory per CPU and
> users can't override it? or any way to disable --mem and --mem-per-cpu ?

I believe you can restrict the amount of memory jobs can use via TRES
functionality:

https://slurm.schedmd.com/tres.html

it's not something we do here though.

Best of luck!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Qos limits associations and AD auth

2017-10-18 Thread Christopher Samuel

On 18/10/17 16:27, Nadav Toledo wrote:

> about B:ן¿½ The reason is I dont want to manually adding each user to
> the slurm database (sacctmgr create user...)

I'm afraid you don't really have an option there, if you want to use the
slurmdbd limits then you're going to need to add the users to the
database.  You could script the addition/removal to avoid having to do
it all by hand.

Regarding the idea of a web portal - I think what is being suggested is
not using AD but instead have your own LDAP server for the cluster which
is populated via a web portal.

If you are tied into using AD (which it sounds like you are) then that's
not really an option for you.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: mysql job_table and step_table growth

2017-10-18 Thread Christopher Samuel

On 19/10/17 05:24, Douglas Meyer wrote:

> We have job_table purge set for 61 days and step_table for 11.  Seems
> to have no impact.

So you have this in slurmdbd.conf?

PurgeJobAfter=61days
PurgeStepAfter=11days

Anything in the logs when you start up slurmdbd?

What does this say?

sacctmgr list config | fgrep Purge

cheer,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: mysql job_table and step_table growth

2017-10-15 Thread Christopher Samuel

On 14/10/17 00:24, Doug Meyer wrote:

> The job_table.idb and step_table.idb do not clear as part of day-to-day
> slurmdbd.conf
> 
> Have slurmdbd.conf set to purge after 8 weeks but this does not appear
> to be working.

Anything in your slurmdbd logs?

-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Slurm 17.02.7 and PMIx

2017-10-09 Thread Christopher Samuel

On 05/10/17 11:27, Christopher Samuel wrote:

> PMIX v1.2.2: Slurm complains and tells me it wants v2.

I think that was due to a config issue on the system I was helping out
with, after having to install some extra packages (like a C++ compiler)
to get other things working I can no longer reproduce this issue.

So next outage they get we can add PMIx support to Slurm (my test build
compiled OK).

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Tasks distribution

2017-10-09 Thread Christopher Samuel

On 09/10/17 22:11, Sysadmin CAOS wrote:

> Now, after that, should srun distribute correctly my tasks as mpirun
> does right?

No, srun will distribute the tasks as how Slurm wants to, remember it's
the MPI implementations job to listen to what the resource manager tells
it to do, not the other way around.

So the issue here is getting Slurm to allocate nodes in the way you
wish.  On my cluster I see:

srun: Warning: can't honor --ntasks-per-node set to 4 which doesn't
match the requested tasks 17 with the number of requested nodes 5.
Ignoring --ntasks-per-node.

That's Slurm 16.05.8.  Do you see the same?

Did you try both having CR_Pack_Nodes *and* specifying this?

-n 17 --ntasks-per-node=4

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Camacho Barranco, Roberto <rcamachobarra...@utep.edu> ssirimu...@utep.edu

2017-10-09 Thread Christopher Samuel

On 10/10/17 07:21, Suman Sirimulla wrote:

> We have installed and configured slurm on our cluster, but unable to
> start the slurmctld daemon. We followed the instructions
> (https://slurm.schedmd.com/troubleshoot.html)
> <https://slurm.schedmd.com/troubleshoot.html%29> and tried to stop and
> restart it multiple times but still not working. Please see the error below.

Check your slurmctld.log, that should have hints about why it won't start.

cheers!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Slurm 17.02.7 and PMIx

2017-10-04 Thread Christopher Samuel

Hi folks,

Just wondering if anyone here has had any success getting Slurm to
compile with PMIx support?

I'm trying 17.02.7 and I find that with PMIx I get either:

PMIX v1.2.2: Slurm complains and tells me it wants v2.

PMIX v2.0.1: Slurm can't find it because the header files are not
where it is looking for them, and when I do a symlink hack to make
PMIX detection work it then fails to compile, with:

/bin/sh ../../../../libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. 
-I../../../.. -I../../../../slurm  -I../../../.. -I../../../../src/common 
-I/usr/include -I/usr/local/pmix/latest/include -DHAVE_PMIX_VER=2   -g -O0 
-pthread -Wall -g -O0 -fno-strict-aliasing -MT mpi_pmix_v2_la-pmixp_client.lo 
-MD -MP -MF .deps/mpi_pmix_v2_la-pmixp_client.Tpo -c -o 
mpi_pmix_v2_la-pmixp_client.lo `test -f 'pmixp_client.c' || echo 
'./'`pmixp_client.c
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm 
-I../../../.. -I../../../../src/common -I/usr/include 
-I/usr/local/pmix/latest/include -DHAVE_PMIX_VER=2 -g -O0 -pthread -Wall -g -O0 
-fno-strict-aliasing -MT mpi_pmix_v2_la-pmixp_client.lo -MD -MP -MF 
.deps/mpi_pmix_v2_la-pmixp_client.Tpo -c pmixp_client.c  -fPIC -DPIC -o 
.libs/mpi_pmix_v2_la-pmixp_client.o
pmixp_client.c: In function ‘_set_procdatas’:
pmixp_client.c:468:24: error: request for member ‘size’ in something not a 
structure or union
   kvp->value.data.array.size = count;
^
pmixp_client.c:482:24: error: request for member ‘array’ in something not a 
structure or union
   kvp->value.data.array.array = (pmix_info_t *)info;
^
make[4]: *** [mpi_pmix_v2_la-pmixp_client.lo] Error 1


So I'm guessing that I'm missing something but the documentation
for PMIX in Slurm seems pretty much non-existent. :-(

Anyone had any luck with this?

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Upgrading Slurm

2017-10-04 Thread Christopher Samuel

On 04/10/17 20:51, Gennaro Oliva wrote:

> If you are talking about Slurm I would backup the configuration files
> also.

Not directly Slurm related but don't forget to install and configure
etckeeper first.

It puts your /etc/ directory under git version control and will do
commits of changes before and after any package upgrade/install/removal
so you have a good history of changes made.

I'm assuming that the slurm config files in the Debian package are under
/etc so that will be helpful to you for this.

> Anyway there have been a lot of major changes in SLURM and in Debian since
> 2013 (Wheezy release date), so be prepared that it will be no picnic.

The Debian package name also changed from slurm-llnl to slurm-wlm at
some point too, so missing the intermediate release may result in that
not transitioning properly.

To be honest I would never use a distros packages for Slurm, I'd always
install it centrally (NFS exported to compute nodes) to keep things
simple.  That way you decouple your Slurm version from the OS and can
keep it up to date (or keep it on a known working version).

All the best!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Setting up Environment Modules package

2017-10-04 Thread Christopher Samuel

On 05/10/17 03:11, Mike Cammilleri wrote:

> 2. Install Environment Modules packages in a location visible to the
> entire cluster (NFS or similar), including the compute nodes, and the
> user then includes their 'module load' commands in their actual slurm
> submit scripts since the command would be available on the compute
> nodes - loading software (either local or from network locations
> depending on what they're loading) visible to the nodes

This is what we do, the management node for the cluster exports its
/usr/local read-only to the rest of the cluster.

We also have in our taskprolog.sh:

echo export BASH_ENV=/etc/profile.d/module.sh

to try and ensure that bash shells have modules set up, just in case. :-)

-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Upgrading Slurm

2017-10-04 Thread Christopher Samuel

On 04/10/17 17:12, Loris Bennett wrote:

> Ole's pages on Slurm are indeed very useful (Thanks, Ole!).  I just
> thought I point out that the limitation on only upgrading by 2 major
> versions is for the case that you are upgrading a production system and
> don't want to lose any running jobs. 

The on disk format might for spooled jobs may also change between
releases too, so you probably want to keep that in mind as well..

-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Is PriorityUsageResetPeriod really required for hard limits?

2017-10-03 Thread Christopher Samuel

On 29/09/17 06:34, Jacob Chappell wrote:

> Hi all. The slurm.conf documentation says that if decayed usage is
> disabled, then PriorityUsageResetPeriod must be set to some value. Is
> this really true? What is the technical reason for this requirement if
> so? Can we set this period to sometime far into the future to have
> effectively an infinite period (no reset)?

Basically this is because once a user exceeds something like their
maximum CPU run time limit then they will never be able to run jobs
again unless you either decay or reset usage.

-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Limiting SSH sessions to cgroups?

2017-09-21 Thread Christopher Samuel

On 21/09/17 00:29, Jacob Chappell wrote:

> I still have one weird issue. I'm probably missing another setting
> somewhere. The cgroup that the SSH session is adopted into does not seem
> to include the /dev files.

That's something I can't help with I'm afraid, we're still on RHEL6.

In a job there I see:

$ cat /proc/$$/cgroup
4:cpuacct:/slurm/uid_500/job_6900206/step_0/task_0
3:memory:/slurm/uid_500/job_6900206/step_0
2:cpuset:/slurm/uid_500/job_6900206/step_0
1:freezer:/slurm/uid_500/job_6900206/step_0

and in my SSH session I see:

$ cat /proc/$$/cgroup
4:cpuacct:/slurm/uid_500/job_6900206/step_extern/task_0
3:memory:/slurm/uid_500/job_6900206/step_extern/task_0
2:cpuset:/slurm/uid_500/job_6900206/step_extern
1:freezer:/slurm/uid_500/job_6900206/step_extern


I'm about to start travelling for the Slurm User Group in a few hours,
so I'll be off-air for quite a while. Good luck!

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Accounting using LDAP ?

2017-09-20 Thread Christopher Samuel

On 20/09/17 17:14, Loris Bennett wrote:

> Is the user management system homegrown or something more generally
> available?

Both, it was started as a project at $JOB-1 and open-sourced.

http://karaage.readthedocs.org/

The current main developer no longer works in HPC (as $JOB-1 folded
years after I left for here) but he's still looking after this as he's
helping the university out in his spare time.  He's currently moving
from deploying it via Debian packages to using Docker.

We have our own custom module for karaage for project creation as we
needed a lot more information from applicants than what it captures by
default, but that's the nice thing, it is modular.

Also includes Shibboleth support.

All the best!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Accounting using LDAP ?

2017-09-19 Thread Christopher Samuel

On 20/09/17 03:03, Carlos Lijeron wrote:

> I'm trying to enable accounting on our SLURM configuration, but our
> cluster is managed by Bright Management which has its own LDAP for users
> and groups.   When setting up SLURM accounting, I don't know how to make
> the connection between the users and groups from the LDAP as opposed to
> the local UNIX.

Slurm just uses the host's NSS config for that, so as long as the OS can
see the users and groups then slurmdbd will be able to see them too.

*However*, you _still_ need to manually create users in slurmdbd to
ensure that they can run jobs, but that's a separate issue to whether
slurmdbd can resolve users in LDAP.

I would hope that Bright would have the ability to do that for you
rather than having you handle it manually, but that's a question for Bright.

Best of luck,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Limiting SSH sessions to cgroups?

2017-09-19 Thread Christopher Samuel

On 20/09/17 06:39, Jacob Chappell wrote:

> Thanks everyone who has replied. I am trying to get pam_slurm_adopt.so
> implemented. Does it work with batch jobs?

It does indeed, we use it as well.

Do you have:

PrologFlags=contain

set?  From slurm.conf:


  Contain At job allocation time, use the ProcTrack plugin to
  create a job  container  on  all  allocated compute
  nodes. This container may be used for user processes
  not launched under Slurm control, for example the PAM
  module may place processes launch through a direct
  user  login into this container. Setting the Contain
  implicitly sets the Alloc flag.


-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-17 Thread Christopher Samuel

On 14/09/17 16:04, Lachlan Musicman wrote:

> It's worth noting that before this change cgroups couldn't get down to
> the thread level. We would only consume at the core level - ie, all jobs
> would get an even number of cpus - jobs that requested an odd number of
> cpus (threads) would be rounded up to the next even number.

Did you have this set too (either explicitly or implicitly)?

  CR_ONE_TASK_PER_CORE
 Allocate one task per core by default.  Without this option,
 by default one task will be allocated per thread on nodes
 with more than one ThreadsPerCore configured.


cheers!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Christopher Samuel

On 14/09/17 11:07, Lachlan Musicman wrote:

> Node configuration differs from hardware: CPUs=8:8(hw) Boards=1:1(hw)
> SocketsPerBoard=8:1(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:2(hw)

OK, so this is saying that Slurm is seeing:

8 CPUs
1 board
1 socket per board
4 cores per socket
2 threads per core

which is what lscpu also describes the node as

Whereas the config that it thinks it should have is:

8 CPUs
1 board
8 sockets per board
1 core per socket
1 thread per core

which to me looks like what you would expect with just CPUS=8 in the
config and nothing else.

I guess a couple of questions:

1) Have you restarted slurmctld and slurmd everywhere?

2) Can you confirm that slurm.conf is the same everywhere?

3) what does slurmd -C report?

cheers!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Christopher Samuel

On 14/09/17 11:07, Lachlan Musicman wrote:

> Node configuration differs from hardware: CPUs=8:8(hw) Boards=1:1(hw)
> SocketsPerBoard=8:1(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:2(hw)

Hmm, are you virtualised by some chance?

If so it might be that the VM layer is lying to the guest about the
actual hardware layout.

What does "lscpu" say?

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: On the need for slurm uid/gid consistency

2017-09-13 Thread Christopher Samuel

On 13/09/17 04:53, Phil K wrote:

> I'm hoping someone can provide an explanation as to why slurm
> requires uid/gid consistency across nodes, with emphasis on the need
> for the 'SlurmUser' to be uid/gid-consistent.

I think this is a consequence of the use of Munge, rather than being
inherent in Slurm itself.

https://dun.github.io/munge/

# It allows a process to authenticate the UID and GID of another
# local or remote process within a group of hosts having common
# users and groups

Gory details are in the munged(8) manual page:

https://github.com/dun/munge/wiki/Man-8-munged

But I think the core of the matter is:

# When a credential is validated, munged first checks the
# message authentication code to ensure the credential has
# not been subsequently altered. Next, it checks the embedded
# UID/GID restrictions to determine whether the requesting
# client is allowed to decode it.

So if the UID's & GID's of the user differ across systems then it
appears it will not allow the receiver to validate the message.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-12 Thread Christopher Samuel

On 13/09/17 10:47, Lachlan Musicman wrote:

> Chris how does this sacrifice performance? If none of my software
> (bioinformatics/perl) is HT, surely I'm sacrificing capacity by leaving
> one thread unused as jobs take an entire core?

A HT is not a core, so if you are running multiple processes on a single
core then you will have some form of extra contention - now how much of
an impact that will have will depend on your application mix and your
hardware generation.

As ever, benchmark and see if you gain more than you lose in that
method.  For HPC stuff which tends to be compute bound the usual advice
is to disable HT in the BIOS, but for I/O bound things you may not be so
badly off.

Hope that helps!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-12 Thread Christopher Samuel

On 13/09/17 07:22, Patrick Goetz wrote:

> All I have to say to this is: um, what?

My take has always been that ThreadsPerCore is really for HPC workloads
where you've decided not to disable HT full stop but want to allocate
full cores to each task and then let the code have 2 threads per Slurm
task (for HPC often that's the same as an MPI rank).

> So, moving to a specific
> implementation example, the question is should this configuration work
> properly?  I do what to include memory in the resource allocation
> calculations, if possible.  Hence:
> 
>   SelectType=select/cons_res
>   SelectTypeParameters=CR_CPU_Memory
>   NodeName=n[001-048] CPUs=16 RealMemory=61500 State=UNKNOWN
> 
> 
> Is this going to work as expected?

I would think so, basically you're saying you're willing to sacrifice
performance and consider each HT unit a core to run a job on.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Exceeded job memory limit problem

2017-09-06 Thread Christopher Samuel

On 06/09/17 17:38, Sema Atasever wrote:

> I tried the line of code what you recommended but the code still
> generates an error unfortunately.

We've seen issues where using:

JobAcctGatherType=jobacct_gather/linux

gathers incorrect values for jobs (in our experience MPI ones).

We constrain jobs via cgroups and have found that using the cgroup
plugin for this results in jobs not getting killed incorrectly.

Using cgroups in Slurm is a definite win for us, so I would suggest
looking into it if you've not already done so.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Jobs cancelled "DUE TO TIME LIMIT" long before actual timelimit

2017-08-30 Thread Christopher Samuel

On 30/08/17 04:34, Brian W. Johanson wrote:

> Any idea on what would cause this?

It looks like the job *step* hit the timelimit, not the job itself.

Could you try the sacct command without the -X flag to see what the
timelimit for the step was according to Slurm please?

$ sacct -S 071417 -a --format JobID%20,State%20,timelimit,Elapsed,ExitCode  -j 
1695151 

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Fair resource scheduling

2017-08-27 Thread Christopher Samuel

On 25/08/17 03:03, Patrick Goetz wrote:

> 1. When users submit (say) 8 long running single core jobs, it doesn't
> appear that Slurm attempts to consolidate them on a single node (each of
> our nodes can accommodate 16 tasks).

How much memory have you configured for your nodes and how much memory
are these single CPU jobs asking for?

That's one thing that can make Slurm need to start jobs on other nodes.

You can also tell it to pack single CPU jobs onto nodes at the other end
of the cluster with this:

pack_serial_at_end
If used with the select/cons_res plugin then put serial jobs at
the end of the available nodes rather than using a best fit
algorithm. This may reduce resource fragmentation for some work-
loads.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Delete jobs from slurmctld runtime database

2017-08-23 Thread Christopher Samuel

On 22/08/17 01:59, Selch, Brigitte (FIDF) wrote:

> That’s the reason for my question.

I'm not aware of any way to do that, and I would advise against mucking
around in the Slurm MySQL database directly.

The idea of slurmdbd is to have a comprehensive view of all jobs (within
its expiry parameters), and removing them will likely break its
statistics and probably do Bad Things(tm).

Here be dragons..

-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread Christopher Samuel

On 15/08/17 09:41, Lachlan Musicman wrote:

> I guess I'm not 100% sure what I'm looking for, but I do see that there
> is a
> 
> 1:name=systemd:/user.slice/user-0.slice/session-373.scope
> 
> in /proc/self/cgroup

Something is wrong in your config then. It should look something like:

4:cpuacct:/slurm/uid_3959/job_6779703/step_9/task_1
3:memory:/slurm/uid_3959/job_6779703/step_9/task_1
2:cpuset:/slurm/uid_3959/job_6779703/step_9
1:freezer:/slurm/uid_3959/job_6779703/step_9

for /proc/${PID_OF_PROC}/cgroup

I notice you have /proc/self - that will be the shell you are running in
for your SSH session and not the job!

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Proctrack cgroup; documentation bug

2017-08-13 Thread Christopher Samuel

On 14/08/17 08:55, Lachlan Musicman wrote:

> Was it here I read that proctrack/linuxproc was better than
> proctrack/cgroup?

I think you're thinking of JobAcctGatherType, but even then our
experience there was that jobacct_gather/cgroup was more accurate.

-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: RebootProgram - who uses it?

2017-08-08 Thread Christopher Samuel

On 07/08/17 17:57, Aaron Knister wrote:

> Good grief. "reboot" is a legacy tool?!?! I've about had enough of systemd.

FWIW reboot is provided by the init system implementation (for instance
on RHEL6 it's from upstart), and /sbin/reboot is only optional in the
FHS. Only /sbin/shutdown is required by the FHS.

http://www.pathname.com/fhs/2.2/fhs-3.14.html

On proprietary UNIX versions reboot (not guaranteed to be in /sbin, it
was /etc/reboot on Ultrix 4, /usr/sbin/reboot on Solaris) may not run
shutdown scripts either (eg Solaris), you'd want to use shutdown for that.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: RebootProgram - who uses it?

2017-08-08 Thread Christopher Samuel

On 07/08/17 14:08, Lachlan Musicman wrote:

> In slurm.conf, there is a RebootProgram - does this need to be a direct
> link to a bin or can it be a command?

We have:

RebootProgram = /sbin/reboot

Works for us.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Multifactor Priority Plugin for Small clusters

2017-07-03 Thread Christopher Samuel

On 03/07/17 16:02, Loris Bennett wrote:

> I don't think you can achieve what you want with Fairshare and
> Multifactor Priority.  Fairshare looks at distributing resources fairly
> between users over a *period* of time.  At any *point* in time it is
> perfectly possible for all the resources to be allocated to one user.

Loris is quite right about this, but it is possible to impose limits on
a project if you chose to use slurmdbd.

First you need to set up accounting:

https://slurm.schedmd.com/accounting.html

then you can set limits:

https://slurm.schedmd.com/resource_limits.html

Best of luck!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: How to get Qos limits

2017-06-06 Thread Christopher Samuel

On 07/06/17 03:08, Kowshik Thopalli wrote:

>  I wish to know the max number of jobs that as a user I can run. That
> is MaxJobsPerUser*.  *I will be thankful if you can actually provide the
> commands that I will have to execute.

You probably want:

sacctmgr list user ${USER} format=MaxJobsPerUser

For a more general view you would do:

sacctmgr list user ${USER} withassoc

Hope this helps,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: srun - replacement for --x11?

2017-06-06 Thread Christopher Samuel

On 06/06/17 23:46, Edward Walter wrote:

> Doesn't that functionality come from a spank plugin?
> https://github.com/hautreux/slurm-spank-x11

Yes, that's the one we use. Works nicely.

Provides the --x11 option for srun.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Accounting: preventing scheduling after TRES limit reached (permanently)

2017-06-04 Thread Christopher Samuel

On 03/06/17 07:03, Jacob Chappell wrote:

> Sorry, that was a mouthful, but important. Does anyone know if Slurm can
> accomplish this for me. If so how?

This was how we used to run prior to switching to fair-share.

Basically you set:

PriorityDecayHalfLife=0

which stops the values decaying over time so once they hit their limit
that's it.

We also set:

PriorityUsageResetPeriod=QUARTERLY

so that limits would reset on the quarter boundaries.  This was because
we used to have fixed quarterly allocations for projects.

We went to fair-share because of a change of the funding model for us
meant previous rules were removed and so we could go to fair-share which
meant a massive improvement in utilisation (compute nodes were no longer
idle with jobs waiting but unable to run because of being out of quota).

NOTE: You can't have both fairshare and hard quotas at the same time.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: sinfo

2017-05-24 Thread Christopher Samuel

On 25/05/17 05:12, Will French wrote:

> We have an alias setup that shows free and allocated nodes grouped by feature:
> 
> sinfo --format="%55N %.35f %.6a %.10A"

Nice, here's an alternative that is more useful in our setup which
groups nodes by reason and GRES.

sinfo --format="%60N %.15G %.30E %.10A"

The reason can be quite long, but there doesn't seem to be a way to just
show the status as down/drain/idle/etc.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Christopher Samuel

On 24/05/17 13:45, Lachlan Musicman wrote:

> Not yet - that's part of the next update cycle :/

Ah well that might help, along with pam_slurm_adopt so that users
SSH'ing into nodes they have jobs on are put into a cgroup of theirs on
that node.

Helps catch any legacy SSH based MPI launchers (and other naughtiness).

Good luck!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545



[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Christopher Samuel

Hiya,

On 24/05/17 13:10, Lachlan Musicman wrote:

> Occasionally I'll see a bunch of processes "running" (sleeping) on a
> node well after the job they are associated with has finished.
> 
> How does this happen - does slurm not make sure all processes spawned by
> a job have finished at completion?

Are you not using cgroups for enforcement?

Usually that picks everything up.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: discrepancy between node config and # of cpus found

2017-05-21 Thread Christopher Samuel

On 20/05/17 07:46, Jeff Avila wrote:

> Yes, I did give that a try, though it didn’t seem to make any difference to 
> the error messages I got.

Have you also set DefMemPerCPU and checked how much RAM is allocated to
the jobs?

Remember that you can have free cores but not free memory on a node and
then Slurm isn't going to put more jobs there (unless you tell it to
ignore memory, which is not likely to end well).

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: LDAP required?

2017-04-19 Thread Christopher Samuel

On 13/04/17 18:32, Janne Blomqvist wrote:

> 15.08 should also work with enumeration disabled, except for
> AllowGroups/DenyGroups partition specifications.

I'm pretty sure this was what we got stuck on, and so had to drop AD.

> So how do you manage user accounts? Just curious if someone has a sane
> middle ground between integrating with the organization user account
> system (AD or whatever) and DIY.

We use some software that was developed at my previous employer called
Karaage to manage our projects, including allowing project leaders to
invite members, talking to LDAP and integrating with slurmdbd.

Sadly my previous employer shut down at the end of 2015 (long after I
left I hasten to add!) and the person who was doing a lot of that work
has moved on to other things and only tinkers with the code base.

That said there are 2 different HPC organisations inside the university
using it and the other group use it with Shibboleth integration so that
people with AAF (Australian Access Federation) credentials can auth to
the web interface with their institutional ID (though of course it still
creates a separate LDAP account for them).

https://github.com/Karaage-Cluster

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: LDAP required?

2017-04-12 Thread Christopher Samuel

On 13/04/17 01:47, Jeff White wrote:

> +1 for Active Directory bashing. 

I wasn't intending to "bash" AD here, just that the AD that we were
trying to use (and I suspect that Lachlan might me talking to) has tens
of thousands of accounts in it and we just could not get the
Slurm->sssd->AD chain to work reliably to be able to run a production
system.

This was with both sssd trying to enumarate the whole domain and also
(before that) trying to get Slurm to work without sssd enumeration.

Smaller AD domains might work more reliably, but that's not where we sit
so we fell back to using our own LDAP server with Karaage to manage
project/account applications, adding people to slurmdbd, etc.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: LDAP required?

2017-04-12 Thread Christopher Samuel

On 11/04/17 16:05, Lachlan Musicman wrote:

> Our auth actually backs onto an Active Directory domain

You have my sympathies. That caused us no end of headaches when we tried
that on a cluster I help out on and in the end we gave up and fell back
to running our own LDAP to make things reliable again.

+1 for running your own LDAP.

I would seriously look at a cluster toolkit for running nodes,
especially if it supports making a single image that your compute nodes
then netboot.  That way you know everything is consistent.

Best of luck,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Jobs submitted simultaneously go on the same GPU

2017-04-11 Thread Christopher Samuel

On 10/04/17 21:08, Oliver Grant wrote:

> We did not have a gres.conf file. I've created one:
> cat /cm/shared/apps/slurm/var/etc/gres.conf
> # Configure support for our four GPU
> NodeName=node[001-018] Name=gpu File=/dev/nvidia[0-3]
> 
> I've read about "global" and "per-node" gres.conf, but I don't know how
> to implement them or if I need to?

Yes you do.

Here's an (anonymised) example from a cluster that I help with that has
both GPUs and MIC's on various nodes.

# We will have GPU & KNC nodes so add the GPU & MIC GresType to manage them
GresTypes=gpu,mic
# Node definitions for nodes with GPUs
NodeName=thing-gpu[001-005] Weight=3000 NodeAddr=thing-gpu[001-005] 
RealMemory=254000 CoresPerSocket=6 Sockets=2 Gres=gpu:k80:4
# Node definitions for nodes with Xeon Phi
NodeName=thing-knc[01-03] Weight=2000 NodeAddr=thing-knc[01-03] 
RealMemory=126000 CoresPerSocket=10 Sockets=2 ThreadsPerCore=2 Gres=mic:5110p:2

You'll also need to restart slurmctld & all slurmd's to pick up
this new config, I don't think "scontrol reconfigure" will deal
with this.

Best of luck,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Distinguishing past jobs that waited due to dependencies vs resources?

2017-04-11 Thread Christopher Samuel

Hi folks,

We're looking at wait times on our clusters historically but would like
to be able to distinguish jobs that had long wait times due to
dependencies rather than just waiting for resources (or because the user
had too many other jobs in the queue at that time).

A quick 'git grep' of the source code after reading 'man sacct' and not
finding anything (also running 'sacct -e' and not seeing anything useful
there either) doesn't offer much hope.

Anyone else dealing with this?

We're on 16.05.x at the moment with slurmdbd.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Randomly jobs failures

2017-04-11 Thread Christopher Samuel

On 11/04/17 17:42, Andrea del Monaco wrote:

> [2017-04-11T08:22:03+02:00] error: Error opening file
> /cm/shared/apps/slurm/var/cm/statesave/job.830332/script, No such file
> or directory
> [2017-04-11T08:22:03+02:00] error: Error opening file
> /cm/shared/apps/slurm/var/cm/statesave/job.830332/environment, No such
> file or directory

I would suggest that you are looking at transient NFS failures (which
may not be logged).

Are you using NFSv3 or v4 to talk to the NFS server and what are the
OS's you are using for both?

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Scheduling jobs according to the CPU load

2017-03-21 Thread Christopher Samuel

On 22/03/17 08:35, kesim wrote:

> You are right. Many thanks for correcting.

Just note that load average is not necessarily the same as CPU load.

If you have tasks blocked for I/O they will contribute to load average
but will not be using much CPU at all.

So, for instance, on one of our compute nodes a Slurm job can ask for 1
core, start 100 tasks doing heavy I/O, they all use the same 1 core and
get the load average to 100 but the other 31 cores on the node are idle
and can quite safely be used for HPC work.

The manual page for "uptime" on RHEL7 describes it thus:

# System load averages is the average number of processes that
# are either in a runnable or uninterruptable state.  A process
# in a runnable state is either using the CPU or waiting to use
# the CPU.  A process in uninterruptable state is waiting for
# some I/O access, eg waiting for disk.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-19 Thread Christopher Samuel

On 19/03/17 23:25, kesim wrote:

> I have 11 nodes and declared 7 CPUs per node. My setup is such that all
> desktop belongs to group members who are using them mainly as graphics
> stations. Therefore from time to time an application is requesting high
> CPU usage.

In this case I would suggest you carve off 3 cores via cgroups for
interactive users and give Slurm the other 7 to parcel out to jobs by
ensuring that Slurm starts within a cgroup dedicated to those 7 cores..

This is similar to the "boot CPU set" concept that SGI came up with (at
least I've not come across people doing that before them).

To be fair this is not really Slurm's problem to solve, Linux gives you
the tools to do this already, it's just that people don't realise that
you can use cgroups to do this.

Your use case is valid, but it isn't really HPC, and you can't really
blame Slurm for not catering to this.  It can use cgroups to partition
cores to jobs precisely so it doesn't need to care what the load average
is - it knows the kernel is ensuring the cores the jobs want are not
being stomped on by other tasks.

Best of luck!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: reporting used memory with job Accounting or Completion plugins?

2017-03-12 Thread Christopher Samuel

On 11/03/17 09:36, Chris Samuel wrote:

> If you use the slurmdbd accounting database then yes, you get information 
> about memory usage (both RSS and VM).
> 
> Have a look at the sacct manual page and look for MaxRSS and MaxVM. 

I should mention that for jobs that trigger job steps with srun you can
also monitor them as the job is going with 'sstat' (rather than just
post-mortem with sacct).

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Storage of job submission and working directory paths

2017-03-07 Thread Christopher Samuel

On 08/03/17 08:15, Chad Cropper wrote:

> Am I missing something? Why is it that the DBD cannot store these 2
> pieces of information?

I suspect it's just not been requested, I'd suggest opening a feature
request at:

http://bugs.schedmd.com/

Report the bug ID back as it would be useful to us here too.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: slurmctld not pinging at regular interval

2017-02-19 Thread Christopher Samuel

On 17/02/17 05:36, Allan Streib wrote:

> t-019 is one of my nodes that's frequently "down" according to slurm but
> really isn't. What is that "Can't find an address" about? DNS lookups
> seem to be working fine in a shell on the same machine.

This looks to be an issue when Slurm is wanting to forward messages and
trying to find hosts in slurm.conf:

src/common/forward.c - _forward_thread():

/* repeat until we are sure the message was sent */
while ((name = hostlist_shift(hl))) {
if (slurm_conf_get_addr(name, ) == SLURM_ERROR) {
error("forward_thread: can't find address for host "
  "%s, check slurm.conf", name);
slurm_mutex_lock(_struct->forward_mutex);
mark_as_failed_forward(_struct->ret_list, name,
   SLURM_UNKNOWN_FORWARD_ADDR);
free(name);
if (hostlist_count(hl) > 0) {
slurm_mutex_unlock(_struct->forward_mutex);
continue;
}
goto cleanup;
}


It would be interesting to know if increasing your TreeWidth to 256
would help (basically turn off forwarding if I'm reading it right).

   TreeWidth
  Slurmd  daemons  use  a virtual tree network for communications.
  TreeWidth specifies the width of the tree (i.e. the fanout).  On
  architectures  with  a front end node running the slurmd daemon,
  the value must always be equal to or greater than the number  of
  front end nodes which eliminates the need for message forwarding
  between the slurmd daemons.  On other architectures the  default
  value  is 50, meaning each slurmd daemon can communicate with up
  to 50 other slurmd daemons and over 2500 nodes can be  contacted
  with  two  message  hops.   The default value will work well for
  most clusters.  Optimal  system  performance  can  typically  be
  achieved if TreeWidth is set to the square root of the number of
  nodes in the cluster for systems having no more than 2500  nodes
  or  the  cube  root for larger systems. The value may not exceed
  65533.

If so then I suspect that this is a possible transient DNS failure?

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: New User Creation Issue

2017-02-15 Thread Christopher Samuel

On 16/02/17 09:45, Lachlan Musicman wrote:

[partitions down in slurm.conf]
> What's the reasoning behind this? So that you can test the cluster still
> works with debug before jobs start getting submitted and failing?

Yeah, pretty much!

It's a continuation from when running Torque+Moab/Maui here and at VPAC
before that - we would always start Moab paused so we could check out
what impact any changes had to our queues & priorities before starting
jobs running.

Measure twice, cut once.

cheers!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: New User Creation Issue

2017-02-15 Thread Christopher Samuel

On 16/02/17 07:09, Katsnelson, Joe wrote:

> Just curious is there a way to restart slurm to get the below working
> without impacting the current jobs that are running?

You should be able to restart slurmctld with running jobs quite safely,
if you are paranoid (like me) then just mark partitions down first so
you know Slurm won't be trying to start any jobs just when you shut it down.

We also have all our partitions (other than our debug one reserved for
sysadmins) marked as "State=DOWN" in slurm.conf so that they won't start
jobs when slurmctld is brought back up again.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: New User Creation Issue

2017-02-15 Thread Christopher Samuel

Hi Joe,

On 27/01/17 09:01, Katsnelson, Joe wrote:

> sacctmgr list clusters

Sorry I missed this before!

Can you check that the machine where your slurmdbd is running can
connect to 172.16.0.1 on port 6817 please?

If it can't then that'll be the reason why you need to restart.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Job priority/cluster utilization help

2017-02-08 Thread Christopher Samuel

On 08/02/17 11:19, Vicker, Darby (JSC-EG311) wrote:

> Sorry for the long post but not sure how to get adequate help without
> providing a lot of detail.  Any recommendations on configuring the
> scheduler to help these jobs run and increase the cluster utilization
> would be appreciated.

My one thought after a quick scan is that both the jobs you mention are
listed as reason "Priority" and there's a higher priority job 1772 in
the list before them.  You might want to look at your backfill settings
to see whether it's looking far enough down the queue to see these.

Perhaps an alternative idea would be to instead of using features use
partitions and then have people submit to all partitions (there is a
plugin for that, though we use a submit filter instead to accomplish the
same).

That way Slurm should consider each job against each partition (set of
architectures) individually.

Best of luck!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: 16.05.8 bug with memory handling?

2017-01-29 Thread Christopher Samuel

On 28/01/17 09:04, Bill Broadley wrote:

> I have a very simple script:
> #!/bin/bash -l
> #SBATCH --mem=35
> #SBATCH --time=96:00:00
> #SBATCH --nodes=1
> #SBATCH --cpus-per-task=4
> 
> srun date
> 
> It acts exactly like I expect with srun:
[...]

> So 4 CPUs are allocated, but the binary is run once.  If I used -n 4 I'd
> expect the binary to be run 4 times.

But the srun you are using to launch the script doesn't ask for any
resources, so it'll get the default which I would expect to be 1 core
and so the srun inside the script will just use that 1 core.

> But now with sbatch:

That's more peculiar, I would expect it to get 1 task with 4 CPUs
associated and then just run date once, with access to the 4 cores.

Before your script does "srun date" add:

env | fgrep SLURM
echo ''
scontrol show job -dd ${SLURM_JOB_ID}
echo ''

In my Slurm environment it does just what I expect, runs date once with
4 cores allocated, to whit the environment has:

SLURM_JOB_CPUS_PER_NODE=4

and scontrol shows:

[...]
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
[...]

Best of luck,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Daytime Interactive jobs

2017-01-29 Thread Christopher Samuel

On 28/01/17 04:08, Skouson, Gary B wrote:

> We'd like to have some nodes available during the workday exclusively
> for interactive or debug jobs.  These are fairly small, short running
> jobs.  I'd like to make these nodes available for other jobs at night
> and on weekends.

With Slurm you can have reservations that always start a certain
distance into the future:

 TIME_FLOATThe reservation start time is relative to the
   current time and moves forward through time
   (e.g. a StartTime=now+10minutes will always be
   10 minutes in the future).

They added that for us and we use it to reserve a node 24x7 for jobs of
less than 2 hours (we have a reasonable amount of those to be able to
justify this).

But, I can't see from the docs how that would work with the DAILY flag
to get it to repeat at the same time each day. :-(

Might be a feature request..

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: New User Creation Issue

2017-01-24 Thread Christopher Samuel

On 24/01/17 08:34, Katsnelson, Joe wrote:

>  I’m having an issue creating new users on our cluster. After
> running the below commands slurm has to be restarted in order for that
> user to be able to run sbatch. Otherwise they get an error

When I've seen that before it's been because the slurmdbd cannot connect
back to slurmctld to send RPCs on the IP address that slurmctld has
registered with slurmdbd.

What does this say?

sacctmgr list clusters

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Job temporary directory

2017-01-22 Thread Christopher Samuel

On 23/01/17 08:40, Lachlan Musicman wrote:

> We use the SPANK plugin found here
> 
> https://github.com/hpc2n/spank-private-tmp
> 
> and find it works very well.

+1 to that, though we had to customise it to our environment (it breaks
when your nodes are diskless and your scratch area is a high-performance
parallel filesystem shared across all nodes).

https://github.com/vlsci/spank-private-tmp

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel

On 16/01/17 15:56, Ryan Novosielski wrote:

> I think you are right actually. Might have also been configurable
> system-wide. The difference though, still, is that you don't have to
> provide an e-mail address, so you could share a script like this and it
> would work for anyone without modifying it. 

You don't need to provide an email address to Slurm either:

 --mail-user=
User  to  receive  email notification of state
changes as defined by --mail-type.  The default
value is the submitting user.

Our Postfix config rewrites their username to their registered email
address that's stored in LDAP.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel

On 10/01/17 18:56, Ole Holm Nielsen wrote:

> For the record: Torque will always send mail if a job is aborted

It's been a few years since I've used Torque so I don't remember that
behaviour.

Thanks for the info!

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel

On 14/01/17 09:28, Steven Lo wrote:

> Is it true that there is no configurable parameter to achieve what we
> want to do and user need to specific in
> either the sbatch or the submission script?

Not that I'm aware of.

A submit filter would let you set that up though.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Prolog behavior with and without srun

2017-01-09 Thread Christopher Samuel

On 10/01/17 10:57, Christopher Samuel wrote:

> If you are unlucky enough to have SSH based job launchers then you would
> also look at the BYU contributed pam_slurm_adopt

Actually this is useful even without that as it allows users to SSH into
a node they have a job on and not disturb the cores allocated to other
jobs on the node, just their own.

You could argue that this is more elegant though, to add an interactive
shell job step to a running job:

[samuel@barcoo ~]$ srun --jobid=6522365  --pty -u ${SHELL} -i -l
[samuel@barcoo010 ~]$
[samuel@barcoo010 ~]$ cat /proc/$$/cgroup
4:cpuacct:/slurm/uid_500/job_6522365/step_1/task_0
3:memory:/slurm/uid_500/job_6522365/step_1
2:cpuset:/slurm/uid_500/job_6522365/step_1
1:freezer:/slurm/uid_500/job_6522365/step_1


-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Prolog behavior with and without srun

2017-01-09 Thread Christopher Samuel

On 06/01/17 15:08, Vicker, Darby (JSC-EG311) wrote:

> Among other things we want the prolog and epilog scripts to clean up any 
> stray processes.

I would argue that a much better way to do that is to use Slurm's
cgroups support, that will contain a jobs processes into the cgroup
allowing it to kill off only those processes (and not miss any) when the
job ends.

If you are unlucky enough to have SSH based job launchers then you would
also look at the BYU contributed pam_slurm_adopt which will put those
tasks into the cgroup of that users job on the node they are trying to
SSH into.  You do need PrologFlags=contain for that to ensure that all
jobs get an "extern" batch step on job creation for these processes to
be adopted into.

We use both here with great success.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel

On 04/01/17 10:29, Koziol, Lucas wrote:

> I want to have 1 batch script, where I reserve a certain large number
> of CPUs, and then run multiple 1-CPU tasks from within this single
> script. The reason being that I do several cycles of these tasks, and
> I need to process the outputs between tasks.

OK, I'm not sure how Slurm will behave with multiple srun's and cons_res
and CR_LLN but it's still worth a shot.

Best of luck!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel

On 04/01/17 10:19, Koziol, Lucas wrote:

> Will Slurm read a local slurm.conf file or in my home directory?

No, I'm afraid not, it's a global configuration thing.

> The default slurm.conf file I can't modify. I can ask the admins here
> to modify it I I have to though.

I strongly believe that will be necessary, sorry!

Best of luck,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel

On 04/01/17 04:20, Koziol, Lucas wrote:

> The hope was that all 16 tasks would run on Node 1, and 16 tasks would
> run on Node 2. Unfortunately what happens is that all 32 jobs get
> assigned to Node 1. I thought –m cyclic was supposed to avoid this.

You're only running a single task at a time, so it's a bit hard for srun
to distribute 1 task over multiple nodes. :-)

The confusion is, I suspect, that job steps (an srun instance) are not
the same as tasks (individual processes launched in a job step).

The behaviour in the manual page is for things like MPI jobs where you
want to distribute the many ranks (tasks) over nodes/sockets/cores in a
particular way - in this instance a single srun might be launching 10's
through to 100,000's of tasks (or more) at once.

What might work better for you is to use a job array for your work
instead of a single Slurm job and then have this in your slurm.conf:

SelectType=select/cons_res
SelectTypeParameters=CR_LLN

This should get Slurm to distribute the job array elements across nodes
picking the least loaded (allocated) node in each case.

Job arrays are documented here:

https://slurm.schedmd.com/job_array.html

Hope this helps!

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Christopher Samuel

On 16/12/16 10:33, Kilian Cavalotti wrote:

> I remember Danny recommending to use jobacct_gather/linux over
> jobacct_gather/cgroup, because "cgroup adds quite a bit of overhead
> with very little benefit".
> 
> Did that change?

We took that advice but reverted because of this issue (from memory).

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Christopher Samuel

On 16/12/16 02:15, Stefan Doerr wrote:

> If I check on "top" indeed it shows all processes using the same amount
> of memory. Hence if I spawn 10 processes and you sum usages it would
> look like 10x the memory usage.

Do you have:

JobAcctGatherType=jobacct_gather/linux

or:

JobAcctGatherType=jobacct_gather/cgroup

If the former, try the latter and see if it helps get better numbers (we
went to the former after suggestions from SchedMD but from highly
unreliable memory had to revert due to similar issues to those you are
seeing).

Best of luck,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: job arrays, fifo queueing not wanted

2016-12-14 Thread Christopher Samuel

On 14/12/16 04:57, Michael Miller wrote:

> thank you for your answer. I do not need round-robin - I need some
> mechanism that allows both/multiple job arrays to share the resources.

We set CPU limits per association per cluster:

${SACCTMGR} -i modify account set grpcpus=256   where cluster=snowy

So no project can use more than 256 cores at once.

You can also do that for nodes, GrpCpuMins (product of cores and
runtime), etc.

This only makes sense if your cluster is going to be overcommitted most
of the time though, otherwise you may have jobs pending due to limits
with idle resources.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-20 Thread Christopher Samuel

On 19/11/16 03:38, Manuel Rodríguez Pascual wrote:

> sbatch --ntasks=16  --tasks-per-node=2  --wrap 'mpiexec ./helloWorldMPI'

If your MPI stack properly supports Slurm shouldn't that be:

sbatch --ntasks=16  --tasks-per-node=2  --wrap 'srun ./helloWorldMPI'

?

Otherwise you're at the mercy of what your mpiexec chooses to do.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-17 Thread Christopher Samuel

On 28/10/16 18:20, Manuel Rodríguez Pascual wrote:

> -bash-4.2$ sbatch --ntasks=16  --tasks-per-node=2  test.sh 

Could you post the content of your batch script too please?

We're not seeing this on 16.05.5, but I can't be sure I'm correctly
replicating what you are seeing.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Gres issue

2016-11-16 Thread Christopher Samuel

On 17/11/16 11:31, Christopher Samuel wrote:

> It depends on the library used to pass options,

Oops - that should be parse, not pass.

Need more caffeine..

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Gres issue

2016-11-16 Thread Christopher Samuel

On 17/11/16 00:04, Michael Di Domenico wrote:

> this might be nothing, but i usually call --gres with an equals
> 
> srun --gres=gpu:k10:8
> 
> i'm not sure if the equals is optional or not

It depends on the library used to pass options, I'm used to it being
mandatory but apparently with Slurm it's not - just tested it out and using:

--gres mic

results in my job being scheduled on a Phi node with OFFLOAD_DEVICES=0
set in its environment.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Using slurm to control container images?

2016-11-15 Thread Christopher Samuel

On 16/11/16 12:23, Lachlan Musicman wrote:

> Has anyone tried Shifter out and has there been any movement on this? I
> presume the licensing issues remain.

We've got both Shifter and Singularity set up at VLSCI for users.

https://www.vlsci.org.au/documentation/running_jobs/shifter/

https://www.vlsci.org.au/documentation/running_jobs/singularity/

The important thing to recognise for both is that they are *NOT* Docker,
but they are both able to use Docker containers.

Shifter imports directly from the Docker registry (or from other
registries that you configure) and lives entirely on your HPC system,
Singularity needs to be installed on the users system and configured,
but they can do the conversion there and there is no central public repo
of converted containers (a plus if you need a private container, a minus
if you're going to end up with hundreds of copies).

Having private containers is on the roadmap for Shifter.

Shifter also integrates with Slurm.

All the best!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: How to account how many cpus/gpus per node has been allocated to a specific job?

2016-11-13 Thread Christopher Samuel

On 09/11/16 13:07, Ran Du wrote:

>However, the scheduler must have information about separate
> allocated number on each node, or they cannot track how many resources
> left on each node. The question is, if SLURM keep these separate numbers
> in files(e.g. log files or database), or just keep them in memory. I am
> going to read other docs info, to see if there is any lead.

I strongly suspect it's only held in memory by slurmctld whilst running
(and in its state files of course).

Unfortunately it doesn't even appear in "scontrol show job --detail"
from what I can see. :-(

Here's the lines from a test job of mine:

   TRES=cpu=2,mem=8G,node=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
 Nodes=barcoo[069-070] CPU_IDs=0 Mem=4096
   MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
   Features=(null) Gres=mic:1 Reservation=(null)

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: How to account how many cpus/gpus per node has been allocated to a specific job?

2016-11-08 Thread Christopher Samuel

On 09/11/16 12:15, Ran Du wrote:

>Thanks a lot for your reply. However, it's not what I want to
> get. For the example of Job 6449483, it is allocated with only one node,
> what if it was allocated with multiple nodes? I'd like to get the
> accounting statistics about how many CPUs/GPUs separately on each node,
> but not the sum number on all nodes.

Oh sorry, that's my fault, I completely misread what you were after
and managed to invert your request!

I don't know if that information is included in the accounting data.

I believe the allocation is uniform across the nodes, for instance:

$ sbatch --gres=mic:1 --mem=4g --nodes=2 --wrap /bin/true

resulted in:

$ sacct -j 6449484 -o jobid%20,jobname,alloctres%20,allocnodes,allocgres
   JobIDJobNameAllocTRES AllocNodesAllocGRES
 --  -- 
 6449484   wrap  cpu=2,mem=8G,node=2  2mic:2
   6449484.batch  batch  cpu=1,mem=4G,node=1  1mic:2
  6449484.extern extern  cpu=2,mem=8G,node=2  2mic:2

The only oddity there is that the batch step is of course
only on the first node, but it says it was allocated 2 GRES.
I suspect that's just a symptom of Slurm only keeping a total
number.

I don't think Slurm can give you an uneven GRES allocation, but
the SchedMD folks would need to confirm that I'm afraid.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Re:

2016-11-08 Thread Christopher Samuel

On 09/11/16 09:50, Lachlan Musicman wrote:

> I don't know Chris, I think that /dev/null would rate tbh. :)

Ah, but that's a file (OK character special device), not a directory. ;-)

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Re:

2016-11-08 Thread Christopher Samuel

Hi there,

On 08/11/16 21:21, Alexandre Strube wrote:

> For example, on slurm 10.05.6, the example config file says:
> 
> StateSaveLocation=/tmp
> 
> Which is not the best place to write sensitive information, but it will
> for sure be there and will be writable by the slurm user.

Frankly using /tmp seems like a *really* bad idea to me.

The reason is that (depending on system configuration) may be cleaned up
either on a reboot (either scripts or by using non-persistent tmpfs) or
periodically using tmpwatch or similar scripts.

So if you've got jobs queued for any period of time that information
will be lost.

We build from source and use:

StateSaveLocation   = /var/spool/slurm/jobs

but the decision is yours where exactly to put it.

But /tmp is almost certainly the second worst place (after /dev/shm).

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: sinfo man page

2016-11-07 Thread Christopher Samuel

On 08/11/16 11:42, Lachlan Musicman wrote:
> 
> %CNumber of CPUs by state in the format
> "allocated/idle/other/total". Do not use  this  with  a  node  state
> option ("%t" or "%T") or the different node states will be placed on
> separate lines.
> 
> I presume I am doing something wrong?

I think it's a bug (possibly just in the manual page) as the -N option
says there:

   -N, --Node
  Print  information  in a node-oriented format with one
  line per node.  The default is to print informa-
  tion in a partition-oriented format.  This is ignored if
  the --format option is specified.

Except it's not being ignored when you use --format (-o).

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Passing binding information

2016-11-02 Thread Christopher Samuel

On 02/11/16 02:01, Riebs, Andy wrote:

> Interesting -- thanks for the info Chris.

No worries, it's a bit sad I think, but I can understand it.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Passing binding information

2016-10-31 Thread Christopher Samuel

On 01/11/16 05:43, Andy Riebs wrote:

> Does anyone have any recent experience with this code who can answer the
> questions?

Unfortunately it looks like all SchedMD folks have dropped off the
mailing list (apart from posting announcements), presumably due to
workload.  You may want to contact them directly.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Slurm versions 16.05.6 and 17.02.0-pre3 are now available

2016-10-30 Thread Christopher Samuel

On 29/10/16 00:58, Peixin Qiao wrote:

> Will the slurm version 16.05.6 support ubuntu 16.04?

If you build it from source I suspect any moderately recent version will
work there.

If you are asking about the Ubuntu packaged version, then that's a
question for Canonical, not SchedMD. :-)

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Christopher Samuel

On 28/10/16 08:44, Lachlan Musicman wrote:

> So I checked the system, noticed that one node was drained, resumed it.
> Then I tried both
> 
> scontrol requeue 230591
> scontrol resume 230591

What happens if you "scontrol hold" it first before "scontrol release"'ing it?

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Query number of cores allocated per node for a job

2016-10-25 Thread Christopher Samuel

Hi all,

I can't help but think I'm missing something blindingly obvious, but
does anyone know how to find out how Slurm has distributed a job in
terms of cores per node?

In other words, if I submit:

sbatch --ntasks=64 --wrap sleep 60

on a system with (say 16 core nodes where nodes are already running
disparate number of jobs using variable cores, how do I see what cores
on what nodes Slurm has allocated my running job?

I know I can go and poke around with cgroups, but is there a way to get
that out of squeue, sstat or sacct?

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: slurm_load_partitions: Unable to contact slurm controller (connect failure)

2016-10-25 Thread Christopher Samuel

On 25/10/16 10:05, Peixin Qiao wrote:
> 
> I installed slurm-llnl on Debian on one computer. When I ran slurmctld
> and slurmd, I got the error:
> slurm_load_partitions: Unable to contact slurm controller (connect failure).

Check your firewall rules to ensure that those connections aren't
getting blocked, and also check that the hostname correctly resolves.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: sreport "duplicate" lines

2016-10-20 Thread Christopher Samuel

On 21/10/16 13:07, Andrew Elwell wrote:

> Yep, and for that particular account, not all of the members are
> showing twice - I can't work out what causes it

Looks like you've somehow created partition specific associations for
some people - not something we do at all.

I suspect that's what's triggering the different display in sreport, a
line per association/partition.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: sreport "duplicate" lines

2016-10-20 Thread Christopher Samuel

On 21/10/16 12:29, Andrew Elwell wrote:

> When running sreport (both 14.11 and 16.05) I'm seeing "duplicate"
> user info with different timings. Can someone say what's being added
> up separately here - it seems to be summing something differently for
> me and I can't work out what makes it split into two:

Not that it helps, but I don't see the same here for me with 16.05.5.

# sreport cluster AccountUtilizationByUser start=2016-07-01 end=2016-10-01 
account=vlsci user=samuel cluster=avoca -t h

Cluster/Account/User Utilization 2016-07-01T00:00:00 - 2016-09-30T23:59:59 
(7948800 secs)
Use reported in TRES Hours

  Cluster Account Login Proper Name Used   Energy
- --- - ---  
avoca   vlscisamuel Christopher Sa+15103    0



-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Send notification email

2016-10-06 Thread Christopher Samuel

On 06/10/16 03:07, Fanny Pagés Díaz wrote:

> Oct  5 11:34:52 compute-0-3 postfix/smtp[6469]: connect to 
> 10.8.52.254[10.8.52.254]:25: Connection refused

So you are blocked from connecting to the mail server you are trying to
talk to on port 25/tcp (SMTP) - you need to get that opened up.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Best way to control synchronized clocks in cluster?

2016-10-06 Thread Christopher Samuel

On 07/10/16 01:17, Per Lönnborg wrote:

> But what is the preferred way to check that the compute nodes on our
> have correct time, and if not, see to it that Slurm doesn�t allocate
> these nodes to perform tasks?

We run NTP everywhere - we have to because GPFS depends on correct
clocks as well and if they're out of step well then GPFS will stop
working on the node making Slrm the least of your worries. :-)

So just run ntpd.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Send notification email

2016-10-04 Thread Christopher Samuel

On 03/10/16 23:39, Fanny Pagés Díaz wrote:

> I have a slurm running in the same HPC cluster server, but I need send
> all notification using my corporate mail server, which running in
> another server at my internal network. I not need use the local postfix
> installed at slurm server.

The most reliable solution will be to configure Postfix to send emails
via the corporate server.

All our clusters send using our own mail server quite deliberately.

We set:

relayhost (to say where to relay email via)
myorigin (to set the system name to its proper FQDN)
aliasmaps (to add an LDAP lookup to rewrite users email to the value in
LDAP)

But really this isn't a Slurm issue, it's a host config issue for Postfix.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-28 Thread Christopher Samuel

On 29/09/16 01:16, John DeSantis wrote:

> We get the same snippet when our logrotate takes action against the
> cltdlog:

Does your slurmctld restart then too?

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Invalid Protocol Version

2016-09-28 Thread Christopher Samuel

On 28/09/16 16:25, Barbara Krasovec wrote:

> Yes, this worked! Thank you very much for your help!

My pleasure!

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: CGroups

2016-09-27 Thread Christopher Samuel

On 26/09/16 16:51, Lachlan Musicman wrote:

> Does this mean that it's now considered acceptable to run cgroups for
> ProcTrackType?

We've been running with that on all our x86 clusters since we switched
to Slurm, haven't seen an issue yet.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Invalid Protocol Version

2016-09-27 Thread Christopher Samuel

On 27/09/16 23:54, Barbara Krasovec wrote:

> The version of the client and server is the same. I guess the problem is
> in the slurmctld state file, where the slurm protocol version of some
> worker nodes must be wrong.

I suspect this is bug 3050 - we hit it for frontend nodes on BlueGene/Q
and reported it against that, but I've seen the same symptom that
you've hit on x86 clusters too - as has Ulf from Dresden.

https://bugs.schedmd.com/show_bug.cgi?id=3050

Tim is checking the code for the generic case.

> Creating a clean state file would fix the problem but also kill the
> jobs. Is there another way to fix this? Enforce the correct version in
> node_mgr.c and job_mgr.c? Restarting the services doesn't help at all.

Actually you don't lose jobs - but you do lose the reason for nodes being 
offline..

What we did was:

1) save the state of offline nodes and make a script to restore via scontrol
2) shutdown slurmctld and all slurmds
3) move the node_stat* files out of the way
4) start up slurmd again
5) start up slurmctld
6) run the script created at step 1

Hope that helps!

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel

On 26/09/16 17:48, Philippe wrote:

> [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) received

So that's some external process sending one of those two signals to
slurmctld, it's not something it's choosing to do at all.  We've never
seen this.

One other question - you've got the shutdown log from slurmctld and the
start log of a slurmd - what happens when slurmctld starts up?

That might be your clue about why yours jobs are getting killed.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel

On 26/09/16 17:48, Philippe wrote:

> [2016-09-26T08:01:44.792] debug:  slurmdbd: Issue with call
> DBD_CLUSTER_CPUS(1407): 4294967295(This cluster hasn't been added to
> accounting yet)

Not related - but it looks like whilst it's been told to talk to
slurmdbd you haven't added the cluster to slurmdbd with "sacctmgr" yet
so I suspect all your accounting info is getting lost.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel

On 27/09/16 17:40, Philippe wrote:

>   /usr/sbin/invoke-rc.d --quiet slurm-llnl reconfig >/dev/null

I think you want to check whether that's really restarting it or just
doing an "scontrol reconfigure" which won't (shouldn't) restart it.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Bug in node suspend/resume config code with scontrol reconfigure in 16.05.x (bugzilla #3078)

2016-09-25 Thread Christopher Samuel

Hi folks

A heads up for those using or looking to use node suspend/resume aka
power saving aka elastic computing in 16.05.x.

The slurmctld will lose your list of excluded nodes/partitions on an:

scontrol reconfigure

and then will treat all nodes as being eligible for power control,
putting them into a bad state. :-(

This is Slurm bug:

https://bugs.schedmd.com/show_bug.cgi?id=3078

which has been hit separately by two friends of mine at different
places, one of whom I'm helping out with elastic computing/cloudburst.

Hopefully this saves someone else from losing sleep over this!

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: external slurmdbd for multiple clusters

2016-09-25 Thread Christopher Samuel

On 23/09/16 19:12, Miguel Gila wrote:

> We also do this, run a number of Slurm clusters attached to a single
> slurmdbd. One issue with this is setup is that once you decommission
> a cluster, it needs to be removed from the DB somehow, otherwise your
> DB grows beyond reasonable size...

For accounting reasons we can't do that, consequently our slurmdbd is
around 17GB in size.

To give you an idea the largest table is the job step table for our
BlueGene/Q which takes almost 7GB for pushing 19 million job steps.

The next largest is the job step table for one of our x86 clusters which
is around 3GB for 8 million job steps.

Neither cause us any issues these days (we used to have a problem when,
for complicated historical reasons, slurmdbd was running on a 32-bit VM
and could run out of memory).

Admittedly we do have beefy database servers. :-)

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: external slurmdbd for multiple clusters

2016-09-25 Thread Christopher Samuel

On 02/09/16 18:54, Paddy Doyle wrote:

> We currently have the dbd on 16.05.4 and have some 15.x clusters still 
> pointing
> to it fine. I can't recall exactly, but in the past we may have even had 2 
> major
> releases behind pointing to an up-to-date dbd... I'm not sure how far back you
> can go, but I suspect 14.x talking to a 16.x dbd would be fine.

Slurm supports 2 major releases behind.

So a 16.05.x slurmdbd should talk to 15.08.x and 14.11.x
slurmctld's but *not* 14.03.x.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Slurm array scheduling question

2016-09-21 Thread Christopher Samuel

On 21/09/16 14:15, Christopher benjamin Coffey wrote:

> I’d like to get some feedback on this please from other sites and the
> developers if possible.  Thank you!

The best I can offer is this from the job array documentation from Slurm:

# When a job array is submitted to Slurm, only one job record is
# created. Additional job records will only be created when the
# state of a task in the job array changes, typically when a task
# is allocated resources or its state is modified using the
# scontrol command.

In 2.6.x I think there was a record for each element, but this was
cleaned up in a later release (can't remember when sorry!).

Hope that helps,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


  1   2   3   4   5   >