from:"Christopher Samuel"

[slurm-dev] Re: How to strictly limit the memory per CPU

2017-11-01 Thread Christopher Samuel

rs can't override it? or any way to disable --mem and --mem-per-cpu ? I believe you can restrict the amount of memory jobs can use via TRES functionality: https://slurm.schedmd.com/tres.html it's not something we do here though. Best of luck! Chris -- Christopher Samuel

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-18 Thread Christopher Samuel

re tied into using AD (which it sounds like you are) then that's not really an option for you. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: mysql job_table and step_table growth

2017-10-18 Thread Christopher Samuel

cctmgr list config | fgrep Purge cheer, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: mysql job_table and step_table growth

2017-10-15 Thread Christopher Samuel

On 14/10/17 00:24, Doug Meyer wrote: > The job_table.idb and step_table.idb do not clear as part of day-to-day > slurmdbd.conf > > Have slurmdbd.conf set to purge after 8 weeks but this does not appear > to be working. Anything in your slurmdbd logs? -- Christopher Samue

[slurm-dev] Re: Slurm 17.02.7 and PMIx

2017-10-09 Thread Christopher Samuel

On 05/10/17 11:27, Christopher Samuel wrote: > PMIX v1.2.2: Slurm complains and tells me it wants v2. I think that was due to a config issue on the system I was helping out with, after having to install some extra packages (like a C++ compiler) to get other things working I can no lon

[slurm-dev] Re: Tasks distribution

2017-10-09 Thread Christopher Samuel

's Slurm 16.05.8. Do you see the same? Did you try both having CR_Pack_Nodes *and* specifying this? -n 17 --ntasks-per-node=4 cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Camacho Barranco, Roberto ssirimu...@utep.edu

2017-10-09 Thread Christopher Samuel

tried to stop and > restart it multiple times but still not working. Please see the error below. Check your slurmctld.log, that should have hints about why it won't start. cheers! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The Universit

[slurm-dev] Slurm 17.02.7 and PMIx

2017-10-04 Thread Christopher Samuel

tation for PMIX in Slurm seems pretty much non-existent. :-( Anyone had any luck with this? cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Upgrading Slurm

2017-10-04 Thread Christopher Samuel

ackages for Slurm, I'd always install it centrally (NFS exported to compute nodes) to keep things simple. That way you decouple your Slurm version from the OS and can keep it up to date (or keep it on a known working version). All the best! Chris -- Christopher SamuelSenior Syste

[slurm-dev] Re: Setting up Environment Modules package

2017-10-04 Thread Christopher Samuel

the cluster. We also have in our taskprolog.sh: echo export BASH_ENV=/etc/profile.d/module.sh to try and ensure that bash shells have modules set up, just in case. :-) -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@u

[slurm-dev] Re: Upgrading Slurm

2017-10-03 Thread Christopher Samuel

lose any running jobs. The on disk format might for spooled jobs may also change between releases too, so you probably want to keep that in mind as well.. -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.

[slurm-dev] Re: Is PriorityUsageResetPeriod really required for hard limits?

2017-10-03 Thread Christopher Samuel

to sometime far into the future to have > effectively an infinite period (no reset)? Basically this is because once a user exceeds something like their maximum CPU run time limit then they will never be able to run jobs again unless you either decay or reset usage. -- Christopher Samue

[slurm-dev] Re: Tasks distribution

2017-10-03 Thread Christopher Samuel

On 02/10/17 20:51, Sysadmin CAOS wrote: > I'm execution my MPI program with "mpirun"... Maybe could be this the > problem? Do I need to execute with "srun"? I suspect so, try it and see.. -- Christopher SamuelSenior Systems Administrator Melbourne Bio

[slurm-dev] Re: Limiting SSH sessions to cgroups?

2017-09-20 Thread Christopher Samuel

7;ll be off-air for quite a while. Good luck! All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Accounting using LDAP ?

2017-09-20 Thread Christopher Samuel

more information from applicants than what it captures by default, but that's the nice thing, it is modular. Also includes Shibboleth support. All the best! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.

[slurm-dev] Re: Accounting using LDAP ?

2017-09-19 Thread Christopher Samuel

e and assign/change their target. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Accounting using LDAP ?

2017-09-19 Thread Christopher Samuel

ensure that they can run jobs, but that's a separate issue to whether slurmdbd can resolve users in LDAP. I would hope that Bright would have the ability to do that for you rather than having you handle it manually, but that's a question for Bright. Best of luck, Chris -- Christopher Sa

[slurm-dev] Re: Limiting SSH sessions to cgroups?

2017-09-19 Thread Christopher Samuel

n into this container. Setting the Contain implicitly sets the Alloc flag. -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-17 Thread Christopher Samuel

eadsPerCore configured. cheers! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Christopher Samuel

ple of questions: 1) Have you restarted slurmctld and slurmd everywhere? 2) Can you confirm that slurm.conf is the same everywhere? 3) what does slurmd -C report? cheers! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Emai

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Christopher Samuel

about the actual hardware layout. What does "lscpu" say? cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: On the need for slurm uid/gid consistency

2017-09-13 Thread Christopher Samuel

ent is allowed to decode it. So if the UID's & GID's of the user differ across systems then it appears it will not allow the receiver to validate the message. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of M

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-12 Thread Christopher Samuel

pute bound the usual advice is to disable HT in the BIOS, but for I/O bound things you may not be so badly off. Hope that helps! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-12 Thread Christopher Samuel

ach HT unit a core to run a job on. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Exceeded job memory limit problem

2017-09-06 Thread Christopher Samuel

e constrain jobs via cgroups and have found that using the cgroup plugin for this results in jobs not getting killed incorrectly. Using cgroups in Slurm is a definite win for us, so I would suggest looking into it if you've not already done so. All the best, Chris -- Christopher Samuel

[slurm-dev] Re: Jobs cancelled "DUE TO TIME LIMIT" long before actual timelimit

2017-08-30 Thread Christopher Samuel

-a --format JobID%20,State%20,timelimit,Elapsed,ExitCode -j 1695151 cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Fair resource scheduling

2017-08-27 Thread Christopher Samuel

hen put serial jobs at the end of the available nodes rather than using a best fit algorithm. This may reduce resource fragmentation for some work- loads. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Me

[slurm-dev] Re: Delete jobs from slurmctld runtime database

2017-08-23 Thread Christopher Samuel

s expiry parameters), and removing them will likely break its statistics and probably do Bad Things(tm). Here be dragons.. -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread Christopher Samuel

you are running in for your SSH session and not the job! cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Proctrack cgroup; documentation bug

2017-08-13 Thread Christopher Samuel

On 14/08/17 08:55, Lachlan Musicman wrote: > Was it here I read that proctrack/linuxproc was better than > proctrack/cgroup? I think you're thinking of JobAcctGatherType, but even then our experience there was that jobacct_gather/cgroup was more accurate. -- Christopher Samuel

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-08 Thread Christopher Samuel

own for that. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-08 Thread Christopher Samuel

On 07/08/17 14:08, Lachlan Musicman wrote: > In slurm.conf, there is a RebootProgram - does this need to be a direct > link to a bin or can it be a command? We have: RebootProgram = /sbin/reboot Works for us. cheers, Chris -- Christopher SamuelSenior Systems Adminis

[slurm-dev] Re: Multifactor Priority Plugin for Small clusters

2017-07-02 Thread Christopher Samuel

limits.html Best of luck! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: How to get Qos limits

2017-06-06 Thread Christopher Samuel

R} format=MaxJobsPerUser For a more general view you would do: sacctmgr list user ${USER} withassoc Hope this helps, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: srun - replacement for --x11?

2017-06-06 Thread Christopher Samuel

On 06/06/17 23:46, Edward Walter wrote: > Doesn't that functionality come from a spank plugin? > https://github.com/hautreux/slurm-spank-x11 Yes, that's the one we use. Works nicely. Provides the --x11 option for srun. All the best, Chris -- Christopher Samuel

[slurm-dev] Re: Accounting: preventing scheduling after TRES limit reached (permanently)

2017-06-04 Thread Christopher Samuel

e time. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: sinfo

2017-05-24 Thread Christopher Samuel

nfo --format="%60N %.15G %.30E %.10A" The reason can be quite long, but there doesn't seem to be a way to just show the status as down/drain/idle/etc. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email

[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Christopher Samuel

MPI launchers (and other naughtiness). Good luck! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Christopher Samuel

by > a job have finished at completion? Are you not using cgroups for enforcement? Usually that picks everything up. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Compute nodes going to drained/draining state

2017-05-23 Thread Christopher Samuel

e understand what might be wrong? Anything setting a drain state is meant to also set a reason, what does "scontrol show node $NODE" say for these? Also are there any relevant messages in your slurmctld and slurmd logs? Best of luck, Chris -- Christopher SamuelSenior Systems Ad

[slurm-dev] Re: discrepancy between node config and # of cpus found

2017-05-21 Thread Christopher Samuel

a node and then Slurm isn't going to put more jobs there (unless you tell it to ignore memory, which is not likely to end well). All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au

[slurm-dev] Re: LDAP required?

2017-04-19 Thread Christopher Samuel

age-Cluster All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: LDAP required?

2017-04-12 Thread Christopher Samuel

so we fell back to using our own LDAP server with Karaage to manage project/account applications, adding people to slurmdbd, etc. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: LDAP required?

2017-04-11 Thread Christopher Samuel

ble again. +1 for running your own LDAP. I would seriously look at a cluster toolkit for running nodes, especially if it supports making a single image that your compute nodes then netboot. That way you know everything is consistent. Best of luck, Chris -- Christopher SamuelSenior Syst

[slurm-dev] Re: Jobs submitted simultaneously go on the same GPU

2017-04-11 Thread Christopher Samuel

0 NodeAddr=thing-knc[01-03] RealMemory=126000 CoresPerSocket=10 Sockets=2 ThreadsPerCore=2 Gres=mic:5110p:2 You'll also need to restart slurmctld & all slurmd's to pick up this new config, I don't think "scontrol reconfigure" will deal with this. Best of luck, Chris --

[slurm-dev] Distinguishing past jobs that waited due to dependencies vs resources?

2017-04-11 Thread Christopher Samuel

ep' of the source code after reading 'man sacct' and not finding anything (also running 'sacct -e' and not seeing anything useful there either) doesn't offer much hope. Anyone else dealing with this? We're on 16.05.x at the moment with slurmdbd. All the best

[slurm-dev] Re: Randomly jobs failures

2017-04-11 Thread Christopher Samuel

save/job.830332/environment, No such > file or directory I would suggest that you are looking at transient NFS failures (which may not be logged). Are you using NFSv3 or v4 to talk to the NFS server and what are the OS's you are using for both? cheers, Chris -- Christopher Samue

[slurm-dev] Re: Scheduling jobs according to the CPU load

2017-03-21 Thread Christopher Samuel

e best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-19 Thread Christopher Samuel

7;t really blame Slurm for not catering to this. It can use cgroups to partition cores to jobs precisely so it doesn't need to care what the load average is - it knows the kernel is ensuring the cores the jobs want are not being stomped on by other tasks. Best of luck! Chris -- Christop

[slurm-dev] Re: reporting used memory with job Accounting or Completion plugins?

2017-03-12 Thread Christopher Samuel

ps with srun you can also monitor them as the job is going with 'sstat' (rather than just post-mortem with sacct). All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Storage of job submission and working directory paths

2017-03-07 Thread Christopher Samuel

be useful to us here too. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: slurmctld not pinging at regular interval

2017-02-19 Thread Christopher Samuel

ystems having no more than 2500 nodes or the cube root for larger systems. The value may not exceed 65533. If so then I suspect that this is a possible transient DNS failure? All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI -

[slurm-dev] Re: New User Creation Issue

2017-02-15 Thread Christopher Samuel

Torque+Moab/Maui here and at VPAC before that - we would always start Moab paused so we could check out what impact any changes had to our queues & priorities before starting jobs running. Measure twice, cut once. cheers! Chris -- Christopher SamuelSenior Systems Administrator VL

[slurm-dev] Re: New User Creation Issue

2017-02-15 Thread Christopher Samuel

ck up again. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: New User Creation Issue

2017-02-15 Thread Christopher Samuel

cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Job priority/cluster utilization help

2017-02-08 Thread Christopher Samuel

t of architectures) individually. Best of luck! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: 16.05.8 bug with memory handling?

2017-01-29 Thread Christopher Samuel

NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:* [...] Best of luck, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Daytime Interactive jobs

2017-01-29 Thread Christopher Samuel

ay. :-( Might be a feature request.. cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: New User Creation Issue

2017-01-24 Thread Christopher Samuel

;s been because the slurmdbd cannot connect back to slurmctld to send RPCs on the IP address that slurmctld has registered with slurmdbd. What does this say? sacctmgr list clusters cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computatio

[slurm-dev] Re: Job temporary directory

2017-01-22 Thread Christopher Samuel

area is a high-performance parallel filesystem shared across all nodes). https://github.com/vlsci/spank-private-tmp All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3

[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel

me to their registered email address that's stored in LDAP. cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel

On 10/01/17 18:56, Ole Holm Nielsen wrote: > For the record: Torque will always send mail if a job is aborted It's been a few years since I've used Torque so I don't remember that behaviour. Thanks for the info! -- Christopher SamuelSenior Systems Administrator

[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel

he best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Prolog behavior with and without srun

2017-01-09 Thread Christopher Samuel

On 10/01/17 10:57, Christopher Samuel wrote: > If you are unlucky enough to have SSH based job launchers then you would > also look at the BYU contributed pam_slurm_adopt Actually this is useful even without that as it allows users to SSH into a node they have a job on and not disturb the

[slurm-dev] Re: Prolog behavior with and without srun

2017-01-09 Thread Christopher Samuel

into. You do need PrologFlags=contain for that to ensure that all jobs get an "extern" batch step on job creation for these processes to be adopted into. We use both here with great success. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Vi

[slurm-dev] Re: mail job status to user

2017-01-09 Thread Christopher Samuel

rm? All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel

puts between tasks. OK, I'm not sure how Slurm will behave with multiple srun's and cons_res and CR_LLN but it's still worth a shot. Best of luck! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@uni

[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel

I strongly believe that will be necessary, sorry! Best of luck, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel

ml Hope this helps! All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Christopher Samuel

cause of this issue (from memory). -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Christopher Samuel

JobAcctGatherType=jobacct_gather/cgroup If the former, try the latter and see if it helps get better numbers (we went to the former after suggestions from SchedMD but from highly unreliable memory had to revert due to similar issues to those you are seeing). Best of luck, Chris -- Christopher S

[slurm-dev] Re: job arrays, fifo queueing not wanted

2016-12-14 Thread Christopher Samuel

ources. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-20 Thread Christopher Samuel

rwise you're at the mercy of what your mpiexec chooses to do. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-17 Thread Christopher Samuel

ng. cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Gres issue

2016-11-16 Thread Christopher Samuel

On 17/11/16 11:31, Christopher Samuel wrote: > It depends on the library used to pass options, Oops - that should be parse, not pass. Need more caffeine.. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email:

[slurm-dev] Re: Gres issue

2016-11-16 Thread Christopher Samuel

but apparently with Slurm it's not - just tested it out and using: --gres mic results in my job being scheduled on a Phi node with OFFLOAD_DEVICES=0 set in its environment. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Comp

[slurm-dev] Re: Using slurm to control container images?

2016-11-15 Thread Christopher Samuel

Having private containers is on the roadmap for Shifter. Shifter also integrates with Slurm. All the best! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://

[slurm-dev] Re: How to account how many cpus/gpus per node has been allocated to a specific job?

2016-11-13 Thread Christopher Samuel

ic:1 Reservation=(null) All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: How to account how many cpus/gpus per node has been allocated to a specific job?

2016-11-08 Thread Christopher Samuel

is that the batch step is of course only on the first node, but it says it was allocated 2 GRES. I suspect that's just a symptom of Slurm only keeping a total number. I don't think Slurm can give you an uneven GRES allocation, but the SchedMD folks would need to confirm that I'm af

[slurm-dev] Re: Re:

2016-11-08 Thread Christopher Samuel

On 09/11/16 09:50, Lachlan Musicman wrote: > I don't know Chris, I think that /dev/null would rate tbh. :) Ah, but that's a file (OK character special device), not a directory. ;-) -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Science

[slurm-dev] Re: How to account how many cpus/gpus per node has been allocated to a specific job?

2016-11-08 Thread Christopher Samuel

cpu=6,mem=4G,node=1mic:1 6449483.extern extern cpu=6,mem=4G,node=1mic:1 All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 htt

[slurm-dev] Re: Re:

2016-11-08 Thread Christopher Samuel

any period of time that information will be lost. We build from source and use: StateSaveLocation = /var/spool/slurm/jobs but the decision is yours where exactly to put it. But /tmp is almost certainly the second worst place (after /dev/shm). All the best, Chris -- Christopher Samue

[slurm-dev] Re: sinfo man page

2016-11-07 Thread Christopher Samuel

tion in a partition-oriented format. This is ignored if the --format option is specified. Except it's not being ignored when you use --format (-o). All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation

[slurm-dev] Re: Passing binding information

2016-11-02 Thread Christopher Samuel

On 02/11/16 02:01, Riebs, Andy wrote: > Interesting -- thanks for the info Chris. No worries, it's a bit sad I think, but I can understand it. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.

[slurm-dev] Re: Passing binding information

2016-10-31 Thread Christopher Samuel

ontact them directly. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Slurm versions 16.05.6 and 17.02.0-pre3 are now available

2016-10-30 Thread Christopher Samuel

All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Christopher Samuel

On 28/10/16 08:44, Lachlan Musicman wrote: > So I checked the system, noticed that one node was drained, resumed it. > Then I tried both > > scontrol requeue 230591 > scontrol resume 230591 What happens if you "scontrol hold" it first before "scontrol release&quo

[slurm-dev] Re: Query number of cores allocated per node for a job

2016-10-26 Thread Christopher Samuel

I guess you will have to query the > cgroup hierarchy. No need, I'm just trying to automate the detection of bad jobs which are spanning nodes but not using the cores on the other nodes and I wanted a way to quantify how many cores were being wasted by the job. Thanks again! Chris -- Ch

[slurm-dev] Query number of cores allocated per node for a job

2016-10-25 Thread Christopher Samuel

re already running disparate number of jobs using variable cores, how do I see what cores on what nodes Slurm has allocated my running job? I know I can go and poke around with cgroups, but is there a way to get that out of squeue, sstat or sacct? All the best, Chris -- Christopher Samuel

[slurm-dev] Re: slurm_load_partitions: Unable to contact slurm controller (connect failure)

2016-10-24 Thread Christopher Samuel

tions aren't getting blocked, and also check that the hostname correctly resolves. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: sreport "duplicate" lines

2016-10-20 Thread Christopher Samuel

ct that's what's triggering the different display in sreport, a line per association/partition. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.a

[slurm-dev] Re: sreport "duplicate" lines

2016-10-20 Thread Christopher Samuel

Login Proper Name Used Energy - --- - --- avoca vlscisamuel Christopher Sa+ 151030 -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +

[slurm-dev] Re: Send notification email

2016-10-06 Thread Christopher Samuel

ned up. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Best way to control synchronized clocks in cluster?

2016-10-06 Thread Christopher Samuel

s as well and if they're out of step well then GPFS will stop working on the node making Slrm the least of your worries. :-) So just run ntpd. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@

[slurm-dev] Re: Send notification email

2016-10-04 Thread Christopher Samuel

AP lookup to rewrite users email to the value in LDAP) But really this isn't a Slurm issue, it's a host config issue for Postfix. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimel

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-28 Thread Christopher Samuel

On 29/09/16 01:16, John DeSantis wrote: > We get the same snippet when our logrotate takes action against the > cltdlog: Does your slurmctld restart then too? -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Emai

[slurm-dev] Re: Invalid Protocol Version

2016-09-27 Thread Christopher Samuel

On 28/09/16 16:25, Barbara Krasovec wrote: > Yes, this worked! Thank you very much for your help! My pleasure! -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 h

[slurm-dev] Re: CGroups

2016-09-27 Thread Christopher Samuel

On 26/09/16 16:51, Lachlan Musicman wrote: > Does this mean that it's now considered acceptable to run cgroups for > ProcTrackType? We've been running with that on all our x86 clusters since we switched to Slurm, haven't seen an issue yet. All the best, Chris --

[slurm-dev] Re: Invalid Protocol Version

2016-09-27 Thread Christopher Samuel

of offline nodes and make a script to restore via scontrol 2) shutdown slurmctld and all slurmds 3) move the node_stat* files out of the way 4) start up slurmd again 5) start up slurmctld 6) run the script created at step 1 Hope that helps! All the best, Chris -- Christopher SamuelSenior Sy

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel

estion - you've got the shutdown log from slurmctld and the start log of a slurmd - what happens when slurmctld starts up? That might be your clue about why yours jobs are getting killed. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Comput

1 2 3 4 5 >

1 - 100 of 497 matches

Mail list logo