[slurm-dev] Re: Job Accounting for sstat

2016-08-29 Thread Christopher Samuel
On 30/08/16 12:39, Lachlan Musicman wrote: > Oh! Thanks. > > I presume that includes sruns that are in an sbatch file. Yup, that's right. cheers! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative E

[slurm-dev] Re: SLURM daemon doesn't start

2016-08-30 Thread Christopher Samuel
) and slurm-llnl is now a transitional package to move existing installations to the new name. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://

[slurm-dev] Re: new CRIU plugin

2016-08-30 Thread Christopher Samuel
int/resume for it in the same way it does for BLCR at the moment? All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ h

[slurm-dev] Re: SLURM v14 munge issues auth to SLURM v16 DBD

2016-09-11 Thread Christopher Samuel
hat said, to my untutored eye that looks more like a munge problem than anything else - you will want to check that your keys are the same and that your clocks are in sync (NTP is your friend). Best of luck, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Scien

[slurm-dev] Re: external slurmdbd for multiple clusters

2016-09-25 Thread Christopher Samuel
usters which is around 3GB for 8 million job steps. Neither cause us any issues these days (we used to have a problem when, for complicated historical reasons, slurmdbd was running on a 32-bit VM and could run out of memory). Admittedly we do have beefy database servers. :-) All the best, Chris --

[slurm-dev] Bug in node suspend/resume config code with scontrol reconfigure in 16.05.x (bugzilla #3078)

2016-09-25 Thread Christopher Samuel
the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: external slurmdbd for multiple clusters

2016-09-25 Thread Christopher Samuel
back you > can go, but I suspect 14.x talking to a 16.x dbd would be fine. Slurm supports 2 major releases behind. So a 16.05.x slurmdbd should talk to 15.08.x and 14.11.x slurmctld's but *not* 14.03.x. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSC

[slurm-dev] Re: Send notification email

2016-10-04 Thread Christopher Samuel
LDAP lookup to rewrite users email to the value in LDAP) But really this isn't a Slurm issue, it's a host config issue for Postfix. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au P

[slurm-dev] Re: Send notification email

2016-10-06 Thread Christopher Samuel
ned up. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Best way to control synchronized clocks in cluster?

2016-10-06 Thread Christopher Samuel
s as well and if they're out of step well then GPFS will stop working on the node making Slrm the least of your worries. :-) So just run ntpd. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unime

[slurm-dev] Re: Slurm 15.08.12 - Issue after upgrading to 15.08 - only one job per node is running

2016-09-19 Thread Christopher Samuel
RAM/core ratio on the low memory nodes on one system, it's 1/8th of the low memory nodes on another system so making it lower doesn't buy us much and 2 GB/core means most NAMD jobs will run without issues. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victoria

[slurm-dev] Re: fwd_tree_thread ... failed to forward the message

2016-09-19 Thread Christopher Samuel
ere convinced it was related to that, but this looks like the actual problem. Now why slurmctld doesn't upgrade that information on an upgrade is another matter altogether. Thanks! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiativ

[slurm-dev] Re: Slurm 15.08.12 - Issue after upgrading to 15.08 - only one job per node is running

2016-09-18 Thread Christopher Samuel
On 18/09/16 03:45, John DeSantis wrote: > Try adding a "DefMemPerCPU" statement in your partition definitions, e.g You can also set that globally. # Global default for jobs - request 2GB per core wanted. DefMemPerCPU=2048 All the best, Chris -- Christopher SamuelS

[slurm-dev] Re: Slurm 15.08.12 - Issue after upgrading to 15.08 - only one job per node is running

2016-09-19 Thread Christopher Samuel
lurm.conf, this one escaped me! There's always new ones there, I swear they're breeding.. cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/

[slurm-dev] Re: fwd_tree_thread ... failed to forward the message

2016-09-20 Thread Christopher Samuel
On 19/09/16 22:58, Christopher Samuel wrote: > Thanks so much Ulf, you've just answered a puzzle I've been seeing on an > x86 cluster I'm helping out with! ...and stopping slurmctld & slurmd's (slurmdbd was left going), moving /var/spool/slurm/jobs/node_state* out of the way an

[slurm-dev] Re: X11 plugin problems

2016-09-20 Thread Christopher Samuel
SSH host based authentication within the cluster helps with that, along with caching SSH keys in /etc/ssh/ssh_known_hosts. Best of luck, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (

[slurm-dev] Re: Slurm array scheduling question

2016-09-21 Thread Christopher Samuel
ned up in a later release (can't remember when sorry!). Hope that helps, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: strange going-ons with OpenMPI and Infiniband

2016-08-26 Thread Christopher Samuel
age is being constructed inside - we've used this for stock RHEL 6.6, 6.7 and 6.8 with Open-MPI 1.6, 1.8 and 1.10 without issues. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone:

[slurm-dev] Re: Backfill scheduler should look at all jobs

2016-08-23 Thread Christopher Samuel
the new jobs but crawls down first. (These interrupts are > controlled by bf_yield_interval and bf_yield_sleep.) OK, then I'm missing something because I thought that was what you wanted. Sorry for the noise! -- Christopher SamuelSenior Systems Administrator VLSCI - Victoria

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-28 Thread Christopher Samuel
On 29/09/16 01:16, John DeSantis wrote: > We get the same snippet when our logrotate takes action against the > cltdlog: Does your slurmctld restart then too? -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Emai

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel
got the shutdown log from slurmctld and the start log of a slurmd - what happens when slurmctld starts up? That might be your clue about why yours jobs are getting killed. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email:

[slurm-dev] Re: Invalid Protocol Version

2016-09-28 Thread Christopher Samuel
On 28/09/16 16:25, Barbara Krasovec wrote: > Yes, this worked! Thank you very much for your help! My pleasure! -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 h

[slurm-dev] Re: CGroups

2016-09-27 Thread Christopher Samuel
On 26/09/16 16:51, Lachlan Musicman wrote: > Does this mean that it's now considered acceptable to run cgroups for > ProcTrackType? We've been running with that on all our x86 clusters since we switched to Slurm, haven't seen an issue yet. All the best, Chris -- Christopher

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel
e cluster to slurmdbd with "sacctmgr" yet so I suspect all your accounting info is getting lost. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsc

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel
On 27/09/16 17:40, Philippe wrote: > /usr/sbin/invoke-rc.d --quiet slurm-llnl reconfig >/dev/null I think you want to check whether that's really restarting it or just doing an "scontrol reconfigure" which won't (shouldn't) restart it. -- Christopher Samuel

[slurm-dev] Re: Invalid Protocol Version

2016-09-27 Thread Christopher Samuel
and make a script to restore via scontrol 2) shutdown slurmctld and all slurmds 3) move the node_stat* files out of the way 4) start up slurmd again 5) start up slurmctld 6) run the script created at step 1 Hope that helps! All the best, Chris -- Christopher SamuelSenior Systems Administrator

[slurm-dev] Query number of cores allocated per node for a job

2016-10-25 Thread Christopher Samuel
running disparate number of jobs using variable cores, how do I see what cores on what nodes Slurm has allocated my running job? I know I can go and poke around with cgroups, but is there a way to get that out of squeue, sstat or sacct? All the best, Chris -- Christopher SamuelSenior

[slurm-dev] Re: How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Christopher Samuel
On 28/10/16 08:44, Lachlan Musicman wrote: > So I checked the system, noticed that one node was drained, resumed it. > Then I tried both > > scontrol requeue 230591 > scontrol resume 230591 What happens if you "scontrol hold" it first before "scontrol release&quo

[slurm-dev] Re: How to account how many cpus/gpus per node has been allocated to a specific job?

2016-11-08 Thread Christopher Samuel
ep is of course only on the first node, but it says it was allocated 2 GRES. I suspect that's just a symptom of Slurm only keeping a total number. I don't think Slurm can give you an uneven GRES allocation, but the SchedMD folks would need to confirm that I'm afraid. All the best, Chris -- Chr

[slurm-dev] Re: Re:

2016-11-08 Thread Christopher Samuel
any period of time that information will be lost. We build from source and use: StateSaveLocation = /var/spool/slurm/jobs but the decision is yours where exactly to put it. But /tmp is almost certainly the second worst place (after /dev/shm). All the best, Chris -- Christopher Samuel

[slurm-dev] Re: Re:

2016-11-08 Thread Christopher Samuel
On 09/11/16 09:50, Lachlan Musicman wrote: > I don't know Chris, I think that /dev/null would rate tbh. :) Ah, but that's a file (OK character special device), not a directory. ;-) -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computat

[slurm-dev] Re: sinfo man page

2016-11-07 Thread Christopher Samuel
on in a partition-oriented format. This is ignored if the --format option is specified. Except it's not being ignored when you use --format (-o). All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiativ

[slurm-dev] Re: slurm_load_partitions: Unable to contact slurm controller (connect failure)

2016-10-25 Thread Christopher Samuel
tions aren't getting blocked, and also check that the hostname correctly resolves. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: sreport "duplicate" lines

2016-10-20 Thread Christopher Samuel
what's triggering the different display in sreport, a line per association/partition. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitt

[slurm-dev] Re: sreport "duplicate" lines

2016-10-20 Thread Christopher Samuel
Name Used Energy - --- - --- avoca vlscisamuel Christopher Sa+15103 0 -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 h

[slurm-dev] Re: Gres issue

2016-11-16 Thread Christopher Samuel
On 17/11/16 11:31, Christopher Samuel wrote: > It depends on the library used to pass options, Oops - that should be parse, not pass. Need more caffeine.. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email:

[slurm-dev] Re: Gres issue

2016-11-16 Thread Christopher Samuel
ntly with Slurm it's not - just tested it out and using: --gres mic results in my job being scheduled on a Phi node with OFFLOAD_DEVICES=0 set in its environment. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiat

[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-17 Thread Christopher Samuel
ris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: How to account how many cpus/gpus per node has been allocated to a specific job?

2016-11-13 Thread Christopher Samuel
ull) All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Using slurm to control container images?

2016-11-15 Thread Christopher Samuel
rivate containers is on the roadmap for Shifter. Shifter also integrates with Slurm. All the best! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.

[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-20 Thread Christopher Samuel
rcy of what your mpiexec chooses to do. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Slurm versions 16.05.6 and 17.02.0-pre3 are now available

2016-10-30 Thread Christopher Samuel
All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Passing binding information

2016-10-31 Thread Christopher Samuel
ontact them directly. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Passing binding information

2016-11-02 Thread Christopher Samuel
On 02/11/16 02:01, Riebs, Andy wrote: > Interesting -- thanks for the info Chris. No worries, it's a bit sad I think, but I can understand it. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Ph

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Christopher Samuel
cause of this issue (from memory). -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Christopher Samuel
JobAcctGatherType=jobacct_gather/cgroup If the former, try the latter and see if it helps get better numbers (we went to the former after suggestions from SchedMD but from highly unreliable memory had to revert due to similar issues to those you are seeing). Best of luck, Chris -- Christopher S

[slurm-dev] Re: job arrays, fifo queueing not wanted

2016-12-14 Thread Christopher Samuel
ources. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Prolog behavior with and without srun

2017-01-09 Thread Christopher Samuel
. You do need PrologFlags=contain for that to ensure that all jobs get an "extern" batch step on job creation for these processes to be adopted into. We use both here with great success. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victori

[slurm-dev] Re: Prolog behavior with and without srun

2017-01-09 Thread Christopher Samuel
On 10/01/17 10:57, Christopher Samuel wrote: > If you are unlucky enough to have SSH based job launchers then you would > also look at the BYU contributed pam_slurm_adopt Actually this is useful even without that as it allows users to SSH into a node they have a job on and not disturb the

[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel
puts between tasks. OK, I'm not sure how Slurm will behave with multiple srun's and cons_res and CR_LLN but it's still worth a shot. Best of luck! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Pho

[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel
lps! All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel
y believe that will be necessary, sorry! Best of luck, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel
On 10/01/17 18:56, Ole Holm Nielsen wrote: > For the record: Torque will always send mail if a job is aborted It's been a few years since I've used Torque so I don't remember that behaviour. Thanks for the info! -- Christopher SamuelSenior Systems Administrator VLSCI - Victor

[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel
st, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel
r registered email address that's stored in LDAP. cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-19 Thread Christopher Samuel
Slurm for not catering to this. It can use cgroups to partition cores to jobs precisely so it doesn't need to care what the load average is - it knows the kernel is ensuring the cores the jobs want are not being stomped on by other tasks. Best of luck! Chris -- Christopher SamuelSenior Systems Ad

[slurm-dev] Re: Scheduling jobs according to the CPU load

2017-03-21 Thread Christopher Samuel
e best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Randomly jobs failures

2017-04-11 Thread Christopher Samuel
save/job.830332/environment, No such > file or directory I would suggest that you are looking at transient NFS failures (which may not be logged). Are you using NFSv3 or v4 to talk to the NFS server and what are the OS's you are using for both? cheers, Chris -- Christopher Samuel

[slurm-dev] Distinguishing past jobs that waited due to dependencies vs resources?

2017-04-11 Thread Christopher Samuel
' of the source code after reading 'man sacct' and not finding anything (also running 'sacct -e' and not seeing anything useful there either) doesn't offer much hope. Anyone else dealing with this? We're on 16.05.x at the moment with slurmdbd. All the best, Chris -- Christopher SamuelSenior

[slurm-dev] Re: LDAP required?

2017-04-12 Thread Christopher Samuel
ble again. +1 for running your own LDAP. I would seriously look at a cluster toolkit for running nodes, especially if it supports making a single image that your compute nodes then netboot. That way you know everything is consistent. Best of luck, Chris -- Christopher SamuelSenior Syst

[slurm-dev] Re: Jobs submitted simultaneously go on the same GPU

2017-04-11 Thread Christopher Samuel
3] RealMemory=126000 CoresPerSocket=10 Sockets=2 ThreadsPerCore=2 Gres=mic:5110p:2 You'll also need to restart slurmctld & all slurmd's to pick up this new config, I don't think "scontrol reconfigure" will deal with this. Best of luck, Chris -- Christopher SamuelSenior S

[slurm-dev] Re: LDAP required?

2017-04-12 Thread Christopher Samuel
l back to using our own LDAP server with Karaage to manage project/account applications, adding people to slurmdbd, etc. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: reporting used memory with job Accounting or Completion plugins?

2017-03-12 Thread Christopher Samuel
ps with srun you can also monitor them as the job is going with 'sstat' (rather than just post-mortem with sacct). All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Storage of job submission and working directory paths

2017-03-07 Thread Christopher Samuel
to us here too. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: LDAP required?

2017-04-19 Thread Christopher Samuel
ge-Cluster All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread Christopher Samuel
unning in for your SSH session and not the job! cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Proctrack cgroup; documentation bug

2017-08-13 Thread Christopher Samuel
On 14/08/17 08:55, Lachlan Musicman wrote: > Was it here I read that proctrack/linuxproc was better than > proctrack/cgroup? I think you're thinking of JobAcctGatherType, but even then our experience there was that jobacct_gather/cgroup was more accurate. -- Christopher Samuel

[slurm-dev] Re: Multifactor Priority Plugin for Small clusters

2017-07-03 Thread Christopher Samuel
s.html Best of luck! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-08 Thread Christopher Samuel
On 07/08/17 14:08, Lachlan Musicman wrote: > In slurm.conf, there is a RebootProgram - does this need to be a direct > link to a bin or can it be a command? We have: RebootProgram = /sbin/reboot Works for us. cheers, Chris -- Christopher SamuelSenior Systems Adminis

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-08 Thread Christopher Samuel
cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Christopher Samuel
> a job have finished at completion? Are you not using cgroups for enforcement? Usually that picks everything up. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Christopher Samuel
ers (and other naughtiness). Good luck! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: sinfo

2017-05-24 Thread Christopher Samuel
-format="%60N %.15G %.30E %.10A" The reason can be quite long, but there doesn't seem to be a way to just show the status as down/drain/idle/etc. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@u

[slurm-dev] Re: Accounting: preventing scheduling after TRES limit reached (permanently)

2017-06-04 Thread Christopher Samuel
. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: srun - replacement for --x11?

2017-06-06 Thread Christopher Samuel
On 06/06/17 23:46, Edward Walter wrote: > Doesn't that functionality come from a spank plugin? > https://github.com/hautreux/slurm-spank-x11 Yes, that's the one we use. Works nicely. Provides the --x11 option for srun. All the best, Chris -- Christopher SamuelSenior S

[slurm-dev] Re: How to get Qos limits

2017-06-06 Thread Christopher Samuel
R} format=MaxJobsPerUser For a more general view you would do: sacctmgr list user ${USER} withassoc Hope this helps, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: discrepancy between node config and # of cpus found

2017-05-21 Thread Christopher Samuel
a node and then Slurm isn't going to put more jobs there (unless you tell it to ignore memory, which is not likely to end well). All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phon

[slurm-dev] Re: Accounting using LDAP ?

2017-09-19 Thread Christopher Samuel
ey can run jobs, but that's a separate issue to whether slurmdbd can resolve users in LDAP. I would hope that Bright would have the ability to do that for you rather than having you handle it manually, but that's a question for Bright. Best of luck, Chris -- Christopher SamuelSenior System

[slurm-dev] Re: Limiting SSH sessions to cgroups?

2017-09-19 Thread Christopher Samuel
login into this container. Setting the Contain implicitly sets the Alloc flag. -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Accounting using LDAP ?

2017-09-20 Thread Christopher Samuel
n from applicants than what it captures by default, but that's the nice thing, it is modular. Also includes Shibboleth support. All the best! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0

[slurm-dev] Re: Limiting SSH sessions to cgroups?

2017-09-21 Thread Christopher Samuel
Good luck! All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-17 Thread Christopher Samuel
e configured. cheers! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Christopher Samuel
couple of questions: 1) Have you restarted slurmctld and slurmd everywhere? 2) Can you confirm that slurm.conf is the same everywhere? 3) what does slurmd -C report? cheers! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Emai

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Christopher Samuel
about the actual hardware layout. What does "lscpu" say? cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: On the need for slurm uid/gid consistency

2017-09-13 Thread Christopher Samuel
to decode it. So if the UID's & GID's of the user differ across systems then it appears it will not allow the receiver to validate the message. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@u

[slurm-dev] Slurm 17.02.7 and PMIx

2017-10-04 Thread Christopher Samuel
ch non-existent. :-( Anyone had any luck with this? cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Upgrading Slurm

2017-10-04 Thread Christopher Samuel
running jobs. The on disk format might for spooled jobs may also change between releases too, so you probably want to keep that in mind as well.. -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone:

[slurm-dev] Re: mysql job_table and step_table growth

2017-10-15 Thread Christopher Samuel
On 14/10/17 00:24, Doug Meyer wrote: > The job_table.idb and step_table.idb do not clear as part of day-to-day > slurmdbd.conf > > Have slurmdbd.conf set to purge after 8 weeks but this does not appear > to be working. Anything in your slurmdbd logs? -- Christopher Samue

[slurm-dev] Re: Exceeded job memory limit problem

2017-09-06 Thread Christopher Samuel
strain jobs via cgroups and have found that using the cgroup plugin for this results in jobs not getting killed incorrectly. Using cgroups in Slurm is a definite win for us, so I would suggest looking into it if you've not already done so. All the best, Chris -- Christopher SamuelSenior S

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-12 Thread Christopher Samuel
ute bound the usual advice is to disable HT in the BIOS, but for I/O bound things you may not be so badly off. Hope that helps! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-12 Thread Christopher Samuel
o run a job on. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Fair resource scheduling

2017-08-27 Thread Christopher Samuel
rial jobs at the end of the available nodes rather than using a best fit algorithm. This may reduce resource fragmentation for some work- loads. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne E

[slurm-dev] Re: Delete jobs from slurmctld runtime database

2017-08-23 Thread Christopher Samuel
iry parameters), and removing them will likely break its statistics and probably do Bad Things(tm). Here be dragons.. -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Jobs cancelled "DUE TO TIME LIMIT" long before actual timelimit

2017-08-30 Thread Christopher Samuel
417 -a --format JobID%20,State%20,timelimit,Elapsed,ExitCode -j 1695151 cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Camacho Barranco, Roberto <rcamachobarra...@utep.edu> ssirimu...@utep.edu

2017-10-09 Thread Christopher Samuel
 and tried to stop and > restart it multiple times but still not working. Please see the error below. Check your slurmctld.log, that should have hints about why it won't start. cheers! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of

[slurm-dev] Re: Tasks distribution

2017-10-09 Thread Christopher Samuel
8. Do you see the same? Did you try both having CR_Pack_Nodes *and* specifying this? -n 17 --ntasks-per-node=4 cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Slurm 17.02.7 and PMIx

2017-10-09 Thread Christopher Samuel
On 05/10/17 11:27, Christopher Samuel wrote: > PMIX v1.2.2: Slurm complains and tells me it wants v2. I think that was due to a config issue on the system I was helping out with, after having to install some extra packages (like a C++ compiler) to get other things working I can no lon

[slurm-dev] Re: Is PriorityUsageResetPeriod really required for hard limits?

2017-10-03 Thread Christopher Samuel
to sometime far into the future to have > effectively an infinite period (no reset)? Basically this is because once a user exceeds something like their maximum CPU run time limit then they will never be able to run jobs again unless you either decay or reset usage. -- Christopher Samue

[slurm-dev] Re: Setting up Environment Modules package

2017-10-04 Thread Christopher Samuel
e also have in our taskprolog.sh: echo export BASH_ENV=/etc/profile.d/module.sh to try and ensure that bash shells have modules set up, just in case. :-) -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au P

[slurm-dev] Re: Upgrading Slurm

2017-10-04 Thread Christopher Samuel
r Slurm, I'd always install it centrally (NFS exported to compute nodes) to keep things simple. That way you decouple your Slurm version from the OS and can keep it up to date (or keep it on a known working version). All the best! Chris -- Christopher SamuelSenior Systems Administrato

<    1   2   3   4   5   >